Estimated time remaining bug?

MarioD · December 6, 2016, 4:33pm

Hi,

Deadline is giving me a weird estimated time remaining.

I have a 200 frame render going with around 1:30 min per frame and the estimated time remaining is 1 Day and 17 hours…

Check the attached image

tks!

[attachment=0]EstimatedTime.PNG[/attachment]

Coulter · December 6, 2016, 8:36pm

Hi MarioD,

From the image, it looks to me like each Task is running at about 15 min. With about 165 Tasks remaining, that works out to roughly 2,475 minutes, or 41.25 hours, or about 1 day and 17.25 hours. Seems like the estimate of 1 day 17 hours is correct.

MarioD · December 7, 2016, 11:43am

Hi James,

But that should be divided by the number of render nodes right? Or Make an estimate based on the first frame rendered on each node.

mattguetta · December 7, 2016, 1:42pm

Hi,

+1 I was about to write the same post, should be divided by the number of RN that are allowed to work on the task.
++

MarioD · December 7, 2016, 5:09pm

Coulter what are your ideas on this? Can we expect this to be fixed?

Tks

Coulter · December 7, 2016, 5:58pm

The “Estimated Remaining Render Time” represents an estimate of the remaining total compute time required to complete the Job. This estimate is already statistically dubious, since there is no guarantee that the remaining unfinished Tasks will resemble the completed Tasks used to form the estimate. But I think it happens often enough that Tasks are fairly consistent, so that this aggregate estimate still provides us some useful information. (And, technically, it should be called “Estimated Remaining Compute Time”, to be properly general, but who cares?)

So it seems like we could take this aggregate estimate and just divide it by the number of Slave instances on the farm and get a good “wall clock” estimate for when the Job will done. But such an estimate would rarely be accurate. First, not all machines running Slave may be eligible, since they may not have the required worker software installed. There may be a limited number of floating licenses for the worker software, and there’s no way to know how many of those licenses will be available to the Job at any given time. And the Job itself may have a machine limit set. These and other factors make it nearly impossible to estimate the maximum number of Slave instances that could be brought to bear on the remainder of the Job.

But suppose we could, through complex modelling, form an estimate for the maximum number of Slave instances eligible to participate on the Job. Dividing the aggregate estimate by this number would give us a best-case-scenario estimate that would almost never come to pass, since it is very unlikely the upper bound of Slave instances would ever be available to the Job. So the wall clock estimate would not actually burn down in real time. If the estimate said 2 hrs remaining and we checked back in 2 hrs, the estimate might then say 50 minutes remaining. Come back in 50 minutes and the estimate now says 28 minutes. This is just setting up users for constant disappointment, which is fundamentally bad UX.

OK, well, what if we multiplied the optimistic wall clock estimate by some safety factor, like 5 (hey, let’s make yet another repository option!), so that we are not setting false expectations? Well what meaning does this wall clock estimate now convey? We’ve taken a dubious aggregate estimate, divided it by a dubious upper bound of participating Slaves, and then multiplied it by an arbitrary safety factor. How can we expect this final wall clock estimate to be useful in any way? Of course it would occasionally be correct, in the same way that an analog clock with frozen hands is correct twice a day. (Interestingly, it has been established that people would rather have a bad map than have no map at all.)

Ok ok ok, but maybe there is some other approach, since it seems like using historical compute time to make a wall clock estimate of completion time is like asking, “Where are we going, and why are we in a hand basket?” After all we are trying to use a historical metric in one domain (Task vs Slave Instance) to derive a predictive summary estimate in a different domain (Job vs Farm). What if, instead, we used a historical metric from the target domain itself to form the predictive summary estimate?

To do that, we have to stop worrying about how long it takes for a Task to complete. Instead we look at the rate at which Tasks of the Job are being completed. (Sort the completion DateTimes of completed tasks, and then measure the time delta between them. Use these deltas to form some kind of statistic, and then multiply that statistic by the number of incomplete Tasks.) This is a metric from the target domain, and it doesn’t require us to build a hopelessly complex model of the domain. Of course, over the lifetime of a Job the number of Slaves brought to bear will vary, so the “completion deltas” will probably be large in the beginning and then get smaller in the middle and so forth. So rather than using the average of all completion deltas, it would make more sense to use a moving average, or an exponential moving average, so that the wall clock estimate will change based on the recent completion history of Tasks of the Job. This is still a bad map, but it is “less bad” than the one derived from Task durations.

With this rate-based approach, there is still the problem that the burn-down rate of the wall clock estimate will likely not match real time. Suppose the estimate was 30 minutes, so we come back in 30 minutes and check again and now the estimate is 5 hours. What happened? Well the estimate changed to reflect the fact that a bunch of higher-priority Jobs got submitted to the queue, and now this Job isn’t getting much farm time. At the end of the day, do we provide a warm-fuzzy bad map to make people feel better, or do we just not go down that road?

I think the takeaway from this post is quite clear: Never ask a software architect how long something will take; You’ll fall asleep long before the answer is delivered.

MarioD · December 7, 2016, 6:06pm

I’m aware that it’s accurate, it’s an estimate. But it will give us a clue of how long it may take.

Coulter · December 7, 2016, 6:19pm

So you are voting for “bad map” over “no map”. Fair enough. I’d be interested in how others would vote on this. If that’s the prevailing preference, I can bring it to the team for consideration.

If it doesn’t make the cut as a formal feature, I can mock up a script that populates a an Extra Info field with this estimate.

Bobo · December 7, 2016, 7:02pm

You already have that in the TASK panel, in the upper right corner. It says “3:50:51+ remaining”. This is calculated based on the content of the Task display. I guess that value could be used to populate a new column in the Monitor, but … what James said. It is very dubious as an estimate.

MarioD · December 7, 2016, 8:02pm

Ah I’ve never noticed that! Would be great to have it on a column, mantaining the + sign.

Tks!

mattguetta · December 7, 2016, 10:06pm

Sure it’ll be wrong but probably less than no /node at all… actually it gives smth useless: who wants to know what’s the rendertime of one node only with a multi node job ? Maybe it should simply be divided by current number of slaves on the task.
Just as an example, here is the math that does the (false but helpful and based on current node per task) job for me: simonreeves.com/rendertimecalc/

MarioD · December 8, 2016, 9:48am

The Estimated time will always be an estimate, “an approximate calculation or judgement of the value, number, quantity, or extent of something”. No one expects it to be accurate. It will definitely be very helpfull to have that info in a column.

MikeOwen · December 8, 2016, 10:26am

I would take this scripting example from our GitHub site and convert it into an event plugin onHouseCleaningCallback (so it executes by default every 60 seconds), calculates the job stats object per job, [change the existing code example] to grab the calculated est time remaining and injects it into one of the ExtraInfoX columns in the Jobs Panel. Then users can see this value per job, can sort the column in Monitor and it is updated once every 60 secs:
github.com/ThinkboxSoftware/Dea … ectData.py

MarioD · December 8, 2016, 10:33am

Tks Mike, but I don’t have enough knowledge to do that. Isn’t it something that you can add to the next version?

MikeOwen · December 8, 2016, 11:13am

Oh, ok. I have logged an internal ticket for core team to discuss. If not possible, we can always circle back and look again at the scripting route.

MarioD · December 8, 2016, 11:27am

Tks Mike, maybe actually I could take a look at that. If you have time can you describe the steps to use the script? Tks!

MikeOwen · December 8, 2016, 6:01pm

After internal discussion, we want to help improve things here, so we will be looking at improving the calculation which displays the “Estimated Remaining Render Time” column, so it’s logic is more inline with the value displayed in the Tasks Panel. As others have already pointed out, it will still remain an “estimate” but we believe this will resolve this thread in the short term. This change is currently scheduled to take place in Deadline 8.1, which is in beta currently with new builds approximately every 2 weeks (not guaranteed timing) and with an early 2017 final release date.

mattguetta · December 8, 2016, 6:03pm

Great thanks all !

MarioD · December 9, 2016, 8:24am

You’re the best! Tks!