The “Estimated Remaining Render Time” represents an estimate of the remaining total compute time required to complete the Job. This estimate is already statistically dubious, since there is no guarantee that the remaining unfinished Tasks will resemble the completed Tasks used to form the estimate. But I think it happens often enough that Tasks are fairly consistent, so that this aggregate estimate still provides us some useful information. (And, technically, it should be called “Estimated Remaining Compute Time”, to be properly general, but who cares?)
So it seems like we could take this aggregate estimate and just divide it by the number of Slave instances on the farm and get a good “wall clock” estimate for when the Job will done. But such an estimate would rarely be accurate. First, not all machines running Slave may be eligible, since they may not have the required worker software installed. There may be a limited number of floating licenses for the worker software, and there’s no way to know how many of those licenses will be available to the Job at any given time. And the Job itself may have a machine limit set. These and other factors make it nearly impossible to estimate the maximum number of Slave instances that could be brought to bear on the remainder of the Job.
But suppose we could, through complex modelling, form an estimate for the maximum number of Slave instances eligible to participate on the Job. Dividing the aggregate estimate by this number would give us a best-case-scenario estimate that would almost never come to pass, since it is very unlikely the upper bound of Slave instances would ever be available to the Job. So the wall clock estimate would not actually burn down in real time. If the estimate said 2 hrs remaining and we checked back in 2 hrs, the estimate might then say 50 minutes remaining. Come back in 50 minutes and the estimate now says 28 minutes. This is just setting up users for constant disappointment, which is fundamentally bad UX.
OK, well, what if we multiplied the optimistic wall clock estimate by some safety factor, like 5 (hey, let’s make yet another repository option!), so that we are not setting false expectations? Well what meaning does this wall clock estimate now convey? We’ve taken a dubious aggregate estimate, divided it by a dubious upper bound of participating Slaves, and then multiplied it by an arbitrary safety factor. How can we expect this final wall clock estimate to be useful in any way? Of course it would occasionally be correct, in the same way that an analog clock with frozen hands is correct twice a day. (Interestingly, it has been established that people would rather have a bad map than have no map at all.)
Ok ok ok, but maybe there is some other approach, since it seems like using historical compute time to make a wall clock estimate of completion time is like asking, “Where are we going, and why are we in a hand basket?” After all we are trying to use a historical metric in one domain (Task vs Slave Instance) to derive a predictive summary estimate in a different domain (Job vs Farm). What if, instead, we used a historical metric from the target domain itself to form the predictive summary estimate?
To do that, we have to stop worrying about how long it takes for a Task to complete. Instead we look at the rate at which Tasks of the Job are being completed. (Sort the completion DateTimes of completed tasks, and then measure the time delta between them. Use these deltas to form some kind of statistic, and then multiply that statistic by the number of incomplete Tasks.) This is a metric from the target domain, and it doesn’t require us to build a hopelessly complex model of the domain. Of course, over the lifetime of a Job the number of Slaves brought to bear will vary, so the “completion deltas” will probably be large in the beginning and then get smaller in the middle and so forth. So rather than using the average of all completion deltas, it would make more sense to use a moving average, or an exponential moving average, so that the wall clock estimate will change based on the recent completion history of Tasks of the Job. This is still a bad map, but it is “less bad” than the one derived from Task durations.
With this rate-based approach, there is still the problem that the burn-down rate of the wall clock estimate will likely not match real time. Suppose the estimate was 30 minutes, so we come back in 30 minutes and check again and now the estimate is 5 hours. What happened? Well the estimate changed to reflect the fact that a bunch of higher-priority Jobs got submitted to the queue, and now this Job isn’t getting much farm time. At the end of the day, do we provide a warm-fuzzy bad map to make people feel better, or do we just not go down that road?
I think the takeaway from this post is quite clear: Never ask a software architect how long something will take; You’ll fall asleep long before the answer is delivered.