detailed job telemetry

LaszloSebo · August 18, 2014, 9:54pm

Hi there,

At siggraph, there was an excellent presentation about FQ, framestore’s new queueing system. One thing that was very impressive is the kind of metrics they were collecting for frames / jobs.

Imagine several graphs, where the horizontal axis is time, and the vertical axis is things like:

cpu utilization for the rendering process (not the machine), based on how many cpus were allocated for the task (since all parallel jobs are going to allocated cpus only, this is practically always 1-1 with actual utilization, as opposed to random noise due to whats happening on other cpus)
incoming / outgoing bandwidth usage for the rendering process (not the machine)
kernel times / ratio with actual processing for the rendering process (not the machine)
ram usage for the rendering process (not the machine), virtual / physical
etc.

It was the most beautiful thing i have ever seen. These stats were generated PER frame for each job, and an aggregate result was also saved into the job itself to give a ‘job profile’.

Something like this would be extremely beneficial to have, at larger scales, trying to troubleshoot / optimize really becomes a priority, and without metrics, you are just guessing…

rrussell · August 19, 2014, 2:27pm

If you scroll down to the bottom image here, you’ll see that Deadline 7 currently has task graphs that you can view for the selected jobs:
thinkboxsoftware.com/news/20 … ine-7.html

Note that the CPU and RAM we collect is only for the rendering process, not the machine. We already have IO on the wishlist as well.

We only track peak and average CPU and RAM per task though, not per frame.

Cheers,
Ryan

LaszloSebo · August 19, 2014, 4:39pm

Yes the graphs i mention tracked per frame over time. This seems like its summing the peak/average of the frames for the job, as opposed to over time.

This is useful as well, but seeing the rendering profile of the job as in:

startup takes 2 minutes, most is IO bound
render takes 30 seconds, initialization is 1 cpu
rendering fluctuates due to IO, mostly using all cpus
winddown, frame conversion / transfer takes 20 secs, IO bound

Would be extremely helpful for us to determine where we could optimize our farm performance.

cbond · August 19, 2014, 5:03pm

yup - we have lots of plans for more analytics. keep the requests coming!!

LaszloSebo · August 19, 2014, 5:35pm

The problem is that without the ‘slot’ / ‘block’ allocation system, those numbers are very noisy… Especially if you have other renders going on the machine. So say, you have a 24 core machine, rendering a max job and a nuke job, you don’t know why the cpu usage is only 30% on the max job… is it 30% relative to what it was assigned? Or 30% cause nuke is running at ‘realtime’ priority and takes over the machine?

So the slot mechanism would make the numbers more meaningful