job 'stickyness'

LaszloSebo · January 14, 2015, 2:22am

I’ve been trying to get a bit more consistent behavior out of our farm in terms of machines jumping from job to job. I’m having a really hard time… Even with the “Rendering Task Buffer” set to higher values (i tried 3), the machines jump between jobs completely randomly. This adds a ~2-5 min 3dsmax startup time to every single task we are rendering… the effect of this is worse on jobs where a single frame is just a couple of seconds to render:

On this job, out of ~100 frames already rendered, only ~8 were “consecutive” renders. The rest were all “jumping” machines. On this job, 95% of the time was spent starting up max, because of this issue… This is making our farm extremely inefficient ((

This particular job could have finished rendering in a 10 minutes, total. Instead, its been going for 3 hours now, and its still only 16% done.

cbond · January 14, 2015, 3:04am

are you restarting the renderer between frames or something? we hold 3dsmax open and just change frames, so i am assuming that you are doing some frame to frame calc [sim or particle?] or is there something i’m missing?

cb

Bobo · January 14, 2015, 4:16am

What is the scheduling algorithm you are using? Pool/Priority/Date, or one of the fancy weighted ones?

rrussell · January 14, 2015, 3:15pm

Do you know if jobs will render quickly like this ahead of time? If so, setting a higher frames per task could result in less jumping around and less overhead between tasks. I imagine the jumping around is worse for faster jobs than it is for slower jobs, because more time is spent “between tasks” than actually rendering them, so it’s more likely that the faster jobs appear to have less slaves currently rendering them.

Setting the buffer to 4 or 5 might help a bit, but if you can chunk up those tasks for fast jobs like this, that should have a greater impact.

LaszloSebo · January 14, 2015, 5:21pm

We are using pool/weighted/balanced

Priority w: 100
error w: 0
submission time w: 0
Rendering task weight: -75

The frames render in about 2-3 seconds, and there is a ~2 second delay between dequeuing them. The opening time of the scene varies between 5-8 minutes (good ol’ max). The renderer stays open correctly when a machine stays on the job, sadly, they almost never do… They go to another. The slaves swap between jobs randomly.

Frankly, without a central dispatcher, i don’t know how this can be solved. The slaves individually don’t have enough knowledge to decide to stay, or go to another job.

Ill try setting the task buffer to 5.

Chunking manually isn’t really the solution we are looking for… with 2-4000+ jobs a day, that’s not manageable. We would expect deadline to squeeze the farm for juice.

rrussell · January 14, 2015, 6:42pm

Chris and I just did some brainstorming and we might have a solution that doesn’t require a central dispatcher. The core issue here (and in other situations) is that a slave doesn’t have a snapshot of what other slaves are doing when looking for a job. Simply looking at the current set of rendering tasks isn’t enough, because if a slave is between tasks, it would appear that slave isn’t working on anything.

We think we can address this by introducing a new collection that simply contains a mapping of slave names to job IDs and task IDs. When a slave looks for a job to render, it will update its entry in the collection with the job ID and task ID if it gets a task, and it will clear its entry if it doesn’t find a task. When a slave is closed, or marked as stalled, its entry will get cleared as well. The key thing is that when a slave is BETWEEN rendering tasks (not idle, it’s just in the process of looking for the next one), its entry will still have the previous job information, and thus it can be assumed that the slave is still rendering that job.

The balanced algorithms could then in theory rely on this collection when figuring out how many slaves are working on each job, rather than relying on the number of tasks currently being rendered by each job like it currently does. It should be more reliable, and thus result in less jumping around.

Obviously we have to think on this idea more, but if it does prove to be the solution, we’ll see if it’s something we can fit into 7.1.

Cheers,
Ryan

LaszloSebo · January 14, 2015, 7:06pm

That sounds like a good idea, if i understand it correctly.
Although, aren’t slaves that are idle between tasks already marked as still ‘on the job’? They only return the job limitgroup stub when they can’t find another task. Maybe these limitgroups are not used for the queueing algorithm.

The job limit groups might already store this information?

rrussell · January 14, 2015, 9:23pm

The Limit collection doesn’t quite get us there, since different limits can behave differently (ie: one limit per task, one per slave instance, or one per machine). It might just be cleaner to go with a new collection.