Runaway Stalled Slaves

maxx.lee · April 5, 2016, 1:52am

We’re using Deadline 7.2.3.1 R (544c3fb15) on WIndows 7 running 3DS Max 2015 jobs and are seeing cases where a slave loses connection to the database for an extended period of time and hence get marked as ‘Stalled,’ however, as far as I can tell, the slave and the machine are behaving properly. The slave GUI is responding and updating, and I can see that rendering is progressing.

If I sit and watch the render, the task eventually finishes and the slave immediately picks up a new task–meaning that it it was able to reconnect to the database and update its status to ‘Rendering’ and even complete subsequent tasks without stalling. Checking the slave logs, I do see logging seem to stop at a time the coincides with the last status update time I see in Deadline Monitor.
If I choose from the slave GUI, Cancel Current Task, the slave can kill the current task, and it then, just as waiting for the task to finish, the slave picks up another assignment and shows up as ‘Rending’ immediately.

What do you think may be going on here? It seems that the mechanism for reporting back to the database gets stuck somehow, while the rest of the machine seems to be moving along well enough.

For debugging purposes, can the Slave be programmed to report to the GUI cases where it can no longer talk to the database?

And also, since the machine otherwise seems to be working, is there an automated way we can force the reconciliation of the slave status with it’s assignment. Because in the case of us letting the machine run and finish the render, the task it’s been working on has already been reassigned and in some cases can be running for hours.

jgaudet · April 5, 2016, 4:27pm

Hey maxx,

We’ve seen this a couple times now, and we believe we currently have a lead on the issue… So far, we think the problem actually lies in the Slave’s Info thread (the one responsible for updating its status in the DB) is getting hung up on something. The Slave likely is still connected to the DB, but it’s info thread is getting stuck on something else (our prime suspect is currently WMI queries).

It’s also possible that the thread dies altogether, but I’d expect to see some error in the logs if that were the case. Did the Slave go back to reporting its status as normal after the ‘dry spell’? Or was it necessary to restart the Slave to get it working properly again?

We’ll try to get a fix in for this ASAP, or at least add some debug logging to help diagnose this issue a bit better.

Cheers
Jon

maxx.lee · April 5, 2016, 7:17pm

Hi Jon,

From what I’ve seen, a slave restart was not necessary for the Slave to pick up again. The Slave reports back again when the task it’s working on finally finishes, or if we choose ‘Cancel Current Task’ from the GUI.

There may be other cases where a Slave restart was necessary, but I haven’t encountered it yet. I’ll have to ask my teammates to see if they’ve seen that case.

-M