We’re using Deadline 7.2.3.1 R (544c3fb15) on WIndows 7 running 3DS Max 2015 jobs and are seeing cases where a slave loses connection to the database for an extended period of time and hence get marked as ‘Stalled,’ however, as far as I can tell, the slave and the machine are behaving properly. The slave GUI is responding and updating, and I can see that rendering is progressing.
If I sit and watch the render, the task eventually finishes and the slave immediately picks up a new task–meaning that it it was able to reconnect to the database and update its status to ‘Rendering’ and even complete subsequent tasks without stalling. Checking the slave logs, I do see logging seem to stop at a time the coincides with the last status update time I see in Deadline Monitor.
If I choose from the slave GUI, Cancel Current Task, the slave can kill the current task, and it then, just as waiting for the task to finish, the slave picks up another assignment and shows up as ‘Rending’ immediately.
What do you think may be going on here? It seems that the mechanism for reporting back to the database gets stuck somehow, while the rest of the machine seems to be moving along well enough.
For debugging purposes, can the Slave be programmed to report to the GUI cases where it can no longer talk to the database?
And also, since the machine otherwise seems to be working, is there an automated way we can force the reconciliation of the slave status with it’s assignment. Because in the case of us letting the machine run and finish the render, the task it’s been working on has already been reassigned and in some cases can be running for hours.