I am running beta 7 of deadline. The deadline server was inadvertently restarted for maintenance while a job was running. The slaves continued to render however when the server came back online about two thirds of the slaves registered as stalled in the monitor and the frames were re-queued. The frames rendering were taking about an hour per frame and all of them were in the middle of process. I am not sure if this is normal behavior. It would be nice if the slaves could re-sync in this instance.
I think this problem can occur if the server is offline for a period of time that is longer than the Number Of Minutes Before an Unresponsive Slave is Marked as Stalled" setting in the Slave Settings in the Repository Options. When the server comes back online, if a housecleaning operation happens right away, it will see that the slaves haven’t updated their state in a while (because they couldn’t connect to the database), and consider them stalled.
We’re not sure the best way to handle this situation at this stage. One idea might be to not allow an application to perform housecleaning if it was disconnected from the database in the last X minutes. However, that would only really work in full server shutdowns like in your case. It wouldn’t work in partial network failure where some machines could still connect.
Cheers,
Ryan
What if it did a check against the uptime of the mongo db as well? If the uptime is less than the time without updates, it would not consider the slave stalled, till the server’s uptime reaches the stalled threshold?
That would help in this case with the server being shutdown. However, it wouldn’t help in the case of the network going down.