So last night i restarted about 30-35 slaves in nogui mode. So far the results are promising, i only found 1 machine that is hanging overnight.
Attached are its logs.
While i see a line like this from this morning:
2013-03-06 00:48:18: Slave - An error occurred while updating the slave’s info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues. (FranticX.Database.DatabaseConnectionException)
It appears that its been hanging long before that…
It is reporting currently that its rendering a nuke job for ~13.2 hours, and said nuke job finished (according to the slave logs) at 2013-03-05 20:27:28:
Good news on this one. This was actually a separate bug that we were able to reproduce. Basically, if a database error occurred while dequeueing a task, the slave would re-render the previous task and then the scheduler thread would deadlock and never pick up another task. the slave would then report in the Monitor it was rendering that task, even after it finished rendering it again.
This bug will also be fixed in beta 14.
With the slave log being removed from the UI in beta 14 too, in theory this will result in a much more stable slave.
Can’t wait to get this released! Should be early next week.
Great news! Found another hanging slave that was running in nogui, maybe related?
Attached are the logs.
Lines of interest:
2013-03-06 10:48:04: 0: STDOUT: Wed Mar 06 10:48:04 2013 (+4062ms) : PUBLISH STEP 4a/7 : Frame: 1019, Step time: 4.06 secs. Avg step time: 4.23 secs. Total time spent: 00h:04m:56s Estimated time left: 00h:10m:21s (step 70/217)
2013-03-06 10:48:08: 0: STDOUT: Wed Mar 06 10:48:08 2013 (+3937ms) : PUBLISH STEP 4a/7 : Frame: 1020, Step time: 3.94 secs. Avg step time: 4.23 secs. Total time spent: 00h:05m:00s Estimated time left: 00h:10m:17s (step 71/217)
2013-03-06 10:48:12: 0: STDOUT: Wed Mar 06 10:48:12 2013 (+4218ms) : PUBLISH STEP 4a/7 : Frame: 1021, Step time: 4.22 secs. Avg step time: 4.23 secs. Total time spent: 00h:05m:04s Estimated time left: 00h:10m:12s (step 72/217)
2013-03-06 10:48:16: 0: STDOUT: Wed Mar 06 10:48:16 2013 (+4062ms) : PUBLISH STEP 4a/7 : Frame: 1022, Step time: 4.06 secs. Avg step time: 4.22 secs. Total time spent: 00h:05m:08s Estimated time left: 00h:10m:08s (step 73/217)
2013-03-06 10:48:17: Slave - An error occurred while updating the slave’s info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues. (FranticX.Database.DatabaseConnectionException)
2013-03-06 10:48:23: Slave - An error occurred while updating the slave’s info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues. (FranticX.Database.DatabaseConnectionException)
2013-03-06 10:52:33: Slave - Updating slave settings
2013-03-06 10:57:40: Slave - Updating slave settings
2013-03-06 11:02:47: Slave - Updating slave settings
2013-03-06 11:07:50: Slave - Updating slave settings
2013-03-06 11:12:53: Slave - Updating slave settings
After the connection error, no more activity is recorded in the log. Mayabatch was still running full tilt, the monitor reported the job is rendering on the machine.
Yup, these two issues look just like the original, which will be fixed in beta 14. We’re in the process of uploading beta 14 now, so it should be available soon.
I think we’re having the same issue. slaves are win7 x64 and we’re using DL6 beta 12 (sorry, too superstitious to install beta 13).
As soon as mayabatch crashed due to low memory (12k print render), the DL slave process went down with it.
We’re installing beta 14 now and let you know if we have the same issue.
I think it’s a general issue when the system runs out of memory. The same low memory situation that causes Maya to crash also causes the Slave to crash. We’re looking at adding a feature where the slave automatically restarts itself if crashes because the system has run out of memory.
still having the issue where mayabatch dies, it takes the dl slave service with it and I see a farm full of red (stalled). When I login to the machines, the dl slave never came back up.
Is there a newer version? I thought this was addressed, or is this a new issue?
We added the stuff I was referring to in beta 15. There is no guarantee this will save all of your slaves though if the system they’re running on is running out of memory. For example, the slave could die first while maya is still running, and then when it attempts to start up again, there might not be enough memory.
It would be best to void renders that use up all system memory in the first place.
we have four generations of nodes on our farm, and the oldest nodes have 12GB of ram, and this is very rarely a problem, but the current project requires 12k renders for print, which loads a lot into memory, and kills these poor little guys. How would I submit a mayabatch job in a way that would exclude systems with too little memory? Is there a way to exclude nodes based on a memory requirement, or do I have to create a pool for the elderly?
Yeah, the rule of thumb is Groups are for software/machine specs [e.g a 12gig of ram group, a 24 gig of ram group etc] and Pools are for projects/departments/shots or however you manage your specific jobs.
thanks cbond and russell for the clearification. I’ll reorganize my groups and pools.
So, the idea would be for software specific ‘sets’, like the the machines with Nuke or Realflow, I would create a pool (e.g. ‘nuke’ pool, ‘RF’ pool), and for the various generations of machines on the farm, I would create groups. That sound about right?
You might want to use groups for software as well. For example, nuke_all, nuke_new_machines, rf_all, rf_new_machines. The reason is that pools affect the priority of a job, and groups do not. So unless you want the software that will be used for rendering to affect priority, you probably want to stick with groups.