slave hanging - nogui

LaszloSebo · March 6, 2013, 5:38pm

HI there,

So last night i restarted about 30-35 slaves in nogui mode. So far the results are promising, i only found 1 machine that is hanging overnight.

Attached are its logs.

While i see a line like this from this morning:

2013-03-06 00:48:18: Slave - An error occurred while updating the slave’s info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues. (FranticX.Database.DatabaseConnectionException)

It appears that its been hanging long before that…
It is reporting currently that its rendering a nuke job for ~13.2 hours, and said nuke job finished (according to the slave logs) at 2013-03-05 20:27:28:

2013-03-05 20:27:28: 0: STDOUT: LogQtLoadOuts: Qt Library Paths:
2013-03-05 20:27:28: 0: STDOUT: LogQtLoadOuts: C:/Python26_64
2013-03-05 20:27:28: 0: STDOUT: LogQtLoadOuts: //S2/exchange/software/managed/Libraries/Qt/4.7.1_x64_Scl/plugins
2013-03-05 20:27:28: 0: STDOUT: LogQtLoadOuts: //s2/exchange/software/managed/Libraries/Qt/4.5.0_x64/plugins
2013-03-05 20:27:28: 0: STDOUT: LogQtLoadOuts: QT Plugins Path: G:\Develope\FlowLineSDK\external\qt4\qt4.7.1\win64_vc2008\plugins
2013-03-05 20:27:28: 0: STDOUT: LogQtLoadOuts: QT Plugins Environment Path: //S2/exchange/software/managed/Libraries/Qt/4.7.1_x64_Scl/plugins;\s2\exchange\software\managed\Libraries\Qt\4.5.0_x64\plugins
2013-03-05 20:27:28: 0: STDOUT: LogQtLoadOuts: Enforcing QT Plugins Path: //S2/exchange/software/managed/Libraries/Qt/4.7.1_x64_Scl/plugins
2013-03-05 20:27:28: 0: INFO: Process exit code: 0
2013-03-05 20:27:28: 0: Render time for frame(s): 9.281 s
2013-03-05 20:27:28: 0: Total time for task: 27.887 s
2013-03-05 20:27:45: Slave - Updating slave settings
2013-03-05 20:32:50: Slave - Updating slave settings
2013-03-05 20:37:54: Slave - Updating slave settings
deadlineslave_LAPRO0321-LAPRO0321-2013-03-05-0001.zip (994 KB)

LaszloSebo · March 6, 2013, 5:42pm

Note that the nuke process was not running on the machine

rrussell · March 6, 2013, 6:10pm

Thanks! This is very helpful info!

Cheers,

Ryan

rrussell · March 6, 2013, 7:54pm

Good news on this one. This was actually a separate bug that we were able to reproduce. Basically, if a database error occurred while dequeueing a task, the slave would re-render the previous task and then the scheduler thread would deadlock and never pick up another task. the slave would then report in the Monitor it was rendering that task, even after it finished rendering it again.

This bug will also be fixed in beta 14.

With the slave log being removed from the UI in beta 14 too, in theory this will result in a much more stable slave.

Can’t wait to get this released! Should be early next week.

Cheers,

Ryan

LaszloSebo · March 7, 2013, 1:13am

Great news! Found another hanging slave that was running in nogui, maybe related?

Attached are the logs.

Lines of interest:

2013-03-06 10:48:04: 0: STDOUT: Wed Mar 06 10:48:04 2013 (+4062ms) : PUBLISH STEP 4a/7 : Frame: 1019, Step time: 4.06 secs. Avg step time: 4.23 secs. Total time spent: 00h:04m:56s Estimated time left: 00h:10m:21s (step 70/217)
2013-03-06 10:48:08: 0: STDOUT: Wed Mar 06 10:48:08 2013 (+3937ms) : PUBLISH STEP 4a/7 : Frame: 1020, Step time: 3.94 secs. Avg step time: 4.23 secs. Total time spent: 00h:05m:00s Estimated time left: 00h:10m:17s (step 71/217)
2013-03-06 10:48:12: 0: STDOUT: Wed Mar 06 10:48:12 2013 (+4218ms) : PUBLISH STEP 4a/7 : Frame: 1021, Step time: 4.22 secs. Avg step time: 4.23 secs. Total time spent: 00h:05m:04s Estimated time left: 00h:10m:12s (step 72/217)
2013-03-06 10:48:16: 0: STDOUT: Wed Mar 06 10:48:16 2013 (+4062ms) : PUBLISH STEP 4a/7 : Frame: 1022, Step time: 4.06 secs. Avg step time: 4.22 secs. Total time spent: 00h:05m:08s Estimated time left: 00h:10m:08s (step 73/217)
2013-03-06 10:48:17: Slave - An error occurred while updating the slave’s info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues. (FranticX.Database.DatabaseConnectionException)
2013-03-06 10:48:23: Slave - An error occurred while updating the slave’s info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, or experiencing network issues. (FranticX.Database.DatabaseConnectionException)
2013-03-06 10:52:33: Slave - Updating slave settings
2013-03-06 10:57:40: Slave - Updating slave settings
2013-03-06 11:02:47: Slave - Updating slave settings
2013-03-06 11:07:50: Slave - Updating slave settings
2013-03-06 11:12:53: Slave - Updating slave settings

After the connection error, no more activity is recorded in the log. Mayabatch was still running full tilt, the monitor reported the job is rendering on the machine.

Our own internal logs also stop at those last lines btw, so maya is hanging (but at 100% core usage)
deadlineslave_LAPRO0321-LAPRO0321-2013-03-05-0001.zip (137 KB)

LaszloSebo · March 7, 2013, 1:34am

Found another that seems to fit the first issue, so might be fixed with beta14. Logs attached
deadlinelauncher-LAPRO0315-2013-03-05-0000.zip (533 KB)

rrussell · March 7, 2013, 4:34pm

Yup, these two issues look just like the original, which will be fixed in beta 14. We’re in the process of uploading beta 14 now, so it should be available soon.

Cheers,

Ryan

a52_admin · March 8, 2013, 9:26pm

I think we’re having the same issue. slaves are win7 x64 and we’re using DL6 beta 12 (sorry, too superstitious to install beta 13).
As soon as mayabatch crashed due to low memory (12k print render), the DL slave process went down with it.

We’re installing beta 14 now and let you know if we have the same issue.

-ctj

rrussell · March 11, 2013, 12:58pm

I think it’s a general issue when the system runs out of memory. The same low memory situation that causes Maya to crash also causes the Slave to crash. We’re looking at adding a feature where the slave automatically restarts itself if crashes because the system has run out of memory.

Cheers,

Ryan

a52_admin · March 16, 2013, 1:34am

deadline v6.0.0.50586
win7 x64
mayabatch

still having the issue where mayabatch dies, it takes the dl slave service with it and I see a farm full of red (stalled). When I login to the machines, the dl slave never came back up.

Is there a newer version? I thought this was addressed, or is this a new issue?

Thank you.

-ctj

rrussell · March 18, 2013, 12:08pm

You’re currently one version behind:
viewtopic.php?f=84&t=9256

We added the stuff I was referring to in beta 15. There is no guarantee this will save all of your slaves though if the system they’re running on is running out of memory. For example, the slave could die first while maya is still running, and then when it attempts to start up again, there might not be enough memory.

It would be best to void renders that use up all system memory in the first place.

Cheers,

Ryan

a52_admin · March 18, 2013, 10:44pm

we have four generations of nodes on our farm, and the oldest nodes have 12GB of ram, and this is very rarely a problem, but the current project requires 12k renders for print, which loads a lot into memory, and kills these poor little guys. How would I submit a mayabatch job in a way that would exclude systems with too little memory? Is there a way to exclude nodes based on a memory requirement, or do I have to create a pool for the elderly?

Thanks.

-ctj

rrussell · March 19, 2013, 12:56pm

You can create a new Group that only includes the newer machines, and then submit the job to that Group.

Pools are priority based, so it probably makes sense to stick with a Group in this case.

Cheers,

Ryan

cbond · March 19, 2013, 2:36pm

Yeah, the rule of thumb is Groups are for software/machine specs [e.g a 12gig of ram group, a 24 gig of ram group etc] and Pools are for projects/departments/shots or however you manage your specific jobs.

cheers

cb

a52_admin · March 19, 2013, 5:25pm

thanks cbond and russell for the clearification. I’ll reorganize my groups and pools.

So, the idea would be for software specific ‘sets’, like the the machines with Nuke or Realflow, I would create a pool (e.g. ‘nuke’ pool, ‘RF’ pool), and for the various generations of machines on the farm, I would create groups. That sound about right?

Thanks again.

-ctj

rrussell · March 19, 2013, 7:25pm

You might want to use groups for software as well. For example, nuke_all, nuke_new_machines, rf_all, rf_new_machines. The reason is that pools affect the priority of a job, and groups do not. So unless you want the software that will be used for rendering to affect priority, you probably want to stick with groups.

Here is some info on how pools and groups affect scheduling:
thinkboxsoftware.com/deadlin … Scheduling

Cheers,

Ryan