Getting lots of stalled slave messages

tsmithf · February 5, 2016, 10:07am

We have been testing Deadline 8 for a job that had been previously working through Deadline 7, however we are seeing a noticable increase in stalled slave messages, with the timeout setting set to 20 mins as we had it under Deadline 7.

Is there anything that has changed that might affect this?

The messages we get are (Redacted)…

STALLED SLAVE REPORT

Current House Cleaner Information
Machine Performing Cleanup: SERVERNAME
Version: v8.0.0.47 Beta (63536e907)

Stalled Slave: Render44
Slave Version: v8.0.0.47 Beta (63536e907) Last Slave Update: 2016-02-05 09:25:01 Current Time: 2016-02-05 09:45:29 Time Difference: 20.455 m Maximum Time Allowed Between Updates: 20.000 m

Current Job Name: JOBNAME [Frame 0 - 100 Tiles] Current Job ID: 56b3a6241d82b250a0577412 Current Job User: USER Current Task Names: 72 Current Task Ids: 72

Searching for job with id “56b3a6241d82b250a0577412”
Found possible job: JOBNAME [Frame 0 - 100 Tiles] Searching for task with id “72”
Found possible task: 72:[72-72]
Task’s current slave: Render44
Slave machine names match, stopping search Associated Job Found: JOBNAME [Frame 0 - 100 Tiles] Job User: USER Submission Machine: MACHINENAME Submit Time: 02/04/2016 19:27:32 Associated Task Found: 72:[72-72] Task’s current slave: Render44 Task is still rendering, attempting to fix situation.
Requeuing task
Setting slave’s status to Stalled.
Setting last update time to now.

Slave state updated.

jgaudet · February 11, 2016, 4:55pm

Hmmm, we’re not aware of anything that would be causing this yet…

Do you know if the Slaves are actually still rendering in these cases, and this is just a false positive? Or do the Slaves actually getting hung up on something?

Which OS are these Slaves running, and what kind of Jobs are they typically working on when they stall? Would you be able to get us some Slave log extracts matching the times around which they last reported their status, to when they got reported as stalled?

Cheers,
Jon

tsmithf · February 11, 2016, 5:31pm

To be honest I’m not sure if it’s a false positive, I’ll check the slave the next time one stalls

tsmithf · February 15, 2016, 12:43pm

Looking at the error in the log it looks like there is some issue with the active job not being found, however it is in the queue or it could possibly be referring to the dependency, which had been deleted in this case, would this cause the STALL?

STALLED SLAVE REPORT

Current House Cleaner Information
Machine Performing Cleanup: RenderMgr02
Version: v8.0.0.50 Beta (dd3a9e577)

Stalled Slave: render41
Slave Version: v8.0.0.50 Beta (dd3a9e577)
Last Slave Update: 2016-02-15 11:52:57
Current Time: 2016-02-15 12:13:18
Time Difference: 20.352 m
Maximum Time Allowed Between Updates: 20.000 m

Current Job Name: Aerial View 1 Night [Frame 0 - 100 Tiles]
Current Job ID: 56be16d3364b5964588740bf
Current Job User: NAME
Current Task Names: 59
Current Task Ids: 59

Searching for job with id “56be16d3364b5964588740bf”
Found possible job: Aerial View 1 Night [Frame 0 - 100 Tiles]
Searching for task with id “59”
Found possible task: 59:[59-59]
Task’s current slave:
Slave machine names do not match, continuing search
Associated job not found, it has probably been deleted.

Setting slave’s status to Stalled.
Setting last update time to now.

Slave state updated.

tsmithf · April 18, 2016, 9:31am

I think we might have figured out why our renderfarm is producing lots of stalled slaves.

It looks like Deadline 8 is producing lots of log files in C:\ProgramData\Thinkbox\Deadline8\logs, which has eaten all of their spare disk capacity.

One of the log files was 56GB!

jgaudet · April 18, 2016, 8:34pm

Hmmm, did you have a look at what the contents of the log file were?

Was it expected render output that was taking up all that space, or some spammed messages from Deadline itself?