stalled slave report

using mantra, if i just let it use 1 concurrent task and max out the cores, things run fine.

as soon as i start to up the concurrent tasks (and reducing the # of threads) i start to get these failed jobs. the jobs report things like:

STALLED SLAVE REPORT

Current House Cleaner Information
Machine Performing Cleanup: renderVFX034
Version: v6.1.0.54665 R
Note: This is a self cleaning operation because the slave previously did not shut down cleanly

Stalled Slave: renderVFX034
Slave Version: v6.1.0.54665 R
Last Slave Update: 2014-05-21 15:04:42
Current Time: 2014-05-21 15:05:20

the “last slave update” is 15:04:42 and the current time is “15:05:20”, so basically 30 seconds or so since an update. is that REALLY what triggered the killing of the tasks on this host?

the repo settings indicate i’ve got 20 minutes before a slave is considered stalled.

i’ve considered that maybe ram is an issue and i’m simply tasking things too heavily, but the ram requirements on these tasks are each about 10% of available ram so i should be fine with 6 concurrent tasks (4 threads each on a 24 thread machine). i’ve seen the same kind of flakiness with concurrent tasks set to 2. it’s like the concurrent tasks are somehow confusing the stalled slave logic (or maybe i’m just understanding what a stalled slave is).

any thoughts here?

Hello,

So in looking this over, it definitely looks like the slave is crashing. I am wondering if that could be caused by high memory use due to the concurrent tasks. Can you look for a crash report? If it’s on Windows, that would probably by in the event viewer.

this is running in linux.

how would i determine if the slave is crashing? i looked around the /var/log/ area and found some dealine logs, but nothing in there reported any crashes. it was just the output of the various tasks. altho, i suppose the fact that there are logs in that area might imply that the slave crashed as i would imagine the slave would clean those up on a proper exit.

so then the question is was the slave killed by the “stalled slave” reaper, or did it really die? and how can i change the length of time before the slave is determined to be stalled/dead?

edit: i don’t think it’s a memory issue. the tasks i’m making concurrent i’m pretty confident are not high demand tasks. this also happens when i even do 2 concurrrent tasks, tho it’s more prevalent with higher concurrent task values.

Hello,

So Linux itself doesn’t, apparently, have any crash reporting. What we can have you do is run the following command

deadlineslave -console 2&>1 > ~log.txt

from the Deadline install folder and that will have the slave run and pipe everything it does to log.txt in the users home folder. Once the slave crashes, send that file over to us. Thanks.