Hi all,
we’ve got Deadline running on CentOS 6.2, however, for some reason when i tell the deadline slave to restart, it stops the slave, but doesn’t start again (even when doing so directly on the slave through the Deadline Slave menu). The same goes for rebooting via Deadline Manager.
Been scratching my head on this a while now.
We start the slaves by running /usr/local/Thinkbox/Deadline/bin/deadlinelauncher -slave (slaves running with the -nogui flag go stalled sometimes as well, so not sure it is a seperate issues from that or that it still is a problem with Mono)
Also, file permissions seem to be order (could check what happens when i assign ownership to root and the user group.)
Also, after some time, some slaves stall, when i log in to the stalled slave i see that deadline has frozen (window doesn’t redraw and is completely black).
Anyone seen that problem as well?
Thanks in advance.
That’s strange. After you tell the slave to restart from the slave menu, and it shuts down, can you grab the most recent launcher log and post it? The logs are found in the logs folder in the Deadline client install folder (ie: /usr/local/Thinkbox/Deadline/logs). The launcher should be launching the slave again, so hopefully there will be something in there that will explain the problem.
Also, it would be helpful to see a log from a slave after it stalls too.
Thanks!
Will do, shouldn’t take too long, going to render a couple of thousand frames any second now
I must add though, that we render using 4 concurrent tasks using VRay now, as VRay is ‘bugged’ when it comes to rendering single tasks on E5-2680 systems (render times will be all over the place and usually render slower than 6+ year old Gulftown systems, Chaos is looking into that though.)
I should also add that our current 8+ years old server is starting show its age (Pentium 4), we will be upgrading to a new CentOS Sandy Bridge server and new fiber network in the next week, so, that might help as well. (I get the feeling that both our network and old server are being trashed at the moment…might explain the (literally) smoking switch we had a couple of months ago)
Here’s the first one, Deadline is unresponsive on the slave, Deadline Monitor thinks it is still rendering, thought the System Monitor shows no running VRay and no CPU cycles going anywhere (Deadline slave waiting channel is utrace_stop)
Log is not being updates anymore, added part of the log relating to the start of rendering this job to the point it stops dead in its tracks
deadlineslave_House166(House166)-2012-06-13-0001.log (688 KB)
Thanks for the log. Unfortunately, there is nothing in there that tells us if something went wrong.
Out of curiosity, do you see this problem happen with one or two concurrent tasks? Just wondering if it’s a concurrently issue…
That’s why i was baffled, no errors, no warnings, no nothing…just a Deadline slave program that becomes unresponsive.
I do remember that we had the same problem with single concurrent tasks, i should check though.
Maybe something worth trying is running the slave in nogui mode. I know you had mentioned you had tried this before, and the slave would stall sometimes, but it would be interesting to see if they stall the same way, or if they actually crash.
Okay, just ran a slave with -nogui, and it seems it crashes at a certain point, see attached log (removed a gazillion lines from this log, it was 6.8 MB, just posted the last thousand lines).
deadlineslave_House164(House164)-2012-06-14-0002.log (90.5 KB)
Hmm… the log shows vray seg-faulting, but I wouldn’t expect that to bring down the slave too.
I noticed in the original log that the frames per task was set to 15. Do you know if that’s the case for the latest log as well? Have you tried reducing the frames per task to 1 to see if that improves things? I’m just wondering if vray is hitting a memory issue or something, and restarting it after each frame might help if that’s the case.
Well, we hit a snag…guess the SSD in that particular slave just died (folks, do not buy the OCZ Octane SSDs, complete and utter rubbish, you’ve been warned), so not sure how far that skewed the results…
Going to see what other slaves do…
Hey Ryan
Just an update.
We switched to Intel SSDs (OCZ SSDs had a fail ratio of over 60% in 1 month time), we updated to the latest Deadline 5.2.47700, which seem to have solved most problems.
Going to keep an eye on this, if we hit the ‘stalled slaves’ snag again, I’ll let you know.