DeadlineSlave CentOS 6.x disappears and other problems

More fun with CentOS 6.x and Deadline…

For some reason when we render VRay (or Maya Mental Ray) the DeadlineSlave sometimes just ‘disappears’ (silently crashes?)

When we render MentalRay for Maya (maya.bin) once a render completes, DeadlineSlave sometimes just sit there, as if the render isn’t completed yet (like it ‘lost the connection’ to the maya.bin it manages.)

I can’t seem to find any useful VRay or Maya.bin log files (and writing a custom log with memory and cpu usage on maya.bin, vray.bin and DeadlineSlave doesn’t seem to yield any useful data.)
I’m going to try and see if i can get Munin to work for logging.

Anyone seen this behaviour before?
(link to last log backdoor.houseofsecrets.nl/down … 63_log.zip )

Can you post the Slave log from the session were either of these problems occurred? You can find the logs in the “logs” folder in the Deadline Client installation (ie: /usr/local/Thinkbox/Deadline/logs). We can take a look to see if there is any useful info in them.

Thanks!

  • Ryan

The logs are in the zip in the download link i posted. Or are we talking other logs?

(The size is rather large, as my colleague seems to be unable to lower the verbose options…aka he’s too lazy to change a simple option :frowning: )

Sorry! Totally missed that. :slight_smile:

Just to confirm, is this log from when the slave crashed, or from when the render stalled?

This is from a slave where DeadlineSlave just vanished.
I couldn’t check whether or not the frame was finished rendering and the VRay instances exited when done, or that they vanished/crashed as well (there where no instances running when doing a ps aux|grep vray )

This morning a couple of slaves had their DeadlineSlave process 'disappear ’ again.
I wrote a small shell script that periodically logs certain process information, and it shows something odd (still have to install and figure out Munin.)

When i compare it to the DeadlineSlave log, it shows deadline stops logging at 16:30, but the shell process log shows both slave and vray are still running and rendering, until they go poof (or maybe vray was done rendering, but Deadline re-queued those frames and i can’t check whether they actually finished rendering.)
Could it be the linux OOM Killer that kills Deadline? (seems unlikely, but you never know)

link to logs and shell script i used : backdoor.houseofsecrets.nl/down … 4_logs.zip

Okay…i think it is a mono thing, found this in the logs

Aug 2 01:05:42 House164 abrt[3463]: Saved core dump of pid 3616 (/usr/local/bin/mono) to /var/spool/abrt/ccpp-2012-08-02-01:05:41-3616 (446500864 bytes)
Aug 2 01:05:42 House164 abrtd: Executable ‘/usr/local/bin/mono’ doesn’t belong to any package

no core dump named ‘ccpp-2012-08-02-01:05:41-3616’ there though.

Thanks! I noticed you’re running at least 2 concurrent tasks. Have you tested to see if the problem occurs when concurrent tasks are set to 1? Also, do you have a sense of what the memory usage is like on the machine when the slave dies? Just wondering if the slave is dying due to a lack of system resources.

Finally, are you running in -nogui mode, or with the slave user interface enabled?

Thanks!

  • Ryan

We run with the gui on.

The VRay.bin doesn’t seem to go above 9.8 G memory with 10G virtual assigned, the machines have 32GB memory (again, hard to tell without an analysis tool that logs everything i need.)
We render with concurrent tasks because there is some serious thread locking with VRay and Sandy Bridge processors (rendering 4 concurrent tasks renders faster than 1 task, so say 1 task renders 10 minutes (10 minutes per frame), a 4 task render wold take a little less than 10 minutes, resulting in a render time per frame of 2.5 minutes.)

It almost seems worth the hassle to switch the entire farm back to windows (shudder…)

Maybe try running with the gui off to see if that helps at all (the slave itself requires less resources this way). Just run the slave from the terminal like this:

deadlineslave -nogui

Hopefully we can figure something out that doesn’t require switching back!

Hey Ryan,

As a workaround to keep renders going i wrote a shell script yesterday that checks whether or not deadlineslave.exe is running, when it doesn’t find it, it kills any errand vray and maya instances and then starts deadlineslave with the -nogui flag.

Also, it does seems to be not just mono related, as a slave reported this in the /var/log/messages log :

Jul 31 18:36:12 House167 kernel: vray.bin[24308]: segfault at 18 ip 00007f9331588bc1 sp 00007f92d8df0930 error 4 in libvray.so[7f933118b000+988000]
Jul 31 18:37:08 House167 abrtd: Directory 'ccpp-2012-07-31-18:36:13-24226' creation detected
Jul 31 18:37:08 House167 abrtd: Size of '/var/spool/abrt' >= 1000 MB, deleting 'ccpp-2012-07-12-15:57:18-8355'
Jul 31 18:37:08 House167 abrt[25217]: Saved core dump of pid 24226 (/usr/autodesk/maya2012-x64/vray/bin/vray.bin) to /var/spool/abrt/ccpp-2012-07-31-18:36:13-24226 (7834578944 bytes)
Jul 31 18:37:08 House167 abrtd: Executable '/usr/autodesk/maya2012-x64/vray/bin/vray.bin' doesn't belong to any package
Jul 31 18:37:08 House167 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2012-07-31-18:36:13-24226' exited with 1
Jul 31 18:37:08 House167 abrtd: Corrupted or bad directory /var/spool/abrt/ccpp-2012-07-31-18:36:13-24226, deleting

So, could still be a VRay memory problem.

Strange thing is, that we have win slaves with 12GB that run single task renders that render the same scene without exploding.

Another problem that adds to the ability to problem solve is that installing any updates or VRay nightlies is that we rely on VRayScatter, which relies on the VRay SDK, so we’ve managed to work ourselves between a rock and hard place when it comes to updating.

Sven

Hey Sven,

I wonder if the problem is strictly related to running concurrent tasks with Vray on your farm, since the Windows machines seem to run fine with a single task. I’ve seen a case where 3dsmax jobs don’t run well concurrently on Windows (I can’t remember though which renderer was being used), but basically, if 3 or 4 max instances tried to launch at the same time, some would crash. It just seems that some jobs are better geared for concurrency than others. You definitely have enough memory, but there could be other factors. For example, maybe VRay is still trying to use every core, so the multiple vray instances compete with each other for CPU and/or other system resources, and that causes things to explode. That seemed to be the case in the 3dsmax concurrency issue we saw.

Does the problem occur on the Linux slaves when task concurrency is set to 1? Even though you have a solution to keep going with concurrent tasks, I think this is an important thing to confirm if you have the time to spare.

Cheers,

  • Ryan

Hey Ryan,

when we have the time i’ll set the tasks to 1, should be interesting to see what it does indeed.

It wont’ be in the following 2 weeks though, as after today i’m going on a short but necessary vacation ( yay! )

I’l keep you posted once i’m back.

Sven

No rush! Enjoy your vacation!

Cheers,

  • Ryan

Hi Ryan,

been back for a while now, and we’ve made some changes that seem to work.

We switched all nodes to -nogui
We did tests with single tasks, which work with little problems.
We submitted jobs with 2 or more tasks per node, but the big difference is, this time we set a thread limit (which seems to work per task), the amount of available cores on the machines divided by the amount of concurrent tasks. This seems to be the thing that solved it.
It seems VRay/MR (or CentOS) has trouble managing the threading with multiple processes that claim all cores on a machine. It all goes belly up at some point where VRay keeps running at 0% processor power, which makes the slaves stall.

Another problem we now need to solve is that the CentOS deadlinelaunchers seem to crash/disappear when a job it is rendering is set to suspended (need to check the logs when this happens again.)

Sven

Hey Sven,

Glad to hear you were able to find a reliable solution!

That’s strange that the launcher would crash like that. Yes, please send us a launcher log when this happens again!

Cheers,

  • Ryan

Which probably will occur somewhere in the next weeks as from this weekend on we go into the full render stretch with over 10.000 frames of 5K image renders. :slight_smile:

Sven