VRay 3 segfaulting, and Deadline not noticing

Hey guys,

we have a problem with VRay 3 and Centos 6.x in combination with DL 6.1.

For some reason vray.bin finishes a frame (or not) and during cleanup (as far as i can tell) it segfaults, or, vray.bin eats up all memory (i guess a leak) and the OoM manager kills the process.
Either way, deadline doesn’t seem to pick this up, and thinks everything is honky dory and keeps going until it stalls, or the deadline slave just up and leaves (i guess the OoM manager kicks in for this process as well when vray goes bananas.)

Is this something that can be prevented (make deadline launcher realize something has died) or does anyone have any experience with VRay and segfaulting on linux?

with kind regards,

Sven Neve

Hello Sven,

Can have you check if, after things crash or are killed, whether the vray.bin process is still running? Thanks.

Hi Dwight,

i’ll submit a shot to the farm and see what happens when they crash, i’ll get back to you later.

cheers,

Sven

Okay, so i let it run overnight.

I have a couple of machines that have segfaulted and deadline thinks they are still rendering (19 hours so far on 7min average task :slight_smile: )

Couple of things:
DeadlineLauncher is still running
DeadlineSlave is gone
vray.bin is gone

cheers,

Sven

Also, checking the logs after doing some stuff.

When i try to cancel the task that the monitor and launcher still think are rendering don’t seem to check whether the Slave still even exists.

For example, canceling the current task, gives me these results in the log (only a RestartSlave seems to wake the launcher up to the fact there is no Slave anymore)

See log snippet:

2014-06-06 11:37:30:  Checking for stalled slaves on this machine
2014-06-06 11:42:30:  Updating Repository options
2014-06-06 11:42:30:    - Remote Administration: enabled
2014-06-06 11:42:30:    - Automatic Updates: enabled
2014-06-06 11:42:30:  Checking for stalled slaves on this machine
2014-06-06 11:46:36:  ::ffff:192.168.123.63 has connected
2014-06-06 11:46:36:  Launcher Thread - Received command: OnLastTaskComplete CancelTask House162
2014-06-06 11:46:36:  Sending command to slave: OnLastTaskComplete CancelTask
2014-06-06 11:46:36:  Got reply: House162: Sent "" command. Result: "Connection refused"
2014-06-06 11:46:36:  Launcher Thread - Responded with: Success|
2014-06-06 11:46:46:  ::ffff:192.168.123.63 has connected
2014-06-06 11:46:46:  Launcher Thread - Received command: ForceStopSlave House162
2014-06-06 11:46:46:  No Slave to shutdown
2014-06-06 11:46:46:  Launcher Thread - Responded with: Success|
2014-06-06 11:47:10:  ::ffff:192.168.123.63 has connected
2014-06-06 11:47:10:  Launcher Thread - Received command: ForceRelaunchSlave House162
2014-06-06 11:47:10:  No Slave to shutdown
2014-06-06 11:47:10:  Launcher Thread - Responded with: Success|
2014-06-06 11:47:20:  Local version file: /opt/Thinkbox/Deadline6/bin/Version
2014-06-06 11:47:20:  Network version file: /mnt/DeadlineRepository/bin/Linux/Version
2014-06-06 11:47:20:  Comparing version files...
2014-06-06 11:47:20:  Version files match
2014-06-06 11:47:20:  Launching Slave: House162
2014-06-06 11:47:23:  ::ffff:192.168.123.63 has connected
2014-06-06 11:47:23:  Launcher Thread - Received command: ForceStopSlave House162
2014-06-06 11:47:23:  No Slave to shutdown
2014-06-06 11:47:23:  Launcher Thread - Responded with: Success|
2014-06-06 11:47:30:  Updating Repository options
2014-06-06 11:47:30:    - Remote Administration: enabled
2014-06-06 11:47:30:    - Automatic Updates: enabled
2014-06-06 11:47:30:  Checking for stalled slaves on this machine

So, normally the other Slaves detect a stall, re-queue the job for pickup and mark the Slave as stalled itself. The launcher then checks to see if it’s stalled from the database.

I think the house cleaning is having trouble finding those Slaves. Can you try disabling the ‘use external process’ and see if that helps these Slaves get detected as stalled? Just uncheck these boxes.

Hi Edwin,

i turned of those settings, and at that point all machines started bleating about cleanup processes already running, so i restarted the whole lot, and now the deadlauncher doesn’t even want to keep running anymore.

I’m not sure what is going on, but the deadline6launcherservice is running (but a “ps aux | grep -i dead” doesn’t show any deadline programs running)
When i check the /var/run/deadline6launcherservice.pid file to check the process id, the process with the pid contained in the file simply doesn’t exist.

I also noticed that i’m unable to start restart or stop the service because the .log and .pid file are inaccessible by the service (or at least the user that i try to start it with i guess)
Does the user that runs the service need to be part of the sudo-er list, in our case ‘gast’ ? (right now said user is not in the sudo-er list, the user is an account that has access to most file servers and thus we’d like to limit the administrative rights it has, at least enough to be able to access files and be able to render.)

$ /bin/su - gast -c "service deadline6launcherservice restart"
Password:
/etc/init.d/deadline6launcherservice: line 82: kill: (3097) - No such process
rm: cannot remove `/var/run/deadline6launcherservice.pid': Permission denied
Deadline 6 Launcher Service Stopped
/etc/init.d/deadline6launcherservice: line 91: /var/run/deadline6launcherservice.pid: Permission denied
Deadline 6 Launcher Service Started
/etc/init.d/deadline6launcherservice: line 93: /var/log/deadline6launcherservice.log: Permission denied
/etc/init.d/deadline6launcherservice: line 94: /var/log/deadline6launcherservice.log: Permission denied

It seems this is a separate case to the vray.bin oom problem, but it does somehow tie into the problem with deadline, i guess…

Investigating on, let me know if someone can point me into a direction that might solve this.

Sven

I’m so stumped right now…

All machines are the same, however, 1 machine just restarts the service and runs deadline without a problem…

[root@House168 gast]# service deadline6launcherservice restart
Deadline 6 Launcher Service Stopped
Deadline 6 Launcher Service Started

[root@House168 gast]# ps aux|grep -i dead
gast      3176  0.3  0.1 1054480 49708 ?       Sl   10:59   0:15 mono --runtime=v4.0 /opt/Thinkbox/Deadline6/bin/deadlineslave.exe -nogui -name
root      5669  0.0  0.0  64612  1848 pts/0    S    12:14   0:00 /bin/su - gast -c "$DEADLINEBIN/deadlinelauncher" -nogui
gast      5677  0.6  0.0 773972 30844 ?        Ssl  12:14   0:00 mono --runtime=v4.0 /opt/Thinkbox/Deadline6/bin/deadlinelauncher.exe -nogui
root      5760  0.0  0.0 103248   876 pts/0    S+   12:16   0:00 grep -i dead

The others, not.

[root@House162 gast]#  service deadline6launcherservice restart
/etc/init.d/deadline6launcherservice: line 82: kill: (10106) - No such process
Deadline 6 Launcher Service Stopped
Deadline 6 Launcher Service Started

[root@House162 gast]# ps aux|grep -i dead
root     10195  0.0  0.0 103244   860 pts/0    S+   12:15   0:00 grep -i dead

Same environment vars, same vray/maya installs, same deadline installs, same rights on folders and files (as far as i can tell), same service…

Getting more confused by the minute.

Can you add me via Skype? I’d like to work through this with you.

My account for work is “edwin.amsler”.

So, the service does need to start as root because we use su to drop down, and that PID and log are likely owned by root, so not just anyone can write them.

What I would do is set the write permissions of those files to rwx for everyone, then try restarting the service.

I’m going to dig into the init script to actually refresh myself as to how that actually works. Should be able to provide more tips after that.

Alrighty, so I definitely wrote this darn script.

It looks like we delete the PID file when we stop the service, so go ahead and do the same, then try and start it using sudo. That should bring things back up and running.

Okay! Figured it out!

Looks like a variable we thought we were passing via the init script actually wasn’t being passed at all.

There’s a little ‘deadlineclient.sh’ file that we put in /etc/profile.d/ that sets the path to Deadline. For whatever reason, that script was missing on the problem machines, so when we told su to run the launcher, it would actually look in that user’s environment for the path to Deadline not the environment we had set for it. Without ‘deadlineclient.sh’ there, there was no environment to point the way.

Whew.