Deadline slave crashing on requeue

dhildebrand · April 29, 2013, 4:57pm

We’ve been seeing an issue over the last week where Deadline Slave will stall on a machine (the window shows up as “Not Responding”, and it has to be killed) after the job it’s rendering was requeued from Deadline Monitor. The frame requeues and renders successfully, but the slave just sits there showing up as “Rendering” in the slave section of Deadline Monitor. Right-clicking on the slave in monitor and hitting “Remote Control” --> “Restart Slave” does work to kill the slave process and get it running again.

If I look at the logs from the requeue, I get something like this (I replaced the job and machine name in the log excerpt since this is a public forum). According to the job history, the actual requeue of this job took place at 20:33. I didn’t trim the log in the middle - that gap from 3:42 to the requeue at 20:33 actually exists in the log.

2013-04-27 02:52:37:  0: Loaded job: name_of_job (517b3fc72b9afb752068f379)
2013-04-27 02:52:37:  0: Skipping T: because it is already mapped
2013-04-27 02:52:37:  0: Skipping V: because it is already mapped
2013-04-27 02:52:38:  0: INFO: Executing plugin script C:\Users\slave\AppData\Local\Thinkbox\Deadline6\slave\MACHINENAME\plugins\MayaBatch.py
2013-04-27 03:42:42:  0: Slave timed out -- canceling current task...
2013-04-27 20:33:29:  Scheduler Thread - Cancelling task because task "14_43-45" could not be found
2013-04-27 20:33:29:  Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-04-27 20:33:43:  Scheduler Thread - Cancelling task because task "14_43-45" could not be found
2013-04-27 20:33:43:  Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-04-27 20:33:57:  Scheduler Thread - Cancelling task because task "14_43-45" could not be found
2013-04-27 20:33:57:  Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-04-27 20:34:08:  Scheduler Thread - Cancelling task because task "14_43-45" could not be found

And then it repeats those two lines until the slave is manually killed - it repeated them all weekend on a few of our slaves.

Is this something that you’ve seen before? We haven’t found time to upgrade from Beta 18 yet, so if it’s something that’s been dealt with since then, don’t worry about it.

jgaudet · April 29, 2013, 6:12pm

So this happens when the Render thread dies without the scheduler thread noticing. I can’t remember if we added additional checks for this since beta 18, I’ll have a look.

It’d be good to figure out why the Render thread is dying in the first place, too… Is this problem widespread, or is it just a couple machines here and there exhibiting this behaviour? It seems to be dying right after (or during) the execution the plugin script (before actually calling any of its functions, I think), judging from the output. Did you guys make modifications to the MayaBatch.py script, or is it stock?

dhildebrand · April 29, 2013, 6:41pm

This isn’t very widespread - we had it happen to about 5 machines between Friday and today, and the rest of the farm has been rendering successfully. The problem machines have also been successfully rendering frames since I restarted the slave on them this morning.

To confirm, when you talk about the render thread dying, are you talking about deadline slave process, or about the maya process itself?

And yes, we have made a few modifications to our MayaBatch.py script. We added a “-script” argument to the render arguments, and we added some lines to the render script using "scriptBuilder.AppendLine() in the area where you’re already doing that. I can look into whether either of these are causing problems next time we see this come up. My main worry at this point was the inconsistent slave status listings resulting in fewer rendering machines than what’s shown in the monitor, but if you think it’s related to our script modifications, we can look into it ourselves instead of wasting your time on it.

jgaudet · April 29, 2013, 6:50pm

I’m referring to the thread within the Slave process that executes the actual plugin script, which is then responsible for launching Maya. The Slave executable would still be around, but that particular thread isn’t, which can cause issues. The Maya process wouldn’t even have been launched yet, from the looks of it.

Given the plugin changes you described, I don’t think they would be the cause of this problem; it seems to be happening before any of the functions are called, so it’s likely on our end somewhere. I’ll look into it some more and let you know if I find anything.

jgaudet · April 30, 2013, 5:33pm

So I’ve come up short as to anything that could be causing this at the moment. I think the next step is to turn on debug logging for the MayaBatch plugin, which would hopefully give us more information on where the Render thread is getting caught up. To do this, just add the line “DebugLogging=True” to your ‘[repo path]/plugins/MayaBatch/MayaBatch.dlinit’ file.

If it’s still sticking on “Executing plugin script […]” as the last line before it hangs, I’d check if your plugin again. I’d be mainly concerned with things added to the global scope (ie, not in a function/class). Even extra import statements could potentially be the cause of it, if some weird stuff is happening. From your earlier description it really didn’t sound like it, but it might be good to double-check anyways. If you’re comfortable sending us a copy of your modified script, that might be helpful, too.

Cheers,

Jon

dhildebrand · April 30, 2013, 5:51pm

Thanks for the info - I was wondering about verbose logging this morning, so you answered that question before I asked it. I’ll enable debug logging and see if it turns anything up. If it doesn’t show anything obvious, I don’t think it should be a problem to send you a copy of our modified script. I’ll see what it looks like next time this happens, and I’ll keep you posted.

As for the script changes, you’re right - we didn’t change anything in the global scope. The next message on a successful job is “0: INFO: About: Maya Batch Plugin for Deadline”, so it must be hanging up somewhere between the beginning of the script and wherever that happens. But we’ll see what the debug logging turns up…

Thanks for your help!

dhildebrand · April 30, 2013, 6:30pm

I just had a slave hang with debug logging on. Here’s the chunk of the log before it hung:

Constructor: MayaBatch
0: Task timeout is 3000 seconds (Regular Task Timeout)
0: Loaded job: JOBNAME (517f01042b9afb45d7206067)
0: Skipping T: because it is already mapped
0: Skipping V: because it is already mapped
0: DEBUG: InitializeLocalVariables: loaded plugin config file C:\Users\slave\AppData\Local\Thinkbox\Deadline6\slave\COMPUTERNAME\plugins\MayaBatch.dlinit
0: DEBUG: InitializeLocalVariables: debug logging enabled
0: DEBUG: InitializeLocalVariables: setting job filenames
0: INFO: Executing plugin script C:\Users\slave\AppData\Local\Thinkbox\Deadline6\slave\COMPUTERNAME\plugins\MayaBatch.py
0: DEBUG: StartJob: initializing Deadline plugin

I’ll email you a copy of our MayaBatch.py script to avoid posting it here for all to see.

dhildebrand · April 30, 2013, 6:55pm

I also just (temporarily) removed our plugin sync program that was running in a preload script just in case it’s causing problems. I don’t think it should be related since it seems to run after where the slave is crashing, but that’s one of the things that’s been changed here in the last few weeks anyway, so I figured it was worth a shot.

jgaudet · April 30, 2013, 7:11pm

Alrighty, thanks for all the info. I’ll have another look at it here and keep you posted

dhildebrand · April 30, 2013, 9:29pm

Hmm, I haven’t seen a slave crash since I removed the preload script earlier this afternoon. Not sure if it’s just a coincidence, but the farm should be busy all night, so that’ll be a better test. Is there any way that a preload script could be involved in the problem judging by the logs that I posted earlier?

jgaudet · April 30, 2013, 11:58pm

It definitely could be, yeah. When there’s a pre-load script, Deadline creates the plugin’s scope using that pre-load script, instead of using the plugin itself to do it. It’s definitely getting to that bit in the logs you’ve posted, but oddly enough it doesn’t have the message to that effect in there. It should have a line saying “Executing plugin preload script:” somewhere. What does a ‘good’ log look like for you guys?

dhildebrand · May 1, 2013, 2:25pm

When the plugin script runs, I do see the “Executing job preload script”, and it runs all right. I get the log output from our sync executable, and it returns correctly and everything. Below is a log excerpt (with sync paths replaced since this is a public forum):

0: INFO: Executing plugin script C:\Users\slave\AppData\Local\Thinkbox\Deadline6\slave\MACHINENAME\plugins\MayaBatch.py
0: DEBUG: StartJob: initializing Deadline plugin
0: DEBUG: StartJob: this is an advanced plugin job
0: INFO: About: Maya Batch Plugin for Deadline
0: INFO: The job's environment will be merged with the current environment before rendering
0: DEBUG: PreLoadJob: called
0: INFO: Executing job preload script C:\Users\slave\AppData\Local\Thinkbox\Deadline6\slave\MACHINENAME\plugins\JobPreLoad.py
0: INFO: Running opus launcher sync
0: STDOUT: Hg repository up to date
0: STDOUT: Syncing V:\path\to\source1 to C:\path\to\dest1 (recursive: 0, delete: 0)
0: STDOUT: Syncing V:\path\to\source2 to C:\path\to\dest2 (recursive: 1, delete: 0)
0: PYTHON: Opus sync exit code: 0
0: DEBUG: PreLoadJob: returning
0: INFO: Rendering with Maya version 2013.0

Strangely enough, we didn’t see the problem overnight with the pre-render script disabled, but it might not have been a great test because there were some artist-end render problems that caused a bunch of jobs to fail anyway.

Am I correct in assuming that the preload script isn’t touched until the “0: DEBUG: PreLoadJob: called” log message is shown?

jgaudet · May 1, 2013, 5:57pm

I actually managed to confuse myself a bit, I thought you were referring to a plugin pre-load script, not a job pre-load script. It’s a very subtle difference, but the Job Pre-load probably shouldn’t have an impact, since it gets loaded a bit later. You are right in that it doesn’t get touched until that line gets printed out, unless there’s some weird things going on with the output getting lost, or not getting flushed or something.

Looking at the code, there really isn’t anything crazy going on between ‘StartJob: Initializing plugin’ and ‘StartJob: this is an advanced plugin job’. The most special thing is that it’s the first place where it actually tries calling a function in the Plugin (first ‘GetDeadlinePlugin’, and then ‘Initialize’. It also pulls the Environment Variables being set by the job, which I noticed in your ‘good log’ you guys seem to be using… so it might be related to that stuff.

Other than that, at this point I think we’ll just have to add more verbose debug logging in this section for RC2, to try and pin this down.

dhildebrand · May 1, 2013, 6:04pm

Ok, thanks. FYI, I still haven’t seen the problem since yesterday. It doesn’t make any sense that removing the preload script should affect it, but I guess I’ll have to wait and see how things go. We’ve got the farm working away on a bunch of jobs now, which will probably go for the rest of the day and overnight, so that’ll be a good test.

jgaudet · May 1, 2013, 11:51pm

Cool, definitely keep us posted!

dhildebrand · May 3, 2013, 2:40pm

I saw this again on two machines last night. What I hadn’t noticed before is that, when it’s hung, the deadlineslave.exe process is constantly using more and more memory. After being hung for about 15 hours, deadlineslave.exe is using about 1GB of memory on both of these machines. And it’s definitely climbing steadily.

Note that I did re-enable the preload script last night, but the sync process that it launches isn’t running when the slave is hung (I still don’t think it makes it to the preload script judging by the logs). Just in case it affects anything, the preload script is as follows, and it works the rest of the time:

def __main__( deadlinePlugin ):
	deadlinePlugin.LogInfo( "Running opus launcher sync" )
	
	# Run the silent launcher and timeout if it doesn't complete in 30 seconds
	exitcode = deadlinePlugin.RunProcess("C:\\Program Files\\Opus VFX\\Launcher\\launcher_silent.exe", "0000_ProjectName maya_2013", "C:\\Program Files\\Opus VFX\\Launcher", int("30000"))
	print "Opus sync exit code: " + str(exitcode)

Apparently we should have a fairly heavy weekend of rendering, so I’m going to disable the preload script to see if the problem comes up at all over the weekend.

dhildebrand · May 6, 2013, 3:42pm

Well, the farm was busy all weekend and I didn’t see the issue come up once. I’ll continue to experiment with the preload script, but it appears that things are running smoothly without it anyway.

jgaudet · May 6, 2013, 5:29pm

Good to hear the weekend rendering went over smoothly.

One thing you could try, since it’s looking like it might be the PreLoad script after all, is to switch it to a Plugin Pre-Load script, instead of a Job PreLoad. The main difference is that the Plugin Pre-Load will get run slightly before where the Job Pre Load does. If you’re synching the actual plugin that’s currently running, that might be best anyways, since it will run before the plugin script gets touched at all. To do that, you should just have to rename the file from ‘JobPreLoad.py’ or ‘PreLoad.py’ to ‘PluginPreLoad.py’.

dhildebrand · May 6, 2013, 6:04pm

Thanks, I’ll give that a try. It does make sense to have this run as early as possible since it doesn’t need to interact with anything relating to the scene or the current job.

dhildebrand · May 7, 2013, 3:05pm

Not sure if I’m just missing it in the documentation, but I haven’t been able to find anything on the main function parameters for the PluginPreLoad.py script. I was using the following for JobPreLoad.py:

def __main__( deadlinePlugin ):
	deadlinePlugin.LogInfo( "Running opus launcher sync" )
...etc...

But it looks like PluginPreLoad.py expects an extra parameter, as I’m getting the following error:

Python Exception: TypeError : __main__() takes exactly 1 argument (0 given) (Python.Runtime.PythonException)

Can you let me know what needs to be changed? Thanks!