Pre and Post Job Script Causing Stalls

Gary · March 31, 2015, 9:10am

Hi there,

I added a pre job script to update the source file in case the name of was changed slightly. It generally works but sometimes the repository doesn’t actually reflect the changes and keeps the old file name even though the script has run without throwing errors or warnings. I’ve posted about this previously here.

To deal with this (and other potential issues) I added a post job script that will scan the job’s output and make sure all the exported files that should exist are there. This is particularly important because we’re using the Draft with Shotgun so once the job is fully completed there’ll be video and filmstrip jobs created that expect to find the frames in the output folder. If the frames aren’t there, those 2 jobs will generate a set of errors and fail.
If the post job script finds a missing frame, it marks the job as having failed to render once and requeues the job. (I have the script set up to not requeue a job once it has been marked this way, to prevent an endless loop of errors if something goes terribly wrong)

Both of these will, on occasion, stall the slaves. I’m not sure why the pre job script would, unless there’s funny behaviour with the for loop. I am suspicious that maybe the post job script is stalling because it’s requeuing the job it’s currently working on and that may be confusing to it. For the time being I can edit this to only fail the job rather than attempt a requeue and I could manually requeue it myself.
The reason I’d like to safely have the requeue command execute is because there’s another studio running their own separate render farm. I do have occasional access to the other repository but it’s more inconvenient to access. Most of the time, when the pre job script doesn’t update the source file, just one requeue is enough to get it to update properly and then render the frames so if I can resolve the stalls the system will largely handle the problematic behaviour.

Even getting more information on what can trigger stalls would be helpful both now and in future to avoid unsafe practices.

Here’s the export checker script:

[code]import re, os

from Deadline.Scripting import *
from Deadline.Jobs import *

def main( *args ):
deadlinePlugin = args[0]
job = deadlinePlugin.GetJob()
frames = job.JobFramesList

outputDir = job.JobOutputDirectories[0]
outputFile = job.JobOutputFileNames[0]
deadlinePlugin.LogInfo( "Checking for " + job.JobName + " exported frames in " + outputDir)
target = '#####'
previousFail = job.GetJobExtraInfoKeyValueWithDefault("ExportFail", "False")

for f in frames:
    frame = str(f)
    while len(frame) < len(target): frame = '0' + frame
    file = outputFile.replace(target, frame)
    deadlinePlugin.LogInfo( "Checking " + os.path.join(outputDir, file))
    
    if not os.path.exists(os.path.join(outputDir, file)):
        deadlinePlugin.LogWarning( "Exported frame not found at " + os.path.join(outputDir, file))
        if previousFail == "False":
            job.SetJobExtraInfoKeyValue("ExportFail", "True")
            RepositoryUtils.SaveJob( job )
            deadlinePlugin.LogWarning( "Marked Job as previously failed." )
            RepositoryUtils.RequeueJob(job)
            deadlinePlugin.LogWarning( "Requeued Job" )
            
        else:
            deadlinePlugin.LogWarning( "Job has failed before, marking as failed job." )
            RepositoryUtils.FailJob(job)
            deadlinePlugin.LogWarning( "Failed Job" )
        return
            
deadlinePlugin.LogInfo( outputDir + " frames found. Job complete." )[/code]

Also here’s a stalled slave report from the Task reports on the prejob script:

[code]STALLED SLAVE REPORT

Current House Cleaner Information
Machine Performing Cleanup: CS-Render-04
Version: v7.0.2.3 R (24b5c0a7f)

Stalled Slave: CS-Comp-023
Slave Version: v7.0.2.3 R (24b5c0a7f)
Last Slave Update: 2015-03-31 09:21:39
Current Time: 2015-03-31 09:34:36
Time Difference: 12.952 m
Maximum Time Allowed Between Updates: 10.000 m

Current Job Name: WIP-ANIM_pr202_sc067
Current Job ID: 550af2d592341d1768635ced
Current Job User: shotgun
Current Task Names: 0
Current Task Ids: -2

Searching for job with id “550af2d592341d1768635ced”
Found possible job: WIP-ANIM_pr202_sc067
Searching for task with id “-2”
Found possible task: -2:[0]
Task’s current slave: CS-Comp-023
Slave machine names match, stopping search
Associated Job Found: WIP-ANIM_pr202_sc067
Job User: shotgun
Submission Machine: CS-Render-03
Submit Time: 03/19/2015 16:01:39
Associated Task Found: -2:[0]
Task’s current slave: CS-Comp-023
Task is still rendering, attempting to fix situation.
The job is complete but this task is still rendering. This should never happen.
Setting slave’s status to Stalled.
Setting last update time to now.

Slave state updated.[/code]

eamsler · March 31, 2015, 5:12pm

Hey Gary,

These scripts shouldn’t take down the Slave if there are problems, but it can happen (we’re working on fixing that in 8.0).

I think I’m going to have to test this. What OS is it running on? Windows or Linux?

If on Linux, would you be willing to start the Slave manually? Should be easy, just stop the service with sudo service deadline7launcherservice stop, then /opt/Thinkbox/Deadline7/bin/deadlineslave -nogui -console.

This is going to be one of those tough ones… The ‘-console’ should let us see the stacktrace of why it died. The same flag works on Windows and OS X, the paths to the Slave are just different.

Gary · March 31, 2015, 5:18pm

Hey Edwin,

They’re Windows machines, (Windows Server 2008, Service Pack 1, 64 bit, if that helps!)

I’m just heading out the door now but I can check with my supervisor in the morning about manually booting the slaves to get the detailed information.
That will be only displaying the info as it happens, right? That’d be a bit unfortunate just because these have been rare occurrences, it could take days for it to crop up again. But I presume we wouldn’t have many other choices anyway.

Thanks!

eamsler · March 31, 2015, 7:31pm

Well, you can check the Slave logs. If a new one was created around the time of the Stall it would show that the Slave actually crashed.

I’ve seen it where if the machine is overloaded enough, it will stall out because network connectivity was completely starved out, but that was in the Windows XP days. Your script is just too simple for that to happen.

Getting it to happen reliably, or looking into the Slave logs is going to helpful here. If you want us to review, just e-mail them over to support@thinkboxsoftware.com and we can take a look.

nixx · August 11, 2015, 12:13pm

Just discovered this and the other (forums.thinkboxsoftware.com/vie … 11&t=13633) thread about Gary’s script.
It’s perfect as I’m having the same issues with AE (files not written while having been rendered).

I tried the requeue version in the first post, but this one will requeue the complete job when only a few tasks have failed.
I assumed I could use it as a post-task script as well… but this also fails the entire job.

Is there maybe a version already that can requeue the failed tasks (or frames) ?

If not I’ll look into it here