Hi there,
I added a pre job script to update the source file in case the name of was changed slightly. It generally works but sometimes the repository doesn’t actually reflect the changes and keeps the old file name even though the script has run without throwing errors or warnings. I’ve posted about this previously here.
To deal with this (and other potential issues) I added a post job script that will scan the job’s output and make sure all the exported files that should exist are there. This is particularly important because we’re using the Draft with Shotgun so once the job is fully completed there’ll be video and filmstrip jobs created that expect to find the frames in the output folder. If the frames aren’t there, those 2 jobs will generate a set of errors and fail.
If the post job script finds a missing frame, it marks the job as having failed to render once and requeues the job. (I have the script set up to not requeue a job once it has been marked this way, to prevent an endless loop of errors if something goes terribly wrong)
Both of these will, on occasion, stall the slaves. I’m not sure why the pre job script would, unless there’s funny behaviour with the for loop. I am suspicious that maybe the post job script is stalling because it’s requeuing the job it’s currently working on and that may be confusing to it. For the time being I can edit this to only fail the job rather than attempt a requeue and I could manually requeue it myself.
The reason I’d like to safely have the requeue command execute is because there’s another studio running their own separate render farm. I do have occasional access to the other repository but it’s more inconvenient to access. Most of the time, when the pre job script doesn’t update the source file, just one requeue is enough to get it to update properly and then render the frames so if I can resolve the stalls the system will largely handle the problematic behaviour.
Even getting more information on what can trigger stalls would be helpful both now and in future to avoid unsafe practices.
Here’s the export checker script:
[code]import re, os
from Deadline.Scripting import *
from Deadline.Jobs import *
def main( *args ):
deadlinePlugin = args[0]
job = deadlinePlugin.GetJob()
frames = job.JobFramesList
outputDir = job.JobOutputDirectories[0]
outputFile = job.JobOutputFileNames[0]
deadlinePlugin.LogInfo( "Checking for " + job.JobName + " exported frames in " + outputDir)
target = '#####'
previousFail = job.GetJobExtraInfoKeyValueWithDefault("ExportFail", "False")
for f in frames:
frame = str(f)
while len(frame) < len(target): frame = '0' + frame
file = outputFile.replace(target, frame)
deadlinePlugin.LogInfo( "Checking " + os.path.join(outputDir, file))
if not os.path.exists(os.path.join(outputDir, file)):
deadlinePlugin.LogWarning( "Exported frame not found at " + os.path.join(outputDir, file))
if previousFail == "False":
job.SetJobExtraInfoKeyValue("ExportFail", "True")
RepositoryUtils.SaveJob( job )
deadlinePlugin.LogWarning( "Marked Job as previously failed." )
RepositoryUtils.RequeueJob(job)
deadlinePlugin.LogWarning( "Requeued Job" )
else:
deadlinePlugin.LogWarning( "Job has failed before, marking as failed job." )
RepositoryUtils.FailJob(job)
deadlinePlugin.LogWarning( "Failed Job" )
return
deadlinePlugin.LogInfo( outputDir + " frames found. Job complete." )[/code]
Also here’s a stalled slave report from the Task reports on the prejob script:
[code]STALLED SLAVE REPORT
Current House Cleaner Information
Machine Performing Cleanup: CS-Render-04
Version: v7.0.2.3 R (24b5c0a7f)
Stalled Slave: CS-Comp-023
Slave Version: v7.0.2.3 R (24b5c0a7f)
Last Slave Update: 2015-03-31 09:21:39
Current Time: 2015-03-31 09:34:36
Time Difference: 12.952 m
Maximum Time Allowed Between Updates: 10.000 m
Current Job Name: WIP-ANIM_pr202_sc067
Current Job ID: 550af2d592341d1768635ced
Current Job User: shotgun
Current Task Names: 0
Current Task Ids: -2
Searching for job with id “550af2d592341d1768635ced”
Found possible job: WIP-ANIM_pr202_sc067
Searching for task with id “-2”
Found possible task: -2:[0]
Task’s current slave: CS-Comp-023
Slave machine names match, stopping search
Associated Job Found: WIP-ANIM_pr202_sc067
Job User: shotgun
Submission Machine: CS-Render-03
Submit Time: 03/19/2015 16:01:39
Associated Task Found: -2:[0]
Task’s current slave: CS-Comp-023
Task is still rendering, attempting to fix situation.
The job is complete but this task is still rendering. This should never happen.
Setting slave’s status to Stalled.
Setting last update time to now.
Slave state updated.[/code]