We’ve noticed that our Draft jobs sometimes have a problem where they’ll render a video but the upload gets interrupted due to a connection issue. How Deadline handles this is to mark that report with an error, but the Job is still considered complete. Our Draft jobs are set to be deleted upon completion (as we have a large repository, and it’d be further inflated by retaining all the Draft jobs).
So this is what seems to have been happening:
Draft Job Completed
This triggers the OnJobFinished event that uploads the video to Shotgun
This also triggers the job to be autodeleted
The download is interrupted, generating an error report for the Job that’s already been deleted, but can only be found if the job is undeleted.
The problem with this in particular is that we’re in no way notified about it until we notice that there are versions on Shotgun that are missing video. However due to the order of events it seems like the script itself needs a way to catch the exception and undelete the job so it will show up in the repository.
Unless I’m missing something, there is no way to change the settings so a Job isn’t deleted until the OnJobFinished event has ended, is there? My assumption is that the event triggers both the deletion and the upload, so there’s no easy way to separate them. Do you have any suggestions for what might be a way to at least make these errors more visible?
The deletion of jobs on completion is actually not handled by the Event system.
Rather, it’s handled by the HouseCleaning process; it regularly checks for completed jobs with Delete/Archive OnComplete set, and performs those operations.
That said, a temporary solution could be to modify the Shotgun Event Plugin to modify the Job if it fails, removing the Delete On Completion flag. At least this way, Jobs that generate errors in the Event Plugin won’t get automatically cleaned up.
Here’s an code snippet showing how to do this:
from Deadline.Scripting import RepositoryUtils
def OnJobFinished( self, job ):
try:
[...]
except:
#generated an error in OnJobFinished, make sure this job isn't flagged for deletion
job.JobOnJobComplete = "Nothing"
RepositoryUtils.SaveJob( job )
#make sure the exception propagates up
raise
I do agree though, that this is obviously a problem in general and should likely be fixed in Deadline itself. Maybe we need an option to not delete jobs automatically if they generated errors? That way clearing out the error reports would also result in the Job getting cleaned up.
I think the problem though, is that the jobs are deleted before the Shotgun upload script is complete. I might have been unclear before, but the issue is not that the download fails at the start. It’s that partway through a problem will arise and close the connection, causing the problem. And the jobs are actually getting deleted while the job is running, I guess because the event’s aren’t considered part of the job, so the job itself is complete.
I think I could still use your solution though if I also undelete it, so that it reappears in the repository.
I agree that adding the error catching would be a good idea, especially because deleted jobs only last so long in the undelete window. It’s easy to inadvertently lose useful error logs that way.
Got it, I definitely hadn’t considering that, but in retrospect it make a lot of sense that it would happen while the event was still going… We’ll have to make sure to take that into account as well when we implement this new feature.