I’ve been running into some occasions where jobs become corrupt somehow and end up sitting “queued” until I resubmit them. This is normally pretty easy to handle, but lately I’ve been having to do this on some 600-700 task jobs. These jobs will render say, 400 frames before they enter this limbo state, and then I have to resubmit them to get it running again. The problem with this is that there’s so many tasks that are already marked complete, and I’d like the resubmitted job to also have these tasks marked complete.
As it stands now, I’m resubmitting the job, and then going in by hand and comparing the two jobs (which is very painful because of the auto-sorting happening every time I click on something) and ctrl-click on all of the frames and mark them complete. This is starting to take some hours out of the day especially on these huge jobs that are in the queue lately. I was wondering if there could be an option in the resubmit job dialog that would allow you to mark tasks complete on the re-submitted job so that this isn’t necessary?
Does suspending and resuming these “stuck” jobs help? Could you possibly archive a job that gets stuck like this and post it? We could import it here and see if we can reproduce.
The Task List in the Monitor has a feature where you can right-click on a one or more tasks and resubmit them as a new job. Would that help here? Just select the incomplete tasks in the Task List and resubmit them, and then you don’t have to worry about the tasks that are already complete.
Cheers,
Ryan
Well, yes I’ve done some resubmitting of tasks in the past, but this approach tends to lead to confusion and some artists tend to get very upset when their job has a different frame range than what they submitted it with. This leads to angry emails circulating in the office, people requeueing frames, modifying frame ranges, suspending and resubmitting jobs, etc. In order to prevent all of that, it would be great to resubmit the job to give it a veritable kick in the pants, and automatically have all the same tasks set to complete so that the resubmission goes pretty much unnoticed.
I don’t have any jobs like this today that I can archive and send you, but next time I see this sort of thing I’ll let you guys know. Thanks!
p.s. I don’t think suspending/resuming the jobs works, but I will test again to see for sure.
Well, in this situation then, I think the best course of action is to figure out why your jobs end up in this state in the first place, rather than put in a feature that essentially does what resubmitting tasks as a new job already does.
So yeah, next time it happens, get us the job and we’ll take a look.
Thanks!
Ryan
Thanks! I imported the job here, and it seemed to pick up fine. I also inspected the exported job before importing it, and the state information looked fine. So just to confirm, did you try resuming this job after suspending it to see if that helped? Another thing worth testing is re-importing the job after exporting it to see if it picks up then.
Thanks!
Ryan
Laszlo helped me troubleshoot this a bit when I saw more jobs doing this. So I guess the jobs aren’t becoming corrupt, but there is something causing machines to think they’re still rendering the job, but the job has lost track of those machines. If these jobs have a max host of 1 and there is 1 machine out there somewhere with the job still assigned, it prevents it from picking up. Earlier this week I found a couple of these machines and noticed some errors in the logs. I was able to restart the slaves and the job got released from them and was able to render healthy again. I’ve attacked a full slave log so that you can see the errors. Hope this helps.
slaveLog.txt (82.7 KB)
Seems like the job limit stub is never returned
Looks like odd behaviour resulting from when the Slave can’t connect to the DB… I’m actually rewriting a bunch of Limit stuff for 7, so I’ll make sure to include that in my tests, thanks for the log!
If you run a house cleaning (from the Monitor’s Tools menu), does it free up the limit stubs, and lets the Job render again? It should do so, after at least 5 minutes of the Slave no longer rendering the task. If not, we might have to make the clean up code a bit more robust
Housecleaning is running from pulse quite regularly (every 5 mins), and these jobs usually are stuck for a long time before the artist or wrangling notices them. The previous workaround was to resubmit the job, now we know we can simply increase the machine limit by one (basically, adding one more stub… since the other one never gets unstuck)