job/task states inconsistent?

LaszloSebo · May 17, 2016, 1:26am

The job is rendering a task, but in the job view, it shows 0/1 and queued. Restarting the monitor didnt change the behavior.

jgaudet · May 17, 2016, 11:43pm

Strange, did the Job complete properly once the Task finished? This normally only happens when the Job’s internal task counts are off, which hasn’t been a bug in a while… Looks like it might be back

Looking at the job history and job reports, what was the sequence of events that seemingly led up to this? I.e., was it failing on a bunch of slaves beforehand, completed and got requeued by a user, etc, etc.

LaszloSebo · May 18, 2016, 1:22am

Seems like it was deleted, then undeleted within the same minute… Could that have caused it?

jgaudet · May 18, 2016, 4:17pm

Sounds likely, if it was deleted right as a Slave was picking up a task, this could happen. The Slave wouldn’t be able to find the Job to update the counts – normally this wouldn’t ever be noticed, but obviously un-deleting the job after this happened will make this surface.

I’ll do some testing locally to see if I can reproduce.

LaszloSebo · May 18, 2016, 4:51pm

Maybe the undelete operation could clear all the counters?

jgaudet · May 19, 2016, 11:27pm

Yeah, I think we’d have to recalculate the counts while doing the un-delete operation. I’m not sure if it would fix it completely, since calculating counts of documents in one collection and then setting values based on that count in another collection is inherently racy, but it should at least improve it greatly.

In the future we’ll likely need to change the way this is tracked altogether.

LaszloSebo · May 24, 2016, 6:15pm

We are finding this on a few jobs now that were never undeleted.

This one for example is in fact failed (half the tasks are complete, other half is failed)

jgaudet · May 25, 2016, 8:49pm

What was the history log on that one like? was there any end-user manipulation of the Job? Or was it all seemingly Slaves just doing their thing?

LaszloSebo · May 25, 2016, 8:51pm

This is the full history:

/05/19 11:11:20 scanlinevfx LAPRO0650 (SCANLINEVFXLA\scanlinevfx): Submitted Job.
2016/05/19 11:15:10 root lascript0004.scanlinevfxla.com (lascript0004.scanlinevfxla.com\root): [Render Node Localizer] Updating total localized data information.
2016/05/19 11:15:10 root lascript0003.scanlinevfxla.com (lascript0003.scanlinevfxla.com\root): Saved job.
2016/05/19 12:01:21 root lascript0003.scanlinevfxla.com (lascript0003.scanlinevfxla.com\root): [Long Render Cron] Updating the ETA for the task.
2016/05/19 12:01:21 root lascript0005.scanlinevfxla.com (lascript0005.scanlinevfxla.com\root): Saved job.
2016/05/19 15:10:14 root lascript0004.scanlinevfxla.com (lascript0004.scanlinevfxla.com\root): [Long Render Cron] Updating the ETA for the task.
2016/05/19 15:10:15 root lascript0004.scanlinevfxla.com (lascript0004.scanlinevfxla.com\root): Saved job.
2016/05/19 18:22:25 root lascript0003.scanlinevfxla.com (lascript0003.scanlinevfxla.com\root): [Long Render Cron] Updating the ETA for the task.
2016/05/19 18:22:27 root lascript0005.scanlinevfxla.com (lascript0005.scanlinevfxla.com\root): Saved job.
2016/05/20 09:33:51 ryan.flick LAPRO3193 (SCANLINEVFXLA\ryan.flick): Marking Tasks as Complete: [7257,7258,7259,7260,7252]
2016/05/20 20:02:35 ryan.flick LAPRO3193 (SCANLINEVFXLA\ryan.flick): Set Job Machine Limit
2016/05/20 20:02:35 ryan.flick LAPRO3193 (SCANLINEVFXLA\ryan.flick): Set Job Machine Limit Maximum
2016/05/20 20:02:35 ryan.flick LAPRO3193 (SCANLINEVFXLA\ryan.flick): Modified job priority to Tier 1
2016/05/20 20:02:35 ryan.flick LAPRO3193 (SCANLINEVFXLA\ryan.flick): Saved Job
2016/05/20 20:14:47 scanlinevfx LAPRO1897 (SCANLINEVFXLA\scanlinevfx): Failing job because it has incomplete tasks which have reported 155 error(s), which is more than the maximum limit of 150
2016/05/20 20:14:48 scanlinevfx LAPRO0873 (SCANLINEVFXLA\scanlinevfx): Failing job because it has incomplete tasks which have reported 159 error(s), which is more than the maximum limit of 150
2016/05/20 20:14:48 scanlinevfx LAPRO0775 (SCANLINEVFXLA\scanlinevfx): Failing job because it has incomplete tasks which have reported 155 error(s), which is more than the maximum limit of 150
2016/05/20 20:14:48 scanlinevfx LAPRO0881 (SCANLINEVFXLA\scanlinevfx): Failing job because it has incomplete tasks which have reported 156 error(s), which is more than the maximum limit of 150
2016/05/20 20:14:49 scanlinevfx LAPRO0456 (SCANLINEVFXLA\scanlinevfx): Failing job because it has incomplete tasks which have reported 156 error(s), which is more than the maximum limit of 150
2016/05/20 20:15:00 scanlinevfx LAPRO0885 (SCANLINEVFXLA\scanlinevfx): Failing job because it has incomplete tasks which have reported 160 error(s), which is more than the maximum limit of 150
2016/05/20 20:15:09 scanlinevfx LAPRO0991 (SCANLINEVFXLA\scanlinevfx): Failing job because it has incomplete tasks which have reported 162 error(s), which is more than the maximum limit of 150
2016/05/20 20:15:11 scanlinevfx LAPRO0435 (SCANLINEVFXLA\scanlinevfx): Failing job because it has incomplete tasks which have reported 162 error(s), which is more than the maximum limit of 150
2016/05/20 20:15:14 scanlinevfx LAPRO0515 (SCANLINEVFXLA\scanlinevfx): Failing job because it has incomplete tasks which have reported 163 error(s), which is more than the maximum limit of 150

My feeling is that there is some racing condition trying to set the job states when the failure count is over the threshold