machine restarted, deadline showing 'starting up'

LaszloSebo · August 14, 2014, 2:48am

The machine got rebooted right after it dequeued a task:

But deadline is still showing it as if it was in the startup phase… We get this a lot, and wranglers, coordinators, production people have to find these tasks manually and requeue them. Shouldnt this happen automatically? It picked up another task and its been happily rendering it for 30 mins, but it didnt clean out the previous entry, neither did pulse realize whats going on…

LaszloSebo · August 14, 2014, 2:54am

The last lines of the offending log:

2014-08-13 18:10:46: Scheduler - The 53eb9ee8c3ce0a22c0250fa1 limit is maxed out.
2014-08-13 18:10:46: Scheduler - The 53eb9eeac3ce0a0938a86b3c limit is maxed out.
2014-08-13 18:10:46: Scheduler - The 53eabec6bb649a0b14175223 limit is maxed out.
2014-08-13 18:10:46: Scheduler - The 53ebd644bb649a22e819eec2 limit is maxed out.
2014-08-13 18:10:46: Scheduler - The 53ebb728bb649a21b8330466 limit is maxed out.
2014-08-13 18:10:46: Scheduler - The 53eabeb4bb649a0ab8ad8aed limit is maxed out.
2014-08-13 18:10:46: Scheduler - The 53eabec0bb649a18e031c79a limit is maxed out.
2014-08-13 18:10:46: Scheduler - Slave has been marked bad for job 53ec01ab01f73e2e406a4ef3, skipping this job.
2014-08-13 18:10:46: Scheduler - Slave has been marked bad for job 53ea56338b53a01ccccff2a3, skipping this job.
2014-08-13 18:10:46: Waiting for threads to quit. 28 seconds until forced shutdown.
2014-08-13 18:10:47: Waiting for threads to quit. 27 seconds until forced shutdown.
2014-08-13 18:10:48: Waiting for threads to quit. 26 seconds until forced shutdown.
2014-08-13 18:10:48: Scheduler - Successfully dequeued 1 task(s). Returning.

The first lines of the next log:

2014-08-13 18:15:13: BEGIN - LAPRO0842\scanlinevfx
2014-08-13 18:15:13: Deadline Slave 6.2 [v6.2.0.32 R (2563d5bc8)]
2014-08-13 18:15:21: Auto Configuration: A ruleset has been received
2014-08-13 18:15:21: Auto Configuration: Setting Launch Slave At Startup value to ‘false’
2014-08-13 18:15:21: Auto Configuration: Setting Restart Stalled Slave value to ‘True’
2014-08-13 18:15:24: Info Thread - Created.
2014-08-13 18:15:24: Slave ‘LAPRO0842’ has stalled because it has not updated its state in 4.677 m. Performing house cleaning…
2014-08-13 18:15:25: Could not find associated job for this slave.

rrussell · August 14, 2014, 3:08pm

I think it’s the same issue as this:
viewtopic.php?f=86&t=12172&p=53699&hilit=housecleaning#p53699

Housecleaning should be taking care of these, so I think what we need to do is replicate this state here and test our housecleaning to make sure this is working properly.

Cheers,
Ryan

LaszloSebo · August 14, 2014, 4:19pm

Thanks Ryan,

I think this should be a double fail safe process, first, the slave should check if its assigned to any other tasks and clean those out, and as a safety, housecleaning should also do general sweeps (in case the slaves dont come back up to do their own cleaning). Seems like this is the idea already, but both fail…

cheers
laszlo

eamsler · August 14, 2014, 5:24pm

The Slave is supposed to check and see if it has any tasks dequeued when it starts up. That’s the source of those strange ‘stalled slave’ reports that have a timeout of less than what’s configured in the database.

It shouldn’t be too hard to replicate here. I’ll just kill my Slave and see what goes down.

eamsler · August 14, 2014, 5:35pm

Sorry, I missed the bottom text that clearly shows that it detects the stall.

The next time it happens, we’ll need to dig into what the job id is. I’m guessing there’s some point where the id hasn’t been written yet. I’ll see if Vibhu knows the secrets of the related queries. Otherwise, I’mma need Jon.

Update:

I definitely managed to break things by killing the Slave during task dequeue.

My status in the Monitor is ‘waiting to start’ while the Slave itself is sitting idle. I can’t archive the job because changing its state would likely fix things, but I’ll keep it laying around until the guys get back. Also, I also got “Could not find associated job for this slave.”

LaszloSebo · August 14, 2014, 6:26pm

Seems like lack of transactioned db handling again…

LaszloSebo · August 14, 2014, 6:47pm

To illustrate how bad this is, it affects every single job we have in the queue…

Im troubleshooting something else, and have a couple of these frames already (not ‘starting up’, but stuck in ‘rendering’, even though the machine was taken offline for maintenance 10 mins ago)

eamsler · August 14, 2014, 7:53pm

It’s a bit weirder than you think.

It looks like task creation only fill certain parts of the task, but that doesn’t seem to be the problem either. Me and Ryan will work through this once he’s back and settled in.

I’m seeing things that don’t make any sense. The difference between a stuck task and a rendering one isn’t obvious from the task (as represented in the database) alone.

eamsler · August 14, 2014, 9:49pm

Okay, so me and Ryan G went though this and found out that the issue in our case seems to be that the SlaveInfo wasn’t updated with the JobID when it dequeues, so it can’t find the job it was working on (hence the error). It does update the task id in the SlaveInfo, so an update of some description is happening, the Slave may just not have enough info yet.

He’s going to look into it more tomorrow, progress is being made.

We also should probably have some concept of a stalled task during house cleaning since we have the last write time for it…