job not picking up

LaszloSebo · November 26, 2013, 5:19pm

We have a job that has a single task, its queued.

But in appropriate slave logs, i see this entry for it:

2013-11-26 09:18:27: Scheduler - The 5293e1f628500289b4a42dd7 limit is maxed out.

Seems like its limit stubs are stuck?

rrussell · November 26, 2013, 7:23pm

Sounds like it. Housecleaning should take care of this, but the stub needs to be in use for at least 5 minutes before it is released. If it’s been that long, maybe try doing a manual house cleaning from the Tools menu in the Monitor to see if that takes care of it.

Cheers,

Ryan

LaszloSebo · November 26, 2013, 9:44pm

The job was stuck overnight in that state. I ended up increasing its machine limit to 2 to get it working (it had it at 1, single frame, a sim job). It was a high priority sim, so it needed to be pushed through.

LaszloSebo · December 2, 2013, 4:24am

Another case of this, with different behavior now

The job is sitting there, idle, and the machines that should be rendering it (also idle), dont even list it in their job candidate list. Could it be that we have so many jobs that when it asks the mongo db, it gets a set amount (say, the first 200), but not all that it could render?

Attached is a log from a slave that should be rendering this job. The job ID for the job is 529b95d9c3f6ebb1d470ad52

Tried a manual house cleaning, it didnt jump start the job

LaszloSebo · December 2, 2013, 5:26am

We have found several jobs that behave the exact same way… they simply stay queued, they never show up in any logs.

If we simply resubmit the job, the new instance of it picks up just fine…

LaszloSebo · December 2, 2013, 5:41am

If we suspend the job and then resume it, it starts picking up again… Something is really odd…
The problem is, we also deleted about 5000 jobs as well, so it could also be related to that somehow

rrussell · December 2, 2013, 2:43pm

Hey Laszlo,

Hmm, that job ID doesn’t even show up in that log, which means that for whatever reason, the slave isn’t even considering it. Does this only seem to be happening for single task jobs? Also, are there any logs or errors for this job before it gets stuck? Just wondering if an attempt was ever made to render the job, and then somehow that messed things up.

I wouldn’t think deleting a bunch of jobs would make a difference.

Nope, that wouldn’t be the case. The slave loops through all possible jobs.

Cheers,

Ryan

LaszloSebo · December 2, 2013, 5:37pm

Yes the job actually had a lot of tasks, and a bunch of them was already finished by a couple of other machines. I checked the log of the last machine that rendered a task on it, and there was nothing special, it finished the tasks and even logged returning the job stubs…

rrussell · December 2, 2013, 6:19pm

That’s interesting that it logged returning the stubs. It’s almost like it thought the job had no more tasks to work on. Do you remember what the Weight value for the job look like, and if suspending/resuming the job affected the weight at all? Just wondering if there is a potential bug in the weighted scheduling algorithm…

LaszloSebo · December 2, 2013, 6:47pm

The weight in the monitor showed 5000, which seemed correct (100priority+100active tasks. Priority was 50). Not sure if that’s the same weight the slave calculated for itself though.

rrussell · December 2, 2013, 6:53pm

Thanks. Both the slave and the Monitor call the same function to calculate the weight, so they shouldn’t be different.

I wonder if we should add a way to take a “snap shot” of an object in the database so that we can easily view its raw data. In this case, we would get you to take a snap shot of the job, its task collection, and its machine limit to try and get a sense of what’s going on. It could be done now by using the mongo.exe application from the command line, but that’s not very user friendly.

I’ll think about this for a bit and see if it’s something we can/should add for beta 13.

Cheers,

Ryan

LaszloSebo · December 2, 2013, 6:56pm

Ok, its not a problem for me to use a mongo command line to get this info, so if you send me a batch or something like that, i can send you the faulty jobs. They have all been requeued over the weekend and we cleaned out much of the old complete jobs (much to the frustration of our artists…).

rrussell · December 2, 2013, 7:56pm

Cool. Here’s are the commands that uses mongoexport.exe (which is in the mongo bin directory) to export the job, all of its tasks, and its machine limit:

mongoexport -d deadlinedb -c Jobs -q {'_id':'529cdfa1a3c72c08c89a05fa'}
mongoexport -d deadlinedb -c JobTasks -q {$and:[{'_id':{$gte:'529cdfa1a3c72c08c89a05fa_'}},{'_id':{$lt:'529cdfa1a3c72c08c89a05fa~'}}]}
mongoexport -d deadlinedb -c LimitGroups -q {'_id':'529cdfa1a3c72c08c89a05fa'}

Note that these commands assume you’re running them on the same machine that mongo is running on, and that your database name is “deadlinedb”. It also is currently using the job ID “529cdfa1a3c72c08c89a05fa”, so change that as necessary too.

These commands will dump to stdout, so you can just redirect to a file:

mongoexport -d deadlinedb -c Jobs -q {'_id':'529cdfa1a3c72c08c89a05fa'} > job.txt

Thanks!

Ryan

LaszloSebo · December 4, 2013, 2:41am

We have this behavior with quite a few jobs right now… aroung 180 to be exact

Attached is a machine log that picked up a low weight job instead of the 180 higher weighted jobs. I also attached the exported data sets of a couple of the higher weighted jobs and the low weight job as well.
json__529e26fec3f6eb5178b9863d.tar (120 KB)
json__529e612a2eb2afcd68dd5c65.tar (50 KB)
low_weight_job_json__529d66f277dd1b83d8eedc58.tar (110 KB)

LaszloSebo · December 4, 2013, 2:43am

Forgot the slave log…
deadlineslave_LAPRO0633-LAPRO0633-2013-12-03-0000.log (2.33 MB)

LaszloSebo · December 4, 2013, 2:44am

Another low weight job’s db entries + slave log
json__529cc9ff2f5d8719f81f6ff8.tar (170 KB)
deadlineslave_LAPRO0437-LAPRO0437-2013-12-03-0001.log (413 KB)

rrussell · December 4, 2013, 3:47pm

From your email, it sounds like this was due to the database being overwhelmed from too many Monitors. We’re currently investigating ways to deal with this.

LaszloSebo · December 5, 2013, 10:46pm

This is happening again.

Attahced is the job json files
json__52a0b8cfc3f6ebaa7094c985.tar (90 KB)

LaszloSebo · December 5, 2013, 10:56pm

5 frames were completed by lapro1220 before the job hung up.

Attached is that slave’s log file
deadlineslave_LAPRO1220-LAPRO1220-2013-12-05-0000.log (3.93 MB)

rrussell · December 6, 2013, 3:36pm

Thanks Laszlo! We were able to reproduce the problem by dropping this job into our database. It’s a problem that would only affect jobs with frame dependencies. We’ve fixed the bug, and will be including the fix with beta 13.

Cheers,
Ryan