sequential job requeing

LaszloSebo · August 28, 2015, 1:34am

We noticed that quite often, machines don’t stick to their sequential jobs as they are supposed to.

One particular case i looked at had this error between tasks:

2015-08-27 18:19:51:  0: Render time for frame(s): 1.109 m
2015-08-27 18:19:51:  0: Total time for task: 1.144 m
2015-08-27 18:19:53:  0: Saving task log...
2015-08-27 18:19:54:  Scheduler Thread - Render Thread 0 completed its task
2015-08-27 18:19:54:  Scheduler Thread - Seconds before next job scan: 2
2015-08-27 18:19:56:  Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Weighted, Balanced
2015-08-27 18:19:56:  Scheduler - Error occurred while scanning for jobs: An error occurred while trying to Query the Database (deadline01.scanlinevfxla.com:27017,deadline02.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017). It is possible that Deadline failed to Authenticate properly. Please check that the Mongo Username/Password are correct.
2015-08-27 18:19:56:  Full error: QueryFailure flag was assertion src/mongo/db/structure/btree/key.cpp:433 (response was { "$err" : "assertion src/mongo/db/structure/btree/key.cpp:433" }). (FranticX.Database.DatabaseConnectionException)
2015-08-27 18:19:56:  Scheduler - Performing Job scan on Secondary Pools with scheduling order Pool, Weighted, Balanced
2015-08-27 18:19:57:  Scheduler - Using enhanced scheduler balancing
2015-08-27 18:19:57:  Scheduler - Job chooser found no jobs.
2015-08-27 18:19:57:  0: Shutdown
2015-08-27 18:19:57:  0: Exited ThreadMain(), cleaning up...
2015-08-27 18:19:57:  0: INFO: End Job called - shutting down 3dsmax plugin
2015-08-27 18:19:57:  0: Shutdown
2015-08-27 18:19:58:  0: Shutdown
2015-08-27 18:19:58:  0: Shutdown

Not sure whats going on, we are not using authentication.

rrussell · August 28, 2015, 6:10pm

Hey Laszlo,

Do you happen to know (or are able to determine) if that same error comes up in most/all cases where the slave drops its sequential job?

Currently, if a slave is working on a job and then doesn’t find a task for that job on the next dequeue, it unloads that job. When an error occurs while getting the initial list of job candidates, it looks like that’s the equivalent of finding no tasks, so the slave unloads the job it was working on. This would normally be fine for regular jobs, since if that job is still at top priority, the slave will pick it up again during the next dequeue cycle. However, it’s a problem for sequential jobs for obvious reasons.

Regardless of the error that does occur, we should try to make the dequeuing system more robust for sequential jobs in this case (ie: don’t unload the current job if an error occurs when getting the initial list of job candidates). It would still be nice to know though if you see this specific error in all cases where this problem occurs.

Cheers,
Ryan

LaszloSebo · August 31, 2015, 9:04pm

Yeah, the other cases of ‘machine changes’ are all expected as far as my sampling could determine (slave shutdown by wrangling, further tasks still pending, max crashing etc)

I think that would be best, simply ‘keep going’