beta2, limit issue

LaszloSebo · September 19, 2014, 6:41pm

Hi there,

One of the changes in beta2 has been the reduction of queries to the database of the active limits. What we started seeing with the new build, is that machine limits are not respected:
This job for example has (and had since its submission) a machine limit of 1:

blarg.png

I got several reports from artists about this issue, this artist reported the same behavior on 3 of his jobs for example.

Here is another:

rrussell · September 19, 2014, 7:07pm

The change we made shouldn’t cause this behavior. The limits are always loaded during the dequeue phase, so periodically reloading them once the rendering has started wouldn’t have any impact.

Do you have the logs from these slaves? I’m hoping there is some sort of database error that could help explain this…

Thanks!
Ryan

LaszloSebo · September 19, 2014, 7:47pm

Weird thing is, for example in the second screenshot, i cant find this job being referenced in the slave log at all… I got the report and screenshot at 11.17pm last night, which means the machine would have already been assigned to that task then. But its not in the log till about an hour later…

I have uploaded the logs to box, under Logs and Debug Information, called “multiple_machines_limit_1.zip”

rrussell · September 19, 2014, 8:06pm

I wonder if it’s a case were orphaned limit stubs are being returned, but not the tasks themselves. You had reported in another thread of a task stuck in the waiting to start phase, but to mention of it anywhere in the logs. Anything in the new housecleaning logs about if the stubs for the machine limit for this job was returned? The limit name would match the job ID in this case.

LaszloSebo · September 19, 2014, 8:30pm

These dont seem to have been orphaned, there were actually multiple machines processing (at least some of these jobs)

For example on 541c5ddecf715914c8afa0bc:

lapro821 picked it up @ 09:54:19 for frame 1000, then 1001, then 1002, then 1003

lapro927 picked it up @ 10:08:30 for frame 1004:
2014-09-19 10:11:09: 0: INFO: Lightning: Render frame 1004
…
2014-09-19 10:40:44: 0: Render time for frame(s): 32.172 m

lapro462 picked it up @ 10:19:56 for frame 1005 (shouldn’t have been able to, no?)

Seems like 821 crashed (timeout out) sometime between 10:00:00 and 10:50:04:

2014-09-19 10:00:00:  0: INFO: Lightning: Render frame 1003
2014-09-19 10:40:04:  0: Task timed out -- canceling current task...
2014-09-19 10:40:05:  0: Unloading plugin: 3dsmax
2014-09-19 10:40:11:  Scheduler Thread - Render Thread 0 threw a major error: 
2014-09-19 10:40:11:  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2014-09-19 10:40:11:  Exception Details
2014-09-19 10:40:11:  JobTimeoutException -- The Slave did not complete the task before the Regular Task Timeout limit of 00d 00h 40m 00s. The Task Timeout settings can be changed for this job by right-clicking on it in the Monitor and selecting "Modify Properties...".
2014-09-19 10:40:11:  RenderPluginException.Cause: JobError (2)
2014-09-19 10:40:11:  RenderPluginException.Level: Major (1)
2014-09-19 10:40:11:  RenderPluginException.HasSlaveLog: True
2014-09-19 10:40:11:  Exception.Data: ( )
2014-09-19 10:40:11:    Exception.StackTrace: 
2014-09-19 10:40:11:      (null)
2014-09-19 10:40:11:  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

(these are all from the logs i uploaded if you need more details)

rrussell · September 19, 2014, 8:47pm

Not the tasks, but the limit stubs themselves. Anything in the housecleaning logs about orphaned stubs getting returned?

LaszloSebo · September 19, 2014, 8:56pm

Attached is the housecleaning log for this morning so far

Made a search for that job id, but no hits
housecleaning-deadline02-2014-09-19.zip (231 KB)

LaszloSebo · September 19, 2014, 8:57pm

Pulse seems to have been doing other tasks when the situation i mentioned there has occured (archiving jobs, deleting jobs etc)

The orphaned limit stub scans all seem to have returned 0 though around that time

rrussell · September 22, 2014, 12:56pm

Thanks for confirming all of this. We’ll look into this and see if we can figure out why this is happening.

Cheers,
Ryan

rrussell · September 22, 2014, 1:40pm

Just thought of something else to check. The next time this happens, can you export the job’s machine limit object from the LimitGroups collection (it’s ID will match the job’s ID). I’m curious to see if the “Stubs” list matches the slaves that are currently rendering the job, and if the “StubCount” value matches the number of entries in the “Stubs” list.

Thanks!

LaszloSebo · September 22, 2014, 4:32pm

Ill try… People got into the habit of fixing their jobs when they see this (they need to get the shots out), so by the time i get to them, they are usually requeued.