AWS Thinkbox Discussion Forums

Corrupt job / limitstub inconsistency

Seems like this old issue is somehow back in deadline 10, a job that is queued, with no slaves rendering, but it has stuck limitstubs that are never freed up:

Capture.PNG

The jobs limitgroup in mongo:

db.LimitGroups.findOne({"_id":"5a8b8dab5a034f39c008ba7a"}) { "_id" : "5a8b8dab5a034f39c008ba7a", "LastWriteTime" : ISODate("2018-02-20T12:34:14.946Z"), "Props" : { "Limit" : 1, "Overage" : 0, "RelPer" : -1, "Slaves" : [ ], "White" : false, "SlavesEx" : [ ], "ThreePLE" : "", "UseThreePLE" : false, "ThreePLPC" : false, "UnLmt" : false }, "Name" : "5a8b8dab5a034f39c008ba7a", "Holds" : [ "lapro1870" ], "Used" : 1, "InOverage" : 0, "IsResLimit" : false, "StubLevel" : 0, "Type" : 1 }

The machine in question (lapro1870) is currently rendering a whole different job (5a8bd6bb92eb3135703c1793). It rendered the job in question yesterday though, when it errored on it:

2018-02-19 19:45:50: 2018/02/19 19:45:43 DBG: [19296] [45952] Error loading file C:/Users/scanlinevfx/AppData/Local/Thinkbox/Deadline10/slave/lapro1870/jobsData/5a8b8dab5a034f39c008ba7a/PTA_000_0360_v0470_mso_largeMi.max
2018-02-19 19:45:50: Network log file does not exist: %TEMP%\VRayLog.txt
2018-02-19 19:45:50: at Deadline.Plugins.PluginWrapper.StartJob(String& outMessage, AbortLevel& abortLevel) (Deadline.Plugins.RenderPluginException)

But it never returned the stub… not did the housecleaning/repo repair find this and fix the job…

According to the local slave logs, it returned the stub… but the db still has it:

2018-02-19 19:45:52: 2018/02/19 19:45:43 DBG: [19296] [45952] Error loading file C:/Users/scanlinevfx/AppData/Local/Thinkbox/Deadline10/slave/lapro1870/jobsData/5a8b8dab5a034f39c008ba7a/PTA_000_0360_v0470_mso_largeMi.max
2018-02-19 19:45:52: Network log file does not exist: %TEMP%\VRayLog.txt
2018-02-19 19:45:52: at Deadline.Plugins.PluginWrapper.StartJob(String& outMessage, AbortLevel& abortLevel)
2018-02-19 19:45:52: RenderPluginException.Cause: JobError (2)
2018-02-19 19:45:52: RenderPluginException.Level: Major (1)
2018-02-19 19:45:52: RenderPluginException.HasSlaveLog: True
2018-02-19 19:45:52: RenderPluginException.SlaveLogFileName: C:\ProgramData\Thinkbox\Deadline10\logs\deadlineslave_renderthread_0-LAPRO1870-0000.log
2018-02-19 19:45:52: Exception.Data: ( )
2018-02-19 19:45:52: Exception.TargetSite: Deadline.Slaves.Messaging.PluginResponseMemento d(Deadline.Net.DeadlineMessage)
2018-02-19 19:45:52: Exception.Source: deadline
2018-02-19 19:45:52: Exception.HResult: -2146233088
2018-02-19 19:45:52: Exception.StackTrace:
2018-02-19 19:45:52: at Deadline.Plugins.SandboxedPlugin.d(DeadlineMessage aqi)
2018-02-19 19:45:52: at Deadline.Plugins.SandboxedPlugin.StartJob(Job job)
2018-02-19 19:45:52: at Deadline.Slaves.SlaveRenderThread.c(TaskLogWriter aha)
2018-02-19 19:45:52: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2018-02-19 19:45:53: Scheduler Thread - Seconds before next job scan: 1
2018-02-19 19:45:54: Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Weighted, Balanced
2018-02-19 19:45:54: Scheduler - Using enhanced scheduler balancing
2018-02-19 19:45:54: Scheduler - Moving current job to the front of the queue because Sequential Rendering is enforced.
2018-02-19 19:45:54: Scheduler - Slave has been marked bad for job 5a8b8dab5a034f39c008ba7a, skipping this job.
2018-02-19 19:45:54: Scheduler - Preliminary check: The 5a8a900c1dbf03021c98ae9f limit is maxed out.
2018-02-19 19:45:54: Scheduler - Preliminary check: The 5a8ac37feb6081499c6ea09b limit is maxed out.
2018-02-19 19:45:54: Scheduler - Preliminary check: The 5a8ae527eb6081afd0257786 limit is maxed out.
2018-02-19 19:45:54: Scheduler - Preliminary check: The 5a8ae530eb6081c614bc17af limit is maxed out.
2018-02-19 19:45:54: Scheduler - Preliminary check: The 5a8afbda34e0905d101fb012 limit is maxed out.
2018-02-19 19:45:55: Scheduler - Successfully dequeued 1 task(s). Returning.
2018-02-19 19:45:55: 0: Shutdown
2018-02-19 19:45:55: 0: Exited ThreadMain(), cleaning up…
2018-02-19 19:45:56: Scheduler - Returning limit stubs not in use.
2018-02-19 19:45:56: Scheduler - returning mx_fastmachines
2018-02-19 19:45:56: Scheduler - returning 5a8b8dab5a034f39c008ba7a
2018-02-19 19:45:56: 0: Got task!
2018-02-19 19:45:56: 0: Plugin will be reloaded because a new job has been loaded.
2018-02-19 19:45:56: 0: Loading Job’s Plugin timeout is Disabled
2018-02-19 19:45:58: 0: Loaded plugin 3dsmax

The slave’s statusinfo lists a different stub (the one its rendering now):

db.SlaveInfo.findOne({"_id":“lapro1870”})
{
“_id” : “lapro1870”,
“LastWriteTime” : ISODate(“2018-02-20T13:41:02.001Z”),
“UpTime” : 112128,
“Host” : “LAPRO1870”,
“Arch” : “x64”,
“Ver” : “v10.0.11.1 Release (b0f2000be)”,
“AWSP” : false,
“Name” : “LAPRO1870”,
“Config” : “”,
“Stat” : 1,
“StatDate” : ISODate(“2018-02-20T08:16:34.388Z”),
“OnTskComp” : “Continue Running”,
“CompletedTasks” : 383,
“TskFail” : 5,
“Limits” : [
“mx_render”,
“vray3”,
“mx_flowlinerender”,
“5a8bd6bb92eb3135703c1793”
],

cheers
laszlo

Thanks for all the detail, should definitely be enough to get us started.

I’d be curious to see if there what’s kicking around in the LimitStubs collection for that Slave and/or Job as well:

db.LimitStubs.findOne({"Holder":"lapro1870"})
db.LimitStubs.findOne({"_id":"5a8b8dab5a034f39c008ba7a"})

Other than that, I think we should be able to track this down – it clearly meant to return it, there might just be a step we’re skipping.

The slave seems to be holding the proper stubs for its current job:

deadlineRS0:PRIMARY> db.LimitStubs.find({"Holder":"lapro1870"})
{ "_id" : "5a8bd954daa77d3bd49b264a", "Holder" : "lapro1870", "Limit" : "mx_flowlinerender", "Ref" : 1, "Use" : 1, "Time" : ISODate("2018-02-20T08:16:20.865Z"), "LastWriteTime" : ISODate("2018-02-20T08:16:20.865Z") }
{ "_id" : "5a8c734fdaa77d3bd49b2731", "Holder" : "lapro1870", "Limit" : "rendernodes", "Ref" : 1, "Use" : 1, "Time" : ISODate("2018-02-20T19:13:20.203Z"), "LastWriteTime" : ISODate("2018-02-20T19:13:20.203Z") }
{ "_id" : "5a8c7356daa77d3bd49b2740", "Holder" : "lapro1870", "Limit" : "5a8c6f45fbc9467bf06b17b6", "Ref" : 1, "Use" : 1, "Time" : ISODate("2018-02-20T19:13:26.412Z"), "LastWriteTime" : ISODate("2018-02-20T19:13:26.412Z") }

The limitstubs for the job:

deadlineRS0:PRIMARY> db.LimitStubs.find({“Limit”:“5a8b8dab5a034f39c008ba7a”})
{ “_id” : “5a8c6c0d322f0514e0a5cd00”, “Holder” : “lapro1241”, “Limit” : “5a8b8dab5a034f39c008ba7a”, “Ref” : 1, “Use” : 1, “Time” : ISODate(“2018-02-20T18:42:22.311Z”), “LastWriteTime” : ISODate(“2018-02-20T18:42:22.311Z”) }

(just one, from the machine currently actually rendering it - we increased the limit to 2 to get the job going). So it seems like 1870 removed itself from here properly.

Awesome, good to know. Was there anything further down in the Slave log, out of curiosity? What was the next message prefixed with “Scheduler -”? Looking at this code, an error returning the stub can end up getting printed after the next Render Thread starts up, so it might be mixed in with that output

The Pulse/Repository Repair logs around the time the Slave returned it would be good to see as well, actually (or at least do a Ctrl + F for that job ID)

Just wanted to follow up that we continued this thread via email with Jon and provided the logs requested.

Just wanted to mention that this issue is also affecting license/resource limit groups (using 10.0.10.4).

As in Laszlo’s case, the LimitStubs entries for the slaves seem to be correct, but the LimitGroups entry for the license is incorrect.

FYI, we found a couple bugs that might have caused this – they will be in the next Release (should be next week), along with some extra tracing in the event that things go wrong in a different way that I did not address.

Great, thanks Jon.

With 10.0.12.1, we are seeing some oddities (offseting by 0?):

2018-03-28 17:59:18: Orphaned Limit Stub Scan - Offsetting usage count for limit '5aba3142f58f3e1ddc071531' by 0 (0 in use, 0 overage) to match holder count. 2018-03-28 17:59:18: Orphaned Limit Stub Scan - Offsetting usage count for limit '5aba4b23efed491088960d12' by 0 (0 in use, 0 overage) to match holder count. 2018-03-28 17:59:18: Orphaned Limit Stub Scan - Offsetting usage count for limit '5aba58343cb474e5b0cf8c97' by 0 (0 in use, 0 overage) to match holder count. 2018-03-28 17:59:18: Orphaned Limit Stub Scan - Offsetting usage count for limit '5aba59b5efed490b94a633cd' by 0 (0 in use, 0 overage) to match holder count. 2018-03-28 17:59:18: Orphaned Limit Stub Scan - Offsetting usage count for limit '5aba5d6dcd9ada1604a2917b' by 0 (0 in use, 0 overage) to match holder count.

While there are jobs in a state where they have 1 queued task, limit=1, and nothing picking them up:

[attachment=0]Capture.PNG[/attachment]

The IDs of these jobs dont show up in the repository repair log btw.

Checking the mongo db’s limitstub collection, the jobs that are hung like this seem to be missing their entry completely:

db.LimitStubs.findOne({‘Limit’:‘5aaa48b508fc590cc8cb1136’})
null

db.LimitStubs.findOne({‘Limit’:‘5abb681233bc5a1b64320513’})
null

Hmmm, and it doesn’t correct itself (or does that get printed at every Repo Repair iteration)? From what I can see, this message happens once the listed holders on the Limit itself gets out of sync with the Stub objects in the DB (which you confirmed there are none of). Definitely a symptom rather than the cause, since it’s literally trying to offset the counts by ‘0’ (atomically, so no check-and-set race conditions there).

Are all the nodes on the farm on 10.0.12? Or are there still a couple on a slightly older version?

Can you grab the Limit object for those in the DB as well? I’d love to have a look at the full repo repair log, and slave logs of slaves that last rendered those jobs, if possible (can send those via our email thread).

I’m not sure what the current status of this issue is, as we’re still on 10.0.10.4 for the time being, but I wanted to mention another issue I’m seeing that may be related. Basically, jobs with one task and a machine limit of 1 will sometimes just sit there and never pick up, despite plenty of candidate slaves being available; as soon as I change the machine limit to 0, the job will pick up.

I haven’t really found a pattern yet, but I’ve seen it on both of our farms.

I know that it’s still possible for the limit count to pull negative, but there isn’t a way to visualize / reset the counts on a per-job basis. I’m not sure if you’re game to go poking for this, but I’m curious what the per-job limit count for that job actually is. I can probably cook up the MongoDB javascript required, but the per-job limits live along side the farm-wide ones and just have the job id as their name.

Another easy test would be to bump the limit from 1 to 2 and see if it picks up on only one machine when that happens.

I believe setting the machine limit to 2 will allow one slave to pick up as well. We have this issue with some regularity, so I will try and remember to test this out and dump the limit info the next time I see one.

Thanks Nathan. Whatever we can use to try and reproduce or figure this out would be much appreciated.

Privacy | Site terms | Cookie preferences