limit groups acting weird, slaves are idle

LaszloSebo · July 31, 2013, 11:53pm

Hi there,

We have slaves idle, when jobs are not using their max counts… its a bit weird.

Job has a max machiune limit of 15, and has 11 rendering tasks. Yet we have idle slaves, claiming the job has a full limit group on it…

Scheduler - The 51f9988453a91f13f85cf3fe limit is maxed out.

Weird?

rrussell · August 1, 2013, 12:59pm

Hmm, sounds like some limit “stubs” weren’t cleaned out properly. From the Monitor, enter super user mode, then select Tools -> Perform House Cleaning. You can check the Console panel in the Monitor to see if anything is getting cleaned up for this limit. If that doesn’t help, let us know!

Cheers,

Ryan

LaszloSebo · August 1, 2013, 4:35pm

Will try the housecleaning if i see this happen again…

Though, its odd and a bit scary to see such behavior with only 15-30 machines on the farm right now, and maybe only 50-100 jobs a day. The problem is that i can detect this with only so few jobs / machines, but if we start using it on all the hundreds of rendernodes and the more usual 5-600+ jobs a day, im afraid these issues will go undetected, or worse, fall in the “voodoo stuff is happening” category.

Complete sidenote, opening the console i get a bunch of error messages (attached file).
console_output.txt (16.2 KB)

LaszloSebo · August 1, 2013, 4:59pm

I could repro the issue, but the house cleaning did not fix it.

I have a job with 9 machines rendering it, with a machine limit of 15, and 4 idle slaves reporting that the machine limit is reached:

2013-08-01 09:51:04: Scheduler Thread - Seconds before next job scan: 28
2013-08-01 09:51:39: Scheduler - The 51f994f053a91f1278c7c7d4 limit is maxed out.
2013-08-01 09:51:39: Scheduler - Job chooser found no jobs.
2013-08-01 09:51:39: Scheduler Thread - Seconds before next job scan: 26

51f994f053a91f1278c7c7d4 is the job ID for the selected job in the screenshot.

After waiting about 5 minutes, the job picked up on this particular slave, but not on others:

2013-08-01 09:54:57: Scheduler - Job chooser found no jobs.
2013-08-01 09:54:57: Scheduler Thread - Seconds before next job scan: 21
2013-08-01 09:55:23: Scheduler - The 51f994f053a91f1278c7c7d4 limit is maxed out.
2013-08-01 09:55:23: Scheduler - Job chooser found no jobs.
2013-08-01 09:55:24: Scheduler Thread - Seconds before next job scan: 25
2013-08-01 09:55:55: Scheduler - Successfully dequeued 1 task(s). Returning.
2013-08-01 09:55:55: Scheduler - Returning limit stubs not in use.

When i did the manual house cleaning, it did not seem to clean up any limits:

2013-08-01 09:51:07: Purging repository temp files
2013-08-01 09:51:07: Purging limits
2013-08-01 09:51:07: Purging old job and slave reports
2013-08-01 09:51:07: Purging job reports for ‘51f69bf13e029f1bb881f605’ because the job no longer exists
2013-08-01 09:51:07: Purging job reports for ‘51f69bf43e029f08101c2ea2’ because the job no longer exists
2013-08-01 09:51:07: Purging old job auxiliary files
2013-08-01 09:51:07: Purging job auxiliary files ‘51f69bf13e029f1bb881f605’ because the job no longer exists
2013-08-01 09:51:07: Purging job auxiliary files ‘51f69bf43e029f08101c2ea2’ because the job no longer exists
2013-08-01 09:51:07: Purging old statistics
2013-08-01 09:51:07: Purging slave statistics that are older than Apr 03/13 09:53:11

LaszloSebo · August 1, 2013, 5:02pm

Different node still complaining about the same job:

2013-08-01 10:00:25: Scheduler - Job chooser found no jobs.
2013-08-01 10:00:25: Scheduler Thread - Seconds before next job scan: 28
2013-08-01 10:01:00: Scheduler - The 51f994f053a91f1278c7c7d4 limit is maxed out.
2013-08-01 10:01:00: Scheduler - Job chooser found no jobs.
2013-08-01 10:01:01: Scheduler Thread - Seconds before next job scan: 26

Currently rendering with 12 machines, the job has a limit of 15.

LaszloSebo · August 1, 2013, 5:04pm

Restarting the slave does not seem to make any difference:

2013-08-01 10:02:58: BEGIN - LAPRO0332\scanlinevfx
2013-08-01 10:02:58: Deadline Slave 6.0 [v6.0.0.51561 R]
2013-08-01 10:03:05: Auto Configuration: No auto configuration could be detected, using local configuration
2013-08-01 10:03:09: Info Thread - Created.
2013-08-01 10:03:11: Scheduler Thread - Slave initialization complete.
2013-08-01 10:03:11: Scheduler Thread - Performing house cleaning…
2013-08-01 10:03:11: Trying to connect using license server ‘****’…
2013-08-01 10:03:11: License obtained.
2013-08-01 10:03:12: Scheduler - The 51f994f053a91f1278c7c7d4 limit is maxed out.
2013-08-01 10:03:12: Scheduler - Job chooser found no jobs.
2013-08-01 10:03:12: Scheduler Thread - Seconds before next job scan: 27

rrussell · August 1, 2013, 8:44pm

Hey Laszlo,

Thanks for testing that. So we’ll need to get a look at that limit inside the database to see what’s in there. To do that, open a command prompt (or terminal) on the server running the database, and then run the “mongo” program (not the “mongod” program) inside mongo’s bin folder. This will open a prompt. The first thing to do is switch to your current database using it’s name. For example, if your database name is deadlinedb, then do this:

use deadlinedb

If you don’t know the name of the database, you can list them like this:

show dbs

Once you are “using” the correct database, you’ll want to query the specific limit. Here is an example:

db.LimitGroups.findOne( { "_id" : "51f994f053a91f1278c7c7d4" } )

This will print out that limit object, and you can post the results here. This should help us figure out what’s going on.

Thanks!

Ryan

LaszloSebo · August 1, 2013, 11:37pm

Thanks, this is what i get:

db.LimitGroups.findOne( { “_id” : “51f994f053a91f1278c7c7d4” } )
{
“LastWriteTime” : ISODate(“2013-08-01T23:31:14.567Z”),
“Name” : “51f994f053a91f1278c7c7d4”,
“Props” : {
“Limit” : 15,
“RelPer” : -1,
“Slaves” : [ ],
“White” : false,
“SlavesEx” : [ ]
},
“StubCount” : 15,
“StubLevel” : 0,
“Stubs” : [
{
“Holder” : “lapro0331”,
“Time” : ISODate(“2013-08-01T23:29:33.706Z”)
},
{
“Holder” : “lapro1307”,
“Time” : ISODate(“2013-08-01T23:29:35.595Z”)
},
{
“Holder” : “lapro0329”,
“Time” : ISODate(“2013-08-01T23:29:35.881Z”)
},
{
“Holder” : “lapro0330”,
“Time” : ISODate(“2013-08-01T23:29:42.062Z”)
},
{
“Holder” : “lapro0332”,
“Time” : ISODate(“2013-08-01T23:29:42.231Z”)
},
{
“Holder” : “lapro1305”,
“Time” : ISODate(“2013-08-01T23:29:48.260Z”)
},
{
“Holder” : “lapro0328”,
“Time” : ISODate(“2013-08-01T23:29:52.419Z”)
},
{
“Holder” : “lapro1306”,
“Time” : ISODate(“2013-08-01T23:29:52.857Z”)
},
{
“Holder” : “lapro0327”,
“Time” : ISODate(“2013-08-01T23:29:54.908Z”)
},
{
“Holder” : “lapro0315”,
“Time” : ISODate(“2013-08-01T23:31:07.628Z”)
},
{
“Holder” : “lapro0318”,
“Time” : ISODate(“2013-08-01T23:31:10.530Z”)
},
{
“Holder” : “lapro0377”,
“Time” : ISODate(“2013-08-01T23:31:12.180Z”)
},
{
“Holder” : “lapro0323”,
“Time” : ISODate(“2013-08-01T23:31:12.350Z”)
},
{
“Holder” : “lapro0322”,
“Time” : ISODate(“2013-08-01T23:31:12.618Z”)
},
{
“Holder” : “lapro0313”,
“Time” : ISODate(“2013-08-01T23:31:14.567Z”)
}
],
“Type” : 1,
“_id” : “51f994f053a91f1278c7c7d4”
}

Weird thing is, 315, 318, 323, 322 and 313 are NOT rendering the job! They cant even render it, they are not in the win7 limitgroup whitelist that this job has.

In fact, looking into 318’s slave log, it has this message about this job:

2013-08-01 16:27:25: Scheduler - Job chooser found no jobs.
2013-08-01 16:27:25: Scheduler Thread - Seconds before next job scan: 29
2013-08-01 16:28:02: Scheduler - Slave is not whitelisted for win7 limit.
2013-08-01 16:28:02: Scheduler - Job chooser found no jobs.
2013-08-01 16:28:02: Scheduler Thread - Seconds before next job scan: 23
2013-08-01 16:28:31: Scheduler - The 51f994f053a91f1278c7c7d4 limit is maxed out.
2013-08-01 16:28:31: Scheduler - Job chooser found no jobs.

LaszloSebo · August 1, 2013, 11:38pm

Even after doing a house cleaning, i get the same printout for the stubs.

rrussell · August 2, 2013, 1:46pm

Hey Laszlo,

Thanks for sending the printout from that command. This helped us track down 2 bugs!

The slave wasn’t returning the limit stub(s) that it had already obtained if it hit a limit that it was blacklisted (or not whitelisted) for. This is how slaves that weren’t on the win7 whitelist were getting stubs for the job.
When performing housecleaning from the Monitor, it was only checking non-job limits, so the 51f994f053a91f1278c7c7d4 limit was never being checked. Eventually, Pulse (if you’re running it) or the slaves should have cleaned this up, but because of the first bug above, the limit would just get filled up again.

Both of these issues will be fixed in 6.1 beta 2. Thanks again for helping us track this down!

Cheers,

Ryan

LaszloSebo · August 2, 2013, 4:41pm

Glorious thanks Ryan!