AWS Thinkbox Discussion Forums

limit groups acting weird, slaves are idle

Hi there,

We have slaves idle, when jobs are not using their max counts… its a bit weird.

Job has a max machiune limit of 15, and has 11 rendering tasks. Yet we have idle slaves, claiming the job has a full limit group on it…

Scheduler - The 51f9988453a91f13f85cf3fe limit is maxed out.

Weird?

Hmm, sounds like some limit “stubs” weren’t cleaned out properly. From the Monitor, enter super user mode, then select Tools -> Perform House Cleaning. You can check the Console panel in the Monitor to see if anything is getting cleaned up for this limit. If that doesn’t help, let us know!

Cheers,

  • Ryan

Will try the housecleaning if i see this happen again…

Though, its odd and a bit scary to see such behavior with only 15-30 machines on the farm right now, and maybe only 50-100 jobs a day. The problem is that i can detect this with only so few jobs / machines, but if we start using it on all the hundreds of rendernodes and the more usual 5-600+ jobs a day, im afraid these issues will go undetected, or worse, fall in the “voodoo stuff is happening” category.

Complete sidenote, opening the console i get a bunch of error messages (attached file).
console_output.txt (16.2 KB)

I could repro the issue, but the house cleaning did not fix it.

I have a job with 9 machines rendering it, with a machine limit of 15, and 4 idle slaves reporting that the machine limit is reached:

2013-08-01 09:51:04: Scheduler Thread - Seconds before next job scan: 28
2013-08-01 09:51:39: Scheduler - The 51f994f053a91f1278c7c7d4 limit is maxed out.
2013-08-01 09:51:39: Scheduler - Job chooser found no jobs.
2013-08-01 09:51:39: Scheduler Thread - Seconds before next job scan: 26

51f994f053a91f1278c7c7d4 is the job ID for the selected job in the screenshot.

After waiting about 5 minutes, the job picked up on this particular slave, but not on others:

2013-08-01 09:54:57: Scheduler - Job chooser found no jobs.
2013-08-01 09:54:57: Scheduler Thread - Seconds before next job scan: 21
2013-08-01 09:55:23: Scheduler - The 51f994f053a91f1278c7c7d4 limit is maxed out.
2013-08-01 09:55:23: Scheduler - Job chooser found no jobs.
2013-08-01 09:55:24: Scheduler Thread - Seconds before next job scan: 25
2013-08-01 09:55:55: Scheduler - Successfully dequeued 1 task(s). Returning.
2013-08-01 09:55:55: Scheduler - Returning limit stubs not in use.

When i did the manual house cleaning, it did not seem to clean up any limits:

2013-08-01 09:51:07: Purging repository temp files
2013-08-01 09:51:07: Purging limits
2013-08-01 09:51:07: Purging old job and slave reports
2013-08-01 09:51:07: Purging job reports for ‘51f69bf13e029f1bb881f605’ because the job no longer exists
2013-08-01 09:51:07: Purging job reports for ‘51f69bf43e029f08101c2ea2’ because the job no longer exists
2013-08-01 09:51:07: Purging old job auxiliary files
2013-08-01 09:51:07: Purging job auxiliary files ‘51f69bf13e029f1bb881f605’ because the job no longer exists
2013-08-01 09:51:07: Purging job auxiliary files ‘51f69bf43e029f08101c2ea2’ because the job no longer exists
2013-08-01 09:51:07: Purging old statistics
2013-08-01 09:51:07: Purging slave statistics that are older than Apr 03/13 09:53:11

Different node still complaining about the same job:

2013-08-01 10:00:25: Scheduler - Job chooser found no jobs.
2013-08-01 10:00:25: Scheduler Thread - Seconds before next job scan: 28
2013-08-01 10:01:00: Scheduler - The 51f994f053a91f1278c7c7d4 limit is maxed out.
2013-08-01 10:01:00: Scheduler - Job chooser found no jobs.
2013-08-01 10:01:01: Scheduler Thread - Seconds before next job scan: 26

Currently rendering with 12 machines, the job has a limit of 15.

Restarting the slave does not seem to make any difference:

2013-08-01 10:02:58: BEGIN - LAPRO0332\scanlinevfx
2013-08-01 10:02:58: Deadline Slave 6.0 [v6.0.0.51561 R]
2013-08-01 10:03:05: Auto Configuration: No auto configuration could be detected, using local configuration
2013-08-01 10:03:09: Info Thread - Created.
2013-08-01 10:03:11: Scheduler Thread - Slave initialization complete.
2013-08-01 10:03:11: Scheduler Thread - Performing house cleaning…
2013-08-01 10:03:11: Trying to connect using license server ‘****’…
2013-08-01 10:03:11: License obtained.
2013-08-01 10:03:12: Scheduler - The 51f994f053a91f1278c7c7d4 limit is maxed out.
2013-08-01 10:03:12: Scheduler - Job chooser found no jobs.
2013-08-01 10:03:12: Scheduler Thread - Seconds before next job scan: 27

Hey Laszlo,

Thanks for testing that. So we’ll need to get a look at that limit inside the database to see what’s in there. To do that, open a command prompt (or terminal) on the server running the database, and then run the “mongo” program (not the “mongod” program) inside mongo’s bin folder. This will open a prompt. The first thing to do is switch to your current database using it’s name. For example, if your database name is deadlinedb, then do this:

use deadlinedb

If you don’t know the name of the database, you can list them like this:

show dbs

Once you are “using” the correct database, you’ll want to query the specific limit. Here is an example:

db.LimitGroups.findOne( { "_id" : "51f994f053a91f1278c7c7d4" } )

This will print out that limit object, and you can post the results here. This should help us figure out what’s going on.

Thanks!

  • Ryan

Thanks, this is what i get:

db.LimitGroups.findOne( { “_id” : “51f994f053a91f1278c7c7d4” } )
{
“LastWriteTime” : ISODate(“2013-08-01T23:31:14.567Z”),
“Name” : “51f994f053a91f1278c7c7d4”,
“Props” : {
“Limit” : 15,
“RelPer” : -1,
“Slaves” : [ ],
“White” : false,
“SlavesEx” : [ ]
},
“StubCount” : 15,
“StubLevel” : 0,
“Stubs” : [
{
“Holder” : “lapro0331”,
“Time” : ISODate(“2013-08-01T23:29:33.706Z”)
},
{
“Holder” : “lapro1307”,
“Time” : ISODate(“2013-08-01T23:29:35.595Z”)
},
{
“Holder” : “lapro0329”,
“Time” : ISODate(“2013-08-01T23:29:35.881Z”)
},
{
“Holder” : “lapro0330”,
“Time” : ISODate(“2013-08-01T23:29:42.062Z”)
},
{
“Holder” : “lapro0332”,
“Time” : ISODate(“2013-08-01T23:29:42.231Z”)
},
{
“Holder” : “lapro1305”,
“Time” : ISODate(“2013-08-01T23:29:48.260Z”)
},
{
“Holder” : “lapro0328”,
“Time” : ISODate(“2013-08-01T23:29:52.419Z”)
},
{
“Holder” : “lapro1306”,
“Time” : ISODate(“2013-08-01T23:29:52.857Z”)
},
{
“Holder” : “lapro0327”,
“Time” : ISODate(“2013-08-01T23:29:54.908Z”)
},
{
“Holder” : “lapro0315”,
“Time” : ISODate(“2013-08-01T23:31:07.628Z”)
},
{
“Holder” : “lapro0318”,
“Time” : ISODate(“2013-08-01T23:31:10.530Z”)
},
{
“Holder” : “lapro0377”,
“Time” : ISODate(“2013-08-01T23:31:12.180Z”)
},
{
“Holder” : “lapro0323”,
“Time” : ISODate(“2013-08-01T23:31:12.350Z”)
},
{
“Holder” : “lapro0322”,
“Time” : ISODate(“2013-08-01T23:31:12.618Z”)
},
{
“Holder” : “lapro0313”,
“Time” : ISODate(“2013-08-01T23:31:14.567Z”)
}
],
“Type” : 1,
“_id” : “51f994f053a91f1278c7c7d4”
}

Weird thing is, 315, 318, 323, 322 and 313 are NOT rendering the job! They cant even render it, they are not in the win7 limitgroup whitelist that this job has.

In fact, looking into 318’s slave log, it has this message about this job:

2013-08-01 16:27:25: Scheduler - Job chooser found no jobs.
2013-08-01 16:27:25: Scheduler Thread - Seconds before next job scan: 29
2013-08-01 16:28:02: Scheduler - Slave is not whitelisted for win7 limit.
2013-08-01 16:28:02: Scheduler - Job chooser found no jobs.
2013-08-01 16:28:02: Scheduler Thread - Seconds before next job scan: 23
2013-08-01 16:28:31: Scheduler - The 51f994f053a91f1278c7c7d4 limit is maxed out.
2013-08-01 16:28:31: Scheduler - Job chooser found no jobs.

Even after doing a house cleaning, i get the same printout for the stubs.

Hey Laszlo,

Thanks for sending the printout from that command. This helped us track down 2 bugs!

  1. The slave wasn’t returning the limit stub(s) that it had already obtained if it hit a limit that it was blacklisted (or not whitelisted) for. This is how slaves that weren’t on the win7 whitelist were getting stubs for the job.

  2. When performing housecleaning from the Monitor, it was only checking non-job limits, so the 51f994f053a91f1278c7c7d4 limit was never being checked. Eventually, Pulse (if you’re running it) or the slaves should have cleaned this up, but because of the first bug above, the limit would just get filled up again.

Both of these issues will be fixed in 6.1 beta 2. Thanks again for helping us track this down!

Cheers,

  • Ryan

Glorious thanks Ryan!

Privacy | Site terms | Cookie preferences