I have a job whose “Maximum slaves” value is set to 8, but it is being picked up by 10 slaves across 2 machines that each have 5 slaves running.
Hey Nathan, are you comfortable enough to poke at the database to see if the stub count is over? I’m curious if it’s a problem on stub creation side or stub honouring side.
If not, can we have call tomorrow?
Yes, I can check whatever you need in the DB.
So, the document we’d be interested in would be the limit group for that job.
Basic set of steps: (note my )
./mongo --port=27070
> use deadline7db_Limits;
> db.LimitGroups.find({_id: "<job id goes here>"}).pretty();
Here’s an example for one of my jobs:
{
"_id" : "5661c5d1e9faf10332f16a69",
"LastWriteTime" : ISODate("2015-12-04T16:56:49.323Z"),
"Props" : {
"Limit" : 1,
"RelPer" : -1,
"Slaves" : [
"mobile-029"
],
"White" : true,
"SlavesEx" : [ ]
},
"Name" : "5661c5d1e9faf10332f16a69",
"Holds" : [ ],
"Used" : 0,
"StubLevel" : 0,
"Type" : 1
}
I’m mainly interested in the props section, but send the whole document anyways.
Took a few days before another job with similar properties hit the queue. Here’s the document queried while the job had 9 slaves running (it’s about what you would expect):
{
"_id" : "5671e1192b64552a3bd4a602",
"LastWriteTime" : ISODate("2015-12-17T00:02:51.125Z"),
"Props" : {
"Limit" : 8,
"RelPer" : -1,
"Slaves" : [ ],
"White" : false,
"SlavesEx" : [ ]
},
"Name" : "5671e1192b64552a3bd4a602",
"Holds" : [
"rd-206-04",
"rd-206-02",
"rd-206-03",
"rd-205-02",
"rd-205-04",
"rd-206-01",
"rd-206-05",
"rd-205-01"
],
"Used" : 8,
"StubLevel" : 0,
"Type" : 1
}
Could this be related to the bug with Pulse returning valid limit stubs when it shouldn’t?
I’m actually not sure. At least the database structure is solid.
Any idea which machine shouldn’t have been in the list? I don’t remember if it’s log will say anything helpful for that job.
Jon thinks it’s a Repo repair problem. Can you check the Pulse logs for that? We’re looking for it removing limit stubs.
I remember we were having problems with the casing on machine names where their names weren’t properly handled. Do any of the machine names contain capital letters? Not sure if that would show up in the Slave list or not.
rd-205-05 was also rendering, so it’s not a case mismatch. All of these slaves are spread across 2 physical machines.
Yeah, there are a fair number of those in the repair logs, but there’s no contextual information (slave/machine/limit names, etc.)
Jon tells me Pulse should say something like the following: (this is from the source code)
" Orphaned Limit Stub Scan - Returning orphaned limit stub ‘", limitGroup.Name, "’ for machine ", holderName, “: the slave(s) are no longer using this limit.”
or this:
" Orphaned Limit Stub Scan - Returning orphaned limit stub ‘", limitGroup.Name, "’ for machine ", currentStub.HolderName, “: the slave(s) no longer exist”
At least it should while in verbose mode.
Verbose logging isn’t normally on for Pulse, so I don’t have verbose logs for that time period. There are a lot of “orphaned” stubs being returned by the repair process in general though (more than I would expect if slaves were properly releasing them).
P.S. It would be nice to be able to control the verbosity of the different operations separately.