all slaves doing houecleaning every couple seconds

LaszloSebo · November 21, 2014, 1:17am

Not sure whats going on… We have pulse running, its doing its job. But i noticed that every single slave is also running deadlinecommand to do housecleaning, repo repair and pending job scans every couple of seconds (these are cropped from the task manager):

LaszloSebo · November 21, 2014, 1:24am

This seems to be new behavior. We skipped the last 2 betas and went from 41/44 to 47, so not sure when it was introduced. For now we will roll back

rrussell · November 21, 2014, 1:56pm

Starting with RC1, you need to configure your Pulse to be the primary, which can be done by right-clicking on it in the pulse list in the monitor and selecting Modify Pulse Settings. Pulse now uses the same primary/standby system that the Balancer users, which allows you to have redundant Pulse machines on your farm. If enable the setting in the Repository Repair settings in the Repository Options, Deadline can elect a new Primary pulse if the current Primary is no longer running.

Also, in your case, the slaves aren’t actually doing full housecleanings every few seconds. The slave just launches the command line, and it’s up to the command line to determine if it should actually do the housecleaning checks. If you check the slave logs themselves, the housecleaning operations should be saying things like

Skipping house cleaning because it is not required at this time

or

Cheers,
Ryan

LaszloSebo · November 21, 2014, 5:27pm

Thanks for the pointer about the pulse setting. Missed that in the release notes. Could this be tweaked so that the first pulse made gets this set automatically? Minor thing, but probably would behave more according to expectation.

I checked the slave logs, and while the majority are ‘no housecleaning to be done at this time’ style logs, every now and then it seems to actually be doing something.
Also, every slave seems to do all these tasks every 10 or so seconds. That seems like a waste of resources, i imagine they do some sort of db connection/network lookup to determine whether they should be doing anything. With 2000+ machines, that adds up quickly.
If i set the main pulse to be primary, will that behavior stop?

Sidenote: we try to minimize running deadlinecommand as much as possible, because it conflicts with auto updating (as the command executable will hold dlls).

rrussell · November 21, 2014, 6:38pm

When you launch Pulse, if a primary isn’t set, a timed popup will be displayed asking if you want to autoconfigure it to be the primary. By default though, it won’t do anything if the popup timer reaches 0 (so if you were upgrading Pulse remotely, you probably wouldn’t have seen the message).

That would make sense, since the housecleaning still needs to get done if Pulse isn’t running.

The slaves do the check in between job searches, so if everything is idle, these checks will happen more often. The check is quite small though (probably smaller than the db heartbeat), so it’s impact is negligible.

Yes. The slaves only do the housecleaning checks if they can’t connect to pulse. Once it’s been set to the primary, the slaves should be able to connect to it again.

Cheers,
Ryan

LaszloSebo · November 21, 2014, 7:15pm

That was definitely the case.

Odd, because pulse has been running continuously, non-stop. I’ve noticed about 1/5th of our slaves are not connected to pulse. Any clues why that could be?

rrussell · November 21, 2014, 8:16pm

If pulse isn’t the primary, it won’t do things like housecleaning or pending job scans. This is so that if you have multiple pulses running, they’re not all trying to do the housecleaning operations.

If you check the logs for the slaves that can’t connect to Pulse, there should be an error message saying why.

Cheers,
Ryan

LaszloSebo · November 21, 2014, 8:46pm

Thanks for the clarification, makes sense!

Not sure whats going on about the ‘connected to pulse’ state. On deadline6, we have 4-5 machines total that arent connected to pulse. On deadline7, we have 2-400… Any ideas what could cause the problem?

rrussell · November 21, 2014, 9:22pm

Have you checked the slave logs to see if they’re printing out an error when trying to connect to pulse?

LaszloSebo · November 21, 2014, 10:25pm

Yes, there is nothing in there. Attached are 2 such logs, and a screenshot that suggests that at startup, they could connect to pulse:

deadlineslave-LAPRO1344-2014-11-21-0002.log (40.2 KB)
deadlineslave-LAPRO1343-2014-11-21-0000.log (222 KB)

LaszloSebo · November 21, 2014, 10:26pm

I’m noticing that the wast majority (99%) of the machines that report being unable to connect are not on the latest build. Only 1 machine on .47 is unable to connect, the rest are all .44 or earlier. Could that be related?

rrussell · November 24, 2014, 2:31pm

Yup, that would definitely be the reason. The way the slaves determine which port that Pulse is listening on has changed, so the old slaves won’t know which port the new Pulse is listening on.