AWS Thinkbox Discussion Forums

Farm behaving oddly

We are having that behavior again that we had a while back, where a rogue machine with an older version would keep randomly requeuing tasks. It started after a power failure 24 hours ago.

Not a single task finishes,they all get interrupted by something like this in their logs:

2015-02-19 10:36:47: 0: INFO: Preparing global light manager…
2015-02-19 10:37:25: Scheduler Thread - Task “2_1018-1027” could not be found because task has been modified:
2015-02-19 10:37:25: current status = Rendering, new status = Rendering
2015-02-19 10:37:25: current slave = LAPRO0723, new slave = LAPRO0705
2015-02-19 10:37:25: current frames = 1018-1027, new frames = 1018-1027
2015-02-19 10:37:25: Scheduler Thread - Cancelling task…

We can’t track down where this is coming from. Who triggers the cancellation and why is not logged anywhere. We have no “old version” slaves active, every slave enabled is of the latest version, connected to pulse.
Pulse is doing regular houecleaning, nothing out of the ordinary there. Since the rogue slave doesn’t show up anywhere, we have no way of tracking who is doing this housecleaning. Could housecleaning operations triggered on slaves be logged in the repository history? We basically have no way right now to figure out what machine is doing this :-\

We can log when a machine performs housecleaning in the repository history going forward, but unfortunately that’s not going to help this situation now.

In Deadline 7, disabled slaves can still be running, they just can’t pick up any jobs. We should do a better job showing the actual state of a disabled slave, and prevent disabled slaves from doing housecleaning in the future.

So to figure out which rogue slaves are responsible, the best place to start would be the current set of diabled slaves in your slave list that have an older version installed.

Cheers,
Ryan

Hey Ryan,

I think long term, those fixes will come in handy (adding a history entry, not doing housecleaning/etc in a disabled slave). For now, we will have to byte the bullet and go through 2000 disabled slaves and do a version check / force update… :-\

Ill report back if things normalize

cheers
l

Yeah, that’s definitely a pain, and it’s really unfortunate that this issue keeps biting you guys. On a somewhat positive note, you probably would have had to go through this process anyway as you migrate more machines over to 7.

Just a note that we are reconsidering adding a history entry for housecleaning. These entries could get pretty spammy if they are being logged every minute, and don’t forget that we would have to do this for house cleaning, pending job scan, and repository repair.

The other stuff will be in beta 3 though.

Cheers,
Ryan

Hey Ryan,

To clarify, i would not log this for processes running on pulse, only for ones that are triggered on slaves. Our repository history has multiple entries every second anyway, so i wouldn’t mind this showing up there.

cheers
laszlo

Okay, sounds good. We’ll just do it for slaves.

Thanks!
Ryan

Privacy | Site terms | Cookie preferences