AWS Thinkbox Discussion Forums

deleting slaves -> pool/group assignments scrambled?

Jon, our render wrangled ran into this last week. We removed about ~250 rental machines from the farm, and then deleted them from deadline. Not sure about the exact order of happenings, but it seems like that the pool / group assignments of the remaining slaves got scrambled after this.

When he is back, ill ask him to provide more details.

Hmmm, yeah, I tried this a bunch of times and couldn’t reproduce, so any extra info would be great. Had the pools just been modified (or were being modified) while the Slaves were being deleted? Did he do anything else to the Slaves before deleting them (ie, marked them as offline or something)?

In the meantime, I’ll have another look at it to see if I can’t figure it out.

Cheers,

  • Jon

Hey guys, I just got back from a break so I don’t have logs or anything to produce. This actually happened twice within 3 days. I was notified that none of our Nuke jobs were picking up. In our studio, pretty much all Nuke jobs run on secondary Deadline slaves. I noticed quickly that on all of our secondary slaves, the pools and groups were blank, suddenly and unexpectedly. I also noticed that this happened on a smaller assortment of our primary slaves which are responsible for our 3d jobs like Maya and Max. In some other cases, some slaves had partially removed pools/groups, or had the pools removed without any change to the groups. In most, if not all cases, descriptions and comments were also removed and had to be re-done.

The first time that this happened, it was a few hours after we had removed a large group of rental machines from the repository. The machines that were removed were likely shutdown hard-style, i.e. the power button or something similar. This had been performed before I had gotten the chance to issue a remote command to “remove slave instance” or anything like that. Since the slaves were already offline and weren’t going to be coming back, I spent some time simply deleting them from the repository. This took a little over an hour, since I could only really do about 10-20 machines at a time while I waited for the monitor to become responsive again. I kept deleting slaves until all of the removed machines had been deleted.

I quickly re-configured all of the pools, groups, descriptions, and comments as best I could for that day. A couple of days later, the same thing happened again, seemingly unprovoked. We did not remove any more machines since the first time this happened, but the same phenomena happened again almost identical to the first round. I reconfigured everything again, and have not seen the problem since. This was a little over a week ago that we noticed this the first time.

That’s pretty much all of the info I had. During this time, I was scrambling to get jobs to work again and neglected taking logs ans screenshots to share, so apologize for that. I’m wondering if this is related to the manner in which we removed machines from the farm, or just coincidence?

Hmmm, yeah, short of the wrong Slaves getting deleted (either by accident, or because of a Deadline bug), I’m not sure what could cause this.

One thing that might help figure this out might be to check some History entries. The Repository History (‘Tool’ -> ‘View Repository History’, requires Super User permissions by default) should have a list of Slaves that were deleted, it might be good to make sure those match up with the Rental Slaves you deleted.

Assuming that all matches up as expected, can you post the ‘Slave History’ for a couple of the Slaves that had their Settings wiped? You can get to it by right-clicking a Slave and selecting ‘View Slave History’ (requires Super User permissions by default). It should hopefully give us an idea of what happened, or at the very least point to how this bug can be reproduced.

Thanks in advance!

  • Jon

I checked out our slave logs and noticed some strange things in there. It looks like at some point all of the super users decided to wipe all of the pools, but I can guarantee this wasn’t the case. I am pretty much solely responsible for changing the pools on these machines, and am 100% sure that these entries aren’t from a person consciously modifying anything. Please take a look at the log entries starting on the 20th. That is the Friday that we returned the rentals. I am fairly confident that all of the entries coming from lapro0500 and lapro2009 weren’t actually being done by a user, but somehow the machine went rogue and did the changes itself.

lapro2056 - My workstation, super user access
lapro0203 - One of the farm machines that I use for a 2nd Monitor, super user access
lapro0500 - Farm machine that was loaned to a pipeline TD, super user access
lapro2009 - IT workstation, super user access

2013/12/06 16:57:55 jon.bird LAPRO2056 (LAPRO2056\ScanlineVfx_user): Modified Pool list to: [all, 2dshared, python]
2013/12/06 17:15:32 jon.bird LAPRO2056 (LAPRO2056\ScanlineVfx_user): Modified Group list to: [nuke, python]
2013/12/06 17:26:04 jon.bird LAPRO2056 (LAPRO2056\ScanlineVfx_user): Modified Pool list to: [all, 2dshared, python]
2013/12/09 12:41:16 jon.bird LAPRO2056 (LAPRO2056\ScanlineVfx_user): Modified Pool list to: [all, 2dshared, python]
2013/12/12 22:33:55 stephan.trojansky LAPRO1044 (LAPRO1044\ScanlineVFX): Modified Pool list to: [all, 2dshared, python]
2013/12/13 14:47:38 jon.bird LAPRO2056 (LAPRO2056\ScanlineVfx_user): Modified Pool list to: [all, 2dshared, python]
2013/12/13 16:01:48 jon.bird LAPRO2056 (LAPRO2056\ScanlineVfx_user): Modified Pool list to: [all, 2dshared, python]
2013/12/16 14:09:29 jon.bird LAPRO2056 (LAPRO2056\ScanlineVfx_user): Modified Pool list to: [all, 2dshared, python]
2013/12/17 16:43:10 viet.nguyen lapro2009.scanlinevfxla.com (lapro2009.scanlinevfxla.com\viet): Modified Pool list to: [all, 2dshared, python]
2013/12/19 18:14:42 robert.crowther VCPRO1007 (SCANLINEVFXLA\robert.crowther): Modified Pool list to: [all, 2dshared, python]
2013/12/20 13:40:43 viet.nguyen lapro2009.scanlinevfxla.com (lapro2009.scanlinevfxla.com\viet): Modified Pool list to: []
2013/12/20 14:20:06 scanlinevfx LAPRO0203 (LAPRO0203\ScanlineVFX): Modified Pool list to: []
2013/12/20 14:20:17 scanlinevfx LAPRO0500 (LAPRO0500\scanlinevfx): Modified Pool list to: []
2013/12/20 14:21:12 scanlinevfx LAPRO0203 (LAPRO0203\ScanlineVFX): Modified Pool list to: [all, python, 2dshared]
2013/12/20 14:22:12 scanlinevfx LAPRO0500 (LAPRO0500\scanlinevfx): Modified Pool list to: []
2013/12/20 14:22:22 scanlinevfx LAPRO0203 (LAPRO0203\ScanlineVFX): Modified Pool list to: [all, python, 2dshared]
2013/12/20 14:24:04 scanlinevfx LAPRO0203 (LAPRO0203\ScanlineVFX): Modified Pool list to: [all, python, 2dshared]
2013/12/20 14:24:09 scanlinevfx LAPRO0500 (LAPRO0500\scanlinevfx): Modified Pool list to: []
2013/12/20 14:27:05 scanlinevfx LAPRO0203 (LAPRO0203\ScanlineVFX): Modified Pool list to: []
2013/12/20 14:27:11 scanlinevfx LAPRO0500 (LAPRO0500\scanlinevfx): Modified Pool list to: []
2013/12/20 14:29:02 scanlinevfx LAPRO0203 (LAPRO0203\ScanlineVFX): Modified Pool list to: []
2013/12/20 14:29:28 scanlinevfx LAPRO0500 (LAPRO0500\scanlinevfx): Modified Pool list to: []
2013/12/20 14:30:21 scanlinevfx LAPRO0203 (LAPRO0203\ScanlineVFX): Modified Pool list to: []
2013/12/20 14:31:34 scanlinevfx LAPRO0500 (LAPRO0500\scanlinevfx): Modified Pool list to: []
2013/12/20 14:32:02 scanlinevfx LAPRO0203 (LAPRO0203\ScanlineVFX): Modified Pool list to: []
2013/12/20 14:34:00 scanlinevfx LAPRO0500 (LAPRO0500\scanlinevfx): Modified Pool list to: []
2013/12/20 14:35:46 scanlinevfx LAPRO0500 (LAPRO0500\scanlinevfx): Modified Pool list to: []
2013/12/20 14:37:37 scanlinevfx LAPRO0203 (LAPRO0203\ScanlineVFX): Modified Pool list to: []
2013/12/20 14:37:43 scanlinevfx LAPRO0500 (LAPRO0500\scanlinevfx): Modified Pool list to: [all, python, 2dshared]
2013/12/20 14:39:00 scanlinevfx LAPRO0203 (LAPRO0203\ScanlineVFX): Modified Pool list to: [all, python, 2dshared]
2013/12/20 14:41:03 jon.bird LAPRO2056 (LAPRO2056\ScanlineVfx_user): Modified Pool list to: [all, python, 2dshared]
2013/12/20 14:45:24 jon.bird LAPRO2056 (LAPRO2056\ScanlineVfx_user): Modified Group list to: [nuke, python]
2013/12/20 15:30:02 jon.bird LAPRO2056 (LAPRO2056\ScanlineVfx_user): Modified Pool list to: [all, python, 2dshared]
2013/12/20 18:03:24 jon.bird LAPRO2056 (LAPRO2056\ScanlineVfx_user): Modified Pool list to: [all, python, 2dshared]
2013/12/20 18:22:08 jon.bird LAPRO2056 (LAPRO2056\ScanlineVfx_user): Modified Pool list to: [all, python, 2dshared]
2013/12/20 18:39:39 robert.crowther VCPRO1007 (SCANLINEVFXLA\robert.crowther): Modified Pool list to: [all, python, 2dshared]
2013/12/23 11:53:56 scanlinevfx LAPRO0203 (LAPRO0203\ScanlineVFX): Modified Pool list to: [all, python, 2dshared]
2013/12/23 11:55:19 scanlinevfx LAPRO0203 (LAPRO0203\ScanlineVFX): Modified Pool list to: [all, python, 2dshared]
2013/12/23 11:57:16 scanlinevfx LAPRO0203 (LAPRO0203\ScanlineVFX): Modified Pool list to: [all, python, 2dshared]
2013/12/23 12:12:15 jon.bird LAPRO2056 (LAPRO2056\ScanlineVfx_user): Modified Pool list to: [all, python, 2dshared]
2013/12/23 16:35:01 jon.bird LAPRO2056 (LAPRO2056\ScanlineVfx_user): Modified Pool list to: [all, python, 2dshared]
2013/12/23 17:46:56 viet.nguyen lapro2009.scanlinevfxla.com (lapro2009.scanlinevfxla.com\viet): Modified Pool list to: [all, python, 2dshared]
2013/12/26 16:11:45 viet.nguyen lapro2009.scanlinevfxla.com (lapro2009.scanlinevfxla.com\viet): Modified Pool list to: [all, python, 2dshared]

Thanks for the extra info, this should help! I’ll make another pass through the code keeping this in mind. I’ll let you guys know when I find something.

Cheers,

  • Jon

If the slaves are being deleted on one machine, and on another box, you modify the pool / group of any slave (with a slave list that is changing, since the slaves are being deleted in the background), could that create a weird situation like this?

Hmmm, so I did find a bug where clicking ‘OK’ on the Modify Pools dialog would save out pools even if no changes had been made – this could lead to issues where having two Manage Pool windows open in two separate monitors could lead to accidentally clobbering changes. Judging from the history log, I think this is what happened the second/third time pools were cleared – once the pools were cleared the first time, the pool window was opened in two different monitors. One of them fixed the pools, then the other one clicked ‘OK’ which re-committed the empty pool lists, wiping out the fixed pool lists.

I’ve fixed this bug, so that it should only save out pools if changes were made inside the window – groups were already working properly this way.

I still haven’t figured out what caused the pools to have been cleared in the first place, though. I’ll keep poking around.

Thanks Jon, we will keep an eye out!

Privacy | Site terms | Cookie preferences