hundreds of idle machines, all doing housecleaning

LaszloSebo · December 3, 2013, 11:22pm

Hi there,

Bit of a panic situation here… seems like our farm went into a housecleaning frenzy. We have a couple hundred slaves now that refuse to render anything, and on slave restart they just do housecleaning. They seem to be connected to pulse just fine.

Attached is the 3 most recent logs from one machine for example

Slave startup log:

2013-12-03 14:43:27: Scheduler Thread - Slave initialization complete.
2013-12-03 14:43:27: Scheduler Thread - Performing house cleaning…
2013-12-03 14:43:27: Update timeout has been set to 300 seconds
2013-12-03 14:43:27: Stdout Handling Enabled: False
2013-12-03 14:43:27: Popup Handling Enabled: False
2013-12-03 14:43:27: Using Process Tree: True
2013-12-03 14:43:27: Hiding DOS Window: True
2013-12-03 14:43:27: Creating New Console: False
2013-12-03 14:43:27: Executable: “C:\Program Files\Thinkbox\Deadline6\bin\deadlinecommand.exe”
2013-12-03 14:43:27: Argument: -DoPendingJobScan True
2013-12-03 14:43:27: Startup Directory: “C:\Program Files\Thinkbox\Deadline6\bin”
2013-12-03 14:43:27: Process Priority: BelowNormal
2013-12-03 14:43:27: Process Affinity: default
2013-12-03 14:43:27: Process is now running
2013-12-03 14:43:53: Process exit code: 0
2013-12-03 14:43:53: Update timeout has been set to 300 seconds
2013-12-03 14:43:53: Stdout Handling Enabled: False
2013-12-03 14:43:53: Popup Handling Enabled: False
2013-12-03 14:43:53: Using Process Tree: True
2013-12-03 14:43:53: Hiding DOS Window: True
2013-12-03 14:43:53: Creating New Console: False
2013-12-03 14:43:53: Executable: “C:\Program Files\Thinkbox\Deadline6\bin\deadlinecommand.exe”
2013-12-03 14:43:53: Argument: -DoHouseCleaning 100 True
2013-12-03 14:43:53: Startup Directory: “C:\Program Files\Thinkbox\Deadline6\bin”
2013-12-03 14:43:53: Process Priority: BelowNormal
2013-12-03 14:43:53: Process Affinity: default
2013-12-03 14:43:53: Process is now running
2013-12-03 14:44:17: Another house cleaning process is already in progress

Then a lot of this:
2013-12-03 14:43:10: Minor exception - Object reference not set to an instance of an object. (System.NullReferenceException)
2013-12-03 14:43:10: Minor exception - Object reference not set to an instance of an object. (System.NullReferenceException)
deadlineslave_LAPRO0133-LAPRO0133-2013-12-03-0001.zip (7.46 KB)

LaszloSebo · December 3, 2013, 11:22pm

No idea what to do, we restarted pulse, restarted the slaves etc
They arent coming back, they keep doing housecleaning

LaszloSebo · December 3, 2013, 11:28pm

Seems like even pulse is being silly:

[root@deadline bin]# ps aux | grep deadline
root 2506 0.0 0.2 144804 66816 ? S Oct15 52:47 /usr/bin/Xvnc :0 -desktop deadline.scanlinevfxla.com:0 (root) -auth /root/.Xauthority -geometry 1920x1080 -rfbwait 30000 -rfbauth /root/.vnc/passwd -rfbport 5900 -fp catalogue:/etc/X11/fontpath.d -pn
root 8524 0.0 0.0 103248 848 pts/4 S+ 15:28 0:00 grep deadline
root 18532 0.1 0.2 583776 95160 ? Sl 15:05 0:02 /opt/mono-2.10.9/bin/mono-sgen --gc=sgen --runtime=v4.0 /opt/Thinkbox/Deadline6/bin/deadlinecommand.exe -DoPendingJobScan True
root 18533 0.1 0.2 476052 69800 ? Sl 15:05 0:01 /opt/mono-2.10.9/bin/mono-sgen --gc=sgen --runtime=v4.0 /opt/Thinkbox/Deadline6/bin/deadlinecommand.exe -DoHouseCleaning 10 True
root 18594 0.7 0.2 567120 92884 ? Sl 12:47 1:13 /opt/mono-2.10.9/bin/mono-sgen --gc=sgen --runtime=v4.0 /opt/Thinkbox/Deadline6/bin/deadlinecommand.exe -DoHouseCleaning 10 True
root 21457 0.0 0.3 588388 99932 ? Sl 12:56 0:08 /opt/mono-2.10.9/bin/mono-sgen --gc=sgen --runtime=v4.0 /opt/Thinkbox/Deadline6/bin/deadlinecommand.exe -DoPendingJobScan True
root 23147 0.1 0.1 925040 47088 ? Sl Nov26 14:33 /opt/mono-2.10.9/bin/mono-sgen --gc=sgen --runtime=v4.0 /opt/Thinkbox/Deadline6/bin/deadlinelauncher.exe -pulse -nogui -upgraded
root 23587 5.6 0.1 1206200 56800 ? Sl 15:03 1:23 /opt/mono-2.10.9/bin/mono-sgen --gc=sgen --runtime=v4.0 /opt/Thinkbox/Deadline6/bin/deadlinepulse.exe -nogui

Isnt housecleaning / pending job scanning a locking operation?

LaszloSebo · December 3, 2013, 11:39pm

The machines doing this weird housecleaning are putting stuff like this in the logs:

2013-12-03 15:28:36: Slave ‘LAPRO2102’ has stalled because it has not updated its state in 15.731 m. Performing house cleaning…
2013-12-03 15:32:08: Slave ‘LAPRO0052’ has stalled because it has not updated its state in 1.148 hrs. Performing house cleaning…
2013-12-03 15:34:21: Could not find associated job class though.
2013-12-03 15:34:33: Slave ‘LAPRO0057’ has stalled because it has not updated its state in 1.181 hrs. Performing house cleaning…
2013-12-03 15:36:33: Could not find associated job class though.
2013-12-03 15:36:50: Slave ‘LAPRO0091’ has stalled because it has not updated its state in 13.901 m. Performing house cleaning…

LaszloSebo · December 3, 2013, 11:41pm

Even if i kill the housecleaning deadlinecommand, it does not pick up any jobs…

rrussell · December 4, 2013, 3:14pm

Sounds like it’s due to the database being overwhelmed with too many connected monitors:
viewtopic.php?f=156&t=10803#p47167

LaszloSebo · December 8, 2013, 5:33am

This is happening again…

There was a power failure in LA today, and ever since then, our whole farm is stuck doing housecleaning. If i restart any slave, they just go right back into housecleaning.

I dont understand whats happening at all…

MikeOwen · December 8, 2013, 4:35pm

Hi Laszlo,

If there was a power failure, is it possible your MongoDB machine was shutdown/restarted in an ungraceful manner? ie: UPS did not cut in or it timed out and therefore the DB needs to be repaired? There’s too much to explain clearly, so I would recommend reading this link first, before attempting a repair. I would also recommend taking a full backup of the DB prior to carrying out this operation as well; just to be on the safe side.

docs.mongodb.org/manual/tutorial … d-shutdown

Essentially, you may or may not need to remove the *.lock file, which the documentation above will explain. Also, I understand that on Linux, the repair command normally carried out with root permissions, will cause a change of ownership of the DB to root, so depending on the user which normally executes the Mongo service, it may be necessary to do a change ownership command after the repair. Something along the lines of the below (note local install paths may differ) should help explain the command operations:

sudo rm /opt/DeadlineDatabase6/mongo/data/mongod.lock sudo mongod --dbpath /opt/DeadlineDatabase6/mongo/data/deadlinedb --repair sudo chown mongodb /opt/DeadlineDatabase6/mongo/data/* sudo service mongodb start

Again, I would recommend backing up the DB/jobs just in case this doesn’t actually solve your issue and of course, stopping the service to repair the DB, will take the DB/farm off-line during this process.

Mike

rrussell · December 9, 2013, 2:43pm

Hey Laszlo,

There are a few issues here that are all contributing to the behavior you’re seeing. The first is something we had already addressed internally, and it was that the “Stalled Slave Scan” section of housecleaning wasn’t covered by the housecleaning lock. This is the reason all of your slaves try to do a housecleaning at startup.

Another problem was with the “Stalled Slave Scan” itself. If a slave is detected as stalled, and there is a job ID stored in the slave state, an attempt will be made to find that job and requeue tasks for it that the stalled slave was working on. However, if there was no job ID stored in the slave state, an attempt would be made to check every active job to see if the slave was working on one of its tasks. This is a very expensive operation, and really isn’t necessary because checking for orphaned tasks is already part of normal housecleaning, so we’ll be taking that chunk of code out.

Finally, there is a problem where the initial housecleaning done by the slave is attempted before determining if the slave can connect to Pulse. Now if a housecleaning is in progress, the housecleaning lock would catch that, but because of the 1st issue above, it still results in every slave doing the “Stalled Slave Scan” at startup.

All of these issues will be addressed in beta 13.

Cheers,

Ryan

LaszloSebo · December 9, 2013, 5:31pm

Thanks Ryan. Based on what im seeing in the logs, you absolutely correct about whats causing it.

I ended up doing a voodoo dance of sending remote commands to all slaves with:

taskkill /F /IM deadlinecommand*

This would kill the housecleaning (and in a lot of cases, some other hanging queries like getting slave IDs and such, not sure what those were), without a slave restart. So they would eventually go back to rendering.

A simple slave restart never worked, because every time they would report that they themselves are stalled, and thus require housecleaning:

2013-12-07 21:44:09: BEGIN - LAPRO0047\ScanlineVFX
2013-12-07 21:44:09: Deadline Slave 6.1 [v6.1.0.53841 R]
2013-12-07 21:45:00: Auto Configuration: No auto configuration could be detected, using local configuration
2013-12-07 21:45:18: Info Thread - Created.
2013-12-07 21:45:18: Slave ‘LAPRO0047’ has stalled because it has not updated its state in 5.198 m. Performing house cleaning…
2013-12-07 21:55:10: Trying to connect using license server '@lapro0001.scanlinevfxla.com’…
2013-12-07 21:55:10: License obtained.
2013-12-07 21:55:10: Scheduler Thread - Slave initialization complete.
2013-12-07 21:55:10: Scheduler Thread - Performing house cleaning…
2013-12-07 21:55:10: Update timeout has been set to 300 seconds

LaszloSebo · December 9, 2013, 5:33pm

Thanks Mike, Ill check on the health of the database, but i think it might be fine. After around midnight on saturday, having killed 99% of the slave housecleaning, the farm was operational again.

At that point, i called it quits, so ill just go through any jobs that might be in a corrupt state this morning.