6.1 beta 7: Slave hanging on house cleaning

nrusch · October 18, 2013, 6:45pm

6.1.53080 on Fedora 15 x64

I’ve been playing around with beta 7 a bit, and my slave seems to be hanging indefinitely whenever it tries to do its house-cleaning. This is using a completely local setup (mongo server, repo, and a single slave). All processes have been started from separate terminals, and the slave is running in non-GUI mode under a deadlinelauncher process.

The process tree:

The deadlinecommand process is running deadlinecommand.exe -DoHouseCleaning 100 False. Once I manually kill the deadlinecommand process, the slave resumes picking up tasks as expected, but as soon as it tries to do house-cleaning again, it gets stuck.

Could this be related to the fact that I’m only running one slave? From the “DoHouseCleaning” help:

If this value is greater than or equal 100, only one random slave will be checked to see if it is stalled, and the job repository scan will be random based on the number of slaves in the farm.

-Nathan

nrusch · October 18, 2013, 7:11pm

I should also mention that the house-cleaning process blocks the slave from being shut down cleanly, so if a user tries to stop it programmatically, it will end up being killed forcefully after the shutdown process times out.

rrussell · October 18, 2013, 8:27pm

That’s odd. We can’t reproduce this here with beta 7 (I tested on a fedora vm with a local mongodb, repository, and slave).

What happens if you run that DoHouseCleaning command manually from a terminal?

nrusch · October 18, 2013, 9:15pm

Interesting addition: If the slave is started directly from a terminal (in my case, with -nogui), things seem to work properly. However, if the slave is started by the deadlinelauncher process (either at startup or from another call to deadlinelauncher -slave after a launcher process is already running), it stalls every time the house-cleaning runs.

nrusch · October 18, 2013, 9:24pm

Check that, it doesn’t seem to be every time, but it still happens pretty reliably.

rrussell · October 21, 2013, 6:10pm

Hmm… I still can’t reproduce this, and I’ve had the slave running for hours after being launched from the Launcher.

Do you have slave verbose logging enabled? If not, please enable it in the Application Logging section of the repository options. Then restart the slave, and when it hangs, please send us the slave log. This should show which part of the housecleaning process it’s hanging on.

Thanks!

Ryan

nrusch · October 21, 2013, 6:57pm

OK, so I left the whole setup running all weekend, and it remained stuck. However, after enabling verbose logging, restarting the launcher and slave, and re-queueing the same tasks, the stalling seems to be much more inconsistent.

The log is pretty uninformative:

Creating New Console: False Executable: "/mnt/bonus/Deadline6/client/bin/deadlinecommand" Argument: -DoHouseCleaning 100 True Startup Directory: "/mnt/bonus/Deadline6/client/bin" Process Priority: BelowNormal Process Affinity: default Process is now running Performing Job Repository Scan... Loading jobs Scanning jobs Cleaning up orphaned tasks Done.

If I then run “Stop Slave” from the monitor, the slave output looks like this:

[code]::ffff:100.100.104.82 has connected
Launcher Thread - Received command: StopSlave ws-082
Sending command to slave: StopSlave
Listener Thread - ::ffff:100.100.104.82 has connected
Listener Thread - Received message: StopSlave
Listener Thread - Responded with: Success
Got reply: ws-082.luma-pictures.com: Sent “StopSlave” command. Result: “”
Slave - slave shutdown: normal
Listener Thread - OnConnect: Listener Socket has been closed.
Info Thread - requesting slave info thread quit.
Info Thread - shutdown complete
Waiting for threads to quit. 29 seconds until forced shutdown.
Waiting for threads to quit. 28 seconds until forced shutdown.
Waiting for threads to quit. 27 seconds until forced shutdown.
Waiting for threads to quit. 26 seconds until forced shutdown.
Waiting for threads to quit. 25 seconds until forced shutdown.
Waiting for threads to quit. 24 seconds until forced shutdown.
Waiting for threads to quit. 23 seconds until forced shutdown.
Waiting for threads to quit. 22 seconds until forced shutdown.
Waiting for threads to quit. 21 seconds until forced shutdown.
Waiting for threads to quit. 20 seconds until forced shutdown.
Waiting for threads to quit. 19 seconds until forced shutdown.
Waiting for threads to quit. 18 seconds until forced shutdown.
Waiting for threads to quit. 17 seconds until forced shutdown.
Waiting for threads to quit. 16 seconds until forced shutdown.
Waiting for threads to quit. 15 seconds until forced shutdown.
Waiting for threads to quit. 14 seconds until forced shutdown.
Waiting for threads to quit. 13 seconds until forced shutdown.
Waiting for threads to quit. 12 seconds until forced shutdown.
Waiting for threads to quit. 11 seconds until forced shutdown.
Waiting for threads to quit. 10 seconds until forced shutdown.
Waiting for threads to quit. 9 seconds until forced shutdown.
Waiting for threads to quit. 8 seconds until forced shutdown.
Waiting for threads to quit. 7 seconds until forced shutdown.
Waiting for threads to quit. 6 seconds until forced shutdown.
Waiting for threads to quit. 5 seconds until forced shutdown.
Waiting for threads to quit. 4 seconds until forced shutdown.
Waiting for threads to quit. 3 seconds until forced shutdown.
Waiting for threads to quit. 2 seconds until forced shutdown.
Waiting for threads to quit. 1 seconds until forced shutdown.
Waiting for threads to quit. 0 seconds until forced shutdown.
Info Thread: Stopped
Scheduler Thread: ShuttingDown / Waiting
Render Threads:
Forcing shutdown.

Exception Details
Exception – One or more threads failed to quit successfully.
Exception.Data: ( )
Exception.StackTrace:
(null)

Slave - slave shutdown: forced
Could not query child process information for pid 9402 because: Thread was being aborted (System.Threading.ThreadAbortException)
WARNING: an error occured while trying to kill the process tree: Thread was being aborted (System.Threading.ThreadAbortException)
Could not query child process information for pid 9402 because: Thread was being aborted (System.Threading.ThreadAbortException)
WARNING: an error occured while trying to kill the process tree: Thread was being aborted (System.Threading.ThreadAbortException)
Error running housecleaning process: Thread was being aborted (System.Threading.ThreadAbortException)
Launcher Thread - Responded with: Success|[/code]

At this point, the slave process is no longer running, but the monitor still reports it as Idle.

rrussell · October 21, 2013, 7:16pm

The log is actually pretty helpful in this case. Can you let it run 5 or 10 more times and see if the log is identical when it hangs? Based on this log, the hang seems to happen after the repository scan, but I’m curious to see if that’s where it always hangs.

We have a timeout built in as well, but it doesn’t seem to be firing off here, which is strange…

After you kill deadlinecommand, does the slave print out anything before it moves on?

nrusch · October 21, 2013, 7:26pm

Yes, so far it seems like it consistently hangs in that same place.

Process exit code: 143 Scheduler Thread - Performing house cleaning...
And then business as usual

nrusch · October 21, 2013, 10:21pm

Hang on, it looks like it got stuck again and then eventually output the following:

Startup Directory: "/mnt/bonus/Deadline6/client/bin" Process Priority: BelowNormal Process Affinity: default Process is now running Performing Job Repository Scan... Loading jobs Scanning jobs Done. Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check Power Management - Thermal Shutdown: There are no temperature zones to check

I don’t have any sort of power management enabled, and I’m not sure why it wasn’t outputting those lines before… The child process still stays stuck until I kill it.

rrussell · October 22, 2013, 5:45pm

I think what we’ll do for beta 9 is have an option in the House Cleaning section of the Repository Options to run the house cleaning in a separate process or not. The original reason for moving it to a separate process was that script dependencies could potentially cause it to crash. However, we just discovered that we weren’t acquiring the Python global interpreter lock when building up the parameters to pass to these scripts, and in theory that could cause some instability.

If after beta 9 it still locks up for you, it must be an issue with housecleaning in general, and not related to running it as a separate process.

nrusch · October 22, 2013, 5:47pm

Left it running last night again, and it hung all night on the power management message.