I’ve been playing around with beta 7 a bit, and my slave seems to be hanging indefinitely whenever it tries to do its house-cleaning. This is using a completely local setup (mongo server, repo, and a single slave). All processes have been started from separate terminals, and the slave is running in non-GUI mode under a deadlinelauncher process.
The deadlinecommand process is running deadlinecommand.exe -DoHouseCleaning 100 False. Once I manually kill the deadlinecommand process, the slave resumes picking up tasks as expected, but as soon as it tries to do house-cleaning again, it gets stuck.
Could this be related to the fact that I’m only running one slave? From the “DoHouseCleaning” help:
If this value is greater than or equal 100, only one random slave will be checked to see if it is stalled, and the job repository scan will be random based on the number of slaves in the farm.
I should also mention that the house-cleaning process blocks the slave from being shut down cleanly, so if a user tries to stop it programmatically, it will end up being killed forcefully after the shutdown process times out.
Interesting addition: If the slave is started directly from a terminal (in my case, with -nogui), things seem to work properly. However, if the slave is started by the deadlinelauncher process (either at startup or from another call to deadlinelauncher -slave after a launcher process is already running), it stalls every time the house-cleaning runs.
Hmm… I still can’t reproduce this, and I’ve had the slave running for hours after being launched from the Launcher.
Do you have slave verbose logging enabled? If not, please enable it in the Application Logging section of the repository options. Then restart the slave, and when it hangs, please send us the slave log. This should show which part of the housecleaning process it’s hanging on.
OK, so I left the whole setup running all weekend, and it remained stuck. However, after enabling verbose logging, restarting the launcher and slave, and re-queueing the same tasks, the stalling seems to be much more inconsistent.
The log is pretty uninformative:
Creating New Console: False
Executable: "/mnt/bonus/Deadline6/client/bin/deadlinecommand"
Argument: -DoHouseCleaning 100 True
Startup Directory: "/mnt/bonus/Deadline6/client/bin"
Process Priority: BelowNormal
Process Affinity: default
Process is now running
Performing Job Repository Scan...
Loading jobs
Scanning jobs
Cleaning up orphaned tasks
Done.
If I then run “Stop Slave” from the monitor, the slave output looks like this:
[code]::ffff:100.100.104.82 has connected
Launcher Thread - Received command: StopSlave ws-082
Sending command to slave: StopSlave
Listener Thread - ::ffff:100.100.104.82 has connected
Listener Thread - Received message: StopSlave
Listener Thread - Responded with: Success
Got reply: ws-082.luma-pictures.com: Sent “StopSlave” command. Result: “”
Slave - slave shutdown: normal
Listener Thread - OnConnect: Listener Socket has been closed.
Info Thread - requesting slave info thread quit.
Info Thread - shutdown complete
Waiting for threads to quit. 29 seconds until forced shutdown.
Waiting for threads to quit. 28 seconds until forced shutdown.
Waiting for threads to quit. 27 seconds until forced shutdown.
Waiting for threads to quit. 26 seconds until forced shutdown.
Waiting for threads to quit. 25 seconds until forced shutdown.
Waiting for threads to quit. 24 seconds until forced shutdown.
Waiting for threads to quit. 23 seconds until forced shutdown.
Waiting for threads to quit. 22 seconds until forced shutdown.
Waiting for threads to quit. 21 seconds until forced shutdown.
Waiting for threads to quit. 20 seconds until forced shutdown.
Waiting for threads to quit. 19 seconds until forced shutdown.
Waiting for threads to quit. 18 seconds until forced shutdown.
Waiting for threads to quit. 17 seconds until forced shutdown.
Waiting for threads to quit. 16 seconds until forced shutdown.
Waiting for threads to quit. 15 seconds until forced shutdown.
Waiting for threads to quit. 14 seconds until forced shutdown.
Waiting for threads to quit. 13 seconds until forced shutdown.
Waiting for threads to quit. 12 seconds until forced shutdown.
Waiting for threads to quit. 11 seconds until forced shutdown.
Waiting for threads to quit. 10 seconds until forced shutdown.
Waiting for threads to quit. 9 seconds until forced shutdown.
Waiting for threads to quit. 8 seconds until forced shutdown.
Waiting for threads to quit. 7 seconds until forced shutdown.
Waiting for threads to quit. 6 seconds until forced shutdown.
Waiting for threads to quit. 5 seconds until forced shutdown.
Waiting for threads to quit. 4 seconds until forced shutdown.
Waiting for threads to quit. 3 seconds until forced shutdown.
Waiting for threads to quit. 2 seconds until forced shutdown.
Waiting for threads to quit. 1 seconds until forced shutdown.
Waiting for threads to quit. 0 seconds until forced shutdown.
Info Thread: Stopped
Scheduler Thread: ShuttingDown / Waiting
Render Threads:
Forcing shutdown.
Exception Details
Exception – One or more threads failed to quit successfully.
Exception.Data: ( )
Exception.StackTrace:
(null)
Slave - slave shutdown: forced
Could not query child process information for pid 9402 because: Thread was being aborted (System.Threading.ThreadAbortException)
WARNING: an error occured while trying to kill the process tree: Thread was being aborted (System.Threading.ThreadAbortException)
Could not query child process information for pid 9402 because: Thread was being aborted (System.Threading.ThreadAbortException)
WARNING: an error occured while trying to kill the process tree: Thread was being aborted (System.Threading.ThreadAbortException)
Error running housecleaning process: Thread was being aborted (System.Threading.ThreadAbortException)
Launcher Thread - Responded with: Success|[/code]
At this point, the slave process is no longer running, but the monitor still reports it as Idle.
The log is actually pretty helpful in this case. Can you let it run 5 or 10 more times and see if the log is identical when it hangs? Based on this log, the hang seems to happen after the repository scan, but I’m curious to see if that’s where it always hangs.
We have a timeout built in as well, but it doesn’t seem to be firing off here, which is strange…
After you kill deadlinecommand, does the slave print out anything before it moves on?
Hang on, it looks like it got stuck again and then eventually output the following:
Startup Directory: "/mnt/bonus/Deadline6/client/bin"
Process Priority: BelowNormal
Process Affinity: default
Process is now running
Performing Job Repository Scan...
Loading jobs
Scanning jobs
Done.
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
Power Management - Thermal Shutdown: There are no temperature zones to check
I don’t have any sort of power management enabled, and I’m not sure why it wasn’t outputting those lines before… The child process still stays stuck until I kill it.
I think what we’ll do for beta 9 is have an option in the House Cleaning section of the Repository Options to run the house cleaning in a separate process or not. The original reason for moving it to a separate process was that script dependencies could potentially cause it to crash. However, we just discovered that we weren’t acquiring the Python global interpreter lock when building up the parameters to pass to these scripts, and in theory that could cause some instability.
If after beta 9 it still locks up for you, it must be an issue with housecleaning in general, and not related to running it as a separate process.