I’m having (again) an interesting issue with one of my slaves.
The slave first refused to wake up (WOL) on the PowerManager, then I fired him up via the Deadline Monitor which, surprisingly, woke the slave up, but, the slave currently sits in the monitor as “IDLE”, but doesn’t perform any tasks at all.
When I remote log in on the machine I see the Deadline service being in the state of “starting”, which is strange. I can’t disable/enable it manually and I can’t do much from within the Monitor. Restarting the machine didn’t help. Restarting the license server didn’t help either. The machine is just “IDLE” and that’s it. Also reinstalling the Deadline Client didn’t help. The license is being served, but the service just refuses to start.
The logs don’t show anything particularly useful:
2012-01-08 17:48:44: Starting between task wait - seconds: 10
2012-01-08 17:48:56: Scheduler - Contacting Deadline Pulse running on "RAMMSTEIN", port 5046.
2012-01-08 17:48:56: Scheduler - Requesting work from Deadline Pulse...
2012-01-08 17:49:06: Scheduler - No jobs found.
2012-01-08 17:49:06: Starting between task wait - seconds: 10
¨2012-01-08 17:36:56: BEGIN - RND04\loocas
2012-01-08 17:36:56: Start-up
2012-01-08 17:36:56: 2012-01-08 17:36:56
2012-01-08 17:36:56: Deadline Launcher 5.1 [v5.1.0.46114 R]
2012-01-08 17:36:56: WARNING: Repository is not valid, so no upgrades will occur.
2012-01-08 17:36:56: Launching Slave: Rnd04
2012-01-08 17:36:56: Launcher Thread - Launcher thread initializing...
2012-01-08 17:36:56: Perfoming remote admin check
2012-01-08 17:36:58: Remote Administration is now enabled
2012-01-08 17:36:58: Launcher Thread - Remote administration is enabled
2012-01-08 17:36:58: Launcher Thread - Launcher thread listening on port 5042
2012-01-08 17:37:58: Perfoming remote admin check
2012-01-08 17:40:00: Perfoming remote admin check
2012-01-08 17:40:50: ::ffff:192.168.0.111 has connected
2012-01-08 17:40:50: Launcher Thread - Received command: ForceRelaunchSlave Rnd04
2012-01-08 17:40:50: Stoping Slave: Rnd04
2012-01-08 17:40:50: Sending command to slave: StopSlave
2012-01-08 17:40:50: Got reply: RND04: Sent "StopSlave" command. Result: ""
2012-01-08 17:40:53: Launcher Thread - Responded with: Success|
2012-01-08 17:41:03: Local python version file: C:\Program Files\Thinkbox\Deadline\python\2.6.7\Version
2012-01-08 17:41:03: Network python version file: \\rammstein\deadline\python\Windows\2.6.7\Version
2012-01-08 17:41:03: Comparing python version files
2012-01-08 17:41:03: Python upgrade skipped because Version files are the same
2012-01-08 17:41:03: Local version file: C:\Program Files\Thinkbox\Deadline\bin\Version
2012-01-08 17:41:03: Network version file: \\rammstein\deadline\bin\Windows\Version
2012-01-08 17:41:03: Comparing version files
2012-01-08 17:41:03: Launching Slave: Rnd04
2012-01-08 17:41:57: fe80::f8f1:52a4:a57f:dbf3%11 has connected
2012-01-08 17:41:57: Launcher Thread - Received command: LaunchSlavePromptForUpgrade
2012-01-08 17:41:58: Local python version file: C:\Program Files\Thinkbox\Deadline\python\2.6.7\Version
2012-01-08 17:41:58: Network python version file: \\rammstein\deadline\python\Windows\2.6.7\Version
2012-01-08 17:41:58: Comparing python version files
2012-01-08 17:41:58: Python upgrade skipped because Version files are the same
2012-01-08 17:41:58: Local version file: C:\Program Files\Thinkbox\Deadline\bin\Version
2012-01-08 17:41:58: Network version file: \\rammstein\deadline\bin\Windows\Version
2012-01-08 17:41:58: Comparing version files
2012-01-08 17:41:58: Launcher Thread - Responded with: Success|
2012-01-08 17:43:00: Perfoming remote admin check
2012-01-08 17:43:59: ::ffff:192.168.0.111 has connected
2012-01-08 17:43:59: Launcher Thread - Received command: ForceStopSlave Rnd04
2012-01-08 17:43:59: Stoping Slave: Rnd04
2012-01-08 17:43:59: Sending command to slave: StopSlave
2012-01-08 17:43:59: Got reply: RND04: Sent "StopSlave" command. Result: ""
2012-01-08 17:44:01: Launcher Thread - Responded with: Success|
2012-01-08 17:44:15: ::ffff:192.168.0.111 has connected
2012-01-08 17:44:15: Launcher Thread - Received command: LaunchSlave Rnd04
2012-01-08 17:44:16: Local python version file: C:\Program Files\Thinkbox\Deadline\python\2.6.7\Version
2012-01-08 17:44:16: Network python version file: \\rammstein\deadline\python\Windows\2.6.7\Version
2012-01-08 17:44:16: Comparing python version files
2012-01-08 17:44:16: Python upgrade skipped because Version files are the same
2012-01-08 17:44:16: Local version file: C:\Program Files\Thinkbox\Deadline\bin\Version
2012-01-08 17:44:16: Network version file: \\rammstein\deadline\bin\Windows\Version
2012-01-08 17:44:16: Comparing version files
2012-01-08 17:44:16: Launching Slave: Rnd04
2012-01-08 17:44:16: Launcher Thread - Responded with: Success|
2012-01-08 17:47:01: Perfoming remote admin check
HA! What the hell! When I run the slave manually, i.e. in user mode, it runs, but it still won’t get ANY tasks assigned! It’s completely being ignored even though there are jobs to do (not limited to specific machines or anything like that)!
Did you try restarting Pulse? If not, try that to see if it makes a difference.
If it doesn’t, then my next guess is that some job setting (pool, group, whitelist/blacklist, limit group, etc) is preventing jobs from running on that machine. To confirm this, you can try using the Slave Availability Filter in the Monitor (see second paragraph here): thinkboxsoftware.com/deadlin … Slave_List
If that slave is disappearing from the Slave list when you click on jobs, then you know some job setting is preventing that slave from picking up that job.
I just noticed the single render node is being ignored by all the jobs submitted via the integrated SMTD script directly from 3ds Max, but it gets picked up for anything else (for example Nuke or RVIO jobs that I submitted via my own Python scripts).
There are no machine limits set up in the SMTD, all the machines are whitelisted etc…
That is weird. In the job properties in the Monitor, does Rnd04 show up in the whitelist there? Just wondering if the submitter is somehow setting things wrong.
Just to confirm, was the slave availability filter removing this slave from the slave list for SMTD jobs?
I also restarted Pulse just in case and ran it in the GUI mode to see the log:
Listener - Listening for auto configuration requests on port 5043
Web Service - Web Service is disabled. It can be enabled in the Pulse Settings in the Repository Options
Scheduler - Ignoring limit "999_050_999_71276e5a" because:
Scheduler - minor exception: Error in file: E:\Deadline\limitGroups\999_050_999_71276e5a.limitGroup (System.Exception)
Thermal Shutdown: Skipping zone "Idle Shutdown after 3hrs" because it is disabled
Thermal Shutdown: Skipping zone "Restart after 24hrs" because it is disabled
Idle Shutdown: Using default idle override settings for group "Idle Shutdown after 3hrs"
Idle Shutdown: Checking for slaves in group "Idle Shutdown after 3hrs" that have been idle for at least 180 minutes
Idle Shutdown: At least 0 slave(s) should be left running
Idle Shutdown: Idle slave Rnd04 in group "Idle Shutdown after 3hrs" has only been idle for 18 minutes
Idle Shutdown: There are no machines that need to be shutdown at this time
Idle Shutdown: Skipping idle shutdown group "Restart after 24hrs" because it is disabled
Machine Startup: Checking job "GT_AR_07_Max2011_loocas_002" (000_098_999_1ad7c727)
Machine Startup: Checking slaves in Machine Group "Idle Shutdown after 3hrs"
Machine Startup: - only slaves with assigned pool "crisis" should be woken up
Machine Startup: - job is not part of a group, any slaves that are included in the 'none' group should be woken up
Machine Startup: - slaves that are candidates to be woken up based on pool and group: Rnd01
Machine Startup: - checking slave Rnd01
Machine Startup: - waking up offline slave Rnd01 because it is required to render job "GT_AR_07_Max2011_loocas_002"
Machine Startup: Checking job "GT_AR_05_Max2011_loocas_005" (000_097_999_78560528)
Machine Startup: Checking slaves in Machine Group "Idle Shutdown after 3hrs"
Machine Startup: - only slaves with assigned pool "crisis" should be woken up
Machine Startup: - job is not part of a group, any slaves that are included in the 'none' group should be woken up
Machine Startup: - slaves that are candidates to be woken up based on pool and group: Rnd01
Machine Startup: - checking slave Rnd01
Machine Startup: - waking up offline slave Rnd01 because it is required to render job "GT_AR_05_Max2011_loocas_005"
Power Management: No power management notification address specified in Repository Options - cannot send notification
Machine Restart: Skipping machine group "Idle Shutdown after 3hrs" because it is disabled
Machine Restart: Checking machine group "Restart after 24hrs"
Machine Restart: - slave has been running on machine Rnd02 for only 386 minutes
Machine Restart: - slave has been running on machine Rnd03 for only 300 minutes
Machine Restart: - slave has been running on machine Rnd04 for only 148 minutes
Machine Restart: - slave has been running on machine Rnd05 for only 333 minutes
Machine Restart: - slave has been running on machine Rnd06 for only 305 minutes
Machine Restart: - slave has been running on machine Rnd07 for only 292 minutes
Machine Restart: - slave has been running on machine Rnd08 for only 298 minutes
Machine Restart: - slave has been running on machine Vfx01 for only 332 minutes
Machine Restart: - slave has been running on machine Vfx02 for only 298 minutes
Machine Restart: There are no machines that need to be restarted at this time
Slave Scheduling: Skipping scheduling group "Idle Shutdown after 3hrs" because it is disabled
Slave Scheduling: Skipping scheduling group "Restart after 24hrs" because it is disabled
Listener - Received request from ::ffff:192.168.0.204
Scheduler - Job chooser found no jobs.
Scheduler - No jobs found for Rnd04
Clean Up Thread - Performing house cleaning
Checking repository integrity
Purging limits
Scanning for dropped jobs...
Done.
Performing Job Repository Scan...
Done.
Listener - Received request from ::ffff:192.168.0.204
Clean Up Thread - Performing house cleaning
Scanning for dropped jobs...
Done.
Performing Job Repository Scan...
Done.
Scheduler - Job chooser found no jobs.
Scheduler - No jobs found for Rnd04
Clean Up Thread - Performing house cleaning
Purging old job reports
Cleaning up orphaned tasks
Purging trash
Scanning for dropped jobs...
Done.
Performing Job Repository Scan...
Done.
Listener - Received request from ::ffff:192.168.0.204
---- ledna 09 2012 -- 10:42 odp. ----
Scheduler - Job chooser found no jobs.
Scheduler - No jobs found for Rnd04
Clean Up Thread - Performing house cleaning
Scanning for dropped jobs...
Done.
Performing Job Repository Scan...
Done.
Listener - Received request from ::ffff:192.168.0.204
Clean Up Thread - Performing house cleaning
Scanning for dropped jobs...
Done.
Performing Job Repository Scan...
Done.
Scheduler - Job chooser found no jobs.
Scheduler - No jobs found for Rnd04
Clean Up Thread - Performing house cleaning
Purging trash
Scanning for dropped jobs...
Done.
Performing Job Repository Scan...
Done.
Listener - Received request from ::ffff:192.168.0.204
Scheduler - Job chooser found no jobs.
Scheduler - No jobs found for Rnd04
Clean Up Thread - Performing house cleaning
Scanning for dropped jobs...
Done.
Performing Job Repository Scan...
Done.
Listener - Received slaves update request from: ::ffff:192.168.0.111
Listener - Creating new thread...
Listener - Slaves Update: ::ffff:192.168.0.111 - Sending update for 10 updated slaves
Listener - Slaves Update: ::ffff:192.168.0.111 - Update sent!
---- ledna 09 2012 -- 10:43 odp. ----
Listener - Received request from ::ffff:192.168.0.204
Clean Up Thread - Performing house cleaning
Purging repository temp files
Scanning for dropped jobs...
Done.
Performing Job Repository Scan...
Done.
Scheduler - Job chooser found no jobs.
Scheduler - No jobs found for Rnd04
Listener - Received request from ::ffff:192.168.0.204
Listener - Received slaves update request from: ::ffff:192.168.0.111
Listener - Creating new thread...
Listener - Slaves Update: ::ffff:192.168.0.111 - Sending update for 6 updated slaves
Listener - Slaves Update: ::ffff:192.168.0.111 - Update sent!
Thermal Shutdown: Skipping zone "Idle Shutdown after 3hrs" because it is disabled
Thermal Shutdown: Skipping zone "Restart after 24hrs" because it is disabled
Idle Shutdown: Using default idle override settings for group "Idle Shutdown after 3hrs"
Idle Shutdown: Checking for slaves in group "Idle Shutdown after 3hrs" that have been idle for at least 180 minutes
Idle Shutdown: At least 0 slave(s) should be left running
Idle Shutdown: Idle slave Rnd04 in group "Idle Shutdown after 3hrs" has only been idle for 20 minutes
Idle Shutdown: There are no machines that need to be shutdown at this time
Idle Shutdown: Skipping idle shutdown group "Restart after 24hrs" because it is disabled
Machine Startup: Checking job "GT_AR_07_Max2011_loocas_002" (000_098_999_1ad7c727)
Machine Startup: Checking slaves in Machine Group "Idle Shutdown after 3hrs"
Machine Startup: - only slaves with assigned pool "crisis" should be woken up
Machine Startup: - job is not part of a group, any slaves that are included in the 'none' group should be woken up
Machine Startup: - slaves that are candidates to be woken up based on pool and group: Rnd01
Machine Startup: - checking slave Rnd01
Machine Startup: - waking up offline slave Rnd01 because it is required to render job "GT_AR_07_Max2011_loocas_002"
Machine Startup: Checking job "GT_AR_05_Max2011_loocas_005" (000_097_999_78560528)
Machine Startup: Checking slaves in Machine Group "Idle Shutdown after 3hrs"
Machine Startup: - only slaves with assigned pool "crisis" should be woken up
Machine Startup: - job is not part of a group, any slaves that are included in the 'none' group should be woken up
Machine Startup: - slaves that are candidates to be woken up based on pool and group: Rnd01
Machine Startup: - checking slave Rnd01
Machine Startup: - waking up offline slave Rnd01 because it is required to render job "GT_AR_05_Max2011_loocas_005"
Power Management: No power management notification address specified in Repository Options - cannot send notification
Machine Restart: Skipping machine group "Idle Shutdown after 3hrs" because it is disabled
Machine Restart: Checking machine group "Restart after 24hrs"
Machine Restart: - slave has been running on machine Rnd02 for only 388 minutes
Machine Restart: - slave has been running on machine Rnd03 for only 302 minutes
Machine Restart: - slave has been running on machine Rnd04 for only 150 minutes
Machine Restart: - slave has been running on machine Rnd05 for only 335 minutes
Machine Restart: - slave has been running on machine Rnd06 for only 308 minutes
Machine Restart: - slave has been running on machine Rnd07 for only 294 minutes
Machine Restart: - slave has been running on machine Rnd08 for only 300 minutes
Machine Restart: - slave has been running on machine Vfx01 for only 334 minutes
Machine Restart: - slave has been running on machine Vfx02 for only 299 minutes
Machine Restart: There are no machines that need to be restarted at this time
Slave Scheduling: Skipping scheduling group "Idle Shutdown after 3hrs" because it is disabled
Slave Scheduling: Skipping scheduling group "Restart after 24hrs" because it is disabled
Clean Up Thread - Performing house cleaning
Scanning for dropped jobs...
Done.
Performing Job Repository Scan...
Done.
Scheduler - Job chooser found no jobs.
Scheduler - No jobs found for Rnd04
Clean Up Thread - Performing house cleaning
Purging limits
Scanning for dropped jobs...
Done.
Performing Job Repository Scan...
Done.
Listener - Received request from ::ffff:192.168.0.204
Scheduler - Job chooser found no jobs.
Scheduler - No jobs found for Rnd04
Clean Up Thread - Performing house cleaning
Checking repository integrity
Scanning for dropped jobs...
Done.
Performing Job Repository Scan...
Done.
Listener - Received slaves update request from: ::ffff:192.168.0.111
Listener - Creating new thread...
Listener - Slaves Update: ::ffff:192.168.0.111 - Sending update for 9 updated slaves
Listener - Slaves Update: ::ffff:192.168.0.111 - Update sent!
Listener - Received slaves update request from: ::ffff:192.168.0.111
Listener - Creating new thread...
Listener - Slaves Update: ::ffff:192.168.0.111 - Sending update for 4 updated slaves
Listener - Received request from ::ffff:192.168.0.204
Listener - Slaves Update: ::ffff:192.168.0.111 - Update sent!
---- January 09 2012 -- 10:44 PM ----
Clean Up Thread - Performing house cleaning
Scanning for dropped jobs...
Done.
Performing Job Repository Scan...
Done.
Scheduler - Job chooser found no jobs.
Scheduler - No jobs found for Rnd04
It was not. I recently reinstalled this particular slave and since all has been running just fine, I forgot to add it to the pool.
My apologies, it really didn’t occur to me and wouldn’t have if you didn’t mention it. I thought that only the groups were exclusive in a sense that if a slave isn’t in a particular group, it gets excluded, however, I thought that pools were non-exclusive, i.e. if there is nothing else to do, but a particular job in a particular pool you’re not in, render anyways since being idle is a waste of time.
Doesn’t it work this way? I mean, that’s the impression I was under.