Service "starting" - slave idle, yet not rendering anything.

I’m having (again) an interesting issue with one of my slaves.

The slave first refused to wake up (WOL) on the PowerManager, then I fired him up via the Deadline Monitor which, surprisingly, woke the slave up, but, the slave currently sits in the monitor as “IDLE”, but doesn’t perform any tasks at all.

When I remote log in on the machine I see the Deadline service being in the state of “starting”, which is strange. I can’t disable/enable it manually and I can’t do much from within the Monitor. Restarting the machine didn’t help. Restarting the license server didn’t help either. The machine is just “IDLE” and that’s it. Also reinstalling the Deadline Client didn’t help. The license is being served, but the service just refuses to start.

The logs don’t show anything particularly useful:

2012-01-08 17:48:44: Starting between task wait - seconds: 10 2012-01-08 17:48:56: Scheduler - Contacting Deadline Pulse running on "RAMMSTEIN", port 5046. 2012-01-08 17:48:56: Scheduler - Requesting work from Deadline Pulse... 2012-01-08 17:49:06: Scheduler - No jobs found. 2012-01-08 17:49:06: Starting between task wait - seconds: 10

¨2012-01-08 17:36:56: BEGIN - RND04\loocas 2012-01-08 17:36:56: Start-up 2012-01-08 17:36:56: 2012-01-08 17:36:56 2012-01-08 17:36:56: Deadline Launcher 5.1 [v5.1.0.46114 R] 2012-01-08 17:36:56: WARNING: Repository is not valid, so no upgrades will occur. 2012-01-08 17:36:56: Launching Slave: Rnd04 2012-01-08 17:36:56: Launcher Thread - Launcher thread initializing... 2012-01-08 17:36:56: Perfoming remote admin check 2012-01-08 17:36:58: Remote Administration is now enabled 2012-01-08 17:36:58: Launcher Thread - Remote administration is enabled 2012-01-08 17:36:58: Launcher Thread - Launcher thread listening on port 5042 2012-01-08 17:37:58: Perfoming remote admin check 2012-01-08 17:40:00: Perfoming remote admin check 2012-01-08 17:40:50: ::ffff:192.168.0.111 has connected 2012-01-08 17:40:50: Launcher Thread - Received command: ForceRelaunchSlave Rnd04 2012-01-08 17:40:50: Stoping Slave: Rnd04 2012-01-08 17:40:50: Sending command to slave: StopSlave 2012-01-08 17:40:50: Got reply: RND04: Sent "StopSlave" command. Result: "" 2012-01-08 17:40:53: Launcher Thread - Responded with: Success| 2012-01-08 17:41:03: Local python version file: C:\Program Files\Thinkbox\Deadline\python\2.6.7\Version 2012-01-08 17:41:03: Network python version file: \\rammstein\deadline\python\Windows\2.6.7\Version 2012-01-08 17:41:03: Comparing python version files 2012-01-08 17:41:03: Python upgrade skipped because Version files are the same 2012-01-08 17:41:03: Local version file: C:\Program Files\Thinkbox\Deadline\bin\Version 2012-01-08 17:41:03: Network version file: \\rammstein\deadline\bin\Windows\Version 2012-01-08 17:41:03: Comparing version files 2012-01-08 17:41:03: Launching Slave: Rnd04 2012-01-08 17:41:57: fe80::f8f1:52a4:a57f:dbf3%11 has connected 2012-01-08 17:41:57: Launcher Thread - Received command: LaunchSlavePromptForUpgrade 2012-01-08 17:41:58: Local python version file: C:\Program Files\Thinkbox\Deadline\python\2.6.7\Version 2012-01-08 17:41:58: Network python version file: \\rammstein\deadline\python\Windows\2.6.7\Version 2012-01-08 17:41:58: Comparing python version files 2012-01-08 17:41:58: Python upgrade skipped because Version files are the same 2012-01-08 17:41:58: Local version file: C:\Program Files\Thinkbox\Deadline\bin\Version 2012-01-08 17:41:58: Network version file: \\rammstein\deadline\bin\Windows\Version 2012-01-08 17:41:58: Comparing version files 2012-01-08 17:41:58: Launcher Thread - Responded with: Success| 2012-01-08 17:43:00: Perfoming remote admin check 2012-01-08 17:43:59: ::ffff:192.168.0.111 has connected 2012-01-08 17:43:59: Launcher Thread - Received command: ForceStopSlave Rnd04 2012-01-08 17:43:59: Stoping Slave: Rnd04 2012-01-08 17:43:59: Sending command to slave: StopSlave 2012-01-08 17:43:59: Got reply: RND04: Sent "StopSlave" command. Result: "" 2012-01-08 17:44:01: Launcher Thread - Responded with: Success| 2012-01-08 17:44:15: ::ffff:192.168.0.111 has connected 2012-01-08 17:44:15: Launcher Thread - Received command: LaunchSlave Rnd04 2012-01-08 17:44:16: Local python version file: C:\Program Files\Thinkbox\Deadline\python\2.6.7\Version 2012-01-08 17:44:16: Network python version file: \\rammstein\deadline\python\Windows\2.6.7\Version 2012-01-08 17:44:16: Comparing python version files 2012-01-08 17:44:16: Python upgrade skipped because Version files are the same 2012-01-08 17:44:16: Local version file: C:\Program Files\Thinkbox\Deadline\bin\Version 2012-01-08 17:44:16: Network version file: \\rammstein\deadline\bin\Windows\Version 2012-01-08 17:44:16: Comparing version files 2012-01-08 17:44:16: Launching Slave: Rnd04 2012-01-08 17:44:16: Launcher Thread - Responded with: Success| 2012-01-08 17:47:01: Perfoming remote admin check

service_starting.png

service_error.png

Thanks in advance, cheers…

HA! What the hell! When I run the slave manually, i.e. in user mode, it runs, but it still won’t get ANY tasks assigned! It’s completely being ignored even though there are jobs to do (not limited to specific machines or anything like that)!

Did you try restarting Pulse? If not, try that to see if it makes a difference.

If it doesn’t, then my next guess is that some job setting (pool, group, whitelist/blacklist, limit group, etc) is preventing jobs from running on that machine. To confirm this, you can try using the Slave Availability Filter in the Monitor (see second paragraph here):
thinkboxsoftware.com/deadlin … Slave_List

If that slave is disappearing from the Slave list when you click on jobs, then you know some job setting is preventing that slave from picking up that job.

Cheers,

  • Ryan

This is weird!

I just noticed the single render node is being ignored by all the jobs submitted via the integrated SMTD script directly from 3ds Max, but it gets picked up for anything else (for example Nuke or RVIO jobs that I submitted via my own Python scripts).

There are no machine limits set up in the SMTD, all the machines are whitelisted etc…

That is weird. In the job properties in the Monitor, does Rnd04 show up in the whitelist there? Just wondering if the submitter is somehow setting things wrong.

Just to confirm, was the slave availability filter removing this slave from the slave list for SMTD jobs?

I also restarted Pulse just in case and ran it in the GUI mode to see the log:

Listener - Listening for auto configuration requests on port 5043 Web Service - Web Service is disabled. It can be enabled in the Pulse Settings in the Repository Options Scheduler - Ignoring limit "999_050_999_71276e5a" because: Scheduler - minor exception: Error in file: E:\Deadline\limitGroups\999_050_999_71276e5a.limitGroup (System.Exception) Thermal Shutdown: Skipping zone "Idle Shutdown after 3hrs" because it is disabled Thermal Shutdown: Skipping zone "Restart after 24hrs" because it is disabled Idle Shutdown: Using default idle override settings for group "Idle Shutdown after 3hrs" Idle Shutdown: Checking for slaves in group "Idle Shutdown after 3hrs" that have been idle for at least 180 minutes Idle Shutdown: At least 0 slave(s) should be left running Idle Shutdown: Idle slave Rnd04 in group "Idle Shutdown after 3hrs" has only been idle for 18 minutes Idle Shutdown: There are no machines that need to be shutdown at this time Idle Shutdown: Skipping idle shutdown group "Restart after 24hrs" because it is disabled Machine Startup: Checking job "GT_AR_07_Max2011_loocas_002" (000_098_999_1ad7c727) Machine Startup: Checking slaves in Machine Group "Idle Shutdown after 3hrs" Machine Startup: - only slaves with assigned pool "crisis" should be woken up Machine Startup: - job is not part of a group, any slaves that are included in the 'none' group should be woken up Machine Startup: - slaves that are candidates to be woken up based on pool and group: Rnd01 Machine Startup: - checking slave Rnd01 Machine Startup: - waking up offline slave Rnd01 because it is required to render job "GT_AR_07_Max2011_loocas_002" Machine Startup: Checking job "GT_AR_05_Max2011_loocas_005" (000_097_999_78560528) Machine Startup: Checking slaves in Machine Group "Idle Shutdown after 3hrs" Machine Startup: - only slaves with assigned pool "crisis" should be woken up Machine Startup: - job is not part of a group, any slaves that are included in the 'none' group should be woken up Machine Startup: - slaves that are candidates to be woken up based on pool and group: Rnd01 Machine Startup: - checking slave Rnd01 Machine Startup: - waking up offline slave Rnd01 because it is required to render job "GT_AR_05_Max2011_loocas_005" Power Management: No power management notification address specified in Repository Options - cannot send notification Machine Restart: Skipping machine group "Idle Shutdown after 3hrs" because it is disabled Machine Restart: Checking machine group "Restart after 24hrs" Machine Restart: - slave has been running on machine Rnd02 for only 386 minutes Machine Restart: - slave has been running on machine Rnd03 for only 300 minutes Machine Restart: - slave has been running on machine Rnd04 for only 148 minutes Machine Restart: - slave has been running on machine Rnd05 for only 333 minutes Machine Restart: - slave has been running on machine Rnd06 for only 305 minutes Machine Restart: - slave has been running on machine Rnd07 for only 292 minutes Machine Restart: - slave has been running on machine Rnd08 for only 298 minutes Machine Restart: - slave has been running on machine Vfx01 for only 332 minutes Machine Restart: - slave has been running on machine Vfx02 for only 298 minutes Machine Restart: There are no machines that need to be restarted at this time Slave Scheduling: Skipping scheduling group "Idle Shutdown after 3hrs" because it is disabled Slave Scheduling: Skipping scheduling group "Restart after 24hrs" because it is disabled Listener - Received request from ::ffff:192.168.0.204 Scheduler - Job chooser found no jobs. Scheduler - No jobs found for Rnd04 Clean Up Thread - Performing house cleaning Checking repository integrity Purging limits Scanning for dropped jobs... Done. Performing Job Repository Scan... Done. Listener - Received request from ::ffff:192.168.0.204 Clean Up Thread - Performing house cleaning Scanning for dropped jobs... Done. Performing Job Repository Scan... Done. Scheduler - Job chooser found no jobs. Scheduler - No jobs found for Rnd04 Clean Up Thread - Performing house cleaning Purging old job reports Cleaning up orphaned tasks Purging trash Scanning for dropped jobs... Done. Performing Job Repository Scan... Done. Listener - Received request from ::ffff:192.168.0.204 ---- ledna 09 2012 -- 10:42 odp. ---- Scheduler - Job chooser found no jobs. Scheduler - No jobs found for Rnd04 Clean Up Thread - Performing house cleaning Scanning for dropped jobs... Done. Performing Job Repository Scan... Done. Listener - Received request from ::ffff:192.168.0.204 Clean Up Thread - Performing house cleaning Scanning for dropped jobs... Done. Performing Job Repository Scan... Done. Scheduler - Job chooser found no jobs. Scheduler - No jobs found for Rnd04 Clean Up Thread - Performing house cleaning Purging trash Scanning for dropped jobs... Done. Performing Job Repository Scan... Done. Listener - Received request from ::ffff:192.168.0.204 Scheduler - Job chooser found no jobs. Scheduler - No jobs found for Rnd04 Clean Up Thread - Performing house cleaning Scanning for dropped jobs... Done. Performing Job Repository Scan... Done. Listener - Received slaves update request from: ::ffff:192.168.0.111 Listener - Creating new thread... Listener - Slaves Update: ::ffff:192.168.0.111 - Sending update for 10 updated slaves Listener - Slaves Update: ::ffff:192.168.0.111 - Update sent! ---- ledna 09 2012 -- 10:43 odp. ---- Listener - Received request from ::ffff:192.168.0.204 Clean Up Thread - Performing house cleaning Purging repository temp files Scanning for dropped jobs... Done. Performing Job Repository Scan... Done. Scheduler - Job chooser found no jobs. Scheduler - No jobs found for Rnd04 Listener - Received request from ::ffff:192.168.0.204 Listener - Received slaves update request from: ::ffff:192.168.0.111 Listener - Creating new thread... Listener - Slaves Update: ::ffff:192.168.0.111 - Sending update for 6 updated slaves Listener - Slaves Update: ::ffff:192.168.0.111 - Update sent! Thermal Shutdown: Skipping zone "Idle Shutdown after 3hrs" because it is disabled Thermal Shutdown: Skipping zone "Restart after 24hrs" because it is disabled Idle Shutdown: Using default idle override settings for group "Idle Shutdown after 3hrs" Idle Shutdown: Checking for slaves in group "Idle Shutdown after 3hrs" that have been idle for at least 180 minutes Idle Shutdown: At least 0 slave(s) should be left running Idle Shutdown: Idle slave Rnd04 in group "Idle Shutdown after 3hrs" has only been idle for 20 minutes Idle Shutdown: There are no machines that need to be shutdown at this time Idle Shutdown: Skipping idle shutdown group "Restart after 24hrs" because it is disabled Machine Startup: Checking job "GT_AR_07_Max2011_loocas_002" (000_098_999_1ad7c727) Machine Startup: Checking slaves in Machine Group "Idle Shutdown after 3hrs" Machine Startup: - only slaves with assigned pool "crisis" should be woken up Machine Startup: - job is not part of a group, any slaves that are included in the 'none' group should be woken up Machine Startup: - slaves that are candidates to be woken up based on pool and group: Rnd01 Machine Startup: - checking slave Rnd01 Machine Startup: - waking up offline slave Rnd01 because it is required to render job "GT_AR_07_Max2011_loocas_002" Machine Startup: Checking job "GT_AR_05_Max2011_loocas_005" (000_097_999_78560528) Machine Startup: Checking slaves in Machine Group "Idle Shutdown after 3hrs" Machine Startup: - only slaves with assigned pool "crisis" should be woken up Machine Startup: - job is not part of a group, any slaves that are included in the 'none' group should be woken up Machine Startup: - slaves that are candidates to be woken up based on pool and group: Rnd01 Machine Startup: - checking slave Rnd01 Machine Startup: - waking up offline slave Rnd01 because it is required to render job "GT_AR_05_Max2011_loocas_005" Power Management: No power management notification address specified in Repository Options - cannot send notification Machine Restart: Skipping machine group "Idle Shutdown after 3hrs" because it is disabled Machine Restart: Checking machine group "Restart after 24hrs" Machine Restart: - slave has been running on machine Rnd02 for only 388 minutes Machine Restart: - slave has been running on machine Rnd03 for only 302 minutes Machine Restart: - slave has been running on machine Rnd04 for only 150 minutes Machine Restart: - slave has been running on machine Rnd05 for only 335 minutes Machine Restart: - slave has been running on machine Rnd06 for only 308 minutes Machine Restart: - slave has been running on machine Rnd07 for only 294 minutes Machine Restart: - slave has been running on machine Rnd08 for only 300 minutes Machine Restart: - slave has been running on machine Vfx01 for only 334 minutes Machine Restart: - slave has been running on machine Vfx02 for only 299 minutes Machine Restart: There are no machines that need to be restarted at this time Slave Scheduling: Skipping scheduling group "Idle Shutdown after 3hrs" because it is disabled Slave Scheduling: Skipping scheduling group "Restart after 24hrs" because it is disabled Clean Up Thread - Performing house cleaning Scanning for dropped jobs... Done. Performing Job Repository Scan... Done. Scheduler - Job chooser found no jobs. Scheduler - No jobs found for Rnd04 Clean Up Thread - Performing house cleaning Purging limits Scanning for dropped jobs... Done. Performing Job Repository Scan... Done. Listener - Received request from ::ffff:192.168.0.204 Scheduler - Job chooser found no jobs. Scheduler - No jobs found for Rnd04 Clean Up Thread - Performing house cleaning Checking repository integrity Scanning for dropped jobs... Done. Performing Job Repository Scan... Done. Listener - Received slaves update request from: ::ffff:192.168.0.111 Listener - Creating new thread... Listener - Slaves Update: ::ffff:192.168.0.111 - Sending update for 9 updated slaves Listener - Slaves Update: ::ffff:192.168.0.111 - Update sent! Listener - Received slaves update request from: ::ffff:192.168.0.111 Listener - Creating new thread... Listener - Slaves Update: ::ffff:192.168.0.111 - Sending update for 4 updated slaves Listener - Received request from ::ffff:192.168.0.204 Listener - Slaves Update: ::ffff:192.168.0.111 - Update sent! ---- January 09 2012 -- 10:44 PM ---- Clean Up Thread - Performing house cleaning Scanning for dropped jobs... Done. Performing Job Repository Scan... Done. Scheduler - Job chooser found no jobs. Scheduler - No jobs found for Rnd04

As for the white list, this is what I see:

[attachment=0]monitor.png[/attachment]

SMTD also shows the slave as available in the whitelist.

Is that slave in the crisis pool?

I feel like an idiot now.

It was not. I recently reinstalled this particular slave and since all has been running just fine, I forgot to add it to the pool.

My apologies, it really didn’t occur to me and wouldn’t have if you didn’t mention it. I thought that only the groups were exclusive in a sense that if a slave isn’t in a particular group, it gets excluded, however, I thought that pools were non-exclusive, i.e. if there is nothing else to do, but a particular job in a particular pool you’re not in, render anyways since being idle is a waste of time.

Doesn’t it work this way? I mean, that’s the impression I was under.

Anyways, again, thank you for the help!

No worries! We’re actually glad that it was expected behavior, rather than some sort of weird bug! :slight_smile:

Pools are exclusive, just like groups. The only difference between the two is that pools are priority based, and groups are not.

Cheers,

  • Ryan