AWS Thinkbox Discussion Forums

Slave not picking up jobs

Hi,

there are two jobs active on the farm right now that are not being picked up by one of our machines (CELL-WS-26). I’m not sure what could be the issue here. I checked the pool and group assignments, they all look correct to me. Also when i filter the list in the Slaves panel with the Job Candidate Filter it lists said machine.
The machine was started around 23:05 and was shut down again by Deadline around 23:10, Pulse log attached.
Here’s an excerpt of the log containing all the lines with the machine name in them:

Line 282: 2014-10-15 23:01:54: Power Management - Machine Startup: - slaves that are candidates to be woken up based on the job's pool, group, and bad slave list: CELL-RS-09,CELL-RS-10,CELL-WS-25,CELL-WS-26 Line 289: 2014-10-15 23:01:54: Power Management - Machine Startup: - checking slave CELL-WS-26 Line 290: 2014-10-15 23:01:54: Power Management - Machine Startup: - waking up offline slave CELL-WS-26 because it is required to render job "wmp_wav_001_010_rndr_diamondsim_v013_an " Line 295: 2014-10-15 23:01:54: Power Management - Machine Startup: - slaves that are candidates to be woken up based on the job's pool, group, and bad slave list: CELL-RS-09,CELL-RS-10,CELL-WS-25,CELL-WS-26 Line 299: 2014-10-15 23:01:54: Power Management - Machine Startup: - skipping slave CELL-WS-26 because it has already been chosen to wake up for another job Line 305: 2014-10-15 23:01:54: Power Management - Machine Startup: - slaves that are candidates to be woken up based on the job's pool, group, and bad slave list: CELL-RS-09,CELL-RS-10,cell-ws-14,CELL-WS-25,CELL-WS-26 Line 311: 2014-10-15 23:01:54: Power Management - Machine Startup: - skipping slave CELL-WS-26 because it has already been chosen to wake up for another job Line 316: 2014-10-15 23:01:54: Power Management - Machine Startup: - slaves that are candidates to be woken up based on the job's pool, group, and bad slave list: CELL-RS-09,CELL-RS-10,cell-ws-14,CELL-WS-25,CELL-WS-26 Line 321: 2014-10-15 23:01:54: Power Management - Machine Startup: - skipping slave CELL-WS-26 because it has already been chosen to wake up for another job Line 328: 2014-10-15 23:02:34: Server Thread - Auto Configuration: Picking configuration based on: CELL-WS-26 / 192.168.1.36 Line 761: 2014-10-15 23:05:03: Power Management - Idle Shutdown: Idle slave CELL-WS-26 (CELL-WS-26) in group "all" has only been idle for 2 minutes Line 761: 2014-10-15 23:05:03: Power Management - Idle Shutdown: Idle slave CELL-WS-26 (CELL-WS-26) in group "all" has only been idle for 2 minutes Line 1172: 2014-10-15 23:08:03: Power Management - Idle Shutdown: Idle slave CELL-WS-26 (CELL-WS-26) in group "all" has been idle for 5 minutes and is a candidate to be shut down Line 1172: 2014-10-15 23:08:03: Power Management - Idle Shutdown: Idle slave CELL-WS-26 (CELL-WS-26) in group "all" has been idle for 5 minutes and is a candidate to be shut down Line 1173: 2014-10-15 23:08:03: Power Management - Idle Shutdown: Shutting down idle slave CELL-WS-26 (CELL-WS-26) because it has been idle for longer than 5 minutes Line 1173: 2014-10-15 23:08:03: Power Management - Idle Shutdown: Shutting down idle slave CELL-WS-26 (CELL-WS-26) because it has been idle for longer than 5 minutes

Not sure if this is related to this bug

http://forums.thinkboxsoftware.com/viewtopic.php?f=205&t=12422

so i decided to start a new thread.

Cheers,
Holger
deadlinepulse-cell-temp-01-2014-10-15-0000.zip (923 KB)

Did the two jobs have a machine limit set? Is the Job Dequeue Mode set to something other than All Jobs? You can check the Job Dequeue Mode in the Slave Settings by right-clicking on the slave and selecting Modify Slave Properties.

Also, it would be helpful to see a log from the slave itself (CELL-WS-26). If you don’t have Slave Verbose Logging enabled in the Application Logging section of the Repository Options, please turn it on and that might help debug any unexpected slave behavior going forward.

Thanks!
Ryan

no

no

i just activated that and will observe the behaviour and send one of the logs next time this happens.

cheers,
Holger

it just happened again now. The machines cell-ws-17 and cell-ws-18 both don’t pick up the topmost job from the screenshot ‘wls_wir_007_210_comp_v001_gs.nk’.
When i connected to cell-ws-17 to open the log file folder by right-clicking to open the Launcher menu the Launcher and Slave crashed. Not sure whether this is related. cell-ws-18 did not crash. Attached are the Slave logs for both machines, the Launcher log for the one that crashed plus the Pulse log. They should all be verbose.
Hope that will help.

Cheers,
Holger
Deadline_logs_2014-10-22.zip (939 KB)

Stupid me! Just saw we don’t have enough beta licenses for the machines that are currently powered on :wink:
So i guess that was probably also the reason last time as i didn’t pay attention to that.
Seems this issue can be closed…

Unfortunately, i have to exhume this thread.
We just had that situation again and this time i’m sure it wasn’t due to the licenses being maxed out.
In the attached screenshot you can see that the machine called cell-ws-15 is a candidate for the highlighted job. It has been idle for 5.5 minutes and didn’t pick it up. I also attached the log file from that day/time - it was around 19:20 i made that screenshot.

Cheers,
Holger
deadlineslave-cell-ws-15-2015-07-20-0028.log (23.8 KB)

Couple questions:

Were there any limits on the Jobs? Either Machine Limits, or regular Limits might cause this.

Has this Slave been able to render other Jobs? Or was it just this one it was having trouble with? If it’s just that one Slave having trouble with multiple different Jobs, you should check it’s Job Dequeuing mode (right-click the Slave -> ‘Modify Slave Properties’ -> ‘Job Dequeuing’) and make sure that it’s set to ‘All Jobs’.

Hi Jon,

it was indeed the Job Dequeuing mode. So it’s not a bug then - except for the one in front of the machine :wink:
Thanks!

Cheers,
Holger

Gotcha; good to hear it was a simple one!

We’ll make sure to add some messaging to the Slave log to indicate when this is happening though; as you’ve seen it definitely isn’t immediately obvious what’s going on when it’s skipping jobs because of that setting.

Adding that info to the log is a good idea.
Is there actually any chance that you also add this to the algorithm of the ‘Job candidate’ filter’? I think this would make sense since a machine that can’t render the job because of the dequeueing settings is basically not a candidate for that job anymore. What do you think? I think not seeing the machine in the Slave Panel with the ‘Job Candidate Filter’ active would have made me think that it must be some kind of setting that disqualifies it for the selected job. But seeing the machine as a candidate in the Slave Panel doesn’t really give any hint that it might be some setting that’s causing it not to pick up the job but rather makes you wonder something like “why the hell does it not start a task from this job when it’s being listed as a candidate?!” :wink:

Cheers,
Holger

Agreed, the Job Candidate filter should probably be updated.

On the other hand, I also agree with you that it’s not super helpful in determining why certain Slaves aren’t in that filter, so part of me just wants to replace that filter with a more comprehensive and explicit interface that tells you the reason why each slave can/can’t render a given job. That’s more of a future project though :slight_smile:

Yes, an improved version of that filter would be good to have. I wouldn’t really replace the existing one, though. It’s actually good to have this ‘one-click filter’. Rather add some additional interface/view/panel. Where/how exactly i can’t tell right now. I’ll have to think about it.

Cheers,
Holger

Privacy | Site terms | Cookie preferences