AWS Thinkbox Discussion Forums

Remote fix of Stalled machine

Is there any way we can fix all stalled machines from one machine at once using monitor?

I haven’t been able to.
Currently it seems easiest is to just log on the machine and restart deadline10launcherservice (it actually keeps running when it stalls?)

yes thats what we do as well currently. restart the machines,
And no once it stalls the machine doesn’t run any job unless we restart…

I wonder if it’s possible to combine the power management with events/house cleaning to power cycle stalled nodes. A good feature request.

If it’s repeatedly stalling there’s probably a bigger issue at fault, jobs exceeding memory etc.

My point was, it is usually sufficient to restart the service, and not the machine.
In this case I do not understand exactly why the machine is seen as stalled…

I find if the machine has run out of either RAM or VRAM then the launcher app or system is completely hung, the machine shows as stalled and the only fix is a reboot.

It would be great to find someway of checking the stalled system, if it’s not possible to restart the launcher automatically, then a system reboot can be implemented.

Again, with the way power management is handled it’d be good to check for other processes, but this may not be possible if launcher has crashed. so at this point, an emergency measure would be to have a power group for rendernodes where if they’ve been stalled for 30mins they can force a reboot.

It’s possible to remote onto the machine and relaunch the launcher/worker, but if this is an unmanned evening/weekend then it’s handier just to restart the machine

1 Like

One could add a script to ipmi (or whatever your flavor of remote management uses) into the machine and force reboot it.

Privacy | Site terms | Cookie preferences