Restart worker after each task but for every task and by default?

silcy · March 7, 2024, 11:22am

Hi,

I have a strange issue with Houdini and Redshift where I have to restart the program after each render. It only happens sometimes and it’s been incredibly hard to debug. It only happens on specific render nodes randomly. The surefire way to make it work is to restart the worker after each render, which works great in the Deadline Worker but only for one task unfortunately.

I was wondering if there was a way to force the restart of the worker after each task on a specific render node?

Thanks,
L

karpreet · March 8, 2024, 10:30pm

We can definitely hack the restart of the Worker after every task by applying a post task script which restart the worker. Here is a documentation Job Scripts: Job Scripts — Deadline 10.3.1.4 documentation

But we should look into what causes the worker to not cleanly dequeue the job after every task. Could you be able to share the logs from the worker which reproduces this behavior. Here is how you can get the Worker logs.

Are you seeing this behavior accross all the Worker nodes or specific machines are showing this issue?
Is it happening on a specific type of jobs or all the Houdini and Redshift jobs?
What version of Deadline are your running on the farm?

silcy · March 17, 2024, 6:23pm

Sorry for slow reply but I managed to fix the error. So I don’t need to restart the worker each time now! The worker was getting stuck on a certain machine during the Redshift pre-rendering setup stage (before it even managed to start logging stuff), I fixed it by pre-caching .rs files instead. I don’t normally like to do that as it has caused issues in the past but it seemed to work pretty well this time so I think I will continue to do it that way.

silcy · April 4, 2024, 1:48pm

This issue has seemingly gotten a bit worse now and I don’t really have time to debug, it only happens on certain machines sometimes so it’s super hard to debug. Only with Houdini+RS jobs. It seems to only happen when I am rendering a scene with a huge amount of scattered objects. Also the worker log seems to be empty. I’m on Deadline 10.2.1.0 windows 10. I wrote this code to restart the worker if anyone else has this issue and needs a quick fix:

import re
from System.IO import *
from Deadline.Scripting import *

def __main__(*args):
	deadlinePlugin = args[0]
	job = deadlinePlugin.GetJob()
	task = deadlinePlugin.GetCurrentTask()
	tasks = []
	tasks.append(deadlinePlugin.GetCurrentTask())
	slave = deadlinePlugin.GetSlaveName()
	
	foo = RepositoryUtils.GetSlaveInfo(slave, 1);
	machine = foo.MachineRealName;
	
	RepositoryUtils.CompleteTasks(job, tasks, slave)
	SlaveUtils.SendRemoteCommand(machine, "RelaunchSlave " + slave)

Just have to put this in the Post Task Script box in the Job Properties → Scripts panel.

I’m not a programmer so please let me know if I’m doing anything bad here but it seems to work on my machines, I haven’t had any hanging since doing this but will report back if I do.