AWS Thinkbox Discussion Forums

How is "Fail Task" implemented?

I’m working on some code in a python script that will be run as a Deadline task, and I want it to behave optimally when the user clicks on “Fail Task” in the Deadline Monitor.

Specifically, I’m trying to get it to behave gracefully when it’s killed, so that it cleans up before it exits. If this was a render task you can imagine that the machine it’s running on might get pre-empted, and I want to checkpoint the current render and copy any generated files back to the fileserver before exiting.

My expectation is that normally in Linux, when you want to end a process, you send it a SIGTERM and then after a grace period (often one second) a SIGKILL. The SIGTERM gives the process a chance to clean up before the SIGKILL forcibly kills it. Ideally the managing process has some configurable grace period, since clean up can take some time, depending on the type of task.

Is this how “Fail Task” is implemented in Deadline?

The logs for my process do not indicate that it receives any request to exit and I’m really a bit confused as to how Dealine is causing ‘failed’ processes to exit. How has this feature been implemented?

In the log for my process (which uses a custom plugin I developed) I’m seeing this output (which includes a couple of lines of ‘Progress’ output from my process:

2018-12-10 13:56:43:  0: STDOUT: Progress: 054%
2018-12-10 13:56:44:  0: Executing plugin command of type 'Cancel Task'
2018-12-10 13:56:44:  0: Done executing plugin command of type 'Cancel Task'
2018-12-10 13:56:44:  0: STDOUT: Progress: 055%
2018-12-10 13:56:45:  0: Done executing plugin command of type 'Render Task'
2018-12-10 13:56:45:  0: In the process of canceling current task: ignoring exception thrown by PluginLoader

Why am I seeing an exception thrown by PluginLoader? Am I doing something wrong perhaps in my Plugin code?

If I have a process tree that looks like this:

deadlinesandbox -> WrapperProcess -> RenderProcess

…it seems that the “Fail Task” operation first sends SIGKILL directly to RenderProcess, and only later then sends SIGTERM to WrapperProcess. Is that by design?

It’d be great to have some clear information about what to expect here.

This one doesn’t come up often. :slight_smile:

It’s interesting that the kill/term calls are inverted. I can tell you that “exception thrown by PluginLoader” is shown because the way we stop any render process is we send a special exception to the system that won’t trigger a failed job report, but kills the process. That’s for simple plugins at least where Deadline owns the lifetime of the process.

In that case, we’re controlling your WrapperProcess in this case using a special class internally here that wraps C#'s Process class. Looking at our implementation here, we walk over over the process tree and all of its children calling Kill() to each of them in the order the OS listed them to us. There’s a delay of 2 seconds between each process.

If that call fails, we use the kill() function in libc on non-Windows and Kernel32.TerminateProcess() on Windows.

The ‘term’ call is likely coming from up the stack when the Slave is trying to stop its sandbox and that may be bubbling down.

If you’re trying to snapshot in the process, normally you would want to use an AdvancedPlugin as that moves the responsibility of closing the application to the plugin and you’d regularly poll the DeadlinePlugin.IsCancelled() function. I still think you’ll be forcibly closed as the Slave is going down.

Inside Deadline there really isn’t a good way to pre-empt the closing of the Slave. You can try having the WrapperProcess poll to see if the instance is going down and race the Slave to the finish. EC2 has an endpoint for this and GCP seem to have a special script that be run.

Hey thanks for this overview - that’s very helpful.

I’ll have a crack at turning my plugin into an AdvancedPlugin .

On slave shutdown - if I add another intermediate process I should be able to get my WrapperProcess to detach from the Slave and basically daemonize so that it’s then in control of its own shutdown (modulo the entire OS going away).

Unfortunately Deadline will for a short time believe that the process has released any licenses it is using (while it is actually still running as a daemon), so the Limits may be temporarily incorrect. It would be optimal if the Slave had a means to be shut down gracefully :slight_smile:

I’ve been working on an AdvancedPlugin using the Lightwave plugin as a base to work from, which means I’m subclassing ManagedProcess, however this approach has some idiosyncrasies that I don’t know how to resolve.

I need to capture the PID of my child process so that I can signal it to shutdown gracefully, so -

When the subprocess is spawned, there is no event that propagates to my code so that I can take the opportunity to get its PID. The best I can do is attach a callback with AddStdoutHandlerCallback so that I get called as soon as it produces any output.

But - if the child process errors on startup, so that it just prints something to stderr (or nothing at all) and then hangs, my code will not get notified of this. The StdoutHandler doesn’t parse anything that’s printed to stderr, which is great - I wouldn’t welcome that - but there does not seem to be an AddStderrHandlerCallback that I can use to watch for this output.

So currently I think the ManagedProcess framework doesn’t give me a reliable way to find the PID of my child process so that I can take responsibility for killing it.

If that is the case - is there any guidance or example for managing my own process lifecycle with the subprocess module? It looks as though minimally I should write a RenderTasksCallback that blocks until my child process has finished one way or another. Is that correct?

So in that case I can probably assume that the code is not using any asynchronous APIs and I should just use subprocess.Popen?

One way to approach this would be to write an implementation of ManagedProcess with the behaviour that I want, but - to do this I’d need some documentation on the Interface that the ManagedProcess class is expected to implement and that does not seem to be public. Apologies if I’ve just missed that.

Specifically, I’d need to know what method gets called on a ManagedProcess instance when the DeadlinePlugin calls self.RunManagedProcess(ManagedProcess) - is there any available information on this that might help me?

Privacy | Site terms | Cookie preferences