hung python process - requeue doesnt work

LaszloSebo · March 6, 2014, 12:08am

We have a python subprocess being initiated from the mayabatch.py plugin, which sometimes hangs. However, even if we requeue these frames, the slave never recovers, unless the slave is forcefully restarted. It just keeps repeating this:

[code]
---- 2014/03/05 16:04 ----
Scheduler Thread - Task “20_1060-1060” could not be found because task has been modified:
current status = Rendering, new status = Queued
current slave = LAPRO0129, new slave =
current frames = 1060-1060, new frames = 1060-1060
Scheduler Thread - Cancelling task…
---- 2014/03/05 16:05 ----
Connecting to slave log: LAPRO0129
Scheduler Thread - Task “20_1060-1060” could not be found because task has been modified:
current status = Rendering, new status = Queued
current slave = LAPRO0129, new slave =
current frames = 1060-1060, new frames = 1060-1060
Scheduler Thread - Cancelling task…
Listener Thread - fe80::ce3:3c3d:53ed:fde2%15 has connected
Listener Thread - Received message: StreamLog
Listener Thread - Responded with: Success
Scheduler Thread - Task “20_1060-1060” could not be found because task has been modified:
current status = Rendering, new status = Queued
current slave = LAPRO0129, new slave =
current frames = 1060-1060, new frames = 1060-1060
Scheduler Thread - Cancelling task…
Scheduler Thread - Task “20_1060-1060” could not be found because task has been modified:
current status = Rendering, new status = Queued
current slave = LAPRO0129, new slave =
current frames = 1060-1060, new frames = 1060-1060
Scheduler Thread - Cancelling task…
Scheduler Thread - Task “20_1060-1060” could not be found because task has been modified:
current status = Rendering, new status = Queued
current slave = LAPRO0129, new slave =
current frames = 1060-1060, new frames = 1060-1060
Scheduler Thread - Cancelling task…
Scheduler Thread - Task “20_1060-1060” could not be found because task has been modified:
current status = Rendering, new status = Queued
current slave = LAPRO0129, new slave =
current frames = 1060-1060, new frames = 1060-1060
Scheduler Thread - Cancelling task…
Scheduler Thread - Task “20_1060-1060” could not be found because task has been modified:
current status = Rendering, new status = Queued
current slave = LAPRO0129, new slave =
current frames = 1060-1060, new frames = 1060-1060
Scheduler Thread - Cancelling task…
Scheduler Thread - Task “20_1060-1060” could not be found because task has been modified:
current status = Rendering, new status = Queued
current slave = LAPRO0129, new slave =
current frames = 1060-1060, new frames = 1060-1060
[/quote]
The hang was caused by a stdout read deadlock between a subprocess started from mayabatch and another subprocess. We fixed the deadlock since, but figured this is a robustness issue in deadline as well, so we should report.

LaszloSebo · March 6, 2014, 12:19am

For the slave to be forcefully restarted, we need to manually kill the mayabatch processes. Otherwise the slave never restarts

rrussell · March 6, 2014, 2:07pm

Hey Laszlo,

Do you have the log from when the slave started printing out these messages? Just curious to see the last thing the slave was doing before it got stuck.

It’s a known issue with windows sub processes. It’s the same issue you guys ran into with the custom image viewers in the Monitor. If a parent process passes its handles to the child (which is necessary for stdout redirection), this situation can occur. We were able to fix it for the Monitor image viewers because we don’t need to redirect their stdout. However, for rendering processes, we really don’t have a choice.

Cheers,
Ryan

LaszloSebo · March 7, 2014, 5:59pm

Its doing an internal process. Basically after the render finishes, we need to convert the exrs to a jpeg sequence, apply gamma correction etc. to create a proxy version.

We use draft for that currently. Due to the fact that the python environment in deadline is not encapsulated we kept running into draft licensing issues. So recently, instead of importing draft directly, it was changed to spawn an external python script that does the conversion.

However, the stdout / stdin was not properly handled, and sometimes this subscript would get stuck in a deadlock. When it reached this deadlock, the slave itself would go into this odd hanging state.

(sidenote: we are considering switching to rvio from draft, due to the lack of LUT and exr metadata support in draft. The licensing hassles dont help.)

rrussell · March 7, 2014, 6:04pm

Thanks for the additional details. We really want to start looking at the python sandboxed environment in Deadline 8 (to coincide with the scheduling refactoring we want to do).

cbond · March 7, 2014, 6:36pm

whats the draft licensing issue you are having?

ifraser · March 7, 2014, 6:37pm

Laszlo can you expand on the draft licensing issues you mentioned? I don’t want to make any assumptions about what these issues might be

-i

… and CB beat me to the punch…

LaszloSebo · March 7, 2014, 6:56pm

Deadline is running a single python environment (instead of per task encapsulated), so once you imported draft into it, it holds on to that license.

While you guys have been extremely accommodating in giving us more draft licenses upon request, we do have license issues with draft due to this “holding onto it” behavior, either by running out of licenses (as the count does not match the deadline license count, so over time slaves not even using draft at the time fill up our license count), or the flex ‘keepalive’ signals failing due to a temporary network outage, which then causes that deadline slave to fail all draft related requests.

ifraser · March 7, 2014, 7:17pm

Would the ability to acquire and release Draft licenses by command be helpful? I think it might be slightly dangerous, but in your case it might allow your custom scripts to be responsible for handing the licenses back rather than the current process exit behaviour and then get you the flexibility you are looking for.

LaszloSebo · March 7, 2014, 8:12pm

Yes i think that would help, we could encapsulate its usage with acquire/release code.