Problem Report on Maya crashed state

anon5390329 · September 14, 2009, 4:01am

We’re reporting some operational problems we’ve run into that we’re trying to tamp down before putting a few more machines in the dedicated render farm here.

We have 30 Render Farm hosts running WinXP x64 SP2. They auto-login to a special domain user, ‘deadline’, which runs Launcher and then Slave. Typically, no one logs into these machines, they’re 1U rack-mount servers with no graphics card or monitor. I don’t know if it’s important for this bug report, but we run Pulse on a separate machine under Windows Server 2003 (32-bit).

Sometimes when mayabatch.exe crashes (we use the MayaCmd plugin), the process doesn’t seem to exit cleanly. We’ve seen this problem now and then. Most of our jobs (we’re probably running at least 200 jobs a day) exit cleanly, but out of the 3-5 jobs (-> probably 40 tasks) a day that fail, we get this sort of occurrence maybe once every 3-5 days.

Usually we notice some task is lingering on a machine for an especially long time. The task hasn’t failed yet – from the Deadline Monitor view, you wouldn’t think anything was wrong other than that it’s a slow task. There are no error reports available from the Monitor.

However, if you Remote Desktop into the machine and examine the Slave window directly, you find that mayabatch.exe is trying to crash, as indicated by this message:

0: STDOUT: Result: C:/Documents and Settings/deadline/Local Settings/Application Data/Frantic Films/Deadline/slave/jobsData/c03_xxxx_xxxxxxx_mb.ma
0: STDOUT: Fatal Error. Attempting to save in C:/DOCUME~1/deadline/LOCALS~1/Temp/deadline.20090914.1126.ma

Following this there was some indication of what the fatal error was. Today’s case was in a plugin we’re using, or in previous cases it was related to Mental Ray, but in any case the crash itself has no relationship to Deadline. However, when you Remote Desktop into the machine, we find that:

CPU is at 0%, and appears to have been that way for hours on end.
mayabatch.exe is still present in the process table, at its full memory size (~3GB in this case – note these machines are Win64 with 16GB of memory, so we routinely have 6-9GB render jobs)
Deadline Slave is idle, and the app’s UI is responsive

Huh. Earlier, we had a problem with the Maya Crash Event Reporter that created situations like this. We got an environment variable from Autodesk to disable that behavior (very bad idea on their part to enable that under mayabatch). Furthermore, in those cases, there was always a Crash Reporter window visible onscreen. In the cases we see now, there’s never any window open (except Slave itself, of course).

I then kill mayabatch. Slave notices that, but becomes very busy doing… something? Usually for the sake of getting the machine back online, I kill and restart Slave at that point (I have other things to do besides babysit the render farm), although plausibly if I waited long enough Slave would presumably recover on its own? However, the fact it gets very busy could be a clue.

Any guesses as to what might be happening to prevent mayabatch.exe from exiting cleanly or how to further improve the handling of this situation would be appreciated. Obviously, one goal we have is to be able to manage machines remotely without logging into them all the time; you can’t do that in this case because from afar, nothing appears to be wrong (you can’t see the error message yet). Furthermore, when this happens it tends to swallow a render farm machine for half a day or something until we notice it and manually fix it.

Unfortunately, this isn’t something we can address with task timeouts – we do have 12+ hour tasks sometimes. We could look at using Auto Job Timeout and Force Auto Job Timeout to work around this, but that has two scary aspects: for effects jobs, it’s quite normal to see a 20x time difference between successive tasks in the same job due to e.g. particle counts, so it would result in false positives that are then hard to work around; and secondly, it would waste a lot of time on the machines before killing the job. In fact, mayabatch is trying to exit, we just want to figure out how to help it exit right away!

Related to that, one possible problem is that for some unclear reason mayabatch is having problems saving the crash dump, so another possible workaround might be to disable that. Does anyone have any experience setting MAYA_DEBUG_ENABLE_CRASH_REPORTING to 0?

Thanks for reading,

Leo

rrussell · September 14, 2009, 12:52pm

Hey Leo,

Before I get into answering any of these questions directly, I just want to confirm something. Does this message get printed out to the slave log when the problem is occurring (ie: before you do anything like kill mayabatch.exe)?

If so, we should be able to catch this error in the MayaCmd plugin and fail the render automatically. Let us know if this is the case, and if it is, please send us this file from your repository:

\your\repository\plugins\MayaCmd\MayaCmd.py

We’ll make the appropriate modifications and send you back the file for testing. If this catches the problem going forward, we’ll make sure to include the modification in the next release.

Thanks!

Ryan

anon88561584 · August 10, 2011, 10:21am

Hi there,

i got the same error Message:

Fatal Error. Attempting to save in C:\Users\Render~1\AppData\Local\Temp

jgaudet · August 10, 2011, 6:06pm

Hey there,

Does this happen every time a slave tries to render the scene? Does this happen on all the slaves or just specific ones? Either way, could you post the slave log from one of the failed attempts?

Cheers,

Jon