Nuke crash crashing Slave?

celluloidvfx · December 14, 2014, 9:12pm

Hi,

today i found a task of a job was rendering endlessly (16 hrs to be precise). I assumed a crashed Application and/or Slave and did a “Requeue Task” so it would be rendered by another Slave. It did so but despite the task completing and the job then being marked as complete the status of the Slave in the Slave panel was still “rendering”. I then RDP’ed into the affected machine and found that Nuke (8.0v6) had crashed and was showing one of the usual Windows crash dialogue which showed this info:

[code]Problem signature:
Problem Event Name: APPCRASH
Application Name: Nuke8.0.exe
Application Version: 0.0.0.0
Application Timestamp: 5411d2b9
Fault Module Name: MSVCR100.dll
Fault Module Version: 10.0.30319.1
Fault Module Timestamp: 4ba220dc
Exception Code: 40000015
Exception Offset: 00000000000760d9
OS Version: 6.1.7601.2.1.0.256.48
Locale ID: 1031
Additional Information 1: f50a
Additional Information 2: f50a94c46d1af12074027fcda9ee8c8f
Additional Information 3: c303
Additional Information 4: c3033e59905da9299a0c305ac348ad24

Read our privacy statement online:
http://go.microsoft.com/fwlink/?linkid=104288&clcid=0x0409

If the online privacy statement is not available, please read our privacy statement offline:
C:\Windows\system32\en-US\erofflps.txt[/code]

After clicking the “Close Application” button Windows showed also a crash dialogue for Deadline Slave:

[code]Problem signature:
Problem Event Name: BEX64
Application Name: deadlineslave.exe
Application Version: 7.0.0.50
Application Timestamp: 547f1c07
Fault Module Name: Wacom_Tablet.dll_unloaded
Fault Module Version: 0.0.0.0
Fault Module Timestamp: 51154fcc
Exception Offset: 000007fef5415dfd
Exception Code: c0000005
Exception Data: 0000000000000008
OS Version: 6.1.7601.2.1.0.256.48
Locale ID: 1031
Additional Information 1: 1751
Additional Information 2: 1751db00310023f5bc93b01cbe496fe7
Additional Information 3: 1036
Additional Information 4: 103688058abd7fd2151ef073198ffa01

Read our privacy statement online:
http://go.microsoft.com/fwlink/?linkid=104288&clcid=0x0409

If the online privacy statement is not available, please read our privacy statement offline:
C:\Windows\system32\en-US\erofflps.txt[/code]

After closing this one, it showed another one:

[code]Problem signature:
Problem Event Name: APPCRASH
Application Name: deadlineslave.exe
Application Version: 7.0.0.50
Application Timestamp: 547f1c07
Fault Module Name: Wacom_Tablet.dll
Fault Module Version: 6.3.5.3
Fault Module Timestamp: 51154fcc
Exception Code: c000041d
Exception Offset: 0000000000005dfd
OS Version: 6.1.7601.2.1.0.256.48
Locale ID: 1031
Additional Information 1: e6c3
Additional Information 2: e6c3d8284bf05f0a946f68dcbc0dd3eb
Additional Information 3: fbfe
Additional Information 4: fbfe7ab2cada43151b0fa1e2e8b1787f

Read our privacy statement online:
http://go.microsoft.com/fwlink/?linkid=104288&clcid=0x0409

If the online privacy statement is not available, please read our privacy statement offline:
C:\Windows\system32\en-US\erofflps.txt[/code]

The status of the Slave still didn’t change in the Monitor. But i guess this is because there’s no logic implemented for this kind of situation and because the Slave apparently can’t update its status anymore?
On the other hand shouldn’t there be any communication going on with the deadlinelauncher process to see if the Slave on that machine actually still exists? Just wondering…
I attached a screenshot of the Monitor with the job, task and slave being selected.

Now that i’m posting this i’m remembering seeing the crash message of the Slave almost always when RDP’ing into that machine (cell-ws-17), just forgot to report this until now. After starting Slave and Launcher again everything runs fine usually. Don’t know if this is actually something that you guys can do something about or whether i need to contact Wacom about this as it seems this is related to that Wacom_Tablet.dll in the crash message. Now that i think about it, this probably means the Slave crash wasn’t because of the Nuke crash but maybe rather because of me RDP’ing into that machine. But as i did this a long time after Nuke crashed i’m wondering if Deadline Slave doesn’t actually catch this and kill/restart the process?

Cheers,
Holger

rrussell · December 15, 2014, 2:57pm

After the slave crashed, it would have eventually been marked as stalled by Deadline’s Repository Repair operation. A slave is considered stalled if it hasn’t update its state in a while (the default is 10 minutes). When the slave is marked as stalled, a notification can be sent out (if you entered a email address for this in the Notification settings in the repository options), and if the slave happened to be working on a task at the time, it would have been requeued so that another slave could pick it up.

Regarding the tablet crash, it looks like you might just have to uninstall the current tablet driver and reinstall the latest version:
forums.adobe.com/thread/514294?tstart=0

You can also configure windows to suppress the crash popups so that they don’t block the process that crashed. Here’s a recent post on our forums that explains how to do that:
viewtopic.php?f=11&t=12689&p=56354#p56354

Cheers,
Ryan

celluloidvfx · December 15, 2014, 3:42pm

So what’s probably the reason it wasn’t marked stalled in this case? Shouldn’t that have happened after ten minutes (we actually left the default setting).

One more proof that Wacom just doesn’t know how to properly program drivers…

Great! Didn’t know that yet.

Cheers,
Holger

rrussell · December 15, 2014, 3:53pm

It looks like the slave didn’t crash though until you clicked through the Nuke crash dialog. Also, even when the slave crash window came up, it’s possible that some slaves threads were still running (including the one that reports the slave state). Once the slave’s crash dialogs were clicked through, the slave would die, and then 10 minutes from that point it would have been marked as stalled. By disabling windows error reporting, applications will be allowed to crash immediately.

Cheers,
Ryan