Slave crash then renders bad frames

Every now and then one of our render slaves will have a weird crash. Unfortunately the only warning of this is seeing one of the slaves start rendering bad frames every 2 seconds. Which gets kind of annoying because later on you might see your render is complete, but half of it is bad because of a slave crash.

Ive noticed that when it happens the slave on the node will still be open, but the Microsoft crash error message will be on top of it. “Do you want to send error report” message. I hit dont send, and the deadline slave completely exits out. I think this residual left over slave from the crash is whats causing these erroneous renders. Is there a possible solution for this?

Which version of Deadline are you running? If you’re running Deadline 3.0 (build 32934), we suggest upgrading to Deadline 3.0 SP1 (build 33353). This maintenance release fixes some stability issues with the slave application. You can download it from here:
franticfilms.com/software/pr … /download/

If you’re already running 3.0, you don’t need a new license to upgrade. Upgrade documentation can be found here:
franticfilms.com/software/su … rading.php

Cheers,

  • Ryan

It appears that we are current unforntunately

Can you post the error message (or a screen shot of it)? If the Microsoft crash dialog allows you to view more information (ie: the exception message), that would also be helpful. Finally, if you see this happen again, close the dialog so that the slave exits, then post the most recent slave log from the client log folder. If you’re on XP, the log folder is c:\documents and settings\all users\frantic films\deadline\logs.

Thanks,

  • Ryan

PS: Just to confirm, the version in the slave’s about menu is 3.0.33353?



I wasn’t able to find the log folder on the render node with that file path. Our server is windows 08 server, and didn’t have that a file path like that either.

I checked the about menu and we are definitively in 3.0.33353

We are also running lightwave 9.5

thanks,

Scott

Sorry, I made a typo in the path. This is what it should be:
C:\Documents and Settings\All Users\Application Data\Frantic Films\Deadline\logs

Note that this path is on the client machine, not the repository server. I’m hoping it contains more information regarding this error. Make sure to grab the log after you have closed the dialog and the slave has completely exited.

Out of curiosity, do you guys render with anything else besides lightwave? I’m just wondering if this is lightwave specific or not.

Cheers,

  • Ryan

I attached two files that were both created at the same time in the log directery.

If the error occurs again we can run an xsi render through to see if we can reproduce it in another package.

As a temporary fix we have gone through all the nodes and turned of the System error reports in the hopes that it will just crash if it needs to crash. Ill go back and turn them back on for a couple machines for more testing though.

thanks again,

Scott
deadlineslave(Render10)-2008-11-07-0000.log (909 KB)
task_2(Render10)-2008-11-07-0000.log (525 KB)

One thing that caught my eye in the log was this:

Now normally when the slave loses the connection to the repository, it will continue rendering, which seems to be the case. I’m not sure if this is even related to the Microsoft error.

I googled the Microsoft error you sent and some people have suggested checking to make sure all your windows and .NET service packs are up to date.

We actually had to reboot the server this morning which is what may have caused that lost connection. I will check into the .net updates for now though thank you for your quick support :smiley:

Hi again. Heres another log of a major culprit slave this weekend. This time it happened on an artists computer over the weekend as well as on a few nodes. Here is the log from his computer.

I see that it says task timeout is disabled, could that have something to do with it. In this log you can actually see it running through all the frames
deadlineslave(Flash)-2008-11-08-0000.log (2.39 MB)

The task timeout shouldn’t have anything to do with this. It’s just a per-job feature that is disabled by default.

We’re you able to confirm if this problem only happened while rendering Lightwave jobs - more specifically, did you run into this problem with XSI jobs? When you submit the Lightwave jobs, do you have the “Use Screamernet” option enabled? If so, maybe try submitting without this option enabled to see if that makes a difference. The difference between the two is that Screamernet mode keeps the scene loaded in memory between frames, but it won’t affect the results of your renders. We’re just trying to rule out possibilities here.

I’m out of the office until Wednesday, but when I’m back in, I’ll throw a bunch of Lightwave jobs at Deadline to see if I can reproduce the problem.

Cheers,

  • Ryan

We submitted an xsi render on friday, but it went red over the weekend and didnt render. We will try again though. Also we will try the no screamer net thing as well.

thanks,
Scott

If a job is turning red, it means that it’s accumulating errors (the normal kind, not slave crashes). To view errors, right-click on the job and select Job Reports -> View Error Reports. These reports should tell you what the problem is, but if you’re not sure what to make of them, please feel free to post them and we’ll take look (preferably to a new thread).

Cheers,

  • Ryan

We’re not having any luck reproducing this here. By any chance do you have Error Reporting disabled? If so, you can enable it from the Monitor while in super user mode by selecting Tools -> Configure Repository Options. Select Error Reporting Setup from the list on the left and enable it. If this is enabled, we should be sent an email when any Deadline application throws an unhandled exception, which seems to be the case here. The email will contain an error message that should show us the stacktrace, and make it easier for us to pinpoint where the error is occurring.

Cheers,

  • Ryan

Check, I will enable enable the error reporting.

I am having the same issue at my shop with deadline slave crashing and lightwave continuing to render bad frames. Have either of you made any progress with this issue?

Nothing as of yet. Do you have remote error reporting enabled? Have you tried rendering with screamernet mode on/off to see if it makes a difference?

We have not had the problem since disabling screamernet renderering. Any ideas why this would be happening only when using screamernet.

Great, glad to hear this! No ideas why yet, but now we have confirmation about where the problem is occurring. At least there is a workaround for now.

Cheers,

  • Ryan