I had a personal project to render out (three 180 frame turntables) over the weekend and I figured I’d give D6R10 a try. I was very impressed with how snappy it feels and the UI (thanks for adding an option for script icons!). In practice however I found R10 to be super unstable.
Slave app would stall or crash quite frequently on most of my render nodes. When it crashed during a rendering the Monitor wouldn’t update it’s status as offline or stalled, it would just show it green with 0% CPU usage. Doing a remote slave restart wouldn’t work in those cases. When the slave stalled it also wouldn’t update the status until I killed the whole rendering job, however restarting a slave remotely that was actually stalled would in fact restart the slave.
I even had the monitor crash on my workstation once which has never happened to me before.
If I RDP’d into one of the nodes that had a crashed slave app I couldn’t get the slave app to start either through the launcher, or the shortcut in the start menu. Sometimes if I double clicked the exe itself it would restart other times it wouldn’t.
To make sure it wasn’t something else causing the problems I uninstalled D6 and re-installed D5.2 and submitted the same jobs and so far they’re running very smoothly.
My workstation is running Windows 8 pro x64 and my render nodes are Win Server 2008 R2 Standard (SP1)
Can you provide any logs from sessions where the slaves or monitor crashed? The log folder can be found by selecting Help -> Explore Log Folder from any of the Deadline applications, or from the Launcher’s right-click menu.
Also, can you check to make sure your render nodes have all of their Windows updates applied?
Hi Ryan, once the crashing started I made sure all of my machines had the latest windows updates and it didn’t help. I’ve been running the same jobs for the past 40h without a hitch on the latest D5.2.
I have no idea which logs are from a crash and which aren’t. I’m not even sure what to look for. Each of my 5 slaves has about 15-20 logs files. They’re 4.3MB zipped - lemme know where to send them if you’re interested.
Here are what I think might be relevant:
A few slaves have this in some of their logs:
My workstation (on which the monitor is run) had this in a couple of logs:
I found this on another slave that might be relevant:
4.3 MB isn’t that large, so you can just upload them here.
None of those errors should be “fatal” and bring down the slave. I checked the code and confirmed this. The Monitor one has already been fixed internally and will be included in beta 11.
I’m surprised it’s so unstable for you, since we don’t see this instability here. Maybe it’s something specific to Win Server 2008. We have a Win Server 2008 machine here, so we installed beta 10 on it and we’re going to let it run for a while to see if the problems happens for us.
What you could do is set up beta 10 (or wait until beta 11) on a single node and wait for it to crash again. When it does, just go to the log folder and get the most recent slave log.