AWS Thinkbox Discussion Forums

Slave crash (OS X) possibly due to too many open files

Joy! :slight_smile:

Just you wait till I start testing Maya and modo renders! That’ll keep you busy again. :smiling_imp:

Hi Ryan, your favourite Nuke OSX bug finder here. :slight_smile:

Thanks for the new beta build. Unfortunately, it is hard to tell if the process issue has been fixed, because I cannot finish a render session. The client just stalls. What I mean is, that a job is started, but then not updated and the client sits there at 0% while the render finishes in the background. The only way out of it, is to kill the slave (actually quit X11 sometimes, because a simple file/exit won’t do), mark the batch as complete and restart the client.

The log just sits there and does nothing after a certain point.

2011-10-25 12:04:32: BEGIN - chef.local\alex 2011-10-25 12:04:32: Start-up 2011-10-25 12:04:32: 2011-10-25 12:04:30 2011-10-25 12:04:32: Deadline Slave 5.1 [v5.1.0.45606 R] 2011-10-25 12:04:38: Auto Configuration: No auto configuration could be detected, using local configuration 2011-10-25 12:04:38: slave initialization beginning. 2011-10-25 12:04:39: Slave 'Chef' has stalled because it has not updated its state in 1.384 m. Performing house cleaning... 2011-10-25 12:04:39: Could not find associated job class though. 2011-10-25 12:04:39: No stalled slave notification address specified in Repository Options - cannot send notification 2011-10-25 12:04:39: Info Thread - Created. 2011-10-25 12:04:39: Trying to connect using license server '/usr/local/FLEXnet' 2011-10-25 12:04:39: The license file being used will expire in 36 days. 2011-10-25 12:04:47: Checking repository integrity 2011-10-25 12:04:48: 2011-10-25 12:04:48: Scheduler Thread - Synchronizing job files 2011-10-25 12:04:48: Scheduler Thread - Synchronization time for job files: 5.554 ms 2011-10-25 12:04:48: Scheduler Thread - Synchronizing plugin files 2011-10-25 12:04:48: Scheduler Thread - Synchronization time for plugin files: 24.133 ms 2011-10-25 12:04:48: Constructor: Nuke 2011-10-25 12:04:48: Constructor: Nuke 2011-10-25 12:04:48: Constructor: Nuke 2011-10-25 12:04:48: Constructor: Nuke 2011-10-25 12:04:48: 1: Task timeout is disabled. 2011-10-25 12:04:48: 1: Loaded job: outro-m_fcbk_o_schn_v002.nk - Write2 (999_050_999_4ce0094b) 2011-10-25 12:04:48: 2: Task timeout is disabled. 2011-10-25 12:04:48: 2: Loaded job: outro-m_fcbk_o_schn_v002.nk - Write2 (999_050_999_4ce0094b) 2011-10-25 12:04:48: 0: Task timeout is disabled. 2011-10-25 12:04:48: 0: Loaded job: outro-m_fcbk_o_schn_v002.nk - Write2 (999_050_999_4ce0094b) 2011-10-25 12:04:48: 3: Task timeout is disabled. 2011-10-25 12:04:48: 3: Loaded job: outro-m_fcbk_o_schn_v002.nk - Write2 (999_050_999_4ce0094b) 2011-10-25 12:04:50: 1: INFO: StartJob: initializing script plugin Nuke 2011-10-25 12:04:50: 1: INFO: About: Nuke Plugin for Deadline 2011-10-25 12:04:50: 1: INFO: Scrubbing the LD and DYLD LIBRARY paths 2011-10-25 12:04:50: 1: INFO: Prepping OFX cache 2011-10-25 12:04:50: 1: INFO: Checking Nuke temp path: /var/tmp/nuke-u501 2011-10-25 12:04:50: 1: INFO: Path already exists 2011-10-25 12:04:50: 1: INFO: OFX cache prepped 2011-10-25 12:04:50: 1: Plugin rendering frame(s): 200-299 2011-10-25 12:04:50: 3: INFO: StartJob: initializing script plugin Nuke 2011-10-25 12:04:50: 3: INFO: About: Nuke Plugin for Deadline 2011-10-25 12:04:50: 3: INFO: Scrubbing the LD and DYLD LIBRARY paths 2011-10-25 12:04:50: 3: INFO: Prepping OFX cache 2011-10-25 12:04:50: 3: INFO: Checking Nuke temp path: /var/tmp/nuke-u501 2011-10-25 12:04:50: 3: INFO: Path already exists 2011-10-25 12:04:50: 3: INFO: OFX cache prepped 2011-10-25 12:04:50: 3: Plugin rendering frame(s): 400-499 2011-10-25 12:04:52: 0: INFO: StartJob: initializing script plugin Nuke 2011-10-25 12:04:52: 0: INFO: About: Nuke Plugin for Deadline 2011-10-25 12:04:52: 0: INFO: Scrubbing the LD and DYLD LIBRARY paths 2011-10-25 12:04:52: 0: INFO: Prepping OFX cache 2011-10-25 12:04:52: 0: INFO: Checking Nuke temp path: /var/tmp/nuke-u501 2011-10-25 12:04:52: 0: INFO: Path already exists 2011-10-25 12:04:52: 0: INFO: OFX cache prepped 2011-10-25 12:04:52: 0: Plugin rendering frame(s): 100-199 2011-10-25 12:04:52: 2: INFO: StartJob: initializing script plugin Nuke 2011-10-25 12:04:52: 2: INFO: About: Nuke Plugin for Deadline 2011-10-25 12:04:52: 2: INFO: Scrubbing the LD and DYLD LIBRARY paths 2011-10-25 12:04:52: 2: INFO: Prepping OFX cache 2011-10-25 12:04:52: 2: INFO: Checking Nuke temp path: /var/tmp/nuke-u501 2011-10-25 12:04:52: 2: INFO: Path already exists 2011-10-25 12:04:52: 2: INFO: OFX cache prepped 2011-10-25 12:04:52: 2: Plugin rendering frame(s): 300-399 2011-10-25 12:04:52: 0: INFO: Any stdout that matches the regular expression "READY FOR INPUT" will be handled as appropriate 2011-10-25 12:04:52: 0: INFO: Any stdout that matches the regular expression ".*ERROR:.*" will be handled as appropriate 2011-10-25 12:04:52: 0: INFO: Any stdout that matches the regular expression ".*Error:.*" will be handled as appropriate 2011-10-25 12:04:52: 0: INFO: Any stdout that matches the regular expression ".*Frame [0-9]+ \(([0-9]+) of ([0-9]+)\)" will be handled as appropriate 2011-10-25 12:04:52: 0: INFO: Stdout Handling Enabled: True 2011-10-25 12:04:52: 0: INFO: Popup Handling Enabled: True 2011-10-25 12:04:52: 0: INFO: Using Process Tree: True 2011-10-25 12:04:52: 0: INFO: Hiding DOS Window: True 2011-10-25 12:04:52: 0: INFO: Creating New Console: False 2011-10-25 12:04:52: 0: INFO: Render Executable: "/Applications/Nuke6.3v4/Nuke6.3v4.app/Contents/MacOS/Nuke6.3v4" 2011-10-25 12:04:52: 0: INFO: Rendering with NukeX 2011-10-25 12:04:52: 0: INFO: Rendering write node Write2 2011-10-25 12:04:52: 0: INFO: Render Argument: -V --nukex -x -X "Write2" -F 100-199 "/Users/alex/Deadline/slave/Chef/jobsData/outro-m_fcbk_o_schn_v002.nk" 2011-10-25 12:04:52: 0: INFO: Startup Directory: "/Applications/Nuke6.3v4/Nuke6.3v4.app/Contents/MacOS" 2011-10-25 12:04:52: 0: INFO: Process Priority: BelowNormal 2011-10-25 12:04:55: 1: INFO: Any stdout that matches the regular expression "READY FOR INPUT" will be handled as appropriate 2011-10-25 12:04:55: 1: INFO: Any stdout that matches the regular expression ".*ERROR:.*" will be handled as appropriate 2011-10-25 12:04:55: 1: INFO: Any stdout that matches the regular expression ".*Error:.*" will be handled as appropriate 2011-10-25 12:04:55: 1: INFO: Any stdout that matches the regular expression ".*Frame [0-9]+ \(([0-9]+) of ([0-9]+)\)" will be handled as appropriate 2011-10-25 12:04:55: 1: INFO: Stdout Handling Enabled: True 2011-10-25 12:04:55: 1: INFO: Popup Handling Enabled: True 2011-10-25 12:04:55: 1: INFO: Using Process Tree: True 2011-10-25 12:04:55: 1: INFO: Hiding DOS Window: True 2011-10-25 12:04:55: 1: INFO: Creating New Console: False 2011-10-25 12:04:55: 1: INFO: Render Executable: "/Applications/Nuke6.3v4/Nuke6.3v4.app/Contents/MacOS/Nuke6.3v4" 2011-10-25 12:04:55: 1: INFO: Rendering with NukeX 2011-10-25 12:04:55: 1: INFO: Rendering write node Write2 2011-10-25 12:04:55: 1: INFO: Render Argument: -V --nukex -x -X "Write2" -F 200-299 "/Users/alex/Deadline/slave/Chef/jobsData/outro-m_fcbk_o_schn_v002.nk" 2011-10-25 12:04:55: 1: INFO: Startup Directory: "/Applications/Nuke6.3v4/Nuke6.3v4.app/Contents/MacOS" 2011-10-25 12:04:55: 1: INFO: Process Priority: BelowNormal 2011-10-25 12:04:55: 2: INFO: Any stdout that matches the regular expression "READY FOR INPUT" will be handled as appropriate 2011-10-25 12:04:55: 2: INFO: Any stdout that matches the regular expression ".*ERROR:.*" will be handled as appropriate 2011-10-25 12:04:55: 2: INFO: Any stdout that matches the regular expression ".*Error:.*" will be handled as appropriate 2011-10-25 12:04:55: 2: INFO: Any stdout that matches the regular expression ".*Frame [0-9]+ \(([0-9]+) of ([0-9]+)\)" will be handled as appropriate 2011-10-25 12:04:55: 2: INFO: Stdout Handling Enabled: True 2011-10-25 12:04:55: 2: INFO: Popup Handling Enabled: True 2011-10-25 12:04:55: 2: INFO: Using Process Tree: True 2011-10-25 12:04:55: 2: INFO: Hiding DOS Window: True 2011-10-25 12:04:55: 2: INFO: Creating New Console: False 2011-10-25 12:04:55: 2: INFO: Render Executable: "/Applications/Nuke6.3v4/Nuke6.3v4.app/Contents/MacOS/Nuke6.3v4" 2011-10-25 12:04:55: 2: INFO: Rendering with NukeX 2011-10-25 12:04:55: 2: INFO: Rendering write node Write2 2011-10-25 12:04:55: 2: INFO: Render Argument: -V --nukex -x -X "Write2" -F 300-399 "/Users/alex/Deadline/slave/Chef/jobsData/outro-m_fcbk_o_schn_v002.nk" 2011-10-25 12:04:55: 2: INFO: Startup Directory: "/Applications/Nuke6.3v4/Nuke6.3v4.app/Contents/MacOS" 2011-10-25 12:04:55: 2: INFO: Process Priority: BelowNormal 2011-10-25 12:04:59: 3: INFO: Any stdout that matches the regular expression "READY FOR INPUT" will be handled as appropriate 2011-10-25 12:04:59: 3: INFO: Any stdout that matches the regular expression ".*ERROR:.*" will be handled as appropriate 2011-10-25 12:04:59: 3: INFO: Any stdout that matches the regular expression ".*Error:.*" will be handled as appropriate 2011-10-25 12:04:59: 3: INFO: Any stdout that matches the regular expression ".*Frame [0-9]+ \(([0-9]+) of ([0-9]+)\)" will be handled as appropriate 2011-10-25 12:04:59: 3: INFO: Stdout Handling Enabled: True 2011-10-25 12:04:59: 3: INFO: Popup Handling Enabled: True 2011-10-25 12:04:59: 3: INFO: Using Process Tree: True 2011-10-25 12:04:59: 3: INFO: Hiding DOS Window: True 2011-10-25 12:04:59: 3: INFO: Creating New Console: False 2011-10-25 12:04:59: 3: INFO: Render Executable: "/Applications/Nuke6.3v4/Nuke6.3v4.app/Contents/MacOS/Nuke6.3v4" 2011-10-25 12:04:59: 3: INFO: Rendering with NukeX 2011-10-25 12:04:59: 3: INFO: Rendering write node Write2 2011-10-25 12:04:59: 3: INFO: Render Argument: -V --nukex -x -X "Write2" -F 400-499 "/Users/alex/Deadline/slave/Chef/jobsData/outro-m_fcbk_o_schn_v002.nk" 2011-10-25 12:04:59: 3: INFO: Startup Directory: "/Applications/Nuke6.3v4/Nuke6.3v4.app/Contents/MacOS" 2011-10-25 12:04:59: 3: INFO: Process Priority: BelowNormal

That’s interesting. This seems like a different problem, and it’s not obvious why it’s occurring. Does this only happen when rendering concurrent tasks? If you set concurrent tasks to 1, does the render still hang? We’ll try running some tests here to see if we can reproduce. Looks like there may be a deadlock somewhere…

The pattern is kind of like this:

  • Mac Mini Server running Lion Server (which I think is not supported, right? bad me :blush: ) always does it. (Also, the slave on Lion seemed to run much more stable in Beta 3 then the Snow Leopard slave.)
  • Mac Pro running Snow Leopard only does it sometimes and then it is mostly on the last concurrent task. Say every other concurrent task batch.

I see if I can run a testrun with only one concurrent task this evening. Have to rely on good old terminal batches to get some renders done at the moment, so might not have time today.

I’m testing on Snow Leopard right now. Nothing yet, but I’ll keep hammering it.

We should be getting our Lion machine soon, so we’ll definitely run some tests there too.

Is there any log output that might help you further? Extended logging? Mono console?

Ok, so @ 16 concurrent tasks, I’m hitting the original process problem pretty consistently, even with OFX cache prepping turned off. There must still be something we’re missing.

I’m going to try another app besides Nuke (maybe After Effects) to see if perhaps this issue is Nuke-specific.

Happens with AE too. We may have a lead on it. I’ll keep you posted.

Cheers mate! Looking forward to it. Fingers crossed.

We narrowed down the problem to setting the process priority on the render process. After we commented this code out, I was able to hammer through thousands of nuke tasks @ 16 concurrent tasks without any problems. So we’re just going to leave this code commented out for OSX and Linux for now. I think we only started doing this in 5.0 actually.

We’ve also found a way to gather process info on the mac using the system libraries. This is another significant reduction in the number of external processes the Mac version uses.

These improvements will be rolled into beta 5, which we hope to get out next week.

Cheers,

  • Ryan

Cheers Ryan! Looking forward to testing it. I’ll be out of the office next week, but we will receive a new batch of plates when I get back home on Saturday and I’d love to test the new build with them. There should be enough plates to stress test the hell out of Deadline with Nuke. :slight_smile:

Hi there Ryan,

I finally got that boatload of plates I mentioned and therefor am back at testing and using Deadline. And it seems (knock on wood) that Beta 5 fixed the issues I was having. I have been converting files all morning and not a single crash or missing progress indicator and even canceling the task works as expected. Feels good to build trust in the farm again instead of having to constantly monitor it. :smiley:

I’ll keep an eye on it and will scream if anything goes haywire.

Hi Ryan,

I have been running beta 5 for a few days now and overall it seems pretty stable. I had a few slave crashes on Friday when I tried to create ProRes Quicktimes with sizes over 2GB (via AFP on a HFS+ volume). I tried to run those locally and most succeeded except one on which I had to render it via manual batch, because the slave kept erroring our. I’ll try and find the log files for that since I forgot to mark them in the Friday render frenzy.

The slave and task logs didn’t show anything unusual (except that is just ends mid-render), but I found a Mono crash log that roughly fits the time.

As I said, it only seems to happen with rather large Quicktime files. As far as I can tell, Nuke finalizes Quicktime renders by completely rewriting the whole file at the end, so I have a 2.4GB .mov and a growing .mov.tmp that ultimately replaces the .mov. This process can take a while up to 1-2 minutes I guess. During that time the slave seems to either crash or requeue the task. On my weaker machines over the network this appears to happen mid-render as well.
mono_2011-11-17-115800_Grumpy.txt (65.5 KB)
mono_2011-11-18-170744_Happy.txt (69.9 KB)

I googled that crash type, and found this:
forums.steampowered.com/forums/s … ?t=1803559

He said that when he installed Mono 2.6.7, the problem went away. He was using Mono 1.2.4 though, but I did notice you were on 2.10.5. Mono 2.6.7 is their LTS version, and has been shown to be the most stable version available. Maybe try downgrading to see if that helps. Note that you should uninstall all versions of Mono first, and this script makes that process easier:
mono-project.com/Mono:OSX#Un … n_Mac_OS_X

We are actually changing our documentation to recommend and link to 2.6.7, and at some point we would like to try and bundle this version with Deadline in the future.

Cheers,

  • Ryan

I vaguely remember upgrading to 2.10.x because of some interface issues with 2.6.x back in another beta (5.0?). But sure, I’ll give it a try tonight when all the renderings are done.

Just out of interest and completely off topic: which language is Deadline written in and why Mono instead of a true port? Are you using that many native Windows dlls and other dingdongs that are hard to port? I imagine that you’d have to do as much code support for Linux and OS X with Mono as if you had just ported it, wouldn’t you? But that is me not actually programming anything so I might be completely off the reservation here.

Deadline is primarily written in C# (.NET), and Mono allows us to run our .NET binaries in a Linux and OSX environment. We would have to convert all of our code to C++ if we wanted to build native binaries. At the time we decided to port from Windows, that simply wasn’t a viable option.

The majority of our code is the same across all 3 operating systems, with the exceptions being in the user interface and some low level utility functions to get system information.

The non-UI portions of Mono are very stable at this point (especially in 2.6.7). Because your problem only happens with these huge movie files, it could be possible that the system is just running out of resources, and the Slave becomes a casualty of that.

Cheers,

  • Ryan

It looks like Mono 2.6.7 might have done the trick. I am running several large r3d->mov conversions as I write this and so far the slaves are stable. Fingers crossed. I’ll report back when that batch has finished.

Ok, all slaves have been running for the last 2 days doing Nuke renders to EXR and large ProRes QuickTimes and not a single crash or failure. We might have a winner here. :slight_smile:

Sweet!! Thanks for confirming this, and a bigger thanks for your patience and help in tracking these problems down!

Cheers,

  • Ryan
Privacy | Site terms | Cookie preferences