Slave crash (OS X) possibly due to too many open files

anon29971431 · August 4, 2011, 4:21pm

Hi Ryan,

as per our discussion in Slave crash (OS X) I am now in the Beta program. However, I’d wait to actually install it until you give me a heads up that the issue is actually being worked on since I am in the middle of a production and don’t really have time to spare to test the other new feature although they do sound great.

rrussell · August 4, 2011, 5:02pm

I completely understand. If you do have some downtime to test now though, it would probably be enough to just pick one of your OSX machines and remove it from production to test with the new version. It’s probably best to do that anyways once we have a new version out that tries to target the problem specifically.

Cheers,

Ryan

anon29971431 · October 5, 2011, 8:28am

Hi Ryan,

just letting you know that I am back in the office. I didn’t have any time to test out new builds, but I am diving into that now. I’ll let you know how it goes.

anon29971431 · October 5, 2011, 9:49am

Well, that went by fast. Installed Beta 2 with high hopes, because of the new way processes are handled, but now the Slave won’t start up anymore. Only bounces once. No logs, no errors, nothing.

Monitor, Pulse, etc. all run fine though.

rrussell · October 5, 2011, 1:29pm

This is a known issue in beta 2 that will be fixed in beta 3. The workaround is to open Finder and go to /Applications/Deadline/Resources and create a folder called “slaves” (without the quotes). After creating this folder, try launching the slave and see if that helps.

Cheers,

Ryan

anon29971431 · October 5, 2011, 1:35pm

Worked perfectly! Will report back after a bit of testing. Cheers Ryan.

anon29971431 · October 7, 2011, 8:38am

Hi Ryan,

I ran a couple of Nuke render tests yesterday. The same renders with thousands or frames that made the slaves go pop last time and in the beginning I had some permission related problems on the mounted share that made the slaves crash (slave one created the render directory and slave two didn’t have write permissions). Once I sorted that out, the slaves seemed stable. Ran a render for nearly two hours straight without problems. ps auxww also didn’t reveal any unusual process accumulations, so that is good as well. I will keep testing it a bit, but it seems that the too many files issue is closed.

One thing that should not have happened though, I think, is that the slaves crash after several write permission errors. Although it is obviously an error on my end not keeping my ACL’s straight, but that should only throw an error in the render thread, not bring down the whole slave, right?

I have attached the slave and task log for those crashes, although they don’t really show the slave crash, only the permission errors.
deadlineslave_Happy(Happy)-2011-10-06-0002.log (113 KB)
task_2(Happy)-0000.log (5.78 KB)

rrussell · October 7, 2011, 1:28pm

That’s great to hear that you aren’t hitting the “too many processes” error! You’re right though, the slave shouldn’t be crashing because nuke is unable to write the frames. Can you check the OSX Console log to see if there are any crash dumps from Mono around the same time the slave crashed? Based on the last timestamp in the slave log, it crashed at 2011-10-06 13:12:31.

Cheers,

Ryan

anon29971431 · October 10, 2011, 4:34pm

and of course my console log only goes back to about 2pm. I’ll see if I can reproduce it.

anon29971431 · October 11, 2011, 12:45pm

Well, not sure if this is related, but I still get crashes from my slaves. In addition here is a list of problems I noticed (with Deadline Version: 5.1.0.45235):

I cannot cancel or suspend Nuke renders. Instead I get: Child process with id failed to exit
and of course, the slave crashes, but the render keeps going (which is good in my case. I just wait till the job is finished and then start the slave anew, since this are sequences that cannot be started in the middle.) I have attached the stack trace for both machines in my testfarm since the Deadline logs don’t reveal anything.

Also, I’d love to setup some way of converting one of the outputs to a Quicktime. I guess that is what Draft is for, right? Need to look into that.
MacPro_console.log (9.44 KB)
iMac_console.log (7.25 KB)

rrussell · October 11, 2011, 4:07pm

Grrr! I really wish this process issue didn’t segfault Mono. This definitely looks like the same issue starting processes that we had seen originally (and thought we fixed). We’ll add some additional process cleanup code to beta 4 (which really shouldn’t be necessary, but it doesn’t hurt).

Hmm, that’s weird. That would mean that something is preventing us from killing the process. When this happens, we then try a native SIGKILL on that process, so I’m surprised that’s not working either. Are you able to kill the Nuke process from a terminal?

anon29971431 · October 11, 2011, 5:15pm

Will try the terminal kill tomorrow. Have to finish a big lump or rendering first.

Btw. I installed the update to Deadline Version: 5.1.0.45496 and that seems to be more stable on the Mac Pro, but no the iMac. Will also post crashlogs of the iMac tomorrow if I have time.

rrussell · October 11, 2011, 6:32pm

Sounds good.

I think the process issue affects some machines more than others. For example, I’ve been running a test on my 10.6.8 Mac mini all day that just executes one external process after another to try and reproduce the crash. It has launched 203,000 subsequent processes at this point, and no crash. I’m going to leave it running all night and see what happens.

anon29971431 · October 12, 2011, 7:27am

Ok, the canceling seems to be connected to me rendering several write nodes in the script one of them being a Quicktime. So the PID Deadline is trying to kill (and which it indeed is killing) is the Quicktime process “NukeQuickTimeHelper-32”. The actual Nuke process continues on undisturbed. Interesting enough, the Slave reports the PID of the Helper which it just quit when it complains that it cannot kill the process. Should it not kill the process that it thinks is Nuke and then declare the process canceled? It would still be wrong, sure since Nuke would continue running in the background, but the behavior right now seems really mixed up.

anon29971431 · October 12, 2011, 7:33am

And here are the promised crash logs for the iMac with the up to date Deadline build.
iMac_monocrash.log (41.5 KB)
iMac_console.log (7.44 KB)

rrussell · October 12, 2011, 1:45pm

I think the “Child process with id failed to exit” errors may be misleading. When Deadline goes to kill the process tree, it will keep trying to kill each process until it eventually exits, and it prints out that message after each failed attempt (a failed attempt is a case where the process is still running a full second after invoking the kill command). So it looks like the QT process does get cleaned up eventually. Maybe we should just remove those messages, since they aren’t helpful in any way.

That’s strange though that the Nuke process doesn’t exit. We’ll have to try out some QT renders with Nuke to see if we can reproduce the problem.

For the crash issue, the cleanup code that we added might actually be helping. Yesterday, I had run a few tests where I was relying on the framework to do the process cleanup (which according to the documentation, should work fine). All my test app does is repeatedly call “ls” and grab the stdout. The first time I ran it, it crashed after about 1000 processes. The second time, it crashed after about 450,000 processes. That’s quite the difference, but regardless, I reproduced the problem.

So before I left yesterday, I ran the same test app, but this time I added code to explicitly cleanup the process resources. It was still running this morning and had launched over 32,000,000 processes. This seems very promising, so we will be rolling this change into beta 4. Hopefully this resolves the issue once and for all.

Cheers,

Ryan

anon29971431 · October 12, 2011, 2:01pm

Sounds great! When can I get my hands on it?

rrussell · October 12, 2011, 2:08pm

When we release it.

Haha, I couldn’t resist. We’re aiming to release beta 4 next week.

Cheers,

Ryan

anon29971431 · October 12, 2011, 2:09pm

Hehe, all in good spirit.
Looking forward to it!

rrussell · October 13, 2011, 7:10pm

Some more good news about this problem. We were looking at the OFX cache prepping that Deadline does for Nuke jobs, and it turns out a .NET process is used here, and it ISN’T CLEANED UP after the process exits. This has probably been the main source of the process leak problem this entire time. Now that I think of it, this problem mainly affected Nuke users, and that makes sense because with the process bug in the Nuke plugin, the odds of the problem occurring were much greater than with any other plugin.

This will also be fixed in beta 4. Hopefully we’ve resolved this once and for all…