AWS Thinkbox Discussion Forums

Deadline on Apple M1 is unstable

Just wanted to follow up on this thread, has anyone found a workaround yet?

Also following up here to see if any progress has been made towards a solution? Thanks.

We’ve re-created it, it’s just the Worker that fails for us.

Beyond that there’s nothing I can share unfortunately. As soon as we release a fix you’ll see it in the Release History — Deadline 10.3.0.15 documentation

Thanks Justin. Yes, it’s just the Worker that’s giving me issues. Hoping to see that fix soon!

1 Like

Hello All

Good news, we have tested this issue on the latest version of Deadline (10.3.1.3) which has update of .Net Core version, it seems to be Working fine. Please upgrade and test.

I was still seeing crashes after updating to 10.3.1.3 (and initially updating my MongoDB database to version 5), but I recently unchecked the setting:

Working Settings → Gather System Resources

and it seems to be much more stable. With this setting checked, I was getting worker crashes even mid-render.

UPDATE: There are still some worker crashes, but they seem to be much less frequent.

spoke too soon - crashes are happening even more frequently. Really hope there will be a fix soon for M1/M2 Macs.

Again, it’s just the Workers that are crashing, and on x64 machines, it works fine.

I did see a Worker crash when asking it to restart after current task completion.

I’ve been running a series of tests, the Worker operates much better after the .net Core upgrade from 6.0.14 to 6.0.416… That’s 402 patch releases to sift though, but maybe something will fall out of that digging.

Some results of testing:

I have this script m1-check.py that will lock up when run with /Volumes/Applications/Thinkbox/Deadline10/Resources/deadlinecommand ExecuteScriptNoGui <path> in under a minute. Upgrading from 10.3.0.13 to 10.3.1.3 made no appreciable differerence. For numbers, before upgrade I was able to get 6 and 11 iterations in two runs. After the upgrade it locked at 4 and 25 iterations.

What was very interesting is after the upgrade, the Worker was able to go from running for under a minute with 10 threads to running for 58 minutes with 16!

Some facts about this that someone may be able to run with:

  • Deadline 10.1.23.6 has been reported to not have this problem (I have yet to check this)
  • Putting as little as a print statement just after “thread.join()” allows for 3x the number of threads but that may have just been a series of good runs
  • A purely C# version of the Python script does not lock up on .net Core 8. I’m not sure how to run that code inside of Deadline 10.3.1.3’s .net Core yet.
  • A pure Python implementation using popen() does not lock up at all
  • When a lock happens, every thread locks. That includes the Worker Info thread and the main GUI thread which shouldn’t be impacted at all by other threads. Especially Python that’s being launched in the deadlinesandbox these days.

My test for the worker is to run this job which is 5,000 tasks, has no task limit (should render with 16 threads), and no error limit):
Ping Job (16 concurrent, no err limit).zip

I’ll keep at this, but it’s been one of those fun problems like that time a memory alignment in Mono caused a crash on Linux 2.6.something seven years ago. Real hard, real satisfying to figure out. :smiley:

All this to say, my detective hat is on and I’m most suspicious about .net Core and how it interacts with Rosetta 2 :man_detective:. Efforts like Support .NET on Apple Silicon with Rosetta 2 emulation really have my attention, but it may also be related to to CPython’s GIL

Update today. Let’s all peak into how Edwin’s brain works as he tests M1 lockup problems while doing e-mail.

At a high level, there is a serious lock-up problem while doing statistics gathering. The Worker locks up in under 20 seconds with it enabled reliably. With it turned off it lives more than 20 mintues (I purposely killed at after running long enough). Digging in, I’m suspicious about threading/timers and how the backing thread pool works here. I think the problem now is that threads creating threads that launch processes?

  • Started at 14:20 with another 5,000 tasks. Enabled CPU/Mem stats gathering
  • Locked up within a minute?! No tasks actually finished. :face_with_raised_eyebrow:
  • Restarted at 14:25. Locked up at 14:26? Claimed to be rendering for 28 seconds
  • Restarted at 14:27. Locked up in 13 seconds (using ‘Running Time’ metric here).
  • Turned off stats gathering.
  • Restarted at 14:29. Still running after 12 minutes…
  • Decided to turn stats back on to see if it dies
  • Restarted at 14:42. Locked up in 16 seconds?!
  • Restarted at 14:43. Locked up in 13 seconds.
  • Turned off stats gathering…
  • Restarted at 14:44. Didn’t lock up after 21 minutes
  • Turned on stats gathering…
  • Restarted at 15:06. Locked up in 17 seconds
  • Turned off stats gathering…
  • Code for this feature is stored in the regretfully named SlaveRenderThread.cs and is triggered by a timer elapsed event (so async operation through a thread pool). We get the managed process IDs and run some CollectValues() call on m_processMonitor which is a ScriptPluginMonitor (one of our C# classes)
  • Digging in, we’re just calling /bin/ps on Mac using the same Process2.RunProcessAngGetOutput() that I’m using. I have no idea why it’s locking up so fast. Third arg for me is True, let’s try changing to false…
  • Didn’t make any appreciable difference. The m1_test.py script still locks up after 3 to 7 iterations…
  • The problem is seemingly more and more around running processes in sub-threads… I don’t know how C# handles timers. The stats gathering is run inside a System.Timers.Timer() object.

Hey Thinkbox Staff, really appreciate all of the time you’ve put into working on this issue, as well as these details. I can also confirm that the latest version is still hanging on our systems, though it’s happening after running for hours rather than minutes.

No problem! It’s a bit hard to find the time to dig in, so it’s sporadic right now but we’re making progress narrowing things down.

Testing from today:

  • Installed 10.1.23.6. Hit an unusual runtime error. Deleted /Applications/Thinkbox/Deadline10 folder and re-ran the installation
  • The m1_test.py script seems to be running faster and managed to complete without problems.
  • Installed Visual Studio for Mac, .net Core 6.0.416 (x86_64), and built an equivalent C# program to the m1_check.py script m1_check.cs.zip.
  • Ran C# program for some time and while it hung briefly a few times, it continued to at least 60 iterations (I did modify what’s uploaded here to be 1000 iterations of 50 threads).

Testing results (thus far)

  • 10.3.1.5: Slightly improved. Test script still locks before 15 iterations, but Worker runs for ~50 minutes if and only if stats gathering is disabled
  • 10.3.0.15: Known problematic. Worker locks up in under 30 seconds
  • 10.2.1.1: Untested
  • 10.2.0.10: Untested
  • 10.1.23.6: No problems whatsoever. Test script went through 1,000 iterations, Worker ran for at least 20 minutes and was stopped. Reported good by customer.

Update! The C# application running on .net Core 6.0.416 x86_64 did eventually hard lock the same way after 153 iterations! Three cat processes remain and are stuck as zombies which is what I see with the Python version. This narrows things down to very concretely to .net Core

We’re going to focus on upgrading to .net Core 8.0 in the next release for multiple reasons and it will like clear things up as we did test this C# solution with 8.0 a month ago, but it will be some time before the next Deadline version is released. If you’re able, 10.1.23.6 will work in the mean time.

Hi @eamsler, I installed 10.1.23.6 and haven’t seen any of our M1 systems hang yet. They’ve been running for about 24 hours so far. Thanks again!

Just wanted to bump this up. We’re still experiencing intermittent hanging on 10.1.23.6. It’s much less than the newer versions but still occurring on multiple M1 machines.

Any updates?

Unfortunately it’s going to be awhile until we have another release here. As I noted before:

We’re going to focus on upgrading to .net Core 8.0 in the next release for multiple reasons and it will likely clear things up as we did test this C# solution with 8.0 a month ago, but it will be some time before the next Deadline version is released. If you’re able, 10.1.23.6 will work in the mean time.

Privacy | Site terms | Cookie preferences