I have managed to reduce the frequency of the constant hanging/crashing by increasing the interval between housecleaning (now set to 3600) Also I am seeing this happen to the RCS about once a day too
Thanks AnadinMoshi, I gave that a try however my workers are still stalling.
Seeing exactly the same issue on the 2 different versions of deadline on new M2 Ultra Studios running latest ventura.
The machines do not crash but the deadline worker seems to hang up and cause a spinning wheel of death on each machine.
seems to be random length of time of when they do seize up, not consistent.
Machines then show stalled in Deadline Monitor,
Machine power settings are all set as always on, no screensaver etc.
I have a second set of nodes running monterey on mac Pro 2019s and do not get any stalls on those.
Hey Thinkbox, any luck reproducing this issue on your end? We’re currently unable to utilize any of our M1 Macs for Deadline rendering. Can we expect to see Deadline running natively on Apple Silicon any time soon? Thanks!
Dave
Consistent failures on both M1 and M2.
tried multiple options, setting cleanup to 3600
enabling "prevent app nap: for both monitor and slave,
system power setting are all set to always on, rosetta installed several times.
csrutil disabled, security settings set to low, Thinkbox, C4D set to full disk access
Yet all 14 x M2 studio nodes will randomly become stalled only in deadline app, both slave and monitor.
Monitor crashing consistently on M1.
My second farm which is 15 Mac Pros (2019 Intels) do not have any issue!
This is really unnaceptable for a product many businesses rely on.
We’ve not had any luck reliably re-creating this issue to root-cause the issue; it is as intermittent for you folks as it is for us. We’ve got an engineering issue rolling, but we can’t share any details about when it would be resolved, or when a release with a fix would happen due to Amazon policy.
QT seems to be the common denominator in the spinlogs we’ve gotten in this thread and in Worker randomly crashing on MacOS M2 Ultra. We only use QT to create the UI. Which means running the Worker with -nogui
should resolve the issue, but that’s not what we’ve been seeing.
If possible, could we get a spinlog from a crash when running the Worker with the -nogui
flag? It could be there’s two causes of a hang and crash, and the second only gets to show up with QT removed from the equation.
Thanks!
Hey Justin, thanks so much for the response. I’ve just sent you a DM with a -nogui spindump. I had previously attempted running it with the -nogui flag, but it still stalls/hangs. It also might be worth mentioning I’m having no issues whatsoever running the Monitor on my M1 systems, it’s just the Worker that’s problematic. Thanks!
Just wanted to follow up on this thread, has anyone found a workaround yet?
Also following up here to see if any progress has been made towards a solution? Thanks.
We’ve re-created it, it’s just the Worker that fails for us.
Beyond that there’s nothing I can share unfortunately. As soon as we release a fix you’ll see it in the Release History — Deadline 10.3.0.15 documentation
Thanks Justin. Yes, it’s just the Worker that’s giving me issues. Hoping to see that fix soon!
Hello All
Good news, we have tested this issue on the latest version of Deadline (10.3.1.3) which has update of .Net Core version, it seems to be Working fine. Please upgrade and test.
I was still seeing crashes after updating to 10.3.1.3 (and initially updating my MongoDB database to version 5), but I recently unchecked the setting:
Working Settings → Gather System Resources
and it seems to be much more stable. With this setting checked, I was getting worker crashes even mid-render.
UPDATE: There are still some worker crashes, but they seem to be much less frequent.
spoke too soon - crashes are happening even more frequently. Really hope there will be a fix soon for M1/M2 Macs.
Again, it’s just the Workers that are crashing, and on x64 machines, it works fine.
I did see a Worker crash when asking it to restart after current task completion.
I’ve been running a series of tests, the Worker operates much better after the .net Core upgrade from 6.0.14 to 6.0.416… That’s 402 patch releases to sift though, but maybe something will fall out of that digging.
Some results of testing:
I have this script m1-check.py that will lock up when run with /Volumes/Applications/Thinkbox/Deadline10/Resources/deadlinecommand ExecuteScriptNoGui <path>
in under a minute. Upgrading from 10.3.0.13 to 10.3.1.3 made no appreciable differerence. For numbers, before upgrade I was able to get 6 and 11 iterations in two runs. After the upgrade it locked at 4 and 25 iterations.
What was very interesting is after the upgrade, the Worker was able to go from running for under a minute with 10 threads to running for 58 minutes with 16!
Some facts about this that someone may be able to run with:
- Deadline 10.1.23.6 has been reported to not have this problem (I have yet to check this)
- Putting as little as a print statement just after “thread.join()” allows for 3x the number of threads but that may have just been a series of good runs
- A purely C# version of the Python script does not lock up on .net Core 8. I’m not sure how to run that code inside of Deadline 10.3.1.3’s .net Core yet.
- A pure Python implementation using popen() does not lock up at all
- When a lock happens, every thread locks. That includes the Worker Info thread and the main GUI thread which shouldn’t be impacted at all by other threads. Especially Python that’s being launched in the
deadlinesandbox
these days.
My test for the worker is to run this job which is 5,000 tasks, has no task limit (should render with 16 threads), and no error limit):
Ping Job (16 concurrent, no err limit).zip
I’ll keep at this, but it’s been one of those fun problems like that time a memory alignment in Mono caused a crash on Linux 2.6.something seven years ago. Real hard, real satisfying to figure out.
All this to say, my detective hat is on and I’m most suspicious about .net Core and how it interacts with Rosetta 2 . Efforts like Support .NET on Apple Silicon with Rosetta 2 emulation really have my attention, but it may also be related to to CPython’s GIL
Update today. Let’s all peak into how Edwin’s brain works as he tests M1 lockup problems while doing e-mail.
At a high level, there is a serious lock-up problem while doing statistics gathering. The Worker locks up in under 20 seconds with it enabled reliably. With it turned off it lives more than 20 mintues (I purposely killed at after running long enough). Digging in, I’m suspicious about threading/timers and how the backing thread pool works here. I think the problem now is that threads creating threads that launch processes?
- Started at 14:20 with another 5,000 tasks. Enabled CPU/Mem stats gathering
- Locked up within a minute?! No tasks actually finished.
- Restarted at 14:25. Locked up at 14:26? Claimed to be rendering for 28 seconds
- Restarted at 14:27. Locked up in 13 seconds (using ‘Running Time’ metric here).
- Turned off stats gathering.
- Restarted at 14:29. Still running after 12 minutes…
- Decided to turn stats back on to see if it dies
- Restarted at 14:42. Locked up in 16 seconds?!
- Restarted at 14:43. Locked up in 13 seconds.
- Turned off stats gathering…
- Restarted at 14:44. Didn’t lock up after 21 minutes
- Turned on stats gathering…
- Restarted at 15:06. Locked up in 17 seconds
- Turned off stats gathering…
- Code for this feature is stored in the regretfully named SlaveRenderThread.cs and is triggered by a timer elapsed event (so async operation through a thread pool). We get the managed process IDs and run some CollectValues() call on m_processMonitor which is a ScriptPluginMonitor (one of our C# classes)
- Digging in, we’re just calling
/bin/ps
on Mac using the same Process2.RunProcessAngGetOutput() that I’m using. I have no idea why it’s locking up so fast. Third arg for me is True, let’s try changing to false… - Didn’t make any appreciable difference. The m1_test.py script still locks up after 3 to 7 iterations…
- The problem is seemingly more and more around running processes in sub-threads… I don’t know how C# handles timers. The stats gathering is run inside a System.Timers.Timer() object.
Hey Thinkbox Staff, really appreciate all of the time you’ve put into working on this issue, as well as these details. I can also confirm that the latest version is still hanging on our systems, though it’s happening after running for hours rather than minutes.
No problem! It’s a bit hard to find the time to dig in, so it’s sporadic right now but we’re making progress narrowing things down.
Testing from today:
- Installed 10.1.23.6. Hit an unusual runtime error. Deleted /Applications/Thinkbox/Deadline10 folder and re-ran the installation
- The
m1_test.py
script seems to be running faster and managed to complete without problems. - Installed Visual Studio for Mac, .net Core 6.0.416 (x86_64), and built an equivalent C# program to the
m1_check.py
script m1_check.cs.zip. - Ran C# program for some time and while it hung briefly a few times, it continued to at least 60 iterations (I did modify what’s uploaded here to be 1000 iterations of 50 threads).
Testing results (thus far)
- 10.3.1.5: Slightly improved. Test script still locks before 15 iterations, but Worker runs for ~50 minutes if and only if stats gathering is disabled
- 10.3.0.15: Known problematic. Worker locks up in under 30 seconds
- 10.2.1.1: Untested
- 10.2.0.10: Untested
- 10.1.23.6: No problems whatsoever. Test script went through 1,000 iterations, Worker ran for at least 20 minutes and was stopped. Reported good by customer.
Update! The C# application running on .net Core 6.0.416 x86_64 did eventually hard lock the same way after 153 iterations! Three cat
processes remain and are stuck as zombies which is what I see with the Python version. This narrows things down to very concretely to .net Core
We’re going to focus on upgrading to .net Core 8.0 in the next release for multiple reasons and it will like clear things up as we did test this C# solution with 8.0 a month ago, but it will be some time before the next Deadline version is released. If you’re able, 10.1.23.6 will work in the mean time.
Hi @eamsler, I installed 10.1.23.6 and haven’t seen any of our M1 systems hang yet. They’ve been running for about 24 hours so far. Thanks again!
Just wanted to bump this up. We’re still experiencing intermittent hanging on 10.1.23.6. It’s much less than the newer versions but still occurring on multiple M1 machines.
Any updates?