Redshift in docker returned non-zero error code, 139

rur · June 1, 2023, 4:07pm

Hello,

we’ve redshift installed in docker. When execute command manually in the container it renders successfully. When same command is executed through deadline it hangs on the non-zero error code 139. Here is the log where i didn’t find anything.

Anyone has some thoughts ? Thank you so much!!

2023-06-01 17:49:40 0: INFO: Full Command: "/usr/redshift/bin/redshiftCmdLine" "/storage/temp/maya_test/redshift/rs_main/redshift_test_01.0019.rs" -oip "/storage/temp/maya_test/images"
2023-06-01 17:49:40 0: INFO: Startup Directory: "/usr/redshift/bin"
2023-06-01 17:49:40 0: INFO: Process Priority: BelowNormal
2023-06-01 17:49:40 0: INFO: Process Affinity: default
2023-06-01 17:49:40 0: INFO: Process is now running
2023-06-01 17:49:40 0: STDOUT: Redshift Command-Line Renderer (version 3.5.15 - API: 3505)
2023-06-01 17:49:40 0: STDOUT: Copyright 2023 Redshift Rendering Technologies
2023-06-01 17:49:40 0: STDOUT: ls: cannot access /dev/disk/by-id/: No such file or directory
2023-06-01 17:49:40 0: STDOUT: ls: cannot access /dev/disk/by-id/: No such file or directory
2023-06-01 17:49:40 0: STDOUT: cat: /sys/devices/virtual/dmi/id/board_vendor: No such file or directory
2023-06-01 17:49:40 0: STDOUT: cat: /sys/devices/virtual/dmi/id/board_name: No such file or directory
2023-06-01 17:49:40 0: STDOUT: cat: /sys/devices/virtual/dmi/id/board_version: No such file or directory
2023-06-01 17:49:41 0: INFO: Process exit code: 139
2023-06-01 17:49:41 0: Done executing plugin command of type 'Render Task'
2023-06-01 17:49:41 0: Executing plugin command of type 'End Job'
2023-06-01 17:49:41 0: Done executing plugin command of type 'End Job'
2023-06-01 17:49:43 Sending kill command to process tree with root process 'deadlinesandbox.exe' with process id 6261
2023-06-01 17:49:45 Scheduler Thread - Render Thread 0 threw a major error: 
2023-06-01 17:49:45 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2023-06-01 17:49:45 
2023-06-01 17:49:45 Exception Details
2023-06-01 17:49:45 RenderPluginException -- Error: Renderer returned non-zero error code, 139. Check the log for more information.
2023-06-01 17:49:45    at Deadline.Plugins.PluginWrapper.RenderTasks(Task task, String& outMessage, AbortLevel& abortLevel)
2023-06-01 17:49:45 RenderPluginException.Cause: JobError (2)
2023-06-01 17:49:45 RenderPluginException.Level: Major (1)
2023-06-01 17:49:45 RenderPluginException.HasSlaveLog: True
2023-06-01 17:49:45 RenderPluginException.SlaveLogFileName: /var/log/Thinkbox/Deadline10/deadlineslave_renderthread_0-21537bebe8ed-0000.log
2023-06-01 17:49:45 Exception.TargetSite: Deadline.Slaves.Messaging.PluginResponseMemento d(Deadline.Net.DeadlineMessage, System.Threading.CancellationToken)
2023-06-01 17:49:45 Exception.Data: ( )
2023-06-01 17:49:45 Exception.Source: deadline
2023-06-01 17:49:45 Exception.HResult: -2146233088
2023-06-01 17:49:45   Exception.StackTrace: 
2023-06-01 17:49:45    at Deadline.Plugins.SandboxedPlugin.d(DeadlineMessage bgt, CancellationToken bgu
2023-06-01 17:49:45    at Deadline.Plugins.SandboxedPlugin.RenderTask(Task task, CancellationToken cancellationToken
2023-06-01 17:49:45    at Deadline.Slaves.SlaveRenderThread.c(TaskLogWriter ajy, CancellationToken ajz)
2023-06-01 17:49:45 
2023-06-01 17:49:45 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Justin_B · June 1, 2023, 4:58pm

I’m not familiar with how/when Docker makes mounted storage available to a container, but given all the failures in STDOUT at least appear to have to do with disks maybe the Worker’s getting started before storage has been made available?

Is it just Redshift that behaves like this?

rur · June 1, 2023, 5:08pm

These disk errors are fine even it’s strange. And it happens only to the redshift. Maya, mantra and nuke are fine. And it just happens in deadline. If i run the redshift from console it works properly. The worst thing is i can’t debug it.

rur · June 1, 2023, 6:06pm

Is there a chance to run command from console in the deadline sandbox ? It would help a lot to know what is hapenning

rur · June 1, 2023, 6:15pm

I’ve tried to run redshift through command line job and it has a same problem so it’s not a problem of redshift deadline implementation. It seems it’s a problem with running that process through deadline python implementation. I’ll try to run it manually through python api.

Justin_B · June 1, 2023, 6:28pm

The issue with that sort of test is that Deadline doesn’t use regular python for the application plugins. We use Python for .NET, which combines c# and Python together.

Is there maybe an environment difference that’s causing Redshift to exit immediately? Given you’ve re-created this in a commandline render you could dump the environment there, and compare it to a dump run just in the container outside of Deadline.

When connecting to the container to run your test are there changes to the hardware accessible? Given exit code 139 is a segmentation fault (SIGSEGV) maybe there’s a difference there?

rur · June 1, 2023, 6:49pm

Yes i totally forgot you’re emulating on linux. Then i’ll try to run these via deadlinecommand … and maybe there is a problem with GPU as i see now that in the Monitor it says empty field on GPU. In the container terminal through the nvidia-smi i see GPu properly but there is a problem due to mono … i will investigate

rur · June 1, 2023, 6:50pm

On the other hand you have images in AWS with redshift on linux and those are containerized as well so maybe that is good point too

rur · June 1, 2023, 7:32pm

So the GPU is not the reason. Even without GPU assigned it has a same problem. I’ve found a solution how to manually execute through your process.

deadlinecommand -ExecuteScriptNooGui test.py

and test.py has

import subprocess

subprocess.run(['/usr/redshift/bin/redshiftCmdLine'])

The result is the same it hangs.

rur · June 1, 2023, 7:36pm

On the other hand i’ve created an instance of your image on AWS and did the same and it works there!

rur · June 1, 2023, 8:15pm

So finally … it’s the version of redshift … latest version of redshift 3.5.15 causes this problem … any version greater than 3.5.13 on linux has this problem! As a workaround is fine to know we can render with 3.5.13 but would be great to fix it. The question is on which side is the problem ;o)

Justin_B · June 2, 2023, 2:37pm

Well given a change in Redshift version I’d bet there’s a change in hardware detection between 3.5.15 and 3.5.13.

We don’t use subprocess to run applications, so that test also hanging is interesting. Did you test that in vanilla Python, and if you did how did it behave? If it hangs with no Deadline involvement using subprocess it might be worth a ticket to the Redshift folks.

rur · June 6, 2023, 8:07am

In vanilla python it works as expected. It only has a problem running in deadline env. So it seems it’s on your side.

eamsler · June 21, 2023, 8:24pm

This is incredible! I’m very excited to try and figure this out. I’ve asked Justin to schedule a call in the matching ticket.

anthonygelatka · September 4, 2024, 1:32pm

I’m hitting this issue on a cloud GPU, outside of Deadline it renders

“/usr/redshift/bin/redshiftCmdLine” “/mnt/test/c4d/rs/rs_0000.rs”

but in Deadline it throws a 139 error, did you ever get to the bottom of this?

Thanks

eamsler · September 4, 2024, 3:23pm

A 139 exit code on Unix corresponds to a SIGSEV (see 128+n: Fatal error signal “n”).

139-128=11. Signal 11 is SIGSEGV aka “Segmentation violation” aka, some low level program oopsie interacted with RAM in a way that the Kernel had to kill.

I don’t think this is fixable without changes in Redshift… Is there a stack trace or anything? Maxon should likely take a look.

anthonygelatka · September 4, 2024, 4:12pm

Thanks Edwin, I’ve got a ticket open with Maxon, but outside of Deadline this command runs fine.

It’s literally running /usr/rs/bin/cmdline /path/to/file.rs as the same user, but when running as deadline it gives 139. the scene is a basic sphere test.

I’m not using Docker, but searching gave me this result, this is a T4 AWS box running Rocky 8.8

michal.mocnak · September 4, 2024, 4:35pm

Hey Guys, it was caused when running deadline worker as a root user. You have to run it as non-root and it will work. Cheers Michal!

anthonygelatka · September 4, 2024, 4:49pm

Thanks Michal, much appreciated!

I’m guessing this is something Deadline is doing? as running the command directly as root has no problem

Full Command: "/usr/redshift/bin/redshiftCmdLine" "/mnt/test.rs"
Startup Directory: "/usr/redshift/bin"
Process Priority: BelowNormal
Process Affinity: default
Process is now running
Redshift Command-Line Renderer (version 3.5.19 - API: 3505)
Copyright 2023 Redshift Rendering Technologies
Process exit code: 139
Done executing plugin command of type 'Render Task'
=======================================================

eamsler · September 4, 2024, 5:44pm

This is triggering something the back of my brain, but I don’t recall… I really hope that there’s some special flag that can be provided to have it dump the stack.

If not, you can try using GDB to grab it. Some details here:

The main benefit here is that you’ll at least know where in the haystack the needle is that’s bringing the world down.