we’ve redshift installed in docker. When execute command manually in the container it renders successfully. When same command is executed through deadline it hangs on the non-zero error code 139. Here is the log where i didn’t find anything.
Anyone has some thoughts ? Thank you so much!!
2023-06-01 17:49:40 0: INFO: Full Command: "/usr/redshift/bin/redshiftCmdLine" "/storage/temp/maya_test/redshift/rs_main/redshift_test_01.0019.rs" -oip "/storage/temp/maya_test/images"
2023-06-01 17:49:40 0: INFO: Startup Directory: "/usr/redshift/bin"
2023-06-01 17:49:40 0: INFO: Process Priority: BelowNormal
2023-06-01 17:49:40 0: INFO: Process Affinity: default
2023-06-01 17:49:40 0: INFO: Process is now running
2023-06-01 17:49:40 0: STDOUT: Redshift Command-Line Renderer (version 3.5.15 - API: 3505)
2023-06-01 17:49:40 0: STDOUT: Copyright 2023 Redshift Rendering Technologies
2023-06-01 17:49:40 0: STDOUT: ls: cannot access /dev/disk/by-id/: No such file or directory
2023-06-01 17:49:40 0: STDOUT: ls: cannot access /dev/disk/by-id/: No such file or directory
2023-06-01 17:49:40 0: STDOUT: cat: /sys/devices/virtual/dmi/id/board_vendor: No such file or directory
2023-06-01 17:49:40 0: STDOUT: cat: /sys/devices/virtual/dmi/id/board_name: No such file or directory
2023-06-01 17:49:40 0: STDOUT: cat: /sys/devices/virtual/dmi/id/board_version: No such file or directory
2023-06-01 17:49:41 0: INFO: Process exit code: 139
2023-06-01 17:49:41 0: Done executing plugin command of type 'Render Task'
2023-06-01 17:49:41 0: Executing plugin command of type 'End Job'
2023-06-01 17:49:41 0: Done executing plugin command of type 'End Job'
2023-06-01 17:49:43 Sending kill command to process tree with root process 'deadlinesandbox.exe' with process id 6261
2023-06-01 17:49:45 Scheduler Thread - Render Thread 0 threw a major error:
2023-06-01 17:49:45 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2023-06-01 17:49:45
2023-06-01 17:49:45 Exception Details
2023-06-01 17:49:45 RenderPluginException -- Error: Renderer returned non-zero error code, 139. Check the log for more information.
2023-06-01 17:49:45 at Deadline.Plugins.PluginWrapper.RenderTasks(Task task, String& outMessage, AbortLevel& abortLevel)
2023-06-01 17:49:45 RenderPluginException.Cause: JobError (2)
2023-06-01 17:49:45 RenderPluginException.Level: Major (1)
2023-06-01 17:49:45 RenderPluginException.HasSlaveLog: True
2023-06-01 17:49:45 RenderPluginException.SlaveLogFileName: /var/log/Thinkbox/Deadline10/deadlineslave_renderthread_0-21537bebe8ed-0000.log
2023-06-01 17:49:45 Exception.TargetSite: Deadline.Slaves.Messaging.PluginResponseMemento d(Deadline.Net.DeadlineMessage, System.Threading.CancellationToken)
2023-06-01 17:49:45 Exception.Data: ( )
2023-06-01 17:49:45 Exception.Source: deadline
2023-06-01 17:49:45 Exception.HResult: -2146233088
2023-06-01 17:49:45 Exception.StackTrace:
2023-06-01 17:49:45 at Deadline.Plugins.SandboxedPlugin.d(DeadlineMessage bgt, CancellationToken bgu
2023-06-01 17:49:45 at Deadline.Plugins.SandboxedPlugin.RenderTask(Task task, CancellationToken cancellationToken
2023-06-01 17:49:45 at Deadline.Slaves.SlaveRenderThread.c(TaskLogWriter ajy, CancellationToken ajz)
2023-06-01 17:49:45
2023-06-01 17:49:45 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
I’m not familiar with how/when Docker makes mounted storage available to a container, but given all the failures in STDOUT at least appear to have to do with disks maybe the Worker’s getting started before storage has been made available?
These disk errors are fine even it’s strange. And it happens only to the redshift. Maya, mantra and nuke are fine. And it just happens in deadline. If i run the redshift from console it works properly. The worst thing is i can’t debug it.
I’ve tried to run redshift through command line job and it has a same problem so it’s not a problem of redshift deadline implementation. It seems it’s a problem with running that process through deadline python implementation. I’ll try to run it manually through python api.
The issue with that sort of test is that Deadline doesn’t use regular python for the application plugins. We use Python for .NET, which combines c# and Python together.
Is there maybe an environment difference that’s causing Redshift to exit immediately? Given you’ve re-created this in a commandline render you could dump the environment there, and compare it to a dump run just in the container outside of Deadline.
When connecting to the container to run your test are there changes to the hardware accessible? Given exit code 139 is a segmentation fault (SIGSEGV) maybe there’s a difference there?
Yes i totally forgot you’re emulating on linux. Then i’ll try to run these via deadlinecommand … and maybe there is a problem with GPU as i see now that in the Monitor it says empty field on GPU. In the container terminal through the nvidia-smi i see GPu properly but there is a problem due to mono … i will investigate
So finally … it’s the version of redshift … latest version of redshift 3.5.15 causes this problem … any version greater than 3.5.13 on linux has this problem! As a workaround is fine to know we can render with 3.5.13 but would be great to fix it. The question is on which side is the problem ;o)
Well given a change in Redshift version I’d bet there’s a change in hardware detection between 3.5.15 and 3.5.13.
We don’t use subprocess to run applications, so that test also hanging is interesting. Did you test that in vanilla Python, and if you did how did it behave? If it hangs with no Deadline involvement using subprocess it might be worth a ticket to the Redshift folks.
139-128=11. Signal 11 is SIGSEGV aka “Segmentation violation” aka, some low level program oopsie interacted with RAM in a way that the Kernel had to kill.
I don’t think this is fixable without changes in Redshift… Is there a stack trace or anything? Maxon should likely take a look.
Thanks Edwin, I’ve got a ticket open with Maxon, but outside of Deadline this command runs fine.
It’s literally running /usr/rs/bin/cmdline /path/to/file.rs as the same user, but when running as deadline it gives 139. the scene is a basic sphere test.
I’m not using Docker, but searching gave me this result, this is a T4 AWS box running Rocky 8.8
I’m guessing this is something Deadline is doing? as running the command directly as root has no problem
Full Command: "/usr/redshift/bin/redshiftCmdLine" "/mnt/test.rs"
Startup Directory: "/usr/redshift/bin"
Process Priority: BelowNormal
Process Affinity: default
Process is now running
Redshift Command-Line Renderer (version 3.5.19 - API: 3505)
Copyright 2023 Redshift Rendering Technologies
Process exit code: 139
Done executing plugin command of type 'Render Task'
=======================================================
This is triggering something the back of my brain, but I don’t recall… I really hope that there’s some special flag that can be provided to have it dump the stack.
If not, you can try using GDB to grab it. Some details here:
The main benefit here is that you’ll at least know where in the haystack the needle is that’s bringing the world down.