Redshift in docker returned non-zero error code, 139

michal.mocnak · September 5, 2024, 7:32am

Actually what version of redshift vs cuda version are you using ? In our case it was problem only when it was executed through deadline process and then we discoverd it was related to the method how redshift is printing messages into std and that was caused only under root user.

anthonygelatka · September 5, 2024, 8:09am

3.5.19, the driver is the grid driver 535.154.05 CUDA 12.2

aim is to move to newer version after current project

michal.mocnak · September 5, 2024, 8:45am

I am not 100% sure but i remember that this version had linux bug … we’re now on 3.6.01 and it works well with cuda 12.2 … and as i remember 3.5.24 was the one where they fix the problem

anthonygelatka · September 6, 2024, 11:28am

I switched to non-root account and now the file goes through but then fails at the end
I updated to latest version and same issue

2024-09-06 11:06:24:  0: STDOUT: License for redshift-core 2024.12 valid until Dec 07 2024
2024-09-06 11:06:24:  0: STDOUT: Detected change in GPU device selection
2024-09-06 11:06:26:  0: STDOUT: Creating CUDA contexts
2024-09-06 11:06:26:  0: STDOUT: 	CUDA init ok
2024-09-06 11:06:26:  0: STDOUT: No devices available
2024-09-06 11:06:27:  0: STDOUT: PostFX: Shut down
2024-09-06 11:06:27:  0: STDOUT: Shutdown GPU Devices...
2024-09-06 11:06:27:  0: STDOUT: 	Devices shut down ok
2024-09-06 11:06:27:  0: STDOUT: Shutdown Rendering Sub-Systems...
2024-09-06 11:06:27:  0: STDOUT: License returned     
2024-09-06 11:06:27:  0: STDOUT: 	Finished Shutting down Rendering Sub-Systems
2024-09-06 11:06:27:  0: INFO: Process exit code: 1
2024-09-06 11:06:27:  0: Done executing plugin command of type 'Render Task'

Render outside of Deadline (and using local license) and it goes through fine…

License acquired
License for net.maxon.license.app.bundle_maxonone-release~commercial valid until Oct 03 2024
Detected change in GPU device selection
Creating CUDA contexts
        CUDA init ok
        Ordinals: { 0 }
Initializing GPUComputing module (CUDA). Active device 0
        CUDA Driver Version: 12020
        CUDA API Version: 11020
        Device 1/1 : Tesla T4
        Compute capability: 7.5

feels like it breaks after this line

2024-09-06 11:06:26:  0: STDOUT: 	CUDA init ok
2024-09-06 11:06:26:  0: STDOUT: No devices available

I submit the job again with GPU0 selected and it went through (using local license), releasing the mx1 license and resubmitting the job now gives a different error

2024-09-06 11:50:15:  0: STDOUT: Loading: /mnt/test/c4d/rs.rs
2024-09-06 11:50:15:  0: STDOUT: Maxon licensing error: License not activated (9)
2024-09-06 11:50:15:  0: STDOUT: Detected change in GPU device selection
2024-09-06 11:50:16:  0: STDOUT: Creating CUDA contexts
2024-09-06 11:50:16:  0: STDOUT: 	CUDA init ok
2024-09-06 11:50:16:  0: STDOUT: No devices available
2024-09-06 11:50:16:  0: STDOUT: PostFX: Shut down

michal.mocnak · September 6, 2024, 12:50pm

Then you are in a different problem. There is no device available. Try nvidia-smi tool what prints out.

anthonygelatka · September 6, 2024, 1:06pm

Somthing weird going on, rendering outside of Deadline works fine, from 3.5.20-3.6.04 I get an RLM error

2024-09-06 12:47:26: 0: STDOUT: Loading: /mnt/test/c4d/rs.rs
2024-09-06 12:47:26: 0: STDOUT: License error: Error communicating with license server (-17)
2024-09-06 12:47:26: 0: STDOUT: License error: (RLM) Communications error with license server (-17)
2024-09-06 12:47:26: 0: STDOUT: Read error from network (-105)
2024-09-06 12:47:26: 0: STDOUT: select() system call error (comm: -15)Interrupted system call (errno: 4)
2024-09-06 12:47:26: 0: STDOUT: Detected change in GPU device selection

but on 3.5.19 I don’t get this error, but it fails on exitcode 1

2024-09-06 12:51:43: 0: STDOUT: Loading: /mnt/test/c4d/rs.rs
2024-09-06 12:51:44: 0: STDOUT: License for redshift-core 2024.12 valid until Dec 07 2024
2024-09-06 12:51:44: 0: STDOUT: Detected change in GPU device selection
2024-09-06 12:51:45: 0: STDOUT: Creating CUDA contexts
2024-09-06 12:51:45: 0: STDOUT: CUDA init ok
2024-09-06 12:51:45: 0: STDOUT: No devices available
...
2024-09-06 12:51:47:  0: INFO: Process exit code: 1

but I I run the worker under a user and not service I get the error exposed as with later versions and also confirmation the license checks out.

0: STDOUT: Loading: /mnt/test/c4d/rs.rs
Port Forwarder (redshift:5054): Client connected to port forwarder.
Worker - Confirmed Credit Usage for "redshift".
0: STDOUT: License error: Error communicating with license server (-17)
0: STDOUT: License error: (RLM) Communications error with license server (-17)
0: STDOUT: Read error from network (-105)
0: STDOUT: select() system call error (comm: -15)Interrupted system call (errno: 4)
0: STDOUT: Detected change in GPU device selection
0: STDOUT: Creating CUDA contexts
0: STDOUT:      CUDA init ok
0: STDOUT: No devices available
0: STDOUT: PostFX: Shut down
0: STDOUT: Shutdown GPU Devices...
0: STDOUT:      Devices shut down ok

mois.moshev · September 10, 2024, 8:50am

On Windows at least, GPU access is not allowed for services. I wonder how it works on Linux, perhaps some service configuration is possible?

anthonygelatka · September 10, 2024, 3:32pm

The actual issue was the port being blocked one way to the UBL license forwarder, which was confusing. eventually changed the Global Command override to 5554 (default is 5054) and left UBL redshift at 5054. Bit annoying having the overlap!

I also moved “libredshift-core-cpu.so” to “libredshift-core-cpu.so.BKP” as recommended in another thread. So all rendering fine

Thanks all