Redshift - Intermittent 'No devices available'

markl · August 2, 2023, 11:32am

We’re rendering using Redshift on AWS and have a situation whereby the render would fail on the first couple of attempts and then randomly it would success. The error we’re getting when it fails is as follows:

Loading Job's Plugin timeout is Disabled
SandboxedPlugin: Render Job As User disabled, running as current user 'ec2-user'
Executing plugin command of type 'Initialize Plugin'
INFO: Executing plugin script '/var/lib/Thinkbox/Deadline10/workers/ip-10-128-36-92/plugins/64ca3a24b36cc246db88b058/Redshift.py'
INFO: Plugin execution sandbox using Python version 3
INFO: Redshift Path Mapping...
INFO: source: "D:\AWS_DEADLINE" dest: "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/"
INFO: source: "Z:\" dest: "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/"
INFO: [REDSHIFT_PATHOVERRIDE_FILE] now set to: "/var/lib/Thinkbox/Deadline10/workers/ip-10-128-36-92/jobsData/64ca3a24b36cc246db88b058/RSMapping_tempSZkEF0/RSMapping.txt"
INFO: About: Redshift Plugin for Deadline
INFO: The job's environment will be merged with the current environment before rendering
Done executing plugin command of type 'Initialize Plugin'
Start Job timeout is disabled.
Task timeout is disabled.
Loaded job: Untitled (64ca3a24b36cc246db88b058)
Executing plugin command of type 'Start Job'
INFO: Sending StartTaskRequest to S3BackedCacheClient.
DEBUG: Request:
DEBUG: 	JobId: 64ca3a24b36cc246db88b058
DEBUG: 	JobUploadWhitelist: 
DEBUG: 	JobUploadWhitelistRe: ^.+\.abc$, ^.+\.avi$, ^.+\.bmp$, ^.+\.bw$, ^.+\.cin$, ^.+\.cjp$, ^.+\.cjpg$, ^.+\.cxr$, ^.+\.dds$, ^.+\.dpx$, ^.+\.dwf$, ^.+\.dwfx$, ^.+\.dwg$, ^.+\.dxf$, ^.+\.dxx$, ^.+\.eps$, ^.+\.exr$, ^.+\.fbx$, ^.+\.fxr$, ^.+\.hdr$, ^.+\.icb$, ^.+\.iff$, ^.+\.iges$, ^.+\.igs$, ^.+\.int$, ^.+\.inta$, ^.+\.iris$, ^.+\.jpe$, ^.+\.jpeg$, ^.+\.jpg$, ^.+\.jp2$, ^.+\.mcc$, ^.+\.mcx$, ^.+\.mov$, ^.+\.mxi$, ^.+\.pdf$, ^.+\.pic$, ^.+\.png$, ^.+\.prt$, ^.+\.ps$, ^.+\.psd$, ^.+\.rgb$, ^.+\.rgba$, ^.+\.rla$, ^.+\.rpf$, ^.+\.sat$, ^.+\.sgi$, ^.+\.stl$, ^.+\.sxr$, ^.+\.targa$, ^.+\.tga$, ^.+\.tif$, ^.+\.tiff$, ^.+\.tim$, ^.+\.vda$, ^.+\.vrimg$, ^.+\.vrmesh$, ^.+\.vrsm$, ^.+\.vrst$, ^.+\.vst$, ^.+\.wmf$, ^.+\.ass$, ^.+\.gz$, ^.+\.ifd$, ^.+\.mi$, ^.+\.mi2$, ^.+\.mxi$, ^.+\.rib$, ^.+\.rs$, ^.+\.vrscene$
DEBUG: S3BackedCache Client Returned Sequence: 75
INFO: Executing global asset transfer preload script '/var/lib/Thinkbox/Deadline10/workers/ip-10-128-36-92/plugins/64ca3a24b36cc246db88b058/GlobalAssetTransferPreLoad.py'
INFO: Looking for legacy (pre-10.0.26) AWS Portal File Transfer...
INFO: Looking for legacy (pre-10.0.26) File Transfer controller in /opt/Thinkbox/S3BackedCache/bin/task.py...
INFO: Could not find legacy (pre-10.0.26) AWS Portal File Transfer.
INFO: Legacy (pre-10.0.26) AWS Portal File Transfer is not installed on the system.
Done executing plugin command of type 'Start Job'
Plugin rendering frame(s): 1
Executing plugin command of type 'Render Task'
INFO: Sending StartTaskRequest to S3BackedCacheClient.
DEBUG: Request:
DEBUG: 	JobId: 64ca3a24b36cc246db88b058
DEBUG: 	JobUploadWhitelist: 
DEBUG: 	JobUploadWhitelistRe: ^.+\.abc$, ^.+\.avi$, ^.+\.bmp$, ^.+\.bw$, ^.+\.cin$, ^.+\.cjp$, ^.+\.cjpg$, ^.+\.cxr$, ^.+\.dds$, ^.+\.dpx$, ^.+\.dwf$, ^.+\.dwfx$, ^.+\.dwg$, ^.+\.dxf$, ^.+\.dxx$, ^.+\.eps$, ^.+\.exr$, ^.+\.fbx$, ^.+\.fxr$, ^.+\.hdr$, ^.+\.icb$, ^.+\.iff$, ^.+\.iges$, ^.+\.igs$, ^.+\.int$, ^.+\.inta$, ^.+\.iris$, ^.+\.jpe$, ^.+\.jpeg$, ^.+\.jpg$, ^.+\.jp2$, ^.+\.mcc$, ^.+\.mcx$, ^.+\.mov$, ^.+\.mxi$, ^.+\.pdf$, ^.+\.pic$, ^.+\.png$, ^.+\.prt$, ^.+\.ps$, ^.+\.psd$, ^.+\.rgb$, ^.+\.rgba$, ^.+\.rla$, ^.+\.rpf$, ^.+\.sat$, ^.+\.sgi$, ^.+\.stl$, ^.+\.sxr$, ^.+\.targa$, ^.+\.tga$, ^.+\.tif$, ^.+\.tiff$, ^.+\.tim$, ^.+\.vda$, ^.+\.vrimg$, ^.+\.vrmesh$, ^.+\.vrsm$, ^.+\.vrst$, ^.+\.vst$, ^.+\.wmf$, ^.+\.ass$, ^.+\.gz$, ^.+\.ifd$, ^.+\.mi$, ^.+\.mi2$, ^.+\.mxi$, ^.+\.rib$, ^.+\.rs$, ^.+\.vrscene$
DEBUG: S3BackedCache Client Returned Sequence: 75
INFO: Stdout Redirection Enabled: True
INFO: Asynchronous Stdout Enabled: False
INFO: Stdout Handling Enabled: True
INFO: Popup Handling Enabled: True
INFO: QT Popup Handling Enabled: False
INFO: WindowsForms10.Window.8.app.* Popup Handling Enabled: False
INFO: Using Process Tree: True
INFO: Hiding DOS Window: True
INFO: Creating New Console: False
INFO: Running as user: ec2-user
INFO: Executable: "/usr/redshift/bin/redshiftCmdLine"
CheckPathMapping: Swapped "Z:\squab2.0001.rs" with "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/squab2.0001.rs"
CheckPathMapping: Swapped "Z:\" with "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/"
INFO: GPUs per task is greater than 0, so the following GPUs will be used: 0,1
INFO: Argument: "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/squab2.0001.rs" -oip "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/"
INFO: Full Command: "/usr/redshift/bin/redshiftCmdLine" "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/squab2.0001.rs" -oip "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/"
INFO: Startup Directory: "/usr/redshift/bin"
INFO: Process Priority: BelowNormal
INFO: Process Affinity: default
INFO: Process is now running
STDOUT: Redshift Command-Line Renderer (version 3.5.15 - API: 3505)
STDOUT: Copyright 2023 Redshift Rendering Technologies
STDOUT: sh: lsb_release: command not found
STDOUT: sh: lsb_release: command not found
STDOUT: Querying texture cache buget from preferences.xml: 32 GB
STDOUT: Querying cache path from preferences.xml: $REDSHIFT_LOCALDATAPATH/cache
STDOUT: No GPUs were selected in the command line, using selected compute devices from preferences.
STDOUT: Creating cache path /home/ec2-user/redshift/cache
STDOUT: 	Enforcing shader cache budget...
STDOUT: 	Enforcing texture cache budget...
STDOUT: 		Collecting files...
STDOUT: 		Total size for 0 files 0.00MB (budget 32768.00MB)
STDOUT: 		Under budget. Done.
STDOUT: 	Creating mesh cache...
STDOUT: 	Done
STDOUT: Overriding GPU devices due to REDSHIFT_GPUDEVICES (0,1)
STDOUT: Redshift Initialized
STDOUT: 	Version: 3.5.15, May 10 2023 07:43:06 [44018]
STDOUT: 	Linux Platform
STDOUT: 	Release Build
STDOUT: 	Number of CPU HW threads: 8
STDOUT: 	CPU speed: 3.12 GHz
STDOUT: 	Total system memory: 30.95 GB
STDOUT: 	Current working dir: /usr/redshift/bin
STDOUT: redshift_LICENSE=5053@10.128.2.4
STDOUT: RLM License Search Path=/home/ec2-user/redshift:/etc/opt/maxon/rlm
STDOUT: License return timeout is disabled (license will be returned on shutdown)
STDOUT: Detected env variable REDSHIFT_PATHOVERRIDE_FILE. Loading path override data from file: /var/lib/Thinkbox/Deadline10/workers/ip-10-128-36-92/jobsData/64ca3a24b36cc246db88b058/RSMapping_tempSZkEF0/RSMapping.txt
STDOUT: Loading Redshift procedural extensions...
STDOUT: 	From path: /usr/redshift/procedurals/
STDOUT: 	Done!
STDOUT:  
STDOUT: Preparing compute platforms
STDOUT: 	Found CUDA compute library in /usr/redshift/bin/libredshift-core-cuda.so
STDOUT: 	Found CPU compute library in /usr/redshift/bin/libredshift-core-cpu.so
STDOUT: 	Done
STDOUT: Creating CUDA contexts
STDOUT: 	CUDA init ok
STDOUT: 	Ordinals: { 0 }
STDOUT: Initializing GPUComputing module (CUDA). Active device 0
STDOUT: 	CUDA Driver Version: 11070
STDOUT: 	CUDA API Version: 11020
STDOUT: 	Device 1/1 : Tesla T4 
STDOUT: 	Compute capability: 7.5
STDOUT: 	Num multiprocessors: 40
STDOUT: 	PCI busID: 0, deviceID: 30, domainID: 0
STDOUT: 	Theoretical memory bandwidth: 320.063995 GB/Sec
STDOUT: 	Measured PCIe bandwidth (pinned CPU->GPU): 5.820019 GB/s
STDOUT: 	Measured PCIe bandwidth (pinned GPU->CPU): 6.133014 GB/s
STDOUT: 	Measured PCIe bandwidth (paged CPU->GPU): 4.937102 GB/s
STDOUT: 	Measured PCIe bandwidth (paged GPU->CPU): 3.883552 GB/s
STDOUT: 	Estimated GPU->CPU latency (0): 0.005209 ms
STDOUT: 	Estimated GPU->CPU latency (1): 0.004677 ms
STDOUT: 	Estimated GPU->CPU latency (2): 0.004674 ms
STDOUT: 	Estimated GPU->CPU latency (3): 0.004741 ms
STDOUT: 	New CUDA context created
STDOUT: 	Available memory: 14772.9375 MB out of 14971.8750 MB
STDOUT: CPU backend has auto selected the following arch: BASE
STDOUT: Determining peer-to-peer capability (NVLink or PCIe)
STDOUT: 	Done
STDOUT: PostFX: Initialized
STDOUT: OptiX denoiser init...
STDOUT: 	Selecting device
STDOUT: 	Selected device Tesla T4 (ordinal 0)
STDOUT: OptixRT init...
STDOUT: 	Load/set programs
STDOUT: 	Ok!
STDOUT: Loading: /mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/squab2.0001.rs
STDOUT: License for redshift-core 2024.06 valid until Jun 28 2024
STDOUT: Detected change in GPU device selection
STDOUT: Creating CUDA contexts
STDOUT: 	CUDA init ok
STDOUT: No devices available
STDOUT: PostFX: Shut down
STDOUT: Shutdown GPU Devices...
STDOUT: 	Devices shut down ok
STDOUT: Shutdown Rendering Sub-Systems...
STDOUT: License returned     
STDOUT: 	Finished Shutting down Rendering Sub-Systems
INFO: Process exit code: 1
INFO: Sending EndTaskRequest to S3BackedCacheClient.
DEBUG: Request:
DEBUG: 	JobId: 64ca3a24b36cc246db88b058
Done executing plugin command of type 'Render Task'

However it’s not stable as it will reattempt the job which sometimes succeeds . The command seems to be the same but in the first instance it cant fnd a device, and in the second it does…

Loading Job's Plugin timeout is Disabled
SandboxedPlugin: Render Job As User disabled, running as current user 'ec2-user'
Executing plugin command of type 'Initialize Plugin'
INFO: Executing plugin script '/var/lib/Thinkbox/Deadline10/workers/ip-10-128-36-92/plugins/64ca3a24b36cc246db88b058/Redshift.py'
INFO: Plugin execution sandbox using Python version 3
INFO: Redshift Path Mapping...
INFO: source: "D:\AWS_DEADLINE" dest: "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/"
INFO: source: "Z:\" dest: "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/"
INFO: [REDSHIFT_PATHOVERRIDE_FILE] now set to: "/var/lib/Thinkbox/Deadline10/workers/ip-10-128-36-92/jobsData/64ca3a24b36cc246db88b058/RSMapping_tempoVSQM0/RSMapping.txt"
INFO: About: Redshift Plugin for Deadline
INFO: The job's environment will be merged with the current environment before rendering
Done executing plugin command of type 'Initialize Plugin'
Start Job timeout is disabled.
Task timeout is disabled.
Loaded job: Untitled (64ca3a24b36cc246db88b058)
Executing plugin command of type 'Start Job'
INFO: Sending StartTaskRequest to S3BackedCacheClient.
DEBUG: Request:
DEBUG: 	JobId: 64ca3a24b36cc246db88b058
DEBUG: 	JobUploadWhitelist: 
DEBUG: 	JobUploadWhitelistRe: ^.+\.abc$, ^.+\.avi$, ^.+\.bmp$, ^.+\.bw$, ^.+\.cin$, ^.+\.cjp$, ^.+\.cjpg$, ^.+\.cxr$, ^.+\.dds$, ^.+\.dpx$, ^.+\.dwf$, ^.+\.dwfx$, ^.+\.dwg$, ^.+\.dxf$, ^.+\.dxx$, ^.+\.eps$, ^.+\.exr$, ^.+\.fbx$, ^.+\.fxr$, ^.+\.hdr$, ^.+\.icb$, ^.+\.iff$, ^.+\.iges$, ^.+\.igs$, ^.+\.int$, ^.+\.inta$, ^.+\.iris$, ^.+\.jpe$, ^.+\.jpeg$, ^.+\.jpg$, ^.+\.jp2$, ^.+\.mcc$, ^.+\.mcx$, ^.+\.mov$, ^.+\.mxi$, ^.+\.pdf$, ^.+\.pic$, ^.+\.png$, ^.+\.prt$, ^.+\.ps$, ^.+\.psd$, ^.+\.rgb$, ^.+\.rgba$, ^.+\.rla$, ^.+\.rpf$, ^.+\.sat$, ^.+\.sgi$, ^.+\.stl$, ^.+\.sxr$, ^.+\.targa$, ^.+\.tga$, ^.+\.tif$, ^.+\.tiff$, ^.+\.tim$, ^.+\.vda$, ^.+\.vrimg$, ^.+\.vrmesh$, ^.+\.vrsm$, ^.+\.vrst$, ^.+\.vst$, ^.+\.wmf$, ^.+\.ass$, ^.+\.gz$, ^.+\.ifd$, ^.+\.mi$, ^.+\.mi2$, ^.+\.mxi$, ^.+\.rib$, ^.+\.rs$, ^.+\.vrscene$
DEBUG: S3BackedCache Client Returned Sequence: 75
INFO: Executing global asset transfer preload script '/var/lib/Thinkbox/Deadline10/workers/ip-10-128-36-92/plugins/64ca3a24b36cc246db88b058/GlobalAssetTransferPreLoad.py'
INFO: Looking for legacy (pre-10.0.26) AWS Portal File Transfer...
INFO: Looking for legacy (pre-10.0.26) File Transfer controller in /opt/Thinkbox/S3BackedCache/bin/task.py...
INFO: Could not find legacy (pre-10.0.26) AWS Portal File Transfer.
INFO: Legacy (pre-10.0.26) AWS Portal File Transfer is not installed on the system.
Done executing plugin command of type 'Start Job'
Plugin rendering frame(s): 1
Executing plugin command of type 'Render Task'
INFO: Sending StartTaskRequest to S3BackedCacheClient.
DEBUG: Request:
DEBUG: 	JobId: 64ca3a24b36cc246db88b058
DEBUG: 	JobUploadWhitelist: 
DEBUG: 	JobUploadWhitelistRe: ^.+\.abc$, ^.+\.avi$, ^.+\.bmp$, ^.+\.bw$, ^.+\.cin$, ^.+\.cjp$, ^.+\.cjpg$, ^.+\.cxr$, ^.+\.dds$, ^.+\.dpx$, ^.+\.dwf$, ^.+\.dwfx$, ^.+\.dwg$, ^.+\.dxf$, ^.+\.dxx$, ^.+\.eps$, ^.+\.exr$, ^.+\.fbx$, ^.+\.fxr$, ^.+\.hdr$, ^.+\.icb$, ^.+\.iff$, ^.+\.iges$, ^.+\.igs$, ^.+\.int$, ^.+\.inta$, ^.+\.iris$, ^.+\.jpe$, ^.+\.jpeg$, ^.+\.jpg$, ^.+\.jp2$, ^.+\.mcc$, ^.+\.mcx$, ^.+\.mov$, ^.+\.mxi$, ^.+\.pdf$, ^.+\.pic$, ^.+\.png$, ^.+\.prt$, ^.+\.ps$, ^.+\.psd$, ^.+\.rgb$, ^.+\.rgba$, ^.+\.rla$, ^.+\.rpf$, ^.+\.sat$, ^.+\.sgi$, ^.+\.stl$, ^.+\.sxr$, ^.+\.targa$, ^.+\.tga$, ^.+\.tif$, ^.+\.tiff$, ^.+\.tim$, ^.+\.vda$, ^.+\.vrimg$, ^.+\.vrmesh$, ^.+\.vrsm$, ^.+\.vrst$, ^.+\.vst$, ^.+\.wmf$, ^.+\.ass$, ^.+\.gz$, ^.+\.ifd$, ^.+\.mi$, ^.+\.mi2$, ^.+\.mxi$, ^.+\.rib$, ^.+\.rs$, ^.+\.vrscene$
DEBUG: S3BackedCache Client Returned Sequence: 75
INFO: Stdout Redirection Enabled: True
INFO: Asynchronous Stdout Enabled: False
INFO: Stdout Handling Enabled: True
INFO: Popup Handling Enabled: True
INFO: QT Popup Handling Enabled: False
INFO: WindowsForms10.Window.8.app.* Popup Handling Enabled: False
INFO: Using Process Tree: True
INFO: Hiding DOS Window: True
INFO: Creating New Console: False
INFO: Running as user: ec2-user
INFO: Executable: "/usr/redshift/bin/redshiftCmdLine"
CheckPathMapping: Swapped "Z:\squab2.0001.rs" with "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/squab2.0001.rs"
CheckPathMapping: Swapped "Z:\" with "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/"
INFO: GPUs per task is greater than 0, so the following GPUs will be used: 0,1
INFO: Argument: "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/squab2.0001.rs" -oip "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/"
INFO: Full Command: "/usr/redshift/bin/redshiftCmdLine" "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/squab2.0001.rs" -oip "/mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/"
INFO: Startup Directory: "/usr/redshift/bin"
INFO: Process Priority: BelowNormal
INFO: Process Affinity: default
INFO: Process is now running
STDOUT: Redshift Command-Line Renderer (version 3.5.15 - API: 3505)
STDOUT: Copyright 2023 Redshift Rendering Technologies
STDOUT: sh: lsb_release: command not found
STDOUT: sh: lsb_release: command not found
STDOUT: Querying texture cache buget from preferences.xml: 32 GB
STDOUT: Querying cache path from preferences.xml: $REDSHIFT_LOCALDATAPATH/cache
STDOUT: No GPUs were selected in the command line, using selected compute devices from preferences.
STDOUT: Creating cache path /home/ec2-user/redshift/cache
STDOUT: 	Enforcing shader cache budget...
STDOUT: 	Enforcing texture cache budget...
STDOUT: 		Collecting files...
STDOUT: 		Total size for 0 files 0.00MB (budget 32768.00MB)
STDOUT: 		Under budget. Done.
STDOUT: 	Creating mesh cache...
STDOUT: 	Done
STDOUT: Overriding GPU devices due to REDSHIFT_GPUDEVICES (0,1)
STDOUT: Redshift Initialized
STDOUT: 	Version: 3.5.15, May 10 2023 07:43:06 [44018]
STDOUT: 	Linux Platform
STDOUT: 	Release Build
STDOUT: 	Number of CPU HW threads: 8
STDOUT: 	CPU speed: 3.12 GHz
STDOUT: 	Total system memory: 30.95 GB
STDOUT: 	Current working dir: /usr/redshift/bin
STDOUT: redshift_LICENSE=5053@10.128.2.4
STDOUT: RLM License Search Path=/home/ec2-user/redshift:/etc/opt/maxon/rlm
STDOUT: License return timeout is disabled (license will be returned on shutdown)
STDOUT: Detected env variable REDSHIFT_PATHOVERRIDE_FILE. Loading path override data from file: /var/lib/Thinkbox/Deadline10/workers/ip-10-128-36-92/jobsData/64ca3a24b36cc246db88b058/RSMapping_tempoVSQM0/RSMapping.txt
STDOUT: Loading Redshift procedural extensions...
STDOUT: 	From path: /usr/redshift/procedurals/
STDOUT: 	Done!
STDOUT:  
STDOUT: Preparing compute platforms
STDOUT: 	Found CUDA compute library in /usr/redshift/bin/libredshift-core-cuda.so
STDOUT: 	Found CPU compute library in /usr/redshift/bin/libredshift-core-cpu.so
STDOUT: 	Done
STDOUT: Creating CUDA contexts
STDOUT: 	CUDA init ok
STDOUT: 	Ordinals: { 0 }
STDOUT: Initializing GPUComputing module (CUDA). Active device 0
STDOUT: 	CUDA Driver Version: 11070
STDOUT: 	CUDA API Version: 11020
STDOUT: 	Device 1/1 : Tesla T4 
STDOUT: 	Compute capability: 7.5
STDOUT: 	Num multiprocessors: 40
STDOUT: 	PCI busID: 0, deviceID: 30, domainID: 0
STDOUT: 	Theoretical memory bandwidth: 320.063995 GB/Sec
STDOUT: 	Measured PCIe bandwidth (pinned CPU->GPU): 5.822417 GB/s
STDOUT: 	Measured PCIe bandwidth (pinned GPU->CPU): 6.132351 GB/s
STDOUT: 	Measured PCIe bandwidth (paged CPU->GPU): 4.880897 GB/s
STDOUT: 	Measured PCIe bandwidth (paged GPU->CPU): 4.199223 GB/s
STDOUT: 	Estimated GPU->CPU latency (0): 0.005366 ms
STDOUT: 	Estimated GPU->CPU latency (1): 0.004629 ms
STDOUT: 	Estimated GPU->CPU latency (2): 0.004579 ms
STDOUT: 	Estimated GPU->CPU latency (3): 0.004565 ms
STDOUT: 	New CUDA context created
STDOUT: 	Available memory: 14772.9375 MB out of 14971.8750 MB
STDOUT: CPU backend has auto selected the following arch: BASE
STDOUT: Determining peer-to-peer capability (NVLink or PCIe)
STDOUT: 	Done
STDOUT: PostFX: Initialized
STDOUT: OptiX denoiser init...
STDOUT: 	Selecting device
STDOUT: 	Selected device Tesla T4 (ordinal 0)
STDOUT: OptixRT init...
STDOUT: 	Load/set programs
STDOUT: 	Ok!
STDOUT: Loading: /mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/squab2.0001.rs
STDOUT: License acquired
STDOUT: License for redshift-core 2024.06 valid until Jun 28 2024
STDOUT: Detected change in GPU device selection
STDOUT: Creating CUDA contexts
STDOUT: 	CUDA init ok
STDOUT: 	Ordinals: { 0 }
STDOUT: Initializing GPUComputing module (CUDA). Active device 0
STDOUT: 	CUDA Driver Version: 11070
STDOUT: 	CUDA API Version: 11020
STDOUT: 	Device 1/1 : Tesla T4 
STDOUT: 	Compute capability: 7.5
STDOUT: 	Num multiprocessors: 40
STDOUT: 	PCI busID: 0, deviceID: 30, domainID: 0
STDOUT: 	Theoretical memory bandwidth: 320.063995 GB/Sec
STDOUT: 	New CUDA context created
STDOUT: 	Available memory: 14772.9375 MB out of 14971.8750 MB
STDOUT: =================================================================================================
STDOUT: Rendering frame 1...
STDOUT: AMM enabled
STDOUT: =================================================================================================
STDOUT: 	  0ms
STDOUT: Loading OCIO config using C:\ProgramData\Redshift\Data\OCIO\config.ocio
STDOUT: 	Could not find OCIO config. Using Redshift's default instead
STDOUT: 	Full path: /usr/redshift/Data/OCIO/config.ocio
STDOUT: 	Ok
STDOUT: Creating OCIO processors between rendering and view/srgb spaces
STDOUT: 	Rendering space: ACEScg
STDOUT: 	Display: sRGB
STDOUT: 	View: ACES 1.0 SDR-video
STDOUT: 	Ok
STDOUT: Loading OCIO color space transforms for texture sampling
STDOUT: 	Found a suitable sRGB color space: "sRGB"
STDOUT: 	Found a suitable sRGB-linear color space: "scene-linear Rec.709-sRGB"
STDOUT: Device 0 (Tesla T4) uses Optix for ray tracing
STDOUT: Detected 4 new texture color spaces in use.
STDOUT: Preparing ray tracing hierarchy for meshes
STDOUT: 	Time to process 0 meshes:   0ms
STDOUT: 	Time to process textures: 0.000027 seconds
STDOUT: Preparing materials and shaders
STDOUT: 	Time to process all materials and shaders: 21.546835 seconds
STDOUT: Freeing VRAM because of an OptiX denoiser/RT reset
STDOUT: 	Freeing RT related buffers
STDOUT: Initializing OptiXRT
STDOUT: 	Freeing stack allocator
STDOUT: 	Will rebuild everything
STDOUT: 	OptixRT initialized on device 0 (appears to use 128 MB)
STDOUT: 	Time: 19ms
STDOUT: Allocating GPU mem...(device 0)
STDOUT: 	Done (Allocator size: 12755 MB. CUDA reported free mem before: 14172 MB, after: 1416 MB)
STDOUT: Allocating GPU mem for ray tracing hierarchy processing
STDOUT: 	Allocating VRAM for device 0 (Tesla T4)
STDOUT: 		Redshift can use up to 12755 MB
STDOUT: 		Fixed: 0 MB
STDOUT: 		Geo: 0 MB, Tex: 0 MB, Rays: 4917 MB, NRPR: 2886080
STDOUT: 		Done! ( 11ms). Compute API reported free mem: 1416 MB
STDOUT: Ray Tracing Hierarchy Info:
STDOUT: 	Max depth: 128. MaxNumLeafPrimitives: 8
STDOUT: 	Extents: (-1.367651 -1.468923 -2.632175) - (3.229024 1.028439 2.583128)
STDOUT: 	Time to create tree: 13 ms (0 13 0)
STDOUT: Irradiance point cloud...
STDOUT: 	Allocating VRAM for device 0 (Tesla T4)
STDOUT: 		Redshift can use up to 12755 MB
STDOUT: 		Fixed: 536 MB
STDOUT: 		Geo: 0 MB, Tex: 0 MB, Rays: 5204 MB, NRPR: 2886080
STDOUT: 		Done! ( 20ms). Compute API reported free mem: 1416 MB
STDOUT: 	Total num points before: 290 (num new: 290)
STDOUT: 	Total num points before: 574 (num new: 289)
STDOUT: 	Total num points before: 1122 (num new: 548)
STDOUT: 	Total num points before: 2121 (num new: 999)
STDOUT: 	Total num points before: 3725 (num new: 1604)
STDOUT: 	Total num points before: 6153 (num new: 2428)
STDOUT: 	Total num points before: 9408 (num new: 3255)
STDOUT: 	Total num points before: 13206 (num new: 3798)
STDOUT: 	Total num points before: 17467 (num new: 4261)
STDOUT: 	Total irradiance point cloud construction time 0.8s
STDOUT: Rendering blocks... (resolution: 1280x720, block size: 128, unified minmax: [16,8192])
STDOUT: 	Allocating VRAM for device 0 (Tesla T4)
STDOUT: 		Redshift can use up to 12755 MB
STDOUT: 		Fixed: 0 MB
STDOUT: 		Geo: 0 MB, Tex: 0 MB, Rays: 8958 MB, NRPR: 2886080
STDOUT: 		Done! ( 11ms). Compute API reported free mem: 1416 MB
STDOUT: 	Pre-streaming and uploading out-of-core textures (they fit in GPU memory cache):
STDOUT: 		Device 0 completed.
STDOUT: 	Time to pre-stream and upload all out-of-core textures: 0.0001 seconds
STDOUT: 	Block 1/60 (4,2) rendered by GPU 0 in 44ms
STDOUT: 	Block 2/60 (5,2) rendered by GPU 0 in 118ms
STDOUT: 	Block 3/60 (5,3) rendered by GPU 0 in 111ms
STDOUT: 	Block 4/60 (4,3) rendered by GPU 0 in 96ms
STDOUT: 	Block 5/60 (3,4) rendered by GPU 0 in 114ms
STDOUT: 	Block 6/60 (3,3) rendered by GPU 0 in 124ms
STDOUT: 	Block 7/60 (3,2) rendered by GPU 0 in 89ms
STDOUT: 	Block 8/60 (3,1) rendered by GPU 0 in 92ms
STDOUT: 	Block 9/60 (4,1) rendered by GPU 0 in 96ms
STDOUT: 	Block 10/60 (5,1) rendered by GPU 0 in 48ms
STDOUT: 	Block 11/60 (6,1) rendered by GPU 0 in 99ms
STDOUT: 	Block 12/60 (6,2) rendered by GPU 0 in 155ms
STDOUT: 	Block 13/60 (6,3) rendered by GPU 0 in 126ms
STDOUT: 	Block 14/60 (6,4) rendered by GPU 0 in 107ms
STDOUT: 	Block 15/60 (5,4) rendered by GPU 0 in 86ms
STDOUT: 	Block 16/60 (4,4) rendered by GPU 0 in 9ms
STDOUT: 	Block 17/60 (2,5) rendered by GPU 0 in 87ms
STDOUT: 	Block 18/60 (2,4) rendered by GPU 0 in 111ms
STDOUT: 	Block 19/60 (2,3) rendered by GPU 0 in 117ms
STDOUT: 	Block 20/60 (2,2) rendered by GPU 0 in 80ms
STDOUT: 	Block 21/60 (2,1) rendered by GPU 0 in 63ms
STDOUT: 	Block 22/60 (2,0) rendered by GPU 0 in 9ms
STDOUT: 	Block 23/60 (3,0) rendered by GPU 0 in 9ms
STDOUT: 	Block 24/60 (4,0) rendered by GPU 0 in 47ms
STDOUT: 	Block 25/60 (5,0) rendered by GPU 0 in 76ms
STDOUT: 	Block 26/60 (6,0) rendered by GPU 0 in 76ms
STDOUT: 	Block 27/60 (7,0) rendered by GPU 0 in 77ms
STDOUT: 	Block 28/60 (7,1) rendered by GPU 0 in 86ms
STDOUT: 	Block 29/60 (7,2) rendered by GPU 0 in 102ms
STDOUT: 	Block 30/60 (7,3) rendered by GPU 0 in 46ms
STDOUT: 	Block 31/60 (7,4) rendered by GPU 0 in 28ms
STDOUT: 	Block 32/60 (7,5) rendered by GPU 0 in 24ms
STDOUT: 	Block 33/60 (6,5) rendered by GPU 0 in 88ms
STDOUT: 	Block 34/60 (5,5) rendered by GPU 0 in 87ms
STDOUT: 	Block 35/60 (4,5) rendered by GPU 0 in 54ms
STDOUT: 	Block 36/60 (3,5) rendered by GPU 0 in 88ms
STDOUT: 	Block 37/60 (1,5) rendered by GPU 0 in 86ms
STDOUT: 	Block 38/60 (1,4) rendered by GPU 0 in 98ms
STDOUT: 	Block 39/60 (1,3) rendered by GPU 0 in 103ms
STDOUT: 	Block 40/60 (1,2) rendered by GPU 0 in 9ms
STDOUT: 	Block 41/60 (1,1) rendered by GPU 0 in 9ms
STDOUT: 	Block 42/60 (1,0) rendered by GPU 0 in 9ms
STDOUT: 	Block 43/60 (8,0) rendered by GPU 0 in 9ms
STDOUT: 	Block 44/60 (8,1) rendered by GPU 0 in 9ms
STDOUT: 	Block 45/60 (8,2) rendered by GPU 0 in 91ms
STDOUT: 	Block 46/60 (8,3) rendered by GPU 0 in 26ms
STDOUT: 	Block 47/60 (8,4) rendered by GPU 0 in 105ms
STDOUT: 	Block 48/60 (8,5) rendered by GPU 0 in 94ms
STDOUT: 	Block 49/60 (0,5) rendered by GPU 0 in 9ms
STDOUT: 	Block 50/60 (0,4) rendered by GPU 0 in 128ms
STDOUT: 	Block 51/60 (0,3) rendered by GPU 0 in 69ms
STDOUT: 	Block 52/60 (0,2) rendered by GPU 0 in 9ms
STDOUT: 	Block 53/60 (0,1) rendered by GPU 0 in 9ms
STDOUT: 	Block 54/60 (0,0) rendered by GPU 0 in 9ms
STDOUT: 	Block 55/60 (9,0) rendered by GPU 0 in 9ms
STDOUT: 	Block 56/60 (9,1) rendered by GPU 0 in 66ms
STDOUT: 	Block 57/60 (9,2) rendered by GPU 0 in 94ms
STDOUT: 	Block 58/60 (9,3) rendered by GPU 0 in 95ms
STDOUT: 	Block 59/60 (9,4) rendered by GPU 0 in 120ms
STDOUT: 	Block 60/60 (9,5) rendered by GPU 0 in 87ms
STDOUT: 	Processing blocks...
STDOUT: 	Time to render 60 blocks: 4.3s
STDOUT: Rendering time: 26.8s (1 GPU(s) used)
STDOUT: Scene statistics
STDOUT: 	General counts
STDOUT: 		Proxies:                                       1
STDOUT: 		Proxy instances:                               0
STDOUT: 		Meshes:                                        1 (1 TriMeshes, 0 HairMeshes)
STDOUT: 		Instances:                                     1
STDOUT: 		Point cloud points:                            0
STDOUT: 		Lights:                                        0
STDOUT: 		Volume grids:                                  0 (0 unique)
STDOUT: 		Sprite textures:                               0
STDOUT: 		In-core textures:                              0
STDOUT: 	Geometry
STDOUT: 		Unique triangles pre tessellation:         81624
STDOUT: 		Unique triangles post tessellation:        81624
STDOUT: 		Unique points:                                 0
STDOUT: 		Unique hair strands:                           0
STDOUT: 		Unique hair strand segments:                   0
STDOUT: 		Total triangles:                           81624
STDOUT: 		Total points:                                  0
STDOUT: 		Total hair strands:                            0
STDOUT: 		Total hair strand segments:                    0
STDOUT: 	Largest triangle meshes:
STDOUT: 		       81624 triangles : testgeometry_squab1
STDOUT: 	GPU Memory
STDOUT: 		Device  0 geometry PCIe uploads:               0 B  (cachesize:            0 B )
STDOUT: 		Device  0 texture PCIe uploads:                0 B  (cachesize:       256.58 KB)
STDOUT: 		Matrices (for instances/points):              48 B 
STDOUT: 		Rays:                                       8.75 GB
STDOUT: 		Sprite textures:                               0 B 
STDOUT: 		In-core textures:                              0 B 
STDOUT: 		Volume grids:                                  0 B 
STDOUT: 	Textures
STDOUT: 		Device  0 stream and upload time:              0ms
STDOUT: 			File loading time:                     0ms
STDOUT: 			File decompression time:               0ms
STDOUT: 			Average GPU cache hits:                0%
STDOUT: 	GPU Ray Accel. And Geometry Memory Stats (rough)
STDOUT: 		Acceleration Structures:                      80 B 
STDOUT: 		Main primitive data:                           0 B 
STDOUT: 		Extra primitive data:                         16 B 
STDOUT: 		Primitive loading time:                        12ms
STDOUT: Saving: /mnt/Data/DAWS_DEADLINE7ea0f5b3ceacbea92dc965977761b0a2/SQUAB.Redshift_ROP1.0001.exr
STDOUT: Shutdown mem management thread...
STDOUT: 	Shut down ok
STDOUT: PostFX: Shut down
STDOUT: Shutdown GPU Devices...
STDOUT: Freeing GPU mem...(device 0)
STDOUT: 	Done (CUDA reported free mem before: 1686 MB, after: 14442 MB)
STDOUT: 	Devices shut down ok
STDOUT: Shutdown Rendering Sub-Systems...
STDOUT: License returned 
STDOUT: 	Finished Shutting down Rendering Sub-Systems
INFO: Process exit code: 0
INFO: Sending EndTaskRequest to S3BackedCacheClient.
DEBUG: Request:
DEBUG: 	JobId: 64ca3a24b36cc246db88b058
Done executing plugin command of type 'Render Task'

We are not specifying a GPU device, however the machine is a g5.x4large which should have 2 x A10G’s, however when we run ./redshiftCmdLine -listgpus we get

List of available GPUs:
0 : NVIDIA A10G
1 : CPU 0 AMD EPYC 7R32

so I’m unsure if that is related. Could this be something to do with GPU Affinity settings? Any help would be much appreciated.

edit
I’ve also noticed that running the command in isolation on the Worker node "/usr/redshift/bin/redshiftCmdLine" "path/to/file.rs" -oip "/path/to/output.exr" will fail on the AWS node whereas if I specifg the gpu with -gpu 0 it will work. Yet setting this in the Redshift standalone submitter will not work (I am using 10.2.1.1)

jarak · August 2, 2023, 6:57pm

RS v3.5.15 has a bug when using REDSHIFT_GPUDEVICES
Logged as a bug with ID: RS-2766

I think you’re the one who logged the maxon/redhshift/houdini bug with commandline 3.5.16 and 3.5.17 ?

We’re still on RS 3.5.13 (H19.5.493) as 3.5.13 seems to be stable for our use.

markl · August 2, 2023, 7:19pm

Ahh ok, that’s good to know, thanks for the info. Yes redshiftCmdLine seems to be broken in 3.5.16 which is causes this error to in Deadline

RS v3.5.15 has a bug when using REDSHIFT_GPUDEVICES
Logged as a bug with ID: RS-2766

Do you have a link to the bug tracker for this?

jarak · August 2, 2023, 7:35pm

It was listed as fixed in 3.5.16

markl · August 7, 2023, 1:49pm

Even with 3.5.13 there’s still intermittent issues where it randomly fails and subsequently succeeds:

2023-08-07 13:46:24:  0: STDOUT: Redshift Command-Line Renderer (version 3.5.13 - API: 3504)
2023-08-07 13:46:24:  0: STDOUT: Copyright 2021 Redshift Rendering Technologies
2023-08-07 13:46:24:  0: STDOUT: Querying texture cache buget from preferences.xml: 32 GB
2023-08-07 13:46:24:  0: STDOUT: Querying cache path from preferences.xml: $REDSHIFT_LOCALDATAPATH/cache
2023-08-07 13:46:24:  0: STDOUT: No GPUs were selected in the command line, using selected compute devices from preferences.
2023-08-07 13:46:24:  0: STDOUT: Creating cache path /home/ec2-user/redshift/cache
2023-08-07 13:46:24:  0: STDOUT: 	Enforcing shader cache budget...
2023-08-07 13:46:24:  0: STDOUT: 	Enforcing texture cache budget...
2023-08-07 13:46:24:  0: STDOUT: 		Collecting files...
2023-08-07 13:46:24:  0: STDOUT: 		Total size for 0 files 0.00MB (budget 32768.00MB)
2023-08-07 13:46:24:  0: STDOUT: 		Under budget. Done.
2023-08-07 13:46:24:  0: STDOUT: 	Creating mesh cache...
2023-08-07 13:46:24:  0: STDOUT: 	Done
2023-08-07 13:46:24:  0: STDOUT: Overriding GPU devices due to REDSHIFT_GPUDEVICES (0)
2023-08-07 13:46:24:  0: STDOUT: Redshift Initialized
2023-08-07 13:46:24:  0: STDOUT: 	Version: 3.5.13, Feb  6 2023 17:02:00 [42021]
2023-08-07 13:46:24:  0: STDOUT: 	Linux Platform
2023-08-07 13:46:24:  0: STDOUT: 	Release Build
2023-08-07 13:46:24:  0: STDOUT: 	Number of CPU HW threads: 16
2023-08-07 13:46:24:  0: STDOUT: 	CPU speed: 3.31 GHz
2023-08-07 13:46:24:  0: STDOUT: 	Total system memory: 62.23 GB
2023-08-07 13:46:24:  0: STDOUT: 	Current working dir: /usr/redshift/bin
2023-08-07 13:46:24:  0: STDOUT: redshift_LICENSE=5053@10.128.2.4
2023-08-07 13:46:24:  0: STDOUT: RLM License Search Path=/home/ec2-user/redshift:/etc/opt/maxon/rlm
2023-08-07 13:46:24:  0: STDOUT: License return timeout is disabled (license will be returned on shutdown)
2023-08-07 13:46:24:  0: STDOUT: Detected env variable REDSHIFT_PATHOVERRIDE_FILE. Loading path override data from file: /var/lib/Thinkbox/Deadline10/workers/ip-10-128-111-193/jobsData/64d0f5a8b36cc246db88b175/RSMapping_tempCL1Kv0/RSMapping.txt
2023-08-07 13:46:24:  0: STDOUT: Loading Redshift procedural extensions...
2023-08-07 13:46:24:  0: STDOUT: 	From path: /usr/redshift/procedurals/
2023-08-07 13:46:24:  0: STDOUT: 	Done!
2023-08-07 13:46:24:  0: STDOUT:  
2023-08-07 13:46:24:  0: STDOUT: Preparing compute platforms
2023-08-07 13:46:24:  0: STDOUT: 	Found CUDA compute library in /usr/redshift/bin/libredshift-core-cuda.so
2023-08-07 13:46:24:  0: STDOUT: 	Found CPU compute library in /usr/redshift/bin/libredshift-core-cpu.so
2023-08-07 13:46:24:  0: STDOUT: 	Done
2023-08-07 13:46:24:  0: STDOUT: Creating CUDA contexts
2023-08-07 13:46:24:  0: STDOUT: 	CUDA init ok
2023-08-07 13:46:24:  0: STDOUT: 	Ordinals: { 0 }
2023-08-07 13:46:24:  0: STDOUT: Initializing GPUComputing module (CUDA). Active device 0
2023-08-07 13:46:24:  0: STDOUT: 	CUDA Driver Version: 11070
2023-08-07 13:46:24:  0: STDOUT: 	CUDA API Version: 11020
2023-08-07 13:46:24:  0: STDOUT: 	Device 1/1 : NVIDIA A10G 
2023-08-07 13:46:24:  0: STDOUT: 	Compute capability: 8.6
2023-08-07 13:46:24:  0: STDOUT: 	Num multiprocessors: 80
2023-08-07 13:46:24:  0: STDOUT: 	PCI busID: 0, deviceID: 30, domainID: 0
2023-08-07 13:46:24:  0: STDOUT: 	Theoretical memory bandwidth: 600.096008 GB/Sec
2023-08-07 13:46:24:  0: STDOUT: 	Measured PCIe bandwidth (pinned CPU->GPU): 12.434987 GB/s
2023-08-07 13:46:24:  0: STDOUT: 	Measured PCIe bandwidth (pinned GPU->CPU): 12.211938 GB/s
2023-08-07 13:46:25:  0: STDOUT: 	Measured PCIe bandwidth (paged CPU->GPU): 11.814685 GB/s
2023-08-07 13:46:25:  0: STDOUT: 	Measured PCIe bandwidth (paged GPU->CPU): 8.224567 GB/s
2023-08-07 13:46:25:  0: STDOUT: 	Estimated GPU->CPU latency (0): 0.006090 ms
2023-08-07 13:46:25:  0: STDOUT: 	Estimated GPU->CPU latency (1): 0.006149 ms
2023-08-07 13:46:25:  0: STDOUT: 	Estimated GPU->CPU latency (2): 0.006588 ms
2023-08-07 13:46:25:  0: STDOUT: 	Estimated GPU->CPU latency (3): 0.007063 ms
2023-08-07 13:46:25:  0: STDOUT: 	New CUDA context created
2023-08-07 13:46:25:  0: STDOUT: 	Available memory: 22099.3750 MB out of 22592.0625 MB
2023-08-07 13:46:25:  0: STDOUT: Determining peer-to-peer capability (NVLink or PCIe)
2023-08-07 13:46:25:  0: STDOUT: 	Done
2023-08-07 13:46:25:  0: STDOUT: PostFX initialized
2023-08-07 13:46:25:  0: STDOUT: OptiX denoiser init...
2023-08-07 13:46:25:  0: STDOUT: 	Selecting device
2023-08-07 13:46:25:  0: STDOUT: 	Selected device NVIDIA A10G (ordinal 0)
2023-08-07 13:46:25:  0: STDOUT: OptixRT init...
2023-08-07 13:46:25:  0: STDOUT: 	Load/set programs
2023-08-07 13:46:25:  0: STDOUT: 	Ok!
2023-08-07 13:46:25:  0: STDOUT: Loading: /mnt/Data/DDropbox (Kuva)COSMORBITAL563b996fa2797c696f5b6a7ec663c03d/ASSETS/sample_asset/proxy/squabby.0001.rs
2023-08-07 13:46:28:  0: STDOUT: License for redshift-core 2024.06 valid until Jun 28 2024
2023-08-07 13:46:28:  0: STDOUT: Detected change in GPU device selection
2023-08-07 13:46:28:  0: STDOUT: Creating CUDA contexts
2023-08-07 13:46:28:  0: STDOUT: 	CUDA init ok
2023-08-07 13:46:28:  0: STDOUT: No devices available
2023-08-07 13:46:30:  0: STDOUT: PostFX shut down
2023-08-07 13:46:30:  0: STDOUT: Shutdown GPU Devices...
2023-08-07 13:46:30:  0: STDOUT: 	Devices shut down ok
2023-08-07 13:46:30:  0: STDOUT: Shutdown Rendering Sub-Systems...
2023-08-07 13:46:30:  0: STDOUT: License returned      
2023-08-07 13:46:30:  0: STDOUT: 	Finished Shutting down Rendering Sub-Systems

jarak · August 7, 2023, 4:29pm

What do you have for the GPU Affinity settings?
I just looked at the AWS instance type page and the g5.4xlarge and g5g.4xlarge have 1 GPU each A10G and T4G, respectively.

On the CLI of the worker, if you do an nvidia-smi it should report what card and how many GPUs the systems sees.

Can you disable GPU Affinity on the ec2 instances that only have 1 GPU?

zainali · August 7, 2023, 7:26pm

Hello

It could be a GPU driver issue. Did you install the GPU drivers on the AMI or did it come pre-installed?

Also just to confirm you are only using two types of GPU instances: g5.x4large and g5g.4xlarge?

markl · August 8, 2023, 4:24pm

@zarak you’re completely right, those machines do only have one GPU. Here’s the output from nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         Off  | 00000000:00:1E.0 Off |                    0 |
|  0%   34C    P0    58W / 300W |    754MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      9836      C   ...shift/bin/redshiftCmdLine      752MiB |
+-----------------------------------------------------------------------------+

@zainali I didn’t install any drivers on the device, but it is a custom AMI where I just installed RS 3.5.13

eamsler · October 13, 2023, 9:59pm

This is super interesting. Zain and I have been fighting with this on and off this week. Not making too much headway at the moment but it’s interesting that specifying the GPU index isn’t working.

For fun I tried just running it in a loop to see what happens. Forcing hybrid mode to 0 (I presume disabled) doesn’t make much difference:

$ while true; do date; "/usr/redshift/bin/redshiftCmdLine" "/mnt/Data/beep/boop.rs" -oip "/tmp/Output" -hybrid 0 2>&1 | grep "Rendering frame"; done
Fri Oct 13 21:54:10 UTC 2023
Rendering frame 1003...
Fri Oct 13 21:54:17 UTC 2023
Fri Oct 13 21:54:24 UTC 2023
Fri Oct 13 21:54:36 UTC 2023
Fri Oct 13 21:54:43 UTC 2023
Rendering frame 1003...
Fri Oct 13 21:54:50 UTC 2023
Fri Oct 13 21:55:02 UTC 2023
Rendering frame 1003...
Fri Oct 13 21:55:09 UTC 2023

Has anyone made and progress on this one? We’re using driver 535.104.05 (newest from the EC2 S3 bucket) and Redshift 3.5.19.