Hi everyone,
First post here, probably the first of many!!
I will include logs and screenshots along with this post.
We are having issues with our render farm.
We can send jobs to it no problem, sometimes it will render with no errors or issues, then there’s 80% of the time where one of our boxes will ‘‘stall’’ or hang on a frame, and by hang on a frame i mean it will say it has been rendering for 9 hours, but when you view job reports there is nothing there. it’s rendering but no picking up a frame it seems.
User machine/software setup
We are running Deadline 7.2
Redshift v1.2.27
Maya 2016 w/ Service Pack
Windows7
Farm Specs
32 GTX 980 Ti cars
6GB Memory
Windows10
Here are some of the errors we are receiving in the logs
Stalled Box Job Report(This appears multiple times across different gpus when it stalls)
Error: Could not find report log: //license-server3/Lic_servRep\reports\jobs\35\b\56ea91a879f82730ac5e735b\56ea9f1d245c2411e834fa97.bz2
Reque error log(Only included the first bit of it as it repeats itself)
=======================================================
Reason
Rendering task was requeued because the Slave was manually shut down.
=======================================================
Log
2016-03-17 11:07:35: Skipping pending job scan because it is not required at this time
2016-03-17 11:07:35: Skipping repository repair because it is not required at this time
2016-03-17 11:07:35: Skipping house cleaning because it is not required at this time
2016-03-17 11:07:35: The license file being used will expire in 22 days.
2016-03-17 11:07:41: The license file being used will expire in 22 days.
2016-03-17 11:07:50: The license file being used will expire in 22 days.
2016-03-17 11:07:56: The license file being used will expire in 22 days.
2016-03-17 11:08:03: The license file being used will expire in 22 days.
2016-03-17 11:08:11: The license file being used will expire in 22 days.
2016-03-17 11:08:18: The license file being used will expire in 22 days.
2016-03-17 11:08:24: The license file being used will expire in 22 days.
2016-03-17 11:08:30: The license file being used will expire in 22 days.
Possible GPU Crash Error(This one we see a lot across all renders)And we are not using remote desktop to log into the farm, we use teamviewer
2016-03-17 11:16:10: 0: STDOUT: MemCpy failed (CUDA_ERROR_INVALID_VALUE). This is possibly due to a GPU crash. Please re-render this scene with the ‘Debug Capture’ option enabled (in the Redshift ‘System’ tab) and, once you get the crash again, send the developers the log file html and bin files located in C:\ProgramData\Redshift\Log/Log.Latest.2. Thanks!
^^^^^^^^^^^^^^^^^^^^
When we do turn on debug capture we get an error about VMP Pinned memory.
Any help on these at all it amazingly appreciated
need to get this issue sorted asap
Thanks
-Ryan