Render Farm Stalling and Hanging on Frames

Hi everyone,

First post here, probably the first of many!!
I will include logs and screenshots along with this post.

We are having issues with our render farm.
We can send jobs to it no problem, sometimes it will render with no errors or issues, then there’s 80% of the time where one of our boxes will ‘‘stall’’ or hang on a frame, and by hang on a frame i mean it will say it has been rendering for 9 hours, but when you view job reports there is nothing there. it’s rendering but no picking up a frame it seems.

User machine/software setup
We are running Deadline 7.2
Redshift v1.2.27
Maya 2016 w/ Service Pack
Windows7

Farm Specs
32 GTX 980 Ti cars
6GB Memory
Windows10

Here are some of the errors we are receiving in the logs

Stalled Box Job Report(This appears multiple times across different gpus when it stalls)

Error: Could not find report log: //license-server3/Lic_servRep\reports\jobs\35\b\56ea91a879f82730ac5e735b\56ea9f1d245c2411e834fa97.bz2

Reque error log(Only included the first bit of it as it repeats itself)

=======================================================
Reason

Rendering task was requeued because the Slave was manually shut down.

=======================================================
Log

2016-03-17 11:07:35: Skipping pending job scan because it is not required at this time
2016-03-17 11:07:35: Skipping repository repair because it is not required at this time
2016-03-17 11:07:35: Skipping house cleaning because it is not required at this time
2016-03-17 11:07:35: The license file being used will expire in 22 days.
2016-03-17 11:07:41: The license file being used will expire in 22 days.
2016-03-17 11:07:50: The license file being used will expire in 22 days.
2016-03-17 11:07:56: The license file being used will expire in 22 days.
2016-03-17 11:08:03: The license file being used will expire in 22 days.
2016-03-17 11:08:11: The license file being used will expire in 22 days.
2016-03-17 11:08:18: The license file being used will expire in 22 days.
2016-03-17 11:08:24: The license file being used will expire in 22 days.
2016-03-17 11:08:30: The license file being used will expire in 22 days.

Possible GPU Crash Error(This one we see a lot across all renders)And we are not using remote desktop to log into the farm, we use teamviewer

2016-03-17 11:16:10: 0: STDOUT: MemCpy failed (CUDA_ERROR_INVALID_VALUE). This is possibly due to a GPU crash. Please re-render this scene with the ‘Debug Capture’ option enabled (in the Redshift ‘System’ tab) and, once you get the crash again, send the developers the log file html and bin files located in C:\ProgramData\Redshift\Log/Log.Latest.2. Thanks!

^^^^^^^^^^^^^^^^^^^^
When we do turn on debug capture we get an error about VMP Pinned memory.

Any help on these at all it amazingly appreciated

need to get this issue sorted asap

Thanks

-Ryan

Hey Ryan!

Good to meet you.

Multiple issues in a single thread is going to be really fun to juggle here… But, here goes.

Report logs

If the file really doesn’t exist there, something may have gone wrong during the copy. The database has a record of it, but the file never made it to its final destination. I usually see this on overloaded file servers

Requeue log

This looks like something errored out on the Slave and the Scheduler Task started dumping text into render log. I saw that recently with a different error. Can you look in the Slave log for stack traces.

GPU Crash

That’s definitely a RedShift problem. It’s really bad for just hanging around when problems happen, and Deadline is just too curious.

Can yo send a full report to support@thinkboxsoftware.com and we’ll send you a little patch to make the render plugin kill things properly? I ask for the support ticket because it keeps us a bit more organized and private than the forums do.

Thanks!