Render Farm Stalling

RyanIG · April 3, 2016, 12:17pm

Hi,

We are having issues with our render farm stalling, it seems to be holding onto tasks for hours and then eventually ‘‘stalling’’ some jobs will finish frames in 5 minutes and still have 3 frames rendering for 20+ hours before stalling.

At first we thought it could of been a heat issue so we got in extra ventilation for our farm, now they render sitting between 60 - 70 degrees, this helped a lot and reduced the number of stalls by 75% but we still have them, we have 4 render boxes, 8 GTX 980 Ti’s in each one so 32 in total.

We are submitting from Maya using the ‘‘Deadline Submission’’ plugin, we are running Redshift v1.2.7 on all the machines and Render farm.

When one of the Nodes stall it has to be manually shut down, sending commands to the machine through Deadline doesn’t work, even trying to restart it form a command line using ‘’ Shutdown /i’’ doesn’t work, the only way to bring the slave back up is to physically turn it off, which is a major issue when we are rendering overnight.

We have used Verbose Debugging with Redshift but that doesn’t show anything in its logs to suggest what the cause of them stalling is.

In the slave reports this is the error we recieve

STALLED SLAVE REPORT

Current House Cleaner Information
Machine Performing Cleanup: License-Server
Version: v7.2.2.1 R (8d27fcaf8)

Stalled Slave: RenderBox3-GPU1
Slave Version: v7.2.2.1 R (8d27fcaf8)
Last Slave Update: 2016-04-01 18:05:39
Current Time: 2016-04-01 18:26:29
Time Difference: 20.828 m
Maximum Time Allowed Between Updates: 20.000 m

Current Job Name: ExampleJobName
Current Job ID: 56fe8d05e48d05e48d05e4
Current Job User: Users Name
Current Task Names: 1041
Current Task Ids: 40

Searching for job with id “56fe8d05e48d05e48d05e4”
Found possible job: ExampleJobName
Searching for task with id “40”
Found possible task: 40:[1041-1041]
Task’s current slave: RenderBox2-GPU4
Slave machine names do not match, continuing search
Associated job not found, it has probably been deleted.

Setting slave’s status to Stalled.
Setting last update time to now.

Slave state updated.

When we check the Deadline Logs on the RenderBox this is what it says

.

Windows Event Logs don’t show much either.

Are there any other logs i should be looking at or could someone point me towards a solution for this issue?

Any help is greatly appreciated.

Cheers

-Ryan

MikeOwen · April 3, 2016, 1:53pm

Hi,

Could you upgrade to Redshift v2.0.25, I’m seeing many Redshift/Maya fixes which could well help here. Could you also enable verbose logging if you haven’t done already:
docs.thinkboxsoftware.com/produc … ation-data

Once the upgrade and verbose logging is in place, could you provide a zip of all the slave logs for a few slave machines which are displaying this issue as well as slaves which don’t display this issue, together with Redshift debug logs and then we can see if anything stands out. Feel free to provide all this info via your support ticket. Is there a direct relationship between you lowering the number of jobs/active frames/concurrent tasks being rendered per machine and the frequency of stalled tasks per slave or slaves? Could you also confirm your hardware/network setup for all the Deadline components as well, particularly where/how your DB, Repo and slaves are configured, what kind of machine spec/network they are all residing on.

Chances are a simple Redshift software upgrade will resolve this issue, but all the above information will help to start debugging this issue.

RyanIG · April 4, 2016, 12:11pm

Farm is now running the latest Redshift and everyone is informed to use Verbose logging when submitting, will submit the logs once we have enough jobs sent through!

Cheers

-Ryan