We are having issues with our render farm stalling, it seems to be holding onto tasks for hours and then eventually ‘‘stalling’’ some jobs will finish frames in 5 minutes and still have 3 frames rendering for 20+ hours before stalling.
At first we thought it could of been a heat issue so we got in extra ventilation for our farm, now they render sitting between 60 - 70 degrees, this helped a lot and reduced the number of stalls by 75% but we still have them, we have 4 render boxes, 8 GTX 980 Ti’s in each one so 32 in total.
We are submitting from Maya using the ‘‘Deadline Submission’’ plugin, we are running Redshift v1.2.7 on all the machines and Render farm.
When one of the Nodes stall it has to be manually shut down, sending commands to the machine through Deadline doesn’t work, even trying to restart it form a command line using ‘’ Shutdown /i’’ doesn’t work, the only way to bring the slave back up is to physically turn it off, which is a major issue when we are rendering overnight.
We have used Verbose Debugging with Redshift but that doesn’t show anything in its logs to suggest what the cause of them stalling is.
In the slave reports this is the error we recieve
When we check the Deadline Logs on the RenderBox this is what it says
.
Windows Event Logs don’t show much either.
Are there any other logs i should be looking at or could someone point me towards a solution for this issue?
Could you upgrade to Redshift v2.0.25, I’m seeing many Redshift/Maya fixes which could well help here. Could you also enable verbose logging if you haven’t done already: docs.thinkboxsoftware.com/produc … ation-data
Once the upgrade and verbose logging is in place, could you provide a zip of all the slave logs for a few slave machines which are displaying this issue as well as slaves which don’t display this issue, together with Redshift debug logs and then we can see if anything stands out. Feel free to provide all this info via your support ticket. Is there a direct relationship between you lowering the number of jobs/active frames/concurrent tasks being rendered per machine and the frequency of stalled tasks per slave or slaves? Could you also confirm your hardware/network setup for all the Deadline components as well, particularly where/how your DB, Repo and slaves are configured, what kind of machine spec/network they are all residing on.
Chances are a simple Redshift software upgrade will resolve this issue, but all the above information will help to start debugging this issue.
Farm is now running the latest Redshift and everyone is informed to use Verbose logging when submitting, will submit the logs once we have enough jobs sent through!