We have recently setup and begun testing using AWS Portal and Spot Fleets and I would like to outline my experience so far to see if there are any red flags or processes that I could improve.
In a nut shell, we have it all setup and working without error, on simple 3ds Max and VRay Jobs, but when testing on more complex scenes with more assets we are struggling to get anywhere, again without error but no completed frames, yet.
The process I am following is as follows;
- Create Deadline Infrastructure
- Setup Global Path Mapping rule that is required for our Jobs
- Launch Spot Fleet (m5.16xlarge, m4.16xlarge, r5.16xlarge) using our own AMI which is configured the same as our racked render nodes. (Windows, 3ds Max, VRay).
- Launched Instances pick up Job and on simple test scenes (Teapot + HDRI + Vray Proxy) they work as expected, complete frames, Asset Server pulls frames down to desired location as they complete, etc.
However, on the more complex projects with many more scene assets we notice the Instances are stuck on ‘Waiting to Start’ for over 1 hour before they begin Rendering. When they finally do begin Rendering the Instances will get terminated before the test is completed and the cycle continues.
Our Jobs are 1hr - 2hr per frame on our racked nodes which are similar in spec to the m5.16xlarge instances.
I am looking for some help on what I can do to try and improve the performance, especially around the ‘Waiting to Start’ period of time. I assume (but am not 100%) that this time is the Asset Server syncing all of the Assets into AWS and as there are lots its taking a while. We do have Pre-Caching enabled on the Jobs but I guess as there is no Infrastructure at the time of submission no Pre-Caching can happen?
I would really appreciate any suggestions on how to improve things and start to get some Completed frames back on our Scenes.