Experience using AWS Portal + Spot Fleets. What can I improve?

Benjamin_A_Robins · September 8, 2023, 9:56am

Hey,

We have recently setup and begun testing using AWS Portal and Spot Fleets and I would like to outline my experience so far to see if there are any red flags or processes that I could improve.

In a nut shell, we have it all setup and working without error, on simple 3ds Max and VRay Jobs, but when testing on more complex scenes with more assets we are struggling to get anywhere, again without error but no completed frames, yet.

The process I am following is as follows;

Create Deadline Infrastructure
Setup Global Path Mapping rule that is required for our Jobs
Launch Spot Fleet (m5.16xlarge, m4.16xlarge, r5.16xlarge) using our own AMI which is configured the same as our racked render nodes. (Windows, 3ds Max, VRay).
Launched Instances pick up Job and on simple test scenes (Teapot + HDRI + Vray Proxy) they work as expected, complete frames, Asset Server pulls frames down to desired location as they complete, etc.

However, on the more complex projects with many more scene assets we notice the Instances are stuck on ‘Waiting to Start’ for over 1 hour before they begin Rendering. When they finally do begin Rendering the Instances will get terminated before the test is completed and the cycle continues.

Our Jobs are 1hr - 2hr per frame on our racked nodes which are similar in spec to the m5.16xlarge instances.

I am looking for some help on what I can do to try and improve the performance, especially around the ‘Waiting to Start’ period of time. I assume (but am not 100%) that this time is the Asset Server syncing all of the Assets into AWS and as there are lots its taking a while. We do have Pre-Caching enabled on the Jobs but I guess as there is no Infrastructure at the time of submission no Pre-Caching can happen?

I would really appreciate any suggestions on how to improve things and start to get some Completed frames back on our Scenes.

Thank you!

Benjamin_A_Robins · September 8, 2023, 4:01pm

I have a bit of an update from todays test.

We successfully rendered an animation sequence on one of our 3ds Max & VRay productions without error and the results match our on premises render nodes 1:1 .

The startup time for the first instance and task was 1h5m and the startup time for the next 3 instances (I launched 4 in total) was 46m for the first task they rendered. Subsequent tasks startup times were 01s.

I would really like to look into the long load to narrow down exactly what is happening and see what can be done to speed this part of the process up as the rest of it is performing as our racked render nodes are.

Any help on this would be appreciated.

Thank you

Benjamin_A_Robins · September 8, 2023, 4:07pm

The log for the first frame which had the longest delay shows a 56m delay between "Loaded plugin 3ds max’ and 'Executing plugin command of type ‘Initialize Plugin’. How can I best try and determine what was going on during this time?

2023-09-08 11:21:44:  0: Loaded plugin 3dsmax
2023-09-08 12:17:51:  0: Executing plugin command of type 'Initialize Plugin'

Justin_B · September 8, 2023, 9:02pm

I’ll quote bits I want to address directly, there’s a bunch of info in these posts, so I’ll do what I can to be clear.

Launch Spot Fleet (m5.16xlarge, m4.16xlarge, r5.16xlarge) . . .

If possible, add more instance types. The more instance types you’ve got the more resistant to spot interruption your fleet will be. Check out this guide I wrote a while back on the placement score tool. It’s super handy if/when you run into interruptions.

It looks like your bottleneck is in getting the job files up to the S3 bucket.

Given it’s a simple scene I’m surprised it’s taking so long to get the files uploaded however. I’d want to see what the asset server is up to in that time. It could be as simple as the upload does indeed take that long, but it’s better to be sure.

If the asset server log doesn’t show uploads during that time, try right-clicking the Worker in question while it’s sitting at ‘Waiting to Start’ and choose ‘Connect to Worker Log’ to see what it’s actually doing. That state just means we’re waiting on 3dsMax to fire up.

Our Jobs are 1hr - 2hr per frame on our racked nodes which are similar in spec to the m5.16xlarge instances.

Have you looked at tile rendering to get the per-task times down? Getting your instance taken away 3/4th through a 2 hour render is painful, since the whole frame would have to be re-done.

We do have Pre-Caching enabled on the Jobs but I guess as there is no Infrastructure at the time of submission no Pre-Caching can happen?

You’re exactly right. Though if you have the option checked off and later go start up an Infrastructure you can manually trigger the upload. Here’s details on the whole system:

The Asset Server pre-cache will select which files to upload based on key/value pairs add to the JobInfo file:

AWSAssetFile0=pathToFile
AWSAssetFile1=pathToFile
AWSAssetFile2=pathToFile

To pre-cache a job’s AWSAssetFiles run the following command:

deadlinecommand AWSPortalPrecacheJob <JobID>

This command can also be triggered through the Web Service if needed, using the following URL:

http://<WebServiceHost>:8082/AWSPortalPrecacheJob?<JobID>

The startup time for the first instance and task was 1h5m and the startup time for the next 3 instances (I launched 4 in total) was 46m for the first task they rendered. Subsequent tasks startup times were 01s.

Other than that giant load time this makes sense. We leave 3dsMax running after the first task, so it’s only started the first time. For the other tasks we just tell it to render a new set of frames.

2023-09-08 11:21:44:  0: Loaded plugin 3dsmax
2023-09-08 12:17:51:  0: Executing plugin command of type 'Initialize Plugin'

If you go in the Monitor under Tools → Configure Plugins → 3dsMax and enable ‘Verbose Logging’ we should see more chatter there. It won’t help diagnose render issues, but for troubleshooting an application plugin (like we’re doing here) it’s great. Turn it off afterwords as it’ll make your task reports really hard to read.

After doing that could you try firing up one of our AMIs instead of your own? I’m not sure if this is rooted in the AMI or the 3dsMax application plugin. Thanks!

Benjamin_A_Robins · September 9, 2023, 4:39pm

Hi Justin,

Thanks for such a detailed reply, its much appreciated!

If possible, add more instance types. The more instance types you’ve got the more resistant to spot interruption your fleet will be. Check out this guide I wrote a while back on the placement score tool. It’s super handy if/when you run into interruptions.

Thanks, I will look into this and see if I can identify more similar spec instances to include in the list.

It looks like your bottleneck is in getting the job files up to the S3 bucket.

This is without question a big contributing factor. I left yesterdays infrastructure open in order to test again today knowing that all the assets are already up there and requeued yesterdays completed Job. The Startup time today was reduced to 30m. This is a big improvement on yesterdays 1h however 30m to load the max scene for the first time is still something I want to fix. For context, our on-prem render nodes Startup time for the same job is 3m.

Have you looked at tile rendering to get the per-task times down?

We do use Tile rendering for our high-resolution stills work but not something we use on animations. I was a bit off with what our render times actually are. They are between 13m - 30m per task which I consider acceptable for 6k at final production quality.

To pre-cache a job’s AWSAssetFiles run the following command:

Just to check that I understand this correctly. If we disable Asset PreCache on submission by default we can manually trigger a Job to Pre-Cache by running this command? Does it just cache every asset the Job requires as it wont have the asset list in the JobInfo file?

After doing that could you try firing up one of our AMIs instead of your own?

I’d be reluctant to go down this route as I want 100% control over it to ensure the AMI is a mirror of our on-premise render nodes.

It seems like my main focus is now on identifying why 3ds Max takes so long to Startup on the first frame of each Job.