Hello,
I am testing Spot Fleets and it has been going well, however, I have noticed that instances are being terminated seemingly randomly and then a new one will be fired up to replace it. This is happening during rendering and 50% or so through a frame.
I am looking to understand more about why this might be happening and what I can do to avoid it happening in future.
Thank you
Ok, what I think was happening is that the instances were being marked as Stalled Workers due to the 10 minute Wait Time setting. I have upped this to 60 minutes and am testing again.
The Job has lots of Assets so I’m assuming its waiting for the Asset Server to sync up before it can get going.
By the sound of it your instances are getting taken out by ‘spot interruption’. When the instance gets taken out Deadline isn’t aware so the Worker is marked stalled.
To be sure that it is interruption taking your instances out you can go to AWS Management Console > Services > EC2 > Instances. Click the gear icon, enable State Transition Reason and Message. These fields will indicate if the instances are being terminated due to Spot Interruption.
List of transition reasons: StateReason - Amazon Elastic Compute Cloud
AWS Management Console > Services > EC2 > Spot Requests > Spot History can also be used to determine if there has been fluctuation in target capacity and Spot Interruption.
I wrote this guide on how to use placement scores and with that you can build a fleet request of resilient instances. It’s possible the instances you’re requesting are in high demand from EC2 On-Demand customers. The placement score toll will help out.