Deadline Fleet Incorrect Termination Math

jordanphstein · April 26, 2021, 2:17pm

During the course of our day to day operations we noticed that when we had the Resource Tracker enabled in Deadline, it was terminating our fleet for no apparent reason.

This led us to investigate to CloudWatch logs which revealed the following…

{‘version’: ‘0’, ‘id’: ‘2c67b141-e947-bb03-ab46-a9e40ca25f70’, ‘detail-type’: ‘Deadline Resource Tracker Fleet Health State-Change Notification’, ‘source’: ‘DeadlineResourceTracker.FleetHealthReporter’, ‘account’: ‘021677115756’, ‘time’: ‘2021-04-22T17:14:08Z’, ‘region’: ‘us-east-1’, ‘resources’: [], ‘detail’: {‘fleet_id’: ‘sfr-01c47b06-0e61-4a69-8b4d-6efd04aca31d’, ‘capacity’: 1, ‘healthy_node_count’: 84, ‘unhealthy_node_count’: 2, ‘grace_period_start_time’: ‘1970-01-01T00:00:00’, ‘message’: ‘Terminating fleet, 2 unhealthy node(s), 200% of 1 total capacity’, ‘status’: ‘Unhealthy-Fleet Terminated’}}

This led to the discovery of the last line, “Terminating fleet, 2 unhealthy node(s), 200% of 1 total capacity”

We are not sure what the most common use case is for the Resource Tracker, but for us, our capacity ramps up and down dramatically throughout the day. What this means is that if our farm is cranking along, we may get up to a capacity of 200. However, within a few minutes of the queue being completed, Deadline will appropriately request that our Spot Fleets have their capacity brought way back down, usually to 0 or 1.

As you can see from the logs above, it was terminating our fleet because we had only 2 unhealthy nodes, but 1 total capacity. However, when I looked at our Spot Fleet for the this log, we had 200 nodes active. That meant that in reality, we had 198 healthy nodes and 2 unhealthy nodes. The termination of our fleet was not in line with our threshold put into the fleet_health_reporter.py.

This discovery led to a subtle change in our fleet_health_reporter.py within Lamba.

Get information using Spot Fleet keys

if 'SpotFleetRequestId' in fleet:
    parsed_fleet['fleet_id'] = fleet['SpotFleetRequestId']
    parsed_fleet['capacity'] = fleet['SpotFleetRequestConfig']['TargetCapacity']
    parsed_fleet['fulfilledcapacity'] = fleet['SpotFleetRequestConfig']['FulfilledCapacity']

We added the fulfilledcapacity variable so that later down in the function we knew how many nodes were actually active, instead of just the capacity, which was often not a realistic representation of our fleet.

Lastly, we modified the function that terminates the fleet. We added an or statement to catch some rounding errors and changed the division so that to get the % it has to divide unhealthy nodes by the total active capacity, not the fleet’s capacity goal that it may or may not hit in the future.

_unhealthy_percentage(fleet: FleetDict) → Decimal:
return 0 if Decimal(fleet[‘capacity’]) == 0 or Decimal(fleet[‘fulfilledcapacity’]) == 0
else Decimal(fleet[‘unhealthy_node_count’]) / Decimal(fleet[‘fulfilledcapacity’])

MikeOwen · February 5, 2022, 1:53pm

For anyone reading this thread since April 2021. Thinkbox took this code proposal, fully fleshed it out (as a few other lines needed updating) and it was released in 2021. Resource Tracker since Deadline v10.0.10 always deploys the latest version of RT, so as long as your running Deadline v10.0.10 or newer, then you already have this fix applied to your deployed RT.

How to check your RT version: Resource Tracker Overview — Deadline 10.1.20.3 documentation