AWS RCS connection really inconsistent, tanking the remote nodes

I have a ticket open regarding this issue (780670), but I figured I would leave a forum post here as well in case there are other users currently experiencing a similar issue that might find something useful here if there ends up being something to actually fix it.

This past week I have watched the original problem of AWS Spot fleets seemingly randomly shutting themselves down get traced to the Deadline Resource Tracker. It watches fleet and node health with some routine checks it does that I am not wholly familiar with and will shut things down if it deems them unhealthy - you can start spot fleets without a connection to the resource tracker that will keep it from getting overzealous so that’s not a huge deal.

What appears to have caused the resource tracker to go wild though has been a very flaky connection between the local Remote Connection Server and the AWS Gateway Machine that creates all sorts of other problems.

When the Gateway is unable to connect properly to the RCS, the repository doesn’t get its updates from the AWS Pulse and it thinks that all of the workers are stalled, the fleet is in bad health and basically everything stops working. Tasks that are still in progress and humming along fine will reset themselves wasting hundreds of hours of render time, and the Resource Tracker will shutdown whole fleets because it doesn’t think they are communicating properly.

I have double checked, and done re-installs of our local Deadline components on the vm that is running them (both the client with RCS and the AWS Portal Link/Asset Server) and little seems to change.

I have had the RCS connection be stable a few times, but it seems to be random luck when that connection actually becomes stable. What seems to completely break it consistently is starting new fleets, or changing the size of existing fleets to bring new nodes up/shut other nodes down. Something about adding or removing nodes from the AWS fleet causes the RCS to connection to commit suicide and refuse to do anything. What gives? Why has this been such a recurring problem and what makes the Bad Gateway/Gateway Timeout error this single point of failure for the whole system? Is there any way to make this connection more consistent or stable or allow for failover RCS connections?

AWS Worker logs and the AWS Pulse logs will spit out a lot of these errors:

2020-02-07 20:09:48: Scheduler Thread - Unexpected Error Occurred
2020-02-07 20:09:48: Scheduler Thread -
2020-02-07 20:09:48: Connection Server error: GET http://10.128.2.4:8888/db/environment/environment returned GatewayTimeout "
2020-02-07 20:09:48: 504 Gateway Time-out
2020-02-07 20:09:48:
2020-02-07 20:09:48:

504 Gateway Time-out


2020-02-07 20:09:48:
nginx/1.12.2
2020-02-07 20:09:48:
2020-02-07 20:09:48:
2020-02-07 20:09:48: "
2020-02-07 20:09:48: Connection Server error: GET http://10.128.2.4:8888/db/environment/environment [Connection Server version 10.1.3.6]
2020-02-07 20:09:48: --------- Response Body --------
2020-02-07 20:09:48:
2020-02-07 20:09:48: 504 Gateway Time-out
2020-02-07 20:09:48:
2020-02-07 20:09:48:

504 Gateway Time-out


2020-02-07 20:09:48:
nginx/1.12.2
2020-02-07 20:09:48:
2020-02-07 20:09:48:
2020-02-07 20:09:48: -------- Response Body --------
2020-02-07 20:09:48: (System.Net.WebException)