Remote Connection Server (RCS) Becoming Unresponsive

The Core Issue

RCS becomes unresponsive randomly, it appears to still be running in the Activity Monitor but is refusing connections. This manifests as:

  • nc -zv server 8080 returns “Connection refused”
  • curl http://server:8080/ fails with connection timeout
  • External contractors cannot connect via Cloudflare tunnel
  • Only resolution: kill RCS process and restart

This occurs regardless of load—even with just 1-2 external users, RCS fails unpredictably at all hours, causing production delays and missed deadlines.

Environment Details

Hardware: Apple Silicon Mac mini
Deadline Version: 10.4.1.8 → 10.4.1.10
Repository: Hosted on Qumulo cluster (not local)
Usage: ~100 workers (Direct Connection) + external contractors (RCS)

Background Context

We’ve successfully operated Deadline across various configurations over the years, from VM clusters to dedicated hardware. When upgrading to the latest version, we chose a Mac mini for its proven reliability and rapid deployment capabilities compared to rebuilding our VM infrastructure—particularly given current VMware licensing changes and associated hardware costs.

Important: Rosetta installation is required but undocumented in system requirements. This cost significant troubleshooting time and should be clearly stated.

Configuration & Improvements

Initial vanilla setup: Repository installer, then three Client installations (Client, RCS, Web Services) with “Install Launcher As A Daemon” selected.

After reliability issues, we developed a more robust startup sequence by moving services to ~/Library/LaunchAgents and implementing a custom mounting script that:

  1. Verifies network connectivity
  2. Ensures repository mount is available before service startup
  3. Launches services in proper dependency order (Launcher → RCS → Web Service → Pulse)
  4. Includes retry logic and comprehensive logging
  5. Monitors every 5 minutes with automatic recovery

This approach addresses a fundamental oversight in the default setup: where services attempt to start before the repository is accessible.

Current Monitoring & Mitigation

We’ve implemented a 2-minute watchdog service that:

  • Tests HTTP connectivity to RCS
  • Logs failures with timestamps
  • Kills hung processes automatically
  • Restarts services and sends email alerts

However, this is reactive—we need RCS to simply work reliably.

Questions & Concerns

  1. RCS Apple Silicon Compatibility: Are there known RCS reliability issues on Apple Silicon Macs requiring Rosetta? While community reports indicate Deadline Workers experience crashes on M1 systems after 10-15 minutes, our RCS stability issues may be related.

  2. Performance Limitations: AWS documents RCS performance issues with larger farms, recommending load balancing for >500 workers. Is our ~100 worker count approaching these limits?

  3. Version Status: What happened to version 10.4.1.8? Our offline documentation shows it but the release notes now jump from 10.4.1.6 to 10.4.1.9.

Community Input Needed

Is anyone else experiencing similar RCS unresponsiveness issues, particularly on Apple Silicon Macs? The fact that RCS appears running but becomes completely unresponsive suggests something deeper than basic connectivity problems.

We’re running out of ideas on where to look next. We’ve worked through our usual debugging approaches (network, dependencies, resources), but we’re hitting a wall trying to identify what’s actually causing RCS to hang.

The RCS instability is severely impacting our ability to support remote contractors reliably. Any insights into Apple Silicon compatibility, scale limitations, or configuration improvements would be greatly appreciated.

Environment Summary:

  • macOS with Apple Silicon (Rosetta required)
  • External repository on network storage
  • Mixed internal (Direct Connection) and external (RCS) worker setup
  • Verbose logging enabled for all services