We are in the process of cycling out our old Deadline farm and have installed a new deadline setup based on 14.0.8.
The system was looking fine, but after a while we started noticing that some jobs were not getting picked up, sometimes taking up to 30 minutes.
Power management was not always happening (Wake on LANs were not being processed, machines were not powering down as expected, …)
General information:
Deadline installation split in 3 VMs on a dell R730xD host connected by 10Gb SFP+
All installations are on Rocky Linux 9.4
VM1 Mongodb installed
VM2 Deadline RCS installed
VM3 Deadline WEb + pulse installed. (this one has been accidentally updated to 9.5, rollback attempts have failed)
All connectivity by network have been verified, and all DNS checks are ok.
We noticed that it seems Pulse was crashing continuously between 5 minutes to an hour, restarts and then dies again. Some sort of looping crash.
Digging into the deadline logs we see a lot of small pulse logs, also looping with the following messages:
Feb 26 13:59:04 deadline-web-01 bash[63441]: Another repository repair process is already in progress
Feb 26 13:59:04 deadline-web-01 bash[63441]: Another house cleaning process is already in progress
Feb 26 13:59:04 deadline-web-01 bash[63441]: Another pending job scan process is already in progress
We see these a lot, is this normal behaviour?
It starts a new log after each crash (handy for timing purposes though)
Back to the issue, crashing pulse.
As the pulse log doesnt log its crashes, i started digging into /var/log/messages, finding the following concerning messages:
deadlineweb01coredump.zip (4.5 KB)
Key issues identified:
First Crash (10:03:33) - deadlinepulse process:
Error: malloc(): unaligned tcache chunk detected
Root cause: Heap corruption during OpenSSL cryptography operations
Stack trace shows failure in RSA signing operations CRYPTO_malloc → CRYPTO_zalloc → evp_md_init_internal → PKCS1_MGF1 → RSA_padding_add_PKCS1_PSS_mgf1 → rsa_sign
Second Crash (10:04:46) - deadlinewebservice process:
Error: segfault at 20 in libcrypto.so.3.2.2
Root cause: Memory access violation in SHA256_Update function
Both crashes point to problems in libcrypto.so.3 (OpenSSL)
Recovery attempt (10:07:34):
System automatically restarted both Pulse and Web Service components
Services appeared to reconnect to the repository successfully
We have applied the following mitigation steps:
Time is manual set - fix
sudo chronyd -q 'server [dc1.vfx.int](http://dc1.vfx.int/) iburst'
sudo systemctl enable --now chronyd
sudo systemctl stop chronyd
sudo chronyc -a makestep
sudo systemctl start chronyd
date
Fix package inconsistencies
sudo dnf clean all
sudo dnf check
sudo dnf distro-sync
sudo dnf reinstall openssl openssl-libs libcrypto.so.3
We have already rolled back to our 9.4 install with all fixes in place, but we still these errors appearing.