Problems with Deadline pulse crashes on version 14.0.8 SSL / lib / cert errors

TheFridge · February 27, 2025, 9:18am

We are in the process of cycling out our old Deadline farm and have installed a new deadline setup based on 14.0.8.

The system was looking fine, but after a while we started noticing that some jobs were not getting picked up, sometimes taking up to 30 minutes.

Power management was not always happening (Wake on LANs were not being processed, machines were not powering down as expected, …)

General information:

Deadline installation split in 3 VMs on a dell R730xD host connected by 10Gb SFP+

All installations are on Rocky Linux 9.4

VM1 Mongodb installed
VM2 Deadline RCS installed
VM3 Deadline WEb + pulse installed. (this one has been accidentally updated to 9.5, rollback attempts have failed)

All connectivity by network have been verified, and all DNS checks are ok.

We noticed that it seems Pulse was crashing continuously between 5 minutes to an hour, restarts and then dies again. Some sort of looping crash.

Digging into the deadline logs we see a lot of small pulse logs, also looping with the following messages:

Feb 26 13:59:04 deadline-web-01 bash[63441]: Another repository repair process is already in progress
Feb 26 13:59:04 deadline-web-01 bash[63441]: Another house cleaning process is already in progress
Feb 26 13:59:04 deadline-web-01 bash[63441]: Another pending job scan process is already in progress

We see these a lot, is this normal behaviour?
It starts a new log after each crash (handy for timing purposes though)

Back to the issue, crashing pulse.

As the pulse log doesnt log its crashes, i started digging into /var/log/messages, finding the following concerning messages:

deadlineweb01coredump.zip (4.5 KB)

Key issues identified:

First Crash (10:03:33) - deadlinepulse process:

Error: malloc(): unaligned tcache chunk detected
Root cause: Heap corruption during OpenSSL cryptography operations

Stack trace shows failure in RSA signing operations CRYPTO_malloc → CRYPTO_zalloc → evp_md_init_internal → PKCS1_MGF1 → RSA_padding_add_PKCS1_PSS_mgf1 → rsa_sign

Second Crash (10:04:46) - deadlinewebservice process:

Error: segfault at 20 in libcrypto.so.3.2.2

Root cause: Memory access violation in SHA256_Update function

Both crashes point to problems in libcrypto.so.3 (OpenSSL)

Recovery attempt (10:07:34):

System automatically restarted both Pulse and Web Service components
Services appeared to reconnect to the repository successfully

We have applied the following mitigation steps:

Time is manual set - fix

sudo chronyd -q 'server [dc1.vfx.int](http://dc1.vfx.int/) iburst'
sudo systemctl enable --now chronyd
sudo systemctl stop chronyd
sudo chronyc -a makestep
sudo systemctl start chronyd
date

Fix package inconsistencies

sudo dnf clean all
sudo dnf check
sudo dnf distro-sync
sudo dnf reinstall openssl openssl-libs libcrypto.so.3

We have already rolled back to our 9.4 install with all fixes in place, but we still these errors appearing.

TheFridge · March 20, 2025, 10:07am

As we did some more testing, it looks like pulse was fundamentally causing this. Pulse was always trying to do the housekeeping ( as these settings seem to be default enabled when pulse is installed)
Disabling all housekeeping on pulse on our webservices VM it looks to have stabilized.

We still get a log full of the rules being disabled, as being pointless, but we do need pulse for WOL and general power management.

We are going to setup a 4th for pulse only, but it seems overkill but it looks to be the best way to seperate any pulse related issues on affecting our webservices.

Root cause could be openssl getting overloaded by the pulse trying to do the housekeeping or some process or lib related to that process. tbh we havent dug deeper into this as it is weird that this isnt natively supported or setup correctly on install?