pulse restart failure

LaszloSebo · September 1, 2014, 6:21am

We have been running a cronjob that triggers a restart of pulse every 30 minutes due to issues with the dependency checking. Tonight this process failed.

After many successful restarts, the launcher started failing to start pulse with the following exception:

2014-08-31 21:00:05:  Local version file: /opt/Thinkbox/Deadline6/bin/Version
2014-08-31 21:00:05:  Network version file: /mnt/isila/deadline/repository6/bin/Linux/Version
2014-08-31 21:00:05:  Comparing version files...
2014-08-31 21:00:05:  Version files match
2014-08-31 21:00:05:  Launching Pulse
2014-08-31 21:00:05:  Launcher Thread - Responded with: Success|
2014-08-31 21:00:05:  Failed to spawn process "/opt/Thinkbox/Deadline6/bin/deadlinepulse" with "-nogui " arguments
2014-08-31 21:00:05:  Exception Details
2014-08-31 21:00:05:  Win32Exception -- ApplicationName='/opt/Thinkbox/Deadline6/bin/deadlinepulse', CommandLine='-nogui ', CurrentDirectory='/opt/Thinkbox/Deadline6/bin'
2014-08-31 21:00:05:  Win32Exception.NativeErrorCode: 14
2014-08-31 21:00:05:  ExternalException.ErrorCode: -2147467259 (mono-io-layer-error (-2147467259))
2014-08-31 21:00:05:  Exception.Source: System
2014-08-31 21:00:05:  Exception.TargetSite: Boolean Start_noshell(System.Diagnostics.ProcessStartInfo, System.Diagnostics.Process)
2014-08-31 21:00:05:  Exception.Data: ( )
2014-08-31 21:00:05:    Exception.StackTrace:
2014-08-31 21:00:05:    at System.Diagnostics.Process.Start_noshell (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in <filename unknown>:0
2014-08-31 21:00:05:    at System.Diagnostics.Process.Start_common (System.Diagnostics.ProcessStartInfo startInfo, System.Diagnostics.Process process) [0x00000] in <filename unknown>:0
2014-08-31 21:00:05:    at System.Diagnostics.Process.Start () [0x00000] in <filename unknown>:0
2014-08-31 21:00:05:    at (wrapper remoting-invoke-with-check) System.Diagnostics.Process:Start ()
2014-08-31 21:00:05:    at FranticX.Processes.Process2.SpawnProcess (System.Diagnostics.ProcessStartInfo startInfo) [0x00000] in <filename unknown>:0

It’s been unable to restart pulse for ~2 hours before it was noticed.

Once the launcher was restarted, pulse started running as well.

nrusch · September 2, 2014, 4:10pm

According to a quick Google search, that Win32 error code supposedly maps to ERROR_OUTOFMEMORY (“Not enough storage is available to complete this operation”).

rrussell · September 2, 2014, 4:35pm

What does the launcher memory look like when this starts to happen? What if the cron job killed and restarted both Pulse and Launcher?

LaszloSebo · September 2, 2014, 4:50pm

The cronjob is attempting a ‘graceful’ restart:

/opt/mono-2.10.9/bin/mono /opt/Thinkbox/Deadline6/bin/deadlinecommand.exe -RemoteControl deadline02 RestartPulse

As opposed to process killing.

Maybe there is a memory leak in the launcher? Try issuing that command to the launcher a couple hundred times and see what happens. Since this happened a couple of days ago, i cant check the ram usage anymore. But i have a feeling it will happen again in the next 1-2 days again. So if i get a midnight call again, ill try to remember to take a ram snapshot

Current ram usage (after 2 days of running):
root 9262 0.3 0.1 4479988 54776 pts/2 Sl Aug31 6:15 /opt/mono-2.10.9/bin/mono-sgen --gc=sgen --runtime=v4.0 /opt/Thinkbox/Deadline6/bin/deadlinelauncher.exe -nogui

Thats 4+gigs of ram usage for the launcher!!

If i restart the launcher:

root 13739 1.6 0.1 645280 36336 pts/2 Tl 09:47 0:00 /opt/mono-2.10.9/bin/mono-sgen --gc=sgen --runtime=v4.0 /opt/Thinkbox/Deadline6/bin/deadlinelauncher.exe -nogui

Still high, but only 600+megs

rrussell · September 2, 2014, 5:00pm

Thanks for the info! Definitely looks like a leak, and we’ve logged this as a bug. I’m guessing that it’s either a remote control issue, or that the launcher isn’t cleaning up resources it has for previous pulse processes.

LaszloSebo · September 2, 2014, 5:39pm

Minor sidenote (maybe another bug)

Now that i restarted the launcher only, it fails to receive the remote commands from the same machine to restart…

in the command line i get:

[root@deadline02 ~]# /opt/mono-2.10.9/bin/mono /opt/Thinkbox/Deadline6/bin/deadlinecommand.exe -RemoteControl deadline02 RestartPulse
Sent remote command ‘RestartPulse’ to: [deadline02].

But in the logs, nothing shows up (and pulse indeed is not restarted).

2014-09-02 09:47:46:  BEGIN - deadline02.scanlinevfxla.com\root
2014-09-02 09:47:46:  Deadline Launcher 6.2 [v6.2.1.33 R  (1e480b6c5)]
2014-09-02 09:47:46:  Local version file: /opt/Thinkbox/Deadline6/bin/Version
2014-09-02 09:47:46:  Network version file: /mnt/isila/deadline/repository6/bin/Linux/Version
2014-09-02 09:47:46:  Comparing version files...
2014-09-02 09:47:46:  Version files match
2014-09-02 09:47:46:  Launcher Thread - Launcher thread initializing...
2014-09-02 09:47:46:  Updating Repository options
2014-09-02 09:47:46:    - Remote Administration: enabled
2014-09-02 09:47:46:    - Automatic Updates: enabled
2014-09-02 09:47:46:  Launcher Thread - Remote administration is enabled
2014-09-02 09:47:46:  Launcher Thread - Launcher thread listening on port 17060
2014-09-02 09:47:51:  Updating Repository options
2014-09-02 09:47:51:    - Remote Administration: enabled
2014-09-02 09:47:51:    - Automatic Updates: enabled

The restart command was issued on deadline02, the same machine running pulse etc.

LaszloSebo · September 2, 2014, 5:41pm

I dont think the command was in fact sent, if i trigger the same operation from another machine via deadline monitor, it actually ‘times out waiting for reply’

rrussell · September 2, 2014, 5:53pm

Maybe the launcher’s listening port hasn’t cleaned itself up…

Is the cron job running on the same machine as pulse? If so, what if you had it do the following:

deadlinepulse -s
deadlienpulse -nogui

The first command shouldn’t exit until the existing pulse has shut down. This takes the launcher and the remote control out of the equation.

LaszloSebo · September 2, 2014, 6:07pm

You mean instead of going through the launcher, stop/restart pulse directly? Good idea… didnt know about the -s flag

rrussell · September 2, 2014, 6:18pm

Yup!

Note that it’s not meant to ignore the problems you reported, just a more direct way of doing this.

LaszloSebo · September 2, 2014, 6:57pm

Hehe for sure, thanks for the tip!