Pulse (v5.2) crash - timeCheck_0sDZv

MikeOwen · February 4, 2013, 10:15am

Hi,
Although not related to v6, I wanted to report this situation as it caused me some excitement over the wkd!
Essentially to repeat the situation (which I would advise against!):

Have Deadline Slave/Launcher both running v5.2 on a machine, with Pulse running on the network as well.
Pull out network cable for more than 15 minutes.
Put network cable back into machine 30+ minutes later.
Go home, discover 48 hours later, that in the repository /temp directory, approx 46,000 files called “timeCheck_” + “5 random characters” + “.ComputerName” which had its network cable pulled out above…

Example File: “/temp/timeCheck_0sDZv.ComputerName”

So, looks like Pulse tries to check for “stalled” status on this machine, which it thinks is still running slave and does a “timeCheck”. Unfortunately, when the number of files gets this big, Pulse can’t handle it when it tries to “purge” the files and Pulse will crash after about 30-35 minutes, which at I guess, I reckon is the random intervals, that Pulse executes a clean up routine.

Lesson Learnt by Mike - always ensure Slave is shutdown before removing a network cable for more than 15+ minutes in the future!

However, I was thinking that an improvement could be made in the “timecheck” code for the future? Do a ping before trying to check time on a slave, maybe?

Maybe this situation is a thing of the past in v6, which would be good. Either way, I wanted to report this situation.

Thanks,
Mike

rrussell · February 4, 2013, 4:30pm

Thanks for reporting this! These time check files are no longer created in v6, so this won’t be an issue going forward.

Cheers,

Ryan

MikeOwen · February 4, 2013, 4:35pm

Hooray! Go v6