extremely unstable beta10 pulse

LaszloSebo · November 11, 2013, 2:32am

Please see this thread:

pulse crashes every 5-10 mins, and it seems to cause the whole farm to come to a halt.

LaszloSebo · November 11, 2013, 3:59am

I think its the housecleaning thread not being separate any more

rrussell · November 11, 2013, 12:52pm

Did you configure the Pending Job Cleanup to be in a separate process? That does the dependency checking, which is why the housecleaning was originally moved into a separate process. You can find this option in the Repository Options on the Job Settings page under the Cleanup tab.

Cheers,

Ryan

LaszloSebo · November 11, 2013, 6:43pm

Yes, that was one of the first things i did. On a related note, maybe defaulting that and the housecleaning to be in a separate process (in beta9) to true might be better, as it makes things very unstable when thats off.

So yes, even with pending tasks being on separate threads, pulse would crash very often. Actually, it would crash as often as beta9 did with housecleaning NOT in a separate process. So maybe the pending task handling was not the sole reason for the instabilities, there might be something else during housecleaning that could cause it to crash?

LaszloSebo · November 11, 2013, 6:51pm

I’ve attached 2 pulse logs, both from beta10. You will see its executing the pending stuff in a separate thread:

2013-11-10 18:12:00:  Pending Job Scan Thread - Performing pending job scan
2013-11-10 18:12:00:  Update timeout has been set to 300 seconds
2013-11-10 18:12:00:  Stdout Handling Enabled: False
2013-11-10 18:12:00:  Popup Handling Enabled: False
2013-11-10 18:12:00:  Using Process Tree: True
2013-11-10 18:12:00:  Hiding DOS Window: True
2013-11-10 18:12:00:  Creating New Console: False
2013-11-10 18:12:00:  Executable: "/opt/Thinkbox/Deadline6/bin/deadlinecommand"
2013-11-10 18:12:00:  Argument: -DoPendingJobScan True
2013-11-10 18:12:00:  Startup Directory: "/opt/Thinkbox/Deadline6/bin"
2013-11-10 18:12:00:  Clean Up Thread - Performing house cleaning
2013-11-10 18:12:00:  Process Priority: BelowNormal
2013-11-10 18:12:00:  Process Affinity: default
2013-11-10 18:12:00:  Process is now running

deadlinepulse-deadline-2013-11-10-0005.log (882 KB)
deadlinepulse-deadline-2013-11-10-0006.log (1 MB)

rrussell · November 12, 2013, 3:09pm

Hey Laszlo,

For beta 11, we’re going to allow housecleaning to be run in a separate process again. Hopefully that will improve stability. My guess is that it’s memory related, because based on the stack traces here (viewtopic.php?f=156&t=10667#p46277), it suggests that the application ran out of memory. I’ve read that this could be due to the Boehm garbage collector, and that the SGen one can improve on this.

On your Pulse machine, can you run “mono -V” and post the results?

Thanks!

Ryan

LaszloSebo · November 12, 2013, 4:57pm

[root@deadline ~]# mono -V
Mono JIT compiler version 2.10.9 (tarball Fri Apr 12 10:29:12 PDT 2013)
Copyright © 2002-2011 Novell, Inc, Xamarin, Inc and Contributors. www.mono-project.com
TLS: __thread
SIGSEGV: altstack
Notifications: epoll
Architecture: amd64
Disabled: none
Misc: softdebug
LLVM: supported, not enabled.
GC: Included Boehm (with typed GC and Parallel Mark)

rrussell · November 12, 2013, 5:25pm

Thanks! Yeah, it is using the Boehm garbage collector. One thing that could help is to rebuild mono using these options to configure it first:

./configure --with-large-heap=yes --with-sgen=yes

Then ‘make’ and ‘make install’ again. You probably don’t want to be doing this in the middle of production though, but maybe it’s something worth trying if you hit a slower period. It might be unnecessary though when the housecleaning gets moved back to a separate process.

LaszloSebo · November 12, 2013, 5:30pm

We are in the process of setting up 1-2 redundant deadline servers just in case, so we might try rebuilding mono on one of those.

But im not keeping my fingers crossed to be honest, things are too busy :-\