AWS Thinkbox Discussion Forums

extremely unstable beta10 pulse

Please see this thread:

viewtopic.php?f=156&t=10667

pulse crashes every 5-10 mins, and it seems to cause the whole farm to come to a halt.

I think its the housecleaning thread not being separate any more

Did you configure the Pending Job Cleanup to be in a separate process? That does the dependency checking, which is why the housecleaning was originally moved into a separate process. You can find this option in the Repository Options on the Job Settings page under the Cleanup tab.

Cheers,

  • Ryan

Yes, that was one of the first things i did. On a related note, maybe defaulting that and the housecleaning to be in a separate process (in beta9) to true might be better, as it makes things very unstable when thats off.

So yes, even with pending tasks being on separate threads, pulse would crash very often. Actually, it would crash as often as beta9 did with housecleaning NOT in a separate process. So maybe the pending task handling was not the sole reason for the instabilities, there might be something else during housecleaning that could cause it to crash?

I’ve attached 2 pulse logs, both from beta10. You will see its executing the pending stuff in a separate thread:

2013-11-10 18:12:00:  Pending Job Scan Thread - Performing pending job scan
2013-11-10 18:12:00:  Update timeout has been set to 300 seconds
2013-11-10 18:12:00:  Stdout Handling Enabled: False
2013-11-10 18:12:00:  Popup Handling Enabled: False
2013-11-10 18:12:00:  Using Process Tree: True
2013-11-10 18:12:00:  Hiding DOS Window: True
2013-11-10 18:12:00:  Creating New Console: False
2013-11-10 18:12:00:  Executable: "/opt/Thinkbox/Deadline6/bin/deadlinecommand"
2013-11-10 18:12:00:  Argument: -DoPendingJobScan True
2013-11-10 18:12:00:  Startup Directory: "/opt/Thinkbox/Deadline6/bin"
2013-11-10 18:12:00:  Clean Up Thread - Performing house cleaning
2013-11-10 18:12:00:  Process Priority: BelowNormal
2013-11-10 18:12:00:  Process Affinity: default
2013-11-10 18:12:00:  Process is now running

deadlinepulse-deadline-2013-11-10-0005.log (882 KB)
deadlinepulse-deadline-2013-11-10-0006.log (1 MB)

Hey Laszlo,

For beta 11, we’re going to allow housecleaning to be run in a separate process again. Hopefully that will improve stability. My guess is that it’s memory related, because based on the stack traces here (viewtopic.php?f=156&t=10667#p46277), it suggests that the application ran out of memory. I’ve read that this could be due to the Boehm garbage collector, and that the SGen one can improve on this.

On your Pulse machine, can you run “mono -V” and post the results?

Thanks!

  • Ryan

[root@deadline ~]# mono -V
Mono JIT compiler version 2.10.9 (tarball Fri Apr 12 10:29:12 PDT 2013)
Copyright © 2002-2011 Novell, Inc, Xamarin, Inc and Contributors. www.mono-project.com
TLS: __thread
SIGSEGV: altstack
Notifications: epoll
Architecture: amd64
Disabled: none
Misc: softdebug
LLVM: supported, not enabled.
GC: Included Boehm (with typed GC and Parallel Mark)

Thanks! Yeah, it is using the Boehm garbage collector. One thing that could help is to rebuild mono using these options to configure it first:

./configure --with-large-heap=yes --with-sgen=yes 

Then ‘make’ and ‘make install’ again. You probably don’t want to be doing this in the middle of production though, but maybe it’s something worth trying if you hit a slower period. It might be unnecessary though when the housecleaning gets moved back to a separate process.

We are in the process of setting up 1-2 redundant deadline servers just in case, so we might try rebuilding mono on one of those.

But im not keeping my fingers crossed to be honest, things are too busy :-\

Privacy | Site terms | Cookie preferences