AWS Thinkbox Discussion Forums

hung housecleaning 8.1.0.2

Log:

2016-06-14 18:15:13:      Old Limit Scan - Loading machine limits
2016-06-14 18:15:13:      Old Limit Scan - Found 10640 machine limits in 217.118 ms
2016-06-14 18:15:13:      Old Limit Scan - Loading job IDs
2016-06-14 18:15:13:      Old Limit Scan - Loaded 14545 job IDs in 159.915 ms
2016-06-14 18:15:13:      Old Limit Scan - Purged 0 machine limits in 1.384 ms
2016-06-14 18:15:13:      Old Limit Scan - Done.
2016-06-14 18:15:13:  Purging Temporary Repository Files
2016-06-14 18:15:13:      Temporary File Scan - Scanning for 'connectReadWriteTest' files
2016-06-14 18:15:13:      Temporary File Scan - Deleted 0 temporary files in 1.390 ms
2016-06-14 18:15:13:      Temporary File Scan - Done.
2016-06-14 18:15:13:  Purging Old Statistics
2016-06-14 18:15:13:      Old Statistics - Skipping job statistics because the option to purge them is disabled in the Repository Options
2016-06-14 18:15:13:      Old Statistics - Purging slave statistics that are older than Apr 15/16  18:15:13
2016-06-14 18:15:13:      Old Statistics - Purged old slave statistics in 1.373 ms
2016-06-14 18:15:13:      Old Statistics - Purging repository statistics that are older than Apr 15/16  18:15:13
2016-06-14 18:15:13:      Old Statistics - Purged old repository statistics in 437.000 μs
2016-06-14 18:15:13:  Purging Deleted Document Stubs From Database
2016-06-14 18:15:13:      Deleted Document Stubs - Deleting stubs that are older than 3 days
2016-06-14 18:15:13:      Deleted Document Stubs - Deleted 0 stubs in 14.591 ms
2016-06-14 18:15:13:  Triggering House Cleaning Events

Active processes:
root 3033 1.5 0.0 971448 83736 ? Sl 18:14 0:10 /opt/Thinkbox/Deadline8/mono/bin/mono-sgen /opt/Thinkbox/Deadline8/bin/deadlinecommand.exe -DoHouseCleaning False True False

Seems like its hung up for some reason trying to quit. Normally that ‘Triggering House Cleaning Events’ is the last line of the log. But as you can see, the process has never quit. We had seen similar behavior from pending job scan and housecleaning too. This is a regression in 8, and a critical issue for us.

For future reference, all the processes:

[root@deadline01 ~]# ps aux | grep deadline
root       679  0.0  0.0 1247824 95680 ?       Sl   Jun10   5:48 /opt/Thinkbox/Deadline8/mono/bin/mono-sgen /opt/Thinkbox/Deadline8/bin/deadlinesandbox.exe -socket 64062 -repository /mnt/isila/deadline/repository8 -region
root      3033  1.3  0.0 971448 83736 ?        Sl   18:14   0:10 /opt/Thinkbox/Deadline8/mono/bin/mono-sgen /opt/Thinkbox/Deadline8/bin/deadlinecommand.exe -DoHouseCleaning False True False
root      9102  0.0  0.0 1092956 130628 ?      Sl   Jun13   0:30 /opt/Thinkbox/Deadline8/mono/bin/mono-sgen /opt/Thinkbox/Deadline8/bin/deadlinecommand.exe -DoPendingJobScan False True False none
root     13164  0.0  0.0 1240436 88160 ?       Sl   Jun09   6:59 /opt/Thinkbox/Deadline8/mono/bin/mono-sgen /opt/Thinkbox/Deadline8/bin/deadlinesandbox.exe -socket 39197 -repository /mnt/isila/deadline/repository8 -region
root     24334  1.4  0.0 1216632 61092 ?       Sl   18:02   0:22 /opt/Thinkbox/Deadline8/mono/bin/mono-sgen /opt/Thinkbox/Deadline8/bin/deadlinelauncher.exe -pulse -nogui -daemon
root     24413  0.1  0.0 1217288 64476 ?       Sl   18:02   0:02 /opt/Thinkbox/Deadline8/mono/bin/mono-sgen /opt/Thinkbox/Deadline8/bin/deadlinesandbox.exe -socket 15345 -repository /mnt/isila/deadline/repository8 -region
root     24660  0.0  0.0 103256   868 pts/3    S+   18:28   0:00 grep deadline
root     25679  6.1  0.6 3310876 809040 ?      Sl   18:03   1:31 ./mono --runtime=v4.0 --config /opt/Thinkbox/Deadline8/bin/deadline_mono_config /opt/Thinkbox/Deadline8/bin/deadlinepulse.exe -nogui
root     25738  0.1  0.0 1217252 64152 ?       Sl   18:03   0:02 /opt/Thinkbox/Deadline8/mono/bin/mono-sgen /opt/Thinkbox/Deadline8/bin/deadlinesandbox.exe -socket 15259 -repository /mnt/isila/deadline/repository8 -region

In case it helps, an strace reveals:

[root@deadline01 ~]# strace -p 3033
Process 3033 attached - interrupt to quit
futex(0x285b3fc, FUTEX_WAIT_PRIVATE, 19, NULL

Killing the process did not ‘revive’ housecleaning. Pulse never tried to start housecleaning again, even when sent explicit remote commands to do so.

Had to restart pulse completely for it to start doing housecleaning again.

Strange… And you guys weren’t getting this hanging problem in 8.0.2, right? That should hopefully narrow this down quite a bit, I don’t think we’ve made too many changes to this code in 8.1 vs 8.0.

Does it hang like this for every single house cleaning? Or is it a sporadic issue?

We’ll have a look at this ASAP.

Its sporadic, yeah. I think it happened in both versions

Hmmm, that might explain why we haven’t been able to reproduce or find this in code.

Would you be able to get us a the full log of a couple of these that ended up hanging? It would be good to have context on what those specific house cleaning instances were doing (although I imagine they largely end up looking similar to others, anyways…).

Do you guys have any events that trigger on housecleaning? If so, I wonder if we’re not cleaning up some of our python loading/calling code properly…

We’ve since made a number of improvements to Event Plugin initialization (and sandboxes closing properly, etc), and haven’t been able to reproduce this internally still – I’d be curious to see if you guys are still experiencing this on 8.0.5, and if so with what kind of frequency.

Privacy | Site terms | Cookie preferences