extremely slow dependency handling

LaszloSebo · September 17, 2014, 4:38pm

I found something on these coredump errors, apparently setting
ProcessUnpackaged = yes
in
/etc/abrt/abrt-action-save-package-data.conf

Will make it generate the dumps. So ill keep an eye and if i get one, send it over

LaszloSebo · September 17, 2014, 5:30pm

And on queue, I have a core dump!
Ill send it via dropbox

LaszloSebo · September 17, 2014, 5:47pm

In the meantime (its zipping… 3-4gigs), a stack trace:

Program terminated with signal 6, Aborted.
#0  0x000000338ca328a5 in raise () from /lib64/libc.so.6
Mono support loaded.
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.80.el6_3.6.x86_64 libgcc-4.4.6-4.el6.x86_64
(gdb) bt
#0  0x000000338ca328a5 in raise () from /lib64/libc.so.6
#1  0x000000338ca34085 in abort () from /lib64/libc.so.6
#2  0x000000000048cce6 in mono_handle_native_sigsegv (signal=<value optimized out>, ctx=<value optimized out>) at mini-exceptions.c:2223
#3  <signal handler called>
#4  0x000000338ca328a5 in raise () from /lib64/libc.so.6
#5  0x000000338ca34085 in abort () from /lib64/libc.so.6
#6  0x00000000005def1b in GC_abort (msg=0x6a8465 "mmap(...PROT_NONE...) failed") at misc.c:1099
#7  0x00000000005e034d in GC_unmap (start=0x7ff15eee5000 "Ð\234", bytes=20480) at os_dep.c:2037
#8  0x00000000005f10f7 in GC_unmap_old () at allchblk.c:383
#9  0x00000000005d50c4 in GC_finish_collection () at alloc.c:776
#10 0x00000000005d473d in GC_try_to_collect_inner (stop_func=0x5d40c7 <GC_never_stop_func>) at alloc.c:393
#11 0x00000000005d57a0 in GC_collect_or_expand (needed_blocks=9, ignore_off_page=0) at alloc.c:1045
#12 0x00000000005d8dd5 in GC_alloc_large (lw=4100, k=0, flags=0) at malloc.c:60
#13 0x00000000005d919c in GC_generic_malloc (lb=32800, k=0) at malloc.c:204
#14 0x00000000005d9363 in GC_malloc_atomic (lb=32800) at malloc.c:270
#15 0x00000000005e136d in GC_local_malloc_atomic (bytes=32800) at pthread_support.c:380
#16 0x000000000055e242 in mono_object_allocate_ptrfree (vtable=0x7ff1c0109cf8, n=16384) at object.c:4220
#17 mono_array_new_specific (vtable=0x7ff1c0109cf8, n=16384) at object.c:4779

the full stacktrace:

(gdb) bt full
#0  0x000000338ca328a5 in raise () from /lib64/libc.so.6
No symbol table info available.
#1  0x000000338ca34085 in abort () from /lib64/libc.so.6
No symbol table info available.
#2  0x000000000048cce6 in mono_handle_native_sigsegv (signal=<value optimized out>, ctx=<value optimized out>) at mini-exceptions.c:2223
        sa = {__sigaction_handler = {sa_handler = 0, sa_sigaction = 0}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x7ff1d4dd9d70}
        jit_tls = <value optimized out>
#3  <signal handler called>
No symbol table info available.
#4  0x000000338ca328a5 in raise () from /lib64/libc.so.6
No symbol table info available.
#5  0x000000338ca34085 in abort () from /lib64/libc.so.6
No symbol table info available.
#6  0x00000000005def1b in GC_abort (msg=0x6a8465 "mmap(...PROT_NONE...) failed") at misc.c:1099
No locals.
#7  0x00000000005e034d in GC_unmap (start=0x7ff15eee5000 "Ð\234", bytes=20480) at os_dep.c:2037
        result = 0xffffffffffffffff
        start_addr = 0x7ff15eee5000 "Ð\234"
        end_addr = 0x7ff15eeea000 "Ð\234"
        len = 20480
#8  0x00000000005f10f7 in GC_unmap_old () at allchblk.c:383
        h = 0x7ff15eee5000
        hhdr = 0x7ff1862fa870
        sz = 20480
        last_rec = 3040
        threshold = 3041
        i = 5
#9  0x00000000005d50c4 in GC_finish_collection () at alloc.c:776
No locals.
#10 0x00000000005d473d in GC_try_to_collect_inner (stop_func=0x5d40c7 <GC_never_stop_func>) at alloc.c:393
        start_time = 109857679
        current_time = 140676635140912
#11 0x00000000005d57a0 in GC_collect_or_expand (needed_blocks=9, ignore_off_page=0) at alloc.c:1045
No locals.
#12 0x00000000005d8dd5 in GC_alloc_large (lw=4100, k=0, flags=0) at malloc.c:60
        h = 0x0
        n_blocks = 9
        result = 0x7ff1d4dda400 "@\244\335\324\361\177"
#13 0x00000000005d919c in GC_generic_malloc (lb=32800, k=0) at malloc.c:204
        lw = 4100
        n_blocks = 9
        init = 0
        result = 0x7ff1d4dda410 "P\244\335\324\361\177"
#14 0x00000000005d9363 in GC_malloc_atomic (lb=32800) at malloc.c:270
        op = 0x7ff1c0109cf8 "\bË¹"
        opp = 0x7ff15c52e0a8
        lw = 140676660391104
#15 0x00000000005e136d in GC_local_malloc_atomic (bytes=32800) at pthread_support.c:380
No locals.
#16 0x000000000055e242 in mono_object_allocate_ptrfree (vtable=0x7ff1c0109cf8, n=16384) at object.c:4220
        o = <value optimized out>
#17 mono_array_new_specific (vtable=0x7ff1c0109cf8, n=16384) at object.c:4779
        o = <value optimized out>
        ao = <value optimized out>
        byte_len = 32800

Memory leak?

LaszloSebo · September 17, 2014, 6:18pm

Uploaded the full coredump to the box account you guys gave me, under Logs / ccpp-2014-09-17-09

LaszloSebo · September 17, 2014, 6:22pm

definitely looks like a leak… fresh pulse:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27560 root 20 0 1213m 38m 11m S 16.9 0.1 0:10.51 deadlinepulse

when it crashed, virtual ram usage was around 4-5gigs, cant recall resident…

LaszloSebo · September 17, 2014, 6:50pm

I’ve noticed that sometime during the last couple updates the “–gc=sgen” option disappeared from the launcher startup. I’ve added it back in

LaszloSebo · September 17, 2014, 7:15pm

I wanted to ask, what happens between the "pending job scan - released pendings tasks X’ and 'CheckPathMapping: Swapped: lines?

2014-09-17 12:12:48: Dependency script returned 4 tasks that can start: /mnt/s2/exchange/software/managed/pythonScripts/site-packages/scl/deadline/scriptDependency.py
2014-09-17 12:12:49: Pending Job Scan - Released pending tasks [11,12,13,14] because the frames they depends on have finished and/or their required assets are available.
2014-09-17 12:12:51: CheckPathMapping: Swapped “//S2/exchange/software/managed/pythonScripts/site-packages/scl/deadline/scriptDependency.py” with “/mnt/s2/exchange/software/managed/pythonScripts/site-packages/scl/deadline/scriptDependency.py”

Just curious, always seems to add a 2-3 second delay…

LaszloSebo · September 17, 2014, 9:47pm

Haven’t seen any crashes since i reset the garbage collector to use sgen. Could this be the default in the launchers?
Also, the deadlinecommand’s spawned by deadlinelauncher doesn’t seem to be using /opt/Thinkbox/Deadline6/bin/deadlinecommand, so i can’t patch that to use sgen instead of the default.

Speed is still highly fluctuating:

2014-09-17 12:50:29: Pending Job Scan - Released 7 pending jobs and 38 pending tasks in 17.125 s
2014-09-17 12:53:13: Pending Job Scan - Released 14 pending jobs and 132 pending tasks in 1.604 m
2014-09-17 12:59:33: Pending Job Scan - Released 9 pending jobs and 252 pending tasks in 4.905 m
2014-09-17 13:02:52: Pending Job Scan - Released 9 pending jobs and 258 pending tasks in 2.214 m
2014-09-17 13:08:27: Pending Job Scan - Released 14 pending jobs and 267 pending tasks in 4.420 m
2014-09-17 13:13:03: Pending Job Scan - Released 11 pending jobs and 394 pending tasks in 3.394 m
2014-09-17 13:21:11: Pending Job Scan - Released 13 pending jobs and 368 pending tasks in 7.023 m
2014-09-17 13:29:42: Pending Job Scan - Released 14 pending jobs and 600 pending tasks in 7.218 m
2014-09-17 13:41:41: Pending Job Scan - Released 16 pending jobs and 1165 pending tasks in 10.857 m
2014-09-17 14:05:27: Pending Job Scan - Released 47 pending jobs and 2385 pending tasks in 22.546 m
2014-09-17 14:29:02: Pending Job Scan - Released 65 pending jobs and 2207 pending tasks in 22.420 m
2014-09-17 14:39:44: Pending Job Scan - Released 87 pending jobs and 2056 pending tasks in 9.497 m
2014-09-17 14:44:05: Pending Job Scan - Released 55 pending jobs and 5523 pending tasks in 3.240 m

LaszloSebo · September 18, 2014, 12:31am

Hard to tell what makes it slow/fast… Sometimes its 1-2 mins, sometimes 22, and it seems independent of the amount of tasks being processed:

2014-09-17 15:46:54: Pending Job Scan - Released 12 pending jobs and 625 pending tasks in 1.909 m
2014-09-17 15:49:45: Pending Job Scan - Released 4 pending jobs and 189 pending tasks in 1.398 m
2014-09-17 15:52:54: Pending Job Scan - Released 16 pending jobs and 180 pending tasks in 2.022 m
2014-09-17 15:57:16: Pending Job Scan - Released 23 pending jobs and 320 pending tasks in 3.129 m
2014-09-17 15:59:41: Pending Job Scan - Released 15 pending jobs and 198 pending tasks in 1.371 m
2014-09-17 16:02:35: Pending Job Scan - Released 16 pending jobs and 155 pending tasks in 1.847 m
2014-09-17 16:04:46: Pending Job Scan - Released 14 pending jobs and 103 pending tasks in 1.122 m
2014-09-17 16:06:40: Pending Job Scan - Released 10 pending jobs and 316 pending tasks in 50.573 s
2014-09-17 16:08:23: Pending Job Scan - Released 10 pending jobs and 87 pending tasks in 39.627 s
2014-09-17 16:10:23: Pending Job Scan - Released 9 pending jobs and 62 pending tasks in 53.049 s
2014-09-17 16:15:32: Pending Job Scan - Released 13 pending jobs and 190 pending tasks in 3.593 m
2014-09-17 16:26:49: Pending Job Scan - Released 20 pending jobs and 2368 pending tasks in 10.226 m
2014-09-17 16:41:29: Pending Job Scan - Released 37 pending jobs and 908 pending tasks in 13.237 m
2014-09-17 17:06:39: Pending Job Scan - Released 53 pending jobs and 3451 pending tasks in 23.977 m
2014-09-17 17:14:59: Pending Job Scan - Released 78 pending jobs and 4553 pending tasks in 7.251 m
2014-09-17 17:21:48: Pending Job Scan - Released 61 pending jobs and 1138 pending tasks in 5.656 m
2014-09-17 17:27:20: Pending Job Scan - Released 44 pending jobs and 711 pending tasks in 4.431 m

LaszloSebo · September 18, 2014, 2:04am

So while pulse hasnt crashed since tweaking the GC algorithm setting, its steadily growing in ram usage…:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31790 root 20 0 7220m 2.9g 11m S 15.7 9.3 51:35.15 deadlinepulse

started with:
27560 root 20 0 1213m 38m 11m S 16.9 0.1 0:10.51 deadlinepulse

jgaudet · September 18, 2014, 5:48pm

Hmmm, we’ll dig around and see if we can’t find out why it’s leaking. Pulse honestly isn’t directly responsible for much anymore (especially with splitting out the Housecleaning/Repo Scans in a separate proc), so I suspect it might still be the logging…

We’ll try to post a new 6.2 build to the forum soon, which will have the option to split out the Housecleaning into a separate log as well (you’ll have to set this in Repo Options). Hopefully splitting out both will at least reduce the amount of leakage going on.

I don’t think we can set the default GC to be sgen for 6.2, unfortunately, because Mono isn’t necessarily built with it, so it might not be available on the build everyone’s using.

As for what happens between the two steps you asked about (basically between checking jobs), there’s really only a couple things of substance going on (and they’re both DB calls):

New history entry for the job whose tasks were just released is created
Tasks for the next Job are obtained from the DB

LaszloSebo · September 18, 2014, 7:00pm

Hard to figure out whats going on here to be honest… The times fluctuate wildly… there is one test there that too 80 mins to run. Then now it finished under 1 minute…:

2014-09-18 08:56:57: Pending Job Scan - Released 2 pending jobs and 147 pending tasks in 7.178 m
2014-09-18 09:06:24: Pending Job Scan - Released 1 pending jobs and 353 pending tasks in 8.172 m
2014-09-18 09:20:51: Pending Job Scan - Released 6 pending jobs and 275 pending tasks in 13.087 m
2014-09-18 11:14:30: Pending Job Scan - Released 102 pending jobs and 6621 pending tasks in 1.214 hrs
2014-09-18 11:35:59: Pending Job Scan - Released 305 pending jobs and 1788 pending tasks in 2.871 m
2014-09-18 11:38:55: Pending Job Scan - Released 87 pending jobs and 93 pending tasks in 1.896 m
2014-09-18 11:43:25: Pending Job Scan - Released 79 pending jobs and 652 pending tasks in 3.311 m
2014-09-18 11:51:46: Pending Job Scan - Released 31 pending jobs and 632 pending tasks in 7.275 m
2014-09-18 11:55:14: Pending Job Scan - Released 54 pending jobs and 835 pending tasks in 2.309 m
2014-09-18 11:57:09: Pending Job Scan - Released 18 pending jobs and 646 pending tasks in 50.640 s

I restarted pulse around 11:32 (after that 80 min cycle), it was using close to 7-8gigs of ram. As you see the times normalized right away. But whether its because of the restart or not, hard to tell with such fluctuating times.

LaszloSebo · September 18, 2014, 7:02pm

Could it be related to when a lot of jobs are being submitted? So artist submissions (single publish of one scene) can result upwards to 80-90 jobs on deadline.

jgaudet · September 19, 2014, 5:21pm

Hey Laszlo,

Given that we still think this is caused by the DB, it could honestly be caused by anything that’s really talky with the DB; that’s why it seems kind of random. I’m not sure that it would necessarily be Job submission, though, unless your Submission process does a lot of querying of information. It’d certainly contribute to the overall load, but it’s unlikely that it’d be the sole culprit

LaszloSebo · September 19, 2014, 5:25pm

Well hopefully we can figure out why dependency checking isnt working in 7 right now, then we can do some comparison tests