nuke render hanging - out of memory?

LaszloSebo · February 12, 2015, 6:02pm

Some of our nuke jobs have been hanging indefinitely on deadline7 - we havent seen this type of error on deadline6. The logs are weird:

2015-02-06 17:37:48: 0: INFO: SceneFileCopyToSlave? True
2015-02-06 17:37:48: 0: INFO: Directory name: “\inferno2\projects\fast7\scenes\PC_200_2585\2d\tcomps\PC_200_2585_2d_tcomps_main_v0006\linear_2880x2160x1”
2015-02-06 17:37:48: 0: INFO: Preparing to copy Network SceneFile to Slave: “C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline7\slave\LAPRO0626-secondary\jobsData\54d56bf33b3ece1d9cc29356”
2015-02-06 17:49:48: 0: Task timed out – canceling current task…
2015-02-12 09:45:38: Out of memory
2015-02-12 09:45:38: Out of memory
2015-02-12 09:45:38: Out of memory
2015-02-12 09:45:38: Error in read: Out of memory
2015-02-12 09:45:38: Error in writes: Out of memory
2015-02-12 09:59:01: Listener Thread - ::ffff:172.18.8.91 has connected

The slave is up and running, i can open its log, connect to its log, but its basically not doing anything.

LaszloSebo · February 12, 2015, 6:14pm

the machine has 128g of ram, so it seems unlikely that it ran out… but who knows… there is another slave running on that box doing 3d renders.

rrussell · February 13, 2015, 2:02pm

Is this random or do the same set of jobs repeatedly get stuck? Also, do they all get stuck after the “Preparing to copy Network SceneFile to Slave…” line? If the same jobs reliably get stuck with Deadline 7, can you try sending the same jobs to Deadline 6 and have them render on the same machines to see if the problem persists?

LaszloSebo · February 13, 2015, 5:22pm

It doesn’t seem job specific… the real question though, is why isn’t stalled slave detection kick in, or why the slave can’t shut itself down?
They are still there, 6 days later. The tasks havent gotten requeued by pulse neither:

rrussell · February 13, 2015, 8:57pm

The problem is that the slave hasn’t stalled (it’s running fine, and it’s updating its state in the db).

I’m wondering if this is related to this problem:
viewtopic.php?f=86&t=10855&p=54227&hilit=hang#p54227

It’s happening during a copy command that I’m assuming a custom script of yours is doing, and I’m pretty sure the problem in the post I linked to above was also related to a hanging command that’s accessing one of your file servers.

Note that this is something we’re addressing in Deadline 8 as part of the python sandboxing.

Cheers,
Ryan

LaszloSebo · February 14, 2015, 12:11am

Yeah seems like it might be related. I couldn’t even kill the deadline slave process from task manager, had to reboot the boxes. They both used 100% of a single core, constantly pinging the mongo db.