I’m seeing the slave process quit, seemingly randomly, in both GUI and nogui modes. The very last thing in the output in Terminal.app is :
Illegal instruction: 4
Meanwhile, the rendering process (lwsn) continues untracked.
The logs show nothing unusual - just the usual LWSN readout of the form :
0: STDOUT: Rendering frame 234, segment 1/1, pass 1/1.
The last line in console.app, seemingly just before the crash, is :
07/12/2011 21:05:34.863 [0x0-0x2e02e0].net.deadline.DeadlineMonitor: Attempting to contact Deadline Pulse (Ermintrude.local)…
I checked and Pulse is still apparently running. Prior attempts to connect to Pulse on nodes show no issues, so I wonder if there is a subtle crash if Pulse fails to respond in a timely manner…
Strange. Googling this turns up nothing useful, and that error doesn’t give us much to go off of. Does it only happen in the middle of a LW render? Are they “heavy” renders?
They are heavy renders. Each node is looking for around 20 GB of RAM, but none of them have anything like that much. That said, disk thrashing appears to be minimal, so it seems that a lot of the dataset is being generated, but not utilised (one of the 3rd party components generates data that is not evaluated during render, but it would take a rewrite to improve the efficiency there). The systems seem stable and LWSN doesn’t fail at all, just the DL slave process.
Render times vary, but average something like 1-2 hours per frame.
I’m guessing that the system is hitting its memory limit, and then when the Slave goes to allocate some new memory, it explodes…
Except that I can quite happily fire up any other application on the system without suffering the same ill effects and the failure of the slave doesn’t seem to be correlated with any event that I can isolated - the render process continues happily. Indeed, I can also have Monitor up and running and it doesn’t go down when the slave process fails.
(Most of what is paged out seems to be unused - there’s little evidence of disk thrashing on any of the nodes or the workstations despite the heavy page file.)