So I’ve seen a couple of stalled slave reports in the last couple of days that had me puzzled, but after some searching around, it seems they’re the result of slaves choking due to a large amount of output from a child process.
We’ve got some renders that can get unstable. If they crash, Arnold dumps out a crash report (including stack and heap allocations and memory-mapped objects), and this can get… large. I’ve seen them run tens of millions of lines/multiple GBs.
Now, the utility of said crash reports notwithstanding, I’ve got a few questions.
Why would an overly large amount of output from the render process take down the entire slave?
When that much stdout is being processed, our log file writer get backed up, causing the memory usage to increase until it destabilizes everything. There are a couple of things we could do to help mitigate the problem:
If we set a buffer limit for the log writer, we can have it throw away lines of output if the buffer is currently full. This ensures that the log writer will never get backed up and hold on to memory.
If we set a file size limit for task logs, we can avoid memory issues bzipping the log before copying it to the repo (and also avoid memory issues when the Monitor tries to read in a very large log file).
I think both of these limits could be configurable.
So if I understand you correctly, the log writer is trying to buffer the entire file into memory before writing it? Or are you just saying it’s getting backed up while trying to apply the job’s stdout handlers to the process’ output (before writing the output to the log file)? If it’s the former, it seems like the slave should just be writing the output directly to its (local) log directory.
As far as solutions go, I don’t really like the idea of throwing away log lines, but I guess it might be necessary. I’m still not sure when they would get thrown away though… would this be before parsing or before writing to the temp log file? And if my job process generates 20 million lines of output, which lines are going to get thrown away? (I’m hoping the answer is “the newest ones”).
I would be surprised if bzip had memory issues… it should only require memory equal to a factor of the block size, plus some small amount of overhead. Have you confirmed that that’s actually a significant source of memory consumption?
In the test I did, there weren’t any stdout handlers (I was using the CommandLine plugin). I’m sure stdout handlers would have an impact as well.
The backup is occurring when the stdout lines are flushed to the task log file. There could be some additional things we could try here that I’ll look into. If we did throw lines away though, it would be AFTER parsing, and the newer lines would get thrown away.
I didn’t look too much at how we zip up the log before copying it to the repository. It just seemed like there was a spike in memory during the operation, but it could be something else. I just wanted to throw the idea out there for discussion. Note though that it would still impact the Monitor’s memory if you were to try and view a 2GB log file, so a cap might still be a good idea.
Are you using line-buffering for the last log file? I wonder if there’s a sweet spot where the cost of the fsync calls adds a negligible amount of performance overhead, without the buffer being too large or too small…
OK, in terms of throwing lines away, I like the sound of that best.
Just throwing this out there, but I like of like the idea of adding a preference to cap the amount of data the Monitor would actually pull, and then adding an option in the report view to show the whole thing (memory be damned!).
Just an update. We’ve made significant improvements to the log handling for Deadline 7, so crazy amounts of output like this shouldn’t bring the slave down any more. We were also able to avoid implementing a buffer limit on our log writer. We tested the changes with 10,000,000 lines of stdout and it had no problem handling it (although it did take a few minutes to save the 1+ GB log file to the repository).
Hey Ryan, Somewhat related to this, we ran into issues with nuke renders, where excessive amount of data was printed into a single line (despite the 5000 character limit we have set on this).
I dont think it was a deadline issue however. It seems like nuke itself has trouble writing too much data in a single line. Basically if you had 10-50-100k characters in one line, nuke’s internal python variable space would explode. Nuke finishes rendering just fine, but then when it tries to shut down, it just hangs there doing garbage collection cycles for 10+ minutes.
This happened via command line renders outside of deadline too, but if you see reports like this, might be good information.