Hi Laszlo,
Edwin has spent a lot of time playing with this error and we have a proposed fix to inside the Lightning plugin which we intend to ship with the first beta of v7.1. Edwin tells me in his testing, that this error is dramatically reduced by 95% or more. We believe it’s a simple case of the local socket timing out and the timeout value’s just need to be slightly increased to compensate. As this change could have far reaching consequences for many customers, we proposed to ship it with the first beta of v7.1, so customers have a good opportunity to give it a good thrashing. Fingers crossed. Edwin can provide all the gory details if needed.
When would the first beta for 7.1 come about? I wonder how bad this would affect our farm, as this was already happening with only a single job and 4 slaves.
Our plan is to get the 7.1 beta started basically ASAP, since we already have a bunch of stuff for it. I think the plan was originally to have it going by the end of January, but we’ll definitely keep you posted!
Also found a couple of errors like this (~1% of errors on d7):
“2015-01-13 23:18:36: 0: An exception occurred: Error in StartJob: RenderTask: Unexpected exception (Exception caught in 3ds max: – Runtime error: Error in GetJobInfoEntry: simple_socket: Timed out waiting for header packet to arrive.”
The socket communication between Deadline and Lightening handle a socket timeout exception as a perfectly passable event (I agree with that plan) while the way the socket was coded leaves things in kind of an inconsistent state if a message was only half-received. It’s not a huge problem most of the time because the natural timeout happens after a whole message is sent. The fix was to increase that timeout to give half-sent message more time to pass through on extremely heavy renders.
There’s more possible work to do here like a proper acknowledgement, but considering how elegantly simple the current system is we’ll cross that bridge if my fix isn’t good enough.