I had a job where 4 tasks failed with one of those two status messages in the task log.
But according to deadline they’re all still rendering and haven’t failed. I could stand an error message because then other machines would pick up the task. But instead I get a situation where one render node is supposedly rendering 3 tasks at once and they sit indefinitely until the auto-timeout occurs or I reque the task.
We’re hosting the repository on a Windows Small Business Server system so it’s not a access limitation. Any ideas on:
why it might be failing.
solutions so that it realizes it has failed and tries again
?
Is it possible to post the full slave logs that contain these errors? It would be helpful to know whereabouts during the render the problem occurred.
Also, which version of Deadline are you running? I’ve seen you on the 3.1 beta discussion, and it seems like someone else is running into this problem with this version too. We’re trying to figure out if it’s a regression from 3.0, or an ongoing problem in the 3.x releases.
From what we’ve seen, this problem only seems to affect 3dsmax jobs, so it could just be an issue with the 3dsmax plugin that is preventing the task from being requeued (which is the main problem here). Even if Deadline was just throwing random errors for no reason, the task should get requeued.
We don’t know why this problem is occurring, and there aren’t any known workarounds. If you’re still participating on the 3.1 beta, we’d appreciate anything you can do to help us figure out this problem before the 3.1 release.
I think we may have figured this one out. The problem seems to be that the slave is thinking the task file has been moved (which is usually an indication that the task has been requeued), when in fact the file hasn’t been moved at all. It could be a network hiccup or something that is making the slave believe the task file has been moved, and thus it tells the render thread to cancel the current task assuming the task has already been requeued.
We’ll be adding some redundancy when checking if the task file has been moved. There will be 3 checks made to see if the file exists with a 10 millisecond delay between checks, and if all three fail, it will determine that the task file has been moved. This alone should reduce the amount of time this situation can occur.
We’re also thinking we should explicitly try to requeue the task, but we’re not sure how well that would work if the slave can’t find the task in the first place. We’re still trying to figure this part out, but I just wanted to give an update on where we are with this.
So after some more digging, forcing a requeue won’t really help if Deadline thinks the file doesn’t exist. Hopefully the added redundancy will reduce the likelihood of this problem occurring in the future (3 checks for the file’s existence would have to fail). We are using standard .NET code when checking for a file’s existence, so I don’t think it’s a problem with our code, and while we do try to handle any case where a network issue can interrupt rendering, it appears that it’s still possible for the rare case to fall through the cracks.
If this problem is currently a common occurrence for you, are there any diagnostic tests you run on your network to see if there are any problems? Not too long ago, we ran into a problem where files on a network share would be there one minute and then gone the next, and while I’m not sure how our IT department solved the problem, it’s easy to see how such a problem could affect the render farm if it were to occur.
I’d be happy to run any tests. Any suggestions on how to test the connection? It’s a pretty rampant problem. But our network is all GigE and seems reliable from a user standpoint so I don’t know what would cause that. The switches aren’t reporting any collisions or obvious network bugs. And we just installed a new file server which is like 5x faster than our old one, so if anything these problems should have gone away not popped up in the first place.
Is this fix releasable as a patch? We’re having to practically babysit all of our renderings to make sure they finish right now. About 1/8th of our tasks are failing.
Hmm, I’m not aware of any specific tests. I’m actually surprised how often it is occurring for you. I’ve only heard of a few others that have experienced this problem, and it’s not obvious what’s different between farms that experience this and those that don’t. We’re really hoping that these “blips” that cause a file to appear missing are quick ones so that the redundancy checking will work around it.
The fix (assuming it actually works) will be included in the next 3.1 beta release, which we should be able to get out early next week. You’re best bet would be to upgrade to this version (you should have beta access now).
Out of curiosity, is this something that has just started happening recently? Does the new server host the repository? We’re you having these problems before upgrading to the new server?
It’s a RAID 10 (6 1TB drives) partitioned into two windows drives with ShadowCopy running on top. But shadowcopy only occurs every 30 minutes, takes about 2 seconds and finishes. I’ve checked and there is no corellation between shadow copy times and task stall times.
Server info (Same server is both Pulse and Repository):
Our slaves are running Vista x64. Slave is running as a service with power user privalages.
Try turning off Pulse for a day, and see if the problem persists. Since the problem occurs quite often, hopefully 24 hours is enough to determine if Pulse is contributing to the problem. Also, have you noticed if the problem occurs almost directly after a slave receives a task from Pulse, or does it occur later on in the rendering at random times? I believe the log you posted earlier shows it happening very early into the render, and I’m curious if that’s the situation in all cases.
Cool, that’s what we had figured. 3.1 beta 8 includes some new checks to hopefully workaround the problem, so if you have a chance to test it out, let us know how it goes with Pulse running.
I think 3.1 b8 has fixed it. I only had one slave fail. But then I checked the version # and it appears that slave had been running and never gotten restarted.
I don’t think it’s coincidence that all the other slaves finished without any problems.