Since I’ve installed deadline 3.1 I’ve gotten a lot more errors on the render-network.
I’ve configured it more and more, but there’s still a problem. For some reason certain slaves are getting the state “stalled”, while they aren’t stalled? Normally a restart of the machine/slave, would help, but it isn’t the case. Sometimes they’re rendering, sometimes they aren’t. So restarting the machine or slave and refreshing my deadline monitor, didn’t solve the problem. What am I doing wrong here?
I’ve installed today deadline 3.1 SP1 to see if this is helping, but no…
I’m sorry to hear you’re having problems with Deadline 3.1.
A few questions:
Which operating system do you have installed on the repository machine?
Is there a time difference between your repository machine and any machines in the farm?
Were you running smoothly with a previous version of Deadline? If so, which version were you using previously?
Is it always the same few slaves that exhibit this behavior? If so, there may be network connection problems between those slaves and the repository machine.
I’m really late with my answer. But I haven’t seen the response, because I didn’t check the “notify me” checkbox when a reply is posted.
Ok, the problems are still happening now.
The repository machine is an ISILON cluster system. So we run the operating system “Isilon OneFS v5.0.6.4 B_5_0_6_4”.
If there is a time difference between the repository and the slaves, it can only be maximum 5 or 7 minutes.
We ran smoothly with an earlier version of Deadline, which was 2.7.
The problem occurs randomly for all the slaves.
Another thing. Because I checked the repository time because of your question, and we’re looking in deadline pulse. I’ve seen the following thing:
pulse: windows time was 11hr40
repository time was 11hr39
and deadline pulse said, repository time is 10hr54???
By default, to get the repository time, Pulse will create a new file in the repository folder, and then get that file’s creation date and time. From the Pulse log, it looks like the “repository time” is 15 minutes off, which would definitely account for the random stalled slaves. By default, if a slave hasn’t updated its status in 10 minutes, it’s assumed to be stalled, so having a 15 minute time difference can definitely cause problems.
There are two things you can try:
In the Monitor, enter super user mode and select Tools -> Configure Repository Options. Scroll down to the Slave Settings section and modify the Stalled Slave Delay setting. For initial testing, maybe set it to something like 60 minutes. After clicking OK to save the changes, restart all the Slave applications so that they recognize the change immediately. Now, with the delay set to 60 minutes, the 15 minute time difference shouldn’t result in improperly detected stalled slaves.
In the Repository Options, scroll down to the SNTP Date/Time Synchronization Settings. Here, you can specify a machine that will be used to sync the time, instead of using the file creation method mentioned above.
Thank you for your quick respons. I tried to adjust the time now everywhere, and I’ll have to wait another day to see if the problem is still occuring.
It seems to be better indeed! I’ll keep track of the errors and when this is happening again, I’ll reply back here. But I don’t think this will be soon.