I have no idea what happened, but this is a second job I’ve tried that Deadline incorrectly reports the render progress and thus doesn’t move on, especially if you have dependent jobs on it!
See the picture for clarity.
Basically, the Job still reports two or more (I even had 34!, but I only have physically 5 machines) machines still rendering. The machines list shows all slaves as Idle and the job is basically stuck.
However, all render frames are correctly rendered and all is fine, but if you have a dependant job on this one, it never gets started!
Slave Machine = Duber-slave-04
Slave Version = v4.1.0.42706 R
Plugin Name = 3dsmax
So, basically, it finished, but it seems as if Deadline wasn’t notified about the change?
I have to say this is the first time I’ve bumped into this issue and the only difference from previous jobs was that I used a Inline PostLoad script: ATSOps.RetargetCommonRoot @"\messiah_UNMANAGED_PROJECTS_\Test"
The Job just finished, after waiting about 10 minutes and it reports the render times of the frames correctly, even though it was stuck for about 10 minutes on some of the frames?
We’ve seen this type of problem happen when running Pulse and when Pulse runs into permission problems. Can you give us a bit more detail about your setup? Specifically:
What OS do you have the repository installed on?
Do you have Pulse running on the same machine as the repository?
i) If not, which OS is Pulse running on?
ii) If so, is Pulse connecting to the repository via a local path (ie: c:\program files\prime focus\deadlinerepository)?
Thanks for the info. The first thing to try is to have Pulse use the UNC path to connect to the repository. Just use the Launcher in the system tray to change the repository path and then restart Pulse. If the problem still occurs after making this change, then please post the slave log for the session where it appears to be processing many tasks at the same time and we can look through it to see if any problems stand out.
It seems, after upgrade to 4.1 SP1, the problem is gone. At least for now. I’ll see if it still does report incorrectly.
By the way, even after changing the repository option for Remote Administration, I still can’t remotely restart slaves etc…
I get this error:
Machine Name Command Timestamp Status Results
Duber-slave-01 LaunchSlave 10/06 22:39 Failed No connection could be made because the target machine actively refused it 192.168.0.201:5042
There is no firewall restriction on the local network, by the way, so the port is open.
Also, I should probably mention that I’m trying to operate the Nodes remotely, I mean, via VPN from home on a completely different network. If it changes anything.
No, the Firewall is configured to let Deadline services through. As I said, I manage my jobs from home without issues, except for firing up the slaves. However, locally, i.e. running the remote administration from within the company network (from another slave or one of the workstations) all seems to be working fine. I just restarted the slaves successfully. It only doesn’t work via VPN from my home office.
But, since it works at least this way, it’s not a big deal actually.
Thank you again for your support, much appretiated!
It’s hard to say. I would have to look at the slave log from the session where it picked up multiple tasks. If you could post it, that would be a big help.
That’s from a particular job. We need the full slave log, which you can find by going to the slave machine, and selecting Help -> Explore Log Folder from the Slave UI. Just make sure to grab a log from the session where the slave picked up multiple tasks. Also, please post it as an attachment, since I’m sure it will be fairly long.
Thanks for the log. So I went through it and I’m seeing a few of these statements:
This would seem to imply that the Slave is having problems connecting to the repository. Normally, if a connection is severed completely, the slave won’t assume the task has been requeued because it determines that the entire repository folder can’t be reached. However, if the slave only loses partial connectivity (ie: it can’t see this task file, but it can see the job folder and the repository folder), then all it can do is assume someone requeued the task. You should check to see if you’re having any network issues, because that could definitely be the source of this problem.
However, I can’t see any issues on the network. It’s a small (8 PCs total) local network on a 1Gb line.
I have Pulse running though and I’m submitting from an external PC over a narrow 4Mbit line (it’s a pain, btw), but that’s about it. All the render machines and the repository are on the same local network.