Additional slave is hogging tasks, but not rendering them.

I’m having a problem with two license free nodes. The Slave app says Repository has two slaves. Both slaves appear in the Monitor as idle/running. When I submit a Maya 2011 job from Slave A, the first task usually goes to Slave A and it begins to render, but then all the remaining tasks are assigned to Slave B almost immediately, none of which render. When the first task completes on Slave A, none of the additional tasks are assigned because they are assigned to Slave B.

The opposite is true of I submit the job from Slave B. The first task is usually assigned to Slave B while the remaining tasks go to Slave A, which does not render. Sometimes, all the tasks are assigned to the oppisite slave that the job was submitted from, leaving no tasks at all for the slave that initiated the job. The status of all tasks just says ‘Rendering’, but I’m not seeing any error or rendering taking place.

I tried finding similar issues, with no luck.

While I’m not positive on what is happening, it does sound like something that may occur because of permissions issues.

software.primefocusworld.com/sof … ssions.php

Try making sure all the proper permissions have been set on the repository and see if that helps.

  • Cody

I verified that all permissions on the network-shared Repository are correct but it didn’t help.

Here is a screenshot that hopefully illustrates the problem.

Also, here is a short screencast demonstrating what happens. As a bonus, you can see the Mono implementation on OS X is quite buggy, but that’s another issue.

H264, 6.5MB
http://dl.dropbox.com/u/57324/deadlineScreencast.mp4

I have seen this exact problem occur in the past, and it was related to permissions. It sounds like the job folder created when a job is submitted from Slave A has Slave A’s credentials, and the same for Slave B. That would explain why the job only renders on the machine that it is submitted from.

Which OS do you have the repository installed on, and is it a separate machine from your 2 slaves (ie: it’s not installed on one of the slaves)? If applicable, how is the repository shared (ie: nfs, samba, etc)?

Finally, you can avoid a lot of Mono buggy-ness by using the X11 driver. This was an option in the Client installer. You can run the setup again without reinstalling by going to /Applications/Deadline and running ClientSetupWizard.

Cheers,

  • Ryan

Thanks for replying guys.

The only permissions problem I found was for the deadline.ini file in the Applications/Resources folder on one of the machines. I found this because I checked to see if the X11 Driver is enabled (it’s checked), but when I clicked OK I got an error. Fixing the permissions made that error go away and another one appear. Maybe I need to reinstall it, but I’m ignoring it for now. Like you said, the Mono bugginess is not as bad on the other slave that seems to be using the X11 driver. Are there any future plans to switch from Mono to Qt?

As for the Repository, it’s stored on an an Xserve running OS 10.5 and we mount the volume with afp. It is not one of the slaves. I verified the permissions again; the folder and all of its contents show ‘everyone Read & Write’, but it didn’t help.

Then I tried logging both slaves into the same default user account in OS X. The change is reflected in the Monitor under the Slave User column (both usernames are the same). THIS WORKS. Both machines can split the tasks as expected. I hope there is a way to submit jobs from our individual user accounts as opposed to using the default accounts. It does seem permissions related, but I’m not sure where else to look. Maybe that bit of info can help you help me, but for now at least this is a partial solution. :slight_smile:

This probably deserves its own thread, but I can’t get either slave to respect Slave Scheduling, or any Power Management settings. I read the page which says Pulse must be running, which it is. For testing, here’s what I do:

1.) Set the Slave Scheduling Mode to Enabled
2.) Add a group with one machine, set Group Mode to Enabled.
3.) Set the start time for today to 10 minutes from now, and the stop time to 20 minutes from now.
4.) Submit a new medium-sized job.
5.) Start Pulse and then Slave a moment later.

And the slave begins rendering immediately… and does not stop when the ‘stop’ time is reached. What am I missing? It seems like maybe it worked intermittently, but the most I have got is under Pulse - Power Management, “-slave is online on machine blah-blah-blah and needs to be stopped”, but it doesn’t actually stop or start. However, after the 20 minutes is up, it says both Start and Stop tasks have run today. Should Pulse launch the Slave app automatically? Because that doesn’t work either. Similarly, if I set a machine to restart if the Slave has been running for 10 minutes, Pulse just says it’s been running for 10 minutes and will restart… then nothing.

Thanks for your prompt replies. I’m really liking the software overall. The feature set is more than I could ask for and the queuing process works great. My biggest issues are the Mono interface and the array of various applications. Configuration is mostly straight forward, but permissions are a pain, though not entirely unique to Deadline.

Thanks for reporting that error. We’ll look into it. We’re also doing some digging to try and figure out the user account/permission problem, as it should be possible to submit from different accounts.

For your power management problem, do you have the Deadline Launcher application running on your render nodes? This is the application that power management communicates with when controlling machines, and is also what the Monitor uses when performing remote control commands. If it is running, check the Launcher Settings in the Repository Options to make sure that Remote Administration is enabled. You can access the Repository Options from the Tools menu in the Monitor while in super user mode. Scroll down to the Launcher settings to find the Remove Administration setting.

Cheers,

  • Ryan

I’ve done some reading regarding the permission problem you’re running into on your Mac server, and it seems like a common issue. The problem is that the file created in the share (in this case, the repository) is retaining the permissions of the user that created it, rather than inheriting the permissions defined in your AFP share settings.

I haven’t come across a step-by-stop solution for the problem (yet), but maybe this information will be helpful for you. On Linux, I know that with NFS and Samba shares, there are mask settings you can specify to avoid this problem, so I would expect something similar to exist on the Mac.

Cheers,

  • Ryan

Just found this:
devworld.apple.com/mac/library/d … ing.1.html

The usage mentions these arguments:

So maybe the key is to create the share manually using the “sharing” command. We only have OSX clients here, so unfortunately we can’t test this out ourselves.

Cheers,

  • Ryan