concurrent task on linux

tothic · August 19, 2010, 4:17pm

Hi,

another issue on linux: setting concurrent task has no visible effect on linux. Rendering only 1 task at a time. It works properly on windows though. Tried with maya and nuke. Any suggestions?

Thanks,
Gabor

rrussell · August 19, 2010, 4:51pm

Hi Gabor,

I just tested this here with Deadline 4.1, and it worked fine. In the slave list in the Monitor, how many CPUs is it showing for the linux slave(s) that this problem is occurring on? If you have the “Limit Concurrent Tasks To Slave’s Task Limit” enabled (which is the default), a slave will not dequeue more tasks than it has CPUs. If Deadline is showing that it can only detect 1 CPU on the linux slaves, that could explain the problem…

Cheers,

Ryan

tothic · August 23, 2010, 11:31am

Hi Russel,

thanks for reply! Unfortunately everything is set as you described, but not working: on the slave CPU is 16 or 4, slave concurrent limit tried with 0 or 4 or 8, job property “limit concurrent task…” is on/off doesn’t matter, it’s always rendering 1 task at a time. Can it be a permission problem again somewhere? We are on Fedora 10.

Thanks,
Gabor

tothic · August 25, 2010, 8:09am

Hi,
no other tips or suggestions regarding this? We are seriously considering investing more in deadline, to be used on our linux based renderfarm, if these issues are solved. The problem is we have a couple of 16 core machines, that we want to use rendering more task in parallel, to be efficient on quick nuke jobs.

Thanks in advance,
Gabor L. Toth
2D TD
www.digitalapes.com

rrussell · August 25, 2010, 8:30pm

Hi Gabor,

I’m away from the office this week, so I won’t be able to investigate this further until I get back next week.

Cheers,
Ryan

rrussell · August 30, 2010, 3:37pm

Hi Gabor,

I don’t think this would be permissions-related, as the slave is able to pick up one task just fine. Can you do the following:

Submit a new job in the suspended state and then set up concurrent tasks options (use a very small scene file if possible).
Right-click on the job in the monitor and select Explore Repository Directory.
Zip up the contents of the folder that is displayed and post it here.

We’ll drop the job in our repository here to see if we can reproduce. It can be a Maya or Nuke job.

Also, which version of Deadline are you currently using?

Cheers,

Ryan

tothic · August 30, 2010, 5:23pm

Hi Russel,

thanks for the reply! Here is a simple scene set up, that renders on linux only 1 task at a time, hope you can find something.
We are using version: 4.1.0.42706

Thanks,
Gabor
deadlinejob_concurrent.zip (7.17 KB)

rrussell · August 30, 2010, 7:21pm

Thanks for this! I dropped the job folder into our repository, and without changing a thing, I had both Linux and OSX machines each picking up 4 tasks at a time. I tested with and without Pulse running just to cover all bases, and it worked as expected. There is no delay in picking up the 4 tasks either, so it’s not like fast render times would cause you to only see one at a time. The machines I tested on have 2 cpus each, but it’s not like that played a factor because the “limit tasks to cpus” option was disabled.

So unfortunately, I’m pretty much stumped at this point. Since it is working on Windows for you, maybe permissions are playing a role, but I can’t imagine what that would be. Which OS do you have the repository on? Ours is a Linux box running openSUSE with samba sharing. It’s mounted on the clients using cifs on Linux and smb on OSX.

Cheers,

Ryan

tothic · August 31, 2010, 1:35pm

Hi Russel,

thanks for the feedback! The repository is on ubuntu, mounting and share as yours: samba, cifs.
On linux the slave creates only 1 local copy of scenefile, called [scenefile]_thread0, while on windows it creates as many as cpus it has ([scenefile]_thread0-3 for example). Comparing the slave logs of a linux and a properly working windows machine, I noticed that there is a part in winlog that is completely missing from the linux log:
"
//192.168.xxx/dfsroot/DeadlineRepository\jobs\000_050_999_012c262b\000_050_999_012c262b.job) (Deadline.Jobs.JobCorruptedException)
2010-08-31 12:51:41: Scheduler - Trying to dequeue task: \192.168.xxx\dfsroot\DeadlineRepository\jobs\000_050_999_192a585c\Queued\000_050_999_192a585c_00016_33-34.task
2010-08-31 12:51:41: Scheduler - Obtaining limit group stubs.
2010-08-31 12:51:41: Scheduler - obtaining 000_050_999_192a585c
2010-08-31 12:51:41: Scheduler - Trying to dequeue task: \192.168.xxx\dfsroot\DeadlineRepository\jobs\000_050_999_192a585c\Queued\000_050_999_192a585c_00017_35-36.task
2010-08-31 12:51:41: Scheduler - Obtaining limit group stubs.
2010-08-31 12:51:41: Scheduler - skipping 000_050_999_192a585c because we already have it
2010-08-31 12:51:41: Scheduler - Trying to dequeue task: \192.168.xxx\dfsroot\DeadlineRepository\jobs\000_050_999_192a585c\Queued\000_050_999_192a585c_00018_37-38.task
2010-08-31 12:51:41: Scheduler - Obtaining limit group stubs.
2010-08-31 12:51:41: Scheduler - skipping 000_050_999_192a585c because we already have it
2010-08-31 12:51:41: Scheduler - Trying to dequeue task: \192.168.xxx\dfsroot\DeadlineRepository\jobs\000_050_999_192a585c\Queued\000_050_999_192a585c_00019_39-40.task
2010-08-31 12:51:41: Scheduler - Obtaining limit group stubs.
2010-08-31 12:51:41: Scheduler - skipping 000_050_999_192a585c because we already have it
2010-08-31 12:51:41: Scheduler - Successfully dequeued 4 task(s). Returning.
2010-08-31 12:51:41:
2010-08-31 12:51:41: Scheduler - Returning limit group stubs not in use.
"

then from this point coming file synchronization, rendering, it’s similar. Is this normal? Can this scheduler be configured somewhere?
If you need I can post the whole log.

Thanks,
Gabor

rrussell · August 31, 2010, 2:43pm

Hi Gabor,

That Scheduler information should be printed in the Linux logs too. Can you send us a full log from a Linux slave (one from the session where the slave was only able to render 1 task at a time). You can find the log folder on the slave machine by selecting Help -> Explore Log Folder in the Slave application. If you would prefer to not post this log on the public forum, you can email it to deadline support:
software.primefocusworld.com/sof … e/contact/

Cheers,

Ryan

rrussell · August 31, 2010, 5:22pm

Hi Gabor,

Thanks for sending the log. I saw this appear once in the log:

2010-08-31 14:23:12: Scheduler - Trying to dequeue task: /***/DeadlineRepository/jobs/000_050_999_192a585c/Queued/000_050_999_192a585c_00001_3-4.task
2010-08-31 14:23:12: Scheduler - Obtaining limit group stubs.
2010-08-31 14:23:12: Scheduler - Job with a higher priority than the desired job was found. Not picking up new tasks for available threads.

So the machine thinks that there is a job with a higher priority sitting in the queue. I’m not sure why this is, unless there is a corrupted job (or jobs) in the queue that are causing problems. There is a quick test you can run to see if this is the case, but you will only want to run it if you’re not in the middle of production.

Make sure that there are no active or queued jobs in the queue. Anything that is currently active or queued should be suspended.
In the repository root, rename the ‘jobs’ folder to ‘jobs_back’.
Create a new ‘jobs’ folder (now we have an empty queue without losing any of the existing jobs).
Submit a new job and enable its concurrent tasks option to see if you still see this problem on the Linux machine.

Cheers,

Ryan

tothic · September 1, 2010, 4:45pm

Hi Russel,

thanks for the tip, but unfortunately not working Does the same.
What is this limit group stub? Simply the limit group?

Thanks,
Gabor

rrussell · September 1, 2010, 4:59pm

Can you set up your queue so that it just has one job, and restart the slave application on a Linux machine so that it starts a fresh log? Let it go through a few tasks, and then send us that log. This way the log isn’t too big, and without any other jobs in the queue, it will be interesting to see what info is in the log.

Is the job using a limit group, or a machine limit (including white/black lists), or both?

Thanks!

Ryan

tothic · September 2, 2010, 5:36pm

Hi Russel,

I sent the fresh log to support. I wasn’t using groups, or black/whitelisted machines, only pools.

Thanks,
Gabor

MikeOwen · September 14, 2010, 6:19am

Hi Gabor,
Just a duplicate reply as I have just answered your nuke emailing list question about Deadline & Nuke concurrent processing.
Unfortunately this issue isn’t related to Deadline but more an issue with Nuke. From my own testing, using concurrent processing with Nuke has mixed results, so we very much use it on a case by case basis when we create our nuke scripts.
You can always test all this locally, perhaps even using a python script to start multiple local renders. I would recommend:
nukepedia.com/gizmos/python- … r/bgnukes/
(Nukepedia.com - the new home of all things Nuke…)
Mike

tothic · September 16, 2010, 1:55pm

Hi Mike,

thanks for the reply. I will try that you suggested. We would need the concurrent tasks feature because we have four machines with 16 core, with lots of memory, and on small to medium nuke tasks the cores are not fully used. For example if one can process 4 tasks at once, so 1 task on 4 cores, that should be faster than 1 task per 16 cores, by 1.5-2 times, depending on the type of the comp. Last year when we were using afanasy, they nicely could process 4 nuke tasks in parallel, so I’m sure these machines could handle this with deadline too. And this feature is working here on windows, (but we are using mainly linux for comp), that’s why I don’t understand.
I searched the nuke discussions regarding deadline, but I will look once again on this whole threading thing.

Thanks again,
Gabor

tothic · February 18, 2011, 4:43pm

Hi,

if anyone else is (like us) struggling with the absence of rendering with concurrent tasks on linux, I have great news for you: in the newest deadline5 beta 5 version this issue is now fixed! Thanks Ryan! Now it’s working as expected!

Cheers,
Gabor