Experiences? Large scale farm performance and future updates (Houdini, Maya)

Heide · May 23, 2023, 12:36pm

Hey Everybody!

I am taking a look at Deadline as it was recommended to me.
I have some questions that I cannot verify with a simple test setup.
So far I have 3 questions:

Performance with 20tsd/40tsd jobs and 500 workers.
How long does it take for the monitor to update a huge job list and many workers?
Is there any difference between starting the working and getting all jobs the first time vs having monitor opened for some time and updating the job list again?
Any recommendations what can be done to speed it up?
I have heard some bad information regarding future updates.
e.g. it took over 6 months to support the new Houdini version after SideFX switched to python3.
And someone told me that Solaris LOPs are not supported?
What about other applications?
E.g. if there is a new Maya version. Does it work right away or do we have to wait for an update? And how long does that take?
I have read that it is not possible to start 2 different jobs on one worker.
You have to start 2 workers for that.
But that arised 2 questions:
a) Do these workers communicate to each other?
e.g. I know that one job requires 60 GB of RAM and an other 20 GB of RAM.
I cannot render both on a machine withg 64GB RAM.
But I still want that 2x 20GB RAM jobs can render on one machine.
How can this be solved?
b) What about the AWS cloud connection?
Does it have the option to start 2 workers on 1 VM?

Thanks

Justin_B · May 23, 2023, 8:55pm

I’ll stick to your numbers here:

1: Assuming 20tsd/40tsd is short for 20/40 thousand? It’s hard to give a solid number on how often the Monitor will update a job list as it’s effectively a query against the database.

Using an RCS will make it faster as clients will be able to take advantage of request caching built into the RCS.

In terms of usability the performance settings are used to increase/decrease how often Monitors poll the database for updated data. The idea being that if updating all the jobs takes 30 seconds, don’t poll for new data every 10 seconds. It does only scale based on the number of Workers, so I can’t just punch in 20 thousand jobs and get you an estimate. However at 1000 Workers it suggests a Job update every 46 seconds. Meaning at most the data in the Monitor will be 46 seconds out of date.

However I’ve never been a render wrangler, so if someone who’s actually running a farm that big could weigh in that’d be great.

2: In general the support team can get an unofficial patch out for the new release before full support is added to the product. Or the support team is able to publish the updated plugin code here in the forums early.

For Maya 2024, we’ve created patch files and they’re up in the forums on Maya 2024 Patch files.

Adding these patches manually isn’t usually complicated, and you can use this Help Centre guide to go through it if the support team hasn’t done it already.

Assuming the product developers (Autodesk for Maya for example) haven’t wildly changed the rendering API the patch should work immediately. New features may or may not work however, it can vary product to product and update to update.

Ideally we can get the patch files out in a couple days, and full support out in the next release. We’re not able to comment on when releases are, or what they’ll have in them due to Amazon’s policies however.

3:
a) The Workers do not communicate or co-ordinate with each other to co-ordinate resources. It’s possible to assign CPU cores and GPUs using their respective affinity settings but we don’t have a similar feature for memory usage.

b) No not by default. The reason being that all our AWS integrations (AWS Portal — Deadline 10.2.1.1 documentation / Spot Event Plugin — Deadline 10.2.1.1 documentation) assume a single Worker per VM, and if that Worker is idle it terminates the VM. There’s nothing stopping you from starting your own VMs using images that have multiple Workers configured however!

The idle detection limitation is a byproduct of our implementation. For AWS, the instance size and price should scale linearly. Meaning two 8GB machines should cost the same as a single 16GB machine of the same family. So our usual recommendation is to size the instances to match the work versus running multiple Workers.

Heide · May 24, 2023, 10:22am

Hi

Thanks for the answers.

It’s hard to give a solid number on how often the Monitor will update a job list
performance settings are used to increase/decrease

The question was not about how often the jobs are updated. It wanted to know how much time an update of the monitor takes.
A) Time from starting the monitor to the point in which I see all 20tsd jobs.
B) Monitor is already started and running since some time. But jobs are outdated, so it should update all jobs that have changed.

The idea being that if updating all the jobs takes 30 seconds

Ok, so it takes 30 seconds to get jobs into the monitor?

It does only scale based on the number of Workers, so I can’t just punch in 20 thousand jobs and get you an estimate

What is the connection between the job data in the database and the number of workers?
The database is not stored on a worker.
So it should not matter how many workers I have to get the job list. The job data (like scene name, frames done, …) do not change if I add more workers.
Or am I missing something here?

so if someone who’s actually running a farm that big

Yes, that would definitely be better.

we’ve created patch files and they’re up in the forums on Maya 2024 Patch files

Ok, so a patch is required for a new application even if nothing has changed on the Maya side.
Good to know that we have to get a patch before someone tries a new version.

But what about IF something has changed on the Maya/Houdini side?
Perhaps Houdini Python3 is a bad example case for updates?
But on the other side what about Houdini Solaris (LOP nodes/Husk)?

and they’re up in the forums on Maya 2024 Patch files

Hmm, the files are from 17.May.

a) Ok, thanks.
b) Ok.

There’s nothing stopping you from starting your own VMs using images that
have multiple Workers configured however!

Yes, but then they do not shut down if they are not required any more.

So our usual recommendation is to size the instances to match the work
versus running multiple Workers.

That question arose because of licensing and render apps not using all cores all the time.

Justin_B · May 24, 2023, 1:28pm

1: I gotcha - I’d expect 20k jobs to load on first start in around a minute but that can depend on hardware and network conditions. And the Monitor is constantly pulling updates from the database, the frequency of that pull is based on the performance settings. In general the number of Workers directly correlates with how much data is changing on the farm, which is why the auto-scaling uses that as the input. It’s a byproduct of that tool versus a reflection of how the product works. Keep in mind the Monitor is displaying the result of a bunch of database queries, so time to update scales with quantity.

The 30 second update time was meant as an example, your experience may be different.

2: If something has changed then the support team might be able to resolve the issue and get a patch out, or you may have to wait for support to come out in the next release.

I can’t share any details about the product roadmap, so I’m not able to comment on Solaris support.

3: Exactly, you’d have to look at building your own automatic termination. Taking existing code from the Spot Event Plugin (in DeadlineRepository10\events\Spot) is my usual recommendation as a jumping off point for that.