Need more threads...

anon35454328 · March 14, 2013, 10:59am

Hi,

I seem to be limited to 16threads on the queue while I have currently 24core nodes which are due to be upgraded to 32core nodes and I would like to use all cores on a single arnold task.

Regards

Chris

cbond · March 14, 2013, 3:05pm

I’m sure Ryan or someone will jump in - but to clarify, are you speaking about Arnold being limited to 16 cores on a specific machine?

fwiw - we did extensive testing with threaded applications [our own and others] on a 32 core and 40 core machine, and in many cases using more than 12-16 cores was slower [or the same speed] as using more. My Layman’s understanding: this is due to how the memory allocation works for the application across the cores and IO etc. For example, on a 40 core machine [4 CPU’s, 10 cores each], each CPU has access to it’s own memory, and if the data one CPU requires is in another CPU’s memory bank, then things slow down a lot. Also the NUMA architecture generally has one logical CPU which has control of the bus, so anytime something needs to pull data from a drive or network, that CPU spends cycles doing that, again reducing performance on the whole.

what we found is that after a certain level you get worse than diminishing returns on a single machine - you get less performance with more cores than with less. I’ve seen cases where 4 jobs on a single machine sharing resources perform [in aggregate] faster than 1 with all the cores [for example] as well as the ‘aha’ case [all on the same machine] where 40 cores was much slower than 32 cores, and 32 cores was slightly slower than 24 cores.

we also tested having 40 single-threaded jobs on a 40 core, and to put it lightly, that didn’t work so well. The assessment after months of testing:: more machines with less cores with current architectures always beat less machines with more cores, both for performance and $$$ comparisons. Of course this means greater expense in network infrastructure etc, and some cores is better than none generally [at the time I was comparing 5 8-core machines or 10-4 core machines vs 1 40-core]

some applications were also unstable with that many cores - some would barely boot with >24 cores as well. there is also the negative of ‘all your eggs in less baskets’ when you have a few machines with many cores.

just some thoughts passed over. usual caveats apply, your mileage may vary etc.

cb

cbond · March 14, 2013, 3:20pm

one more thing - to deal with machines that DO have many cores::

each slave can have up to 16 concurrent tasks. so on a 40 core machine, you could run 4 v-ray tasks concurrently of the same job [ram permitting]. you should test that vs 1 task and see how fast the job gets done.
each task could cover a range of frames, which keeps max, maya etc loaded and open and just changes frames. on large jobs with lots of assets this can increase performance and decrease network load time.
you can run multiple slaves to control different jobs at the same time. this uses additional licenses, but lets you combine a simulation, and a comp [for example] at the same time. this occasionally will get you better performance [overall] because you have a cpu-bound task with a network-bound task. I say occasionally, because with the whole NUMA architecture, this isn’t always the case. at least it doesn’t always perform as well with large datasets as you might hope.

cb

rrussell · March 14, 2013, 5:04pm

We’ll bump that up for beta 16. We’ll bump it to 128 for now. We just want to cap the UI to something reasonable.

Cheers,

Ryan

rrussell · March 14, 2013, 5:05pm

Also, if you set the thread count to 0, Arnold should automatically detect the optimal number of threads to use.

cbond · March 14, 2013, 5:06pm

doh! threads not cores ;-/ that’s what I get for reading an email before coffee!!

cb

anon35454328 · March 14, 2013, 7:00pm

I have been finding that running 2 arnold tasks of 12 threads each is better than 1 task of 24 threads (I changed the Arnold.py submission script to force it to pass -t 24).

I am looking also at 3 tasks using 8 threads, and turning off hyperthreading and using 12 threads IE 100% cpu usage.

Chris

anon35454328 · March 14, 2013, 7:25pm

update.

If I used 2 arnold tasks and 12 cores per task per node the deadline slave becomes very unstable and frequently falls over (25% of the farm has “died” and we have only been running for 27mins) and appears as Stalled on the monitor. Memory is not used heavily (~30%) however the cpus are running a 99.90+% busy.

Restarted the slave process resolves it.

Tools -> check for stalled slaves reports no stalled slaves
I do have stalled slaves on the task job reports.

Chris

rrussell · March 15, 2013, 1:08pm

Can you post the logs from the slaves when they crashed?

Also, can you let us know what the Deadline version is for your slaves?

Thanks!

Ryan