Hi,
I seem to be limited to 16threads on the queue while I have currently 24core nodes which are due to be upgraded to 32core nodes and I would like to use all cores on a single arnold task.
Regards
Chris
Hi,
I seem to be limited to 16threads on the queue while I have currently 24core nodes which are due to be upgraded to 32core nodes and I would like to use all cores on a single arnold task.
Regards
Chris
I’m sure Ryan or someone will jump in - but to clarify, are you speaking about Arnold being limited to 16 cores on a specific machine?
fwiw - we did extensive testing with threaded applications [our own and others] on a 32 core and 40 core machine, and in many cases using more than 12-16 cores was slower [or the same speed] as using more. My Layman’s understanding: this is due to how the memory allocation works for the application across the cores and IO etc. For example, on a 40 core machine [4 CPU’s, 10 cores each], each CPU has access to it’s own memory, and if the data one CPU requires is in another CPU’s memory bank, then things slow down a lot. Also the NUMA architecture generally has one logical CPU which has control of the bus, so anytime something needs to pull data from a drive or network, that CPU spends cycles doing that, again reducing performance on the whole.
what we found is that after a certain level you get worse than diminishing returns on a single machine - you get less performance with more cores than with less. I’ve seen cases where 4 jobs on a single machine sharing resources perform [in aggregate] faster than 1 with all the cores [for example] as well as the ‘aha’ case [all on the same machine] where 40 cores was much slower than 32 cores, and 32 cores was slightly slower than 24 cores.
we also tested having 40 single-threaded jobs on a 40 core, and to put it lightly, that didn’t work so well. The assessment after months of testing:: more machines with less cores with current architectures always beat less machines with more cores, both for performance and $$$ comparisons. Of course this means greater expense in network infrastructure etc, and some cores is better than none generally [at the time I was comparing 5 8-core machines or 10-4 core machines vs 1 40-core]
some applications were also unstable with that many cores - some would barely boot with >24 cores as well. there is also the negative of ‘all your eggs in less baskets’ when you have a few machines with many cores.
just some thoughts passed over. usual caveats apply, your mileage may vary etc.
cb
one more thing - to deal with machines that DO have many cores::
cb
We’ll bump that up for beta 16. We’ll bump it to 128 for now. We just want to cap the UI to something reasonable.
Cheers,
Also, if you set the thread count to 0, Arnold should automatically detect the optimal number of threads to use.
doh! threads not cores ;-/ that’s what I get for reading an email before coffee!!
cb
I have been finding that running 2 arnold tasks of 12 threads each is better than 1 task of 24 threads (I changed the Arnold.py submission script to force it to pass -t 24).
I am looking also at 3 tasks using 8 threads, and turning off hyperthreading and using 12 threads IE 100% cpu usage.
Chris
update.
If I used 2 arnold tasks and 12 cores per task per node the deadline slave becomes very unstable and frequently falls over (25% of the farm has “died” and we have only been running for 27mins) and appears as Stalled on the monitor. Memory is not used heavily (~30%) however the cpus are running a 99.90+% busy.
Restarted the slave process resolves it.
Tools -> check for stalled slaves reports no stalled slaves
I do have stalled slaves on the task job reports.
Chris
Can you post the logs from the slaves when they crashed?
Also, can you let us know what the Deadline version is for your slaves?
Thanks!