We’ve just purchased the krakatoa prof package and I’m testing Deadline with krakatoa locally on 1 machine in freemode before we get it up and running over the network. When partitioning using deadline I specify 8 tasks per machine ( 8 cpus ). Deadline seems to be working fine its outputting the PRT files. However my CPU utilization is around 5%-10%. When I just open 8 instances of max and speficy my own partition range I get near 100% utilization. Is there something I’m missing?
Last time I checked on our network, it took Deadline a lot of time to launch all 8 instances of 3ds Max - when the 8th instance started, the first one was pretty much done with its partition. Given that we have hundreds of computers, we generally don’t rely much on the “Concurrent Tasks” option due to this overhead. I don’t think this is a Krakatoa-specific issue though. If you actually see 8 concurrent tasks being opened and saving in parallel in the Slave dialog and you are still getting 10% utilization, then it is strange… I think I was getting around 40-50%.
What would be more interesting is to measure the actual time it takes to process all partitions on Deadline from submission to finish. Does it take longer or shorter than processing each partition sequentially on the local workstation? If it is faster, then it is probably working right.
If you have more then one slave (the professional package comes with 10 network licenses), try to compare saving from 8 machines with one task each vs. saving from 1 machine with 8 concurrent tasks and compare the total time. I would expect one machine with sequential partitioning to be slower than one machine with concurrent tasks which would be slower than 8 machines with one task each. There could be a sweet spot where one machine with 8 cores produces partitions better with 4 concurrent tasks than 8 due to the Max launching overhead, but I haven’t done any benchmarking yet. I might do that today…
Unfortunately, our IT guy isn’t available to help setup network rendering right now. I’ve been given the joyous task of testing out deadline with krakatoa locally for time being. We’ve been using a temporary krakatoa liscence while waiting for the official one up until now. When partioning we have been opening multiple instances of max and setting the affinity to use 1 cpu per instance. typically 5-10 instances depending on the workstation. My point is that each core that was partitioning was using near 100% per core. Even with the overhead of multiple instances of max we were seeing excellent scaling on partitioning performance. With Deadline none of the cores of seem be to be doing much and overall utilization is very low. Shouldn’t deadline be as fast as opening up X of instances by myself and partitioning?
It depends.
On our network, launching Max can take up to 2 minutes since we resync all plugins each time it starts, and having hundreds of machines hitting the network for loading data can be slow. So usually before I can get 8 copies of Max running, the first instance has already finished partitioning…
As I said, if you can SEE 8 instances of Max open and saving (as opposed to starting up) in the Deadline Slave Dialog on the slave and you are getting the particles processed but the CPU is still at 13%, then there is something wrong indeed.
I will do some tests today, although this is in the realm of Deadline support as opposed to Krakatoa.
Ok, I take it back - I created a simple PFlow which produces 20,000 particles from frame 0 to frame 100, spawning 1 particles per 1 unit and colliding with a UDeflector containing a 64K faces teapot.
I sent this to the network, 8 partitions on one machine with 8 CPUs:
I wonder if the setup you are partitioning is hitting the CPU enough - saving particles to disk is relatively simple and does not load the cores much, calculating the PFlow simulation is what would load the CPUs so you need something that really challenges it. If your scene is mainly I/O bound, it might use less CPU. But it would not explain why your workstation would be loaded to 100%.
Here are the actual stats from the job run on one machine with 8 cores vs. run on 8 machines:
One Machine, 8 partitions, 1 partition per core:
[code] started date/time: Oct 04/10 12:04:17
completed date/time: Oct 04/10 15:53:26
elapsed running time: 00d 03h 49m 09s
total task time: 00d 19h 13m 57s
average task time: 00d 02h 24m 14s
median task time: 00d 02h 23m 59s
total task startup time: 00d 00h 44m 59s
average task startup time: 00d 00h 05m 37s
median task startup time: 00d 00h 06m 05s
total task render time: 00d 18h 28m 58s
average task render time: 00d 02h 18m 37s
median task render time: 00d 02h 18m 06s[/code]
Eight Machines, one partition per machine:
[code] started date/time: Oct 04/10 16:11:25
completed date/time: Oct 04/10 18:22:20
elapsed running time: 00d 02h 10m 55s
total task time: 00d 16h 57m 49s
average task time: 00d 02h 07m 13s
median task time: 00d 02h 06m 47s
total task startup time: 00d 00h 10m 03s
average task startup time: 00d 00h 01m 15s
median task startup time: 00d 00h 01m 14s
total task render time: 00d 16h 47m 45s
average task render time: 00d 02h 05m 58s
median task render time: 00d 02h 05m 44s[/code]
What can be seen from this:
*The one machine took almost twice as long as the 8 machines, but it was more efficient - it produced the same output on 8 cores that the others produced on 64 (but PFlow is single-threaded, so most of the cores were wasted). In other words, if you have 8 machines with 8 cores and you would run one partition on each core, you could produce over 4 times the amount of particles in the same time the 8 machines processed one partition each!
*The startup time for the tasks on the one machine was much longer because launching another copy of Max with many other copies already running is slower.
*The actual processing time wasn’t that bad - average task time was 2h42m vs. 2h07m, or 2h18m vs. 2h06m pure saving without taking the startup time into account.
When the mode is set to “frames as tasks”, all frames to be saved for one Partition are inside a specialized job. This means that in the Monitor, there is one job for each Partition, and the tasks of the job represent one or more frames to be saved. If Concurrent Tasks Per Machine were allowed in this mode, a render node with 8 CPUs would try to dequeue 8 tasks at once. The result would be 8 copies of 3ds Max being opened and each one rendering a different frame or set of frames from the SAME partition. But since Particle Flow is typically history dependent, EACH of the 8 copies of Max would have to preroll ALL preceding frames just to save the current frame. For example, let’s assume that the machine has only 4 CPUs and tries to do 4 tasks at once:
1st copy of 3ds Max opens, calculates PFlow on frame 0 and saves frame 0000
2nd copy of 3ds Max opens, calculates PFlow frames 0 and 1 and saves frame 0001
3rd copy of 3ds Max opens, calculates PFlow frames 0, 1 and 2 and saves frame 0002
4th copy of 3ds Max opens, calculates PFlow frames 0,1,2 and 3 and saves frame 0003
1st copy of 3ds Max calculates PFlow frames 1,2,3 and 4 and saves frame 0004
2nd copy of 3ds Max calculates PFlow 2,3,4 and 5 and saves frame 0005
3rd copy of 3ds Max calculates PFlow 3,4,5 and 6 and saves frame 0006
4th copy of 3ds Max calculates PFlow 4,5,6 and 7 and saves frame 0007
and so on.
As you can see, there is barely any reason to run 4 copies of Max on the same partition because each copy will first waste time launching and eating memory, then it will basically calculate the same frames as the other 3 copies, while saving one of every 4 to disk. Chances are it won’t be much faster than one copy of Max running consecutively through the frame range and saving each frame.
Thus it is a MUCH better idea to set only one machine to run on the whole job and let another machine deal with the next job and so on. And if you want each partition to run on a different CPU, then using the “Partitions As Tasks” mode was specifically designed for that purpose and DOES support concurrent tasks…