STOKE MX RC2 Benchmarks

Bobo · April 10, 2013, 7:09pm

Here is a recent benchmark using Stoke Release Candidate 2.

I already posted some numbers in another thread, but now I have all runs with various Thread counts and would like to discuss what they show.
The simulation used one rather default FumeFX simulation of a burning Teapot. Emission was performed from the surface of the Teapot, but I tested it with Emission from FumeFX Source and it produced the same results and took the same time.
The simulation generated 100,000 particles per frame over 101 frames for a total of 10 million on the last frame.

Saving to PRTs amounted to 15.2GB of data on disk. It made no difference whether the saving was done on an HDD or SSD, since the saving performance was CPU-bound all the time. Most of the time was spent zipping the streams.

The Particle Flow run used Integration Step of Half Frame, it would be 213 seconds faster with One Frame step.
I used a HP Z420 Quad Xeon machine with 32 GB of RAM. The Memory Limit was set to 24GB. Simulating to Memory Cache only produced 19GB of actual data. With less RAM, the performance of Stoke would have been of course slower - I might run some more tests with 8GB and 4GB to show what happens.

In the table / graph below, you can see the comparison of simulating with PFlow+FumeFX Follow+Saving with Krakatoa vs. Stoke simulating on 8 threads and saving on 1 to 8 threads.

You will notice that the Stoke Simulation time is around 10 times shorter than the PFlow simulation time. Switching PFlow to 1 Frame Integration Step would reduce that to 5x. Saving with Stoke using only one thread though is only 1.41 times faster.

As we go to 4 cores on a Quad CPU, we get quite a nice speed up from saving 2,3, or 4 PRT files in parallel in the background - two cores save 1.9x faster than 1, three cores save 2.7x faster than 1, and four cores are 3.5x faster than 1.

Unfortunately, this does not hold true for Hyper-Threading. Adding 1 to 4 threads via HT does not improve performance that significantly. Running on 8 threads is only 4.6x faster than running on one thread! Still, it squeezes out some more seconds from the machine - 8 threads are still 95 seconds faster overall than 4 threads, but the addition of 4 Hyper-Threads gives us only 30% performance boost.

Compared to PFlow, 4 Threads finish partitioning nearly 5x faster, and 8 Threads are 6.5 times faster.

Note that the Simulation and Saving in PFlow is sequential - each frame is being calculated, then saved to disk, then the next frame is processed. Thus, the sum of the Simulation and Saving (labeled as Flush) times is the Total time.
In the case of Stoke, the Simulation and Saving are asynchronous - they run in parallel thanks to the memory buffering and background caching threads. As result, the Total time is shorter than the sum of the Simulation and Saving times. The only exception is the saving with one thread since the saving is significantly slower than the simulation and has to do most of the work after the simulation has finished.

Last but not least, as reported in the other thread, the final results of the Stoke simulation looked better. The FumeFX grid was set too narrow and PFLow particles were escaping the grid, producing linear streaks. I decided not to delete them because I was interested in the final data amount being saved.

JohnnyRandom · April 10, 2013, 7:35pm

DATA!!! Thanks

Are you writing large enough files to saturate the O of IO? You’re writing like 170mb a frame give or take right? Really the biggest benefit of something like SATA III SSD or Fusion IO is read IMO.

I would like to add my cents cents from my experience with HT. The best increase in performance I have EVER seen is right around 30%, never has in any simulation or render mode ever surpassed that percentage.

anon58450692 · April 10, 2013, 7:38pm

When generating a sequence of PRT files, all my measurements have shown that the process is limited by the cpu in the compression algorithm so the I/O bandwidth has almost no impact. Assuming a traditional hard disk of course. Something even slower (ex. USB 2.0 external hard disk or a floppy drive ) would impact the performance negatively.

JohnnyRandom · April 10, 2013, 7:45pm

Regardless of my silly questions, that statement right there is huge good news! Looking forward to trying RC2, just so happens I have a rf Quantum Force sim I am playing with and I am curious to see how well I can upscale the particle count.

BTW out of curiousness what is the largest file you have tried to write?

Bobo · April 10, 2013, 8:37pm

In the posted benchmark, the 10MP on the last frame amounted to a bit over 300MB. This is not the largest file I have ever written, but a good value for comparison.
I was watching Krakatoa saving it and it took about 25 seconds.
Assuming that the last 8 frames of the Stoke simulation are close to that size, I was attempting to write 8x300 = 2.4GB at once. But the saving is performed by zipping blocks of data of a certain size. We added internal controls to test different blocks and different buffer sizes and NOTHING changed. When Darcy profiled the actual saving process, it turned out that nearly all the time is spent compressing. We would probably be able to take advantage of higher bandwidth if we weren’t compressing, but who wants even larger PRT files?

Chad · April 10, 2013, 9:06pm

That’s what Deadline is for. You have a job compressing the uncompressed files.

JohnnyRandom · April 10, 2013, 9:22pm

Haha I am all for the smallest file size possible!

LOL Chad, anyone hiring a particle compressionist?

Bobo · April 10, 2013, 10:36pm

Ok, as promised, I tested the fastest settings (8 threads saving) using various Memory Limits to see how the installed RAM on your system could affect your Stoke performance.
The results were as impressive as they were surprising…

In the original benchmark, I was simulating up to 10 million particles with a Memory Cache set to 24 GB limit on a machine with 32 GB RAM. The actual amount of memory needed to fit all particles in memory for all frames was around 19 GB.
The results from that simulation printed by the Partitioning tool looked like this:

Generate Time: 10.071 sec. Advect Time: 72.353 sec. Update Time: 0.313 sec. STOKE Partition 1 of 10 SIMULATION Time: 98.608 sec. STOKE Partition 1 of 10 CACHE FLUSH Time: 139.22 sec. TOTAL STOKE PARTITIONING TIME: 237.906 sec.

So then I set the Stoke Memory Cache Limit to 8 GB and simulated again. Here is the encouraging result:

Generate Time: 11.073 sec.
Advect Time: 63.3 sec.
Update Time: 0.267 sec.
STOKE Partition 1 of 10 SIMULATION Time: 197.871 sec.
STOKE Partition 1 of 10 CACHE FLUSH Time: 37.701 sec.
TOTAL STOKE PARTITIONING TIME: 235.857 sec.

Say what?! The time is identical. In fact, it is slightly shorter, but this is within the margin of error. If I would run it multiple times, I would probably get results between 235 and 237 seconds. But the amazing part is that the total time to partition the exactly same particles using a lot less memory was the same. What has changed is the time it took to simulate vs. the time it took to save to disk. Since the two processes run in parallel, and the simulation had to wait for memory to get freed by the background thread to continue, 100 seconds were added to the simulation time, but they were removed from the flushing time (because the saving was running on full throttle the whole time in both runs). The 37 seconds is the time AFTER the simulation finished to flush the remaining data to disk, but the actual saving was running during most of the 197 seconds of simulation, too.

The implication is that if you are Partitioning, the memory limit DOES NOT MATTER, as long as you have enough memory to keep the CURRENT particle set for the current frame in memory! But if you are Simulating via the Stoke UI, it would take 100 seconds longer before you are able to scrub interactively while the flushing continues in the background. So for fast iterations, the Memory Limit plays a great role to give you back control sooner, but for final PRT generation, what you are interested in is when the last PRT will be written to disk, and in this case the memory does not affect that!

Want further proof? Here are the Partitioning results with only 4GB memory limit:

Generate Time: 12.868 sec.
Advect Time: 62.829 sec.
Update Time: 0.472 sec.
STOKE Partition 1 of 10 SIMULATION Time: 208.743 sec.
STOKE Partition 1 of 10 CACHE FLUSH Time: 28.77 sec.
TOTAL STOKE PARTITIONING TIME: 237.728 sec.

Pretty much the same time as above!

Note that on the last frame, 10MP would require about 3GB to store the current frame, so limiting the memory further would not have any effect - Stoke will go over the limit if it has to if the count of the current particle set requires it… So setting the limit to 2GB or 1 GB produces exactly the same results, and ends up using about the same amount of memory - slightly over 3GB on the last frame…

Both Seattle and Vancouver are slightly shaking as we do a happy dance on the West Coast…

JohnnyRandom · April 10, 2013, 10:55pm

That’s pretty crazy so in essence allocating the 4 gig for every core would be the optimal speed/production ratio on an 8 phys core 32gb ram machine, assuming of course that windows is doing it job of correctly allocating resources (we can only hope)

I find it interesting the the Advection time keeps going down. Am I missing something there?

Bobo · April 10, 2013, 11:50pm

The allocation is global, not per core.
There is one memory limit for the whole Stoke object. Each Stoke object (if you have more than one) can allocate memory to hold at least the current frame, and if the limit is high enough, several frames.
When simulating with Disk Cache off, you want all the memory you can afford to be allocated for the Memory Cache. This way, your simulation can fit fully in memory and can be scrubbed quickly in the viewports, or rendered directly by Krakatoa. If you are only testing, you can cache every Nth frame (where N is 2, 5, or even 10) to reduce memory usage, but still get a pretty good idea about the final simulation.

When simulating with Disk Cache on, the memory limit is first given to Memory Cache and from there it is moved to the Saving (Serialization) Buffer. From there they are saved by the background thread(s). During the simulation, all threads are always used to simulate. So on a machine with 4 cores + HT you always get 8 threads simming. During that time, if you set the Cache Threads limit to 0, 4 threads will be used to save 4 PRTs to disk. Once the simulation is done, 7 threads will be used to dump the rest of the buffer to disk. (One thread left for you to interact with Max). The more memory you have, the faster the simulation will finish because it does not have to wait for the saving threads to free up memory.

If you set the memory limit to, say, 4GB, the simulation will have to stop between frames and wait for memory to become free. That’s why the total simulation time is a lot longer, while the actual advection time is pretty similar. If you do have 32 GB, it is a good idea to allow Stoke to use as much memory as needed to finish the simulation faster.Then you can start interacting with Stoke to review the simulation and decide whether you want to keep the cache or stop the saving even before it is done.

My benchmarks used the Partitioning tools though. When partitioning, by default all 8 cores will perform simulation, and 8 saving threads will be set up to zip up the data and dump to PRTs. The memory is given to the Serialization buffer because no interactive scrubbing is needed. This is what my previous post was mostly about - the memory limit does not matter because the Partitioning dialog does not release the main thread when the simulation is done, so it does not affect you at all - you are always going to wait the same time per partition regardless of how much memory you gave Stoke. The only requirement is to have enough memory to process the largest frame of the simulation.

Because of this, it makes no sense to partition with concurrent tasks on Deadline - I removed those controls from the submitter in RC2. Use all CPUs you have, and let it finish as fast as it can. Throw one machine at each partition and you will get faster results than splitting partitions between Max instances on the same machine (due to the Max startup overhead).

The short version - if you have 32 GB, use them and Stoke will finish simulating faster so you can review the results earlier. If you need the memory for something else, your PRTs will take about the same time to save, but the simulation part will take longer. If you are partitioning, memory limit does not matter.

We are not sure why the advection time goes down as the memory limit goes down. The total simulation time goes up. so that makes sense. My suspicion is that when there is more memory, the 8 saving threads have more to do all the time, and thus steal some resources from the 8 simulation threads. You can see the same behavior as you go from 1 to 8 saving threads - the Advection time goes up. So as you reduce the memory buffer, you are making the saving threads do less and thus allow the simulation threads to use more CPU cycles.
But this is just a guess… Also, the 72 seconds could have been a glitch - the 8 threads test in the top post shows only 65 seconds Advection time in that run, and the same total time. So you can assume that the Advection times row is more like 65,63,62…

anon58450692 · April 10, 2013, 11:54pm

The advection step is the most obviously cpu hungry portion of the algorithm, so likely there is more cpu time available when the memory limit is lower. The actual cause is probably much more complicated and not really worth anyone’s time investigating.

I wouldn’t bother reading too far into the individual numbers. The only two useful numbers to track are total time (time from start to all prts being on disk) and total sim (time from start to being able to scrub the timeslider again).

JohnnyRandom · April 11, 2013, 2:45am

Ahhh ok, not sure what you might pull out of a bag-o-tricks. Completely understand thanks detailed explanation.

Sure, of course overall time and processor saturation are the most important things to observe. It just seemed an interesting pattern developing, nothing more, it seems obvious that it wouldn’t go to much further as you wouldn’t be allocating enough memory