Particle Sorting Benchmark from the next build (0.9.13)

Bobo · March 15, 2007, 7:07pm

As you probably know, lighting and drawing particles in Krakatoa involves the sorting of the particle buffer relatively to the light or camera.

In 0.9.12, we discovered a bug that was slowing down the sorting of particles if the buffer was already sorted. In order to fix the bug, we decided to look at alternatives.

The next build will provide a new option where you will be able to select a sorting algorithm of your choice. Right now we have 4, but one is likely to be removed as it brings no real advantage.

Attached is the PDF showing a benchmark with 1, 10 and 20 million particles, one spotlight and the times in milliseconds. There are two set of times: The loading into PCache and lighting, and lighting from PCache. I have not included the total render times, this benchmark reflects only the lighting pass.

Some words on the 3 sorting routines:

Standard Sort is the method provided by C++. It uses one thread (so it runs on one CPU only) and is the slowest one. We added it as a reference, it is generally slower than the old method in 0.9.12.

Radix Sort is the new fast method we implemented as alternative to our own sorting method. It uses one thread but twice the memory of the other sorting algorithms, so the speed is at the cost of memory usage. When rendering from a presorted PCache, this method is at least an order of magnitude faster than the old method and up to 20 times faster than Standard sort.

FF Threaded Sort is Frantic Films version of a standard sort using multiple threads. In this benchmark, we used only 2 threads. Using a quad CPU machine, its results would be potentially twice as fast and thus more or less equal to the Radix Sort results, but at cost of CPU usage instead of Memory - it does not require a second buffer and uses the same amount of memory as the Standard Sort. This method is the original sorting routine found in 0.9.12, but it was fixed to behave correctly when sorting already sorted particles.

Since the particles have to be sorted again from the camera's point of view for the Final Pass, we expect further significant speed up from the Radix and FF Threaded sorting changes in future builds (at the moment of benchmarking, the FF Threaded mode with 2 threads was used for the Final Pass).

Conclusion:

If you have a 32 bit OS (thus memory is a limitation), use FF Threaded Sort. If your machine has 2 or 4 CPUs, you should get twice or 4 times the performance of Standard Sort.

If you have a 64 bit machine with lots of memory but only 1 or 2 CPUs, Radix Sort will give you better performance at the cost of more memory usage. If you have 4 CPUs on that machine, rendering with the FF Threaded mode would be as fast but less memory intensive.

But if you are relighting a scene from PCache with just one light and its position does not change, using Radix Sort will be about 6 times faster in the second and consecutive renders than the Radix sort in the first run, so if memory is not a limitation, Radix Sort should be the preferred mode for relighting from cache!

Borislav "Bobo" Petrov
Technical Director 3D VFX
Frantic Films Winnipeg

Bobo · March 15, 2007, 6:58pm

Some notes on the PDF content.

The first page contains the times for 1, 10 and 20 million particles.

Lighting is the total time spent in the Lighting Pass.

Sorting is the time spent in the sorting routine. That is what the benchmark is about.

Drawing is the time spent drawing the particles into the attenuation buffer. It is more or less constant and scales linearly between 1,10 and 20 million particles.

In general, all 3 portions appear to scale in a linear fashion (lighting 10 million particles takes 10 times longer than 1 million, 20 million is twice as slow as 10). Which is Good News.

The Loading/Cache tables on the first page show the difference between the Sorting time when Loading and the Sorting time when using an already sorted Cache of all routines.

For example, 3.08 means that sorting using Standard Sort from PCache is 3.08 times faster than using Standard Sort without PCache, but Radix Sort from Cache is 18.95 times faster than Standard Sort when Loading and so on.

The second page shows the comparison of brute force sorting (loading from disk, not from PCache) of all methods. The diagonal always contains 1 because the sorting method is compared to itself.

You will notice that Radix Sort gets better as the particle count increases - it is 3.35 times faster than Standard Sort with 1 million, 3.48 with 10 million and 3.73 times faster with 20 million particles.

Radix Sort is twice as fast when compared to FF Threaded Sort running on two threads. (1.97, 1.9, 2.06 with 1,10 and 20M). Since FF Threaded Sort is about 1.8 times faster than Standard Sort, we can assume that running on 4 CPUs could make it around 3.5 times faster than Standard Sort and almost as fast as Radix Sort, but without the memory overhead… When we get it to run on 4 CPUs, we will post an updated benchmark reflecting our findings.

Bobo · March 16, 2007, 1:02am

The machine the above benchmarks were performed on had WindXP 64 running Max 9 32 bit, 4GB RAM and 2 DualCore Athlon Opterons, each of the 4 cores running at 2.1 GHz.

So when I went home I decided to benchmark my little home PC - it has WinXP 32, Max 9 32bit, 1 GB RAM and a 2.01 GHz Athlon 64 3200+

Using 10 million particles, I got the following results for the sorting:

Standard, Loading: 24750ms

Standard, from Cache: 8328ms

Indexed Radix, Loading: 18437ms

Indexed Radix, from Cache: 18313ms

FF Threaded, Loading: 24781ms

FF Threaded, from Cache: 8297ms

NOTE that Indexed Radix Sort was not featured in the PDF. Instead of sorting the positions directly using a full second buffer as the regular fast Radix Sort does, this method sorts using a second buffer of particle indices. It uses 8 bytes per particle so the memory usage is a lot lower than regular Radix Sort, but it is also by far not as fast. Also, when using a pre-sorted buffer, it does not speed up much. I benchmarked it here because my home machine failed to allocate enough memory for regular Radix Sort when lighting 10 million particles. After all, it has only 1 GB of RAM.

So I decided to benchmark with 5 million particles - after all, we already know that Krakatoa tends to scale in a linear fashion, so the performance with 5 million particles should be about twice as fast, right?

Here are the results from the 5M benchmark:

Standard, Loading: 11640ms

Standard, from Cache: 3953ms

Radix, Loading: 3641ms

Radix, from Cache: 468ms

Indexed Radix, Loading: 11687ms

Indexed Radix, from Cache: 8516ms

FF Threaded, Loading: 11735ms

FF Threaded, from Cache: 3938ms

As you can see, the prediction for the speed up factor of 2 with half the particle count was totally confirmed! This time, I was able to use the fast Radix (Two-Buffer) sort which again proves to be the best sorting option if memory is not an issue.

Also note that the speed of the FF Threaded Sort on a single CPU is more or less identical to the single-threaded Standard Sort, so FF Threaded mode only makes a difference on Dual and Quad systems (the final version will support any Power-Of-Two number of CPUs - 1,2,4,8,16 etc. for future compatibility with upcoming monster machines.)

The Indexed Radix sort does not provide any advantage over the other methods and might be removed in the future.

It is interesting to point out that my office machine with 4GB or RAM failed to render 40 million particles using Radix Sort just like my home machine with 1GB of RAM failed to render 10 million particles. You see the pattern?

When I get a working 64 bit build of this version of Krakatoa, I will try to benchmark some higher counts in Max 9 64 to see how they compare. I still expect Radix to be fastest as long as there is enough memory. We will also enable threading by CPU count on Monday and test FF Threaded with 4 threads to compare to Radix…

Here are the actual rendering times (including all loading, sorting, lighting, sorting, drawing phases) with 5 million particles:

Standard Loading: 52 seconds

Standard From Cache: 22 seconds

Radix Loading: 31 seconds

Radix From Cache: 19 seconds

Indexed Radix Loading: 38 seconds

Indexed Radix From Cache: 29 seconds

FF Threaded Loading: 39 seconds

FF Threaded From Cache: 22 seconds.

We can expect an additional speed up from switching the pre-drawing sort for the final pass to the user-defined sorting methods in the future, so Radix Sort rendering should be able to go down to

Radix Loading: 22 seconds

Radix From Cache: 15 seconds

with some luck on my home machine - the last pass sort is currently hard-coded to FF Threaded sort which is 8 resp. 3.5 seconds slower than Radix…

Chad · July 10, 2007, 5:27pm

Just checking Beta 17, and the FF Threaded Sort is only using 2 of my 4 CPUs. Is that expected to use all available processors or not yet?

Chad

Bobo · July 10, 2007, 5:48pm

Just checking Beta 17, and the

FF Threaded Sort is only using

2 of my 4 CPUs. Is that

expected to use all available

processors or not yet?

It is as expected.

We decided to leave these changes for a future update (along with some other tweaks that could make Krakatoa even faster) because we feel we destabilized it enough already short before release

We know how to do it, but it would be too dangerous at this point.