During a rendering, if the CPU usage in task manager is pegged to full for one core (so 25% on my quad core), does that indicate that memory is not the bottleneck?
We’re considering upgrading the memory on some machines from 667MHz to 800MHz, but not if we won’t see a performance increase in Krakatoa.
Our high particle count renders (~1B) are going really slow, even if we have valid PCache and are just generating new LCache.
I would say if you get 25% on a quad machine, the CPU is the bottleneck.
We are multi-threading 1.2 heavily as we speak, culling is 3.5 times faster and shading (calculating materials and colors) over 4 times faster on a quad PC!
And this is just the beginning…
When generating the LCache, Krakatoa performs the steps of sorting particles relatively to the light (currently multi-threaded) and calculating the attenuation which is a process similar to the final pass rendering - one particle at a time on one CPU. We are going to switch the sorting to Intel’s Thread Building Blocks technology which might make it even faster than it is now. Then we will have to figure out how to do buckets per thread for the drawing which is the most difficult part of the project.
If everything goes well, ANY portion of the Krakatoa rendering should scale with the number of CPUs.
Otherwise, we might end up having just some stages of the pipeline multi-threaded until we figure out the rest…
Borislav "Bobo" Petrov
Technical Director VFX
Frantic Films Winnipeg
Sounds great, but why don’t you just split up the particles not by XY position in screen space, but by distance to camera? I can see where buckets would be really useful, but I can also see where it would be a lot simpler to just divide the depth of the scene into a multiple of the threads desired and just alpha composite the results as a separate process. Depth of field would seem to be a lot simpler that way, for sure. Filtering too.
Guess we’ll hold off on the memory upgrade and think about adding a second quad core CPU.
We can do this with Additive mode where the order does not matter at all and lighting is not necessary.
But for Volumetric shading, it simply is not trivial. The light attenuation for any particle depends on the attenuation calculations of ALL particles before it relatively to the light source, so they cannot be processed out of order. That’s why we are looking into splitting the image into N regions and processing them in parallel, then fighting with the issues at the borders.
Right now, we are speeding up those parts of the program that are well suited for threading. The drawing portion will be the hardest.
For the lighting, you don’t need DOF, and you don’t do motion blur, right? I forget. Anyway, buckets would probably be perfect for that, as you need to see the order of the points from the light.
But for the drawing, I thought that the volumetric mode just (sorta) multiplied the inverse of the distance from the camera by the density? So that 5 points per pixel (ppp) would have less density at 100 scene units from the camera than 5 ppp at 10 scene units from camera. If the lighting is pre-calculated in the lighting stage, and the distance from the camera for each “slice” is nearly the same, then what’s the complexity? Each “slice” will be rendered out of order, yes, but there’s no dependancy between the slice at 10 units and the one at 100 units, no? Each would get it’s own density compensation for the volumetric mode. You should be able to save each slice out as an EXR and comp them together FG over BG with alpha.
Maybe you end up with 80 or 160 slices per frame, but heck, you could assign each a Z depth based on the slice, and composite with other renders really nicely. You could even do DOF as a post-effect, as well as z attenuation, tinting, etc…
Not to be a asshat, but that’s how I thought this worked. Is there some 3D filtering going on or something else I’m missing?
And of course my analysis of the CPU and memory is all wrong, because as soon as the renders get 4x faster, the memory might be the bottleneck. So I’m obviously having a less than ideal “think day”.
Krakatoa relies heavily on particle-on-particle interactions that simulate the volumetric attenuation of light. You can think of it as solving for the number of particles that a light ray intersects between a particle and the camera/light in addition to the distance. If we split the particles into Z-based layers, almost all of the self-shadowing would be destroyed in the general case.
In this case it would be more fruitful to split the particles into buckets similar to a ray-trace renderer. There are complications to this however, because particles do not completely fall onto a single pixel for most filtering modes. This means that there is a certain amount of overlap between buckets. If you throw DOF into the mix, it further complicates things.
That being said, this is not an impossible task it just is not the “low hanging fruit” of multi-threading efforts.