Multithreading (beta 13, 64bit)

I’ve tried the Radix and the FF Threaded Sort, and in neither case does the CPU usage on my dual opteron machine get above 50%.



Retrieval, lighting, and rendering are all running at 50%.



I’m running max 9 64 bit.



Is that normal yet?


  • Chad

Actually, there is a brief moment when the render dialog progress says “Rendering” and there is nothing in the bar that the CPU goes to 99%.


  • Chad

How many particles, how many lights?



Try 20M particles and 3 lights.

Set the Sorting to FF Threaded.

Render.



You should see peaks in the CPU usage going between single-threaded and multi-threaded

*The particles have to be split into two chunks - this is single-threaded

*The sorting of the two chunks happens in two threads. I am getting 50% of 4 CPUs.



In my earlier tests, FF Threaded was 1.7 times faster in the Sorting during the Lighting phase than Standard Sort.



Radix is generally single-threaded and should be twice as fast for lighting as FF Threaded on two CPUs, but uses a bit more memory (you should see a spike in the memory usage around each light calculation).



The Final Pass sort in 0.9.13 uses the same technique. We have optimized it more for the next build though, so it will be generally faster.



On your machine, Radix should be faster as long as you have enough memory to run it.

If you don’t, FF Threaded will be your choice.



We have not enabled 4 threads sorting yet, but we hope that it will be closer to the speed of Radix sort without the memory overhead.


I have 4 lights and 144 million particles.



The actual back-to-front painters algorithm splatting whatever part of the render is running at 50%. The actual attenuation buffer rendering for the lights is also running at 50%. The only time I see it at 99% is when the dialog gives the unspecific “rendering” that happens right before the lighting or final rendering happens. I assume that is the sorting? But the actual rasterizing of the particles isn’t multithreaded?


  • Chad

This is correct.



Krakatoa rendering is generally more memory dependent than CPU-dependent.



Once the particles are sorted, all Krakatoa has to do is read values from the sorted buffer and draw them one by one back to front. As you can imagine, this is not very CPU-intensive, reading from memory and writing into the buffer is all that has to be done. But in order to multi-thread this, each thread would require its own memory with all the particles, which would mean double the memory requirements to get something sped up that is already pretty darn fast.



As you can see, this would be a really BAD decision ;o)



Btw, when it says “Rendering”, it is actually sorting the particles relatively to the view. If you are using FF Threaded sort, that’s when the multithreading kicks in. Same with the Sorting of each light - the drawing of the attenuation maps is the same as rendering the final pass - painting back to front, single-threaded.


Ah, I see.



Well, the sorting bit (especially with Radix) is fast. The back to front rasterizing is what’s taking ages to render.



Even when I have lights with shadow maps of only 20 it seems to take a long time to make their attenuation maps. That doesn’t seem like a lot of pixels to fill, and there’s no DOF or motion blur to worry about.



I tried a render with 66 million particles (I culled to particles visible to camera) no lights (lighting turned off) and with PCache turned on, with an animated camera. So after the first frame, this should only require sorting for the camera and rasterizing. I used the Radix sort since I have the memory for it.



After the first frame, it takes about 9 minutes a frame.



What’s weird is that if I turn lighting ON and use LCache, it takes 4.5 minutes.



Either way, it is going pretty slow.



When you say “double the memory” is that for 2 threads or for N threads? I’m assuming that only gets you 2 threads. But right now I could easily double my memory usage and would trade that for double speed.



We’re buying some 8 core machines in the next month, I’m wondering if that’s a bad idea for Krakatoa. Perhaps we should just get massively overclocked single CPU machines?



Assuming no DOF, and nearest neighbor filtering, could you not just break the projection matrix of the camera up into chunks when doing the sorting? On a 4 core system, have each thread do a quadrant of the frame buffer? Still do front to back and such, but treat each quadrant as a little “crop” of the full image?



With filtering or DOF, you might have to have a little bit of overlap with the particles, where some particles would have to be duplicated to more than one thread’s sort. With filtering it would be minimal, I don’t know what DOF would require.


  • Chad

Are you absolutely positive that you are not hitting the swap disk?



I tested the following using Krakatoa Beta 0.9.13



25 Million particles, no lights.

Max 9 32 bit on WinXP 64:



No PCache: 2:13 (most of the time wasted releasing the memory after render)

PCache ON: 1:27 (no mem.release, so faster)

From PCache: 0:55 (no loading into memory!)



Sorting time for all 3 cases was 19.5 sec. with FF Threaded.





Same setup as above, but

Max 9 64 bit on WinXP 64:



PCache OFF: 0:55 (memory management is faster!)

PCache ON: 0:55 (identical!)

From PCache: 0:23 (so loading is about 22 sec.)



Sorting with FF Threaded was 20 sec. in the first and second case, 8.6 seconds from Cache (since camera TM was the same)



Pure point drawing time was 13 seconds in all 3 cases.



Switching to Radix Sort gave the following results:



*PCache OFF/ON: 0:48 Sort: 12.4 Draw: 13.9

*From PCache: 0:16 Sort: 1.9 Draw: 13.9



As you can see, switching to Radix Sort helps when memory is enough (it needs nearly double the memory as it creates a second buffer for sorting!)





We assume that Krakatoa scales in linear fashion as long as it does not hit the page file. To prove this, I went up to 50M particles in the same scene, Max 9 64 bit, FF Threaded Sort (Radix would not have enough memory as I have only 4GB)



PCache OFF: 1:56

PCache ON: 1:56

From PCache: 0:45



As you can see, it is scaling linearly - double the particle count, double the time.



Pure FF Threaded Sorting time whithout cache was 42 seconds. With pre-sorted cache - 16.5 sec.



Pure drawing time was 27 seconds in all 3 cases.



-------------



Regarding Attenuation Map generation - the size of the shadow (just like the size of the rendered image) has almost no influence on the render time - it is the number of particles that have to be drawn. So whether you are shading 20x20 or 2000x2000, the work is the same, just the final image is different.



I guess it would be worth investigating the possibility to speed up shadow generation by drawing only a small portion of the particles (every Nth, like every 100th or something). In this case, instead of drawing 66 Million, your maps would have to shade only 600K particles into the 20x20 map. If we would adjust the volumetric density respectively, it should give you a similar result, I guess…







I am not chief engineer of Krakatoa, so I leave the questions about multi-threading the actual shading phase to those who know more about the problem.



Still, drawing 50 million particles in 27 seconds means 1.85 million per second.



I have no idea why would you get 66 million in 9 minutes. Something does not sound right.


The sort times for me aren’t the problem. I don’t have a breakdown (where do you get that?) but I can see from taskmanager where the memory usage spikes briefly (during Radix sort) then drops off for a long stretch (rasterizing).



Thirteen seconds was not happening for the draw phase. It was at least 8 minutes of rasterizing, seemed like about a minute between each “slice” though I didn’t time it, just looking at the clock in the corner.



As to swapping, I only can tell from taskman. Is that not good enough? It says I used ~6.3GB when sorting and had ~.5GB available. When rasterizing, it used ~3.5GB and had ~3.5GB available. I have no idea if those taskman numbers reflect truth though.


  • Chad

How much Physical Memory do you have installed?



I cannot render 66M particles here as I have only 4GB. If I try to render those, I will hit the swap disk and it will become unusably slow.



If Commit Charge (K) Total is greater than Physical Memory (K) Total, you are swapping.



If you have 4 GB RAM for example and PF Usage shows more than 4 GB being used, you are swapping.



In my case, I set 50 million particles to render with Radix sort which is like rendering 100 million with FF Threaded sort (since it requires a second particle buffer for faster sorting, it uses as much memory as 100M). With FFT, my memory usage goes up from 1.2GB to 3GB. With Radix, is goes up to 5GB. Since I have only 4GB, Windows tries to give me more by paging. During this time, the CPU usage stays low since Windows has too much to do with memory management for Krakatoa to process anything. My available memory was reported to be about 1.5 GB, going up and down as the page file was written.



I am not sure what you are seeing is page file related though - the above setup took a lot more than 9 minutes (it is still sorting at the moment…)



It would be cool to figure out what exactly is causing the slow down on your system.



Full hardware and software specs might be useful. Then try replicating the same with artificial data - like 50 x 1M PRT partition sequences created from a simple box emitter PFlow to exclude the nature of your point cloud data from the equation.

Perhaps it is the depth of field? Rendertimes (with PCache) drop to 1.5 minutes without the depth of field. That’s ~6x improvement.



Unfortunately, I can’t turn it off.


  • Chad

AHA!



Yes, DOF works by rendering the same particle multiple times to cover the disk area calculated using the DOF Settings. The quantity of these samples (and thus the quality) depends on the DOF Sample Rate - the higher the value, the more samples will be drawn per particle. This means that when using DOF, you are effectively drawing orders of magnitude more particles than loaded in memory - if each particle required on the average 10 samples (this varies with depth, of course), then you would be drawing 660 million points on screen.



With a drawing rate of almost 2 M/sec, this would mean 360 seconds or 6 minutes just for drawing points…



YMMV, try lower DOF Sample Rates and see if the quality is usable…

I was wondering how that worked…



I figured you were changing the size of the sample filter as you got farther from the focal plane.



Duplicating the points sounds like it would work really well, but yes, slowly.



When you say it is memory bound, not CPU, how does that break down for designing systems to use this? I assume it’s the memory bandwidth for a single CPU that matters, right? Is there a benchmark that can be used, like a 1P STREAM or do you know from experience with Frantic productions the certain system types work better than others?


  • Chad

Here’s the shot that brought up my concerns. Only the camera moves, so I was expecting the pcache and lcache to really pay off. I suppose once you turn off the DOF, it really does!


  • Chad

I did a quick test. Took the same scene as before, but zoomed the camera in. So FOV was 3 degrees instead of 30. Rendertimes were much lower. Which means only the particles visible to the camera are being rasterized. The rest are being ignored. So I guess I’ll log a wish to allow “crops” of the screen to be sent off to different threads. Maybe Mark can let me know if something like that could be feasible for a post 1.0 world?


  • Chad

The tricky thing with multithreading Krakatoa is how to split up the particle drawing task without causing high memory usage or lots of redo.



The way I’ve been imagining doing this is, after the particles are sorted, splitting them into near and far particles. One thread would render the near particles, while the other renders the far particles. The two frame buffers would then be composited together to produce the final render.



This becomes trickier to load-balance correctly when there are multiple effects, like matte objects and depth of field, in play. These effects could change the speed of the near particles relative to the far particles, so CPU usage would be split across the threads at the start, but the fast half one finish early, and so it wouldn’t be using multiple CPUs for most of the time.



In this case, what I think we need to do is create a thread pool, split the particles into much smaller chunks, and dole them out to the threads as they finish them. Unfortunately, the threads might finish out of order, which means the intermediate frame buffers might have to wait to be composited together, requiring extra (potentially unbounded) memory usage to get maximal load balancing.



What you’re suggesting as an alternative, of splitting the scene into buckets, is much more like how most geometry renderers work. In the case of heavy DOF renders, this can provide a lot of benefit, because the duplicated calculation is small relative to the heavy DOF sampling. There could also be a benefit when really heavy matte object geometry is being used. For scenes with neither of those, the particle drawing time is small relative to the processing overhead, so the same work would be duplicated across the multiple CPUs, making the utilization look good, but not providing a really good speed up.



So basically, we want to put in multithreading, but it’s pretty tricky to balance memory usage, load balancing, and avoid redoing work across the different CPUs. Post-1.0, we definitely plan to do this. It needs the right amount of time to get it right, test it thoroughly, etc. It’s very easy for incorrectly implemented multithreading to introduce subtle bugs, or sporadic crashes, and we don’t want any of those.



Cheers,

Mark

Thank’s Bobo for the timing tip. With the new log window, I stopped looking at the listener.



Mark, what you say is encouraging, I guess I’ll stick with a multiprocessor system and just hope I get to beta test the 1.x version. I didn’t think about doing near-far splitting with a compositing step at the end. I can’t think of any way I can use that now, but it might be useful to split out renders by clip plane ranges to allow better post processing, especially with DOF. Like I could use low sample on the near and far passes, and blur them in post, and use high samples on the in-focus middle clipping region.


  • Chad

I’m attaching an image that shows the before and after of my first attempt at masking in Krakatoa. I consider it a success, though the irony is that the mask took exactly the same time to render as the beauty pass. It is possible to speed it up by using fewer particles, but as it is now, the mask is quite accurate.



Also, I had a problem with the kidney wherein one of the slices had an error in the data because of a problem in the cryomacrotome that messed up my segmentation algorithm. The error didn’t show up when I was working in Box3, because I was only using a few thousand particles to test. Once I loaded up all the PRT’s however, I could see a problem. Easy fix with the new beta… I made a box around the bad particles, clipped the PRT loader to that box, and saved that out as a NEW PRT file. Worked a treat.


  • Chad

Beautiful work, thanks for sharing!!!



We always thought of LCache and PCache as a manual way to speed up quick test renders with several million particles. The fact they could be used to render static particle clouds with moving camera was a side-effect, not a designed feature. DOF is a production effect and like Motion Blur requires a lot more calculations.



That being said, having 66 million particles might allow you to use VERY low Samplw Rate values - if you set DOF Sample Rate close to 0.0, the existing particles will be scattered in space without much duplication, thus rendering in almost the same time as without DOF.



Of course, the quality of the effect would be lower, but it is worth playing with low settings and finding the best balance between speed and quality. The more particles you have, the lower the Sample could go. 1.0 might be an overkill if 0.1 looks similar visually but takes 10 times less time…



(Currently, you should be able to see the times for lighting and final passes in the MAXScript Listener.)