Was just wondering if there are any plans to implement GPU rendering and particle saving?
What improvements over the current system would you expect from GPU rendering and particle saving?
The current limitation of GPU rendering is the amount of memory you can fill up with particles.
Krakatoa requires ALL particles to be present in memory to perform sorting, lighting and drawing. With a typical 1GB card today, the amount of particles one could process would be somewhere between 15 and 40 megapoints. But the portions of the code that the GPU could potentially make faster are already pretty fast on the CPU. The portions that are slow on the CPU wouldn’t get much faster on the GPU.
For example, drawing particles as points is currently single-threaded. Making it multi-threaded on the CPU or GPU is quite a challenge because the process itself is not well suited for parallelization. If we could multi-thread it, we would rather go with 8 cores and 16 or 32 GB RAM instead of hundreds of cores and only 1GB RAM.
Running 3ds Max materials and maps on the GPU might be a problem since the Max materials we support are currently not designed to run on the GPU (Autodesk and mental images are doing something in that direction, but MetaSL might not be the right answer for Krakatoa). The KCMs run like hell on the CPU (the overhead of typical KCMs is rather negligible). Making KCMs execute on the GPU would probably make them a bit faster, but since that is already one of the fastest portions of the system, the effort wouldn’t pay off much.
Saving particles depends mostly on the speed of the ZIP library. We could theoretically speed it up by possibly writing multiple streams to the same file to better use the available cores, but the GPU wouldn’t change anything in this case.
One area where GPU rendering could be interesting would be fast previews of fractions of the particle cloud in a real time environment (like the nVidia smoke fluid simulation demos). This would probably require an external viewer or waiting for XBR to deliver the new Max viewports.
In both cases, I am not sure the RnD effort would be worth it. Keep in mind Krakatoa is being developed mainly to serve our internal VFX needs. It was not designed primarily as a commercial product, that was a side effect and we are happy people are embracing it. But in order to get Krakatoa on the GPU, we would first need new graphics cards in our workstations (read: $$$), time to learn CUDA or DirectCompute or whatever (time==$$$), and people to do the port (people==more $$$). Given the limited resources and high demand for the developers’ time in-house, it is much better to concentrate on adding the things that Krakatoa does not do yet, and optimizing the things it already does but could do better.
We would be interested in hearing from the users what areas of Krakatoa would be improved by moving to the GPU.
Cheers!
Agree with you on the GPU stuff. Krakatoa is I/O and memory bound, the two things GPU’s can’t solve right now.
Regarding CPU multithreading of the rasterizing… Why is that a challenge? Why can’t you divide the scene up into N depth slices and render each as a separate thread and composite the results together at the end?
I can see why bucket or scanline rendering would be hard, but it seems like the depth sorting is being done anyway. I’ve done tests on this myself (save lighting to emission, sorted to camera, load the PRT, cull by ID or clip box, render, composite), and the results are accurate to a very high degree, just half-float rounding errors. And even if there were artifacts (and I can’t think off the top of my head what they might be), they would all be in planes parallel to the camera plane, making them hard to detect (compared to bucket edges or scanlines).
- Chad
That is certainly an option, but its not IDEAL in the sense that it can make the render slower. We were kind of holding out for an algorithm that cannot perform slower that the single-threaded implementation in the worst case.
My observations show that the lighting phase is typically the slowest, do you agree?
Actually, we find the rendering itself to be slower, but that’s usually because of depth of field. I can see how large numbers of lights could make things quite slow, but we so often render passes that have no lighting whatsoever, so we might do light sorting 4 times, but do rasterizing 12-15. And lighting can be baked to emission channel, rasterization can only really be baked to EXRs.
But the lighting can be multithreaded too, the same way. The compositing is different, you’d need to subtract the light sorted slices, but that’s all, right?
And in what cases would the depth sorted rasterization be slower? For very small point counts, maybe? But in those cases you’re talking about going from .3 seconds to .4 seconds, and no one will notice.
- Chad
Here’s my beautiful diagram:
/
--------------------------------------------------------------- zMax
| (4)
--------------------------------------------------------------- zMax * 3/4
| (3)
--------------------------------------------------------------- zMax * 1/2
| (2)
--------------------------------------------------------------- zMax * 1/4
| (1)
Camera z = 0
A framebuffer at 4K resolution and 16:9 aspect needs 4 (bytes per float) * 6 (floats per pixel) * 4096 * ( 4096 * 9 / 16 ) = 226492416 bytes or 216 MB.
I have 4 CPUs, so I split the particle z-range into 4 pieces. Right off, this needs at least 4 times more frame buffer memory. 4 * 216 = 864 MB. This won’t kill anyone at this point. Also we will need to do a second pass to composite these four images together after all threads have finished (assume they all finish at the same time).
This extra work shouldn’t be a problem since I’ve got 4 CPUs churning if we can assume a mostly uniform distribution of particles along z. Unfortunately that’s not a good assumption under a lot of situations. For example, imagine a scene where region 1 in my diagram gets N - 1 particles where N is the total number of particles. The last one is at zMax. Three threads will start, finish immediately since they have at most a single particle (potentially holding onto that extra framebuffer memory), and wait for thread 1 to do the entire job. Then the second (single threaded) pass will begin compositing these images together. All in all, that’s probably a bunch slower than the single threaded case, and not really all that unexpected.
To deal with this, the standard technique is to chop the work into some amount much greater than the number of available threads, and take pieces as you finish. This will balance the work, but potentially be ugly since each chunk needs its own framebuffer if they are randomly assigned. You could use some sort of scheduling algorithm to try and grab an adjacent work chunk when finished, but that could end up being un-balanced without careful design.
The other technique that probably works really well would be to split based on a certain number of particles. instead of zMax * X/4 being the splitting points, you could use particle N * X/4 as the splitting region so each thread gets an equal number of particles. Beyond the memory requirements of the extra framebuffers, I haven’t figured what the other drawbacks might be but I’m sure they are out there. Waiting. . .
Yeah, the memory doesn’t sound that bad at all… So for an extra 660 MB, we could have almost 4x faster rendering (after PCache)? Better yet, for an extra 2.3 GB it could be 12x faster?
I was assuming that you would divide the particles into equal chunks, possibly compensating for DOF by making the chunks closer to the focal plane larger…
Making many chunks wouldn’t eat more memory, dump the finished buffers off to disk. But I don’t think small chunks is a good idea. If you have 4 CPUs, make 4 chunks and just try to get them nearly the same size.
Lighting doesn’t account for motion blur, right? So that’s the place to start…
Suppose for the rasterizing the motion blur would have to have multiple particles. So a particle might need to have some samples in one chunk, and others in another. But it won’t be any more samples than you do now.
- Chad
What’s the 6 floats/pixel? R,G,B,A, and what else?
- Chad
You need three floats for alpha in order to handle Absorption correctly.
Thanks for all the very illuminating replies…
Ah, right.
Which reminds me, could we save out AR, AG, and AB EXR channels to allow for proper compositing of absorption?
- Chad
EDIT: AR/AG/AB are the actual “standard” channels in EXR’s for absorption. Neat!