Magma output to motion blur samples & Performance

PatrykKizny · November 30, 2012, 10:56pm

I wonder if that’s possible to somehow map the distance between the particle and the camera to motion blur samples.
Obviously I’d like to optimize my rendering and increase motion blur number of samples (krakatoa motion blur) to get better performance and better sampling in the foreground.
I think I saw its somewhere browsing the forum, documentation or on educational vidoes, but can’t seem to find it now.
I mean I can’t find the parameter to output to, because I can easily get and process distance of particles to camera.

Many thanks in advance.

Bobo · November 30, 2012, 11:12pm

Nope, the number of samples defines the number of passes that Krakatoa will render, so it is global and cannot be controlled per particle. What you could do is scale the Velocity channel to make particles closer to the camera travel more or less distance than particles in the background, but you cannot tell each particles how many samples to draw as each particle is drawn once in every motion blur image buffer and then all images are comped together…

PatrykKizny · December 1, 2012, 12:11am

I see. Would that be possible to duplicate particles closer to camera along the relative particle to camera movement path?

Bobo · December 1, 2012, 1:17am

You could try cloning a PRT loader multiple times as reference. This will make the base object instanced, but modifiers unique.
Then on each clone, add a Magma modifier and calculate a Soft Selection based on the camera distance (the farther the particle, the higher the selection value).
Add a Krakatoa Delete modifier to delete by Soft Selection, thus giving you more duplicates close to the camera and fewer with distance.
Then add another Magma on top of the Krakatoa Delete and shift the position of each particle a fraction of the Velocity vector, emulating the sub-samples this way. Expose the fraction in the UI and clone this Magma to every PRT Loader as a Copy. Change the exposed factor on each PRT Loader so you have 0.1, 0.2, 0.3, 0.4… 0.9 or something like that. This will simulate the sub-samples along the Velocity, but far from the camera most of these will be missing due to the deleting, so there will be less samples…
You can enable the Krakatoa MBlur on top of that to blur each sample additionally.

I haven’t tried this, but it sounds like it might work.

PatrykKizny · December 1, 2012, 4:03pm

Just in theory - if I have 10 PRT loaders referencing the same dataset - Is it loaded once or 10 times?
And I believe it would work only in case the particle is in fact moving. It does not seem to work with combinig velocity channel transformed to moveing camera space.

Bobo · December 1, 2012, 4:59pm

They have to be loaded 10 times currently. The instancing is done only to simplify the management if you decide to tweak the settings of the PRT Loaders - adding/removing files, changing counts etc. If you change one, all 10 will change. But at render time, each modifier stack has to be processed individually, so each PRT Loader will have to read again from disk.

It can be made to work with a moving camera by using an InputScript operator that reads two positions of the camera at different times using MAXScript and calculating the motion vector of the camera itself, then using that to modify the positions of the PRT Loaders instances.

PatrykKizny · December 1, 2012, 6:48pm

I see. In this case it probably does not pay off. It’s faster to draw 10 MLN particles 10 times then load it 10 times.
In this case it looks like high frequency noise to increase partcicles count and then soft selection delete away from camera would work better.

One more question though regarding performance and Magma.
Let’s assume that:

I have a static particles dataset (one frame loaded via load single frame only);
I want to do animated culling either via explicit culling or Magma;
Particles are lit on the scene, but lighting is static;
There’s an animated camera;

Now what would be the best possible workflow to avoid reloading particles every frame?

I assume animated volume affects lighting, so probably I need to bake-in the lighting into the dataset
How is it with culling via Magma or explicit culling and cache - is it possible somehow to load particles only once and only have Magma or culling updated in RAM basing on cached data? Or there’s no way to escape dataset reloading every frame?
While loading particles from the PRT file, what takes most time? Processing and streaming or readout? i.e. will I get significantly better performance by moving my datasets either to SSD drive or to RAM drive (if it’s not possible to avoid reloading in this scenario). --> I just have tested it. Loading 20MLN dataset from SATA HDD and from RAM disk makes abolutely no difference timewise.

I know it’s a lot of questions, but I believe other users will benefit having them replied here on the forum.

Bobo · December 2, 2012, 1:11am

Then you are doing wrong

There is a practical limit imposed by the speed of the zlib compression library. You cannot decompress particles faster than a certain limit. So if the bottleneck is the decompression and not the hard drive I/O, you will see no difference. BUT if you have multiple partitions, equal or higher than the number of cores in your computer, you should see a significant difference because each sequence would get its own thread and as long as the drive can keep up with the demand, you should see many times faster performance.

With an SSD and even better, a FusionIO card, you should be able to load up to 21 million particles PER SECOND. We have tested this using a FusionIO card and it flies. You can read about it here:
thinkboxsoftware.com/news/20 … /7bpf.html

Bobo · December 2, 2012, 1:33am

Magma is normally applied to the particles while loading, as part of the modifier stack evaluation. The RESULT of that gets cached in the PCache. So if the Magmas are changing the particle count, you have no other choice but to reload. Same applies to the regular culling feature of the PRT Loader. It is applied at the end of the modifier stack after the particles are transformed to world space, but still before they are stuffed into the PCache.

Expanding on the loading process description, here is what happens when loading with PRT Loader:

*Each PRT Loader is processed sequentially, but each PRT sequence within the same PRT Loader gets its own thread. With 8 cores, you can load 8 PRT sequences in parallel for a 6 to 8 times better performance, as long as the drive can keep up IO-wise.
*Note that there is a certain overhead related to opening multiple partition files though. So while reading from 10 or 20 files is better than loading from 1 or 2, loading from 100 might turn out to be a bit slower due to that overhead. This was even more pronounced when loading was single-threaded before v2.0. So we recommend creating partition counts close to the CPU count. If you need 100 MP, 10 partitions x 10MP are probably a better idea than 100x1MP.
*The particles from every PRT stream are unzipped in chunks of about 50,000 at a time, then passed up the stack for processing.
*The particles are passed through all deformation modifiers on the stack bottom to top. They are represented as vertices, so the Max modifiers believe they are operating on a TriMesh with no faces. If there are Velocity or Normal channels, they are passed separately from the Positions through the modifiers. This is as fast as with regular meshes in Max, but it is single-threaded due to the Max modifier stack design. If a Magma modifier can be used instead of a Max deformation modifier, it is recommended (see below).
*Magma modifiers are also evaluated if encountered. Magma modifiers are multi-threaded internally - the more nodes and more complex the flow, the better the CPUs are saturated.
*The particles are transformed into world space by the Node Transforms of the PRT Loader, if enabled. Loading directly in world space (Use Node Transforms OFF) should be faster.
*Any SpaceWarp bindings are processed to produce deformations in World Space.
*The PRT Loader culling is performed. The particles are already in world space and are tested against one or more geometry objects. This was added before we had Magma. Magma culling is more flexible, but can be slower, esp. if using InVolume.
*If a Material is assigned to the PRT Loader, the shader tree is evaluated for each particle to produce a Color, Density multiplier, Emission,Absorption, SpecularLevel and SpecularPower data as needed. This can be relatively slow. If something can be done with a Magma, e.g. Color assignment, it is better to use a Magma modifier. The Krakatoa Material is a Max Material front-end to a Magma-based back-end, so it is very fast in comparison to Max Materials!
*The Global Channel Overrides are applied. These are Magma modifiers that are applied to the world space of the particles and can change their channels before the data is stuffed into the PCache.
*The Global Value Overrides are applied. These are hard-coded Color, Emission, Absorption and Density overrides that can replace the channels of ALL particles with global values before the particles end up in the PCache. In earlier versions of Krakatoa, these were actual render-time values and could be changed with PCache on. But in v1.5 we added the ability to specify global texture map overrides which made it impossible to preserve the old way of doing things.
*Only the channels requested by specific features are now stored in the PCache. See Memory Channels rollout to see what is there.
*A note about shading - some of the Phase Functions (e.g. Phong, Marschner etc.) expose additional properties AND optional per-particle channels. If the per-particle channel for, say, SpecularLevel is not checked, the value is applied at render time and is independent from the PCache. As long as there is a Normal channel, you will be able to tweak the specular highlight’s shape without reloading. But if you enable a per-particle channel for SpecularLevel and there is no such channel, the PCache will have to be rebuilt, and if it exists, it will be locked and reused if PCache is on. The actual (Phong) Shading is performed in camera space at render time, so you can move the camera and get correct speculars with an enabled PCache…

Now if you lock the PCache, all channels that were stored in the PCache will be locked and reused. You cannot change anything about them, e.g. Position, Color, Emission etc. What you can modify are the Global controls like the Density multiplier (Lighting/Final Pass Density Value/Exponent), Emission Strength etc. These are applied at render time. Also, Camera-space effects are generally calculated at render time, e.g. Environment Reflection Maps. Only the Normal is stored in PCache, the color of the Environment is looked up at render time, so moving the camera produces correct reflections with locked PCache.

The LCache only locks the Lighting channel. If engaged, the Lighting channel calculated in the previous render will be locked in place and will be reused. If a light is moving, you won’t see the effect unless the LCache is disabled to allow for dynamic lighting.

You can see a graphical representation of everything I wrote above by opening the Krakatoa Schematic Flow!

We HAVE talked in the past about a possible higher level of Magma that could be run at render time over the PCache, but we don’t have this feature yet. If we had it, it would probably do what you asked for…

Bobo · December 2, 2012, 2:20am

An additional comment about “Load Single Frame Only”.
This modes does NOT cache the particles between frames for rendering. It has two functions - it stops trying to figure out the time / frame number and loads the exact frame specified on the list, and it caches the viewport particles. So if you move the time slider, the PRT Loader will be lightning-fast. But when rendering, the particles will still be reloaded from the single frame on each frame.

The reason for this is that otherwise you could end up with twice+ the memory usage - PRT Loaders would have to keep ALL channels found in the PRT files (which are usually more than what ends up in PCache, e.g. Mapping channels, ID channels etc.). So if you have 100 million particles loaded via PRT Loader with Single Frame Only checked, this would mean keeping 100MP in memory once with all channels, and then a second time in the PCache. We have resisted so far, but some users have asked multiple times to provide such per-PRT Loader cache. This could be used even when Load Single Frame Only is unchecked if the same frame is being re-rendered and only the modifiers are being evaluated… As long as it is optional and off by default, it could be one of those “At your risk only” features. But I am quite sure there are lots of worms in that can

Bobo · December 2, 2012, 2:57am

Ok, thanks to Facebook I now know what you are trying to do.
First, I would suggest you add a signature to your forum account. While in theory it shouldn’t matter who I am answering to, writing to “paco” or “Patryk” makes a big difference - now I understand the context of the question!

Obviously, your data is not partitioned because it comes from a LIDAR source. So we should discuss all the possible ways to split your original point clouds into multiple sequences to allow for multi-threaded loading. Assuming you have 8 or 16 cores in your machine, this could mean between 5 and 10 times faster performance if your SSD is really fast! Even if it is twice as fast it would be useful.

So you could do several thing:
*Set up a Magma modifier that culls particles by Index. Set a range From / To. Split the count, by say 10, and set range from 1 to 1/10th of the total count and resave as a file. Move the range to 1/10th + 1 to 2/10th and repeat with a new file name. Repeat until you have all 10 partitions, each containing a slice of the particles.
*Write a MAXScript that reads from a PRT file and writes to N particle files. You could save again 1/10th of the particles to each file, or do every Nth and save the 1st, 11th, 21st etc. to the first file, the 2nd, 12th, 22nd etc. to the second and so on. As result, you could enable the first sequence to display in the viewport, giving you 1/10th of the particles in Every Nth mode, but in a single much smaller file. You could even do test renders with it since it will have a good representation of the whole cloud. I would prefer the every Nth approach. If you want help with the script, please let me know… It would be slower than the Magma approach though. You can do every Nth via Modulo in Magma, too!
*Process the LIDAR particles from the original source while converting them to multiple PRT files. It is probably too late for this.

PatrykKizny · December 2, 2012, 1:27pm

Boris,

Many thanks for your elaborate info and all help.
I managed to make a Magma partitioner (shared in a separate thread).
I did run some benchmarks and the results are very opposite to what you mentioned.
I appreciate your help explaining what am I missing.

System:
Win 7 x64, 3DMAX 2012, Krakatoa 2.1.3 running on a MAC natively (no VM).
Mac Pro 8-core (16 cores total), Nehalem 3.2 Ghz, 32GB RAM, SATA 7200 drive.

Benchmark #1 - variant A

120 MLN particles total in single loader in 2 separate files 100 MLN + 20MLN
Some Magma modifiers and particle culling
28MLN particles rendered (after culling)
render resolution 1920x960

-> Hardddisk readout on average around 12MB/s
-> All cores used 100% during render
-> 01:46 render time

Benchmark #1 - variant B

120 MLN particles in single loader spread across 16 partition files
Some Magma modifiers and particle culling
28MLN particles rendered (after culling)
render resolution 1920x960

-> Hardddisk readout on average around 14MB/s
-> All cores used 100% during render
-> 01:56 render time

Benchmark #2 - variant A

20 MLN particles total in single loader, read from single file
Some Magma modifiers and particle culling
5MLN particles rendered (after culling)
render resolution 1920x960

-> 27s render time

Benchmark #2 - variant B

20 MLN particles total in single loader, read from 4 partitions
Some Magma modifiers and particle culling
5MLN particles rendered (after culling)
render resolution 1920x960

-> 22s render time

In both cases actual lighting, sorting and drawing particles took a fraction of time, mostly it was “Retrieving Particles”.

PatrykKizny · December 2, 2012, 4:07pm

Hi Bobo,

This is another thread I’d like to explore here. After some research I found possiby reasonable solution for optimizing motion blur.
Revision FX Motion Blur (And few other plugins) accept motion vectors encoded in bitmaps. Relative pixel speed to the camera is encoded to RGB and used to do motion blur in post.
I think this could geve me good speed up performance with acceptable quality loss comparing to dense motion blur sampling in Krakatoa and adding DOF at the same time.

There’s an option to output custom Magma data to Render Elements bitmaps. That’s what I need.
The only point where I am stuck is how to produce the source motion vectors. My particles are static and only camera is moving.
Any idea how to pull that out?

Although I used to do some programming in the past, I never did anything in maxscript, so It looks like I’ll need some help of yours or other clever folks here.

Actually, It would be also a good idea for you to include a proper motion vectors module in future Krakatoa builds since this kind of workflow is quite established in the world of VFX and post and other renderers support it directly.

Bobo · December 2, 2012, 5:51pm

I believe you are hitting the bandwidth limits of your HDD. Trying the same from a Solid State Drive would be interesting. You might notice that in the article I sent you, our BASE system for comparison was a 7200rpm drive. The better results were from a pair of SCSI 10000rpm drives, an SSD drive and a Fusion-io card. We were able to saturate all of them but the Fusion-io (but that’s too expensive to recommend for your project). I have an SSD drive in one of my machines and I can confirm that it makes a difference. You should also keep Task Manager open and watch the CPUs while rendering. If you are getting less than 100% CPU activity during loading, you are being limited by the HDD. If the CPUs are at 100%, then a faster drive would not help.

On top of that, your performance is probably affected negatively by the operations being performed on the stack, but obviously you need to do the culling etc., so there isn’t much to be done there.

Bobo · December 2, 2012, 6:03pm

A few thoughts:

*I am absolutely against rendering volumetric particle data with fake motion blur. But it is your project and your time, so if it works… My issue with it is that if you output a Velocity pass to control motion blur, it will reflect only the particles closest to the camera. Typically, there will be thousands of particles BEHIND the semi-transparent closest particle, and the Render Element will contain zero information about them. Fake MBlur can work with solid surfaces in geometry renderers, but for Krakatoa, it is just a Bad Idea IMHO.

*We were just fixing the camera motion blur to be respected by the built-in Velocity Render Pass in Krakatoa SR / Maya. We talked about fixing that in Krakatoa MX too. So if you decide to go with a Velocity Render Element, camera motion would be respected in such a build. I will have to talk to the developers on Monday to see if we can give you a better build soon.

Regarding your general workflow - I would suggest (if you are not doing it already) to pre-bake the Lighting into the particles as Emission and then perform the culling on the already pre-lit particles. Otherwise the culling would affect the lighting and shadows would change dynamically. Obviously this would require a lot of disk space if the lighting is dynamic. If it is static, a single frame with pre-baked lighting would be enough. But I don’t know enough about your project, so I could be thinking along the wrong lines…

PatrykKizny · December 6, 2012, 6:20pm

Bobo:

I believe you are hitting the bandwidth limits of your HDD. Trying the same from a Solid State Drive would be interesting. You might notice that in the article I sent you, our BASE system for comparison was a 7200rpm drive. The better results were from a pair of SCSI 10000rpm drives, an SSD drive and a Fusion-io card. We were able to saturate all of them but the Fusion-io (but that’s too expensive to recommend for your project). I have an SSD drive in one of my machines and I can confirm that it makes a difference. You should also keep Task Manager open and watch the CPUs while rendering. If you are getting less than 100% CPU activity during loading, you are being limited by the HDD. If the CPUs are at 100%, then a faster drive would not help.

On top of that, your performance is probably affected negatively by the operations being performed on the stack, but obviously you need to do the culling etc., so there isn’t much to be done there.

Well, I don’t think it’s the matter of hitting the drive performance. Not at all.
1 - As I said my CPUs are ALL running on 100%
2- the drive can easily give about 30-50MB/s if not 80MB/s readout performance. With 13MB/s we’re not even close to that.
So I believe it is all the processing being done in the stack.

PatrykKizny · December 6, 2012, 6:24pm

You are right. If it was at least averaged this could be considered. But since it is only the topmost particles It does not make any sense. I just missed that point totally.

No rush, but if you have it I’d give it a try.

Yes, I think it is something I need to do since we can’t skip past reloading each grame. At least I don’t have to relight.

PatrykKizny · December 6, 2012, 7:59pm

Hi again,

So here’s the breakdown of my rendering time per frame.
Not sure if anything could be optimized here significantly beyond what I found out.

Summary
Static particles, static lighting, moving camera
20M particles in PRT Loader (250MB dataset in 4 files), 4M rendered after geometry volume culling (fast method)

Rendering breakdown
Time in seconds.

11.0 - Particle loading and drawing
14.0 [+3.0] - Lighting with 2 lights
15.4 [+1.4] - Magma color processing (summed)
23.0 [+7.6] - Magma modifiers based on distance functions (that’s slow - savings here)
49.2 [+26.2] - Motion blur, 8 samples (hurts but seems necessary)
60 [+10.8] - Motion blur, only adding jitter (potential optimization here - image even looks cleaner w/out it after DOF pass)
73.5 [+13.5] - DOF, 0.5 sampling, optimized via near-camera clipping

Optimisation
61 [-12.5s] - removed motion blur jitter, but kept DOF
55 [-6s] - removed magma modifiers based on distance

73.5 -> 55 = 18.5s saved = 25% time saved

So overally it looks like caching light does not really bring a significant change here (and in my case at the moment it would be a bit time consuming to recreate emission effects I am using).
Hope this breakdown will be helpful for someone here. I understand it is a bit specific, but anyways some principles are probably universal.