PRT Loader I/O

We’re loading some PRT’s into the PRT Loader (no surprise there).



We’re noticing that all the read I/O requests being sent out are 64KB. When grabbing a 2GB PRT file, this is less than ideal as it works out to many thousands of IO requests being sent out.



So I’m wondering if your ZLIB_CHUNK is set to that, or if there is something else that limits the IO request size? Is there some way to increase this so we can try boosting the performance of our PRT reads?


  • Chad

For fun, we tried loading PRTs from local hard drives. Not good for the pipeline, but good for testing. We were still only getting ~12MB/s for the “retrieving particles” part of the render. So it’s not restricted to our SAN, but something any user should notice as slow.


  • Chad

What kind of tool are you using to measure the IO? I’ve tried a few values for the zlib cache but I haven’t gotten a noticeable speedup. I assume you aren’t apply materials or culling while you are measuring the IO since those are also lumped into the “Retrieving Particles” phase.

For speed, we can’t measure it accurately. We’re just estimating the time during the “Retrieving Particles” stage. No material or culling, loading first N 100%. And that’s 1 PRT, not a collection of them.



For the I/O size, I have a dedicated Fibre line to our SAN, and I’m watching the log on that port. When Krakatoa is loading PRT’s, it’s showing only 64KB reads.



What kind of speeds were you noticing on the PRT reads? Our CPU utilization was less than 20%, so I assume it is I/O bound, but the time it’s taking to load 200MB of PRT should be measured in single-digit seconds, but currently it is taking much longer than that.


  • Chad

We did some more testing today - the chunk size appears to have NO effect on the input performance. We went up from 16K to 512K with no measurable effect. We will take a closer look at what can be tweaked in the zip library (if anything) to make it faster.



That being said, we benchmarked v.1.0.1 against the WIP build of v.1.1.0 and saw a significant performance increase from the new named channels core - when a channel like Velocity, Normals or Lighting is not needed, it is not being allocated. For example, loading and rendering 50M particles using the old build allocated 38 bytes per particle, resulting in 1812MB of memory being allocated (as seen in the Cache readout).

When not using Lighting, Normals and Motion Blur (Velocity channel), the new build needed 20 bytes per particle or only 953MB for the same particles. The loading time went down from 77 seconds to 52 seconds.



With 20 bytes/particle, the loading performance was around 1 million particles per second, or about 20MB/s. I will try to do some benchmarking with pure PFlow tomorrow to see how fast Krakatoa would get the same number of particles directly from PFlow. If the speed is comparable, we could assume the ZIP library is not the bottleneck and would have to look closer at our own code.

Ah, didn’t think to compare it to Pflow. Good idea.


  • Chad

I did some tests at home and getting particles from a pre-calculated PFlow is about 10 times faster, so the bottleneck is definitely the I/O or ZipLib. Will do some more benchmarking on the office machine today.

I was just informed that our programmers have discovered the cause for the slow I/O (an accute case of Microsoftus Suckitis) and they have a plan for solving the problem.


I'm looking into replacing the posix calls: open(),read(), and close(),  with CreateFile(), ReadFile(), and CloseHandle(). These have extra flags that should help. FILE_FLAG_NO_BUFFERING in partictular would seem to help us with our SAN. Using a IO benchmark utility with these commands we get much better results.

B.

With a SAN or any storage with many harddrives, we seem to get better performence if there are multiple asynchonous I/O requests of a minimum size (in our case currently 512kb).  Using IOmeter, http://www.iometer.org/doc/downloads.html , we clearly see the difference of overlapping IO.

With 0 outstanding io's (synchronous) top speed is 75MB/s, (sequential read, 512KB pieces).

 And with 4-12 outstanding io's (asynchronous) top speed sustained is 390MB/s (again sequential read, 512KB pieces).

Of course these numbers are artificial because there is no processing involved, but the difference is clear.

http://msdn2.microsoft.com/en-us/library/aa363858.aspx

"... FILE_FLAG_NO_BUFFERING.....

When combined with FILE_FLAG_OVERLAPPED, the flag gives maximum asynchronous performance, because the I/O does not rely on the synchronous operations of the memory manager...."

It's more work but since krak is so IO bound, i think its worth getting this as fast as possible.

Ben.