Most people who use O_DIRECT writes stop quickly, thinking it's "slow". What's a...

StillBored · on May 1, 2014

I don't think this is accurate. We have a kernel bypass for disk operations. We use our own memory buffers, and bypass the filesystem, block, and SCSI midlayers. Our stuff is basically what O_DIRECT should be.

There are cases where we are 50% faster than O_DIRECT without any "caching". Furthermore, in high bandwidth applications (>4GB/sec) without O_DIRECT its easy to become CPU limited in the blk/midlayer so again we win.

Now that said, I haven't tried the latest blk-mq, scsi-mq, etc patches which are tuned for higher IOP rates. These patches were driven by people plugging in high performance flash arrays and discovering huge performance issues in the kernel. Still, I expect if you plug in a couple high end flash arrays the kernel is going to be the limit rather than the IO subsystem on a modern xeon.

dekhn · on May 2, 2014

Sure, if you're dealing with extremely high bandwidth apps (4GB/sec is pretty high bandwidth!) what I said doesn't apply.

THe number of people who are sustaining 4GB/sec (on a single machine/device array) is pretty small and they have a reason to go beyond the straightforward approaches the kernel makes available through a simple API (everything you described, like bypass, puts you in a rare category).

Anyway, when I was swapping to SSD, the kswapd process was using 100% of one core while swapping at 500MB/sec. I suspect many kernel threads haven't been CPU-optimized for high throughput.

zobzu · on May 1, 2014

but thats also why his tests are unreliable in this case