What observability tools were used to confirm that the target of the test was actually being tested properly?
I've performed, and also debugged, a lot of these cloud comparison benchmarks, and it's very, very easy to have bogus results due to some misconfiguration.
What is the total file size for the fio runs? (And, what is the intended working set size?) Was fio configured to bypass the file system cache and perform I/O directly to disk? (And if so, what is the rationale for bypassing it?[1]) Was iostat or other tools run during the benchmark to confirm that fio was configured and operating correctly, and that the results could be trusted? Was the same version of fio used, and built with the same compiler (same binary?).
The 118 page report does not include the actual fio commands used.
[1] Disk I/O latency can be a serious issue in cloud environments, and one that vendors can address by incorporating additional levels of storage I/O cache (eg, in the hypervisor). Picking benchmarks that bypass discourages vendors from doing this, which is not ultimately good for our industry.
The intent of the fio testing was to measure block storage not cache I/O. direct I/O was set in fio configs, but is usually not honored by the hypervisor. During run-up a 100% fill test was performed using refill_buffers and scramble_buffers to break out of cache. Then optimal iodepth settings are determined for each workload and block size by running short tests with incrementing iodepth settings (targeted for maximum iops). Once iodepth is determined, 3 iterations of tests are performed, each with 36 workloads (18 block sizes, random + sequential). Each of these is 15 minutes (5 minute ramp_time, 10 minute runtime). Since asynchronous IO and variable iodepth settings were used, latency wasn't compared. Total test time per instance for run-up and 3 iterations was about 36 hours. fio configs are available here (iodepth and device designation are added at runtime):
Thanks, but I disagree with the approach of only showing storage benchmarks with disabled caches. Production workloads will encounter variance between the providers thanks to different caches and behaviors of handling direct I/O. I'd include direct I/O results _with_ cached results, so that I wasn't misleading my customers.
I know what I'm suggesting is not the current norm for cloud evaluations. And I believe the current norm is wrong.
The more important question is how the benchmarks were analyzed -- what other tools were run to confirm that they measured what they were supposed to?
Good point - user experience may include cached and non-cached I/O so it would be beneficial to include both in this type of analysis.
The benchmarks binaries, configurations and runtime settings were generally consistent for instance types of the same size across services, but we didn't verify efficacy of the benchmarks as they ran.
wow, I was expecting to hear "We couldn't do much to remove the caches from the equation", but it looks like you guys put a lot of work into doing what you could.
Excellent work, thanks for doing a quality job on this.
I've performed, and also debugged, a lot of these cloud comparison benchmarks, and it's very, very easy to have bogus results due to some misconfiguration.