This is interesting, but I think there are a lot of caveats. 800G Ethernet NICs are very expensive, and PCI 5.0 SSDs are much more pricey than PCI 4.0 SSDs. Also they used DDR5 4800 in their comparison instead of DDR5 6400. All of this is just to say than on a "typical" server/workstation, the relative speeds of Ethernet, RAM, and dusk might look very different than what the authors suggest.
The other element which appears unsaid is that in a typical datacentre, your bisection bandwidth is typically << (num computers * network bandwidth per computer). Or in other words, even if your computation is bandwidth and not latency starved, you're not obviously going to be able to do a gigantic data shuffle quickly unless it's within a rack. This is to say, I'm not sure _just how much_ this would affect your overall target system architecture at present. Once you're switching at a couple of terabits things are likely quite different in those terms.
The other element which is a bit scary is that it's fairly rare these days for mass market companies to index deeply on tech which isn't available in public clouds, so until AWS supports this sort of thing, it's unlikely that many folks will target it. You can kind of see this with the Optane PDIMMs - they looked absolutely fantastic, but given you couldn't get them on any AWS instance there wasn't much point actually trying to use them outside of very specific applications - as a software engineer this hardware lets me build my software very differently and in a simpler way, but how can I possibly risk architecting based on that if it then cannot support a customer's cloud migration?
Obviously v different in HPC contexts.
And it should always be said - latency is very different between these. Memory latency still measured in nanoseconds, PCIe latency still measured in 10s of microseconds, about 3 orders of magnitude difference.
I can see many reasons why Optane PDIMMs didn't take off, other than not being cloud-available (which Intel might have subsidized with the help of a big software player and a series of killer apps). Price was excruciating, number of write cycles (so lifetime of machine, architecture) not very clear and upgrade path / exit hatch was also not clear. Also the programming model wasn't very clear and not many big important applications had migrated to use them efficiently or in an interesting manner (we were still in the 'crazy interesting papers' cycle).
It still grates me that the only interface we ended up with for high speed durable data is nvme through pcie and regret the lost promise of fast byte-addressable persistent memory, but once again, worse is better seems to have won?
Very good question. Being up to my eyeballs on 400G, dpdk, gpudirect, spdk right now I see a way forward, but also I don't see how we 'progress' more without them, where's the 'other' path? Integrated chips and like Apple's m1/2 or NVIDIA's GH and AMD's MI300X?
> Also they used DDR5 4800 in their comparison instead of DDR5 6400.
Are there server processors on the market yet that support DDR5-6400? Intel Sapphire Rapids and AMD Genoa only support DDR5-4800. Faster DIMMs are available (especially in the consumer market segments) but it seems quite reasonable to base this kind of analysis on what's actually officially supported and widely available.
I was running memtest recently on my brand new 13900 with DDR5-6000 and the stock performance is 20 GB/s. Overclocking from motherboard stock setting of 4ghz to 6ghz still only meets me 25 GB/s.
Is it the 38 GB/s for DDR5-4800 in the article hypothetical, my motherboard / CPU is a bottleneck somehow, or that memtest86 bundled with my motherboard isn’t capable of calculating bandwidth correctly for my system?
Check that you have installed one module, or two modules, one per each memory channel. If two modules are installed in wrong slots, they can use the same channel and this can result in a slowdown.
One a 13900 core, with default clock, should be able to read at max. rate 89.2/2 = 44.8 GB/s (small gigabytes = 10^9). One DDR5 module running at 6000 MT/s has maximum throughput 48 GB/s. Granting some unknown slowdowns due to error corrections and other things I would be expecting to see at least above 40 GB/s. With two RAM modules and two or more cores working, one should be able to get above 80GB/s.
Perhaps memtest does something that limits its memory performance (maybe it uses old less-efficient instructions for reading memory instead of SSE/AVX) or maybe it makes a factor 2 error in calculation of the speed.
I have two slots installed correctly. The bandwidth tool in AUR reports ~20-35 GB/s for sequential reads (although weird spikes of 70 gb/s) i.e. which is consistent with memtest86's ~25 GB/s (the discrepancy can be explained away by memtest probably not using AVX instructions). Writes seem a lot better at ~58 GB/s bypassing cache.
That's still a large discrepancy though where I'm typically 30% away from the nominal speed.
Make sure all power saving stuff is disabled in BIOS, that XMP is being used, and Linux is not booting with some weird ACPI/APIC flags. Try ganged/unganged mode (DCT) in BIOS.
You have to be careful reading the charts. The author of that tool doesn’t do the best job clarifying that most of that chart is showing L1/2/3 cache speeds. That tool could definitely use some TLC. I don’t see any that get anywhere near that once your out of cache range.
First row for Intel Core i7-930, DDR3 2000 MT/s he should be getting at most 16GB/s for reading from DRAM modules (not cache), but he's actually getting 18.4 GB/s, big 15% more. Maybe more cores were used, then the theoretical limit of that processor is 25.6GB/s, and the result is not that great.
But then for i5-520M, 1066MT/s he should be getting 8.5GB/s per core, but he's getting less, only 7.1GB/s.
Maybe contact the author (his email is in README.txt) and ask for clarifications on these discrepancies and help on your problem.
I think there is a problem somewhere with your system/measurements. With 6000 MT/s, there's no way getting 25GB/s is fine - that's what DDR4 3200MT/s should able to do. You should be getting close below 48 GB/s on one core, and 89.6GB/s multicore. And writes should be slower than reads.