Sapphire Rapids Core-to-Core Latency

cduzz · on Aug 12, 2023

How do semi-modern (IE such as you'd get in a centos 8 "enterprise" distro) linux kernels manage NUMA scheduling?

Is it going to make some effort to keep a "same process/thread on same or neighbor core" or is it up to the end user to play Mel and spend some effort optimizing things?

I remember at $job people acting surprised when they'd run "2 web servers on VMs on hardware" vs "One bare metal web server with 2x the concurrency" and they discovered the "2 VMs" had better throughput. Of course, that's a direct result of $corp using huge 2 socket systems and $ops not paying as much attention to NUMA penalties as the hypervisor did (which happily placed all of VM1 and VM2's workloads on statically assigned cores each properly within a numa performance zone). I never did get a direct answer to "what happens if we just run 2 web servers on one bare metal server; give each their own IP to bind to...

ot · on Aug 12, 2023

CFS, the default Linux scheduler, has a hierarchy of "domains" that by default is organized following the cache topology. When doing load balancing, it is increasingly less likely to migrate tasks across domains as it goes up the hierarchy.

The user is still in charge of deciding where to allocate the application's memory, either via the NUMA APIs, or via numactl. If the application is memory-intensive, for example, one strategy is to spin up one process per NUMA node and make sure it only uses memory and CPUs from that node.

metadat · on Aug 12, 2023

Another related article from August 3rd, 2023 covering the M7i / Xeon Platinum 8488 / 8480 / 848x:

https://www.theregister.com/2023/08/03/aws_custom_xeon_m7i_i...

I wonder what's really brewing at Intel.. it isn't often we get to read about customer-specific silicon.

re-thc · on Aug 12, 2023

> I wonder what's really brewing at Intel.. it isn't often we get to read about customer-specific silicon.

Copying AMD. EPYC has done this already with mostly minor differences.

JonChesterfield · on Aug 12, 2023

Confusing conclusion. A bunch of references to Rome/Milan then concludes that Intel's latest has better latency than Zen, but looking at the graphs shows Rome performing better. My best guess is they mean intel's interconnect is better than the first generation zen from ages ago. Anyone have a different take on the graphs?

formerly_proven · on Aug 12, 2023

A fabrication disadvantage of this approach is that the four chiplets are actually all different layouts, while AMD uses only two different CCDs (Zen 4 and 4c) across all chiplet product lines.

jeffbee · on Aug 13, 2023

Isn't it two different layouts, not four? It seems like on the diagonals they are just rotated.

formerly_proven · on Aug 13, 2023

Why yes indeed.

Still, that's two each of two different ~400 mm² dies instead of just one type of ~70 mm² CCD. AMD also has the IOD of course, which iirc is also around 400 mm². Everything about the Intel solution looks and feels more expensive to make either way.

omgtehlion · on Aug 12, 2023

Nice pictures. Anyone knows which software the author used to measure and draw them?

jfim · on Aug 12, 2023

If you're talking about the latency plots, they look like ggplot2

Gelob · on Aug 12, 2023

Any instructions on how to run this and get same output or data? I've got access to some SPR systems and would be interested in helping

jeffbee · on Aug 12, 2023

I don't see why it was necessary to say that the machine is "expensive" twice. There isn't another machine on EC2 that has these resources at a lower price, so in that sense it's the cheapest one of its class. The m6a class is 15% cheaper for the same memory and core count, but that instance family is slower.

mgaunard · on Aug 12, 2023

Maybe the author was surprised of the ridiculously high pricing of AWS in general, and that it was the only way to get access to the processor.

re-thc · on Aug 12, 2023

> to say that the machine is "expensive" twice.

Makes it twice as expensive

prashantsengar · on Aug 12, 2023

How is the other machine family slow when it has the same resources? How do you identify such machine families?

froggit · on Aug 12, 2023

> How is the other machine family slow when it has the same resources?

They aren't the same resources. Different families have different generation CPUs and/or CPUs from different manufacturers with different performance. Just because instances are available with common core count and memory size doesn't mean an instance with 5+ year old ARM CPUs is going to perform the same as one with new Intel CPUs.

> How do you identify such machine families?

The AWS EC2 site...

prashantsengar · on Aug 14, 2023

Thanks!

vladvasiliu · on Aug 12, 2023

Probably by number of cores and RAM.