How do semi-modern (IE such as you'd get in a centos 8 "enterprise" distro) linux kernels manage NUMA scheduling?
Is it going to make some effort to keep a "same process/thread on same or neighbor core" or is it up to the end user to play Mel and spend some effort optimizing things?
I remember at $job people acting surprised when they'd run "2 web servers on VMs on hardware" vs "One bare metal web server with 2x the concurrency" and they discovered the "2 VMs" had better throughput. Of course, that's a direct result of $corp using huge 2 socket systems and $ops not paying as much attention to NUMA penalties as the hypervisor did (which happily placed all of VM1 and VM2's workloads on statically assigned cores each properly within a numa performance zone). I never did get a direct answer to "what happens if we just run 2 web servers on one bare metal server; give each their own IP to bind to...
CFS, the default Linux scheduler, has a hierarchy of "domains" that by default is organized following the cache topology. When doing load balancing, it is increasingly less likely to migrate tasks across domains as it goes up the hierarchy.
The user is still in charge of deciding where to allocate the application's memory, either via the NUMA APIs, or via numactl. If the application is memory-intensive, for example, one strategy is to spin up one process per NUMA node and make sure it only uses memory and CPUs from that node.
Confusing conclusion. A bunch of references to Rome/Milan then concludes that Intel's latest has better latency than Zen, but looking at the graphs shows Rome performing better. My best guess is they mean intel's interconnect is better than the first generation zen from ages ago. Anyone have a different take on the graphs?
A fabrication disadvantage of this approach is that the four chiplets are actually all different layouts, while AMD uses only two different CCDs (Zen 4 and 4c) across all chiplet product lines.
Still, that's two each of two different ~400 mm² dies instead of just one type of ~70 mm² CCD. AMD also has the IOD of course, which iirc is also around 400 mm². Everything about the Intel solution looks and feels more expensive to make either way.
I don't see why it was necessary to say that the machine is "expensive" twice. There isn't another machine on EC2 that has these resources at a lower price, so in that sense it's the cheapest one of its class. The m6a class is 15% cheaper for the same memory and core count, but that instance family is slower.
> How is the other machine family slow when it has the same resources?
They aren't the same resources. Different families have different generation CPUs and/or CPUs from different manufacturers with different performance. Just because instances are available with common core count and memory size doesn't mean an instance with 5+ year old ARM CPUs is going to perform the same as one with new Intel CPUs.
Is it going to make some effort to keep a "same process/thread on same or neighbor core" or is it up to the end user to play Mel and spend some effort optimizing things?
I remember at $job people acting surprised when they'd run "2 web servers on VMs on hardware" vs "One bare metal web server with 2x the concurrency" and they discovered the "2 VMs" had better throughput. Of course, that's a direct result of $corp using huge 2 socket systems and $ops not paying as much attention to NUMA penalties as the hypervisor did (which happily placed all of VM1 and VM2's workloads on statically assigned cores each properly within a numa performance zone). I never did get a direct answer to "what happens if we just run 2 web servers on one bare metal server; give each their own IP to bind to...