AMD’s Rome is indeed a monster

ChuckMcM · on Nov 9, 2018

Looking forward to these chips but always have concerns for AMDs ability to execute consistently. Opteron, the original 'Sledgehammer' series was way ahead of Intel because Intel just couldn't bring themselves to put 64 bit features into their Pentium line, and AMD squirreled away that advantage by not following up, and having other issues with later spins of the Opterons.

That said, this really does look like a pretty awesome chip for data centers. I would love a dual socket mother board that had 512G of RAM 1TB of Optane memory as additional "RAM", then 16TB of NVME SSD storage, and 32 SAS/SATA channels for an effective 360TB of rotating disk (dual parity RAID 6, 30 active drives, 2 parity drives). And then make a cluster of 48 of those monsters.

Ah the places we would go and the things we would do with such a system.

nemothekid · on Nov 9, 2018

>AMD squirreled away that advantage by not following up, and having other issues with later spins of the Opterons.

AMD didn't squirrel away the advantage. Intel abused their monopoly to starve out AMD by giving OEMs discounts if they agreed not to carry AMD chips. They paid a paltry 1.5B in fines for abusing their market dominance.

rasz · on Nov 10, 2018

> They paid a paltry 1.5B in fines

they didnt https://www.lexology.com/library/detail.aspx?g=8965e7a2-ac87...

zrm · on Nov 10, 2018

> Looking forward to these chips but always have concerns for AMDs ability to execute consistently. Opteron, the original 'Sledgehammer' series was way ahead of Intel because Intel just couldn't bring themselves to put 64 bit features into their Pentium line, and AMD squirreled away that advantage by not following up, and having other issues with later spins of the Opterons.

That isn't really what happened though. It was a combination of two things. One was this:

https://www.extremetech.com/computing/184323-intel-stuck-wit...

The other was that it happened around the time when CPU frequencies hit the power wall. That hit the Pentium 4 especially hard, which gave AMD the advantage, but Intel's anti-competitive behavior prevented AMD from capitalizing on it. Meanwhile Intel knew the Pentium 4 was too power hungry for laptops, so they kept iterating on the Pentium M, which is what became Core. It was designed for power efficiency rather than clock speed right when clock speeds unexpectedly started getting limited by power. It wasn't expected to be faster than the Pentium 4 (at half the power), but it was, so Netburst got canceled and suddenly Intel had the advantage.

The combination of the two things meant that AMD never had a chance to really profit from its investment in Sledgehammer, which meant they didn't have the money to put into R&D and fell behind for a decade.

There is no guarantee something else won't go wrong, but the chance of that same confluence of factors happening again seems pretty unlikely.

wmf · on Nov 9, 2018

At this point Intel isn't executing consistently either, so any customer who treats their processor vendor as "strategic" is going to pay the price.

vondur · on Nov 9, 2018

I'm thinking a lot of their recent success may have to do with Lisa Su becoming their CEO back in 2014. She seems really smart and focused on their core business. I hope some of this success on the CPU market will bleed over to their GPU developments.

user5994461 · on Nov 9, 2018

RAID 6 with 32 drives. All the things you would not do while the array is rebuilding.

ChuckMcM · on Nov 9, 2018

I'm not sure I understand the comment.

I've used dual parity RAID systems for over a decade now and they work superbly when rebuilding even during a drive failure. (aka in 'degraded' mode). I typically run them in 22 drive sets because that is how many drives fit on a single NetApp drive shelf.

commentor10 · on Nov 9, 2018

With 32 drives you should have at least 3 failure tolerances. Might even want to plan some hot spares in there. The idea is to build a large file system that won't die and take your data with it - nor introduce downtime (with a suitable failure tolerance). And of course, have an offsite backup for actually backing up your data.

ChuckMcM · on Nov 10, 2018

Agreed with the hot spare. Typical RAID reconstruction on these systems is limited to available I/O operations (IOPS) after accounting for the number needed to meet performance goals. If these are the only drives on a fairly beefy system you have lots of extra IOPS available so you should be able to reconstruct a drive in under a day if you wanted to. There are the usual caveats about age related failures happening in groups but at least anecdotally and from what I know from the folks who work in NetApp customer support usage across a much larger population, it is quite reliable.

montecarl · on Nov 9, 2018

I aware of the issues one can have with rebuilding a large raid array. It seems like you will always hit some parity issue upon reconstruction. What is the proper approach here?

wmf · on Nov 9, 2018

Either declustered RAID for faster rebuilds or RAID 60 with smaller stripes.

nine_k · on Nov 9, 2018

Why, RAID6 can rebuild online, because it turns temporarily into a RAID5. Even RAID5 can be used while rebuilding, in read-only mode, because data from the remaining disks is sufficient (else rebuilding won't be possible).

overcast · on Nov 9, 2018

RAID5 can be used normally while rebuilding. Albeit at reduced performance. It's definitely NOT read only as that is effectively the same as offline in a production environment.

nine_k · on Nov 9, 2018

Depends on your usage pattern. You might be hesitant to allow writes without redundancy, especially if you store something important which is mostly read-only anyway (e.g. 95% of operations are reads).

For a large write-heavy setup + "we will restore from yesterday's backup" mode of operation, you may likely be better off with a RAID10 which is faster. (Though RAID50 and even RAID60 are a thing.)

overcast · on Nov 12, 2018

Sure, but it's not offline, ever. The debate isn't about the best RAID level for your usage.

Dunedan · on Nov 9, 2018

AMD had enough time to learn that lesson. Let's hope they did.

bakul · on Nov 9, 2018

What would you do with such a system that is innovative and not just more of the same?!

ChuckMcM · on Nov 10, 2018

I've been noodling on what it would take to build a generally conversational dialog machine.

riskneutral · on Nov 9, 2018

That sounds like a very good start with CPUs and memory, but you need to include GPUs and network interconnect and explain what the high speed busses between all those components look like. How do you solve that?

NicoJuicy · on Nov 9, 2018

The older you are, the better you execute things. Look at AMD 2 years ago versus now

jaas · on Nov 10, 2018

Lots of companies get older and lose ability to execute. AMD was very old already when Opteron happened.

The age of the company in a case like this is irrelevant. What matters is current management and staff and market context.

NicoJuicy · on Nov 12, 2018

The CEO executed it perfectly for the new chips, because they had walked that road before.

With less success then

walrus01 · on Nov 9, 2018

Not particularly related to Rome, but what you can buy right now for a single socket system, AMD is far ahead of Intel for reasonably priced workstations or small servers with a huge amount of I/O. One threadripper CPU has 64 pci express 3.0 lanes. One lane is 985MB/s.

Working with a $399 threadripper motherboard that has four x16 physical slots, it can accommodate four Intel x710-4 10GbE four port NICs (each electrically x8), for a total of sixteen 10GbE router interfaces in a Linux kernel based FRR system, as a fully software implemented router. And with the RAM capacity of a 32GB system, no worries about FIB size. It's a very different approach to do with a routing entirely in CPU vs ASICs. But this can be built for under $4000.

Or one could just as easily use those four physical x16 slots for four independent 100GbE interfaces.

CoolGuySteve · on Nov 9, 2018

I looked into it for a server but I wasn't able to find a motherboard with remote management features.

I suspect there might never be one so as not to muddy the waters of Epyc's intended market.

kolbe · on Nov 9, 2018

That's so frustrating. We ran into the same roadblock.

zaroth · on Nov 10, 2018

I’m sorry, you’re saying there’s no boards available with IPMI?

kolbe · on Nov 12, 2018

I couldn't find a Threadripper compatible board with IPMI, no. At least a couple are supporting ECC now.

hak8or · on Nov 10, 2018

I looked into this in the past but had issues finding information about processing overhead for routing (or even switching) that much data. Could a, say, 2950 handle routing 40 gigabit worth of data over a network, assuming no crypto?

wmf · on Nov 10, 2018

Easily. One modern core can route 10 Gbps or more using efficient software like VPP.

tapoxi · on Nov 9, 2018

Every time I see SemiAccurate I remember their love for AMD and their constant, unending hate for Nvidia: https://semiaccurate.com/2009/10/01/nvidia-fakes-fermi-board...

I don't have any dog in this fight, SA is just a terrible news source.

dman · on Nov 9, 2018

It might feel one sided, but I think SemiAccurate is one of the few places remaining doing old fashioned journalism (ie pursuing sources and gathering information from the real world). Most other tech news websites have devolved into wrapping up marketing materials from tech companies into "reviews".

m_mueller · on Nov 10, 2018

Calling Jensen “Dear Leader” though? Did I catch that right? If you have good info why resort to tasteless and frankly racist language?

wmf · on Nov 9, 2018

They don't like Intel or Nvidia, but their facts tend to be mostly accurate. I wouldn't listen to any of their financial or market-share predictions since the market has remarkable inertia that disconnects it from technical factors.

zamadatix · on Nov 9, 2018

I would say they are semi accurate.

overcast · on Nov 9, 2018

Accurate facts, the best kind of facts.

zaroth · on Nov 10, 2018

I’m feeling almost nostalgic.

nv-vn · on Nov 9, 2018

But they were right in that case... I know it's a biased source but I've seen a lot of accurate info there in the past

twtw · on Nov 9, 2018

> But they were right in that case...

How do you know?

nv-vn · on Nov 9, 2018

It was widely reported on afterwards. Look at the bottom of the article, they mention Nvidia's cover-up.

pellucide · on Nov 9, 2018

It is semi-accurate after all.

ksec · on Nov 10, 2018

>SA is just a terrible news source.

Yes and does not belong to HN.

sixothree · on Nov 9, 2018

The article has a lot of boasting about their exclusive scoop. That and their annoyingly watermarked images.

sizeofchar · on Nov 9, 2018

They started using watermarks this way when their news were plagiarized some 5 years ago by even big sites such as WCCFTech, Tom’s and others. It is ugly but understandable, they are independent, don’t live on ads, aren’t invited for most events and barely receive pre-release hardware for reviews.

CoolGuySteve · on Nov 9, 2018

Interesting design. The 70mm^2 die is more similar to an embedded chip so I guess that explains how AMD was able to hit 7nm so much earlier than Intel.

But reading between the lines, AMD is only showing benchmarks for extremely parallel loads much like they did with ThreadRipper 2. Between the physically separate processing cores and the nature of the marketing benchmarks, I suspect workloads that have high IPC or extremely low latency requirements won't fare so well on this chip.

The weirdest part is to think what Apple might come up with if they were to take the A12X design and connect it on a die like AMD has done here. It would probably make for a pretty interesting Mac Pro.

ip26 · on Nov 9, 2018

You play to the strengths of the part and the interests of the target market. People willing to pay the price premium for ThreadRipper are interested in throughput, not latency.

stcredzero · on Nov 9, 2018

You play to the strengths of the part and the interests of the target market.

However the strategy to go with gluing smaller dies together also better fits the (long known) pragmatics of manufacturing processes. This allows AMD to hit lower price points and while getting better margins. It also coincides with Intel being hit with "The Thermocline of Truth" where they were too optimistic about overcoming the challenges of larger die sizes.

"The Thermocline of Truth"

http://brucefwebster.com/2008/04/15/the-wetware-crisis-the-t...

vondur · on Nov 9, 2018

I’ve been wondering lately if the reason for the Mac Pro replacement not coming out until 2019 may be due to Apple switching to the 7nm refresh of the AMD ThreadRipper. That would certainly make for some interesting news.

jamesu · on Nov 9, 2018

But then you've got other people who seem adamant they are switching the mac to their magically scaled up ARM chips. What on earth is really going on at Apple HQ?

fipple · on Nov 9, 2018

That will take several years. AMD could happen in months.

vondur · on Nov 9, 2018

I would think that the use cases for a laptop CPU and high end desktop may call for different CPU's. Plus the X64 arch is well supported by all of the apps you would use a MacPro for.

ksec · on Nov 10, 2018

Apple can't use Thunderbolt with AMD, and Intel has been delaying its opening of Thunderbolt standard for whatever reason. At this point I think Intel has been so disappointing in the past 3 years Apple should just dump them and takes matter into their own hands.

Waterluvian · on Nov 10, 2018

I read somewhere that Apple has been trying to avoid uttering the word "Intel" in their latest presentations. If that's true, maybe it lends a bit of vague support to your idea.

BonesJustice · on Nov 9, 2018

Oh man, the licensing costs for any “per core” software are going to be insane :D

PedroBatista · on Nov 10, 2018

A few years ago many companies changed it to "per socket".

Now with 64 cores in one socket things got even weirder, but then again, those licenses always have been in some parallel universe.

Twirrim · on Nov 10, 2018

There's still a disturbing amount of software that operates on a per-core pricing model. Some of them include Hyperthread in that core count. When we launched Oracle Cloud Infrastructure, (as Bare Metal Cloud, before we launched our VM product), our standard bare metal instance had 36 cores, 72 threads with HT enabled. One interested customer couldn't use our platform because just licensing their standard software on our platform was going to set them back some $250,000 due to all those 72 "cores". The product manager at the time mentioned there were several other customers facing that issue with their various bits of enterprise software.

dman · on Nov 9, 2018

Ive been really enjoying my dual epyc 64 core workstation. Looking forward to when I can upgrade to 128 cores. :)

montecarl · on Nov 9, 2018

Did you build that yourself? Do you know of any good vendors for this type of workstation? I know building it isn't that hard, but typically an employer doesn't want to go that route.

dman · on Nov 9, 2018

I built it myself but there are multiple vendors selling prebuilt workstations eg https://www.velocitymicro.com/wizard.php?iid=327

PS: I have no affiliation with velocity micro, have never bought from them

walrus01 · on Nov 10, 2018

Silicon Mechanics can build just about anything with your choice of Supermicro motherboard.

http://www.supermicro.com/en/products/aplus/solutions/SP3

spiritcat · on Nov 10, 2018

what do you do with all those cores? if i may ask

dman · on Nov 10, 2018

Simulations and randomized testing.

jesuslop · on Nov 9, 2018

Question, why is DRAM still packed separately?

wmf · on Nov 9, 2018

A lot of reasons. It's physically quite large (multiple thousands of sq. mm. of silicon). Customers want different amounts of RAM so you could end up with a cartesian product explosion of SKUs. Customers want to upgrade RAM. System vendors and integrators don't want to give up more margins to the CPU.

extrapickles · on Nov 9, 2018

For higher power parts, heat dissipation is the primary reason. Most of the high end CPUs are purely heat limited in terms of performance.

A TDP of 90W is about the limit for CPUs using cheap cooling techniques, and adding ram would lower that by 10-30W.

ericd · on Nov 9, 2018

If you interleaved the memory with the processing, wouldn’t it decrease the heat density, though, which afaik is the limiting factor? You can get rid of quite a lot more heat than that if it’s not in too small an area to conduct away quickly.

saltcured · on Nov 9, 2018

The semiconductor processing for high-end DRAM is not the same as for high-end CPU logic or cache SRAM, so it is not cost-effective to put both on the same chip. This was a recurring obstable to past processing-in-memory R&D projects.

To mix the two for thermal reasons would lead to some crazy multi-chip modules as in the "chiplet" approach reported recently. However, I suspect you would need far too many chiplets and too many bonds to really give you this thermal benefit rather than just having individual chiplets which are still hot spots with problematic cooling.

mjevans · on Nov 9, 2018

Processors also tend to run hot, while RAM is more reliable when cool. Much like some parts in human biology benefit from being outside of the core (enough to make it an evolutionary benefit even when it's quite an obvious defect due to vulnerability), having RAM in it's own cooling sub-domain makes sense, and since it's already out there the blade interconnect card design is space, cost, and cooling effective.

tmd83 · on Nov 9, 2018

I don't think that's feasible either. If you put memory in between chip element that means you now have to route around that adding to wastage. It also means the distance between the elements would increase slowing things down. At the speed that chips are running the physical distance matters (from what I recall).

nine_k · on Nov 9, 2018

I wonder if the extra computing power achievable by more extensive cooling is worth it, in computation units per watt (including both system power and cooling power).

I keep reading that some data centers switch to liquid cooling, so likely it does. Maybe with advances on this front, we'll see the increase of the CPU efficiency that the fabrication process can no more deliver.

monocasa · on Nov 9, 2018

Smae reason that they split off the IO block here, but exaggerated. DRAM is very different from a process perspective, with very different constraints. Even eDRAM tends to be a compromise in a lot of ways that wouldn't be competitive if it weren't literally on the same chip as the logic.

Additionally DRAM more or less has to be separate chips due to the sheer surface area for yield reasons. The amount of DRAM in a decent sized server is close to a whole wafer.

rwmj · on Nov 9, 2018

A different question is why is DRAM still colocated with the CPUs, and not centrally located over a very fast network? There was a company doing this (RNA Networks, founded 2006) which was sold to Dell and then apparently disappeared without trace.

It might make more sense now that there are more types of memory-like storage which isn't quite as fast as DDR4, like Optane.

walrus01 · on Nov 9, 2018

Latency between a CPU that has the memory controller built directly onto the die, and adjacent dram modules, is significantly less than if it is abstracted away somewhere else on the far side of a low latency, short distance Network link.

uluyol · on Nov 10, 2018

Attempts to do this are being made in academia. Some recent work tries to quantify the network requirements of this, and decides that it's far more feasible to use some local memory as a caches than to have everything be remote.

Network requirements: https://www.usenix.org/conference/osdi16/technical-sessions/...

Two systems that use remote memory: https://www.usenix.org/conference/nsdi17/technical-sessions/... https://www.usenix.org/conference/osdi18/presentation/shan

wmf · on Nov 9, 2018

The cost and latency of the very fast network is still too high, but for slower stuff Gen-Z is coming.

orlp · on Nov 9, 2018

The answer to that is very simple: latency.

There is also an infinitely more tricky issue: security.

stefan_ · on Nov 9, 2018

Why not? It's not like moving it to the same chip would make DRAM perform better, the signal lines are not the limiting factor there.

wmf · on Nov 9, 2018

HBM2 does have dramatically more bandwidth than DDR4.

monocasa · on Nov 9, 2018

It has more bandwidth but about the same latency. CPUs fundamentally attack memory latency differently than GPUs do, and raw bandwidth is less important.

imtringued · on Nov 10, 2018

It is extremely important when you have 64 cores orif you use wide SIMD instructions. Intel's Xeon Phi have 16GB HMC memory with 500GB/s bandwidth to prevent the CPUs from being starved by main memory.

Imagine you have a cache miss on literally every memory access and have to pay a 100 nano second penalty to access main memory. SMT allows the CPU to switch to a different thread during a cache miss. Effectively you can have 128 memory loads at any given time. 100ns amortized over 128 threads is less than 1 nano second. Since the minimum size you can read is a 64 byte cache line this means that program will require at least 64GB/s of memory bandwidth even though your CPU is stuck on cache misses.

monocasa · on Nov 11, 2018

That's what I meant by "CPUs fundamentally attack memory latency differently than GPUs do"; they rely on their relatively smart caches and prefetching logic to reduce bandwidth explosion from all of the cores, and the latency of accessing external DRAM.

Of course if you thrash your cache on a modern general purpose CPU, you're going to have a really bad time. Pretty much if you're getting close to worse case memory access like in your example, you need to step back and re evaluate quite a few of your choices.

LiterallyDoge · on Nov 9, 2018

Anyone know if this chip addresses speculative execution exploits eg Meltdown?

protomyth · on Nov 9, 2018

From the article:

The other big box to check is Spectre mitigations are now rolled in to the core but Meltdown and L1TF/Foreshadow are not. Why? Because AMD wasn’t affected by either one and never will be. Patching can work but to be immune from the start is always a better choice.

LiterallyDoge · on Nov 12, 2018

Meant Spectre.

PedroBatista · on Nov 9, 2018

They don't, because AMD is not affected by it.

https://en.wikipedia.org/wiki/Meltdown_(security_vulnerabili...

LiterallyDoge · on Nov 12, 2018

Spectre

mainde · on Nov 9, 2018

Not the answer you're looking for but Meltdown specifically has never affected AMD.

LiterallyDoge · on Nov 12, 2018

Yeah my question was bad, was looking for Spectre

captaincoont · on Nov 9, 2018

I wish AMD could catch up in single core performance as despite having huge potential in offline parallel processing it falls short in real-time applications.

The_rationalist · on Nov 10, 2018

No it doesn’t fall short, Intel single core lead is < 10%. (except for the few programs that use AVX256)

Btw, zen 2 will take the lead at every metrics.

bitL · on Nov 10, 2018

Rumors are Zen 2 is 13% faster on average than Zen+.

Looking at fastest single-threaded CPUs:

https://www.cpubenchmark.net/singleThread.html

The fastest Zen+ is 2950x at position 85, roughly as fast as Broadwell C, with score of 2230. Now 13% more would give it 2520, roughly matching Haswell 4790k (number 15).

So no, it's not likely going to lead every single metrics.

The_rationalist · on Nov 11, 2018

Lol Nice try and your calculus is 100% right for this benchmark under the assumption that only the zen ipc will increase (and new rumors say it could be higher).

7nm will allow higher frequencies too which should make the difference. If Intel still lead in single thread it would probably be negligible.

The_rationalist · on Nov 11, 2018

29% ipc gain under certain workloads !! https://old.reddit.com/r/Amd/comments/9w16y1/amd_claims_zen_...

bitL · on Nov 11, 2018

Yeah, but we know the AVX is going to be 256-bit, so if there is "higher IPC gain under certain workloads", that's completely expected. We might see a performance decrease under certain workloads as well due to the I/O chip. What's the point of picking a singular workload case and dance around it as if it were the case with any workload?

The_rationalist · on Nov 11, 2018

Yes under certain workload doesn't mean a lot. But if it was AVX then the gain would be 100%

bitL · on Nov 11, 2018

I have a Threadripper, I don't need to be a fanboy. I like realistic estimates more.

fipple · on Nov 9, 2018

How are 7nm and 14nm mixed? Are they made separately and then... attached together? Isn’t a die made from a single wager “burned” with a single wavelength?

atq2119 · on Nov 10, 2018

You're mixing up a bunch of things.

Each wafer, which is the disk on which many dies are produced, is produced using a single technology process. This means that Rome's IO dies (14nm process) are produced on separate wafers from the core dies (7nm process). The wafers are sliced into dies, the dies are tested and binned separately, and sufficiently good dies are then packaged together into a single package that goes into your server's CPU socket.

The wavelength of the laser used for the manufacturing is a different story entirely. Pretty much all state-of-the-art photolithography uses 193nm wavelength lasers, and this has been the case for many years. Yes, the wavelength is larger than the size of the features produced with those lasers. There's a lot of magic involved.

Since it's getting a bit too much magic, the industry has been trying to upgrade to 13.5nm (extreme ultraviolet) lasers for a long time. They're now starting to be used in actual practice. And EUV lasers are still basically magic, but of a different kind.

fipple · on Nov 10, 2018

Yeah, I guess I never thought about putting multiple dies in one package. How are they wired up anyways?

pacificmint · on Nov 10, 2018

7nm and 14nm do not refer to the wavelength of the light. They describe the size of the features on the chip (though that is getting increasingly inaccurate).

Either way, Zen chips are already multi die chips. For Zen and Zen+ each die has (up to) four cores. Chips with more cores than four contain multiple dies.

For Zen 2 they are going to be mixed differently, which is what you are referring to.

atq2119 · on Nov 10, 2018

Mostly correct, except Zen (and Zen+, and Zen 2) has eight cores per die. You may have been thinking of core complexes (CCXs), of which there are two on a die, and each CCX has four cores.