Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
AMD’s Rome is indeed a monster (semiaccurate.com)
163 points by ajnin on Nov 9, 2018 | hide | past | favorite | 101 comments


Looking forward to these chips but always have concerns for AMDs ability to execute consistently. Opteron, the original 'Sledgehammer' series was way ahead of Intel because Intel just couldn't bring themselves to put 64 bit features into their Pentium line, and AMD squirreled away that advantage by not following up, and having other issues with later spins of the Opterons.

That said, this really does look like a pretty awesome chip for data centers. I would love a dual socket mother board that had 512G of RAM 1TB of Optane memory as additional "RAM", then 16TB of NVME SSD storage, and 32 SAS/SATA channels for an effective 360TB of rotating disk (dual parity RAID 6, 30 active drives, 2 parity drives). And then make a cluster of 48 of those monsters.

Ah the places we would go and the things we would do with such a system.


>AMD squirreled away that advantage by not following up, and having other issues with later spins of the Opterons.

AMD didn't squirrel away the advantage. Intel abused their monopoly to starve out AMD by giving OEMs discounts if they agreed not to carry AMD chips. They paid a paltry 1.5B in fines for abusing their market dominance.


> They paid a paltry 1.5B in fines

they didnt https://www.lexology.com/library/detail.aspx?g=8965e7a2-ac87...


> Looking forward to these chips but always have concerns for AMDs ability to execute consistently. Opteron, the original 'Sledgehammer' series was way ahead of Intel because Intel just couldn't bring themselves to put 64 bit features into their Pentium line, and AMD squirreled away that advantage by not following up, and having other issues with later spins of the Opterons.

That isn't really what happened though. It was a combination of two things. One was this:

https://www.extremetech.com/computing/184323-intel-stuck-wit...

The other was that it happened around the time when CPU frequencies hit the power wall. That hit the Pentium 4 especially hard, which gave AMD the advantage, but Intel's anti-competitive behavior prevented AMD from capitalizing on it. Meanwhile Intel knew the Pentium 4 was too power hungry for laptops, so they kept iterating on the Pentium M, which is what became Core. It was designed for power efficiency rather than clock speed right when clock speeds unexpectedly started getting limited by power. It wasn't expected to be faster than the Pentium 4 (at half the power), but it was, so Netburst got canceled and suddenly Intel had the advantage.

The combination of the two things meant that AMD never had a chance to really profit from its investment in Sledgehammer, which meant they didn't have the money to put into R&D and fell behind for a decade.

There is no guarantee something else won't go wrong, but the chance of that same confluence of factors happening again seems pretty unlikely.


At this point Intel isn't executing consistently either, so any customer who treats their processor vendor as "strategic" is going to pay the price.


I'm thinking a lot of their recent success may have to do with Lisa Su becoming their CEO back in 2014. She seems really smart and focused on their core business. I hope some of this success on the CPU market will bleed over to their GPU developments.


RAID 6 with 32 drives. All the things you would not do while the array is rebuilding.


I'm not sure I understand the comment.

I've used dual parity RAID systems for over a decade now and they work superbly when rebuilding even during a drive failure. (aka in 'degraded' mode). I typically run them in 22 drive sets because that is how many drives fit on a single NetApp drive shelf.


With 32 drives you should have at least 3 failure tolerances. Might even want to plan some hot spares in there. The idea is to build a large file system that won't die and take your data with it - nor introduce downtime (with a suitable failure tolerance). And of course, have an offsite backup for actually backing up your data.


Agreed with the hot spare. Typical RAID reconstruction on these systems is limited to available I/O operations (IOPS) after accounting for the number needed to meet performance goals. If these are the only drives on a fairly beefy system you have lots of extra IOPS available so you should be able to reconstruct a drive in under a day if you wanted to. There are the usual caveats about age related failures happening in groups but at least anecdotally and from what I know from the folks who work in NetApp customer support usage across a much larger population, it is quite reliable.


I aware of the issues one can have with rebuilding a large raid array. It seems like you will always hit some parity issue upon reconstruction. What is the proper approach here?


Either declustered RAID for faster rebuilds or RAID 60 with smaller stripes.


Why, RAID6 can rebuild online, because it turns temporarily into a RAID5. Even RAID5 can be used while rebuilding, in read-only mode, because data from the remaining disks is sufficient (else rebuilding won't be possible).


RAID5 can be used normally while rebuilding. Albeit at reduced performance. It's definitely NOT read only as that is effectively the same as offline in a production environment.


Depends on your usage pattern. You might be hesitant to allow writes without redundancy, especially if you store something important which is mostly read-only anyway (e.g. 95% of operations are reads).

For a large write-heavy setup + "we will restore from yesterday's backup" mode of operation, you may likely be better off with a RAID10 which is faster. (Though RAID50 and even RAID60 are a thing.)


Sure, but it's not offline, ever. The debate isn't about the best RAID level for your usage.


AMD had enough time to learn that lesson. Let's hope they did.


What would you do with such a system that is innovative and not just more of the same?!


I've been noodling on what it would take to build a generally conversational dialog machine.


That sounds like a very good start with CPUs and memory, but you need to include GPUs and network interconnect and explain what the high speed busses between all those components look like. How do you solve that?


The older you are, the better you execute things. Look at AMD 2 years ago versus now


Lots of companies get older and lose ability to execute. AMD was very old already when Opteron happened.

The age of the company in a case like this is irrelevant. What matters is current management and staff and market context.


The CEO executed it perfectly for the new chips, because they had walked that road before.

With less success then


Not particularly related to Rome, but what you can buy right now for a single socket system, AMD is far ahead of Intel for reasonably priced workstations or small servers with a huge amount of I/O. One threadripper CPU has 64 pci express 3.0 lanes. One lane is 985MB/s.

Working with a $399 threadripper motherboard that has four x16 physical slots, it can accommodate four Intel x710-4 10GbE four port NICs (each electrically x8), for a total of sixteen 10GbE router interfaces in a Linux kernel based FRR system, as a fully software implemented router. And with the RAM capacity of a 32GB system, no worries about FIB size. It's a very different approach to do with a routing entirely in CPU vs ASICs. But this can be built for under $4000.

Or one could just as easily use those four physical x16 slots for four independent 100GbE interfaces.


I looked into it for a server but I wasn't able to find a motherboard with remote management features.

I suspect there might never be one so as not to muddy the waters of Epyc's intended market.


That's so frustrating. We ran into the same roadblock.


I’m sorry, you’re saying there’s no boards available with IPMI?


I couldn't find a Threadripper compatible board with IPMI, no. At least a couple are supporting ECC now.


I looked into this in the past but had issues finding information about processing overhead for routing (or even switching) that much data. Could a, say, 2950 handle routing 40 gigabit worth of data over a network, assuming no crypto?


Easily. One modern core can route 10 Gbps or more using efficient software like VPP.


Every time I see SemiAccurate I remember their love for AMD and their constant, unending hate for Nvidia: https://semiaccurate.com/2009/10/01/nvidia-fakes-fermi-board...

I don't have any dog in this fight, SA is just a terrible news source.


It might feel one sided, but I think SemiAccurate is one of the few places remaining doing old fashioned journalism (ie pursuing sources and gathering information from the real world). Most other tech news websites have devolved into wrapping up marketing materials from tech companies into "reviews".


Calling Jensen “Dear Leader” though? Did I catch that right? If you have good info why resort to tasteless and frankly racist language?


They don't like Intel or Nvidia, but their facts tend to be mostly accurate. I wouldn't listen to any of their financial or market-share predictions since the market has remarkable inertia that disconnects it from technical factors.


I would say they are semi accurate.


Accurate facts, the best kind of facts.


I’m feeling almost nostalgic.


But they were right in that case... I know it's a biased source but I've seen a lot of accurate info there in the past


> But they were right in that case...

How do you know?


It was widely reported on afterwards. Look at the bottom of the article, they mention Nvidia's cover-up.


It is semi-accurate after all.


>SA is just a terrible news source.

Yes and does not belong to HN.


The article has a lot of boasting about their exclusive scoop. That and their annoyingly watermarked images.


They started using watermarks this way when their news were plagiarized some 5 years ago by even big sites such as WCCFTech, Tom’s and others. It is ugly but understandable, they are independent, don’t live on ads, aren’t invited for most events and barely receive pre-release hardware for reviews.


Interesting design. The 70mm^2 die is more similar to an embedded chip so I guess that explains how AMD was able to hit 7nm so much earlier than Intel.

But reading between the lines, AMD is only showing benchmarks for extremely parallel loads much like they did with ThreadRipper 2. Between the physically separate processing cores and the nature of the marketing benchmarks, I suspect workloads that have high IPC or extremely low latency requirements won't fare so well on this chip.

The weirdest part is to think what Apple might come up with if they were to take the A12X design and connect it on a die like AMD has done here. It would probably make for a pretty interesting Mac Pro.


You play to the strengths of the part and the interests of the target market. People willing to pay the price premium for ThreadRipper are interested in throughput, not latency.


You play to the strengths of the part and the interests of the target market.

However the strategy to go with gluing smaller dies together also better fits the (long known) pragmatics of manufacturing processes. This allows AMD to hit lower price points and while getting better margins. It also coincides with Intel being hit with "The Thermocline of Truth" where they were too optimistic about overcoming the challenges of larger die sizes.

"The Thermocline of Truth"

http://brucefwebster.com/2008/04/15/the-wetware-crisis-the-t...


I’ve been wondering lately if the reason for the Mac Pro replacement not coming out until 2019 may be due to Apple switching to the 7nm refresh of the AMD ThreadRipper. That would certainly make for some interesting news.


But then you've got other people who seem adamant they are switching the mac to their magically scaled up ARM chips. What on earth is really going on at Apple HQ?


That will take several years. AMD could happen in months.


I would think that the use cases for a laptop CPU and high end desktop may call for different CPU's. Plus the X64 arch is well supported by all of the apps you would use a MacPro for.


Apple can't use Thunderbolt with AMD, and Intel has been delaying its opening of Thunderbolt standard for whatever reason. At this point I think Intel has been so disappointing in the past 3 years Apple should just dump them and takes matter into their own hands.


I read somewhere that Apple has been trying to avoid uttering the word "Intel" in their latest presentations. If that's true, maybe it lends a bit of vague support to your idea.


Oh man, the licensing costs for any “per core” software are going to be insane :D


A few years ago many companies changed it to "per socket".

Now with 64 cores in one socket things got even weirder, but then again, those licenses always have been in some parallel universe.


There's still a disturbing amount of software that operates on a per-core pricing model. Some of them include Hyperthread in that core count. When we launched Oracle Cloud Infrastructure, (as Bare Metal Cloud, before we launched our VM product), our standard bare metal instance had 36 cores, 72 threads with HT enabled. One interested customer couldn't use our platform because just licensing their standard software on our platform was going to set them back some $250,000 due to all those 72 "cores". The product manager at the time mentioned there were several other customers facing that issue with their various bits of enterprise software.


Ive been really enjoying my dual epyc 64 core workstation. Looking forward to when I can upgrade to 128 cores. :)


Did you build that yourself? Do you know of any good vendors for this type of workstation? I know building it isn't that hard, but typically an employer doesn't want to go that route.


I built it myself but there are multiple vendors selling prebuilt workstations eg https://www.velocitymicro.com/wizard.php?iid=327

PS: I have no affiliation with velocity micro, have never bought from them


Silicon Mechanics can build just about anything with your choice of Supermicro motherboard.

http://www.supermicro.com/en/products/aplus/solutions/SP3


what do you do with all those cores? if i may ask


Simulations and randomized testing.


Question, why is DRAM still packed separately?


A lot of reasons. It's physically quite large (multiple thousands of sq. mm. of silicon). Customers want different amounts of RAM so you could end up with a cartesian product explosion of SKUs. Customers want to upgrade RAM. System vendors and integrators don't want to give up more margins to the CPU.


For higher power parts, heat dissipation is the primary reason. Most of the high end CPUs are purely heat limited in terms of performance.

A TDP of 90W is about the limit for CPUs using cheap cooling techniques, and adding ram would lower that by 10-30W.


If you interleaved the memory with the processing, wouldn’t it decrease the heat density, though, which afaik is the limiting factor? You can get rid of quite a lot more heat than that if it’s not in too small an area to conduct away quickly.


The semiconductor processing for high-end DRAM is not the same as for high-end CPU logic or cache SRAM, so it is not cost-effective to put both on the same chip. This was a recurring obstable to past processing-in-memory R&D projects.

To mix the two for thermal reasons would lead to some crazy multi-chip modules as in the "chiplet" approach reported recently. However, I suspect you would need far too many chiplets and too many bonds to really give you this thermal benefit rather than just having individual chiplets which are still hot spots with problematic cooling.


Processors also tend to run hot, while RAM is more reliable when cool. Much like some parts in human biology benefit from being outside of the core (enough to make it an evolutionary benefit even when it's quite an obvious defect due to vulnerability), having RAM in it's own cooling sub-domain makes sense, and since it's already out there the blade interconnect card design is space, cost, and cooling effective.


I don't think that's feasible either. If you put memory in between chip element that means you now have to route around that adding to wastage. It also means the distance between the elements would increase slowing things down. At the speed that chips are running the physical distance matters (from what I recall).


I wonder if the extra computing power achievable by more extensive cooling is worth it, in computation units per watt (including both system power and cooling power).

I keep reading that some data centers switch to liquid cooling, so likely it does. Maybe with advances on this front, we'll see the increase of the CPU efficiency that the fabrication process can no more deliver.


Smae reason that they split off the IO block here, but exaggerated. DRAM is very different from a process perspective, with very different constraints. Even eDRAM tends to be a compromise in a lot of ways that wouldn't be competitive if it weren't literally on the same chip as the logic.

Additionally DRAM more or less has to be separate chips due to the sheer surface area for yield reasons. The amount of DRAM in a decent sized server is close to a whole wafer.


A different question is why is DRAM still colocated with the CPUs, and not centrally located over a very fast network? There was a company doing this (RNA Networks, founded 2006) which was sold to Dell and then apparently disappeared without trace.

It might make more sense now that there are more types of memory-like storage which isn't quite as fast as DDR4, like Optane.


Latency between a CPU that has the memory controller built directly onto the die, and adjacent dram modules, is significantly less than if it is abstracted away somewhere else on the far side of a low latency, short distance Network link.


Attempts to do this are being made in academia. Some recent work tries to quantify the network requirements of this, and decides that it's far more feasible to use some local memory as a caches than to have everything be remote.

Network requirements: https://www.usenix.org/conference/osdi16/technical-sessions/...

Two systems that use remote memory: https://www.usenix.org/conference/nsdi17/technical-sessions/... https://www.usenix.org/conference/osdi18/presentation/shan


The cost and latency of the very fast network is still too high, but for slower stuff Gen-Z is coming.


The answer to that is very simple: latency.

There is also an infinitely more tricky issue: security.


Why not? It's not like moving it to the same chip would make DRAM perform better, the signal lines are not the limiting factor there.


HBM2 does have dramatically more bandwidth than DDR4.


It has more bandwidth but about the same latency. CPUs fundamentally attack memory latency differently than GPUs do, and raw bandwidth is less important.


It is extremely important when you have 64 cores orif you use wide SIMD instructions. Intel's Xeon Phi have 16GB HMC memory with 500GB/s bandwidth to prevent the CPUs from being starved by main memory.

Imagine you have a cache miss on literally every memory access and have to pay a 100 nano second penalty to access main memory. SMT allows the CPU to switch to a different thread during a cache miss. Effectively you can have 128 memory loads at any given time. 100ns amortized over 128 threads is less than 1 nano second. Since the minimum size you can read is a 64 byte cache line this means that program will require at least 64GB/s of memory bandwidth even though your CPU is stuck on cache misses.


That's what I meant by "CPUs fundamentally attack memory latency differently than GPUs do"; they rely on their relatively smart caches and prefetching logic to reduce bandwidth explosion from all of the cores, and the latency of accessing external DRAM.

Of course if you thrash your cache on a modern general purpose CPU, you're going to have a really bad time. Pretty much if you're getting close to worse case memory access like in your example, you need to step back and re evaluate quite a few of your choices.


Anyone know if this chip addresses speculative execution exploits eg Meltdown?


From the article:

The other big box to check is Spectre mitigations are now rolled in to the core but Meltdown and L1TF/Foreshadow are not. Why? Because AMD wasn’t affected by either one and never will be. Patching can work but to be immune from the start is always a better choice.


Meant Spectre.


They don't, because AMD is not affected by it.

https://en.wikipedia.org/wiki/Meltdown_(security_vulnerabili...


Spectre


Not the answer you're looking for but Meltdown specifically has never affected AMD.


Yeah my question was bad, was looking for Spectre


I wish AMD could catch up in single core performance as despite having huge potential in offline parallel processing it falls short in real-time applications.


No it doesn’t fall short, Intel single core lead is < 10%. (except for the few programs that use AVX256)

Btw, zen 2 will take the lead at every metrics.


Rumors are Zen 2 is 13% faster on average than Zen+.

Looking at fastest single-threaded CPUs:

https://www.cpubenchmark.net/singleThread.html

The fastest Zen+ is 2950x at position 85, roughly as fast as Broadwell C, with score of 2230. Now 13% more would give it 2520, roughly matching Haswell 4790k (number 15).

So no, it's not likely going to lead every single metrics.


Lol Nice try and your calculus is 100% right for this benchmark under the assumption that only the zen ipc will increase (and new rumors say it could be higher).

7nm will allow higher frequencies too which should make the difference. If Intel still lead in single thread it would probably be negligible.



Yeah, but we know the AVX is going to be 256-bit, so if there is "higher IPC gain under certain workloads", that's completely expected. We might see a performance decrease under certain workloads as well due to the I/O chip. What's the point of picking a singular workload case and dance around it as if it were the case with any workload?


Yes under certain workload doesn't mean a lot. But if it was AVX then the gain would be 100%


I have a Threadripper, I don't need to be a fanboy. I like realistic estimates more.


How are 7nm and 14nm mixed? Are they made separately and then... attached together? Isn’t a die made from a single wager “burned” with a single wavelength?


You're mixing up a bunch of things.

Each wafer, which is the disk on which many dies are produced, is produced using a single technology process. This means that Rome's IO dies (14nm process) are produced on separate wafers from the core dies (7nm process). The wafers are sliced into dies, the dies are tested and binned separately, and sufficiently good dies are then packaged together into a single package that goes into your server's CPU socket.

The wavelength of the laser used for the manufacturing is a different story entirely. Pretty much all state-of-the-art photolithography uses 193nm wavelength lasers, and this has been the case for many years. Yes, the wavelength is larger than the size of the features produced with those lasers. There's a lot of magic involved.

Since it's getting a bit too much magic, the industry has been trying to upgrade to 13.5nm (extreme ultraviolet) lasers for a long time. They're now starting to be used in actual practice. And EUV lasers are still basically magic, but of a different kind.


Yeah, I guess I never thought about putting multiple dies in one package. How are they wired up anyways?


7nm and 14nm do not refer to the wavelength of the light. They describe the size of the features on the chip (though that is getting increasingly inaccurate).

Either way, Zen chips are already multi die chips. For Zen and Zen+ each die has (up to) four cores. Chips with more cores than four contain multiple dies.

For Zen 2 they are going to be mixed differently, which is what you are referring to.


Mostly correct, except Zen (and Zen+, and Zen 2) has eight cores per die. You may have been thinking of core complexes (CCXs), of which there are two on a die, and each CCX has four cores.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: