They also announced Threadripper, which is AMD's new HEDT platform with up to 16 cores and 32 threads, they also showed Radeon Vega Pro SSG (16 Gbyte HBM and 2 Terabyte SSD on the GPU) and DeepBench vs Nvidia P100 (with an advantage of ~30%) etc .
I have met software engineers who could not believe me when I pointed out that multi-threading was invented, made sense, and, in fact, was thriving - on single-core computers (with no hyper-threading)!
We still haven't figured out how to parallelize most software without unreasonable effort though. Servers happen to be the happy case for it beacuse they can just run many copies of the same single threaded code that we haven't figured out how to parallelize.
In my life that matters less than you'd think, it's not about speeding up a particular program, it's more about been able to run a particulary program at full speed on one core and not drag the rest of them to a halt.
If I have an IDE, 2-3 VM's, a couple of browsers, continuous integration running in the background, webpack/ts-loader with off thread hinting I can easily have 4-5 processes running that all benefit from having a full core to play with.
It's for that reason when I had to build a new desktop for the new job I went with the Ryzen 1700, each core isn't that important (as long as it's comparable with the core in my current jobs i5-3570K which it broadly is), it's having eight of them.
It's interesting that this use case of running many different programs continuously is probably quite different from the rest of the high-end desktop world, where you might be exclusively in Autocad, Photoshop, Maya or some complex engineering software.
For developers I think more cores is still better (I'd add Spotify and Slack to your list of things that are always running) and yet we still prefer shiny laptops to powerful desktops.
Laptops have horrible postural positions unless you use desktop monitors at which point why not just use an actual desktop which will annihilate the laptop on performance anyway.
They clearly never looked at the task manager or something similar. I'm running ~900 threads on my CPU which has only 6 cores and 12 threads. Magic! ;o)
Really not sure on the "EPYC" name for an enterprise part. It seems more enthusiast then button down enterprise. But as long as it doesn't exclusively come in servers with glowing neons and tri-colour fans, I don't exactly care...
I just imagine it will make it easier for intel's marketing to imply these are toys rather then true enterprise grade parts.
It might be completely deliberate, as the WoW generation might be coming into positions where they are now the "enterprise guys" that need to be targeted. Or maybe I'm just reading too much into a whimsical name.
Depending on whose definition of millenials that's certainly true, I was born 1980, I'm 37 in a couple of weeks and some would have me as a millenial..which is kind of funny since statistically I'm damn near half way through my life.
I personally think EPYC is a lot more tacky than Xeon or Opteron since "epic" has gaming and (internet) pop culture connotations for me. I was pretty surprised when I heard the name and heard it was an enterprise product.
It also has "disruptive performance." I am not sure if I should take it as a good sign, that AMD talks more with reddit than with their marketing department.
I bet a 64c Zen chip would be amazing with mixed SQL Server workloads but SQL Server's per-core licensing will make it crazy cost prohibitive. $456k for one server - ouch! Here's hoping per-core licensing goes tf away.
I would totally get behind that, especially if it could be actual wattage used and not just based on the tdp of the CPU. Nothing sucks more than having to pay huge money for servers that are sub 5% utilization because you're trying to have some headroom.
and from time to time a MS sales rep drops by and measures your power consumption to recalculate cost.
or maybe have a power consumption tracking dongle that sends the information back to MS for billing while the MS techs just check if those are connected to the correct machine.
Wasn't the fact that Microsoft began charging per processor, one of the reasons the antitrust lawsuit against it started?
But these days Microsoft is also beginning to exclude third-party browsers from its store, so I guess they don't care they're repeating the same violations all over again. Too much money on the line, when the worst case scenario is a slap on the wrist financially and some "monitoring" by federal agencies.
Considering DB2 is bundled as part of the OS license for IBM i, 10K/core is still cheaper than SQL Server (though then you get to pay for drivers...because we're fucking IBM).
I just boggle at the idea of a server with 1TB of RAM. I'm sure the oil&gas folks are salivating as this allows them to to put an entire high resolution 'cube' into memory and analyze it but for us mere mortals at what point is there so much stuff in memory that you need a couple hours of hold up time just to flush it out to SSD.
The big headache is not 1TB but 64TB which is the maximum physical RAM limit of the Linux kernel / x86-64 architecture. Big NUMA systems could go higher but they don't. Look here: https://www.sgi.com/products/servers/uv/ "SGI UV 300 scales from 4 to 64 sockets with up to 64TB of shared memory in a single system." vs "SGI UV 3000 scales from 4 to 256 CPU sockets with up to 64TB of shared memory as a single system." see how 256 and 64 sockets both only support 64TB?
More ordinary servers typically stop at 6TB since one Xeon can support 1.5TB so an ordinary quad socket board typically with 96 DIMM slots can go up to 6TB with 64GB DIMMs. You can configure a machine like this at http://www.thinkmate.com/system/superserver-4048b-tr4ft and see that for the relatively low price of $110K you can get a machine with 6TB of RAM.
> The big headache is not 1TB but 64TB which is the maximum physical RAM limit of the Linux kernel / x86-64 architecture.
It sounds like you know what you're talking about, so I'm sure it was inadvertent that you wrote "physical RAM limit of the Linux kernel". It's primarily the x86-64 architecture and 5-level paging is coming which extends the linear address space to 57 bits (128 PiB) and the physical address space up to 52 bits (4 PiB). Still, one wonders how long it will take for that to be inadequate.
And Linux support for 5-level is being actively worked on since December, a quick git search shows this in 4.11:
Merge 5-level page table prep from Kirill Shutemov:
"Here's relatively low-risk part of 5-level paging patchset. Merging it
now will make x86 5-level paging enabling in v4.12 easier.
There are still 8-socket systems, like the Supermicro 7088B (192 DIMMs), which support a theoretical 24 TB (with 128 GB modules).
(It would be interesting to know whether this backplane-based system actually manages to achieve 4.8 GHz QPI speed (9.6 GHz symbol rate), or whether the physical aspects limit it to lower speeds. Given that processors only have four QPI links this would further increase communication overhead -- in a 8-way system some sockets are separated by two hops).
It's true, 1TB of memory is well within the realm of an over-eager enthusiast. Motherboards that support it aren't even exotic, they're just physically large server-grade ones.
They are often based on lower density DDR3 chips that consume more power per bit. That being said you can do 1TB on a normal dual socket server with 64GB DDR4 LR-DIMMs for a while now, though they do come at a cost per bit premium.
We've had database servers with 2-3TB of RAM in production for years. You can get servers today with 24TB RAM - they're 8-way so lots of NUMA going on but still tons o' RAM.
mere mortals at what point is there so much stuff in memory that you need a couple hours of hold up time just to flush it out to SSD
If your buying these machines with multiple TB of RAM, then buying flash arrays that can drive multiple GB/sec of IO bandwidth shouldn't be a problem either. At say 8GB/sec write bandwidth flushing 1TB is a little over two minutes. Although, why you have that much "dirty" data in RAM might be another question. A database machine using that amount of RAM is going to care more about the IOP rates of the disk, so that its flushing updates to disk at the same rates they are arriving. Meaning that the RAM won't need to be flushed to disk if the machine/power/whatever fails. Disk arrays with >1M IOP/s have been around for over a decade, and given a SAN can be wired together to increase aggregate performance.
> I just boggle at the idea of a server with 1TB of RAM.
MooseFS (a scale-out storage system) keeps metadata in RAM for low latency and is known for pushing the limits of commodity hardware in big installations. I presume the same applies to any kind of low-latency database of similar size.
1.4 gigabytes per second or more for 4K flat video work at 4 color components at 16 bits at 25fps (PAL land) each raw for each channel on the timeline saturates system real quick. For online finishing systems that would be sweet. 8 TB would fill an average 90 minutes movie if you had a finishing timeline (no additional ones over it). With over 12TB I wouldn't know what to do anymore, as of this point in time.
Sure, you can build a RAID that fast already though.
I wish we had more details about the on-die internconnect ( Hypertransport successor) in terms of latency and bandwidth, and even topology. We run a NUMA .. challenged .. OS, and depending on the interconnect that may or may not matter so much for our workload.
Sure.. but I'm wondering about the on-die interconnect between the 4 different packages that share the die. Eg, everything seems like an EPYC is 4 Ryzen packages, and a "thread ripper" is 2 Ryzen packages. So it seems logical that there is something connecting those 2 or 4 packages, and that fabric as bandwidth and latency characteristics. If it is somehow infinitely fast, then that's great for me.
Zen has SHA1/2 extensions compatible to the Intel SHA extensions, yes. These are kinda new on desktops, but since they existed for some years some software already supports them out of the box (like OpenSSL and cryptopp); so applications will automatically profit.
With this extension Zen does SHA1 @ 2 cpb, SHA-256 @ 3 cpb and SHA-512 @ 2 cpb (off the top of my head). (All of which are faster than the fastest BLAKE2 implementation I know on Haswell).
Do you by chance have hard numbers about SHA2 with/without hardware instructions and BLAKE2 on a specific Intel CPU?
I've wondered about trade off between SHA256 vs BLAKE2. In the future there'll be no debate since more and more computers will have SHA instructions. But right now I'm wondering about the speedup of BLAKE2 vs SHA256 with hardware. On the other hand, many computers, especially servers don't have SHA2 instructions for the foreseeable future which will make BLAKE2 a very good option.
However, none with SHAEXT; they just weren't there yet. But the Zen numbers should give you a good idea.
Note that these benchmarks are made using a plain C implementation of BLAKE2 (the reference one), which is not vectorized by any compiler. The fastest (AVX2) BLAKE2 implementation is about 40 % faster than the scalar C implementation (on Haswell).
As far as I'm aware no mainstream crypto library ships optimized BLAKE2 versions. I believe some Go packages do/did make up their own version (not the one from Samuel Neves), but at least one of them mixed SSE and VEX/AVX insns with the predictably bad results (60 MB/s or so) - perhaps this is fixed by now.
So in summary, BLAKE2b is imho the best candidate on perf, and if you use a good implementation it should be within ~30% of SHA2 (512) with SHAEXT — with the numbers we have so far. I understand that Zen's aggressive (=good) power mgt makes it somewhat difficult to benchmark hot loops consistently, so we'll have to wait and see for practical results, I guess.
Given how much these types of processors are used for virtualization, wouldn't a lower core count at a higher clock speed be just as useful? 32 cores at 1.4Ghz only seems useful if you need to use a lot of processor affinity, but assigning faster vCPUs to your VMs doesn't seem to have a downside. Just not sure what advantage this would have over a 16core 2.8ghz chip as a comparison.
Switching tasks is expensive [1]. Twice as many cores running at half the speed can be considerably faster in the real world because you're not constantly stopping to flush the cache, save the kilobytes of register a modern CPU has, etc. Honestly, I'm surprised that x86 has kept with just two virtual threads for this long. Architectures like Sparc and Power have 4+ threads per core because so many modern jobs are built around hurrying up and waiting.
A core at twice the frequency is better than two at half the frequency every time. The problem is usually the trade-off is not that clear (either the slower cores consume significantly less or they are faster than just half as slow).
Regarding HT, 2 threads is really a sweet spot for a 4-wide CPU. More than that and the competition for cache resources, execution units and register file become significant.
POWER8 is special because a factor of x2 is because each power 'core; is pretty much two distinct smaller cores that can gang together to speed up one thread (it also helps on per core software licensing), while the other x2 factor is for very specialized loads (this is also true, or used to be, for SPARC).
IIRC XeonPhi which is also a specialized cpu has 4xHT.
My understanding is that lower-clocked cores run cooler and consume less power. CPUs designed for high clock speeds will have "hotspots" where certain units are running very hot compared to the rest of the chip. Slower cores have more even thermal profiles.
So, if you don't need the high peak clock speeds, 32 half-speed cores would be preferable for datacenters to save money on electricity and cooling design.
Higher core count means more total cache and fewer context switches. If your workload actually has 32 threads, then the 32-core processor will probably be faster overall than a processor with sixteen of the same cores running at twice the clock speed.
On at least some Intel chips, the turbo clock is only applicable to one core - do you see any indication that all 32 of these cores can run at 2.8Ghz? And if so, sustain that for any period of time?
Intel's Turbo Boost isn't binary. CPUs typically have a base clock, a maximum turbo frequency that may only be attainable when using a single core, and numerous intermediate states depending on how many cores are active, including an all-cores turbo frequency that may be usable only for short bursts or may be indefinitely sustainable given a sufficiently lightweight instruction stream (ie. not much AVX).
As an example, the 22-core Intel Xeon E5-2696v4 has a base clock of 2.2GHz. With one or two cores active, it can turbo up to 3.7GHz. With three cores active, the maximum is 3.5GHz, and it decreases by 100MHz per active core until ten cores are active. With 10 or more active cores, the limit is 2.8GHz, provided that the chip is still within its power and thermal limits.
Many Virtualised workloads have many VMs with light CPU utilisation on each. As such this kind of core configuration allows lots of VMs to be doing low intensity background tasks at the same time.
As such this would be ideal for things like VDI and web / cloud hosting where the quantity of VMs is very high, but the load from each is typically not.
> Many Virtualised workloads have many VMs with light CPU utilisation on each
Which is why every virtualization platform out there lets you oversubscribe CPUs. That's a solved problem, what's the benefit of having 100 VMs run on 32 slow cores vs 16 fast ones?
As mentioned above, context switching is expensive and extra L1 cache is valuable. Time-sharing can also have a huge effect on latency (because requests must wait until their server is scheduled), even when the throughput is
still good.
Even if time-sharing performs well most of the time, when it goes wrong, the performance problems can be opaque and hard to debug. In general solution that "really" does something will save engineer-days as compared to a thing that does it at the same price/performance trade-off, but virtually.
Threadripper (16c/32t, quad channel memory, 44? PCIe 3.0 lanes) will probably be the Workstation platform.
Wouldn't AMD need to license Thunderbolt from Intel? There are repeatedly rumors about a license agreement between Intel and AMD regarding AMDs GPU IP, if that is true maybe they get access to Thunderbolt.
You have to sign lots of NDAs and other docs, give company details, describe the product you want to develop, just to get a look at the datasheets [1]. To put that in perspective, the datasheets for their latest and greatest processors are very readily available [2]. Thunderbolt is very much an intel only game. I really wish that the PCI-SIG had gotten an open standard for external PCIe out, it would have been rather useful.
You do, but part of that license is validating it works and it's not cheap. It's part of the reason why thunderbolt cables are so expensive compared to their alternatives.
A good summary (with pictures) by a reddit user: https://www.reddit.com/r/Amd/comments/6bjvy6/amd_2017_financ...