10 years ago the fastest supercomputer was BlueGene/L which was rated at 136.8 TFlop/s. The current fastest supercomputer is rated at 33,862.7 TFlop/s, or 247 times faster.
It seems to me that the aim of taking 10 years to build a supercomputer that is only 20 times faster than the current one might fall a little short if it's aiming to take the top spot.
This isn't only about the FLOPS, the big trend of these countries* ordering new supercomputers by 2020/2025 is very focused on power. Current supercomputers consume a lot.
Also the FLOPS measurement is a bit broken: It focuses on dense linear algebra problem, for which GPU or other accelerators boost the results easily. If all you plan to do is running simulations that are easily parallelized on GPU it is fine, for other types of programs it is hard to tell which is the fastest supercomputer.
Power is over-emphasized in HPC circles. According to FOIA reports (and you can back out similar from public budget information), less than 10% of the budget is going to energy. It is frequently used as an excuse to build machines that are inappropriate for the science (not just "hard to program", but actually inappropriate in the sense that even with infinite programming effort, they deliver less scientific value than a more conventional architecture). There is some value in making scientists uncomfortable so that they think of creative algorithmic solutions that may pay off as the inevitabilities of semiconductor physics become more apparent, but seeing the number of applications that are within small constant factors of proven barriers and the willingness to compromise quality of solution and/or run scientifically irrelevant configurations to demonstrate "speedup", I think it has gone too far.
Perhaps it could be the size of the matrix that can be inverted on it in an hour of time, with IEEE double precision floats, using some standard algorithm.
The "High Performance Conjugate Gradients" benchmark was proposed a couple years ago as an alternative metric for ranking supercomputers. Its proponents claim its behavior is more similar to real applications (irregular access patterns, lower ratio of computation to memory access, etc), compared to linear algebra problems like the "High Performance Linpack" benchmark currently used by the Top500.
HPCG basically measures STREAM and has many technical flaws making it scale-dependent and difficulty to adjudicate. As codeveloper of a different benchmark, I'll just cite this paper from a third party. https://hpgmg.org/static/MarjanovicGraciaGlass-PerformanceMo...
The reality is that there are many dimensions to supercomputing performance and it's impossible for one number to capture the utility of the machine. Our HPGMG benchmark (https://hpgmg.org) attempts to strike a balance and give useful supplementary information. I do think it's better than any other single benchmark for evaluating today's machines and will also prove to be more durable over time.
How would you use a benchmark like this to predict the performance of a well-designed asynchronous parallel conjugate gradient solver, like most modern deep learning neural networks that run on Internet HPC machines?
CG isn't truly asynchronous due to its reductions. It can be pipelined in various ways (we have several implementations in PETSc), but performance requires a quality implementation of asynchronous reduction (e.g., MPI_Iallreduce) which the vendors have been slow about developing (I've been working with some on fixing this and Cray has made recent progress).
With respect to deep learning and other applications using CG or related algorithms, the bottlenecks depend on the scale, and ability to expose locality, and operator/preconditioner representation. If there is no locality, then matrix-vector products require all-to-all communication which tend to dwarf the cost of the reductions in CG. Even with locality in the matrix-vector product, preconditioners often need to communicate globally in a scalable way similar to HPGMG. Operators need not be represented as a table of numbers or a sparse matrix format, but could use a tensor product, fast transform, or other information to compute the action using less storage. If they are represented explicitly (sparse or dense), then matrix-vector product performance (thus CG as a whole) is dominated by memory bandwidth for problem sizes that do not fit in cache. HPGMG tries to strike a balance between memory bandwidth demands and compute using a matrix-free representation. HPGMG also reports dynamic range expressed as Performance versus Time-to-solution as the problem size is varied, which allows applications to see performance barriers that might be relevant to them (e.g., see how Titan cannot do a solve in less than 200 ms while Edison can do 50 ms, and how that relates to climate simulation performance targets; see slide 7 of https://jedbrown.org/files/20150624-Versatility.pdf).
Is it possible to calculate the theoretical performance of a cluster under HPGMG and then do a practical run and come with an efficiency number like in HPL ?
One of the biggest reasons for use of HPL is that many sizing considerations can be based off of the theoretical calculations.
But anyway this is very interesting. I definitely need to check this out.
HPL has an abundance of flops at all scales (N^{1.5} flops on N data), so one can expect a decent fraction of peak flop/s on any architecture with enough memory and adequate cache performance. This is a problem because architectural tricks like doubling the vector registers without commensurate improvements in bandwidth, cache sizes, load/store/gather/scatter produce huge (nearly 2x) benefit for HPL and little or no benefit to a large fraction of real applications.
HPGMG is representative of most structure-exploiting algorithms in that it does not have this abundance of flops, thus theoretical performance is actively constrained by both memory bandwidth and flop/s. We see many active constraints in practice; e.g., improving any of peak flop/s, memory bandwidth, network latency, or network bandwidth produces a tangible improvement in HPGMG performance. Depending on the fidelity of the performance model, these dimensions can be a fairly accurate predictor of performance, but ILP, compiler quality, on-node synchronization latency, cache sizes, and similar factors also matter (more for HPGMG-FE than HPGMG-FV).
I think it is actually quite undesirable for benchmark performance to be trivially computed from one parameter in machine provisioning. No computing center has a mission statement asking for a place on a benchmark ranking list (like Top500). Instead, they have a scientific or engineering mandate. Press releases tend to overemphasize the ranking and I think it is harmful to the science any time the benchmark takes precedence over the expected scientific workload. HPGMG is intended to be representative in the sense that if you build an "HPGMG Machine", you'll get a balanced, versatile machine that scientists and engineers in most disciplines will be happy with. I'd still rather the centers focus on their workload instead of HPGMG.
The information I've seen about Sibyl (the Google ML system, not the genomics package (http://sybil.sourceforge.net/documentation.html)) says it is basically doing logistic regression using a parallel algorithm (Collins, Schapire, Singer) with a transpose on each iteration. Without knowing more about the problem sizes and data sparsity/irregularity, I expect the transpose to be a significant expense. I'd be happy to read more if you have access to further technical information, but it's not clear how this comment relates to your previous question about CG and deep learning. As it relates to HPGMG, I think my previous response covers the important performance dimensions. I'd be happy to discuss further over email.
Both logistic regression and deep learning are basically just big conjugate gradient minimizers.
What I meant by asynchronous is that not all terms in a gradient are required to be summed in the same step.
The transpose step in Sibyl is implemented in the Shuffle and Reduce phases. The filesystem is used to hold the temporary data. Nevertheless even for large systems, very few steps are required, and step times are reasonable, even compared to modern supercomputers. This is a tribute primarily to the design of sibyl and the implementation of MapReduce at Google.
This is all explained in online versions of the Sibyl presentation. I really wish more people from DOE who write modern solvers would pay attention to this stuff.
It depends, I meant broken if you plan to compute other types of problems.
For instance some big graph problem instead of involving linear algebra. Then you would favor the benchmarks from http://www.graph500.org/.
But what if you want to optimize for programs that are communication intensive, or memory intensive?
Should the FLOPS of a very specific linear algebra suite be used as the metric of best computers?
The extrapolation on the top500 supercomputer list [1] estimates the first EFlop computer in 2019. The math in the article is weird. They say 20x faster, but 20x33 PFlops is quite a bit less than 1EFlop.
The extrapolation in the linked graph pretty clearly doesn't track increases after 2012 correctly. Tracking the past 3-4 years puts 2019 at around 600PF.
It could very well be that there's diminishing returns involved but I agree - they should be aiming to surpass the current tech by at last 100x in the next ten years.
The current trend [1] would suggest that exascale is not on that kind of trajectory. The dominance of non US countries (especially China) in the Top500 rankings is as much a driver here.
1. The top 500 list is broken, as people who are serious about real world applications give essentially zero shits about Linpack.
2. The dominance of China is achieved through the use of Xeon Phi accelerators, which may be great for Linpack, but have not made much of a splash for applications yet. GPUs are solidly beating Intel's accelerator offering both on adaptation and performance.
Computer hardware innovation is a textbook example of diminishing returns. With each improvement in processor performance, size, energy usage, and heat management, it becomes more expensive to push the tech further. We're currently witnessing this effect in action with the recent stagnation in consumer processor speeds. They are still getting better in size, energy and heat management, but average speeds have hovered around 2.5Ghz for years now.
I think this do not advanced more just because of economic reasons.
I have have read last 10 years that Processing is a lot more cheaper did by network of computers and clusters, instead of a expensive supercomputer that also demands an appropriate building and infrastructure.
Modern supercomputers are essentially clusters, but with much more advanced network topologies and technologies, shared storage, etc. It's not just one monolothic machine.
However, the types of computation performed by the top supercomputers are rarely the "embarassingly parallel" programs you can easily distribute via an @Home-style program, or something like Hadoop. They do depend heavily on very reliable, very low latency, high bandwidth networks.
A bit more informative is the actual fact sheet put out by the white house [1]. What they are really aiming for is exascale computing, which they define as being capable of applying exaFlops to exabytes. From my limited knowledge, the latter will actually be the bigger deal. As pointed out elsewhere, an exaflop supercomputer will probably come around beforehand.
The real problem is getting 1 exaflop (or around it) within a reasonable power budget. The DOE's power budget for all of their supercomputing resources is 20 Megawatts, so at a full system level we would need to be at 50 GFLOPs per watt, while the best system right now is at 5.
That's single precision, which the DOE doesn't care about. The latest NVIDIA GPU's do 8 to 16 times better at single precision (32 bit) floating point compared to double precision (64 bit).
The best next generation DP GFLOPs/watt from one of the big players will most likely be the 2016 Xeon Phi, at ~10-12GFLOPs/watt... You are also forgetting that GPUs also have a ~100W+ CPU sitting next to it, which brings down total efficiency significantly.
Shameless self promotion: My startup (http://rexcomputing.com) is aiming for 64 double precision GFLOPs/watt, and 128 GFLOPs/watts single precision for its first chip next year.
Your chip looks cool. I guess it may be tricky to adapt software to run on the thing? Or else you could try to sell Obama 4 million of them for his new computer.
A huge amount of overall system power is spent in data transport. Plus, double everything for cooling. That brings the total system efficiency way down from what the actual computational components spend.
> "I'd say they're targeting around 60 megawatts, I can't imagine they'll get below that," [Mark Parsons] commented. "That's at least £60m a year just on your electricity bill."
The story linked below about the
French exascale plans explicitly states that it is the military applications division of the French nuclear energy agency (CEA) that is spearheading that effort. (I believe they work on both reactors and warheads.)
It looks like they are explicitly saying they want to make a machine that works for both types of HPC -- classic low-latency high-bandwidth internode communication (physics simulations) and modern Internet-driver high-bandwidth storage/node communications.
This is because the supercomputer community has long ignored the Internet-style of computation (MapReduce etc). But most of the new generation of scientists are adapting their codes to this new style, because dollar-for-dollar they can get more throughput than the classic style machines. Classic machines invest heavily in low-latency communication and typically require APIs like MPI to achieve it, while Internet HPC just uses well-designed TCP-based socket communications.
Building dual-design systems like this- especially when the community has little or no skill at building NG Internet HPC systems- is likely to produce a system that is good at few things.
Instead, build two systems. One is the largest (but not necessarily exaflop) you can afford and is a classic supercom[puter. Then, for the second, hire some datacenter designers from Google/Facebook and have them build a modern HPC cloud design.
The biologists will flock to the second one; they have long been underserved by the DOE supercomputing community.
It is important to not conflate "massively parallel" (HPC) and "massively distributed" (Internet-scale), they have different architectural requirements and solve different classes of problem. People with competency in either of these areas tend to overestimate their understanding of the other but they are not solving the same computer science problems even though they look similar on the surface.
Massively distributed systems do not get much benefit from low-latency interconnects. Massively parallel systems do, and in particular, it is a "throw hardware at the problem" kind of solution that helps cover for the fact that virtually no software designers can engineer efficient, non-trivial, massively parallel systems. MapReduce is a distributed model; outside of some trivial cases, it is a poor parallel model. And while the HPC community has a much better understanding of massive parallelism than the Internet-scale systems community, the HPC community largely doesn't grok massively distributed systems in the way that someone working on Google's infrastructure would.
I benefitted from having spent several years designing software for both HPC and Internet-scale systems. They are not fungible, and both communities grok things that the other is oblivious to. Even within the HPC community though, the number of people skilled at the design of massively parallel software systems is quite small, much smaller than people that know massively distributed systems.
You do not need two systems, you need one system and more people that have figured out how to design massively parallel software -- the real problem. It is difficult to overstate just how rare this skill is even within the HPC community.
> especially when the community has little or no skill at building NG Internet HPC systems
I would argue that the community of people who actually have the skills to take advantage of the interconnects in a classic HPC system is vanishingly small, and in consequence we've overbuilt them on an epic scale.
Allow me to vent. I had the good fortune to have a login on a "petascale" HPC system, and access to an allocation of hours.
The /scratch filesystem would fail weekly, which killed everybody's jobs. If you had a big run going when /scratch failed, you lost everything. Scratch failed so much because the models that were being used often did wildly inappropriate amounts of file IO --- debugging print statements, detailed intermediate calculations, excessively verbose output --- that worked all right in development but when run in parallel brought the filesystem to its knees.
Furthermore, the login nodes were almost unusably slow because of all the Python and Perl post-processing scripts running on them. This isn't even a matter of users being cheap with their hours --- post-processing would have been a tiny fraction of their allocations. Instead, it's that many of them gave no thought at all to how the post-processing might be structured and run through the batch scheduler, and saw no downside to abusing the login nodes for that purpose.
In conclusion, I can attest to at least one HPC system that was badly mismatched to its users' needs and level of sophistication, despite allocations of hours being awarded only to a small number of researchers from across the country through a highly competitive process. Building these things serves national and institutional pride far more than any utilitarian interest.
You're describing an exceptionally poorly built and used system. That said, it's not inconsistent with what I've seen as well.
My claim is that the design of classic interconnects is a big waste of money, because only a few codes need it, yet the cost dominates (>50%) of the cluster. I've learned, from years of studying Google's papers, that there are better ways to build code that communicates, and those mechanisms are much easier to teach to scientists and computer scientists than MPI.
Here is my argument: when I worked for DOE, everybody told me I had to run my MD simulations on a super computer using all the processors, and I would judged on my parallel efficiency. This meant using a code that used MPI to communicate at every (or every N) timesteps. I asked, instead, "Why not just run N independent simulations, and pool the results?" In this case, you run an M-thread simulation on each machine (where M = number of cores on the machine) with no internode communication at all except to read input files and write output files.
The short answer is, that approach works just fine, but the DOE supercomputer people won't let you run embarassingly parallel codes because they already spent money on the interconnect to run tightly coupled codes.
In reponse to this, I went to Google, built Exacycle (loosely coupled HPC) and published this well-cited paper: http://www.ncbi.nlm.nih.gov/pubmed/24345941 which in my opinion put the last nail in the coffin of DOE-style physics simulations for molecular dynamics.
That said, there are systems which are so large you can't practically simulate a single instance of the system on a single machine, so you have to partition. Simulating the ribosome is a nice example. However, simulating the ribosome currently provides no valuable scientific data except to tell us that we have major problems with our simulation systems (force field errors, missing QM, electrostatic approximations,e tc).
Interesting! Would it be accurate to say that as the amount of computing power and memory per CPU has increased over the years, so also has the percentage of scientific problems where a single simulation instance will fit on a single CPU? Certainly if you can do so, it's more efficient (in both machine and human resources) to partition by one job per CPU.
Yes, for example when I did my PhD work ~2001 with a T3E I could run a simulation of a duplex DNA in a box of water by running it in parallel. This was true both for memory and CPU reasons. It limited to me studying a single sequence at a time, or 2-3 which was the practical limit on the number of concurrent jobs. This used the well-balanced design of the T3E, which had a great MPI system.
Eventually it reached the point (~2007) where I could fit the whole simulation on a single 4-core Intel box with similar performance. Then, I ran one "task" per machine, and scaled to the number of available machines. This uses
only inter-node communication, which goes over a hub or crossbar on the motherboard. Much faster.
Now, I can fit many copies of DNA on a single machine (one task per core). This is far and away the best, because each processor just accesses its own memory, greatly reducing motherboard traffic, so the problem is basically CPU-bound instead of communication bound (this also now applies to GPUs, such that single GPUs can run one large simulation within its own RAM and not have to spill data back and forth over the CPU/GPU communication path).
This moves the challenge to the IO subsystem- I generate so much simulation data that I need a fat MapReduce cluster to analyze the trajectories.
none of this is news - what you're describing is really just strong scaling. and sure, most systems already have subsets of nodes set aside for post-simulation cleanup.
I'm not just describing strong scaling. I'm describing a cost-effective way to achieve it; that's what really matters.
Why have subsets of nodes for post-simulation cleanup? Why not just run that cleanup on the same nodes you used for simulation? Or other general nodes? Otherwise, you've got two sets of nodes which are used at lower utilization than they would normally be.
I know some people in the life sciences who were strongly encouraged to get Titan time. When they applied and presented ORNL with their embarrassingly parallel code, they were told to go away.
yes, precisely my point. If I wanted to run BLAST by partitioning it to run embarassingly parallel, they wanted to me use mpiBLAST- but mpiBLAST isn't actually any better for any real-world workfload.
This is because 50% of the cost of the machine was the interconnect, and if they let those codes run, it means they wasted budget and will get less next time.
Until I hear that the funders/builders are spending the same amount of budget on machines that let biologists run embarrassingly parallel codes as they spend on TOP500 machines, it's not going to change.
I would argue that many of the failures you describe are due to a lack of libraries that can exploit these systems. You can't expect every biologist to be an expert in distributed computing. The HPC and algorithms communities failed to consider the users of these systems and happily tinkered in their academic niches. All the while, what was actually needed were easy to use libraries that allowed non-experts to benefit from the advances made. (Also, the algorithms community has lost interest in distributed memory computing).
The abuse of /scratch and the login nodes sounds like a classical mixture of not knowing, not caring, and limited time. That isn't something that has a technical solution.
I completely agree. My point is, the workload was lots of jobs using 100 nodes or less and doing massive amounts of file IO, while the cluster was essentially designed to run very large physics simulations, one at a time. This was a cluster that didn't need to exist.
/scratch on a Cray is (last I used one) is a Lustre filesystem. It is the only file storage available to the compute nodes, which have no local storage of their own. No spinning disk, no SSD. So if the code is written such that it makes frequent small writes (e.g. it's peppered with print statements), the lustre nodes get hammered by all the compute nodes and become the bottleneck and they will eventually fall over.
They're not. They're just the only ones who insist on doing it with a filesystem that can't handle the load.
The other problem is that their interconnect is relatively fragile. It's comparatively easy to crash the entire network, at which time your filesystem goes away and processing stops.
But thanks to Lustre, even when it's good, it's bad.
And it's not like there isn't precedent. I remember a couple years ago when Google showed off gene mapping at I/O (Urs' demo to visualize how easy it was to scale GCP From 1 to 10 to 1000 to 100000 cores), and now they've partnered with the Broad Institute to apply this more generally.
I was the person on stage with Urs who ran the demo (I'm a computational biologist who went to Google to help them build this kind of infrastructure, because DOE wasn't doing it). Read more about the demo here: https://cloud.google.com/compute/io
I also helped create the original idea for the Broad collaboration. It was pretty obvious that standard HPC was a waste of money and not designed for high throughput biology, while Google published numerous papers (MapReduce, GFS, Bigtable) that demonstrated they were building infrastructure that was perfect for a wide range of computational biology programs.
One of the largest challenges in building an exascale cluster is communication. Computing power increases at a higher rate than memory throughput does, and memory throughput increases faster than communication infrastructure advances.
Many argue that an exascale computer can only be cost efficient if the communication capabilities scale highly sublinearly with the computation done in the subsystems [1]. In particular, you can't move the data, and new algorithms are needed that can deal with data that is arbitrarily distributed. This is quite challenging and unfortunately the theoretical computer science community seems to have decided that distributed memory algorithms have been covered since the 90s and are not worth their time. Yet they ignore the progress that has been made in other models of computation since, and many algorithmic improvements of the last decades are not applicable. It is high time to develop communication-efficient algorithms for the basic "toolbox".
I guess what I'm trying to say is that you can't just throw MapReduce at an Exascale machine and expect it to perform well. Instead, you need an environment that is rich in primitives that have been implemented in a communication-efficient way. It's faster and cheaper to spend a little more effort on local communication if that allows for reduced communication volume (and/or the number of connections that need to be established!).
The issue I have with the MapReduce approach is that it doesn't particularly care about data locality. Thus it is very hard to achieve communication volume sublinear in the input size, which is absolutely deadly in an exascale setting.
I also understand the frustration with MPI, it is a very low-level API focused on data movement. It can be rather frustrating to use, but there do exist tools to make it more fun (Boost.MPI with C++11/14 is an excellent example). That said, with a well-engineered set of algorithmic tools, ideally you wouldn't need to use low-level MPI calls at all. However, MPI still remains a useful tool to implement these things.
Exascale computing requires us to rethink a lot of things.
You are incorrect saying MapReduce isn't locality aware. Hadoop supports machine, rack, row, and cluster locality scheduling.
Also, most modern Internet HPC systems dedicate a ton of design and equipment to having very high cross-sectional bandwidth, which enables the locality restrictions to be relaxed.
Well my point exactly. That "ton of design and equipment" doesn't scale particularly well, as its cost grows highly super-linearly with the computing power. You need to reduce communication volume to be cost effective at exascale.
This isn't true. You can build awesome high bandwidth clusters for extremely cheap. It takes an understanding of ethernet silicon and TCP implementations, but it can be done. Amazon for example recognized that superlinear cost scaling was killing their profits, and invested in building newer systems with better designs that solve these problems.
The main challenge is that because these are built with multistage routers, they have fairly high latency. So much of the effort in modern HPC systems used for Hadoopy workloads goes to latency hiding.
You say "awesome high bandwidth" but at 1 Gbit/s per node you're still a long way from an InfiniBand 4X FDR Interconnect (54 Gbit/s and sub-microsecond latency, significantly lower than your network ). As you write, these are built with multistage routers, which add even more latency. So in effect they have reduced (but still high) communication capabilities to keep costs manageable, just as I said.
1 Gbit/sec, if you look at the Jupiter paper, was the host speed in 2004. The Jupiter system works with 10G and 40G interfaces on the host.
What's important to recognize is you simply cannot buy Infiniband switches that let you contact a lot (10K+) of hosts together. The vendors won't sell you this, they won't do the R&D to make it, and it would cost infinite anyway.
This is a deliberate choice: for most Internet work, it's better to have really fat bisection bandwidth and non-blocking fabrics, and latency is ignored due to the high cost of building a crossbar that supports that with high radix.
Only if you have an algorithm that absolutely requires, and simply cannot be fixed, low latency, you are almost always better off building a cheaper, fatter fabric, and hiring engineers who know how to write applications that are latency tolerant.
You're assuming the goal is to build the fastest supercomputer in the world. I think the goal is probably closer to "get the computing resources we need at the lowest cost".
You could compare raw FLOPS (Floating point operations per second) but that would only tell part of the story. These supercomputers are highly engineered for low network latency between nodes, which is necessary for many scientific workloads. Google and other companies are generally able to express their algorithms in highly parallel ways, which means there are much reduced requirements for communication between nodes.
Therefore, even if the raw performance in terms of FLOPS sound similar, the two systems will have widely differing performance on real workloads.
Capturing and indexing the entire web is certainly a real workload, even if it is massively psrallelizable, so it would probably run equally well on Google's infrastructure as on a supercomputer because those fast interconnects wouldn't provide much advantage, right?
However,when simulating a nuclear explosion or a weather system (maybe that's what you mean by "real" workloads?), the heavy node-to-node communication makes the supercomputer much, much better suited.
Comparing Bitcoin miners to any supercomputer or server farm is meaningless at best and deceiving at worst. Current Bitcoin miners cannot do anything other than one specific calculation.
The NSA's currently being sued over its metadata collection. I wrote that comment slightly tongue-in-cheek, but repurposing that supercomputer for civilian use would actually be a great way to recoup some of your losses.
That seems to be more of a storage data centre. And if Wikipedia's figure of 60 megawatts is correct, it's well in the ballpark of a datacenter for Google or Facebook.
Edit: also should say it is amusing to hear Obama talking about exascale conputing by 2025 when the NSA's goal (read the Wired article) is to get there by 2018.
I suppose that "supercomputers" are all multi-processor these days, so the colossal FLOP numbers are counted as an aggregation over many processors and one has to coordinate these processors in any application that takes advantage of the FLOP specs.
Now I am curious what is the fastest single processor?
I don't think knowing that is usefull. "Single processors" are all superscalar or pipelined these days, so the colossal single-thread FLOP numbers are counted as an aggregation over many arithmetic units and one has to coordinate these units (mainly avoiding branch mispredictions) in any application that takes advantage of them.
It's still relevant. Some applications remain single threaded due to data dependencies.
Mechanical CAD geometry kernels are one such application. Recently, I had a use case that demanded the peak single threaded performance.
In PC land, it's this chip clocked at 4.x Ghz: Intel® Core™ i7-4790K. It's multi-core performance is pretty great too, so it's not that big of a trade-off to maximize single thread performance.
I would be very interested in knowing what faster solutions exist. Are there any, regardless of instruction set?
I guess it depends on how 'core' is defined. A single core of an Intel CPU can work on 2 threads at a time, the power 8 can work on 8.
I assume that when most people think of 'single core' performance, they really mean 'single thread'. I think Intel's CPUs win out there, based on your linked benchmarks.
If your memory bandwidth limited then, yes. Otherwise your better of with a decently clocked intel. Anything that spends a portion of its time running out of L2 or better on the Xeon will be significantly faster.
I happened to catch an overclocking competition being streamed on Twitch one late night many months ago. It was really interesting to see more about the methods and techniques involved and how the competitions work.
See my comment above for exascale info. The single chip performance will vary wildly. The important measurement is how many operations of useful work per second per watt the system will do. That's the gist of it I've learned from HPC people.
Exascale is power-hungry so power must go way down and efficiency of calculation way up.
Probably an intel xeon (or i7?) overclocked and cooled with liquid nitrogen. There may be a specialized processor with higher performance, but I doubt it.
Water cooling from Antec, it fits into regular desktop case. Under load it makes sound like hoover. Some cheap RAM, some motherboard for $200, 500 Watt PSU.
I use it for IDE (Scala compiler is slow).
No problem with stability, it ran a few times over weekend at full load. Maximal temperature about 90C. I have put quite high voltage. Over summer it goes down to 4.8GHz.
I find it odd that a president would sign an executive order for a new kind of computer (even if the new kind of computer is technically impressive). Call me cynical, but I'll bet that there's a number of quid-pro-quo arrangements with big party donors (or soon to be donors) -- regardless of which party is in office.
It's not that sinister, it's just PR. This project likely would have been in the budget, anyway, and the executive order is just a way to get the President's name behind it and his image in front of it.
I'd really love to see speed measured by performance by a single collective computation of an O(n) or O(n log n) algorithm. This would emphasize the importance of balancing communication performance with computation. Not holding my breath, the LINPACK is strong with these people...
Reading over the list of priorities in the PDF linked from the Whitehouse blog, the one that I was the most pleased about was improving HPC productivity.
IMHO if you need to break down a task well enough to run on a supercomputer, there isn't a lot more to do to make it run on a regular server farm.
edit: Actually, in the scenarios you'd use a supercomputer for, the added latency and overhead (shoddy servers, network, etc.) would most likely make the run time orders of magnitude higher.
Maybe for embarrassingly parallel tasks, but if you require nontrivial interprocess communication, a server farm can't compete with the interconnect of a modern supercomputer.
And by "codes", you mean specific legacy software artifacts written in FORTRAN (note that I'm not even spelling it as Fortran)? Of course that's a problem.
No, that means specific problem domains that are not easily partitioned, and where latency or affinity are the primary performance constraints.
There are still some fortran libraries in large scale use for this sort of thing. They are still in use because they are very good, and replacing them would be very expensive for little gain.
Write in any language you want; that's irrelevant to the nature of the computations being done here.
By codes they mean -- at the minimum -- pretty much anything that requires frequent communication between any or all nodes as a necessary part of computation. (For example, simulations across a large 3D space, where the changing states of particles on node A directly impacts the states of particles on adjacent nodes.)
You can't write your code in any language you want on a supercomputer.
Also, there is a wide range of literature about communication patterns for supercomputer apps; my argument is that often times, to solve the problem that matters, you may not actually need to run the simulation you think you do. It's more that people are just used to running that way.
For example, with MD, you can run 1 sim parallelized over 100 machines using tightly coupled communication (doesn't necessarily mean the forces and positions of every particle have to be shared between node decompositions) or run 100 sims over 100 machines, with no communication except for input and output files. The latter can often answer the same question far more cheaply.
No, the people who run the clusters won't let you run any language just because it has an MPI binding. They invest a lot in ensuring peak performance, and right now, only C++ and FORTRAN can achieve that. Very few, if any, major supercomputer centers support Java codes.
Oh, you're talking about a policy limitation, not a technological one. (And if you're talking about the DOE or NSF/Teragrid/XSEDE clusters, then you're probably right. Haven't touched those in years -- and even when I did, I wasn't doing anything crazy.)
To be frank, if I was running a computer that was designed for peak performance, I probably wouldn't use Java. There are some very significant performance issues with garbage collection that prevent you from making peak use of them machine.
Supercomputers aren't built so that people can squander the resource (desktop PCs, closest clusters, and phones fulfill that role).
It is a technical limitation. Oftentimes the platform is so specialized that only a tiny handful of compilers are ported to it. Say, just gcc, g++, and gfortran, and xlc, xlC, and xlf. And just one version at that. Java would require porting the JVM to the cut-down, weird Linux on the compute nodes. Some $$$ machines don't even support dynamic linking! The number of these machines is so small that extensive compiler and tool support just isn't happening unless you want to add millions to the cost.
The JVM probably calls fork() and system(), no? Not allowed. Dynamic thread creation? Not allowed. And 50% of your flops go away unless your program uses the BG/Q-specific "double hummer" floating point instructions. These are primitive machines, in terms of development environment and typically require significant rewriting to get even "standard" system software working.
U.S. and other countries have been in a race for exascale. The thing holding us back isn't funding or political will: exascale is so ridiculously hard that it requires fundamentally different architectures. The main issues are making our CPU's do more work, eliminating memory bottlenecks, and dramatically improving energy efficiency of both. It's just very tough, technical challenges that might also have to operate on process nodes that are themselves tough.
Rexx Computing is one attempt whose founder posts here a lot [except in one thread dedicated to it lol]. I'm curious if any other exascale researchers read HN and can post their concepts as it's probably interesting stuff. Here's some links for readers interested in this stuff.
Step 1: Nuclear test ban treaty. Step 2: Avoid nuclear test disasters. Step 3: Maintain military status quo.
Notice step 3 implies international stability and hence profit. NNSA stresses computation so heavily because stockpile stewardship cannot be done by noncomputational means.
Interesting point. Exascale is a lot more than that though: many stakeholders. And, even if none, it's still going to get funded as another international pissing contest (see Top 500). ;)
One of the issues faced in super computing isn't just raw horsepower or more cpu's, it's latency. It's not enough to just connect a ton of machines via ethernet, you need specialized hardware to provide high-throughput sharing of data.
It kills me every time I read about how the world's nth-fastest computer is just used to simulate nuclear explosions, so it was a delight to see that they're planning on using this one for some good.
I don't know if this is what you're referring to, but a common application for government-owned supercomputers is simulating the degredation of nuclear warheads. The degradation of the fissile material as well as its surroundings is highly critical to a nation's security, and also very hard to model well.
Of course in an ideal world those cycles would be used to help cure cancer, but given that these warheads exist, it's probably a good idea to invest resources into getting an idea of what shape they're in.
And it kills me every time someone makes a silly comment like this. Someone has to spend the large amount of money for the R&D. Your model of how the world works is flawed.
We went to the moon because of the Cold War. The military paid for the development of the Internet. GPS? Supersonic flight? Nuclear energy? Autonomous vehicles?
I wish private industry would do more. Every company should have a Bell Labs.
> Rather than tell us how much you hate the military and capitalism maybe you can figure out a better way.
Well, knowing a few scientists, I can say that I know well the creative and inquiring spirits that advance us. The world is structured so that the people who do the actual development and discovery have to sell their talents to the logic of property and capital, either directly, or to the government that enforces that structure.
So, I think we're probably in agreement on the forces wherefrom those technologies came. I think we're also in agreement on whose behalf those forces act. I think we're probably in disagreement that capital is any better a master than the government it is in collusion with.
There may be a better solution. Just because no one has found it doesn't mean we can't do better. However, I'm not one for cursing the darkness. Light a small candle and lead the way.
I think the solution is fewer solutions. When I am working well with others, there's nothing much to fuss about. When there are conflicts, it's not ideology or a system that wins the day. I think we two would collaborate just fine without such stuff, for instance. What works in conflict is the willingness to discard systems and ideas. Or, even better, the lack of willfulness to enforce systems and ideas in the first place.
Being conceptually slippery enough to get out of any problem comes naturally to skillful people in their fields. The situation at hand provides all the impetus for theory and practice there is. Conflict thrives off of superfluity: superfluous methods, superfluous justifications, and superfluous issues. Methodology and structure gets in the way of skillfulness.
I know that real human power scales on its own without armies and guns forcing it into a certain shape. People collaborate and collude very easily. People form groups very easily. People associate with people. It's probably the only culturally universal thing people do (well of course; culture only makes sense when there are associating people).
If we want association to work well, groups need to be able to disintegrate as easily as they come together. People need to be slippery, too. Without what makes groups cohere, they disintegrate on their own. When you see violence and hierarchy used to keep groups coherent, it means they've lost the essential power that makes them useful.
People like achieving status. Status is a social signal, and it communicates both ways. Status is not held like a title. Titles wax and wane in the status they confer just like every other human object and activity. When we pretend status can be held, that it can be concentrated and preserved, we get problems. If we let status come and go by its own logic, we'd have no problems with status. A society where status is maintained by a legal system will have hierarchies, and it will certainly have problems.
So ways that our current systems-- not just capitalism, fail:
Labor is not free.
Association is not free.
Status is not free.
Even skillful people who believe in property as an organizing force wish to be locally free in these ways. Skillful people want to work unimpeded by the politics of labor, associating with likeminded people unimpeded by the politics of social groups, praised for the inherent virtue of their actions and unimpeded by the politics of status. They only believe in property because it helps them manage the world that is beyond their control, beyond the means of their skill. They use property to create a bubble in which they can live that is free from its control.
People who believe in property because they enjoy its logic, enjoy the wheeling and dealing, enjoy the frantic rush to get more of it, who feel more worthwhile the more property they have are nervous people, who can never be fulfilled, because property's logic does not lead to fullness, the only conclusion is 'not enough', the only purpose is 'more'.
I don't know how many people in the latter category actually exist. I suspect enough to cause a great deal of problems. In any case they require the assent of everyone else. I think all I have to say to everyone else is this:
The world without property is still full and whole, is still just, is still full of vitality, is still nourishing, and receptive to human power.
I would not have wagered the world for what we got out of it. The fact that we're around to talk about it is fine and all, but the last century is not a good model for any century.
The term "embarrassingly parallel" doesn't refer to algorithms, it refers to problems which can be computed in such a manner, stating that they're not very interesting for parallel algorithms research. But yeah, it's kind of a stupid term. Would you be okay with "pleasingly parallel"? ;)
Potato, potato. I don't make the rules. I just think "embarrassingly parallel" sounds like a college freshman trying way too hard to sound smart and cool.
It called "embarrassingly parallel" because it is embarrassing for the Cray and other supercomputer vendors sales people when their expensive hardware does not outperform a 5 year old solution that cost 1/2 their fancy gear ;)
Otherwise data parallel is an other phrase that is used for the same concept.
There are dual-use systems which run classified and non-classified codes (they can be partitioned.. They typically live at LANL or LLNL rather than ANL or Berkeley (Berkeley in particlar doesn't do any classified work). Note that most classified codes are actually physics/material/explosion/plasma physics simulations and you can't always tell they are running on your system.
I checked the list of projects running on ALCF and I see some pretty obvious nuclear weapon stockpile stewardship and weapon design projects, such as "Validation Simulations of Macroscopic Burning-Plasma Dynamics"
Nucleosynthesis of heavier materials in supernova explosions and neutron star mergers are two cases of problems that I guess are used for "publicly" validating dark codes.
(I know a guy that Los Alamos is trying to recruit to do some dark work for them, and he does nucleosynthesis in neutron star mergers.)
Those would be useful, but typically the validation codes use terms like "multiscale combustion physics" or "coupled neutron/radiation transport".
I think the folks simulating supernova and neutron stars have a lot of physics overlaps, but I don't think that data is used directly for stockpile stewardship.
99% of the compute time on this will probably be used for weapons, combat and civil emergency wargame simulations - but that doesn't make for good PR so of course they don't mention it in this sycophantic fluff piece.
However, since the US military recently labeled climate change as the #1 national security threat to the US or something like that, maybe we could hope for some climate science on the side? ;-)
I work in this field. The short of it is: there is more money out there for PIs to get defense related grants vs cure for cancer type grants. Some of it is for good, e.g. modeling where to dispatch mobile hospitals for the Ebola outbreak.
Modeling the world's population doing things like fleeing a dirty bomb takes an astonishing amount of compute if you want answers in a timely fashion.
I think one of the primary motivations for doing this is to break cryptographic keys, and using that for surveillance and to hack into Chinese websites.
It seems to me that the aim of taking 10 years to build a supercomputer that is only 20 times faster than the current one might fall a little short if it's aiming to take the top spot.