10 years ago the fastest supercomputer was BlueGene/L which was rated at 136.8 TFlop/s. The current fastest supercomputer is rated at 33,862.7 TFlop/s, or 247 times faster.
It seems to me that the aim of taking 10 years to build a supercomputer that is only 20 times faster than the current one might fall a little short if it's aiming to take the top spot.
This isn't only about the FLOPS, the big trend of these countries* ordering new supercomputers by 2020/2025 is very focused on power. Current supercomputers consume a lot.
Also the FLOPS measurement is a bit broken: It focuses on dense linear algebra problem, for which GPU or other accelerators boost the results easily. If all you plan to do is running simulations that are easily parallelized on GPU it is fine, for other types of programs it is hard to tell which is the fastest supercomputer.
Power is over-emphasized in HPC circles. According to FOIA reports (and you can back out similar from public budget information), less than 10% of the budget is going to energy. It is frequently used as an excuse to build machines that are inappropriate for the science (not just "hard to program", but actually inappropriate in the sense that even with infinite programming effort, they deliver less scientific value than a more conventional architecture). There is some value in making scientists uncomfortable so that they think of creative algorithmic solutions that may pay off as the inevitabilities of semiconductor physics become more apparent, but seeing the number of applications that are within small constant factors of proven barriers and the willingness to compromise quality of solution and/or run scientifically irrelevant configurations to demonstrate "speedup", I think it has gone too far.
Perhaps it could be the size of the matrix that can be inverted on it in an hour of time, with IEEE double precision floats, using some standard algorithm.
The "High Performance Conjugate Gradients" benchmark was proposed a couple years ago as an alternative metric for ranking supercomputers. Its proponents claim its behavior is more similar to real applications (irregular access patterns, lower ratio of computation to memory access, etc), compared to linear algebra problems like the "High Performance Linpack" benchmark currently used by the Top500.
HPCG basically measures STREAM and has many technical flaws making it scale-dependent and difficulty to adjudicate. As codeveloper of a different benchmark, I'll just cite this paper from a third party. https://hpgmg.org/static/MarjanovicGraciaGlass-PerformanceMo...
The reality is that there are many dimensions to supercomputing performance and it's impossible for one number to capture the utility of the machine. Our HPGMG benchmark (https://hpgmg.org) attempts to strike a balance and give useful supplementary information. I do think it's better than any other single benchmark for evaluating today's machines and will also prove to be more durable over time.
How would you use a benchmark like this to predict the performance of a well-designed asynchronous parallel conjugate gradient solver, like most modern deep learning neural networks that run on Internet HPC machines?
CG isn't truly asynchronous due to its reductions. It can be pipelined in various ways (we have several implementations in PETSc), but performance requires a quality implementation of asynchronous reduction (e.g., MPI_Iallreduce) which the vendors have been slow about developing (I've been working with some on fixing this and Cray has made recent progress).
With respect to deep learning and other applications using CG or related algorithms, the bottlenecks depend on the scale, and ability to expose locality, and operator/preconditioner representation. If there is no locality, then matrix-vector products require all-to-all communication which tend to dwarf the cost of the reductions in CG. Even with locality in the matrix-vector product, preconditioners often need to communicate globally in a scalable way similar to HPGMG. Operators need not be represented as a table of numbers or a sparse matrix format, but could use a tensor product, fast transform, or other information to compute the action using less storage. If they are represented explicitly (sparse or dense), then matrix-vector product performance (thus CG as a whole) is dominated by memory bandwidth for problem sizes that do not fit in cache. HPGMG tries to strike a balance between memory bandwidth demands and compute using a matrix-free representation. HPGMG also reports dynamic range expressed as Performance versus Time-to-solution as the problem size is varied, which allows applications to see performance barriers that might be relevant to them (e.g., see how Titan cannot do a solve in less than 200 ms while Edison can do 50 ms, and how that relates to climate simulation performance targets; see slide 7 of https://jedbrown.org/files/20150624-Versatility.pdf).
Is it possible to calculate the theoretical performance of a cluster under HPGMG and then do a practical run and come with an efficiency number like in HPL ?
One of the biggest reasons for use of HPL is that many sizing considerations can be based off of the theoretical calculations.
But anyway this is very interesting. I definitely need to check this out.
HPL has an abundance of flops at all scales (N^{1.5} flops on N data), so one can expect a decent fraction of peak flop/s on any architecture with enough memory and adequate cache performance. This is a problem because architectural tricks like doubling the vector registers without commensurate improvements in bandwidth, cache sizes, load/store/gather/scatter produce huge (nearly 2x) benefit for HPL and little or no benefit to a large fraction of real applications.
HPGMG is representative of most structure-exploiting algorithms in that it does not have this abundance of flops, thus theoretical performance is actively constrained by both memory bandwidth and flop/s. We see many active constraints in practice; e.g., improving any of peak flop/s, memory bandwidth, network latency, or network bandwidth produces a tangible improvement in HPGMG performance. Depending on the fidelity of the performance model, these dimensions can be a fairly accurate predictor of performance, but ILP, compiler quality, on-node synchronization latency, cache sizes, and similar factors also matter (more for HPGMG-FE than HPGMG-FV).
I think it is actually quite undesirable for benchmark performance to be trivially computed from one parameter in machine provisioning. No computing center has a mission statement asking for a place on a benchmark ranking list (like Top500). Instead, they have a scientific or engineering mandate. Press releases tend to overemphasize the ranking and I think it is harmful to the science any time the benchmark takes precedence over the expected scientific workload. HPGMG is intended to be representative in the sense that if you build an "HPGMG Machine", you'll get a balanced, versatile machine that scientists and engineers in most disciplines will be happy with. I'd still rather the centers focus on their workload instead of HPGMG.
The information I've seen about Sibyl (the Google ML system, not the genomics package (http://sybil.sourceforge.net/documentation.html)) says it is basically doing logistic regression using a parallel algorithm (Collins, Schapire, Singer) with a transpose on each iteration. Without knowing more about the problem sizes and data sparsity/irregularity, I expect the transpose to be a significant expense. I'd be happy to read more if you have access to further technical information, but it's not clear how this comment relates to your previous question about CG and deep learning. As it relates to HPGMG, I think my previous response covers the important performance dimensions. I'd be happy to discuss further over email.
Both logistic regression and deep learning are basically just big conjugate gradient minimizers.
What I meant by asynchronous is that not all terms in a gradient are required to be summed in the same step.
The transpose step in Sibyl is implemented in the Shuffle and Reduce phases. The filesystem is used to hold the temporary data. Nevertheless even for large systems, very few steps are required, and step times are reasonable, even compared to modern supercomputers. This is a tribute primarily to the design of sibyl and the implementation of MapReduce at Google.
This is all explained in online versions of the Sibyl presentation. I really wish more people from DOE who write modern solvers would pay attention to this stuff.
It depends, I meant broken if you plan to compute other types of problems.
For instance some big graph problem instead of involving linear algebra. Then you would favor the benchmarks from http://www.graph500.org/.
But what if you want to optimize for programs that are communication intensive, or memory intensive?
Should the FLOPS of a very specific linear algebra suite be used as the metric of best computers?
The extrapolation on the top500 supercomputer list [1] estimates the first EFlop computer in 2019. The math in the article is weird. They say 20x faster, but 20x33 PFlops is quite a bit less than 1EFlop.
The extrapolation in the linked graph pretty clearly doesn't track increases after 2012 correctly. Tracking the past 3-4 years puts 2019 at around 600PF.
It could very well be that there's diminishing returns involved but I agree - they should be aiming to surpass the current tech by at last 100x in the next ten years.
The current trend [1] would suggest that exascale is not on that kind of trajectory. The dominance of non US countries (especially China) in the Top500 rankings is as much a driver here.
1. The top 500 list is broken, as people who are serious about real world applications give essentially zero shits about Linpack.
2. The dominance of China is achieved through the use of Xeon Phi accelerators, which may be great for Linpack, but have not made much of a splash for applications yet. GPUs are solidly beating Intel's accelerator offering both on adaptation and performance.
Computer hardware innovation is a textbook example of diminishing returns. With each improvement in processor performance, size, energy usage, and heat management, it becomes more expensive to push the tech further. We're currently witnessing this effect in action with the recent stagnation in consumer processor speeds. They are still getting better in size, energy and heat management, but average speeds have hovered around 2.5Ghz for years now.
I think this do not advanced more just because of economic reasons.
I have have read last 10 years that Processing is a lot more cheaper did by network of computers and clusters, instead of a expensive supercomputer that also demands an appropriate building and infrastructure.
Modern supercomputers are essentially clusters, but with much more advanced network topologies and technologies, shared storage, etc. It's not just one monolothic machine.
However, the types of computation performed by the top supercomputers are rarely the "embarassingly parallel" programs you can easily distribute via an @Home-style program, or something like Hadoop. They do depend heavily on very reliable, very low latency, high bandwidth networks.
It seems to me that the aim of taking 10 years to build a supercomputer that is only 20 times faster than the current one might fall a little short if it's aiming to take the top spot.