Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Supercomputers: Obama orders world's fastest computer (bbc.co.uk)
163 points by m-i-l on July 30, 2015 | hide | past | favorite | 174 comments


10 years ago the fastest supercomputer was BlueGene/L which was rated at 136.8 TFlop/s. The current fastest supercomputer is rated at 33,862.7 TFlop/s, or 247 times faster.

It seems to me that the aim of taking 10 years to build a supercomputer that is only 20 times faster than the current one might fall a little short if it's aiming to take the top spot.


This isn't only about the FLOPS, the big trend of these countries* ordering new supercomputers by 2020/2025 is very focused on power. Current supercomputers consume a lot.

Also the FLOPS measurement is a bit broken: It focuses on dense linear algebra problem, for which GPU or other accelerators boost the results easily. If all you plan to do is running simulations that are easily parallelized on GPU it is fine, for other types of programs it is hard to tell which is the fastest supercomputer.

* France is also ordering a would -be top 10 supercomputer : http://www.hpcwire.com/off-the-wire/the-cea-agency-and-atos-...


Power is over-emphasized in HPC circles. According to FOIA reports (and you can back out similar from public budget information), less than 10% of the budget is going to energy. It is frequently used as an excuse to build machines that are inappropriate for the science (not just "hard to program", but actually inappropriate in the sense that even with infinite programming effort, they deliver less scientific value than a more conventional architecture). There is some value in making scientists uncomfortable so that they think of creative algorithmic solutions that may pay off as the inevitabilities of semiconductor physics become more apparent, but seeing the number of applications that are within small constant factors of proven barriers and the willingness to compromise quality of solution and/or run scientifically irrelevant configurations to demonstrate "speedup", I think it has gone too far.


> Also the FLOPS measurement is a bit broken

Does anybody have a better measurement?

Perhaps it could be the size of the matrix that can be inverted on it in an hour of time, with IEEE double precision floats, using some standard algorithm.


The "High Performance Conjugate Gradients" benchmark was proposed a couple years ago as an alternative metric for ranking supercomputers. Its proponents claim its behavior is more similar to real applications (irregular access patterns, lower ratio of computation to memory access, etc), compared to linear algebra problems like the "High Performance Linpack" benchmark currently used by the Top500.

The different performance numbers for top systems on HPCG vs HPL are pretty striking: http://www.hpcg-benchmark.org/custom/index.html?lid=155&slid...

Original proposal to use HPCG as an alternative to HPL for supercomputer rankings: http://www.sandia.gov/~maherou/docs/HPCG-Benchmark.pdf


HPCG basically measures STREAM and has many technical flaws making it scale-dependent and difficulty to adjudicate. As codeveloper of a different benchmark, I'll just cite this paper from a third party. https://hpgmg.org/static/MarjanovicGraciaGlass-PerformanceMo...

The reality is that there are many dimensions to supercomputing performance and it's impossible for one number to capture the utility of the machine. Our HPGMG benchmark (https://hpgmg.org) attempts to strike a balance and give useful supplementary information. I do think it's better than any other single benchmark for evaluating today's machines and will also prove to be more durable over time.


How would you use a benchmark like this to predict the performance of a well-designed asynchronous parallel conjugate gradient solver, like most modern deep learning neural networks that run on Internet HPC machines?


CG isn't truly asynchronous due to its reductions. It can be pipelined in various ways (we have several implementations in PETSc), but performance requires a quality implementation of asynchronous reduction (e.g., MPI_Iallreduce) which the vendors have been slow about developing (I've been working with some on fixing this and Cray has made recent progress).

With respect to deep learning and other applications using CG or related algorithms, the bottlenecks depend on the scale, and ability to expose locality, and operator/preconditioner representation. If there is no locality, then matrix-vector products require all-to-all communication which tend to dwarf the cost of the reductions in CG. Even with locality in the matrix-vector product, preconditioners often need to communicate globally in a scalable way similar to HPGMG. Operators need not be represented as a table of numbers or a sparse matrix format, but could use a tensor product, fast transform, or other information to compute the action using less storage. If they are represented explicitly (sparse or dense), then matrix-vector product performance (thus CG as a whole) is dominated by memory bandwidth for problem sizes that do not fit in cache. HPGMG tries to strike a balance between memory bandwidth demands and compute using a matrix-free representation. HPGMG also reports dynamic range expressed as Performance versus Time-to-solution as the problem size is varied, which allows applications to see performance barriers that might be relevant to them (e.g., see how Titan cannot do a solve in less than 200 ms while Edison can do 50 ms, and how that relates to climate simulation performance targets; see slide 7 of https://jedbrown.org/files/20150624-Versatility.pdf).


Is it possible to calculate the theoretical performance of a cluster under HPGMG and then do a practical run and come with an efficiency number like in HPL ?

One of the biggest reasons for use of HPL is that many sizing considerations can be based off of the theoretical calculations.

But anyway this is very interesting. I definitely need to check this out.


HPL has an abundance of flops at all scales (N^{1.5} flops on N data), so one can expect a decent fraction of peak flop/s on any architecture with enough memory and adequate cache performance. This is a problem because architectural tricks like doubling the vector registers without commensurate improvements in bandwidth, cache sizes, load/store/gather/scatter produce huge (nearly 2x) benefit for HPL and little or no benefit to a large fraction of real applications.

HPGMG is representative of most structure-exploiting algorithms in that it does not have this abundance of flops, thus theoretical performance is actively constrained by both memory bandwidth and flop/s. We see many active constraints in practice; e.g., improving any of peak flop/s, memory bandwidth, network latency, or network bandwidth produces a tangible improvement in HPGMG performance. Depending on the fidelity of the performance model, these dimensions can be a fairly accurate predictor of performance, but ILP, compiler quality, on-node synchronization latency, cache sizes, and similar factors also matter (more for HPGMG-FE than HPGMG-FV).

I think it is actually quite undesirable for benchmark performance to be trivially computed from one parameter in machine provisioning. No computing center has a mission statement asking for a place on a benchmark ranking list (like Top500). Instead, they have a scientific or engineering mandate. Press releases tend to overemphasize the ranking and I think it is harmful to the science any time the benchmark takes precedence over the expected scientific workload. HPGMG is intended to be representative in the sense that if you build an "HPGMG Machine", you'll get a balanced, versatile machine that scientists and engineers in most disciplines will be happy with. I'd still rather the centers focus on their workload instead of HPGMG.


You might want to read up on Sibyl.


The information I've seen about Sibyl (the Google ML system, not the genomics package (http://sybil.sourceforge.net/documentation.html)) says it is basically doing logistic regression using a parallel algorithm (Collins, Schapire, Singer) with a transpose on each iteration. Without knowing more about the problem sizes and data sparsity/irregularity, I expect the transpose to be a significant expense. I'd be happy to read more if you have access to further technical information, but it's not clear how this comment relates to your previous question about CG and deep learning. As it relates to HPGMG, I think my previous response covers the important performance dimensions. I'd be happy to discuss further over email.


Both logistic regression and deep learning are basically just big conjugate gradient minimizers.

What I meant by asynchronous is that not all terms in a gradient are required to be summed in the same step.

The transpose step in Sibyl is implemented in the Shuffle and Reduce phases. The filesystem is used to hold the temporary data. Nevertheless even for large systems, very few steps are required, and step times are reasonable, even compared to modern supercomputers. This is a tribute primarily to the design of sibyl and the implementation of MapReduce at Google.

This is all explained in online versions of the Sibyl presentation. I really wish more people from DOE who write modern solvers would pay attention to this stuff.


Interesting! I'll have to check this out, thanks!


It isn't that different, 7 of the top 10 of the top500 are in the top 10 of this HPCG benchmark. Most barely changed positions.


It depends, I meant broken if you plan to compute other types of problems. For instance some big graph problem instead of involving linear algebra. Then you would favor the benchmarks from http://www.graph500.org/.

But what if you want to optimize for programs that are communication intensive, or memory intensive?

Should the FLOPS of a very specific linear algebra suite be used as the metric of best computers?


The extrapolation on the top500 supercomputer list [1] estimates the first EFlop computer in 2019. The math in the article is weird. They say 20x faster, but 20x33 PFlops is quite a bit less than 1EFlop.

[1] http://www.top500.org/statistics/perfdevel/


The extrapolation in the linked graph pretty clearly doesn't track increases after 2012 correctly. Tracking the past 3-4 years puts 2019 at around 600PF.


See my comment above. Has all the numbers you might need.


It could very well be that there's diminishing returns involved but I agree - they should be aiming to surpass the current tech by at last 100x in the next ten years.


The current trend [1] would suggest that exascale is not on that kind of trajectory. The dominance of non US countries (especially China) in the Top500 rankings is as much a driver here.

1. http://www.theplatform.net/2015/07/13/top-500-supercomputer-...


1. The top 500 list is broken, as people who are serious about real world applications give essentially zero shits about Linpack.

2. The dominance of China is achieved through the use of Xeon Phi accelerators, which may be great for Linpack, but have not made much of a splash for applications yet. GPUs are solidly beating Intel's accelerator offering both on adaptation and performance.


Doubling time: 10 years * log(2)/log(247) ~ 15 months

x20 ought to take about five years and a half.


Computer hardware innovation is a textbook example of diminishing returns. With each improvement in processor performance, size, energy usage, and heat management, it becomes more expensive to push the tech further. We're currently witnessing this effect in action with the recent stagnation in consumer processor speeds. They are still getting better in size, energy and heat management, but average speeds have hovered around 2.5Ghz for years now.


I think this do not advanced more just because of economic reasons.

I have have read last 10 years that Processing is a lot more cheaper did by network of computers and clusters, instead of a expensive supercomputer that also demands an appropriate building and infrastructure.


Modern supercomputers are essentially clusters, but with much more advanced network topologies and technologies, shared storage, etc. It's not just one monolothic machine.

However, the types of computation performed by the top supercomputers are rarely the "embarassingly parallel" programs you can easily distribute via an @Home-style program, or something like Hadoop. They do depend heavily on very reliable, very low latency, high bandwidth networks.


Then they just can just "order" another one. We're talking government here. Declaring faster computers by government fiat is already a dumb idea.


A bit more informative is the actual fact sheet put out by the white house [1]. What they are really aiming for is exascale computing, which they define as being capable of applying exaFlops to exabytes. From my limited knowledge, the latter will actually be the bigger deal. As pointed out elsewhere, an exaflop supercomputer will probably come around beforehand.

[1] https://www.whitehouse.gov/sites/default/files/microsites/os...


The real problem is getting 1 exaflop (or around it) within a reasonable power budget. The DOE's power budget for all of their supercomputing resources is 20 Megawatts, so at a full system level we would need to be at 50 GFLOPs per watt, while the best system right now is at 5.


Nvidia's Pascale GPU, due next year, is supposed to do 28 terraflops with 1300 watts which I make 21 GFLOPs per watt.


That's single precision, which the DOE doesn't care about. The latest NVIDIA GPU's do 8 to 16 times better at single precision (32 bit) floating point compared to double precision (64 bit).

The best next generation DP GFLOPs/watt from one of the big players will most likely be the 2016 Xeon Phi, at ~10-12GFLOPs/watt... You are also forgetting that GPUs also have a ~100W+ CPU sitting next to it, which brings down total efficiency significantly.

Shameless self promotion: My startup (http://rexcomputing.com) is aiming for 64 double precision GFLOPs/watt, and 128 GFLOPs/watts single precision for its first chip next year.


Your chip looks cool. I guess it may be tricky to adapt software to run on the thing? Or else you could try to sell Obama 4 million of them for his new computer.


20+ GFLOPS/W for a single part isn't something new: http://streamcomputing.eu/blog/2012-08-27/processors-that-ca... (2012)

A huge amount of overall system power is spent in data transport. Plus, double everything for cooling. That brings the total system efficiency way down from what the actual computational components spend.


not 20 real (64b) gflops/watt. and no, not much power is used on interconnect, and certainly not a PUE of 2.0.


> "I'd say they're targeting around 60 megawatts, I can't imagine they'll get below that," [Mark Parsons] commented. "That's at least £60m a year just on your electricity bill."


I wouldn't trust a Brit to report the figures for the US Department of Energy. The DOE's budget is 20 Megawatts [source: http://science.energy.gov/~/media/ascr/ascac/pdf/meetings/20...]


He didn't report it, he gave an estimate. And as for "wouldn't trust a Brit", I'd trust anyone directly involved in it, regardless of nationality.


Clearly designed to run AI for cyber-warfare / cyber-defense purposes? William Gibson's Black Ice coming to life?


More likely its for modeling nuclear weapons which I believe is the primary purpose of NSCI.


The story linked below about the French exascale plans explicitly states that it is the military applications division of the French nuclear energy agency (CEA) that is spearheading that effort. (I believe they work on both reactors and warheads.)


NNSA*, which is an office within the Department of Energy.


National Strategic Computing Initiative (NSCI) is the initiative within the DOE.


Serious question: what's left to model? They're already big enough to kill everyone, and we have heaps of them.


Often they model how the warhead ages, rather than how it explodes.


Yes, this is my understanding.


It looks like they are explicitly saying they want to make a machine that works for both types of HPC -- classic low-latency high-bandwidth internode communication (physics simulations) and modern Internet-driver high-bandwidth storage/node communications.

This is because the supercomputer community has long ignored the Internet-style of computation (MapReduce etc). But most of the new generation of scientists are adapting their codes to this new style, because dollar-for-dollar they can get more throughput than the classic style machines. Classic machines invest heavily in low-latency communication and typically require APIs like MPI to achieve it, while Internet HPC just uses well-designed TCP-based socket communications.

Building dual-design systems like this- especially when the community has little or no skill at building NG Internet HPC systems- is likely to produce a system that is good at few things.

Instead, build two systems. One is the largest (but not necessarily exaflop) you can afford and is a classic supercom[puter. Then, for the second, hire some datacenter designers from Google/Facebook and have them build a modern HPC cloud design.

The biologists will flock to the second one; they have long been underserved by the DOE supercomputing community.


It is important to not conflate "massively parallel" (HPC) and "massively distributed" (Internet-scale), they have different architectural requirements and solve different classes of problem. People with competency in either of these areas tend to overestimate their understanding of the other but they are not solving the same computer science problems even though they look similar on the surface.

Massively distributed systems do not get much benefit from low-latency interconnects. Massively parallel systems do, and in particular, it is a "throw hardware at the problem" kind of solution that helps cover for the fact that virtually no software designers can engineer efficient, non-trivial, massively parallel systems. MapReduce is a distributed model; outside of some trivial cases, it is a poor parallel model. And while the HPC community has a much better understanding of massive parallelism than the Internet-scale systems community, the HPC community largely doesn't grok massively distributed systems in the way that someone working on Google's infrastructure would.

I benefitted from having spent several years designing software for both HPC and Internet-scale systems. They are not fungible, and both communities grok things that the other is oblivious to. Even within the HPC community though, the number of people skilled at the design of massively parallel software systems is quite small, much smaller than people that know massively distributed systems.

You do not need two systems, you need one system and more people that have figured out how to design massively parallel software -- the real problem. It is difficult to overstate just how rare this skill is even within the HPC community.


Agreed. I wrote my thoughts on it going distributed here:

https://news.ycombinator.com/item?id=9976957


I don't really agree with your premise.


> especially when the community has little or no skill at building NG Internet HPC systems

I would argue that the community of people who actually have the skills to take advantage of the interconnects in a classic HPC system is vanishingly small, and in consequence we've overbuilt them on an epic scale.

Allow me to vent. I had the good fortune to have a login on a "petascale" HPC system, and access to an allocation of hours.

The /scratch filesystem would fail weekly, which killed everybody's jobs. If you had a big run going when /scratch failed, you lost everything. Scratch failed so much because the models that were being used often did wildly inappropriate amounts of file IO --- debugging print statements, detailed intermediate calculations, excessively verbose output --- that worked all right in development but when run in parallel brought the filesystem to its knees.

Furthermore, the login nodes were almost unusably slow because of all the Python and Perl post-processing scripts running on them. This isn't even a matter of users being cheap with their hours --- post-processing would have been a tiny fraction of their allocations. Instead, it's that many of them gave no thought at all to how the post-processing might be structured and run through the batch scheduler, and saw no downside to abusing the login nodes for that purpose.

In conclusion, I can attest to at least one HPC system that was badly mismatched to its users' needs and level of sophistication, despite allocations of hours being awarded only to a small number of researchers from across the country through a highly competitive process. Building these things serves national and institutional pride far more than any utilitarian interest.


You're describing an exceptionally poorly built and used system. That said, it's not inconsistent with what I've seen as well.

My claim is that the design of classic interconnects is a big waste of money, because only a few codes need it, yet the cost dominates (>50%) of the cluster. I've learned, from years of studying Google's papers, that there are better ways to build code that communicates, and those mechanisms are much easier to teach to scientists and computer scientists than MPI.


That's true for many purposes, but is it true for physical simulations? Don't they need communication on every time step?


I would say no.

Here is my argument: when I worked for DOE, everybody told me I had to run my MD simulations on a super computer using all the processors, and I would judged on my parallel efficiency. This meant using a code that used MPI to communicate at every (or every N) timesteps. I asked, instead, "Why not just run N independent simulations, and pool the results?" In this case, you run an M-thread simulation on each machine (where M = number of cores on the machine) with no internode communication at all except to read input files and write output files.

The short answer is, that approach works just fine, but the DOE supercomputer people won't let you run embarassingly parallel codes because they already spent money on the interconnect to run tightly coupled codes.

In reponse to this, I went to Google, built Exacycle (loosely coupled HPC) and published this well-cited paper: http://www.ncbi.nlm.nih.gov/pubmed/24345941 which in my opinion put the last nail in the coffin of DOE-style physics simulations for molecular dynamics.

That said, there are systems which are so large you can't practically simulate a single instance of the system on a single machine, so you have to partition. Simulating the ribosome is a nice example. However, simulating the ribosome currently provides no valuable scientific data except to tell us that we have major problems with our simulation systems (force field errors, missing QM, electrostatic approximations,e tc).


Interesting! Would it be accurate to say that as the amount of computing power and memory per CPU has increased over the years, so also has the percentage of scientific problems where a single simulation instance will fit on a single CPU? Certainly if you can do so, it's more efficient (in both machine and human resources) to partition by one job per CPU.


Yes, for example when I did my PhD work ~2001 with a T3E I could run a simulation of a duplex DNA in a box of water by running it in parallel. This was true both for memory and CPU reasons. It limited to me studying a single sequence at a time, or 2-3 which was the practical limit on the number of concurrent jobs. This used the well-balanced design of the T3E, which had a great MPI system.

Eventually it reached the point (~2007) where I could fit the whole simulation on a single 4-core Intel box with similar performance. Then, I ran one "task" per machine, and scaled to the number of available machines. This uses only inter-node communication, which goes over a hub or crossbar on the motherboard. Much faster.

Now, I can fit many copies of DNA on a single machine (one task per core). This is far and away the best, because each processor just accesses its own memory, greatly reducing motherboard traffic, so the problem is basically CPU-bound instead of communication bound (this also now applies to GPUs, such that single GPUs can run one large simulation within its own RAM and not have to spill data back and forth over the CPU/GPU communication path).

This moves the challenge to the IO subsystem- I generate so much simulation data that I need a fat MapReduce cluster to analyze the trajectories.


none of this is news - what you're describing is really just strong scaling. and sure, most systems already have subsets of nodes set aside for post-simulation cleanup.


Here is the news: the Jupiter paper is now published. http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183....

I'm not just describing strong scaling. I'm describing a cost-effective way to achieve it; that's what really matters.

Why have subsets of nodes for post-simulation cleanup? Why not just run that cleanup on the same nodes you used for simulation? Or other general nodes? Otherwise, you've got two sets of nodes which are used at lower utilization than they would normally be.


I know some people in the life sciences who were strongly encouraged to get Titan time. When they applied and presented ORNL with their embarrassingly parallel code, they were told to go away.


yes, precisely my point. If I wanted to run BLAST by partitioning it to run embarassingly parallel, they wanted to me use mpiBLAST- but mpiBLAST isn't actually any better for any real-world workfload.

This is because 50% of the cost of the machine was the interconnect, and if they let those codes run, it means they wasted budget and will get less next time.

Until I hear that the funders/builders are spending the same amount of budget on machines that let biologists run embarrassingly parallel codes as they spend on TOP500 machines, it's not going to change.


montecarlo-type simulations are fine in their niche. it's just asinine to claim you can do all science that way.


I would argue that many of the failures you describe are due to a lack of libraries that can exploit these systems. You can't expect every biologist to be an expert in distributed computing. The HPC and algorithms communities failed to consider the users of these systems and happily tinkered in their academic niches. All the while, what was actually needed were easy to use libraries that allowed non-experts to benefit from the advances made. (Also, the algorithms community has lost interest in distributed memory computing).

The abuse of /scratch and the login nodes sounds like a classical mixture of not knowing, not caring, and limited time. That isn't something that has a technical solution.


I completely agree. My point is, the workload was lots of jobs using 100 nodes or less and doing massive amounts of file IO, while the cluster was essentially designed to run very large physics simulations, one at a time. This was a cluster that didn't need to exist.


The problem you were seeing wasn't the supercomputer per se, it was the sheer awfulness of Lustre.


Ah, a Cray. Buck up; there are better supercomputers.


It was, in fact. How did you infer that?


/scratch on a Cray is (last I used one) is a Lustre filesystem. It is the only file storage available to the compute nodes, which have no local storage of their own. No spinning disk, no SSD. So if the code is written such that it makes frequent small writes (e.g. it's peppered with print statements), the lustre nodes get hammered by all the compute nodes and become the bottleneck and they will eventually fall over.


Interesting! This was exactly the case on the system I used. I didn't realize Cray was the only vendor who went the no-local-storage route.


They're not. They're just the only ones who insist on doing it with a filesystem that can't handle the load.

The other problem is that their interconnect is relatively fragile. It's comparatively easy to crash the entire network, at which time your filesystem goes away and processing stops.

But thanks to Lustre, even when it's good, it's bad.


And it's not like there isn't precedent. I remember a couple years ago when Google showed off gene mapping at I/O (Urs' demo to visualize how easy it was to scale GCP From 1 to 10 to 1000 to 100000 cores), and now they've partnered with the Broad Institute to apply this more generally.

http://www.broadinstitute.org/google


I was the person on stage with Urs who ran the demo (I'm a computational biologist who went to Google to help them build this kind of infrastructure, because DOE wasn't doing it). Read more about the demo here: https://cloud.google.com/compute/io

I also helped create the original idea for the Broad collaboration. It was pretty obvious that standard HPC was a waste of money and not designed for high throughput biology, while Google published numerous papers (MapReduce, GFS, Bigtable) that demonstrated they were building infrastructure that was perfect for a wide range of computational biology programs.


One of the largest challenges in building an exascale cluster is communication. Computing power increases at a higher rate than memory throughput does, and memory throughput increases faster than communication infrastructure advances.

Many argue that an exascale computer can only be cost efficient if the communication capabilities scale highly sublinearly with the computation done in the subsystems [1]. In particular, you can't move the data, and new algorithms are needed that can deal with data that is arbitrarily distributed. This is quite challenging and unfortunately the theoretical computer science community seems to have decided that distributed memory algorithms have been covered since the 90s and are not worth their time. Yet they ignore the progress that has been made in other models of computation since, and many algorithmic improvements of the last decades are not applicable. It is high time to develop communication-efficient algorithms for the basic "toolbox".

I guess what I'm trying to say is that you can't just throw MapReduce at an Exascale machine and expect it to perform well. Instead, you need an environment that is rich in primitives that have been implemented in a communication-efficient way. It's faster and cheaper to spend a little more effort on local communication if that allows for reduced communication volume (and/or the number of connections that need to be established!).

The issue I have with the MapReduce approach is that it doesn't particularly care about data locality. Thus it is very hard to achieve communication volume sublinear in the input size, which is absolutely deadly in an exascale setting.

I also understand the frustration with MPI, it is a very low-level API focused on data movement. It can be rather frustrating to use, but there do exist tools to make it more fun (Boost.MPI with C++11/14 is an excellent example). That said, with a well-engineered set of algorithmic tools, ideally you wouldn't need to use low-level MPI calls at all. However, MPI still remains a useful tool to implement these things.

Exascale computing requires us to rethink a lot of things.

[1] http://www.ipdps.org/ipdps2013/SBorkar_IPDPS_May_2013.pdf Shekhar Borkar (Intel), Keynote presentation at the 2013 IEEE International Parallel & Distributed Processing Symposium


You are incorrect saying MapReduce isn't locality aware. Hadoop supports machine, rack, row, and cluster locality scheduling.

Also, most modern Internet HPC systems dedicate a ton of design and equipment to having very high cross-sectional bandwidth, which enables the locality restrictions to be relaxed.


Well my point exactly. That "ton of design and equipment" doesn't scale particularly well, as its cost grows highly super-linearly with the computing power. You need to reduce communication volume to be cost effective at exascale.


This isn't true. You can build awesome high bandwidth clusters for extremely cheap. It takes an understanding of ethernet silicon and TCP implementations, but it can be done. Amazon for example recognized that superlinear cost scaling was killing their profits, and invested in building newer systems with better designs that solve these problems.

See also this paper http://research.google.com/pubs/pub36740.html

The main challenge is that because these are built with multistage routers, they have fairly high latency. So much of the effort in modern HPC systems used for Hadoopy workloads goes to latency hiding.


You say "awesome high bandwidth" but at 1 Gbit/s per node you're still a long way from an InfiniBand 4X FDR Interconnect (54 Gbit/s and sub-microsecond latency, significantly lower than your network ). As you write, these are built with multistage routers, which add even more latency. So in effect they have reduced (but still high) communication capabilities to keep costs manageable, just as I said.


1 Gbit/sec, if you look at the Jupiter paper, was the host speed in 2004. The Jupiter system works with 10G and 40G interfaces on the host.

What's important to recognize is you simply cannot buy Infiniband switches that let you contact a lot (10K+) of hosts together. The vendors won't sell you this, they won't do the R&D to make it, and it would cost infinite anyway.

This is a deliberate choice: for most Internet work, it's better to have really fat bisection bandwidth and non-blocking fabrics, and latency is ignored due to the high cost of building a crossbar that supports that with high radix.

Only if you have an algorithm that absolutely requires, and simply cannot be fixed, low latency, you are almost always better off building a cheaper, fatter fabric, and hiring engineers who know how to write applications that are latency tolerant.


here we go, the Jupiter paper is now published: http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183....


> The supercomputer would be 20 times quicker than the current leading machine, which is in China.

So given Moore's Law, by the time it's finished in 2025, it will be 50 times slower than 2025's fastest?

Yes, yes, Moore's Law is slowing, transistors on a chip =/= flops, etc. Still seems like they'd want to aim higher than 20x in 10 years.


You're assuming the goal is to build the fastest supercomputer in the world. I think the goal is probably closer to "get the computing resources we need at the lowest cost".


Well, the title literally says "world's fastest computer."


Probably fluff added by the headline writer to sound more exciting. The White House press release doesn't make world's-fastest claims: https://www.whitehouse.gov/blog/2015/07/29/advancing-us-lead...


Can anyone provide back of the napkin calculations on this proposed supercomputers computing power vs Googles compute farm?


You could compare raw FLOPS (Floating point operations per second) but that would only tell part of the story. These supercomputers are highly engineered for low network latency between nodes, which is necessary for many scientific workloads. Google and other companies are generally able to express their algorithms in highly parallel ways, which means there are much reduced requirements for communication between nodes.

Therefore, even if the raw performance in terms of FLOPS sound similar, the two systems will have widely differing performance on real workloads.


Depends on what you mean by a "real" workload.

Capturing and indexing the entire web is certainly a real workload, even if it is massively psrallelizable, so it would probably run equally well on Google's infrastructure as on a supercomputer because those fast interconnects wouldn't provide much advantage, right?

However,when simulating a nuclear explosion or a weather system (maybe that's what you mean by "real" workloads?), the heavy node-to-node communication makes the supercomputer much, much better suited.


This could be slightly misleading. Supercomputers tend to be used for different types of computations.


or the Bitcoin farmers?


Comparing Bitcoin miners to any supercomputer or server farm is meaningless at best and deceiving at worst. Current Bitcoin miners cannot do anything other than one specific calculation.


Don't we already have a exabyte-scale supercomputer in Utah run by the NSA?


Yeah, but everyone knows that thing's doing illegal stuff (illegal now or soon-to-be illegal), and Obama doesn't want that on his record.


Do you have a source for that information?


http://www.usnews.com/news/politics/articles/2015/07/27/nsa-...

The NSA's currently being sued over its metadata collection. I wrote that comment slightly tongue-in-cheek, but repurposing that supercomputer for civilian use would actually be a great way to recoup some of your losses.


Er, do we?



That seems to be more of a storage data centre. And if Wikipedia's figure of 60 megawatts is correct, it's well in the ballpark of a datacenter for Google or Facebook.


You may be interested in this facility: https://en.m.wikipedia.org/wiki/Multiprogram_Research_Facili...

Edit: also should say it is amusing to hear Obama talking about exascale conputing by 2025 when the NSA's goal (read the Wired article) is to get there by 2018.


I suppose that "supercomputers" are all multi-processor these days, so the colossal FLOP numbers are counted as an aggregation over many processors and one has to coordinate these processors in any application that takes advantage of the FLOP specs.

Now I am curious what is the fastest single processor?


I don't think knowing that is usefull. "Single processors" are all superscalar or pipelined these days, so the colossal single-thread FLOP numbers are counted as an aggregation over many arithmetic units and one has to coordinate these units (mainly avoiding branch mispredictions) in any application that takes advantage of them.


It's still relevant. Some applications remain single threaded due to data dependencies.

Mechanical CAD geometry kernels are one such application. Recently, I had a use case that demanded the peak single threaded performance.

In PC land, it's this chip clocked at 4.x Ghz: Intel® Core™ i7-4790K. It's multi-core performance is pretty great too, so it's not that big of a trade-off to maximize single thread performance.

I would be very interested in knowing what faster solutions exist. Are there any, regardless of instruction set?



I guess it depends on how 'core' is defined. A single core of an Intel CPU can work on 2 threads at a time, the power 8 can work on 8.

I assume that when most people think of 'single core' performance, they really mean 'single thread'. I think Intel's CPUs win out there, based on your linked benchmarks.


If your memory bandwidth limited then, yes. Otherwise your better of with a decently clocked intel. Anything that spends a portion of its time running out of L2 or better on the Xeon will be significantly faster.


Looks like somebody got to ~8.7GHz through overclocking a ~4GHz cpu.

http://hwbot.org/submission/%202615355

I happened to catch an overclocking competition being streamed on Twitch one late night many months ago. It was really interesting to see more about the methods and techniques involved and how the competitions work.


See my comment above for exascale info. The single chip performance will vary wildly. The important measurement is how many operations of useful work per second per watt the system will do. That's the gist of it I've learned from HPC people.

Exascale is power-hungry so power must go way down and efficiency of calculation way up.


Probably an intel xeon (or i7?) overclocked and cooled with liquid nitrogen. There may be a specialized processor with higher performance, but I doubt it.


I use i7 2600k at 5.0 GHz for some time. Probably best option for ordinary human.


Interesting! I'm running this one: Intel® Core™ i7-4790K at 4.x Ghz (currently 4.2 for long term big problem stability)

What does your setup look like? Cooling, RAM, etc...?

And is it good for long term, like say crunch on it for a week type problems?


Water cooling from Antec, it fits into regular desktop case. Under load it makes sound like hoover. Some cheap RAM, some motherboard for $200, 500 Watt PSU. I use it for IDE (Scala compiler is slow).

No problem with stability, it ran a few times over weekend at full load. Maximal temperature about 90C. I have put quite high voltage. Over summer it goes down to 4.8GHz.


Nice! I'll need to pick up a new machine soon. Might give this a go to compare with my existing 4.3Ghz machine.


Press: "What's it for?" Obama: "Uuuh... NASA."


I find it odd that a president would sign an executive order for a new kind of computer (even if the new kind of computer is technically impressive). Call me cynical, but I'll bet that there's a number of quid-pro-quo arrangements with big party donors (or soon to be donors) -- regardless of which party is in office.


It's not that sinister, it's just PR. This project likely would have been in the budget, anyway, and the executive order is just a way to get the President's name behind it and his image in front of it.


I think he added an extra "A" there at the very last minute. :-)


Quite a large NAS don't think ?


Nuh-SA


I'd really love to see speed measured by performance by a single collective computation of an O(n) or O(n log n) algorithm. This would emphasize the importance of balancing communication performance with computation. Not holding my breath, the LINPACK is strong with these people...


Reading over the list of priorities in the PDF linked from the Whitehouse blog, the one that I was the most pleased about was improving HPC productivity.


Let's hope this is not just bluster :(


IMHO if you need to break down a task well enough to run on a supercomputer, there isn't a lot more to do to make it run on a regular server farm.

edit: Actually, in the scenarios you'd use a supercomputer for, the added latency and overhead (shoddy servers, network, etc.) would most likely make the run time orders of magnitude higher.


Maybe for embarrassingly parallel tasks, but if you require nontrivial interprocess communication, a server farm can't compete with the interconnect of a modern supercomputer.


This is not remotely true for a wide range of codes that matter to the supercomputing industry.


And by "codes", you mean specific legacy software artifacts written in FORTRAN (note that I'm not even spelling it as Fortran)? Of course that's a problem.


No, that means specific problem domains that are not easily partitioned, and where latency or affinity are the primary performance constraints.

There are still some fortran libraries in large scale use for this sort of thing. They are still in use because they are very good, and replacing them would be very expensive for little gain.


Write in any language you want; that's irrelevant to the nature of the computations being done here.

By codes they mean -- at the minimum -- pretty much anything that requires frequent communication between any or all nodes as a necessary part of computation. (For example, simulations across a large 3D space, where the changing states of particles on node A directly impacts the states of particles on adjacent nodes.)


You can't write your code in any language you want on a supercomputer.

Also, there is a wide range of literature about communication patterns for supercomputer apps; my argument is that often times, to solve the problem that matters, you may not actually need to run the simulation you think you do. It's more that people are just used to running that way.

For example, with MD, you can run 1 sim parallelized over 100 machines using tightly coupled communication (doesn't necessarily mean the forces and positions of every particle have to be shared between node decompositions) or run 100 sims over 100 machines, with no communication except for input and output files. The latter can often answer the same question far more cheaply.


I'm somewhat confused -- I thought we were arguing similar points?

I don't want to drag this out, but where do you see the language constraint? You need an MPI binding, sure, but what else?


No, the people who run the clusters won't let you run any language just because it has an MPI binding. They invest a lot in ensuring peak performance, and right now, only C++ and FORTRAN can achieve that. Very few, if any, major supercomputer centers support Java codes.


Oh, you're talking about a policy limitation, not a technological one. (And if you're talking about the DOE or NSF/Teragrid/XSEDE clusters, then you're probably right. Haven't touched those in years -- and even when I did, I wasn't doing anything crazy.)


To be frank, if I was running a computer that was designed for peak performance, I probably wouldn't use Java. There are some very significant performance issues with garbage collection that prevent you from making peak use of them machine.

Supercomputers aren't built so that people can squander the resource (desktop PCs, closest clusters, and phones fulfill that role).


It is a technical limitation. Oftentimes the platform is so specialized that only a tiny handful of compilers are ported to it. Say, just gcc, g++, and gfortran, and xlc, xlC, and xlf. And just one version at that. Java would require porting the JVM to the cut-down, weird Linux on the compute nodes. Some $$$ machines don't even support dynamic linking! The number of these machines is so small that extensive compiler and tool support just isn't happening unless you want to add millions to the cost.


There is no problem running a JVM on cut-down linux nodes. A JVM is just a process.

Anyway, the issue with JVMs is that they don't have predictable performance, not that the compilers can't be ported.


The JVM probably calls fork() and system(), no? Not allowed. Dynamic thread creation? Not allowed. And 50% of your flops go away unless your program uses the BG/Q-specific "double hummer" floating point instructions. These are primitive machines, in terms of development environment and typically require significant rewriting to get even "standard" system software working.


Where to begin?

There are many existing valuable codes written in FORTRAN. They work, it's not worth the investment to replace them with something else.

Second, many of the codes are in C++, not FORTRAN. Not clear that's any less of a problem.


NSA director be making that smug frog face right about now


too bad for him that he didn't order it earlier, maybe he could have figured out a way to stay for more then 8 years as President...


Step 1: Order exascale computer. Step 2: ??? Step 3: Profit.

U.S. and other countries have been in a race for exascale. The thing holding us back isn't funding or political will: exascale is so ridiculously hard that it requires fundamentally different architectures. The main issues are making our CPU's do more work, eliminating memory bottlenecks, and dramatically improving energy efficiency of both. It's just very tough, technical challenges that might also have to operate on process nodes that are themselves tough.

Rexx Computing is one attempt whose founder posts here a lot [except in one thread dedicated to it lol]. I'm curious if any other exascale researchers read HN and can post their concepts as it's probably interesting stuff. Here's some links for readers interested in this stuff.

LLNL gives data on exascale and its challenges https://asc.llnl.gov/content/assets/docs/exascale-white.pdf

Also describes problems but skip to Venray's TOMI approach http://www.edn.com/design/systems-design/4368705/The-future-...

Rexx Computing's approach http://www.theplatform.net/2015/03/12/the-little-chip-that-c...

Intel's relatively conventional approach http://www.exascale-computing.eu/wp-content/uploads/2012/02/...

Architecture from Univ of Texas and NVIDIA https://www.cs.utexas.edu/users/skeckler/pubs/SC_2014_Exasca...

Boise exploring non-Von-Neuman with ParalleX http://cswarm.nd.edu/news-events/assets/PSAAP_II_Kick-off_CS...

Same group enlightens on details that all fight with http://sites.ieee.org/boise-cs/files/2015/04/Thomas-Sterling...

Bonus: 1,000 core, cache-coherent, optical interconnect. Sort of thing might be useful in exascale. http://dspace.mit.edu/openaccess-disseminate/1721.1/67490

Have fun with these. Submit a link if I left out any chip architecture in exascale race that's pretty cool.


Step 1: Nuclear test ban treaty. Step 2: Avoid nuclear test disasters. Step 3: Maintain military status quo.

Notice step 3 implies international stability and hence profit. NNSA stresses computation so heavily because stockpile stewardship cannot be done by noncomputational means.


Interesting point. Exascale is a lot more than that though: many stakeholders. And, even if none, it's still going to get funded as another international pissing contest (see Top 500). ;)


I wonder how many bitcoins can be mined with it :)


Why not just use commodity spot instances?


Haha.. AWS?


Sure, I'm asking sincerely.


One of the issues faced in super computing isn't just raw horsepower or more cpu's, it's latency. It's not enough to just connect a ton of machines via ethernet, you need specialized hardware to provide high-throughput sharing of data.


Thanks for the insight!


Thanks Obama (no sarcasm).


It kills me every time I read about how the world's nth-fastest computer is just used to simulate nuclear explosions, so it was a delight to see that they're planning on using this one for some good.


I don't know if this is what you're referring to, but a common application for government-owned supercomputers is simulating the degredation of nuclear warheads. The degradation of the fissile material as well as its surroundings is highly critical to a nation's security, and also very hard to model well.

Of course in an ideal world those cycles would be used to help cure cancer, but given that these warheads exist, it's probably a good idea to invest resources into getting an idea of what shape they're in.


The old way to do that was to blow one of them up every so often. The Nuclear Test Ban Treaty put a stop to that.


And it kills me every time someone makes a silly comment like this. Someone has to spend the large amount of money for the R&D. Your model of how the world works is flawed.

We went to the moon because of the Cold War. The military paid for the development of the Internet. GPS? Supersonic flight? Nuclear energy? Autonomous vehicles?

I wish private industry would do more. Every company should have a Bell Labs.


Are you extolling the virtues of a world that was shaped by the threat of nuclear annihilation for multiple generations?


No, militaries has been funding R&D long before the nuclear age.

http://airandspace.si.edu/exhibitions/wright-brothers/online...

Rather than tell us how much you hate the military and capitalism maybe you can figure out a better way?

I'd love to see more non-military funding. Perhaps medical research could be used to increase public interest in funding supercomputer research?


> Rather than tell us how much you hate the military and capitalism maybe you can figure out a better way.

Well, knowing a few scientists, I can say that I know well the creative and inquiring spirits that advance us. The world is structured so that the people who do the actual development and discovery have to sell their talents to the logic of property and capital, either directly, or to the government that enforces that structure.

So, I think we're probably in agreement on the forces wherefrom those technologies came. I think we're also in agreement on whose behalf those forces act. I think we're probably in disagreement that capital is any better a master than the government it is in collusion with.


There may be a better solution. Just because no one has found it doesn't mean we can't do better. However, I'm not one for cursing the darkness. Light a small candle and lead the way.


I think the solution is fewer solutions. When I am working well with others, there's nothing much to fuss about. When there are conflicts, it's not ideology or a system that wins the day. I think we two would collaborate just fine without such stuff, for instance. What works in conflict is the willingness to discard systems and ideas. Or, even better, the lack of willfulness to enforce systems and ideas in the first place.

Being conceptually slippery enough to get out of any problem comes naturally to skillful people in their fields. The situation at hand provides all the impetus for theory and practice there is. Conflict thrives off of superfluity: superfluous methods, superfluous justifications, and superfluous issues. Methodology and structure gets in the way of skillfulness.

I know that real human power scales on its own without armies and guns forcing it into a certain shape. People collaborate and collude very easily. People form groups very easily. People associate with people. It's probably the only culturally universal thing people do (well of course; culture only makes sense when there are associating people).

If we want association to work well, groups need to be able to disintegrate as easily as they come together. People need to be slippery, too. Without what makes groups cohere, they disintegrate on their own. When you see violence and hierarchy used to keep groups coherent, it means they've lost the essential power that makes them useful.

People like achieving status. Status is a social signal, and it communicates both ways. Status is not held like a title. Titles wax and wane in the status they confer just like every other human object and activity. When we pretend status can be held, that it can be concentrated and preserved, we get problems. If we let status come and go by its own logic, we'd have no problems with status. A society where status is maintained by a legal system will have hierarchies, and it will certainly have problems.

So ways that our current systems-- not just capitalism, fail: Labor is not free. Association is not free. Status is not free.

Even skillful people who believe in property as an organizing force wish to be locally free in these ways. Skillful people want to work unimpeded by the politics of labor, associating with likeminded people unimpeded by the politics of social groups, praised for the inherent virtue of their actions and unimpeded by the politics of status. They only believe in property because it helps them manage the world that is beyond their control, beyond the means of their skill. They use property to create a bubble in which they can live that is free from its control.

People who believe in property because they enjoy its logic, enjoy the wheeling and dealing, enjoy the frantic rush to get more of it, who feel more worthwhile the more property they have are nervous people, who can never be fulfilled, because property's logic does not lead to fullness, the only conclusion is 'not enough', the only purpose is 'more'.

I don't know how many people in the latter category actually exist. I suspect enough to cause a great deal of problems. In any case they require the assent of everyone else. I think all I have to say to everyone else is this:

The world without property is still full and whole, is still just, is still full of vitality, is still nourishing, and receptive to human power.


I think he was extolling the virtues of large scale R&D spending and adding that the Cold War happened to be the impetus for this spending.

He parts with: "I wish private industry would do more." That seems to suggest his main point is large R&D spending, not the virtues of the Cold War.


You can't deny that some of that stuff is pretty sweet. And ignoring where it came from requires rewriting history.


I would not have wagered the world for what we got out of it. The fact that we're around to talk about it is fine and all, but the last century is not a good model for any century.


My perspective: I'd rather haven them spend money on simulating bombs in fancy computers, than having actual nuclear tests.

Yes, a world without nculear weapons would be nice. Unfortunately that ship has sailed.


I have access to the 6th fastest super computer and I'm pretty sure that it is not used to simulate nuclear explosions.


What does it simulate?

Thanks.


That should be Piz Daint at CSCS, so Super conductor behavior, weather and many other things.

Mostly non embarrassingly parallel problems, where its high speed interconnect pays off.


Can we please stop calling things "embarrassingly parallel". Just what the hell is embarrassing about highly parallelizable algorithms?


The term "embarrassingly parallel" doesn't refer to algorithms, it refers to problems which can be computed in such a manner, stating that they're not very interesting for parallel algorithms research. But yeah, it's kind of a stupid term. Would you be okay with "pleasingly parallel"? ;)


Potato, potato. I don't make the rules. I just think "embarrassingly parallel" sounds like a college freshman trying way too hard to sound smart and cool.


It called "embarrassingly parallel" because it is embarrassing for the Cray and other supercomputer vendors sales people when their expensive hardware does not outperform a 5 year old solution that cost 1/2 their fancy gear ;)

Otherwise data parallel is an other phrase that is used for the same concept.


Argonne National Labs, which is 5th fastest I think, is used for weather models, material science, and more[1]

[1]http://www.alcf.anl.gov/


There are dual-use systems which run classified and non-classified codes (they can be partitioned.. They typically live at LANL or LLNL rather than ANL or Berkeley (Berkeley in particlar doesn't do any classified work). Note that most classified codes are actually physics/material/explosion/plasma physics simulations and you can't always tell they are running on your system.

I checked the list of projects running on ALCF and I see some pretty obvious nuclear weapon stockpile stewardship and weapon design projects, such as "Validation Simulations of Macroscopic Burning-Plasma Dynamics"


Nucleosynthesis of heavier materials in supernova explosions and neutron star mergers are two cases of problems that I guess are used for "publicly" validating dark codes.

(I know a guy that Los Alamos is trying to recruit to do some dark work for them, and he does nucleosynthesis in neutron star mergers.)


Those would be useful, but typically the validation codes use terms like "multiscale combustion physics" or "coupled neutron/radiation transport".

I think the folks simulating supernova and neutron stars have a lot of physics overlaps, but I don't think that data is used directly for stockpile stewardship.


99% of the compute time on this will probably be used for weapons, combat and civil emergency wargame simulations - but that doesn't make for good PR so of course they don't mention it in this sycophantic fluff piece.


However, since the US military recently labeled climate change as the #1 national security threat to the US or something like that, maybe we could hope for some climate science on the side? ;-)


It's better than doing the real thing.


I work in this field. The short of it is: there is more money out there for PIs to get defense related grants vs cure for cancer type grants. Some of it is for good, e.g. modeling where to dispatch mobile hospitals for the Ebola outbreak. Modeling the world's population doing things like fleeing a dirty bomb takes an astonishing amount of compute if you want answers in a timely fashion.


I think one of the primary motivations for doing this is to break cryptographic keys, and using that for surveillance and to hack into Chinese websites.


Is this in the realm of feasibility for modern crypto? And what about past/current transmissions using older crypto.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: