Forgive me if I'm wrong but I thought supercomputer time was usually not allocated to embarrassingly parallel tasks. While they certainly can do those tasks well they're a waste of a distributed system with expensive, high bandwidth fiber connections between nodes.
When I spent (a relatively small amount of) time working with one this was the main thing the director drilled into my head. Use it to solve large, parallel problems that require lots of intranode communication of intermediate results. Embarrassingly parallel problems can be solved on cheaper hardware like GPUs.
When the data no longer fits inside a compute resource (a node, or even a rack), you are by necessity going to be distributed. Communication is a fact of life when the problem-size grows. This is true also for GPU-based computing.
This work is four years old (with development happening before that), so the Julia GPU capabilities probably weren't good enough at the time. If you wanted to do it today, that'd probably be the way to do it, but would need some benchmarking.
A lot of modern supercomputer use/have GPUs. But most GPUs had very bad fp64 compute capabilities, so they were not really used for anything requiring precision for a long time.