One Billion Row Challenge Shows Java Can Process 1B Rows File in 2 Seconds

mattlondon · on Jan 29, 2024

So go on then - let's hear how fast rust can do it?

Based on the trends of "<something something> so we rewrote our entire production stack in rust!" I am surprised there is not already a comment here!

I'd also be interested in how fast JavaScript/V8 executes by comparison.

jabart · on Jan 29, 2024

The .NET community started their own version and included some other languages. I think I saw Rust at 1.4s or so but was disqualified because it was optimized to that dataset (no idea how they determined that). Someone wrote some SIMD code for c# and was like 1s. Shaving some milliseconds between c#, c++, and rust in the final end for 1 billion rows is still impressive.

buybackoff · on Jan 29, 2024

I maintain cross-language results. There will be a huge update soon, including AWS metal instances running on many combinations of cores/HT-threads.

What you mention from Rust is this: https://github.com/RagnarGrootKoerkamp/1brc/ The author says himself that his solution is not compliant in the readme.

So far the top place is not clear. .NET with noahfalk, C/C++ from lehuyduc, dzaima and austindonisan require more tests. The last looks the fastest so far on the default dataset, but has issues.

Here is my blog post with details: https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...

anonymoushn · on Jan 29, 2024

I think you're talking about a Rust solution that works by reading a small prefix of the input and generating a perfect hash with no handling for misses. So it will be wrong for any input that contains station names that do not occur in that prefix.

DaiPlusPlus · on Jan 29, 2024

> Someone wrote some SIMD code for c# and was like 1s.

CLR performance can be considered nondeterministic given it JITs and performs profile-guided-optimizations - unless you go with AOT, but there's still going to be an initialization delay, and differences in cold vs. warm. vs hot-start scenarios. In order to be fair to all languages/platforms/runtimes involved, how is that managed?

neonsunset · on Jan 29, 2024

The answers to these questions are at the links posted in a sibling comment a couple of hours ago.

radicalbyte · on Jan 30, 2024

..which we then proceed to run in a container running on a VM on a random machine in the cloud :-)

za3faran · on Jan 29, 2024

Do you have a link to the .NET community work?

banashark · on Jan 29, 2024

https://github.com/praeclarum/1brc

pron · on Jan 29, 2024

That's on a machine with more cores. The "baseline" 04:49.679 Java solution runs in 03:17.70 on that machine.

ThePhysicist · on Jan 29, 2024

There's a Reddit thread about that [1], though I'm not sure how the performance numbers can be compared as the hardware is probably different.

1: https://www.reddit.com/r/rust/comments/18ws370/optimizing_a_...

Dowwie · on Jan 29, 2024

Search "1brc rust" on Twitter and you can see people hard at work competing on optimized Rust solutions

hyfgfh · on Jan 29, 2024

It depends on the machine, but some people got very fast results. https://github.com/gunnarmorling/1brc/discussions/57

Why would you use Javascript for that?

pulse7 · on Jan 29, 2024

Some people believe JavaScript/V8 is faster than Java so I would also be glad to see the comparison here...

gunnarmorling · on Jan 29, 2024

Results actually have improved quite a bit since I've answered the questions for that interview. Fastest contenders now are below two seconds, and that's running on 8 CPU cores of the evaluation machine. When I ran the entries on all 32 cores / 64 threads two weeks ago, the fastest one was at 0.8 sec, and it should be quite a bit faster by now.

NortySpock · on Jan 29, 2024

As a SQL-head, I enjoyed following along with Robin Moffatt's blog on doing the challenge with DuckDB. I got slightly slower numbers than he did (I have different hardware), but definitely in the same order of magnitude. He got 28 seconds to ingest the CSV, then 6 seconds to aggregate and calculate.

Maybe 2 hours of coding, tops

https://rmoff.net/2024/01/03/1%EF%B8%8F%E2%83%A3%EF%B8%8F-1b...

rakoo · on Jan 29, 2024

The official challenge is in Java, but there are unofficial solutions in the "Show and tell" categories, using other languages, databases, and stuff (even a Fortran solution that takes 3 seconds without going too far in optimization):

https://github.com/gunnarmorling/1brc/discussions/categories...

I like this discussion because while the original is about "what can you do in 2024 stock java", I'm more interested in "what can you do if you use naive, obvious solutions". AWK does this in 6 minutes but, as pointed in this comment:

https://github.com/gunnarmorling/1brc/discussions/171#discus...

> It's either 6 minutes with 20 lines of AWK written in five minutes. Or 6 seconds in 300 lines of java written in five days. Classic "XKCD: Is It Worth the Time?".

mike_hearn · on Jan 29, 2024

You can both have five minutes of work and some extra JVM speed juice over awk. The following Kotlin Script runs in ~2m46s on my M1 Mac. This is faster than the baseline and a bit more complex than it otherwise could be to avoid using too much RAM.

I think it's also more readable than the awk for people who don't already know the language whilst being of a similar length, but that might be just me. More usefully you can get type checking, autocompletion, use any libraries you want and so on.

    class Result(var lo: Double = Double.MAX_VALUE, var hi: Double = Double.MIN_VALUE, var total: Double = 0.0, var samples: Int = 0) {
        val mean get() = total / samples
    }
    
    val results = HashMap<String, Result>()
    Path("measurements.txt").useLines { it
        .map { it.substringBefore(";") to it.substringAfter(";").toDouble() }
        .forEach { (name, measurement) ->
            with(results.getOrPut(name) { Result() }) {
                lo = min(lo, measurement)
                hi = max(hi, measurement)
                total += measurement
                samples++
            }
        }
    }
    results.toSortedMap().mapValues { (_, m) -> "%.1f/%.1f/%.1f".format(m.lo, m.mean, m.hi) }.also(::println)

You can do this sort of thing with jbang, although for some reason Gunnar's baseline is slower. Maybe I'm doing something wrong.

papercrane · on Jan 29, 2024

> although for some reason Gunnar's baseline is slower. Maybe I'm doing something wrong.

I suspect it's because the baseline code is adding each line to a TreeMap before processing, effectively sorting 1B rows and then processing them. Your code reduces to the 413 unique stations and then sorts those.

I feel like Gunnar purposefully wrote the baseline with lots of room for improvement to encourage participation.

mike_hearn · on Jan 29, 2024

Yes, I think you're right. The way he defines two classes for holding the results is another hint that it was built to be easily optimized :) And, well, you gotta hand it to him. That really went viral!

rakoo · on Jan 30, 2024

That's really interesting and I like seeing what can be done in more "evolved" languages than awk. Awk has the advantage of being everywhere and not needing installation phases, but that's a price you only pay once.

Your Kotlin code feels more imperative to me even though I'm used to that kind of things; I like that awk still feels more like describing stuff, but that's only me.

mike_hearn · on Jan 30, 2024

You can write it more functionally very easily, something like (untested):

    Path("measurements.txt")
        .readLines()
        .groupBy({ it.substringBefore(';') }, { it.substringAfter(';').toDouble() })
        .forEach { (name, temps) -> 
             println("%s=%.1f/%.1f/%.1f".format(name, temps.min(), temps.average(), temps.max())) 
        }

I didn't do it that way because with one billion rows, it runs out of memory. The imperative version that mutates in place doesn't.

You can probably write a functional version that also doesn't run out of memory using .groupingBy{}, because it all applies to lazy sequences. I just didn't bother working out how. The Java streams framework can do the same things and has the advantage that it can sometimes make it easy to parallelize.

elteto · on Jan 29, 2024

For a one off thing, use awk. For the production pipeline spend the time writing the Java one (or Rust, or C++, or… etc).

Or use the awk one in production if you want. Nothing wrong with awk ;)

quadrature · on Jan 29, 2024

> It's either 6 minutes with 20 lines of AWK written in five minutes. Or 6 seconds in 300 lines of java written in five days. Classic "XKCD: Is It Worth the Time?".

Not quite, it runs in 3 minutes written in java in 5 minutes. The official baseline is quite simple and runs in about ~3min https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...

alexsmirnov · on Feb 1, 2024

I can bet that some 2 min 50 seconds in baseline took by byte -> String conversion. Java is pretty bad at this, especially if default Charset is UTF8. Even using ASCII at Files.lines() call may improve performance dramatically. All leaders work directly with bytes. All other tricks is just for sub-second improvements. Did solve similar problem with log files.

quadrature · on Feb 1, 2024

Yeah strings are very expensive, saw huge gains in my submission after removing them. Parallelization gets the runtime down to 30s , After removing strings you get to around 15s on the eval machine, the rest of the tricks are still very relevant to get to sub 10s.

7thaccount · on Jan 29, 2024

>"It's either 6 minutes with 20 lines of AWK written in five minutes. Or 6 seconds in 300 lines of java written in five days. Classic "XKCD: Is It Worth the Time?"

It's exactly this. Is it going to be run many times or just once or a few times? Is it going to be run in production where it impacts customers or runs up the AWS bill? If it's a throwaway script, the Linux terminal and some Awk is hard to beat. There are always tradeoffs.

pizzafeelsright · on Jan 29, 2024

which one scales is the question for those trying to optimize.

aeonik · on Jan 29, 2024

Link to the Git Repo for those that want to try the challenge: https://github.com/gunnarmorling/1brc/

reaperhulk · on Jan 29, 2024

Previously discussed here: https://news.ycombinator.com/item?id=38851337

throwawaaarrgh · on Jan 29, 2024

I imagine an indexed RDBMS could do this in a fraction of a second? I get that this is more of an academic exercise, but it also kind of explains why you might prefer one solution over another for a given task.

WJW · on Jan 29, 2024

Maybe it could answer the question for an already indexed data set, but it probably couldn't INSERT 1 billion rows and then answer the question within that time. The ingestion is part of the challenge.

anonymoushn · on Jan 29, 2024

Most RDBMS are not that fast at string parsing.

bheadmaster · on Jan 29, 2024

I've tried building the project on my machine, but I got an error that some library used doesn't support Java 21. On the other hand, compiling with a lower version got me an error that only Java 21 is supported.

    ¯\_(ツ)_/¯

apapapa · on Jan 30, 2024

SQL can't do it faster ?

y42 · on Jan 30, 2024

Probably, but not including the required previous step: Getting it into the database.

thecleaner · on Jan 29, 2024

Is this considered abnormal for Java ? I presume this was without GC pauses

pron · on Jan 29, 2024

While not directly related to this benchmark, Java doesn't really have meaningful GC pauses anymore (https://openjdk.org/jeps/439). With the new ZGC, the worst-case GC pause is under 1ms (and usually far below that).

treffer · on Jan 29, 2024

The whole problem is solvable with a single hashtable or a fixed set of memory. The solution use mmap over the input file. So no need for a GC run or a large heap. The solutions also mostly used graalvm and ahead of time compilation.

dbacar · on Jan 29, 2024

..and all solutions use unsafe

papercrane · on Jan 29, 2024

All the sub-2 second ones are. There is an impressive 3.2s solution that uses neither AOT or Unsafe. It's using the in-preview foreign memory and vector operations APIs. Browsing the solutions it looks like that the Unsafe solutions are winning because Unsafe lets them bypass bounds checks.

The current fastest safe version: https://github.com/gunnarmorling/1brc/blob/main/src/main/jav...

anonymoushn · on Jan 29, 2024

It's sort of wild that there's no vector load or store that doesn't include bounds checks. Even if you make a MemorySegment from 0 to 2^63-1 (~2^15 times larger than addressable memory on x64!) you'll pay for bounds checks on every load and every store.

pron · on Jan 29, 2024

Only for random access. The compiler will hoist bounds checks out of more regularly-accessed loops (when the inlining is right).

In fact, given that the top safe result was very close to the top unsafe result (especially considering the standard deviation), it looks like you need to write truly exceptional code to feel the cost of the random-access bounds checks. So it doesn't seem wild at all that the default is to give everyone safety at the expense of a performance cost that will only be felt by very few (and who can then enable unsafe code if they feel they need that last extra boost).

anonymoushn · on Jan 30, 2024

Bounds checks in a MemorySegment that contains the whole addressable range 2^15 times provide no safety. Obviously.

You've posted that users can use unsafe to avoid the bounds checks, but you're posting specifically about methods provided for MemorySegment that have no equivalent for Unsafe, so you're just incorrect.

"Only for random access" is probably also incorrect. We could maybe have a look at the codegen for any of the top openjdk submissions which currently perform sequential access but increment the read pointer by ctz(movemask(eq(*ptr, c)))+k at each step. It's trivial for a person to prove that the bounds checks are redundant with the loop termination condition given the MemorySegment's length and given that ctz is at most 64, but it seems unlikely that the runtime manages it. It is of course also trivial for a person who knows that the MemorySegment is so large that it only disallows addresses that differ from allowed addresses by up to two bits not used for addressing to prove that all possible addresses are allowed, or at least that they address memory which is allowed to be accessed by a different address if the user masks off the high two bits first.

pron · on Jan 30, 2024

> Bounds checks in a MemorySegment that contains the whole addressable range 2^15 times provide no safety.

That's why safe code can't obtain such a MS in the first place. To get such a segment you have to use a flag (enable-native-access) that enables unsafe operations. We're contemplating offering a MS without bounds checks in those cases (for off-heap only), but that depends on how much that would be useful.

> you're posting specifically about methods provided for MemorySegment that have no equivalent for Unsafe

I'm talking about unsafe code, not the Unsafe class specifically. The internal vector intrinsics don't perform bounds checks.

> but it seems unlikely that the runtime manages it

We're doing okay and, as always, we expect to improve.

Anyway, my point is that the top safe result was so far ahead of most submissions, that it's clearly not "wild" to prioritise safety given how few manage to get to the point where unsafety can buy them a significant boost. It also allows us to trust more invariants and perform more optimisations (once we complete "integrity by default") so that the overall performance impact is clearly positive.

anonymoushn · on Jan 30, 2024

> The internal vector intrinsics don't perform bounds checks.

VECTOR_ACCESS_OOB_CHECK controls some bounds checks performed by ByteVector.rearrange and similar methods. With these bounds checks enabled, ByteVector.rearrange costs ~40x more than pshufb rather than only ~30x more.

This comment of mine is pedantic and silly. Of course I agree there aren't any bounds checks on the memory addresses you access if you are using the intrinsics that don't access memory.

pron · on Jan 30, 2024

It's not a matter of pedantry but about a fundamental view of performance. Operation X can be 100x fasterthan operation Y yet its impact on performance may be anywhere between -3% and 10000%. The precise makeup of trainers' soles may make the difference between Usain Bolt breaking the world record or not, and yet have negligible impact on people's running speed. Saying that it's "wild" that not every trainer has the exact same soles as those of Usain Bolt's confuses performance with specialists' gear. Making safety stronger to allow, say, better constant folding can improve performance overall and at the same time not be the best choice in some special circumstances.

Improving performance demands that there would be no non-delineated code that could result in miscompilation or undefined behaviour. We don't want to tell people, well, we've introduced a new default optimisation but it means that it's possible that some transitive dependency you may not even know about could result in undefined behaviour. That might be okay for C (where deep dependency graphs are very, very rare), but it's certainly not what people want from Java.

anonymoushn · on Jan 30, 2024

I don't think you're replying to anything in the comment under which you posted.

pron · on Jan 30, 2024

Oh, I'm sorry; perhaps I've gone on too many tangents. The original comment said something about it being "wild" that the vector operations perform bounds checks, and I explained why that's the most sensible thing to do in Java (indeed, we currently offer no standard API for any on- or off-heap operation without bounds checks; this isn't unique to vectors), and added that we're considering adding a bounds-checks-free MemorySegment for off-heap access that would turn them off (that would, of course, require opt-in because it would be unsafe, just as the "universal" MS you mentioned requires opt-in as it, too, is unsafe [1]).

Is there anything else you wanted to know and I didn't answer?

[1]: You may ask, why that universal MS doesn't disable bounds checking, and the answer is that it complicates the implementation, but a specialised MS could do it.

anonymoushn · on Jan 31, 2024

No questions. I'll try to bug the mailing list about SIMD kernels that use a lot of rearranges. That MS sounds exciting!

pron · on Jan 29, 2024

To me, the more interesting aspect is not the fastest solutions, but the variance. There's an almost 2-orders-of-magnitude difference between the fastest safe solution and the baseline entry. When looking at an earlier run (before people started copying off solutions from one another), the standard deviation was 10s, which is 3x bigger than the fastest safe solution; even the standard deviation for the top 20 was more than 4s. In other words, clever algorithms and "mechanical sympathy" makes a much bigger difference than unsafe code, and only few of even the top performers got to the point where unsafe code makes a significant difference.

vincnetas · on Jan 29, 2024

3.2 seconds for fastest safe solution. So not bat i'd say.

borissk · on Jan 29, 2024

So it's better to write this sort of code in a language that's unsafe by default (e.g. C), rather than try to abuse Java.

Reminds me of the code that Microsoft produced when they were trying to prove their new .NET Core is one of the fastest web runtimes - it was quite fast, but had nothing to do with the way C# code is written out there.

coldpie · on Jan 29, 2024

> So it's better to write this sort of code in a language that's unsafe by default (e.g. C), rather than try to abuse Java.

I'm not necessarily disagreeing, but can you elaborate on why it's better? Using unsafe for some parts of the code doesn't throw out safety guarantees about all of the rest of the code.

pron · on Jan 29, 2024

> Using unsafe for some parts of the code doesn't throw out safety guarantees about all of the rest of the code.

That's actually hard to prove because you don't know which guarantees the unsafe code can break; being unsafe, it could potentially, say, mutate arbitrary final fields of any object in the program. So while such code doesn't necessarily break invariants, there's no easy way to tell whether and which it does. Once unsafe is used, no guarantee can be fully trusted.

But the direction we're going with "integrity by default" (https://openjdk.org/jeps/8305968) is that the application will need to acknowledge and approve the use of unsafe code by libraries, and so the author of the application could choose to more closely inspect the unsafe code and try to determine its blast radius.

anonymoushn · on Jan 29, 2024

In this case it's better because C users have access to vector intrinsics that are much cheaper than the ones Java users have access to.

hocuspocus · on Jan 29, 2024

Not all and from what I remember the top ranking hasn't changed that much since everyone started to update their solution with unsafe/mmap, rather showing that the gain is fairly marginal.

anonymoushn · on Jan 29, 2024

I think this instead shows that there are many other popular techniques that achieve much more pessimization than the best-case use of MemorySegment over Unsafe or read over mmap.

freecodyx · on Jan 29, 2024

If we are using mmap and unsafe, then better use c

mike_hearn · on Jan 29, 2024

You actually want GC pauses in a challenge like this, because it's a pure batch problem with no latency constraints at all (like a compiler). When that's the case you can use GC algorithms that are optimized for throughput rather than pause times, and things will go faster.

It probably won't make much difference on this problem for various reasons, but normally that's the case.

gunnarmorling · on Jan 29, 2024

Turns out you don't need GC for processing that 13 GB file, with relatively small heap sizes even: some folks have disabled GC by means of using EpsilonGC (i.e. a no-op collector). That said, the right collector will give you sub-ms pause times these days (at the cost of lower through-put).

mnw21cam · on Jan 29, 2024

I would like to discourage the use of the word "billion". It means different things to different people. For reference, this title uses it to mean 10^9, but it can also mean 10^12. The official word for 10^9 is milliard, but hardly anyone knows what you mean when you say that, so I'd encourage the phrase "thousand million" - everyone will know what you mean when you say that and it's unambiguous. Or use scientific notation.

Pet_Ant · on Jan 29, 2024

In the English speaking world "billion" only means 10^9.

See Wikipedia:

> This is now the most common sense of the word in all varieties of English; it has long been established in American English and has since become common in Britain and other English-speaking countries as well.

https://en.wikipedia.org/wiki/Billion

There is both a long- and short- scale. Neither is more correct: https://en.wikipedia.org/wiki/Long_and_short_scales

And as much as I hate it, when communicating in public what matters is the commonly used definition of the word, even if it's different that what that word meant a few years ago.

abhinavk · on Jan 29, 2024

Nobody in a casual English (and Asian too from my experience) conversation will mean billion to be 10^12. English sphere predominantly uses the short scale.

That being said, there is no official word but two globally used scales.

https://en.wikipedia.org/wiki/Long_and_short_scales#Current_...

Workaccount2 · on Jan 29, 2024

Generally if it is written in English, it's safe to assume it's short scale.

Xeamek · on Jan 29, 2024

Dunno if either system other can be called the official one, but it (as usual) comes down to Americans creating solutions for non-existing problems :P

https://en.wikipedia.org/wiki/Long_and_short_scales

Sanzig · on Jan 29, 2024

Essentially the entirety of the anglophone world uses short scale numbering. I highly doubt anyone would interpret "billion" as anything other than 10^9 in English text.

globular-toast · on Jan 29, 2024

It's off topic, but worth knowing about: https://www.youtube.com/watch?v=C-52AI_ojyQ

bradhe · on Jan 29, 2024

I haven't read the article, just the headline, and what a strange...challenge...this is. Like, 1bil rows of CSV? JSON objects? Billion XML records? What does "process" mean? Parse and toss? Transform?

Who cares of Java can read 1bil "rows"? Why not describe it in terms of MB/s or something more reasonable?

bheadmaster · on Jan 29, 2024

> I haven't read the article

Well, you should've - it's very clearly explained:

    The text file has a simple structure with one measurement value per row:

        Hamburg;12.0
        Bulawayo;8.9
        Palembang;38.8
        St. John's;15.2
        Cracow;12.6
        ...

    The program should print out the min, mean, and max values per station, alphabetically ordered like so:

    {
        Abha=5.0/18.0/27.4,
        Abidjan=15.7/26.0/34.1,
        Abéché=12.1/29.4/35.6,
        Accra=14.7/26.4/33.1,
        Addis Ababa=2.1/16.0/24.3,
        Adelaide=4.1/17.3/29.7,
        ...
    }

Valodim · on Jan 29, 2024

The article lists specific requirements that do make sense.

agilob · on Jan 29, 2024

Thanks for commenting :)

anonymoushn · on Jan 29, 2024

That's a bit less than 1GB/s/core, mostly parsing, for a fairly constrained format