The .NET community started their own version and included some other languages. I think I saw Rust at 1.4s or so but was disqualified because it was optimized to that dataset (no idea how they determined that). Someone wrote some SIMD code for c# and was like 1s. Shaving some milliseconds between c#, c++, and rust in the final end for 1 billion rows is still impressive.
So far the top place is not clear. .NET with noahfalk, C/C++ from lehuyduc, dzaima and austindonisan require more tests. The last looks the fastest so far on the default dataset, but has issues.
I think you're talking about a Rust solution that works by reading a small prefix of the input and generating a perfect hash with no handling for misses. So it will be wrong for any input that contains station names that do not occur in that prefix.
> Someone wrote some SIMD code for c# and was like 1s.
CLR performance can be considered nondeterministic given it JITs and performs profile-guided-optimizations - unless you go with AOT, but there's still going to be an initialization delay, and differences in cold vs. warm. vs hot-start scenarios. In order to be fair to all languages/platforms/runtimes involved, how is that managed?
Results actually have improved quite a bit since I've answered the questions for that interview. Fastest contenders now are below two seconds, and that's running on 8 CPU cores of the evaluation machine. When I ran the entries on all 32 cores / 64 threads two weeks ago, the fastest one was at 0.8 sec, and it should be quite a bit faster by now.
As a SQL-head, I enjoyed following along with Robin Moffatt's blog on doing the challenge with DuckDB. I got slightly slower numbers than he did (I have different hardware), but definitely in the same order of magnitude. He got 28 seconds to ingest the CSV, then 6 seconds to aggregate and calculate.
The official challenge is in Java, but there are unofficial solutions in the "Show and tell" categories, using other languages, databases, and stuff (even a Fortran solution that takes 3 seconds without going too far in optimization):
I like this discussion because while the original is about "what can you do in 2024 stock java", I'm more interested in "what can you do if you use naive, obvious solutions". AWK does this in 6 minutes but, as pointed in this comment:
> It's either 6 minutes with 20 lines of AWK written in five minutes. Or 6 seconds in 300 lines of java written in five days. Classic "XKCD: Is It Worth the Time?".
You can both have five minutes of work and some extra JVM speed juice over awk. The following Kotlin Script runs in ~2m46s on my M1 Mac. This is faster than the baseline and a bit more complex than it otherwise could be to avoid using too much RAM.
I think it's also more readable than the awk for people who don't already know the language whilst being of a similar length, but that might be just me. More usefully you can get type checking, autocompletion, use any libraries you want and so on.
class Result(var lo: Double = Double.MAX_VALUE, var hi: Double = Double.MIN_VALUE, var total: Double = 0.0, var samples: Int = 0) {
val mean get() = total / samples
}
val results = HashMap<String, Result>()
Path("measurements.txt").useLines { it
.map { it.substringBefore(";") to it.substringAfter(";").toDouble() }
.forEach { (name, measurement) ->
with(results.getOrPut(name) { Result() }) {
lo = min(lo, measurement)
hi = max(hi, measurement)
total += measurement
samples++
}
}
}
results.toSortedMap().mapValues { (_, m) -> "%.1f/%.1f/%.1f".format(m.lo, m.mean, m.hi) }.also(::println)
You can do this sort of thing with jbang, although for some reason Gunnar's baseline is slower. Maybe I'm doing something wrong.
> although for some reason Gunnar's baseline is slower. Maybe I'm doing something wrong.
I suspect it's because the baseline code is adding each line to a TreeMap before processing, effectively sorting 1B rows and then processing them. Your code reduces to the 413 unique stations and then sorts those.
I feel like Gunnar purposefully wrote the baseline with lots of room for improvement to encourage participation.
Yes, I think you're right. The way he defines two classes for holding the results is another hint that it was built to be easily optimized :) And, well, you gotta hand it to him. That really went viral!
That's really interesting and I like seeing what can be done in more "evolved" languages than awk. Awk has the advantage of being everywhere and not needing installation phases, but that's a price you only pay once.
Your Kotlin code feels more imperative to me even though I'm used to that kind of things; I like that awk still feels more like describing stuff, but that's only me.
I didn't do it that way because with one billion rows, it runs out of memory. The imperative version that mutates in place doesn't.
You can probably write a functional version that also doesn't run out of memory using .groupingBy{}, because it all applies to lazy sequences. I just didn't bother working out how. The Java streams framework can do the same things and has the advantage that it can sometimes make it easy to parallelize.
> It's either 6 minutes with 20 lines of AWK written in five minutes. Or 6 seconds in 300 lines of java written in five days. Classic "XKCD: Is It Worth the Time?".
I can bet that some 2 min 50 seconds in baseline took by byte -> String conversion. Java is pretty bad at this, especially if default Charset is UTF8. Even using ASCII at Files.lines() call may improve performance dramatically.
All leaders work directly with bytes. All other tricks is just for sub-second improvements.
Did solve similar problem with log files.
Yeah strings are very expensive, saw huge gains in my submission after removing them. Parallelization gets the runtime down to 30s , After removing strings you get to around 15s on the eval machine, the rest of the tricks are still very relevant to get to sub 10s.
>"It's either 6 minutes with 20 lines of AWK written in five minutes. Or 6 seconds in 300 lines of java written in five days. Classic "XKCD: Is It Worth the Time?"
It's exactly this. Is it going to be run many times or just once or a few times? Is it going to be run in production where it impacts customers or runs up the AWS bill? If it's a throwaway script, the Linux terminal and some Awk is hard to beat. There are always tradeoffs.
I imagine an indexed RDBMS could do this in a fraction of a second? I get that this is more of an academic exercise, but it also kind of explains why you might prefer one solution over another for a given task.
Maybe it could answer the question for an already indexed data set, but it probably couldn't INSERT 1 billion rows and then answer the question within that time. The ingestion is part of the challenge.
I've tried building the project on my machine, but I got an error that some library used doesn't support Java 21. On the other hand, compiling with a lower version got me an error that only Java 21 is supported.
While not directly related to this benchmark, Java doesn't really have meaningful GC pauses anymore (https://openjdk.org/jeps/439). With the new ZGC, the worst-case GC pause is under 1ms (and usually far below that).
The whole problem is solvable with a single hashtable or a fixed set of memory. The solution use mmap over the input file. So no need for a GC run or a large heap. The solutions also mostly used graalvm and ahead of time compilation.
All the sub-2 second ones are. There is an impressive 3.2s solution that uses neither AOT or Unsafe. It's using the in-preview foreign memory and vector operations APIs. Browsing the solutions it looks like that the Unsafe solutions are winning because Unsafe lets them bypass bounds checks.
It's sort of wild that there's no vector load or store that doesn't include bounds checks. Even if you make a MemorySegment from 0 to 2^63-1 (~2^15 times larger than addressable memory on x64!) you'll pay for bounds checks on every load and every store.
Only for random access. The compiler will hoist bounds checks out of more regularly-accessed loops (when the inlining is right).
In fact, given that the top safe result was very close to the top unsafe result (especially considering the standard deviation), it looks like you need to write truly exceptional code to feel the cost of the random-access bounds checks. So it doesn't seem wild at all that the default is to give everyone safety at the expense of a performance cost that will only be felt by very few (and who can then enable unsafe code if they feel they need that last extra boost).
Bounds checks in a MemorySegment that contains the whole addressable range 2^15 times provide no safety. Obviously.
You've posted that users can use unsafe to avoid the bounds checks, but you're posting specifically about methods provided for MemorySegment that have no equivalent for Unsafe, so you're just incorrect.
"Only for random access" is probably also incorrect. We could maybe have a look at the codegen for any of the top openjdk submissions which currently perform sequential access but increment the read pointer by ctz(movemask(eq(*ptr, c)))+k at each step. It's trivial for a person to prove that the bounds checks are redundant with the loop termination condition given the MemorySegment's length and given that ctz is at most 64, but it seems unlikely that the runtime manages it. It is of course also trivial for a person who knows that the MemorySegment is so large that it only disallows addresses that differ from allowed addresses by up to two bits not used for addressing to prove that all possible addresses are allowed, or at least that they address memory which is allowed to be accessed by a different address if the user masks off the high two bits first.
> Bounds checks in a MemorySegment that contains the whole addressable range 2^15 times provide no safety.
That's why safe code can't obtain such a MS in the first place. To get such a segment you have to use a flag (enable-native-access) that enables unsafe operations. We're contemplating offering a MS without bounds checks in those cases (for off-heap only), but that depends on how much that would be useful.
> you're posting specifically about methods provided for MemorySegment that have no equivalent for Unsafe
I'm talking about unsafe code, not the Unsafe class specifically. The internal vector intrinsics don't perform bounds checks.
> but it seems unlikely that the runtime manages it
We're doing okay and, as always, we expect to improve.
Anyway, my point is that the top safe result was so far ahead of most submissions, that it's clearly not "wild" to prioritise safety given how few manage to get to the point where unsafety can buy them a significant boost. It also allows us to trust more invariants and perform more optimisations (once we complete "integrity by default") so that the overall performance impact is clearly positive.
> The internal vector intrinsics don't perform bounds checks.
VECTOR_ACCESS_OOB_CHECK controls some bounds checks performed by ByteVector.rearrange and similar methods. With these bounds checks enabled, ByteVector.rearrange costs ~40x more than pshufb rather than only ~30x more.
This comment of mine is pedantic and silly. Of course I agree there aren't any bounds checks on the memory addresses you access if you are using the intrinsics that don't access memory.
It's not a matter of pedantry but about a fundamental view of performance. Operation X can be 100x fasterthan operation Y yet its impact on performance may be anywhere between -3% and 10000%. The precise makeup of trainers' soles may make the difference between Usain Bolt breaking the world record or not, and yet have negligible impact on people's running speed. Saying that it's "wild" that not every trainer has the exact same soles as those of Usain Bolt's confuses performance with specialists' gear. Making safety stronger to allow, say, better constant folding can improve performance overall and at the same time not be the best choice in some special circumstances.
Improving performance demands that there would be no non-delineated code that could result in miscompilation or undefined behaviour. We don't want to tell people, well, we've introduced a new default optimisation but it means that it's possible that some transitive dependency you may not even know about could result in undefined behaviour. That might be okay for C (where deep dependency graphs are very, very rare), but it's certainly not what people want from Java.
Oh, I'm sorry; perhaps I've gone on too many tangents. The original comment said something about it being "wild" that the vector operations perform bounds checks, and I explained why that's the most sensible thing to do in Java (indeed, we currently offer no standard API for any on- or off-heap operation without bounds checks; this isn't unique to vectors), and added that we're considering adding a bounds-checks-free MemorySegment for off-heap access that would turn them off (that would, of course, require opt-in because it would be unsafe, just as the "universal" MS you mentioned requires opt-in as it, too, is unsafe [1]).
Is there anything else you wanted to know and I didn't answer?
[1]: You may ask, why that universal MS doesn't disable bounds checking, and the answer is that it complicates the implementation, but a specialised MS could do it.
To me, the more interesting aspect is not the fastest solutions, but the variance. There's an almost 2-orders-of-magnitude difference between the fastest safe solution and the baseline entry. When looking at an earlier run (before people started copying off solutions from one another), the standard deviation was 10s, which is 3x bigger than the fastest safe solution; even the standard deviation for the top 20 was more than 4s. In other words, clever algorithms and "mechanical sympathy" makes a much bigger difference than unsafe code, and only few of even the top performers got to the point where unsafe code makes a significant difference.
So it's better to write this sort of code in a language that's unsafe by default (e.g. C), rather than try to abuse Java.
Reminds me of the code that Microsoft produced when they were trying to prove their new .NET Core is one of the fastest web runtimes - it was quite fast, but had nothing to do with the way C# code is written out there.
> So it's better to write this sort of code in a language that's unsafe by default (e.g. C), rather than try to abuse Java.
I'm not necessarily disagreeing, but can you elaborate on why it's better? Using unsafe for some parts of the code doesn't throw out safety guarantees about all of the rest of the code.
> Using unsafe for some parts of the code doesn't throw out safety guarantees about all of the rest of the code.
That's actually hard to prove because you don't know which guarantees the unsafe code can break; being unsafe, it could potentially, say, mutate arbitrary final fields of any object in the program. So while such code doesn't necessarily break invariants, there's no easy way to tell whether and which it does. Once unsafe is used, no guarantee can be fully trusted.
But the direction we're going with "integrity by default" (https://openjdk.org/jeps/8305968) is that the application will need to acknowledge and approve the use of unsafe code by libraries, and so the author of the application could choose to more closely inspect the unsafe code and try to determine its blast radius.
Not all and from what I remember the top ranking hasn't changed that much since everyone started to update their solution with unsafe/mmap, rather showing that the gain is fairly marginal.
I think this instead shows that there are many other popular techniques that achieve much more pessimization than the best-case use of MemorySegment over Unsafe or read over mmap.
You actually want GC pauses in a challenge like this, because it's a pure batch problem with no latency constraints at all (like a compiler). When that's the case you can use GC algorithms that are optimized for throughput rather than pause times, and things will go faster.
It probably won't make much difference on this problem for various reasons, but normally that's the case.
Turns out you don't need GC for processing that 13 GB file, with relatively small heap sizes even: some folks have disabled GC by means of using EpsilonGC (i.e. a no-op collector). That said, the right collector will give you sub-ms pause times these days (at the cost of lower through-put).
I would like to discourage the use of the word "billion". It means different things to different people. For reference, this title uses it to mean 10^9, but it can also mean 10^12. The official word for 10^9 is milliard, but hardly anyone knows what you mean when you say that, so I'd encourage the phrase "thousand million" - everyone will know what you mean when you say that and it's unambiguous. Or use scientific notation.
In the English speaking world "billion" only means 10^9.
See Wikipedia:
> This is now the most common sense of the word in all varieties of English; it has long been established in American English and has since become common in Britain and other English-speaking countries as well.
And as much as I hate it, when communicating in public what matters is the commonly used definition of the word, even if it's different that what that word meant a few years ago.
Nobody in a casual English (and Asian too from my experience) conversation will mean billion to be 10^12. English sphere predominantly uses the short scale.
That being said, there is no official word but two globally used scales.
Essentially the entirety of the anglophone world uses short scale numbering. I highly doubt anyone would interpret "billion" as anything other than 10^9 in English text.
I haven't read the article, just the headline, and what a strange...challenge...this is. Like, 1bil rows of CSV? JSON objects? Billion XML records? What does "process" mean? Parse and toss? Transform?
Who cares of Java can read 1bil "rows"? Why not describe it in terms of MB/s or something more reasonable?
Well, you should've - it's very clearly explained:
The text file has a simple structure with one measurement value per row:
Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John's;15.2
Cracow;12.6
...
The program should print out the min, mean, and max values per station, alphabetically ordered like so:
{
Abha=5.0/18.0/27.4,
Abidjan=15.7/26.0/34.1,
Abéché=12.1/29.4/35.6,
Accra=14.7/26.4/33.1,
Addis Ababa=2.1/16.0/24.3,
Adelaide=4.1/17.3/29.7,
...
}
Based on the trends of "<something something> so we rewrote our entire production stack in rust!" I am surprised there is not already a comment here!
I'd also be interested in how fast JavaScript/V8 executes by comparison.