Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> we run on Azure SQL Hyperscale

Definitely the wrong technology, and was almost certainly picked only because Troy Hunt is a "Microsoft Regional Director and MVP".

Many other technologies scale better for this kind of workload. Heck, you could ask ChatGPT to write a short C# CLI tool to process the data on one machine, you don't even need a huge box.

This kind of thing comes up here regularly on HN for problems such as duplicate password detection, leaked password filtering, etc...

After previous brainstorming sessions the general consensus was that it's really hard to beat a binary file that contains the sorted SHA hashes. I.e.: if you have 1 billion records to search and you're using a 20-byte SHA1 hash, then create a file that is exactly 20 billion bytes in size. Lookup is (naively) just binary search, but you can do even better by guessing where in the file a hash is likely to be by utilising the essentially perfectly random distribution of hashes. I.e.: a hash with a first byte value of "25" is almost certainly going to be 10% of the way into the file, etc...

It's possible to create a small (~1 MB) lookup table that can guarantee lookups into the main file with only one I/O operation of a fixed size, such as 64 KB.

Sorting the data is a tiny bit fiddly, because it won't fit into memory for any reasonably interesting data size. There's tricks to this, such as splitting the data into 65,536 chunks based on the first two bytes, then sorting the chunks using a very ordinary array sort function from the standard library.

On blob storage this is super cheap to implement and host, about 50x cheaper than Azure SQL Hyperscale, even if it is scaled down to the minimum CPU count.



Hi.

Stefán (the other HIBP developer) here :)

There are good reasons for the tech we picked. I’ll elaborate in a more detailed answer later today or tomorrow.

I love good nerd discussions.



The sorting is the slowest step by far.

Hashing is so fast that you can hand-wave it away as zero cost relative to the time taken to read such a large amount of data. Also, you only have to do it once for the whole input, which means that it's O(n) time where 'n' is the gigabytes of passwords you have.

Sorting is going to need about O(n * log n) time even if it's entirely in memory, but more if it has to spool to disk storage then it'll take much longer than the hashing step.

PS: I just realised that 2 billion passwords is not actually that much data -- only 40 GB of hashes -- that's well within the range of what's "easy" to sort in-memory by simply creating an array of hashes that size and calling a standard library sort function.


What other algorithms have you used? I'm really interested in big data streams. I would like to hear not only successful solutions, but also failed ones. Have you tried using Bloom filters? Is it possible to merge shards using the Min-Heap algorithm?


Algorithm choice depends on what you're optimising for. The discussion a few years ago was dozens of small web servers handling a large volume of password change traffic (10K/sec!) needing a cheap centralised service for verifying against "known bad" passwords. On a cloud hosting platform, the optimal solution is a blob store of sorted binary hashes with a small (~1 MB) precomputed index stored in-memory in the web server that lets you query the main store with 1 or at most 2 reads. This is an optimal "one round trip" request, and the per-server overhead at the front end is negligible.

However, that approach assumes a slowly changing list where you don't mind that there's a delay of maybe a few hours to merge in a new list. Large lists of password leaks are infrequent, so this works fine for this usecase.

A two-layer approach with a small frequently updated hash set on top of the large infrequently built sorted list is more generally applicable to a wider range of problems.

Bloom filters are probabilistic, and aren't much faster to create than a sorted list. They also require 'k' random reads to test, where k is a small constant such as 3 to 5. If the filter is large (gigabytes), then you can't efficiently load it into many front-end web servers. If you query it over the network as a blob, you need 3-5x the I/O operations. You can wrap it in an API service that holds it in-memory, but that's still much more complex than simply storing and querying a blob in S3 or Azure Storage directly.

"Clever" algorithms like min-heaps or whatever are likely not worth the trouble. My decade-old PC can sort 2 billion hashes in about 30 seconds using the Rust "Rayon" crate. There are cloud VMs available for about $2/hr that have about a terabyte of RAM that could sort any reasonable sized list (10s of billions) in a few minutes at most.

The original article mention a week of 80 vCores of SQL Hyperscale, which is about $6,000 at PayG rates!

Sure, developer time is expensive, blah blah blah, but waiting a week ain't cheap either, and these days an AI can bang out the code for a password file hashing and sorting utility in a minute or two.


Hi. Stefán here again, the other HIBP dev and first employee: https://www.troyhunt.com/have-i-been-pwned-employee-1-0-stef...

So, here's a small blurb to clear up some misunderstandings. Excuse the typos since I'm writing in a bit of a hurry.

Most of the data processing is actually done in CLI tools we created and not in Azure SQL HyperScale. That includes things like:

* Extracting email addresses (from either csv or other delimited files). This can be problematic because it turns out people that gather this data aren't always thinking about encoding etc. so very often we need to jump through hoops to get it to work mostly correct. And when we see files like this huge breach that contains multi-terabyte CSVs, you need a tool that is both fast and memory efficient. For this breach we actually wrote our own to do this since other tools we tried often choked with out-of-memory errors or simply ran too slow. We will likely be open sourcing some of these tools.

* Extracting emails is one thing, extracting passwords is another and has totally different requirements.

Emails we need to extract in a case-insensitive way, for stealer logs we also need to parse the domains associated with the email. We need to hash the email because the hashes are used for k-anonymity purposes as well as batching purposes for internal processes.

Passwords we need to also hash (SHA1 and NTLM) for Pwned Passwords, but that also needs to make sure that we use consistent encoding. We also need to dedupe AND count them for prevalence purposes.

This we can all mostly do without touching Azure SQL HyperScale.

Once we have prepared the data, it needs to be inserted into the DB. It's not a case of creating simple binary lookup tables because we have different requirements and have at least three different ways of looking up an email address.

1. The full email (alias + domain)

2. Domain lookups (just domain)

3. K-anonymity lookups (first 6 chars of the SHA1 of the full email address)

This requires emails to not just be indexed based on the alias and the domain (which we denormalize into a domain-id). We also need indexes on things like the SHA1 prefix and we need to take into account when people have opted out of having their emails loaded.

Reasons for Azure SQL HyperScale: The email and domain search data used to be stored in Azure Table Storage. It was very convenient since it was fast to look up (partition keys and row keys) and cheap to store. There was one big drawback though. Azure Table Storage has no backup or point-in-time restore strategy. The only way to back up/restore data is to download it and reupload as a restore mechanism. Which is easy enough, except downloading the data was starting to take a week, even running in a VM in the same datacenter as the Table Storage account. And for a service like Have I Been Pwned, if we had a disaster or messed up a breahc load and had to roll-back, taking everything offline or having the wrong data for a week is unacceptable.

That's where Azure SQL HyperScale came in. The reason it was picked is not because Troy has a Microsoft RD or me being an MVP. We simply picked it because we both know MS SQL very well, we have good access to people that know it even better than us (for support purposes) and it has a very good, tried and tested backup/restore scenarios.

We do know that there are certainly better DBs that we can use, and it would probably be cheaper to run our own Postgres on our own hardware, or something on that note, but since it's just two people actively working on this and we hardly have time for development of new features and breach loads as it is, we simply couldn't spend valuable time on learning the ins and outs of a new DB engine, what it takes to run/maintain/optimize and all the other SRE responsibilities that come with it.

So in the end, it came down to convenience and what our time is best spent on doing.

Rest assured though, with everything we learned processing this breach, we will be much quicker to process the next really large breach, since we have taken a ton of learnings, new tools and processes that we'll be implementing. I'd expect the next breach of this size to take just a couple of days to process. Most other breaches take a lot less since they are a fraction of the size of this one.

Binary files: Pwned passwords is currently stored in blob storage containing just the first 5 chars of the hash in the filename and the rest in a line delimited, ordered fashion. I have already done some tests on having them binary files (since the hashes are always a fixed size, and the prevlance is just an int). So we could technically have each hash entry be 17 bytes (rest of the hash) + 4 bytes for the prevalence (unsigned 32-bit int) so just 21 bytes for each hash entry, and we skip newlines. And we might actually go that route in the not to distant future since it's easy to do.

Hope that clears up some of our thoughts here :) I'm planning on writing a blog soon with most of the things we learned so that might shed further light and insights on how we process this.


All fair points and obviously well thought out by knowledgeable people. I approach problem solving the same way, accounting for staff familiarity with tools, not just the wall-clock time taken for some esoteric approach.

My observation is that in the last year or so the relative weight of these contributions has shifted massively because of AI code authoring.

It’s so fast and easy to whip up a few hundred lines of code with something like Gemini Pro 2.5 that I got it to make me a sorting benchmark tool just so I’d have a data point for a comment in this thread! I never would have had the time years ago.

For relatively “small” and isolated problems like password hash lookup tables, it’s amazing what you can do in mere hours with AI assistance.

If I was approaching this same problem just two years ago I would have picked SQL Hyperscale too, for the same(ish) reasons.

Now? I feel like many more avenues have been opened up…




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: