> we run on Azure SQL Hyperscale Definitely the wrong technology, and was almost...

Stebet · 2025-11-07T17:51:26 1762537886

Hi.

Stefán (the other HIBP developer) here :)

There are good reasons for the tech we picked. I’ll elaborate in a more detailed answer later today or tomorrow.

I love good nerd discussions.

zazaulola · 2025-11-07T09:03:49 1762506229

Try Blake3 instead SHA-1

https://github.com/BLAKE3-team/BLAKE3

jiggawatts · 2025-11-07T09:41:39 1762508499

The sorting is the slowest step by far.

Hashing is so fast that you can hand-wave it away as zero cost relative to the time taken to read such a large amount of data. Also, you only have to do it once for the whole input, which means that it's O(n) time where 'n' is the gigabytes of passwords you have.

Sorting is going to need about O(n * log n) time even if it's entirely in memory, but more if it has to spool to disk storage then it'll take much longer than the hashing step.

PS: I just realised that 2 billion passwords is not actually that much data -- only 40 GB of hashes -- that's well within the range of what's "easy" to sort in-memory by simply creating an array of hashes that size and calling a standard library sort function.

zazaulola · 2025-11-07T15:37:25 1762529845

What other algorithms have you used? I'm really interested in big data streams. I would like to hear not only successful solutions, but also failed ones. Have you tried using Bloom filters? Is it possible to merge shards using the Min-Heap algorithm?

jiggawatts · 2025-11-08T02:06:54 1762567614

Algorithm choice depends on what you're optimising for. The discussion a few years ago was dozens of small web servers handling a large volume of password change traffic (10K/sec!) needing a cheap centralised service for verifying against "known bad" passwords. On a cloud hosting platform, the optimal solution is a blob store of sorted binary hashes with a small (~1 MB) precomputed index stored in-memory in the web server that lets you query the main store with 1 or at most 2 reads. This is an optimal "one round trip" request, and the per-server overhead at the front end is negligible.

However, that approach assumes a slowly changing list where you don't mind that there's a delay of maybe a few hours to merge in a new list. Large lists of password leaks are infrequent, so this works fine for this usecase.

A two-layer approach with a small frequently updated hash set on top of the large infrequently built sorted list is more generally applicable to a wider range of problems.

Bloom filters are probabilistic, and aren't much faster to create than a sorted list. They also require 'k' random reads to test, where k is a small constant such as 3 to 5. If the filter is large (gigabytes), then you can't efficiently load it into many front-end web servers. If you query it over the network as a blob, you need 3-5x the I/O operations. You can wrap it in an API service that holds it in-memory, but that's still much more complex than simply storing and querying a blob in S3 or Azure Storage directly.

"Clever" algorithms like min-heaps or whatever are likely not worth the trouble. My decade-old PC can sort 2 billion hashes in about 30 seconds using the Rust "Rayon" crate. There are cloud VMs available for about $2/hr that have about a terabyte of RAM that could sort any reasonable sized list (10s of billions) in a few minutes at most.

The original article mention a week of 80 vCores of SQL Hyperscale, which is about $6,000 at PayG rates!

Sure, developer time is expensive, blah blah blah, but waiting a week ain't cheap either, and these days an AI can bang out the code for a password file hashing and sorting utility in a minute or two.

Stebet · 2025-11-08T12:07:19 1762603639

Hi. Stefán here again, the other HIBP dev and first employee: https://www.troyhunt.com/have-i-been-pwned-employee-1-0-stef...

So, here's a small blurb to clear up some misunderstandings. Excuse the typos since I'm writing in a bit of a hurry.

Most of the data processing is actually done in CLI tools we created and not in Azure SQL HyperScale. That includes things like:

* Extracting email addresses (from either csv or other delimited files). This can be problematic because it turns out people that gather this data aren't always thinking about encoding etc. so very often we need to jump through hoops to get it to work mostly correct. And when we see files like this huge breach that contains multi-terabyte CSVs, you need a tool that is both fast and memory efficient. For this breach we actually wrote our own to do this since other tools we tried often choked with out-of-memory errors or simply ran too slow. We will likely be open sourcing some of these tools.

* Extracting emails is one thing, extracting passwords is another and has totally different requirements.

Emails we need to extract in a case-insensitive way, for stealer logs we also need to parse the domains associated with the email. We need to hash the email because the hashes are used for k-anonymity purposes as well as batching purposes for internal processes.

Passwords we need to also hash (SHA1 and NTLM) for Pwned Passwords, but that also needs to make sure that we use consistent encoding. We also need to dedupe AND count them for prevalence purposes.

This we can all mostly do without touching Azure SQL HyperScale.

Once we have prepared the data, it needs to be inserted into the DB. It's not a case of creating simple binary lookup tables because we have different requirements and have at least three different ways of looking up an email address.

1. The full email (alias + domain)

2. Domain lookups (just domain)

3. K-anonymity lookups (first 6 chars of the SHA1 of the full email address)

This requires emails to not just be indexed based on the alias and the domain (which we denormalize into a domain-id). We also need indexes on things like the SHA1 prefix and we need to take into account when people have opted out of having their emails loaded.

Reasons for Azure SQL HyperScale: The email and domain search data used to be stored in Azure Table Storage. It was very convenient since it was fast to look up (partition keys and row keys) and cheap to store. There was one big drawback though. Azure Table Storage has no backup or point-in-time restore strategy. The only way to back up/restore data is to download it and reupload as a restore mechanism. Which is easy enough, except downloading the data was starting to take a week, even running in a VM in the same datacenter as the Table Storage account. And for a service like Have I Been Pwned, if we had a disaster or messed up a breahc load and had to roll-back, taking everything offline or having the wrong data for a week is unacceptable.

That's where Azure SQL HyperScale came in. The reason it was picked is not because Troy has a Microsoft RD or me being an MVP. We simply picked it because we both know MS SQL very well, we have good access to people that know it even better than us (for support purposes) and it has a very good, tried and tested backup/restore scenarios.

We do know that there are certainly better DBs that we can use, and it would probably be cheaper to run our own Postgres on our own hardware, or something on that note, but since it's just two people actively working on this and we hardly have time for development of new features and breach loads as it is, we simply couldn't spend valuable time on learning the ins and outs of a new DB engine, what it takes to run/maintain/optimize and all the other SRE responsibilities that come with it.

So in the end, it came down to convenience and what our time is best spent on doing.

Rest assured though, with everything we learned processing this breach, we will be much quicker to process the next really large breach, since we have taken a ton of learnings, new tools and processes that we'll be implementing. I'd expect the next breach of this size to take just a couple of days to process. Most other breaches take a lot less since they are a fraction of the size of this one.

Binary files: Pwned passwords is currently stored in blob storage containing just the first 5 chars of the hash in the filename and the rest in a line delimited, ordered fashion. I have already done some tests on having them binary files (since the hashes are always a fixed size, and the prevlance is just an int). So we could technically have each hash entry be 17 bytes (rest of the hash) + 4 bytes for the prevalence (unsigned 32-bit int) so just 21 bytes for each hash entry, and we skip newlines. And we might actually go that route in the not to distant future since it's easy to do.

Hope that clears up some of our thoughts here :) I'm planning on writing a blog soon with most of the things we learned so that might shed further light and insights on how we process this.

jiggawatts · 2025-11-08T22:11:27 1762639887

All fair points and obviously well thought out by knowledgeable people. I approach problem solving the same way, accounting for staff familiarity with tools, not just the wall-clock time taken for some esoteric approach.

My observation is that in the last year or so the relative weight of these contributions has shifted massively because of AI code authoring.

It’s so fast and easy to whip up a few hundred lines of code with something like Gemini Pro 2.5 that I got it to make me a sorting benchmark tool just so I’d have a data point for a comment in this thread! I never would have had the time years ago.

For relatively “small” and isolated problems like password hash lookup tables, it’s amazing what you can do in mere hours with AI assistance.

If I was approaching this same problem just two years ago I would have picked SQL Hyperscale too, for the same(ish) reasons.

Now? I feel like many more avenues have been opened up…