> Failing to even get the O(n log n) solution tells me that someone should never have graduated.
This is something I wonder about in my comment sidethread - to me, the natural solution is the O(n) one in O(n) space.
I see that the O(n log n) solution requires O(1) space to run. But it requires O(n) space to return a result! Is that actually better than O(n) time and O(n) space?
With the sorting solution, you can stream the result directly to a file _without_ holding the entire file in memory. So it can be better if you are memory constrained (e.g. if the files are bigger than what you can hold in memory), or if you want to hold the working set inside the CPU L3 cache as 'cperciva is alluding to.
Either of those solutions is ok from a junior developer.
From a senior developer, I would expect a discussion of in-core vs. out-of-core operations, and the fact that hash tables (even if they fit into RAM) aren't really O(1) time per operation since random access to large address spaces runs into e.g. TLB misses.
Given that the depth of the page table tree is fixed at 3/4/5 depending on the architecture and kernel, I think most people would be confused if asked whether the average complexity of a hash table lookup is always really O(1).
Even senior developers that understand the TLB would likely think of it as "O(1), but with a big constant factor".
On a side note, one thing that the article doesn't mention is the possibility of using a Bloom filter. It trades an extra scan of a file for smaller hash tables/smaller set to sort with external sort.
Yup, was surprised bloom filters weren't even mentioned in passing. I suck but still it's crazy to see some people genuinely would have a problem getting the O(n) solution when they've been in the industry for more than 5-10 years
Whether you get a cache miss or hit is part of 'c', or whether those bytes are fetched over WAN. Big oh of one just means lookup time is a constant and is independent of data size. The big oh doesn't change if move from a SSD to a tape drive
Thus, those operations are really O(1)
Big oh is the infimum function such that for an arbitrarily large N, the runtime of the algorithm is then strictly larger than C times the big oh function. (Paraphrasing there a bit, and on a phone, apologies for not pasting the exact definition)
In a formal sense, anything you can compute on a computer is O(1) since there's only a finite number of states. But we tend to hand-wave that away.
Hash table lookup is absolutely bounded by a constant, but going from L1 to L2 to L3 to RAM with anywhere between 0 and 5 TLB misses is far from constant.
> In a formal sense, anything you can compute on a computer is O(1)
I don't think the definition of Big-Oh supports that conclusion.
"Big O notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity." [1]
"For example, the time (or the number of steps) it takes to complete a problem of size n might be found to be T(n) = 4n^2 − 2n + 2.... As n grows large, the n^2 term will come to dominate,... say that the algorithm has order of n^2 time complexity" [1]
> Hash table lookup is absolutely bounded by a constant, but going from L1 to L2 to L3 to RAM with anywhere between 0 and 5 TLB misses is far from constant.
Indeed, but I can choose a value "C" that is large enough to negate any of the differences between cache miss or hit.
Let's say "N" is the number of lines in one of our text files. Once we have put that data into a hashtable, the claim is then that lookup time is O(1). You counter, and state, no, the lookup time is potentially 30ms, or even 100ms (making up the numbers). For my claim that the lookup time of O(1) to be true, I can pick any value of 'C'. So I choose 1 million. I'm basically saying, lookup times will always be under 1 hour. Given our precondition that we have built this hashtable, no matter how we get the data, yes - it will be under 1 hour. If we used carrier pigeon with USB stick, we could just choose a larger value for C and still be correct.
The idea is basically, if you have something like N^2 + N^1, eventually when N is large enough, n^2 will become so dominant that you could multiple by it by a fixed value to wash away all the other terms in the expression.
Thus, while the actual lookup time can and will vary, it is still bounded. Keep in mind as well there are different functions for best case and average case performance. These notations do take into account that the per-instance lookup time can vary.
> Hash table lookup is absolutely bounded by a constant, but going from L1 to L2 to L3 to RAM with anywhere between 0 and 5 TLB misses is far from constant.
Wait, don't the inner nodes of the page table contain the _physical_ addresses of its child nodes? Wouldn't that make it a most one TLB miss per hash table lookup (assuming key and value are adjacent in the hash table bucket)?
Bounded by a constant does not mean constant lookup time.
Let's say the runtime of some function is (x1n + x2n + x3n), where x1, x2, and x3 represent the L1,L2 l3 cache lookup times.
A function F is called big-oh of function G if there exists a value C such that CF is strictly greater for all sufficiently large values of N
For the above example, let's say I choose the value one billion for C. Is it true that 1 billion times n will be greater than the sum of (x1+x2+x3)n. The answer is yes, thus, f(n) = n is big-oh
(x1+x2+x3)n. In other words, at a certain size of n, we can pick a constant multiple that dominates the other terms. This is in part why asymptotic functions are independent of hardware. Even if L3 cache were actually virtual and fetched over a network, I could still just choose a larger constant. If you can't choose such a constant, then chances are you need a faster growing function like n-squared
It depends how you define "TLB miss". Read it as "paging table reads" if that's clearer.
If you have a 64-entry TLB and 4kB pages, then you can do random memory accesses (aka hash table accesses) up to 256 kB without any paging table reads; up to ~16 MB with one paging table read per operation; and up to ~1GB with two paging table reads per operation. (This, of course, is why large databases use superpages when possible; but while that pushes the limits back it doesn't eliminate them.)
In virtual machines there's further overhead since you have "virtual physical" addresses...
I'd expect talking about TLB misses as a junior engineer thing. That's stuff you learn in school.
Id expect senior engineers to talk about how they will monitor this system, and to bring up things like customers we know we don't trust.
A senior engineer might tell you that running into TLB issues makes a suggestion that your approach is wrong and it's about time to move the solution to your data warehouse
This is something I wonder about in my comment sidethread - to me, the natural solution is the O(n) one in O(n) space.
I see that the O(n log n) solution requires O(1) space to run. But it requires O(n) space to return a result! Is that actually better than O(n) time and O(n) space?