If this is considered dehumanizing, I'm concerned for the future of our industry. Anyone with even a first undergraduate algorithms course should immediately spit out "sort by customer and page, then stream the output".
I wouldn't expect everyone to immediately see the optimizations -- you can sort the days individually, you can drop all but a small number of pages per customer -- immediately, but I would be disappointed if someone couldn't be hinted there by the end of a 30 minute interview.
Failing to even get the O(n log n) solution tells me that someone should never have graduated.
> If this is considered dehumanizing, I'm concerned for the future of our industry. Anyone with even a first undergraduate algorithms course should immediately spit out "sort by customer and page, then stream the output".
Well yes, but that's not what this is about. It's about social pressure, not ability.
I could program this for you right now if I wanted to; most "optimisations" seemed the "obvious ones" to me, so I guess I would have passed. I could also program it in a hurry at ludicrous speed under pressure because all of production is down or whatever.
But ... I wouldn't be able to program this in an interview setting.
Last time I was asked a bunch of even easier things (in a timed interview, no less), where I literally have some exact code on my GitHub. I just couldn't finish it. Not because I can't "get" it, but because programming while two strangers are judging your every keystroke and commenting on what you're doing is just nerve-wrecking.
> Failing to even get the O(n log n) solution tells me that someone should never have graduated.
This is something I wonder about in my comment sidethread - to me, the natural solution is the O(n) one in O(n) space.
I see that the O(n log n) solution requires O(1) space to run. But it requires O(n) space to return a result! Is that actually better than O(n) time and O(n) space?
With the sorting solution, you can stream the result directly to a file _without_ holding the entire file in memory. So it can be better if you are memory constrained (e.g. if the files are bigger than what you can hold in memory), or if you want to hold the working set inside the CPU L3 cache as 'cperciva is alluding to.
Either of those solutions is ok from a junior developer.
From a senior developer, I would expect a discussion of in-core vs. out-of-core operations, and the fact that hash tables (even if they fit into RAM) aren't really O(1) time per operation since random access to large address spaces runs into e.g. TLB misses.
Given that the depth of the page table tree is fixed at 3/4/5 depending on the architecture and kernel, I think most people would be confused if asked whether the average complexity of a hash table lookup is always really O(1).
Even senior developers that understand the TLB would likely think of it as "O(1), but with a big constant factor".
On a side note, one thing that the article doesn't mention is the possibility of using a Bloom filter. It trades an extra scan of a file for smaller hash tables/smaller set to sort with external sort.
Yup, was surprised bloom filters weren't even mentioned in passing. I suck but still it's crazy to see some people genuinely would have a problem getting the O(n) solution when they've been in the industry for more than 5-10 years
Whether you get a cache miss or hit is part of 'c', or whether those bytes are fetched over WAN. Big oh of one just means lookup time is a constant and is independent of data size. The big oh doesn't change if move from a SSD to a tape drive
Thus, those operations are really O(1)
Big oh is the infimum function such that for an arbitrarily large N, the runtime of the algorithm is then strictly larger than C times the big oh function. (Paraphrasing there a bit, and on a phone, apologies for not pasting the exact definition)
In a formal sense, anything you can compute on a computer is O(1) since there's only a finite number of states. But we tend to hand-wave that away.
Hash table lookup is absolutely bounded by a constant, but going from L1 to L2 to L3 to RAM with anywhere between 0 and 5 TLB misses is far from constant.
> In a formal sense, anything you can compute on a computer is O(1)
I don't think the definition of Big-Oh supports that conclusion.
"Big O notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity." [1]
"For example, the time (or the number of steps) it takes to complete a problem of size n might be found to be T(n) = 4n^2 − 2n + 2.... As n grows large, the n^2 term will come to dominate,... say that the algorithm has order of n^2 time complexity" [1]
> Hash table lookup is absolutely bounded by a constant, but going from L1 to L2 to L3 to RAM with anywhere between 0 and 5 TLB misses is far from constant.
Indeed, but I can choose a value "C" that is large enough to negate any of the differences between cache miss or hit.
Let's say "N" is the number of lines in one of our text files. Once we have put that data into a hashtable, the claim is then that lookup time is O(1). You counter, and state, no, the lookup time is potentially 30ms, or even 100ms (making up the numbers). For my claim that the lookup time of O(1) to be true, I can pick any value of 'C'. So I choose 1 million. I'm basically saying, lookup times will always be under 1 hour. Given our precondition that we have built this hashtable, no matter how we get the data, yes - it will be under 1 hour. If we used carrier pigeon with USB stick, we could just choose a larger value for C and still be correct.
The idea is basically, if you have something like N^2 + N^1, eventually when N is large enough, n^2 will become so dominant that you could multiple by it by a fixed value to wash away all the other terms in the expression.
Thus, while the actual lookup time can and will vary, it is still bounded. Keep in mind as well there are different functions for best case and average case performance. These notations do take into account that the per-instance lookup time can vary.
> Hash table lookup is absolutely bounded by a constant, but going from L1 to L2 to L3 to RAM with anywhere between 0 and 5 TLB misses is far from constant.
Wait, don't the inner nodes of the page table contain the _physical_ addresses of its child nodes? Wouldn't that make it a most one TLB miss per hash table lookup (assuming key and value are adjacent in the hash table bucket)?
Bounded by a constant does not mean constant lookup time.
Let's say the runtime of some function is (x1n + x2n + x3n), where x1, x2, and x3 represent the L1,L2 l3 cache lookup times.
A function F is called big-oh of function G if there exists a value C such that CF is strictly greater for all sufficiently large values of N
For the above example, let's say I choose the value one billion for C. Is it true that 1 billion times n will be greater than the sum of (x1+x2+x3)n. The answer is yes, thus, f(n) = n is big-oh
(x1+x2+x3)n. In other words, at a certain size of n, we can pick a constant multiple that dominates the other terms. This is in part why asymptotic functions are independent of hardware. Even if L3 cache were actually virtual and fetched over a network, I could still just choose a larger constant. If you can't choose such a constant, then chances are you need a faster growing function like n-squared
It depends how you define "TLB miss". Read it as "paging table reads" if that's clearer.
If you have a 64-entry TLB and 4kB pages, then you can do random memory accesses (aka hash table accesses) up to 256 kB without any paging table reads; up to ~16 MB with one paging table read per operation; and up to ~1GB with two paging table reads per operation. (This, of course, is why large databases use superpages when possible; but while that pushes the limits back it doesn't eliminate them.)
In virtual machines there's further overhead since you have "virtual physical" addresses...
I'd expect talking about TLB misses as a junior engineer thing. That's stuff you learn in school.
Id expect senior engineers to talk about how they will monitor this system, and to bring up things like customers we know we don't trust.
A senior engineer might tell you that running into TLB issues makes a suggestion that your approach is wrong and it's about time to move the solution to your data warehouse
I have a non traditional background (non comp sci) and while I understand complexity I wouldn’t have immediately arrived at the solution like this.
I think that rather, I would have done well because I know after doing hundreds of these both as the interviewer and the interviewee that using some sort of Map along with some vague rumblings of “space/time complexity tradeoff” is very often the solution for these types of questions. So I would have immediately gone there first by instinct.
I wouldn't expect everyone to immediately see the optimizations -- you can sort the days individually, you can drop all but a small number of pages per customer -- immediately, but I would be disappointed if someone couldn't be hinted there by the end of a 30 minute interview.
Failing to even get the O(n log n) solution tells me that someone should never have graduated.