Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

In a formal sense, anything you can compute on a computer is O(1) since there's only a finite number of states. But we tend to hand-wave that away.

Hash table lookup is absolutely bounded by a constant, but going from L1 to L2 to L3 to RAM with anywhere between 0 and 5 TLB misses is far from constant.



> In a formal sense, anything you can compute on a computer is O(1)

I don't think the definition of Big-Oh supports that conclusion.

"Big O notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity." [1]

"For example, the time (or the number of steps) it takes to complete a problem of size n might be found to be T(n) = 4n^2 − 2n + 2.... As n grows large, the n^2 term will come to dominate,... say that the algorithm has order of n^2 time complexity" [1]

[1] https://en.wikipedia.org/wiki/Big_O_notation

> Hash table lookup is absolutely bounded by a constant, but going from L1 to L2 to L3 to RAM with anywhere between 0 and 5 TLB misses is far from constant.

Indeed, but I can choose a value "C" that is large enough to negate any of the differences between cache miss or hit.

Let's say "N" is the number of lines in one of our text files. Once we have put that data into a hashtable, the claim is then that lookup time is O(1). You counter, and state, no, the lookup time is potentially 30ms, or even 100ms (making up the numbers). For my claim that the lookup time of O(1) to be true, I can pick any value of 'C'. So I choose 1 million. I'm basically saying, lookup times will always be under 1 hour. Given our precondition that we have built this hashtable, no matter how we get the data, yes - it will be under 1 hour. If we used carrier pigeon with USB stick, we could just choose a larger value for C and still be correct.

The idea is basically, if you have something like N^2 + N^1, eventually when N is large enough, n^2 will become so dominant that you could multiple by it by a fixed value to wash away all the other terms in the expression.

Thus, while the actual lookup time can and will vary, it is still bounded. Keep in mind as well there are different functions for best case and average case performance. These notations do take into account that the per-instance lookup time can vary.


> Hash table lookup is absolutely bounded by a constant, but going from L1 to L2 to L3 to RAM with anywhere between 0 and 5 TLB misses is far from constant.

Wait, don't the inner nodes of the page table contain the _physical_ addresses of its child nodes? Wouldn't that make it a most one TLB miss per hash table lookup (assuming key and value are adjacent in the hash table bucket)?


Bounded by a constant does not mean constant lookup time.

Let's say the runtime of some function is (x1n + x2n + x3n), where x1, x2, and x3 represent the L1,L2 l3 cache lookup times.

A function F is called big-oh of function G if there exists a value C such that CF is strictly greater for all sufficiently large values of N

For the above example, let's say I choose the value one billion for C. Is it true that 1 billion times n will be greater than the sum of (x1+x2+x3)n. The answer is yes, thus, f(n) = n is big-oh (x1+x2+x3)n. In other words, at a certain size of n, we can pick a constant multiple that dominates the other terms. This is in part why asymptotic functions are independent of hardware. Even if L3 cache were actually virtual and fetched over a network, I could still just choose a larger constant. If you can't choose such a constant, then chances are you need a faster growing function like n-squared


It depends how you define "TLB miss". Read it as "paging table reads" if that's clearer.

If you have a 64-entry TLB and 4kB pages, then you can do random memory accesses (aka hash table accesses) up to 256 kB without any paging table reads; up to ~16 MB with one paging table read per operation; and up to ~1GB with two paging table reads per operation. (This, of course, is why large databases use superpages when possible; but while that pushes the limits back it doesn't eliminate them.)

In virtual machines there's further overhead since you have "virtual physical" addresses...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: