The Chernoff bound needed in this work can be derived from Binomial distribution (with Stirling's approximation);
I have worked on pairwise independent hash functions for a decade and every time I introduce such a function, it feels like magic. The notion of pairwise independence is easy to explain but the notion of pairwise independent hash functions isn't.
The other strength of our work is that it can work for general settings of sets for which pairwise independent hash functions are not known. Please see: https://dl.acm.org/doi/10.1145/3452021.3458333
You are indeed right; while has the added advantage of making the estimator unbiased -- i.e., not only strongly (epsilon,delta)-guarantees but also having an expectation of being correct).
It wasn't easy to see that loop would have added benefit -- that's where Knuth's ingenuity comes in.
Yes, there is an error in the Quanta article [at the same time, I must add that writing popular science articles is very hard, so it would be wrong to blame them]
Your fix is indeed correct; we may want to have either while loop instead of "if len(mem) == thresh" as there is very small (but non-zero) probability that length of mem is still thresh after executing:
mem = [m for m in mem if np.random.rand() < 0.5]
["While" was Knuth's idea; and has added benefit of providing unbiased estimator.]
Didn’t realize you were here so let me be clear that I did overall find the paper so approachable that I could implement it with only a couple of outside pointers (also a little clever impl optimization around storing p if you’re curious). The above should be read more as “even this well-written simplified paper is not necessarily trivial to understand for practicians”. So more of a general point around academic obscurity.
> But yes, our theorems can be reworked to estimate the confidence/error rate
I think that’s useful for practical implications. Also, for practical use, how does one decide the tradeoff between delta and epsilon? Perhaps it’s covered elsewhere, but I have a hard time intuiting their relationship.
I fully agree with you and this is indeed one of my criticisms of modern academic writing -- we tend to write papers that are just very hard for anyone to read.
So delta refers to the confidence, i.e., how often are you willing to be wrong, and epsilon is tolerance with respect to the actual count.
We have found that in general, setting delta=0.1 and espilon=0.8 works fine for most practical applications.
https://dl.acm.org/doi/10.1145/3452021.3458333