kuldeepmeel's comments

kuldeepmeel · on May 19, 2024

Unfortunately, not, and that's an interesting open problem as other count-distinct algorithms don't work for "structured sets", while this one does.

https://dl.acm.org/doi/10.1145/3452021.3458333

kuldeepmeel · on May 18, 2024

I agree with zero_k on everything he said about Knuth and strongly disagree with his own (extremely modest) characterization of himself.

kuldeepmeel · on May 18, 2024

We are very grateful for the interest, and I thought I would link to some relevant resources.

Paper: https://arxiv.org/pdf/2301.10191 Knuth's note: https://cs.stanford.edu/~knuth/papers/cvm-note.pdf

Talk Slides: https://www.cs.toronto.edu/~meel/Slides/meel-distinct.pdf Talk Video: https://www.youtube.com/watch?v=K_ugk7OW0bI

The talk also discusses the general settings where our algorithm resolved the open problem of estimation of the union of high dimensional rectangles.

kuldeepmeel · on May 18, 2024

The Chernoff bound needed in this work can be derived from Binomial distribution (with Stirling's approximation);

I have worked on pairwise independent hash functions for a decade and every time I introduce such a function, it feels like magic. The notion of pairwise independence is easy to explain but the notion of pairwise independent hash functions isn't.

The other strength of our work is that it can work for general settings of sets for which pairwise independent hash functions are not known. Please see: https://dl.acm.org/doi/10.1145/3452021.3458333

kuldeepmeel · on May 18, 2024

You are indeed right; while has the added advantage of making the estimator unbiased -- i.e., not only strongly (epsilon,delta)-guarantees but also having an expectation of being correct).

It wasn't easy to see that loop would have added benefit -- that's where Knuth's ingenuity comes in.

kuldeepmeel · on May 18, 2024

Yes, there is an error in the Quanta article [at the same time, I must add that writing popular science articles is very hard, so it would be wrong to blame them]

Your fix is indeed correct; we may want to have either while loop instead of "if len(mem) == thresh" as there is very small (but non-zero) probability that length of mem is still thresh after executing: mem = [m for m in mem if np.random.rand() < 0.5]

["While" was Knuth's idea; and has added benefit of providing unbiased estimator.]

kuldeepmeel · on May 18, 2024

The following is also not correct.

    if k not in mem:
        mem += [k]
    if k in mem:      # not the same than "else" here
        if np.random.rand() > p:
            mem.remove(k)

Your final solution is indeed correct, and I think more elegant than what we had in our paper [I am one of the authors].

kuldeepmeel · on May 18, 2024

[I am one of the authors].

We have a follow-up work (admittedly, more technical) that can remove reliance on m completely: https://www.cs.toronto.edu/~meel/Papers/pods22.pdf

But yes, our theorems can be reworked to estimate the confidence/error rate; that's what Knuth did: https://cs.stanford.edu/~knuth/papers/cvm-note.pdf

klabb3 · on May 18, 2024

Didn’t realize you were here so let me be clear that I did overall find the paper so approachable that I could implement it with only a couple of outside pointers (also a little clever impl optimization around storing p if you’re curious). The above should be read more as “even this well-written simplified paper is not necessarily trivial to understand for practicians”. So more of a general point around academic obscurity.

> But yes, our theorems can be reworked to estimate the confidence/error rate

I think that’s useful for practical implications. Also, for practical use, how does one decide the tradeoff between delta and epsilon? Perhaps it’s covered elsewhere, but I have a hard time intuiting their relationship.

kuldeepmeel · on May 18, 2024

I fully agree with you and this is indeed one of my criticisms of modern academic writing -- we tend to write papers that are just very hard for anyone to read.

So delta refers to the confidence, i.e., how often are you willing to be wrong, and epsilon is tolerance with respect to the actual count.

We have found that in general, setting delta=0.1 and espilon=0.8 works fine for most practical applications.