Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The Opposite of a Bloom Filter (2012) (somethingsimilar.com)
78 points by espeed on Nov 11, 2015 | hide | past | favorite | 15 comments


So in short, the opposite of a Bloom filter is a hash-map where, if you add two items that have the same hash, the second one replaces the first instead of them both being stored in a linked-list bucket.

> The old id at that index is compared to the new id just added. If they differ, containsAndAdd returns false and the process should write the event. If the ids are the same, containsAndAdd returns true and the process can safely drop the event on the ground. Occasionally, two byte arrays may hash to the same index. When this happens, containsAndAdd will return false and a duplicate write will be emitted. However, since duplicate items come close together in time, this is relatively rare.


So the opposite of a bloom filter is a cache with a random replacement strategy. For some applications they work okay, [1], but for other applications they're worse than FIFO, [2].

[1] http://arxiv.org/abs/1202.4880

[2] http://www.ece.uah.edu/~milenka/docs/milenkovic_acmse04r.pdf


This seems like an alright solution, but I think an LRU or LFU cache would perform better (they wouldn't have pathological cases like two colliding events that frequently alternate), and you probably wouldn't have to implement it yourself.

I also think it's a bit much to call this "the opposite of a bloom filter". A bloom filter has a lot of properties that don't come through in the algorithm proposed.

For one thing, the space required is O(MN) (size of an element * size of the filter) instead of a bloom filter's O(N). For another, bloom filters give the same results regardless of what order previous elements were added in. It seems to me that if you're looking for "the opposite of a bloom filter", you're probably wanting those properties.


It depends how you implement your hash-table, for example I've done something similar using cuckoo hashing: https://github.com/pik/time-rotated-cache-filter

> It seems to me that if you're looking for "the opposite of a bloom filter", you're probably wanting those properties.

You can't really get a theoretical opposite of a bloom-filter unless you store the entries verbatim, since any hash/compression introduces the chance of collisions meaning you can only provide probabilistic guarantees where as a bloom-filter gives a strong guarantee.


>they wouldn't have pathological cases like two colliding events that frequently alternate

To avoid this particular case you could always add a "victim cache" : In that case a similar but smaller, second hash table which uses the same hashing function and follow the same protocol as a regular victim cache. See:

http://www.ecs.umass.edu/ece/koren/architecture/VCache/home....


The space requirement is surely impossible to fulfil generically. As soon as you do any sort of compression you make it possible for two elements to map to the same thing. Unless you know that M << N, in which case you can just save exactly one element, and just replace it every time you see a new one!


That's basically true. Technically you can meet these requirements by exploiting the false negatives loophole, but if the data structure can ever report that it contains a size M element, it will need O(M) space.


So... a cache. You keep track of a bounded number of items, and can avoid doing some work when a currently-tracked item shows up again.


The first footnote in the article says:

> [1] It’s been 3 years since I wrote this article, and every time it pops back up people are like “so, a cache”, and every time I’m like “that wasn’t obvious from context?” and so now I’ve added this text.


I missed that.

It certainly wasn't clear to me from the rest of the post that the author realized they were talking about a cache. Possibly because I'd be tempted to make that the focus, since I consider "by the way we just re-invented the cache by accident!" to be an excellent climax-twist on a post like this. You think it's about this cool new data structure, but then it's actually about thinking from the ground up or about reasoning from multiple perspectives.


Combine a bloom filter with opposite-of-bloom and get: a) a lossless compression data structure (no false negatives or positives) OR b) garbage (both false negatives AND positives)


Interesting idea. Took me a moment to figure out what the actual results would be (beyond "garbage"). But unfortunately it doesn't sound very useful. What you'd end up with is a structure that reports a tri-state value: "Yes I've definitely seen it", "No I've definitely not seen it", and "I have no idea". With the added drawback of having to know your data set size in advance to tune the bloom filter (or resetting it periodically as the article suggests, but that weakens the "No" answer to "Not since my last reset").


Pipe the I have no ideas back into another set of bloom/nonbloom stages, the output of that back into this one, and the network becomes a computer.


I can't hate on this guy for sharing a real-world situation and what he designed.

Though I was expecting something exotic that can help with set membership problems, but instead of being log n in size it would be exponential... which would be comically useless (...or is there a use?)


So, every entry is either stored in memory or will be emitted again. This is very different than in Bloom filters (though, admittedly, you can't do better than that on this problem).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: