more etiennedi's comments

etiennedi · on April 11, 2021

You might be interested in checking out Weaviate (https://github.com/semi-technologies/weaviate) which supports combining vector search with boolean filters. In Weaviate, the vector index is stored alongside an inverted index, so you get the best of both worlds. At the moment on a filtered search, the inverted index is hit first, so you essentially end up with an allow-list of IDs which are then fed to the vector index. If the id isn't on the allow list the next neighbor is selected and so on.

There are also plans to further optimize this based on how restrictive the boolean filter is. For example, if the filter still matches a lot of of ids (e.g. 50% of all candidates) the approach outlined above works really well. But if if your filter is extremely restrictive and maybe only matches 1k out of 100M results, then this approach is no longer ideal, as the vector index will discard most of the ids it sees. However, in this case it would be more efficient to actually perform a brute-force search on those 1k vectors. This and other optimizations around combined vector and scalar search are on the roadmap.

hintymad · on April 11, 2021

Thanks for the explanation. Is it true that a node in HNSW can be a jump point for the next iteration of search, which means its neighbors can well be candidates of a query even though the node itself is filtered out? If so, skipping a node because it is not in an allow-list will potentially result in loss of recall? Will it work if we can keep expanding the search front until we find enough number of results that are both close enough to the given query and match the given filter?