Does Vald support boolean filtering? That is, can I attach a number of attributes to vectors, and then find the nearest neighbors that match a boolean filter on the attributes?
You might be interested in checking out Weaviate (https://github.com/semi-technologies/weaviate) which supports combining vector search with boolean filters. In Weaviate, the vector index is stored alongside an inverted index, so you get the best of both worlds. At the moment on a filtered search, the inverted index is hit first, so you essentially end up with an allow-list of IDs which are then fed to the vector index. If the id isn't on the allow list the next neighbor is selected and so on.
There are also plans to further optimize this based on how restrictive the boolean filter is. For example, if the filter still matches a lot of of ids (e.g. 50% of all candidates) the approach outlined above works really well. But if if your filter is extremely restrictive and maybe only matches 1k out of 100M results, then this approach is no longer ideal, as the vector index will discard most of the ids it sees. However, in this case it would be more efficient to actually perform a brute-force search on those 1k vectors. This and other optimizations around combined vector and scalar search are on the roadmap.
Thanks for the explanation. Is it true that a node in HNSW can be a jump point for the next iteration of search, which means its neighbors can well be candidates of a query even though the node itself is filtered out? If so, skipping a node because it is not in an allow-list will potentially result in loss of recall? Will it work if we can keep expanding the search front until we find enough number of results that are both close enough to the given query and match the given filter?
Since Vald is a pure ANN search engine, it does not support graph search of boolean filtering results.
However, it does have post-filtering capabilities.
The same use case can be achieved by using the post-filter.
It is possible to implement functions such as retrieving more TopKs than the expected value of N, and then filtering the results to get N in the end.
The documentation for Vald's Filter feature is in preparation, but should be available soon.
There are also functions that work seamlessly with Tensorflow and ONNX, so it is possible to embed text data in vector space without Boolean filtering, and perform similar searches using only ANNs by weighting vector queries.
Our ideal world would be one in which everything becomes a vector, and we could perform a variety of searches using vectors.
thanks. The challenge with post-filtering is that the K in TopK becomes unpredictable if I'd like to get a fixed page size. I may end up issuing multiple queries with progressively large Ks to get a single page of results.
Exactly. See my reply on this comment thread comment thread to see how Weaviate solves the filter-issue by using an inverted index to produce a whitelist of IDs which is then passed to the vector index where non matching IDs are simply skipped.