I'd be curious how they implement updating. AFAICT this is the thorniest part of working with existing open source solutions. When working with ANNOY in the past I've had data small enough to be able to recompute the full index in the background every few seconds in a background process and then swap in the built index to the process serving similarity queries.
EDIT: from the insertion docs https://milvus.io/docs/guides/milvus_operation.md#Insert-vec... it seems that they still ask you to re-build your indices after you insert vectors, although in some cases they can tell that they need to re-build the indices for you. Looks like the major value adds here are potentially shifting computation to the GPU and building multiple indices. I'll certainly evaluate this next time I'm building a project around vector search.
Milvus allows users to append vectors. Vectors are stored in multiple file slices. When a file slice reaches the threshold, Milvus will build the index for that file slice, and new data will be inserted into a new file slice. For details, please refer
https://medium.com/@milvusio/managing-data-in-massive-scale-...
We are now working on the vector deletion. Hopefully will be ready by the end of 1Q this year.
If I append a single new vector, will it show up in search results without me needing to ask for the index to be rebuilt? Can i update an existing vector without having to ask for the index to be rebuilt?
EDIT: from reading the linked article, it seems like newly inserted vectors will be queried using brute force. Very interesting idea!
See https://github.com/jolibrain/deepdetect/pull/641 that uses FAISS as a backend alternative to annoy (annoy supported as well). Deletion can be implemented by removing entries from the listing db while the vector remains within the index.
Tests show that FAISS is bit better than annoy on retrieval of both small and million items indexes. It also includes ind x compression techniques that in our tests do fair very well, with very low loss on mid size 500k image indexes.
Does anyone know how to combine vector similarity search with more conventional field-based search (using elasticsearch for example)?
For example, given a set of labeled images, a user should be able to compose a query using a combination of filters (like size or description) along with a reference image (the vector).
Another option in this very interesting space is GNES[1], which attempts to do the encoding/decoding on its own, rather than just working with feature/embedding vectors.
Great to see another ANN tool available. FAISS and SPTAG were good, but this appears to be much better. Not sure if this supports "online" learning i.e. is a training phase required?
> "As each vector takes 2 KB space, the minimum storage space for 100 million vectors is about 200 GB"
Why are you not quantizing the vectors when you insert them? Bolt [1] and Quicker-ADC [2] make 10-100x compression basically free for approximate search (and also get you ~100x compression roughly 10x faster querying within a partition....)
It's about the ML scenarios. If you want to search thru a huge amount of unstructured data after vectorization tech (like deep learning), Milvus will help you a lot.
Our users use Milvus in below scenarios:
1. Chemical molecules analysis, searching SMILE format vectors
2. Image retrieval type application, for example shopping website
3. NLP
4. Recommendation system
5. and more, we are collecting users' feedback
I'm a big fan of ann-benchmarks and will be the first to tell you that the research community needs way more benchmarks like this. But I do want to add a couple caveats about it for people looking into this area:
1) Most of these datasets have extremely correlated dimensions. If you plot the covariance matrices, you'll see dense blobs of entries close to 1 all over the place. This makes the ANN task much easier than it would be with, say, high-quality DNN features. As an example, I've compressed MNIST digits down to 1 byte representations with vector quantization and still gotten nearly perfect retrieval accuracy.
2) 1M vectors is not that many. You can get easily get 1k queries per second in a single thread at a decent precision/recall just brute-force scanning through them with a SIMD approximate distance function like Bolt or Quicker ADC [1]. Also worth noting that the FAISS paper (along with a lot of other work since then) focuses mostly on 100M to billions of vectors.
3) Related to (2), I think most of these methods aren't incorporating state-of-the-art approximate distance functions yet (though I haven't dug into all of their source code). AFAICT FAISS+Quicker ADC [2] is the actual leader on x86 CPUS. Can't comment on the production-readiness of their code though.
[1] The latter is a bit faster for ANN search, though the code is more complex IIRC.
I think the Ann benchmark should pay more attention on
1. The index building speed, as this is very important in some production scenarios. Now it only says I will give 5 hours to build the index on that 1 million vectors.
2. The memory footprint, as 1m vectors are not that many. We will have to deal with billion s of vectors for chemical molecules, images and word vectors. The memory consumption will definitely impact how many servers you need.
(you can see the VERY "research quality" code on Github, here's a decent starting place https://github.com/hyperstudio/spectacles/blob/master/specta...)
EDIT: from the insertion docs https://milvus.io/docs/guides/milvus_operation.md#Insert-vec... it seems that they still ask you to re-build your indices after you insert vectors, although in some cases they can tell that they need to re-build the indices for you. Looks like the major value adds here are potentially shifting computation to the GPU and building multiple indices. I'll certainly evaluate this next time I'm building a project around vector search.