Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Milvus – An Open-Source Vector Similarity Search Engine (milvus.io)
152 points by maximente on Jan 10, 2020 | hide | past | favorite | 26 comments


I'd be curious how they implement updating. AFAICT this is the thorniest part of working with existing open source solutions. When working with ANNOY in the past I've had data small enough to be able to recompute the full index in the background every few seconds in a background process and then swap in the built index to the process serving similarity queries.

(you can see the VERY "research quality" code on Github, here's a decent starting place https://github.com/hyperstudio/spectacles/blob/master/specta...)

EDIT: from the insertion docs https://milvus.io/docs/guides/milvus_operation.md#Insert-vec... it seems that they still ask you to re-build your indices after you insert vectors, although in some cases they can tell that they need to re-build the indices for you. Looks like the major value adds here are potentially shifting computation to the GPU and building multiple indices. I'll certainly evaluate this next time I'm building a project around vector search.


Milvus allows users to append vectors. Vectors are stored in multiple file slices. When a file slice reaches the threshold, Milvus will build the index for that file slice, and new data will be inserted into a new file slice. For details, please refer https://medium.com/@milvusio/managing-data-in-massive-scale-...

We are now working on the vector deletion. Hopefully will be ready by the end of 1Q this year.


If I append a single new vector, will it show up in search results without me needing to ask for the index to be rebuilt? Can i update an existing vector without having to ask for the index to be rebuilt?

EDIT: from reading the linked article, it seems like newly inserted vectors will be queried using brute force. Very interesting idea!


Correct, new vectors will first be searched thru brute force until the index is created on that file slice.


See https://github.com/jolibrain/deepdetect/pull/641 that uses FAISS as a backend alternative to annoy (annoy supported as well). Deletion can be implemented by removing entries from the listing db while the vector remains within the index.

Tests show that FAISS is bit better than annoy on retrieval of both small and million items indexes. It also includes ind x compression techniques that in our tests do fair very well, with very low loss on mid size 500k image indexes.


Does anyone know how to combine vector similarity search with more conventional field-based search (using elasticsearch for example)? For example, given a set of labeled images, a user should be able to compose a query using a combination of filters (like size or description) along with a reference image (the vector).


We are working on this feature which allows use to perform hyper search (attributes plus feature vectors). And you can code your scoring rules.

Again, hopefully be ready by the end of 1Q this year.



I think could also try the partition feature.


Another option in this very interesting space is GNES[1], which attempts to do the encoding/decoding on its own, rather than just working with feature/embedding vectors.

[1] https://gnes.ai/


There's not a lot of information on the site about the architecture or storage solutions used. Do the authors have more info about this space?


You may check our Medium site. We will post more tech details.

https://medium.com/@milvusio


Great to see another ANN tool available. FAISS and SPTAG were good, but this appears to be much better. Not sure if this supports "online" learning i.e. is a training phase required?


Please check https://medium.com/@milvusio/managing-data-in-massive-scale-...

It explains how Milvus managing vectors.


> "As each vector takes 2 KB space, the minimum storage space for 100 million vectors is about 200 GB"

Why are you not quantizing the vectors when you insert them? Bolt [1] and Quicker-ADC [2] make 10-100x compression basically free for approximate search (and also get you ~100x compression roughly 10x faster querying within a partition....)

[1] https://github.com/dblalock/bolt

[2] https://github.com/technicolor-research/faiss-quickeradc


200 GB is the size of original vectors. When creating index, Milvus supports IVF SQ8 and IVF PQ ADC.

Based on our users experience, SQ8 is the most balanced one at this moment. SQ8 provides 8x compression, higher accuracy and better performance.


Yes on online learning, as I gather from the comparison at https://milvus.io/docs/v0.6.0/about_milvus/vector_db.md


What are good answers to this in the embedded space... eg mobile?


Milvus could run on arm CPU. We ported it to Nvidia Jetson NANO and Raspberry PI 4 (4GB mem) so far.

Most people told us running Milvus on arm looked cool but they were not sure if they want to do this...

Please tell us your requirements and scenarios on arm. It will really help.


any recommendation on the machine learning platforms to use?


It's not about the ML platform.

It's about the ML scenarios. If you want to search thru a huge amount of unstructured data after vectorization tech (like deep learning), Milvus will help you a lot.

Our users use Milvus in below scenarios: 1. Chemical molecules analysis, searching SMILE format vectors 2. Image retrieval type application, for example shopping website 3. NLP 4. Recommendation system 5. and more, we are collecting users' feedback


how does it compare to state-of-the art? (https://github.com/erikbern/ann-benchmarks)


I'm a big fan of ann-benchmarks and will be the first to tell you that the research community needs way more benchmarks like this. But I do want to add a couple caveats about it for people looking into this area:

1) Most of these datasets have extremely correlated dimensions. If you plot the covariance matrices, you'll see dense blobs of entries close to 1 all over the place. This makes the ANN task much easier than it would be with, say, high-quality DNN features. As an example, I've compressed MNIST digits down to 1 byte representations with vector quantization and still gotten nearly perfect retrieval accuracy.

2) 1M vectors is not that many. You can get easily get 1k queries per second in a single thread at a decent precision/recall just brute-force scanning through them with a SIMD approximate distance function like Bolt or Quicker ADC [1]. Also worth noting that the FAISS paper (along with a lot of other work since then) focuses mostly on 100M to billions of vectors.

3) Related to (2), I think most of these methods aren't incorporating state-of-the-art approximate distance functions yet (though I haven't dug into all of their source code). AFAICT FAISS+Quicker ADC [2] is the actual leader on x86 CPUS. Can't comment on the production-readiness of their code though.

[1] The latter is a bit faster for ANN search, though the code is more complex IIRC.

[2] https://github.com/technicolor-research/faiss-quickeradc


Good points. I want to add one more.

I think the Ann benchmark should pay more attention on

1. The index building speed, as this is very important in some production scenarios. Now it only says I will give 5 hours to build the index on that 1 million vectors.

2. The memory footprint, as 1m vectors are not that many. We will have to deal with billion s of vectors for chemical molecules, images and word vectors. The memory consumption will definitely impact how many servers you need.


We have some test reports in https://github.com/milvus-io/milvus/tree/master/tests

At this moment, the IVF indecies are based on FAISS. So the performance is the same as Faiss.

IVF_SQ8H is the reconstruction from Faiss IVF SQ8. Performance is much better, but you need GPU for it.

We provide benchmark test procedures and tools.

Please check this: https://github.com/milvus-io/bootcamp/tree/master/EN_benchma...


Cool stuff! Very easy to use and good examples to getting start.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: