very cool stuff! I just read the SPFresh paper a few days ago and was wondering if it's been implemented in industry (e.g. Turbopuffer's implementation of SPANN).
I'd be curious how y'all represent the posting lists for each partition in InnoDB:
- what IDs are you storing in the posting lists?
- how are the posting lists represented on disk? are they using compression and/or some form of skip indexing? the paper seemed to use a pretty simple block-based representation, but I'm curious what works well in practice.
- how do the posting list data structures themselves handle incremental updates and MVCC?
We're storing IDs from a ghost column that is created in the table where you're inserting vector data. This works very well in practice and allows updating the value of the vectors in the table, because they're translated into a delete + insert in the vector index by updating the ghost ID.
We have abstracted away the quantization system from the index; for the initial release, vector data is stored in raw blocks, like in the paper. Query performance is good, but disk usage is high. We're actively testing different quantization algorithms to see which ones we end up offering on GA. We're hoping our beta users will help us guide this choice!
Incremental updates and MVCC are _extremely tricky_, for both correctness and performance. As you've surely noticed, the hard thing here is that the original paper is very focused on LSM trees, because it exploits the fact that LSM trees get compacted lazily to perform incremental updates to the posting lists ('merges'). MySQL (and Postgres, and all relational databases, really) are B-tree based, and in-place updates for B-trees are expensive! I think we came up with very interesting workarounds for the problem, but it's a quite a bit to drill down in a HN comment. Please stay tuned for our whitepaper. :)
I'd be curious if y'all end up supporting adding filter attributes to the inverted index that can then be pushed down into the posting list traversal.
for example, a restaurant search app may have (1) an embedding for each restaurant but also (2) a cuisine. then, if a restaurant has `cuisine = Italian`, we'd also store its ghost ID in a `cuisine:Italian` posting list.
at query time, the query planner could take a query like `SELECT * FROM t1 WHERE cuisine = 'Italian' ORDER BY DISTANCE(..)` and emit a plan that efficiently intersects the `cuisine:Italian` posting list with the union of the partitions' posting lists.
this feels to me like a potential strength of the inverted indexing approach compared to graph-based approaches, which struggle with general filtering (e.g. the Filtered-DiskANN paper).
tpuf’s ANN index uses a variant of SPFresh, yup. These are the only two production implementations I am aware of. I don’t think it is in production at MSFT yet
I'd be curious how y'all represent the posting lists for each partition in InnoDB:
- what IDs are you storing in the posting lists?
- how are the posting lists represented on disk? are they using compression and/or some form of skip indexing? the paper seemed to use a pretty simple block-based representation, but I'm curious what works well in practice.
- how do the posting list data structures themselves handle incremental updates and MVCC?