Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Storing OpenAI embeddings in Postgres with pgvector (supabase.com)
256 points by kiwicopple on Feb 6, 2023 | hide | past | favorite | 54 comments


Hey HN, this one has a cool back story with it that shows the power of open source.

The author, Greg[0], wanted to use pgvector in a Postgres services, so he created a PR[1] in our Postgres repo. He then reached out and we decided it would be fun to collaborate on a project together, so he helped us build a "ChatGPT" interface for the supabase docs (which we will release tomorrow).

This article explains all the steps you'd take to implement the same functionality yourself.

I want to give a shout-out to pgvector too, it's a great extension [2]

[0] Greg: https://twitter.com/ggrdson

[1] pgvector PR: https://github.com/supabase/postgres/pull/472

[2] pgvector: https://github.com/pgvector/pgvector


To summarise the article if you're skipping to the comments, the pgvector allows you to create a "vector" type in your database

    create table documents (
      id bigserial primary key,
      content text,
      embedding vector (1536)
    );
 
Then you can use OpenAI's Embedding API[0] to convert large text blocks into a 1535-dimension vector, which you will store in the database. From there you can used pgvector's cosine distance operator for searching for related documents

You can combine the search results into a prompt, and send that to GPT for a "ChatGPT-like" interface, where it will generate an answer from the documents provided

[0] https://platform.openai.com/docs/guides/embeddings


> From there you can used pgvector's cosine distance operator for searching for related documents

How does this scale with the number of rows in the database? My first thoughts are that this is O(n). Does the pgvector have a smarter implementation that allows performing k-nearest neighbor searches efficiently?


locality sensitive hashing is the typical method here. You generate a ton of random vectors. These vectors automatically give rise to infinite hyperplanes (the one they are normal to). Each vector/hyperplane is associated with a bit in the final hash. Then, each input vector is hashed by setting the bit if it's on one side of the hyperplane or unsetting it if it's on the other. The hamming distance between two vectors is now correlated with the cosine similarity. Or something like that.


That's the reason I come to HN every day. Thanks for the explanation, which probably could not be more compact.


That’s so smart! I read stuff like that and I’m just in awe of the cleverness.


You can add indexes per distance function to perform fast approximate nearest neighbor search: https://github.com/pgvector/pgvector#indexing

They mention this paper in the references: https://dl.acm.org/doi/pdf/10.1145/3318464.3386131


They describe optimizations here as well: https://simonwillison.net/2023/Jan/13/semantic-search-answer...


Does this replace pinecone?


Specialized databases will always outperform Postgres, which is a "jack of all trades". It really depends on your use-case, but I imagine if you need anything beyond basic vector support then Pinecone is a better option.


I've been experimenting with something similar to this on top of SQLite.

My experiments so far have involved storing the embeddings as binary strings representing the floating point arrays in a SQLite blob column: openai-to-sqlite is my tool for populating those: https://datasette.io/tools/openai-to-sqlite

I then query them using an in-memory FAISS vector search index using my datasette-faiss plugin: https://datasette.io/plugins/datasette-faiss


Awesome, I came here to ask about SQLite and you already answered it!

I love your writing, Simon. Have you blogged about this yet?


Yeah I wrote these up in https://simonwillison.net/2023/Jan/13/semantic-search-answer... a few weeks ago.


Disclaimer: I work on Typesense [1] (an open source alternative to Algolia + Pinecone) and we recently added Vector Search as a feature to Typesense [2].

Postgres can do a lot of things, but for large enough datasets and/or when you want to add filtering into the mix along with vector search, then it becomes slow. And at that point you want to use a dedicated vector search database.

It's similar to how Postgres can also do full text search, but for large datasets and/or you want to add typo tolerance, faceting, grouping, filtering, synonyms, etc - the usual features you'd need when implementing a search feature - then it becomes slow to do this in pg and you'd then use a dedicated search engine.

In Typesense, we've now combined Vector Search along with filtering based on attributes in your documents, so you get the best of both worlds [2], and we made it fast with an in-memory index.

[1] https://typesense.org/

[2] https://typesense.org/docs/0.24.0/api/vector-search.html


One downside of pgvector is that it currently only supports one type of index (ivfflat), while others (FAISS, Milvus, qdrant, etc.) support other types of indices that can be advantageous depending on your workload (properties of vectors, size of dataset). See [1] for some more background.

[1] https://github.com/facebookresearch/faiss/wiki/Guidelines-to...


Played with GPT-3 embeds for the first time this week and they seem to be unusually good. Everything >0.85 cosine similarity is a match, everything under 0.76 is not a match, and in the 0.76 .. 0.85 is the boundary zone where some positives and negatives get mixed up. Training models on top works great, even averaging 2-3 phrases works great for making a more precise query.


> Training models on top works great, even averaging 2-3 phrases works great for making a more precise query.

What does this mean?


If you average a query embedding with that of similar terms, you get a more focused ranking. Averaging amplifies the common component of the vectors and diminishes the spurious components.


What do you mean by "average a query embedding"?


Presumably this operation:

  def average_embeddings(embeddings):
    return [
      sum([embedding[x] for embedding in embeddings]) / len(embeddings)
      for x in range(len(embeddings))
    ]
Just finding the mid-point in N-dimensional space between all points. Although I believe OpenAI says these are directions rather than locations. Averaging makes sure you don't accidentally pick a target vector that happens to be on the fringes of your desired embedding space.

So if you're querying a document for predatory birds and you are using the string "Eagles" as your target you will be accidentally digging into embedding space for sports teams (go birds). But if your target is `average_embeddings([eagles_embedding, hawks_embedding, falcons_embedding])` you'll end up with less noise.


When you input a query, you get a numerical representation output. If you have 3-5 similar queries (each phrased a bit differently), they all have slightly different outputs. If you average them all together, it combines them in a way that improves results.


That’s good info. Curious if you have any guidance about how big a document chunk should be to get a good match with a natural language query. For context, I’m building a question answering app using ada-02 that pulls in around 100 pages of text.


First time I've heard of pgvector - for folks with experience, how does it compare to other ANN plugins (i.e. Redis https://redis.io/docs/stack/search/reference/vectors/) and purpose-built vector databases (i.e. Milvus https://milvus.io)?

Curious about both performance/QPS and scale/# of vectors.


This is a really nice article by Dmitry Kan on number of popular vector databases.

https://towardsdatascience.com/milvus-pinecone-vespa-weaviat...


This link does not cover pgvector


This is an interesting feature. But you don't need to use just OpenAI's embeddings. You can generate your own embeddings with open source SOTA transformer models which would probably work just the same. You could generate a couple hundred thousand embeddings with a rented A100 for less than 2 dollars. And the point of converting text or other objects (like images) into embeddings is to compare a large number of documents to a source document very fast. It's more useful to put the embeddings in something like Redis. This pgvector data type would be good for an offline backup of vectors.


> And the point of converting text or other objects (like images) into embeddings is to compare a large number of documents to a source document very fast.

But a "source document" could also be a natural language query that you need to convert to an embedding - if you want to enable natural language search on your corpus? (maybe along with handling queries in French getting good semantic hits in English etc?)?


Do you have any examples that come to mind for this? I'd love to understand what other models exist and what Redis extensions exist to compare embeddings.


This will load the transformer models from Hugging Face (their models have a similar architecture to OpenAI embedding models): https://www.sbert.net/docs/pretrained_models.html

Redis has approximate nearest-neighbors vector similarity search: https://redis-py.readthedocs.io/en/stable/examples/search_ve...

Generate the embeddings on a rented GPU, push to Redis then do a similarity search. Store vectors in Redis using ndarray.tobytes()


Does redis let you do some kind of similarity based lookup? Is that new?


I'll soon be releasing a CLI app that creates embeddings for entire Youtube channels and actually looked whether Supabase offered a pgvector plugin, but seeing as a couple weeks ago it didn't, I ended up going for Pinecone. I will add a mention to this in the docs.


I tested pgvector against a vanilla fais index and pgvector was significantly slower with 511d vectors. If you have a small dataset (less than 100k?) vectors, its probably fine, but for larger storage, I would look at a distributed vector search provider.


we merged the pgvector PR about 2 weeks ago (https://github.com/supabase/postgres/pull/472). If you're missing anything for your CLI don't hesitate to reach out and we'll see if we can integrate it into the product (my email is in my profile)

as an aside, Pinecone looks great


https://github.com/nmslib/hnswlib

Used it to index 40M text snippets in the legal domain. Allows incremental adding.

I love how it just works. You know, doesn’t ANNOY me or makes a FAISS. ;-)


What is a memory usage with 40M vectors?


Good point! I believe it was on the order of 20 GB. Used a Hetzner 512 GB bare metal server. $50/m.

P.S. Many people seem to think that for vector search you need a GPU. You don't.


Looking forward to it. If we (Pinecone) can help with anything shoot me an email! greg@pinecone.io


do you transcribe the youtube videos or what do you mean by embeddings for youtube channels?


Yes, the CLI is a pipeline that fetches audio -> transcribes -> creates embeddings.


Question: How is pgvector applied in this way different from and maybe superior to the PostgreSQL cube extension?

https://www.postgresql.org/docs/current/cube.html

The cube datatype also stores arrays of numbers (cubes or vectors, same thing) and has functions/operators for computing distance. Granted, they're slightly different, with no direct substitute for cosine distance, but on the unit sphere the cosine distance, inner product, and Euclidean distance have direct correspondences.


I looked into this, and (apparently) the cube extension has a very small hard-coded dimensionality limit that isn't sufficient for this purpose; you can recompile it to support more dimensions, but the way it stores data requires the entire vector to still fit within a single database page (which these vectors will not).


Thanks!


I agree with other comments, this is an outstanding article and you should read it if you haven't already.

Is it correct that if OpenAI alters the embeddings model, such as they did with Ada on 22 December, you need to generate all your embeddings again?


Yes, if the nodes had their weights retrained. The resulting output vector from the model would subsequently have different values and would not be meaningful if used against a different set of weights.


Does anyone know if it's possible to make this work on RDS (e.g. by getting Amazon to install it for you or something?) Being able to store sentence embeddings in Postgres and compute similarity on-the-fly would save a lot of work!



I was tinkering with pgvector recently but I was getting inconsistent results. Sometimes Postgres just failed to use the index and resorted to a brute force search which is very slow. Sometimes the index worked and IIRC I managed to sort 500K vectors (limit=10 or so) with dimension size 1536 in about 80ms which is okay but not state of the art. I was using an artificial dataset which might have biased the index statistics resulting in poor execution.

Mind you that this was all running in WSL and I'm very new to vector search so I might not know what I'm doing.


I stored sentence transformers in elasticsearch and used its cosineSimilarity function to find similar documents.

I'm wondering if maybe if makes sense to use pgvector? or sqlite? instead


quick question - can a tsvector not be repurposed to do this exact same thing. i know tsvector is text search. but the advantage is that "similarity" semantics kick in with tsvector - with things like rankings, relevance, etc. so i think we can do trigram matching (i could be wrong..been a while since i did this)

while pgvector seems to be doing cosine distance matching.

im not sure which is better, but tsvector will support the popular RUM plugin for cosine similarity.

any thoughts here ?


Fantastic article and I really love how the author explains embeddings - very well done.

I love also the practicality with basic postgres, very very powerful stuff here.


It seems that it would make sense to reduce the dimensionality for similarity comparisons.


Does OpenAI have an API like this to get the embeddings of images?


If anyone wants to get an A.I. assistant for their docs page, I've been working on Pal (http://www.heypal.chat) these past few weeks.

I'm onboarding folks, and if you have Algolia, Gitbook, Readme or most common docs platform it takes minutes to set up.

It also integrates with your Slack and Discord (both in terms of letting people ask questions and get responses from the assistant in those tools, and marking conversations you want the assistant to know about for the future).

Would love more folks to try it out if anyone's interested, and think their product would benefit from it!


How do you think this would work with a semi-large Discord channel related to programming? Discord (and IRC, etc) are great for forming vibrant communities, but past knowledge is lost pretty quickly. Forums (eg: web forum or Discord forums/topics) are great for keeping information more accessible, but in my opinion really hinders the growth of a community.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: