This gh repo makes it pretty easy to create similar tech by first embedding any images you have using the released "CLIP" model from Open AI and then creating a Faiss index over these embeds for quick retrieval/decode. You can then do text->image, and image->image semantic search.
https://github.com/rom1504/clip-retrieval