CPU BERT inference is fast enough to embed 50 examples per second. Your large in...

jkb79 · on Dec 15, 2021

It's about choosing the right Transformer model, there are several models which are smaller, with fewer parameters than bert-base which gives the exact same accuracy as bert-base, which you can use on a modern CPU single digit ms, even with a single intra-thread. See for example, https://github.com/vespa-engine/sample-apps/blob/master/msma...

leobg · on Dec 15, 2021

Thanks for the link!

I compared BERT[1], distilbert[2], mpnet[3] and minilm[4] in the past. But the results I got "out of the box" for semantic search were not better than using fastText, which is orders of magnitude faster. BERT and distilbert are 400x slower than fastText, minilm 300x, and mpnet 700x. At least if you are using a CPU-only machine. USE, xlmroberta and elmo were even worse (5,000 - 18,000x slower).

I also love how fast and easy it is to train your own fastText model.

[1]: distiluse-base-multilingual-cased-v2

[2]: multi-qa-MiniLM-L6-cos-v1

[3]: multi-qa-mpnet-base-cos-v1

[4]: multi-qa-MiniLM-L6-cos-v1

jkb79 · on Dec 15, 2021

Vector models are nothing but representation learning and applying the model out-of-domain usually gives worse results than plain old BM25. See https://arxiv.org/abs/2104.08663

A concrete example is DPR which is a state of the art dense retriever model for wikipedia for question answering, when applying that model on MS Marco passage ranking it performs worse than plain BM25.