cubie's comments

cubie · 2026-02-20T14:56:39 1771599399

I'm a big fan of their work as well, good shout.

danielhanchen · 2026-02-20T23:01:41 1771628501

Thank you!

cubie · 2025-10-22T13:45:11 1761140711

That's awesome to hear! It's been growing a lot in the background, still useful as ever, especially for retrieval/semantic search.

cubie · on March 10, 2025

Looks very solid; I'm excited for finetuned variants for retrieval and reranking.

cubie · on Dec 19, 2024

Spot on

cubie · on Dec 19, 2024

Not yet - these are base models, or "foundational models". They're great for molding into different use cases via finetuning, better than common models like BERT, RoBERTa, etc. in fact, but like those models, these ModernBERT checkpoints can only do one thing: mask filling.

For other tasks, such as retrieval, we still need people to finetune them for it. The ModernBERT documentation has some scripts for finetuning with Sentence Transformers and PyLate for retrieval: https://huggingface.co/docs/transformers/main/en/model_doc/m... But people still need to make and release these models. I have high hopes for them.

cubie · on Dec 19, 2024

Beyond what the others have said about 1) ModernBERT-base being 149M parameters vs BERT-base's 110M and 2) most LLMs being decoder-only models, also consider that alternating attention (local vs global) only starts helping once you're processing longer texts. With short texts, local attention is equivalent to global attention. I'm not sure what length was used in the picture, but GLUE is mostly pretty short text.

cubie · on Dec 19, 2024

On a very high level, for NLP:

1. an encoder takes an input (e.g. text), and turns it into a numerical representation (e.g. an embedding).

2. a decoder takes an input (e.g. text), and then extends the text.

(There's also encoder-decoders, but I won't go into those)

These two simple definitions immediately give information on how they can be used. Decoders are at the heart of text generation models, whereas encoders return embeddings with which you can do further computations. For example, if your encoder model is finetuned for it, the embeddings can be fed through another linear layer to give you classes (e.g. token classification like NER, or sequence classification for full texts). Or the embeddings can be compared with cosine similarity to determine the similarity of questions and answers. This is at the core of information retrieval/search (see https://sbert.net/). Such similarity between embeddings can also be used for clustering, etc.

In my humble opinion (but it's perhaps a dated opinion), (encoder-)decoders are for when your output is text (chatbots, summarization, translation), and encoders are for when your output is literally anything else. Embeddings are your toolbox, you can shape them into anything, and encoders are the wonderful providers of these embeddings.

SoothingSorbet · on Dec 20, 2024

I still find this explanation confusing because decoder-only transformers still embed the input and you can extract input embeddings from them.

Is there a difference here other than encoder-only transformers being bidirectional and their primary output (rather than a byproduct) are input embeddings? Is there a reason other than that bidirectionality that we use specific encoder-only embedding models instead of just cutting and pasting a decoder-only model's embedding phase?

craigacp · on Dec 20, 2024

The encoder's embedding is contextual, it depends on all the tokens. If you pull out the embedding layer from a decoder only model then that is a fixed embedding where each token's representation doesn't depend on the other tokens in the sequence. The bi-directionality is also important for getting a proper representation of the sequence, though you can train decoder only models to emit a single embedding vector once they have processed the whole sequence left to right.

Fundamentally it's basically a difference between bidirectional attention in the encoder and a triangular (or "causal") attention mask in the decoder.

Kinrany · on Dec 20, 2024

How much does the choice of the encoder depend on the application?

cubie · on Oct 9, 2023

That is exactly correct

cubie · on Oct 9, 2023

By "irrespective of their relevance to the language modeling task", the authors mean that the semantic meaning of the tokens is not important. These 4 tokens can be completely replaced by newlines (i.e. tokens with no semantic meaning), and the perplexity as measured on a book of 65k tokens is nearly unaffected.

The clue is really that these tokens are just used to "offload" attention scores - their semantic meaning is irrelevant.

cubie · on Oct 9, 2023

Various experiments on the recent Window Attention with Attention Sinks/StreamingLLM approach indicate that the approach certainly improves inference fluency of pretrained LLMs, while also improving the VRAM usage from linear to constant.

It can be applied to pretrained LLMs with little to no additional effort, and Hugging Face transformers is working on first-party support. Until then, the third-party module in the blogpost already works well.