Hi! I work directly on these teams as a model builder and have talked to my colleagues are the other labs well.
All our orgs have openings and if you also could consider working for organizations such as the UK AISI team and other independent organizations that are assessing these models. It's a critical field and there is a need for motivated folks.
BDA is THE book to learn Bayesian Modeling in depth rigorously. For different approaches there are a number shared here like Statistical Rethinking from Richard McElreath or Regression and other stories which Gelman and Aki wrote as well.
I also write a book on the topic which is focused a code and example approach. It's available for open access here. https://bayesiancomputationbook.com
It does help to figure out where in the space this model fits. I'm still a bit confused about this part:
>since it needs to be shaped to match specific tasks, we did our best to design it to be a flexible starting point for LLM-style tasks and worked with partners to put it into the right frameworks and places for you all to be able to shape it to what you need it to be.
What does shaping mean in this case? What tools are used, what requirements are there, both in terms of hardware and knowledge?
I would like to go beyond being spoonfed by large companies' high usability products, both to improve my knowledge and not be a victim of potential future rug pulls. In the classic software world, I guess the equivalent would be someone who runs open source software navigating the extra complexity, and ocassionally collaborates with the projects.
But I don't know what that looks like in the AI world. I've gone through some courses on machine learning but learning the basics about hessian matrices and gradient descent seems as detached from the practical point I'm searching as taking a compilers class is from learning React, so I think I've been looking in the wrong places (?).
> What does shaping mean in this case? What tools are used, what requirements are there, both in terms of hardware and knowledge?
I'll try making an analogy to another task I like which is cooking. In cooking the chef has to make decisions like what is the overall meal going to look like, but then also detailed decisions like what the main course versus side, and even more detailed what's the proportion of side dish serving to main dish, what ingredients, how long to cook something etc.
It's kind of the same with ML models, whether AI or not. When I build smaller bayesian models I make specific choices about the model architecture, which data I use, the array shape of the output etc.
The tools used here are largely jax or pytorch, often in a framework like flax, or a NN higher level package. You often then pair it with libraries that which have NN optimizers, data loaders etc. Pytorch is more batteries included than the JAX ecosystem which separates these out.
One of the best ways to get a grasp of all of this is implement some small models yourself. These pieces will start to be come more apparent and concrete, especially because as an end users you're not exposed to them, the same way most end users are not exposed to compilers.
The world reasonable is vague but assuming you mean something that could be run in a residential unit it would long a very long time if training from pure scratch.
This is part of the rationale for releasing this model. Now you don't have to start from scratch and finetuning is reasonable on a wide variety of hardware, including reasonable GPU setups (and smaller)
The evaluations provide this indication. You'll see MMLU, GPQA, Big Bench etc in reports for many models. Those numbers provide the indication you're looking for.
To answer a question you didn't ask. With small models especially we need to make choices as to which to focus on. For this model we focused on text summarization and instruction following, with the idea that users would finetune to gain performance on the task set that is relevant to them
Yes! To me the primary value is not just as a teaching or toy model. I see a lot o value in repeatable tasks if we think about enterprise and a local fast developer model for individual usage.
Here's some examples that are inspired by previous roles I had outside of Google, where a business I was working in needed real time text processing.
This tutorials were made with Gemma versions from a year ago, but could now be recreated with Gemma 270m
Hey all,
I created this model with a top notch team. I answered many questions last week when this hit the front page, and happy to answer more here as well.
I would like to know your thoughts on using 2/3 of such a small the model's size for embeddings. What would be different if you used a byte-level vocabulary and spent the parameter budget on transformer parameters instead? I think you would lose performance (tok/s) but might gain accuracy.
At this small scale the embeddings indeed were a big focus. Consider this thought process.
The tokens themselves are a form of compression. Lets say we have the word "WaffleHouse", character level this would be 11 tokens, but with an embedder this would be perhaps 2 or 3 tokens (I didn't actually run through the tokenizer but we could verify precisely). This matters a lot for on device processing especially.
So while we could get more intelligence out of the model by bumping up the "knowledge" parameters, the device would need to process more input and output tokens.
Another advantage on small devices is the embeddings are just a lookup table which requires little to no computation. Its the rest of the parameters that have the expensive matrix multplications, so if we increased those we'd also be increasing the number of FLOPs needed for a forward pass.
So all this to say is there are definite tradeoffs between model size, performance on evals, and compute cost. We ran many internal experiments with different choices to see could work well, and then picked what we believed work will best for the open community.
How would this matrix get trained with PyTorch? I currently have a toy Transformer network - I ended up marking the matrix as sparse and using SparseAdam - gives a bit of a performance boost, but at the same time I can't use torch.compile() on the fetch from this matrix.
Does Gemma use any specific scheme to compress embeddings? Which have you considered?
For instance, it's well-known that transformer embeddings tend to form clusters. Have you considered splitting the embedding table into "cluster centroid" and "offset from centroid" tables, where the later would presumably have a smaller range and precision?
Very stupid question: why does the tflite model output only '[multimodal][multimodal]' when executed on GPU in the AI edge gallery app, while fully working on the CPU.
This was released with the initial batch of Gemma3 so it doesn't contain the 270m details, nonetheless you'll get a good idea of what it takes to build these models.
It is extremely valuable for researchers that commonly prototype theories using PyTorch on less powerful devices. Many of my colleagues run theory experiments using GPT-2 models. This allows for an easy transition to testing on a SOTA model instead.
I'm not a ML engineer, so I can speak to the "non MLE" bit from my perspective
(literal tl;dr: learning and experimentation opportunity)
1. Since it's just PyTorch, that means one can run it locally upon whatever accelerator you have that PyTorch supports. For quite a few people that includes Metal Performance Shaders: https://docs.pytorch.org/docs/stable/mps.html
I can attest that building PyTorch from git is achievable in about 15 minutes on my M1 Pro, if you really want to chase the rabbithole. Cloning PyTorch is its own special 'please. wait.', but building it is fine
2. Since it's (of the ones that I've looked at) approximately 500 lines long, it's much, much, much more digestable than a lot of the vomit that comes out of so-called production systems. Those systems usually have only heard about typed Python in passing, and they believe it is a fad that will blow over. The ones in this repo aren't stellar about it, but at 500 lines it's easily achievable to type hint the code yourself, which can serve as an excellent learning opportunity
5. Further related, one can play around with the fine-tuning mentioned elsewhere, to better understand what is and isn't possible to achieve using that process. Because the code is digestable, and the models are reasonably sized (Qwen 0.6B weighs only 1.4GB and is Apache 2), it brings FAFO opportunities in ways that gpt-oss-20b (or bigger!) won't
I do appreciate that some of what I said may skate close to "ML engineer" concerns, so obviously your situation will be different, but for me having a better grip on how these things work enables me to have better conversations with my colleagues and also helps trip my bullshit detector when someone claims they're the second coming and are going to cure cancer or whatever
Thanks for making this! One of my favorite projects was having a Discord chatbot powered by the original BERT model - these 270M weights are a fine upgrade.
It can possibly perform basic prompted FC but I wouldn't get your hopes up. It should be to be a solild FC model if trained on specific tools and format. I would not expect great MCP performance because the context window is 32k and most MCP servers I've see implicitly assume massive context windows.
I'm quite glad to hear its working for you! Thank you for adding the comment here as well, we definitely try our best to make useful models, but its fantastic to hear from actual users that we hit the mark.
https://ravinkumar.com/GenAiGuidebook/language_models/Agents... https://github.com/canyon289/ai_agent_basics/blob/main/noteb...