Related ToolsClaude

Anthropic's Interpretability Tools Can Now Peer Inside Google's Gemma 3

Anthropic
Image: Anthropic

What's actually happening inside a language model when it picks its next word? For most of the industry's history, the honest answer has been that nobody really knows - not even the teams who built the model. Anthropic's interpretability researchers just made that question a bit more answerable, and they did it by applying their tools to Google's Gemma 3, a model Anthropic didn't build.

The technique at the center of the research is called Natural Language Attribution, or NLA. Here's the problem it solves: when a large language model generates text, it runs your prompt through billions of mathematical operations organized around learned "features" - internal patterns that activate when the model encounters certain inputs. A feature might represent a concept like "financial risk," a grammatical pattern like "negation," or something harder to categorize. Normally, you see the output word but have no visibility into which features fired or how much each one shaped that specific choice.

NLA traces that path. For each token (a word or word fragment) a model generates, it identifies which internal features contributed most and expresses those contributions in human-readable language rather than raw activation numbers. The shift from "feature 4,271 had activation 0.87" to something meaningful - the model strongly weighted a concept related to legal liability, for instance - is what makes the approach practically useful for diagnosing why a model behaves the way it does.

Why Testing on Gemma 3 Is Significant

Anthropic's interpretability work has historically focused on its own models, particularly Claude. Applying NLA to Gemma 3, Google's open-source model, extends the research in a useful direction. Because Gemma 3's weights are publicly available - anyone can download and run the full model - independent researchers can verify, reproduce, or build on these findings in ways that aren't possible with closed commercial models accessible only through an API.

There's also a practical reason Gemma 3 is a good test case: it's competitive. The model performs well on many benchmarks relative to models significantly larger in parameter count, so findings here apply to a capable, real-world system.

What This Means for Developers

For developers building on open-source models, better interpretability tools mean a cleaner path to diagnosing specific failures. When a model consistently hallucinates in a certain domain or produces wrong outputs on a particular reasoning pattern, NLA gives a way to identify which internal features are driving that behavior - rather than guessing at causes from the output symptoms alone.

For the field, the more consequential implication is that interpretability research is becoming less lab-specific. When techniques developed at one lab can be meaningfully applied to models from another, findings generalize instead of remaining siloed. Testing across model families rather than within a single organization's models is how the field actually makes progress.