A GitHub Tool Strips Meta's Llama 3.3 Safety Guardrails in Under 10 Minutes

Meta Llama
Image: Meta

A journalist at the Financial Times recently removed all the safety restrictions from Meta's Llama 3.3 in less than 10 minutes. No specialized hardware. No machine learning expertise. Just a tool called Heretic, available on GitHub.

The FT published its account after independently testing Heretic, which was created by developer Philipp Emanuel Weidmann. The result: a fully functional copy of Llama 3.3 with its safety filters removed - capable of generating content the original model was trained to refuse. Weidmann told the FT that Heretic has been used to produce more than 3,500 such "de-restricted" model variants.

What Guardrails Actually Do

When Meta releases Llama 3.3, the model has been trained in two stages. The first builds the base model - a neural network trained on billions of documents to predict and generate text. The second stage, called RLHF (reinforcement learning from human feedback), teaches the model to follow safety guidelines, decline harmful requests, and respond appropriately to sensitive topics.

These safety behaviors aren't a filter sitting in front of the model. They're embedded in the model's weights - the billions of numerical parameters that determine how it generates responses. Removing them means modifying those weights directly, which is only possible because Llama 3.3's model files are publicly downloadable.

Heretic works on the model files themselves, systematically reversing or overriding the safety training. The exact technical approach isn't public, but the FT's under-10-minute test suggests the process is highly automated - not a manual operation requiring deep ML knowledge.

The Open-Source Trade-Off

Meta has always known that open-sourcing Llama creates this exposure. The defense is consistent: open distribution enables better safety research, prevents any single company from controlling AI infrastructure, and lets developers run models locally without API dependencies. Those arguments are substantively correct.

But 3,500 de-restricted variants is a concrete number on the other side of that trade-off. When you open-source a model, you lose the ability to enforce safety constraints at the infrastructure level. Closed API models like OpenAI's GPT-4o or Anthropic's Claude run safety filters server-side on company hardware. If you're generating output through their API, they can moderate that output in real time. With a locally-run open model file, that enforcement is impossible.

Individual jailbreak prompts - clever inputs designed to trick a model into ignoring its training - work inconsistently and can be patched with updates. A de-restricted model file is different. The safety training has been removed from the weights, not bypassed by a clever prompt. Subsequent updates to Llama 3.3 don't affect copies already in circulation.

Most of those 3,500 variants likely represent security researchers, curious developers, or people who want a model that doesn't refuse edge-case instructions for legitimate use. A small fraction will be used to generate genuinely harmful content. The problem isn't intent across the full population - it's that the harmful fraction faces no technical barrier.

What Regulators Are Watching

The EU AI Act includes provisions covering "general-purpose AI models with systemic risk," and regulators in multiple jurisdictions have been working through how open-source models fit into safety frameworks. Cases like Heretic get cited directly in those debates.

The structural problem is that you cannot regulate away model files already downloaded millions of times. Meta could change licensing terms, make future Llama releases technically harder to de-restrict, or restrict distribution to verified parties - but none of those measures reach copies already in the wild.

Meta hasn't publicly stated whether Heretic or the FT's coverage will change its approach to future Llama releases. The company has previously maintained that broad access to open models benefits the safety research community overall. What Heretic demonstrates is that stripping the safety work from one of those models now takes anyone with a laptop and a GitHub account less time than a coffee break.