Run a free red teaming scan on your AI agent
Guide

Model Distillation Attacks, Explained: How Anthropic Caught the Largest Known AI Model Theft

Aryaman BeheraJun 30, 20269 min read
Model Distillation Attacks, Explained: How Anthropic Caught the Largest Known AI Model Theft

TL;DR: Model distillation attacks steal AI capabilities through the API — no weights, no architecture, just 28.8 million well-crafted questions. Anthropic's June 2026 disclosure to the US Senate revealed the largest known case: entities linked to Alibaba allegedly ran a six-week extraction campaign against Claude's coding and reasoning capabilities. The "hidden spyware" controversy that followed may actually be the defense — a watermark designed to catch exactly this kind of theft. Here is how distillation attacks work, why they are so hard to stop, and what they mean for anyone shipping an AI product.

How do you steal a model you can only talk to?#

The classic image of AI model theft involves exfiltrating weights — someone copies the 405-billion-parameter file and walks out. That is a supply-chain problem with supply-chain solutions: access controls, encryption at rest, audit logs.

Model distillation is different. The attacker never touches the weights. They sit on the public side of the API, ask the model millions of questions, collect the answers, and use that corpus to train a smaller, cheaper model that reproduces the original's capabilities. The student model learns not from the teacher's internal representations but from its observed behavior — the text it produces, the reasoning patterns it exhibits, the code it writes.

This is not hypothetical. In June 2026, Anthropic told the US Senate Banking Committee that entities linked to Alibaba had conducted what Anthropic called the "largest known distillation attack" against Claude. The alleged numbers: approximately 28.8 million exchanges across roughly 25,000 fraudulent accounts, run between April 22 and June 5, 2026. The targets were Claude's coding and agentic reasoning capabilities — precisely the domains where Claude's training investment is hardest to replicate.

Alibaba has not publicly confirmed or denied the allegations. What follows is the mechanics of how such an attack works, regardless of who is running it.

What distillation actually is#

Knowledge distillation was introduced by Hinton, Vinyals, and Dean in 2015 as a compression technique. A large "teacher" model produces soft probability distributions over its vocabulary — not just the final answer, but the relative confidence across all possible tokens. A smaller "student" model trains against these soft labels, learning the teacher's decision boundaries more efficiently than it could from raw training data alone.

That is the legitimate version. It requires access to the teacher's output logits — the full probability vector, not just the sampled text.

Black-box distillation skips the logits entirely. The attacker queries the teacher through a standard API, collects the text outputs (hard labels), and fine-tunes the student on those input-output pairs using supervised fine-tuning (SFT). The student never sees the probability distribution — it only sees the final text. This produces a weaker copy than logit-level distillation, but a dramatically cheaper one. And it requires nothing beyond API access that any paying customer already has.

Some campaigns add a second stage: preference optimization or reinforcement learning from the collected outputs, training the student not just to imitate the teacher's answers but to rank and prefer them over alternatives. This narrows the quality gap further.

The attack pipeline#

Five-stage distillation attack pipeline mapped to the Anthropic-Alibaba case of April to June 2026. Stage 1, curate prompt corpus targeting coding, reasoning, and agentic tasks. Stage 2, distribute across approximately 25,000 accounts with roughly 1,150 exchanges each. Stage 3 is the focal stage: 28.8 million exchanges quality-filtered to approximately 4 to 7 billion training tokens. Stage 4, fine-tune student model via supervised fine-tuning on an open-weight base. Stage 5, evaluate against the teacher and iterate.

A distillation campaign follows a predictable sequence:

1. Curate a diverse prompt corpus. The attacker selects or generates prompts that cover the target capability space — coding problems across languages and difficulty levels, multi-step reasoning chains, cybersecurity analysis scenarios. Diversity matters more than volume: a million identical prompts produce a million identical answers and teach the student nothing. The goal is to map the teacher's behavior across the full distribution of tasks the attacker wants to replicate.

2. Distribute queries across accounts. A single account sending 28.8 million requests would be flagged in hours. The Anthropic disclosure describes approximately 25,000 accounts, averaging roughly 1,150 exchanges each — a volume that looks plausible for a legitimate developer or small team. Creating these accounts at scale requires automated signup, likely with disposable email addresses and payment methods.

3. Collect and filter responses. Not every response is useful training data. The attacker filters for high-quality completions — correct code, coherent reasoning chains, substantive analysis — and discards refusals, truncated outputs, and low-quality responses. The filtered corpus becomes the synthetic training dataset.

4. Fine-tune the student model. The student is typically an open-weight base model (Llama, Qwen, Mistral) fine-tuned on the collected input-output pairs. SFT is the minimum viable approach. More sophisticated campaigns add DPO or RLHF using paired high/low-quality responses from the same prompts.

5. Evaluate and iterate. The attacker benchmarks the student against the teacher on held-out prompts, identifies capability gaps, generates targeted prompts to fill those gaps, and runs another collection pass. Each iteration improves the student's fidelity.

The economics are stark. Training a frontier model from scratch costs hundreds of millions of dollars. A distillation campaign costs API credits — even at 28.8 million exchanges, the API bill is a rounding error compared to pretraining compute.

Why 25,000 fake accounts matter#

The account volume is not just about evading rate limits. Three things drive the scale:

Corpus diversity. Different accounts can run different prompt distributions simultaneously — one cluster focuses on Python, another on Rust, a third on security analysis. Parallel coverage of the capability space compresses the calendar time from months to weeks.

Behavioral camouflage. Per-account request patterns stay within normal bounds. Any single account looks like a busy developer. The signal is in the aggregate — the coordinated coverage pattern across thousands of accounts — which is exactly the kind of anomaly that takes time to detect.

Resilience. When individual accounts get flagged and banned, the campaign continues across the remaining pool. At 25,000 accounts, losing a few hundred per day is a logistics problem, not a campaign-ending event.

Map the 28.8 million exchanges to corpus size: if the average response is 500 tokens, the raw collected corpus is roughly 14.4 billion tokens. After quality filtering, assume 30–50% survives. That leaves 4–7 billion tokens of high-quality synthetic training data, curated specifically from a frontier model's best outputs. For context, many competitive open-weight models are fine-tuned on datasets an order of magnitude smaller.

Why it is so hard to stop at the API layer#

The fundamental problem: a distillation query is indistinguishable from a legitimate query. Both are well-formed prompts asking the model to do what it was built to do. You cannot block "asking good questions" without degrading the product for real users.

Rate limiting slows the attack but doesn't stop it. The attacker adjusts velocity and distributes across more accounts. IP-based blocking fails against cloud infrastructure where IP addresses are ephemeral. Prompt fingerprinting (detecting suspiciously systematic prompt distributions) catches naive attackers but breaks against prompt generation pipelines that introduce realistic variation.

The deeper problem is economic asymmetry. The defender must protect every API call while maintaining service quality for legitimate users. The attacker only needs a subset of calls to succeed — and "success" is a statistical property of the corpus, not a binary per-query outcome. Even a campaign that loses 30% of its accounts still completes the extraction.

This is why defense is shifting from prevention to proof. If you can't block distillation at the query layer without harming real users, the alternative is to embed evidence in the outputs that survives into the distilled model — evidence that the student was trained on your teacher's outputs.

The defense that may already be deployed#

Days after the Anthropic-Alibaba story broke, a separate controversy surfaced. Posts on X and secondary security blogs alleged that Claude Code contained a hidden classifier — one that varied Unicode characters in the system prompt when the tool connected through non-Anthropic API proxies. A GitHub issue was filed under the label "spyware." The reaction was immediate: outrage, screenshots, heated threads.

The technical claims — that the classifier checked ANTHROPIC_BASE_URL, matched against a list of roughly 147 Chinese AI lab and reseller domains, and encoded approximately 3 bits into the "Today's date is..." line using near-identical Unicode apostrophe variants — have not been independently verified. The GitHub issue itself contains no code evidence and was self-closed by the reporter. No Anthropic employee responded on the issue thread. The detailed technical narrative traces to a single blog post and its derivatives.

If the claims are accurate, the described behavior is not spyware — it is a watermark. The alleged mechanism exfiltrates no user data and makes no new network request. It only varies characters in a string the system already transmits. That is a textbook canary token: a signal embedded in the output pipeline that lets the model provider trace where responses end up. If a distilled model's training corpus contains prompts with these specific Unicode variations, the provider has forensic evidence that the training data originated from unauthorized proxy traffic.

Whether or not this specific mechanism exists in Claude Code, the technique itself is well-established and likely deployed by multiple model providers. Canary tokens, output watermarks, and statistical fingerprints are the natural response when prevention fails at the API layer — they shift the game from "stop every query" to "prove the theft after the fact."

The core tension in any watermark defense: it works best when it is secret, and it stops working once the encoding scheme is public. An attacker who knows the Unicode variation pattern can normalize it out before using the corpus for training. Publicizing a watermark method is, in a real sense, destroying it.

The limits of every defense#

No single defense solves distillation. Each buys time or raises costs:

Behavioral anomaly detection clusters accounts by query patterns, timing, and capability coverage. Effective against coordinated campaigns but produces false positives against legitimate power users, API aggregators, and enterprise customers with diverse workloads. The Anthropic case presumably used this — detecting the coordinated coverage pattern across 25,000 accounts — but it took weeks, not hours.

Output watermarking embeds statistical patterns in token selection probabilities. These can survive into the student model if the watermark is strong enough. The limitation: paraphrasing, back-translation, or even simple Unicode normalization can strip the signal. Stronger watermarks degrade output quality; weaker ones are easier to remove.

Canary tokens (like the Claude Code mechanism) embed traceable signals in specific outputs. They provide forensic evidence after the fact but do not prevent the extraction. And once the encoding scheme is public, it is trivially defeated.

Terms-of-service enforcement is the legal layer. Anthropic's ToS prohibits using outputs to train competing models. But enforcement requires identifying the attacker, proving the connection between the API usage and the resulting model, and navigating cross-border jurisdictions — none of which is straightforward.

The honest answer is that model distillation through public APIs is a problem that cannot be fully solved at the technical layer. It can be made expensive, detectable, and legally risky — but not impossible. The defense stack is layered: anomaly detection to catch campaigns in progress, watermarks to provide forensic evidence, and legal enforcement to raise the cost of getting caught.

What this means for anyone shipping an AI product#

Model distillation is categorized under LLM10: Model Theft in the OWASP LLM Top 10. If you are building on top of a frontier model, you inherit this risk indirectly — your fine-tuning data, your system prompts, your tool configurations, and your specialized behaviors can all be extracted through the same black-box distillation pipeline.

The practical implications:

Your fine-tuning investment is extractable. If you have fine-tuned a model on proprietary data to produce domain-specific outputs, an attacker can distill those outputs into their own model without ever touching your training data. The defense is the same layered stack: anomaly detection on your API, output watermarking if your provider supports it, and monitoring for models that suspiciously replicate your specialized behavior.

Your system prompts are already leaking. Distillation is the slow, expensive version of capability extraction. Prompt injection and system prompt extraction are the fast, cheap versions. If your system prompt contains proprietary logic, it is likely already accessible to a motivated attacker through direct extraction before they would bother with distillation.

Runtime monitoring catches what prevention misses. When distilled models inherit their teacher's capabilities, they also inherit behavioral fingerprints. Monitoring for models that reproduce your outputs — or your safety gaps — is a detection surface that scales better than per-query prevention.

If you are building AI products and want to understand your exposure to extraction, replication, and adversarial attacks, Repello's ARTEMIS platform maps these attack surfaces before an attacker does — and ARGUS monitors for them at runtime.

Talk to us about securing your AI deployment →

Frequently asked questions#

What is a model distillation attack?#

A model distillation attack extracts an AI model's capabilities by querying it at scale and using the outputs to train a cheaper student model. The attacker never needs access to the model's weights, architecture, or training data — only its API responses. The student model learns to mimic the teacher's behavior across targeted capability domains like coding, reasoning, or cybersecurity.

How did Alibaba allegedly distill Claude?#

According to Anthropic's June 2026 letter to the US Senate Banking Committee, entities linked to Alibaba ran approximately 28.8 million exchanges through roughly 25,000 fraudulent accounts between April 22 and June 5, 2026. The queries targeted Claude's coding and agentic reasoning capabilities. Alibaba has not publicly confirmed or denied the allegations.

Is the Claude Code hidden classifier spyware?#

The claims remain unverified. In late June 2026, posts on X and secondary blogs alleged that Claude Code contained a classifier that varied Unicode characters in the system prompt when connected through non-Anthropic API proxies. If accurate, the described behavior — no data exfiltration, no new network request, only character variation in text already being sent — would be consistent with a watermark or canary token designed to detect unauthorized API proxying, not spyware. The GitHub issue filed about it contained no code evidence and was self-closed by the reporter without any Anthropic response on record.

Can you prevent model distillation through rate limiting?#

Rate limiting slows distillation but does not prevent it. Attackers distribute queries across thousands of accounts and vary request patterns to avoid triggering anomaly detectors. The fundamental challenge is that a distillation query looks identical to a legitimate query — both are well-formed prompts asking the model to perform its intended function.

What defenses exist against model distillation attacks?#

Current defenses include behavioral anomaly detection (clustering accounts by query patterns), output watermarking (embedding statistical patterns that transfer into the student model), canary tokens (encoding traceable signals into outputs), and terms-of-service enforcement. Each has limits: anomaly detection produces false positives, watermarks can be stripped through normalization, and ToS enforcement requires identifying the attacker across jurisdictions.