What is a Model Extraction Attack?

A model extraction attack — sometimes called model stealing — recreates a private deployed model's behavior by repeatedly querying it through its public API and training a copy on the input-output pairs. The result: the attacker walks away with a model that approximates the original closely enough to substitute for it, reverse-engineer it, or use it as a stepping stone to other attacks.

How extraction works

The simplest version is API-based knowledge distillation:

Send a stream of carefully-chosen queries to the target model's API
Collect the (query, response) pairs
Train a smaller open-source model on the collected data using standard supervised fine-tuning
The student model approximates the teacher's behavior on the query distribution it was trained on

For larger models, the extraction targets specific capabilities rather than full equivalence — extract the safety classifier, extract the embedding, extract a particular skill.

More advanced techniques include:

Membership-inference-aided extraction — use auxiliary attacks to identify which training data the target was trained on, then collect or synthesize similar data
Logit extraction — when the API returns probabilities (not just text), the richer signal makes extraction much faster
Active learning — choose query inputs that maximally inform the student about decision boundaries

What gets stolen

Capability — the student can do what the teacher could do, often at lower inference cost
Style and voice — fine-tuned models that mimic a specific brand's assistant
Decision logic — extracted classifiers used to evade detection (anti-abuse, content moderation, fraud)
Sometimes weights — for very small models or those exposed via insufficiently-rate-limited endpoints, near-exact weight recovery has been demonstrated

Why it matters

Model extraction is both an IP problem and a security problem:

Loss of competitive moat. A foundation model that cost millions to train can be approximated for thousands of dollars in API calls.
Stolen safety classifiers. Extracting a moderation API enables attackers to test their adversarial inputs offline, finding inputs that bypass the classifier without triggering rate limits or detection.
Bypass via shadow models. Universal jailbreaks generated against an extracted shadow model often transfer back to the original.

The DeepSeek-R1 distillation lineage demonstrated this at industry scale — distilled models inherit the teacher's capabilities but typically with degraded safety training, becoming both useful proxies and weakened security postures.

Defenses

Aggressive rate limiting and abuse detection. Extraction attacks require many queries; cap and monitor.
Per-account anomaly detection. Sustained high-volume querying with diverse inputs is a leading indicator.
Output watermarking. Embed signal in responses that's invisible to humans but detectable in extracted models — proves provenance and enables takedowns.
Don't return logits or token probabilities when not strictly needed — text-only outputs are harder to extract from.
Legal protection. Most foundation-model APIs include terms-of-service prohibiting training derivative models. Detection enables enforcement.

What is a Model Extraction Attack?

How extraction works

What gets stolen

Why it matters

Defenses

See also

Long-form on this topic from the Repello blog