Glossary/Model Extraction

What is a Model Extraction Attack?

A model extraction attack — sometimes called model stealing — recreates a private deployed model's behavior by repeatedly querying it through its public API and training a copy on the input-output pairs. The result: the attacker walks away with a model that approximates the original closely enough to substitute for it, reverse-engineer it, or use it as a stepping stone to other attacks.

How extraction works

The simplest version is API-based knowledge distillation:

  1. Send a stream of carefully-chosen queries to the target model's API
  2. Collect the (query, response) pairs
  3. Train a smaller open-source model on the collected data using standard supervised fine-tuning
  4. The student model approximates the teacher's behavior on the query distribution it was trained on

For larger models, the extraction targets specific capabilities rather than full equivalence — extract the safety classifier, extract the embedding, extract a particular skill.

More advanced techniques include:

What gets stolen

Why it matters

Model extraction is both an IP problem and a security problem:

The DeepSeek-R1 distillation lineage demonstrated this at industry scale — distilled models inherit the teacher's capabilities but typically with degraded safety training, becoming both useful proxies and weakened security postures.

Defenses

See also

Repello's research on the safety of distilled models derived from DeepSeek-R1 documents the security implications when extraction-style training propagates through the AI supply chain.