A primer on Model Distillation

Atmn Intelligence
Jul 21
5 min read

AI teaching AI - A primer on model distillation.

Model distillation are post-training techniques focused on creating efficient & effective AI models (students) by transferring knowledge from large, expensive, complex AI models (teachers).

These models surpass the teacher model’s performance with reduced data, compute or time requirements.

LLMs (Large Language Models) & LMMs (Large Multimodal Models) are emerging as the most popular class of teacher models for distillation due to their benchmark performance in general intelligence tasks.

Instances of model distillation with LLMs as teachers are thus sometimes called LLM distillation.

Another related term, knowledge distillation refers to the set of underlying principles of knowledge transfer employed in model distillation.

Model distillation is also becoming a popular model compression technique, though other compression techniques (e.g. Pruning, Quantization, Low-rank factorization, etc.) are still preferred for faster deployment.

Model distillation can be further categorized on the basis of training schemes used, type of knowledge transferred, and specific methods or algorithms employed.

Model distillation training schemes

1. a. Offline distillation : Most commonly used distillation scheme in which a pre-trained teacher model is used to guide the student, while the teacher's weights typically remain frozen.

1. b. Online distillation : In this scheme, both the teacher and student models are updated simultaneously as an end-result.

1. c. Self distillation : In self-distillation, the same model teaches itself & improves its performance by retraining itself on its own outputs. Knowledge transfer occurs from deeper layers to shallower layers within the same network, or from earlier epochs of the model's training to later epochs.

Type of knowledge transferred

2. a. Response-based distillation : Most popular due to easy implementation, it focuses on the student model mimicking the teacher's predictions or soft labels (output probabilities).

The teacher generates soft labels for each input example, and the student is trained to predict these labels by minimising the difference in their outputs, typically using a Kullback-Leibler (KL) divergence loss function.

Logit-based distillation is a specific application where the student mimics the logits (inputs to the final softmax layer) of the teacher on the training dataset, rather than just the final predicted token. This provides richer feedback during training.

A "temperature" parameter can be used to produce a softer probability distribution from the teacher, which is then used when training the student to match these soft targets.

The Universal Logit Distillation (ULD) Loss leverages optimal transport theory (specifically Wasserstein distance) to enable logit distillation even across models with different tokenizers and vocabularies, which is a limitation of traditional KL divergence-based methods.

2. b. Feature-based distillation : Student model learns the internal features or representations learned by the teacher. The student minimizes the distance between the features learned by both models, often from intermediate layers.

Hidden States-based Distillation is an example of this, where the student aligns its hidden states with those of the teacher, providing richer, layer-wise guidance. This method allows for cross-architecture distillation.

2. c. Relation-based distillation : Student mimics relationships between inputs and outputs, or the relationships between different layers and data points within the teacher model.

Examples include using a Gram matrix to summarize relations between pairs of feature maps.
Specific Distillation Methods and Algorithms

3. a. Distilling Step-by-Step : This advanced method, introduced by Google researchers and Snorkel AI, goes beyond mimicking outputs to explicitly extract and use natural language rationales (intermediate reasoning steps) from Large Language Models (LLMs). The model learns to output both the label and the rationale, which helps it understand why to predict what.

This approach can significantly reduce both the deployed model size and the amount of data required compared to standard fine-tuning or distillation. For example, a 770M parameter T5 model could outperform a 540B PaLM model using only 80% of the dataset, where standard fine-tuning struggled with even 100% dataset.

3. b. Context distillation : A general framework where a language model improves itself by internalising signals from various contexts, such as abstract instructions, natural language explanations, or step-by-step reasoning.

Simultaneous distillation combines multiple context distillation operations to internalize ensembles or sequences of contexts when their total length exceeds the context window size.

Sequential distillation performs multiple distillation operations in sequence, useful for incrementally updating or overwriting previous model knowledge.

Recursive distillation is a variant where the student model becomes the new teacher in the next iteration, allowing incremental updates without maintaining separate teacher/student parameters.

3. c. SDG (Synthetic Data Generation) Fine-tuning : A style of distillation where synthetic data, generated from a larger teacher model, is used to fine-tune a smaller, pre-trained student model. Here, the student mimics only the final token predicted by the teacher.

3. d. Adversarial distillation : Inspired by Generative Adversarial Networks (GANs), this involves training a student model to not only mimic the teacher's output but also to generate samples that the teacher might find challenging, improving the student's understanding of the true data distribution.

3. e. Multi-teacher distillation : Transfers knowledge from multiple teacher models to a single student model. This can provide more diverse knowledge than a single teacher.

3. f. Cross-modal distillation : Facilitates knowledge transfer between different data modalities (e.g., from text to images) when data or labels are available in one modality but not another.

3. g. Graph-based distillation : Uses graphs to map and transfer intra-data relationships between data samples or layers, enriching the student model's learning process.

3. h. Quantized distillation : Reduces the precision of numbers (weights and activations) in the model, e.g., from 32-bit floating point to 8-bit, reducing memory and computational costs. It is often combined with knowledge distillation.

3. i. DistilBERT / TinyBERT / MiniLM : Notable examples of distilled models, which use variations of these methods, often combining attention-based distillation and hidden states-based distillation with multi-stage learning frameworks (pre-training distillation and task-specific distillation) to significantly reduce model size and improve inference speed while retaining high accuracy.

Other model compression techniques, such as pruning, quantization, and low-rank factorization, are often combined with model distillation to achieve desired results. For instance, NVIDIA's Minitron models are obtained by combining structured weight pruning with knowledge distillation.

Now a simple analogy,

Imagine a university with many professors (teacher models) who have varying subject matter expertise, teaching styles, coursework, etc.

While some professors are instant favorites (LLMs), others not so much but still teach students important skills.

Model distillation is the process of teaching students (student models).

Some students learn by mugging up the previous year question papers, some read textbooks, some take notes, some practice sample questions, some ask questions in class, some excel at conducting theoretical or practical experiments. These are the different model distillation techniques.

None of these students spent as much time as their professors in learning those skills but they still then go on to achieve stellar careers in businesses, engineering, arts or science.

-@AgrawalMayankG

A primer on Model Distillation

Recent Posts

Comments