In AI and machine learning systems, knowledge is typically distilled by training a small model — a student — to mimic a large and cumbersome model — a teacher. The idea is to compress the teacher’s knowledge by using its outputs as labels to optimize the student, but there’s no guarantee knowledge will be transferred to the student when the teacher is considerably large.
That’s why a team of Amazon researchers developed in a recent study a technique that distills the internal representations of a large model into a simplified version of it. They claim that in experiments, adding knowledge distillation from representations tended to be a more stable method than only using label distillation.
The approach proposed allows the aforementioned student to behave internally as the teacher by transferring its linguistic properties. The student is optimized by the labels from the teacher’s output, and it acquires the abstraction hidden in the teacher by matching its internal representations.
In a typical AI model, neurons — mathematical functions — are arranged in interconnected layers that transmit “signals” from input data and slowly adjust the synaptic strength (weights) of each connection. In the technique described above, the layers of the student are optimized to match those of the teacher such that knowledge from the lowest layer (closest to the input) is distilled prior to the upper layers. This enables the student to learn and compress the abstraction in the layers of the teacher systematically.
The researchers conducted experiments involving Google’s BERT on four datasets of the General Language Understanding Evaluation (GLUE) benchmark, a collection of resources for training, evaluating, and analyzing natural language processing algorithms. Even in cases where the model skipped one layer for every two layers of the teacher, they report that the student was able to replicate behavior taught by the teacher. Moreover, the generalization capabilities of the teacher were replicated in the student model, implying that the student would potentially make the mistakes of the teacher. And it demonstrated a 5-10% performance improvement on benchmark data sets, including a large new Reddit data set the team assembled.
“Unlike the standard [knowledge distillation] method, where a student only learns from the output probabilities of the teacher, we teach our smaller models by also revealing the internal representations of the teacher. Besides preserving a similar performance, our method effectively compresses the internal behavior of the teacher into the student,” wrote the researchers in a paper describing their work. “This is not guaranteed in the standard [knowledge distillation] method, which can potentially affect the generalization capabilities initially intended to be transferred from the teacher. “