Research on Knowledge Distillation Regularization Methods∗

2023-10-10 07:21WANGXuechun

WANG Xuechun

(School of Mathematics and System Sciences,Xinjiang University,Urumqi Xinjiang 830017,China)

Abstract: In deep learning,regularization is extremely important as it prevents overfitting models and improves their generalization performances.A relatively new,yet increasingly popular type of regularization is knowledge distillation(KD),a set of techniques for soft labels generated by one model as supervised signals to guide the training of another model.We first explain the fundamentals of KD regularization and then categorize KD regularization strategies into two different types,viz.forward distillation and mutual distillation.For each type,we discuss in detail its key components and representative methods.After comparing the pros and cons of KD regularization strategies and testing their performance on the common benchmark of image classification,we provide guidelines on how to choose appropriate KD regularization techniques for specific scenarios.Finally,we identify a number of key challenges and discuss future research directions of KD regularization.

Key words: knowledge distillation;model generalization;overfitting;regularization.

0 Introduction

Deep learning has been tremendously successful in computer vision[1],natural language processing[2],and many other fields.Large pre-trained models,such as the GPT family and BERT,frequently exhibit excellent performance,but have millions and even billions of learnable parameters that far exceed the number of training samples.In this case,a foundation model tends to suffer from overfitting,which is essentially the mistaking of certain residual variations such as noises for the true structure of the problem.To prevent this from happening,it is necessary to substantially reduce generalization errors on unseen data without excessively increasing errors on the training data.Naturally,overfitting can be avoided by increasing the amount of training data.Another popular technique is regularization,which limits the complexity of the model yet improves its prediction performance on unseen data.Various canonical regularization techniques exist,such as weight decay families[3-4],dropout[5],normalization[6-7],and data augmentation families[8–10].

The conventional regularization methods such as L1and L2commonly reduce the complexity of the model by adding extra regularization terms to the original loss function based on its parameters.In contrast,KD regularization is a technique that incorporates the concept of KD into the regularizer term of the loss function.In addition to the standard loss term,an extra regularization term is added to encourage the student model’s output to match the soft targets of the teacher model.This regularization term acts as a regularizer during training and helps the student model capture the knowledge contained in the teacher model.

KD and KD regularization are both techniques used to transfer knowledge from a larger,more complex model to a smaller,simpler model.However,there is a difference in the way they are applied.KD is a standalone approach where a student model is trained to mimic a teacher model through the use of soft targets,while KD regularization integrates the principles of KD into the regularization term of the loss function to guide the training of the student model.Early studies have demonstrated the effectiveness of KD in model compression[11-12]and semi-supervised learning[13],but ignored the prospect that KD also acts as a regularizer to enhance the generalizability of models.Ba et al.[14]and Hinton et al.[15]demonstrated the soft labels generated by the teacher model can effectively regularize the training of the student models and alleviate overfitting.Subsequently,scholars have delved into various KD regularization techniques to improve generalization performance of the deep model such as [16-21],but these fragmented regularization techniques do not provide a systematic introduction for researchers.For this reason,our contribution is to provide a relatively comprehensive taxonomy and exploration of KD regularization strategies for achieving model generalization.

In the literature,there already exist excellent surveys on regularization.For example,Moradi et al.[22]inspected the most effective regularization techniques and their variants without mentioning any applications.Sanyos et al.[23]discussed regularization for convolutional neural networks.Tang et al.[24]presented sparse regularization techniques from the perspective of model compression.Tian et al.[25]reviewed a wide range of regularization methods and applications,but they did not cover KD regularization.

To the best of our knowledge,there are no comprehensive surveys devoted exclusively to KD regularization.Yet the rapid advances in research on this topic call for review articles that explain the fundamental ideas,classify the variants into simple and exhaustive categories,compare the pros and cons of different strategies,and provide guidelines for application scientists to choose appropriate KD regularization methods for specific tasks.

This paper is a survey that answers the aforementioned demands.In section 1,we lay the foundation by explaining basic definitions,fundamental principles,and key mechanisms of the KD regularization.The main part of this survey consists of sections 2 and 3,where we discuss in detail the two main KD regularization strategies,namely forward distillation,mutual distillation.In section 4,we summarize some key points of our review into table,which hopefully could serve as a road map for this seemingly chaotic field of KD regularization schemes.In particular,we benchmark different KD regularization strategies by the practical problem of image classification.In section 5,we discuss the limitations of current KD regularization techniques and give some corresponding research prospects.

1 Fundamentals

1.1 The definition of KD regularization

In general,knowledge refers to the awareness of facts or practical skills.In this context,knowledge includes in a broad sense everything that can be utilized by other models such as parameters,features,structures,modules,and so on.In a narrow sense,knowledge is the output of a teacher model that can be utilized in training other models.

In chemistry,distillation is a method for separating components with different boiling points by heating the compound solution.In the context of this paper,distillation is the process of obtaining pure knowledge from impure knowledge by amplifying the similarity of knowledge across different models.The name “distillation” comes from the analogy to chemical distillations that(i)during the training phase the knowledge similarity is amplified by increasing the value of a parameter(the so-called “temperature”)and(ii)during the testing phase the temperature is lowered to extract knowledge from the original model.

KD is a teacher-student training process in which a lightweight student model is trained by extracting knowledge from the output of a sizable mature teacher model.Given the knowledge from the pre-trained teacher model,these students are supposed to be competitive to or even superior to the teacher.KD is similar to transfer learning[26]in that,both of them involve certain transfer processes,but KD emphasizes knowledge transfer over weight transfer.See Table 1 for a comparison of the two concepts in terms of data domains,architectures,learning styles,and main purposes.

Table 1 The difference between KD and transfer learning

1.2 The training process in KD regularization

The prediction of a category probability usually benefits from the output logits of the teacher model passed through a softmax layer.As an early work,Ba and Caruana[14]trained a small network by learning logit outputs from a big network.However,when the probability distribution output by softmax is employed directly,the probabilities in the negative labels are very much flattened to zero,resulting in an information loss.To resolve these issues,Hinton et al.[15]added a temperature parameter T to the original softmax function to reduce the difference between target and non-target categories so that the information contained in the negative labels is amplified.The softened class probability is calculated as

where zi,zjare input logits of the softmax function and T is the temperature that controls the softening of the output probability.For T=1,the soften softmax function reduces to the standard softmax function.As T increases,the class probability distribution becomes smoother and smoother.

In particular,the softened knowledge of the teacher model is employed as the first portion of the loss function to regulate the parameters of the student model,which is trained with both hard and soft targets.The distillation loss Lsoftis composed of a teacher model and a student model,both of which have T > 1,while the student loss Lhardonly concerns the student model with temperature parameter T=1 and ground truth label.The entire loss function is then

where λ is the balance factor for the soft and hard targets,the distillation loss is

and the student loss is

where yiis a one-hot encoding.See Fig 1 for an illustration of the training framework of KD.

Fig 1 The training process of KD regularization[15]

Thanks to the soft targets,KD is an effective regularization tactic to improve the generalization performance of deep learning models.

1.3 The mechanism of KD regularization

In distillation learning,we distill a high-capacity teacher model to obtain a low-capacity student model,hoping that the performance of student is close to that of the teacher.With a capacity equal to that of the teacher,a student model may even be better than the teacher model[27];then the student may guide the learning of the teacher model[28].In some cases,student models can still benefit from the teacher model even if the latter performs poorly[17].Aside from the above empirically-based results,there also exist theoretical works explaining why knowledge transfer from the ineffective teacher model leads to more powerful student models[29-30].Soft targets provide regularization via label smoothing training[29,31-32].In effect,label smoothing regularization(LSR)avoids overfitting by exploring the relationship between ground truth labels and other labels.Thus,even if the teacher does not perform as well as the student,it is still possible to improve the student model by avoiding the overconfidence of the teacher.The student network trained by Furlanello et al.[27]in a KD manner has a capacity equal to that of the teacher network,but performs better than the teacher model;the reason is that soft targets provide regularization by importance sampling weighting,which could very much benefit from correct prediction of the sample confidence by the teacher model.

In the following two sections,we present regularization strategies:Section 2 is on the forward distillation regularization;Section 3 is on mutual distillation regularization that involves peer-to-peer guidance; See Table 2 for a comparison of two strategies.

Table 2 Application scenarios and typical tasks for KD regularization strategies

2 Forward Distillation

Forward distillation is the transmission of knowledge from a wider,deeper network with teachers to a narrower,shallower network with students.The training process contains two stages: The teacher network is pre-trained before distillation,and then students draw knowledge from the teacher.The learning process is illustrated in Fig 2 and different types of knowledge transfer are shown in Table 3.

Fig 2 Forward distillation.The teacher transfer knowledge to students[33]

Table 3 Different forms of knowledge transfer in forward distillation

Hinton et al.[15]were successful in transferring the knowledge of the teacher to the students.Ahn et al.[35]increased the mutual information(MI)of the middle layer characteristics of the teacher and the student.A higher MI indicates a stronger capability to transfer knowledge.Specifically,given input sample x from the target data distribution p(x) and K pairs of layers{(T(k),S(k))},where each set layer(T(k),S(k))is drawn from the teacher and the student,respectively.The sample x is passed through the teacher and the student,producing K sets of predictions{(t(k),s(k))}.The MI between the teacher and the student is defined as

where H(t)denotes the entropy value of the teacher,H(t|s)denotes that of the teacher conditional on the known student,and E is the expectation.To increase the MI of the output features between the teacher and the student,we minimize the loss function

where Ltaskis the loss of given tasks and λkis a hyperparameter.The student picks up the loss of their specific task while keeping a high MI with the teacher,maximizing MI to stimulate knowledge transfer by learning and estimating the distribution in the teacher.

Park et al.[36]proposed the relational knowledge distillation (RKD),which involves transferring the structured relationships between the output predictions of multiple teacher models into the student model.In contrast to the traditional distillation learning method,which focuses only on the individual the output of the network,combining the outputs of multiple networks into structural units can better reflect the structural properties of the teacher and thus provides more efficient guidance for the student;see Fig 3 and Fig 4.The RKD loss function takes the form

Fig 3 Traditional KD is a point-to-point transfer of knowledge[36]

Fig 4 Structural KD is the transfer of knowledge from category to category[36]

where Xnis a set with n different samples.For a given sample xi,the output of the teacher and the student are denoted by tiand si,respectively.ψ denotes a relational potential function utilized to extract structural information between the individual model outputs.ℓ is a general loss that measures the difference between the output of the teacher and that of the student.Firstly,the distance relationship distillation loss LRKD-Ddepends on the difference in the distances between the two samples of the teacher and the student.

where ψDis the Euclidean distance between outputs of models,µis a parameter for distance normalization,and ℓδ(x,y)is the Huber loss,

Secondly,angle-wise distillation loss LRKD-Ameasures three sample angular relationships,which are utilized to transfer the teacher to the student by angular difference corresponding to the feature maps of the training samples:

The overall optimization objective function is then

where Ltaskis the loss for given tasks and λKDis a hyperparameter.

Lukman et al.[34]proposed full deep distillation mutual learning (FDDML) and half deep distillation mutual learning(HDDML).Both methods combine mutual learning and traditional KD to further improve the performance of student networks.In FDDML,the training of each of the two student networks Θs1and Θs2benefits not only from the knowledge of the teacher but also from the guidance of the other student.In HDDML,while Θs2is trained in the same way as that in FDDML,the student Θs1is trained only from the guidance of another student.Formally,given N training samples X=from M classes,the corresponding label Y=with yi∈{1,2,3,···,M},the total loss functions of FDDML take the form,

where λ and β are balance factors,the cross-entropy(CE)lossandare defined as

where I(yi,m)=is an indicator function,pm(xi)=is the class probability,ziis a logit,and T is the temperature.The distillation loss function is

The mimicry loss Lmimicryis defined as

The total loss functions of HDDML take the form,

where λ and β are two balance factors.

It is widely acknowledged that in forward distillation the teacher can be more effectively generalized than the student.Knowledge comes from the teacher and is applied to the student;This is a one-way transfer in that the teacher is the exporter of knowledge and the student is the receiver.The training of the teacher is not only time-consuming but can also largely affect the learning outcomes of the student.In addition,it is difficult for the student to fully absorb the incoming knowledge,especially when the capacity gap between the teacher and the student is large.

3 Mutual Distillation

For the scenario in Fig 5 where a large-capacity teacher model is unavailable,Zhang et al.[37]proposed the idea of mutual distillation,where student models improve each other by learning together from some common datasets.

Fig 5 Mutual distillation.Students gain knowledge by learning from each other[33]

This deep mutual learning (DML) model consists of two student networks Θ1and Θ2.Given N training samples X=from M classes and corresponding label Y=with yi∈{1,2,3,···,M},the probability of a sample xiin networks Θ1and Θ2belonging to category m is

where zmdenotes the logit output of the m-th category.During the training process,learning experiences are continuously shared between the two networks to achieve simultaneous progress.The total loss of networks Θ1and Θ2take the form

The DML performs well when the network is trained in an end-to-end manner,but it does not fully explore the latent knowledge in the hidden layers.Yao et al.[38]proposed dense cross-layer mutual distillation(DCM),where student networks Θs1and Θs2are trained together by attaching classifiers to hidden layers of the two student networks and by carrying out dense two-way KD operations between the layers of classifiers.On the one hand,knowledge is transferred from one student to the other at the same stage layers.On the other hand,two-way KD operations at the different stage layers further stimulate knowledge transfer.After being used during training,these attached classifiers are discarded before inference.

where α,β,and γ are hyperparamters.Lcdenotes the classification loss,Ldsis the overall cross-entropy loss generated by adding attached classifiers to the different phase layers of the student,which take the forms

Ldcm1and Ldcm2denote the total loss of the same stage bidirectional KD operations and different stage bidirectional KD operations,respectively.The specific expressions take the forms

Guo et al.[39]proposed a promising KD method via collaborative learning(KDCL),where multi-student networks with different abilities are jointly trained to produce soft targets of good performance.Specifically,the logits generated by each student are integrated into ensemble logits,which are then used by the teacher to impart knowledge to each student in order to improve the generalization performance of DML.

Gao et al.[40]proposed cross-architecture online-distillation,where multiple students are co-trained in a distributed manner and the logits output of multiple students are integrated into a server to generate soft targets,which are then employed to supervise the regularized training.

Mutual distillation regularization strategies achieve better generalization without teacher models.It can be applied to all kinds of homogeneous or heterogeneous networks.However,there are two limitations to this multi-branch design: Firstly,the training process takes up plenty of storage resources and the number of students is limited by the available memory space;Secondly and more importantly,the model cannot account for a wide range of uncertainty and diversity in the solution space because of the small number of branches.

4 Summarizing Table

In this section,we summarize the bulk of our survey through table,which,hopefully,provides a road map for the rapidly growing field of KD regularization.In Table 4,we compare the four KD regularization strategies with respect to their typical scenarios,advantages,and disadvantages.We also test the performance of representative mutual distillation regularization methods on image classification and collect the benchmark results in Table 5.

Table 4 A comparison of forward distillation and mutual distillation strategies

Table 5 For a fair comparison,all experiments were performed on the same dataset are obtained with the identical settings in the python environment

The CIFAR-10 dataset[41]consists of 32×32 colour images in 10 classes including 50 000 training samples and 10 000 test samples,where each class has 5 000 training datas.The CIFAR-100 dataset[41]is a more challenging recognition task,which has more classes with fewer samples on each class.The training and test sets are also 50 000 and 10 000 colored natural scene images(32×32 pixels each)drawn from 100 classes.The ResNet-18,ResNet-50,Vgg-16,Vgg-19,and DenseNet-121 are trained to form scratch and optimized via stochastic gradient descent (SGD) with a momentum of 0.9,weight decay is set to 5×10-4,and an initial learning rate is 0.1.The learning rate is divided by 10 at the 50th and 100th epoch,batch size is set to 64.We use typical data augmentation techniques: 32×32 random crops and horizontal flip.We set the accuracy(%)for image classification as the evaluation metric to evaluate the model’s generalization performance.The best performance is shown in boldface,where S1is the student network and S2is another.

5 Conclusion and Future Work

We have systematically reviewed KD regularization strategies in the literature for model generalization; our classification of these KD regularization strategies is based on the teacher-student relationship between the models of knowledge transfer.When a high-capacity teacher model is available,the forward distillation enhances student performance by transferring knowledge from the teacher to the students.When supervising teacher models are unavailable,mutual distillation regularization methods achieve performance improvements through mutual guidance between the students.Details of these two strategies are discussed in sections 2,3 and summarized in section 4.

Despite its notable successes,KD regularization also still has a number of limitations.In traditional KD regularization,an optimal student model can only be obtained from a given teacher model.With pre-trained teacher models,it is currently only possible to distill the teacher that can perform the same tasks.Most of the existing KD regularization methods have been only applied to classification tasks.Finally,there is a lack of theory for the explanation of empirical facts in KD such as the observation that the performance of student models may still be improved via knowledge transfer from a teacher model with poor performance.

Accordingly,we point out some future trends in the field of KD regularization:

(a)Designing more effective KD regularizers when there is a significant difference in the capacity of the teacher and student model.

(b)Combining more KD regularization with structure regularization techniques to achieve better generalization.

(c)Distilling an all-around teacher model who is good at tackling different challenges from a number of teacher models who can perform different tasks.

(d)Deriving tighter generalization bounds for KD methods to explore factors affecting model generalization and to guide the design of new methods.