Enhanced knowledge distillation by auxiliary classifiers
Abstract
Deep neural models have shown promising results in various areas, e.g., computer vision and natural language processing, at the cost of high computation and storage resource consumption. These characteristics of deep neural networks have acted as a barrier in resource-constraint environments, e.g., smartphones. Among numerous proposed approaches to mitigate this limitation, knowledge distillation has gained much attention due to its generalizability and simplicity in implementation. This thesis introduces the enhanced knowledge distillation (EKD), a simple yet effective approach to outperform the canonical knowledge distillation using multiple classifier heads at various teachers’ depths. First, multiple classifier heads are attached to the teacher model in different depths. The mounted heads benefit from the fully trained teacher model and converge fast while the backbone teacher is frozen. The cohort of all classifiers supervises the student in the last step. EKD showed superior performance in comparison with some of the state-of-the-art distillation frameworks.