$$L=CE(y,p)+aCE(q,p)$$

《Does Knowledge Distillation Really Work?》 尝试的去比较学生模型的泛化能力与匹配程度的相关性。泛化性指模型经过训练后，应用到新数据并做出准确预测的能力、匹配程度则能更好反映了学生模型蒸馏到了多少教师模型含有的知识。

• 学生模型的能力比较弱；
• 模型网络结构存在很大不同；
• 蒸馏数据不足或者选择了错误的数据；
• 蒸馏过程中的优化过程有问题。

To summarize, we have at last identifified a root cause of the ineffectiveness of all our previous interventions on the knowledge distillation procedure. Knowledge distillation is unable to converge to optimal student parameters, even when we know a solution and give the initialization a small head start in the direction of an optimum. Indeed, while identififiability can be an issue, in order to match the teacher on all inputs, the student has to at least match the teacher on the data used for distillation, and achieve a near-optimal value of the distillation loss. Furthermore, the suboptimal convergence of knowledge distillation appears to be a consequence of the optimization dynamics specififically, and not simply initialization bias. In practice, optimization converges to -sub-optimal solutions, leading to poor distillation fifidelity
• 我们可以知道的是学生模型的泛化性能和匹配程度的变化趋势并不一致，而匹配程度又与蒸馏的 calibration 又很大的关系。
• 知识蒸馏过程中的优化是很困难的，这也是导致低匹配度的主要原因。
• 似乎蒸馏的复杂度与数据质量之间存在某种均衡。