You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
**Abstract**: In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly coupled with the training dynamics of this projector, which can have a large impact on the students performance. Finally, we show that a simple soft maximum function can be used to address any significant capacity gap problems. Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally efficient. In particular, we obtain these results across image classification (CIFAR100 and ImageNet), object detection (COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby we attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet. Code and models are publicly available.
56
+
**Abstract**: In this paper we revisit the efficacy of knowledge distillation as a function matching and metric learning
57
+
problem. In doing so we verify three important design decisions, namely the normalisation, soft maximum function, and
58
+
projection layers as key ingredients. We theoretically show that the projector implicitly encodes information on past
59
+
examples, enabling relational gradients for the student. We then show that the normalisation of representations is tightly
60
+
coupled with the training dynamics of this projector, which can have a large impact on the students performance.
61
+
Finally, we show that a simple soft maximum function can be used to address any significant capacity gap problems.
62
+
Experimental results on various benchmark datasets demonstrate that using these insights can lead to superior or
63
+
comparable performance to state-of-the-art knowledge distillation techniques, despite being much more computationally
64
+
efficient. In particular, we obtain these results across image classification (CIFAR100 and ImageNet), object detection
65
+
(COCO2017), and on more difficult distillation objectives, such as training data efficient transformers, whereby
66
+
we attain a 77.2% top-1 accuracy with DeiT-Ti on ImageNet. Code and models are publicly available.
67
+
38
68
39
69
FrankenSplit: Efficient Neural Feature Compression With Shallow Variational Bottleneck Injection for Mobile Edge Computing
**Abstract**: The rise of mobile AI accelerators allows latency-sensitive applications to execute lightweight Deep Neural Networks (DNNs) on the client side. However, critical applications require powerful models that edge devices cannot host and must therefore offload requests, where the high-dimensional data will compete for limited bandwidth. Split Computing (SC) alleviates resource inefficiency by partitioning DNN layers across devices, but current methods are overly specific and only marginally reduce bandwidth consumption. This work proposes shifting away from focusing on executing shallow layers of partitioned DNNs. Instead, it advocates concentrating the local resources on variational compression optimized for machine interpretability. We introduce a novel framework for resource-conscious compression models and extensively evaluate our method in an environment reflecting the asymmetric resource distribution between edge devices and servers. Our method achieves 60% lower bitrate than a state-of-the-art SC method without decreasing accuracy and is up to 16x faster than offloading with existing codec standards.
76
+
**Abstract**: The rise of mobile AI accelerators allows latency-sensitive applications to execute lightweight Deep
77
+
Neural Networks (DNNs) on the client side. However, critical applications require powerful models that edge devices
78
+
cannot host and must therefore offload requests, where the high-dimensional data will compete for limited bandwidth.
79
+
Split Computing (SC) alleviates resource inefficiency by partitioning DNN layers across devices, but current methods
80
+
are overly specific and only marginally reduce bandwidth consumption. This work proposes shifting away from focusing on
81
+
executing shallow layers of partitioned DNNs. Instead, it advocates concentrating the local resources on variational
82
+
compression optimized for machine interpretability. We introduce a novel framework for resource-conscious compression
83
+
models and extensively evaluate our method in an environment reflecting the asymmetric resource distribution between
84
+
edge devices and servers. Our method achieves 60% lower bitrate than a state-of-the-art SC method without decreasing
85
+
accuracy and is up to 16x faster than offloading with existing codec standards.
47
86
48
87
49
88
torchdistill Meets Hugging Face Libraries for Reproducible, Coding-Free Deep Learning Studies: A Case Study on NLP
0 commit comments