As a guest user you are not logged in or recognized by your IP address. You have
access to the Front Matter, Abstracts, Author Index, Subject Index and the full
text of Open Access publications.
Self-supervised learning and knowledge distillation intersect to achieve exceptional performance on downstream tasks across diverse network capacities. This paper introduces MIM-HD, which implements enhancements for masked image modeling (MIM) distillation, in two key aspects. First, a vision transformer head-level relation adaptive distillation approach is proposed, allowing the student to dynamically draw multi-source knowledge from the teacher based on its evolving state, compatible with scenarios where teacher-student transformer block head count differs. Second, to address the overemphasis on the encoder and neglect of the decoder role in maintaining representation consistency in previous MIM distillations, a dual-view decoding strategy for latent visual representations is introduced, reusing the teacher’s decoder to alleviate MIM burdens on smaller networks. MIM-HD effectiveness is demonstrated through evaluations on ADE20K (mIoU) and ImageNet-1K (Acc), achieving +1.4% and +0.5% improved performance, respectively, compared to state-of-the-art methods, with substantial advantages on smaller pre-training datasets. Moreover, MIM-HD achieves superior efficiency, reducing pre-training epochs from 300 to 100.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.