{"title":"Appendix","authors":"Xueqing Deng, Dawei Sun, S. Newsam, Peng Wang","doi":"10.2307/j.ctvkwnqg0.12","DOIUrl":null,"url":null,"abstract":"In this section, we present the implementation details on the experiments performed on transformer. We select ViT-B [2] with patch size of 16 as our teacher model and DeiT-Tiny [4] as our student model. We reproduce the baseline result with 4 GPUs and the total batch size is 1024. However, for searching the distillation process, we have to reduce the batch size to 256 due to limited GPU memory as we have pathways between the feature maps from teacher and student. Meanwhile, we keep the same batch size for retraining after searching. The most significant difference between the implementations of convolutional neural networks (CNNs) and transformers is the transform block. Our experimental results show that the proposed transform block on CNNs is not applicable to transformer yielding much worse performance on distillation compared to non-distillation. Therefore, we employ a transformer-style block to serve as a transform block for feature transfer between the teacher and student whose architectures are transformers as shown in Fig. 1. We follow similar search pipeline with a search learning rate of 1e-3. Once the distillation process is obtained, we train the models with 150 epochs for both ReviewKD [1] and our proposed DistPro following the same configurations in DeiT [4].","PeriodicalId":347866,"journal":{"name":"The Rights Paradox","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Rights Paradox","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2307/j.ctvkwnqg0.12","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this section, we present the implementation details on the experiments performed on transformer. We select ViT-B [2] with patch size of 16 as our teacher model and DeiT-Tiny [4] as our student model. We reproduce the baseline result with 4 GPUs and the total batch size is 1024. However, for searching the distillation process, we have to reduce the batch size to 256 due to limited GPU memory as we have pathways between the feature maps from teacher and student. Meanwhile, we keep the same batch size for retraining after searching. The most significant difference between the implementations of convolutional neural networks (CNNs) and transformers is the transform block. Our experimental results show that the proposed transform block on CNNs is not applicable to transformer yielding much worse performance on distillation compared to non-distillation. Therefore, we employ a transformer-style block to serve as a transform block for feature transfer between the teacher and student whose architectures are transformers as shown in Fig. 1. We follow similar search pipeline with a search learning rate of 1e-3. Once the distillation process is obtained, we train the models with 150 epochs for both ReviewKD [1] and our proposed DistPro following the same configurations in DeiT [4].