{"title":"Optimizing Vision Transformer Performance with Customizable Parameters","authors":"E. Ibrahimović","doi":"10.23919/MIPRO57284.2023.10159761","DOIUrl":null,"url":null,"abstract":"This paper experimentally examined the effects of changing the size of image patches and the number of transformer layers on the training time and accuracy of a vision transformer used for image classification. The transformer architecture was first introduced in 2017 as a new way of processing natural language and has since found applications in computer vision as well. In this experiment, we trained and tested fourteen versions of a vision transformer on the CIFAR-100 dataset using graphical processing units provided by Google Colaboratory. The results showed that increasing the number of transformer layers and decreasing the patch size both increased test accuracy and training time. However, learning curves generated by the models showed overfitting for very small patch sizes. Overall, changing patch size had a greater impact on accuracy than changing the number of transformer layers. The results also suggested that transformers are more resource-intensive to train than other models. We suppose that including a classification token could lead to shorter training times, but another experiment is needed to examine its influence on accuracy.","PeriodicalId":177983,"journal":{"name":"2023 46th MIPRO ICT and Electronics Convention (MIPRO)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 46th MIPRO ICT and Electronics Convention (MIPRO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/MIPRO57284.2023.10159761","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper experimentally examined the effects of changing the size of image patches and the number of transformer layers on the training time and accuracy of a vision transformer used for image classification. The transformer architecture was first introduced in 2017 as a new way of processing natural language and has since found applications in computer vision as well. In this experiment, we trained and tested fourteen versions of a vision transformer on the CIFAR-100 dataset using graphical processing units provided by Google Colaboratory. The results showed that increasing the number of transformer layers and decreasing the patch size both increased test accuracy and training time. However, learning curves generated by the models showed overfitting for very small patch sizes. Overall, changing patch size had a greater impact on accuracy than changing the number of transformer layers. The results also suggested that transformers are more resource-intensive to train than other models. We suppose that including a classification token could lead to shorter training times, but another experiment is needed to examine its influence on accuracy.