{"title":"The Application of Vision Transformer in Image Classification","authors":"Zhixuan He","doi":"10.1145/3546607.3546616","DOIUrl":null,"url":null,"abstract":"This project aims to study the different performance between the Vision Transformer and a Convolu- tional Nerual Network. Google Colab will be used as the environment in this project. The dataset will use CIFAR-100 image dataset to train vision transformer and Convolutional Neural Network (CNN) separately, which are both built by Keras and Tensorflow in Python, and compare the performance of these two models through the training results. The experiment of this project has found that at the scale of 60,000 images, CNN has a slight better performance than vision transformer in general. The CNN's top-5 accuracy can reach 82.38% when using test set to evaluate the model, while the top-5 accuracy of vision transformer is 82.24%.","PeriodicalId":114920,"journal":{"name":"Proceedings of the 6th International Conference on Virtual and Augmented Reality Simulations","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Virtual and Augmented Reality Simulations","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3546607.3546616","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This project aims to study the different performance between the Vision Transformer and a Convolu- tional Nerual Network. Google Colab will be used as the environment in this project. The dataset will use CIFAR-100 image dataset to train vision transformer and Convolutional Neural Network (CNN) separately, which are both built by Keras and Tensorflow in Python, and compare the performance of these two models through the training results. The experiment of this project has found that at the scale of 60,000 images, CNN has a slight better performance than vision transformer in general. The CNN's top-5 accuracy can reach 82.38% when using test set to evaluate the model, while the top-5 accuracy of vision transformer is 82.24%.