{"title":"基于ViT、swin变压器和ConvNeXt的大规模预训练模型的比较","authors":"Jiapeng Yu","doi":"10.1117/12.2671201","DOIUrl":null,"url":null,"abstract":"In the field of computer vision, deep learning has developed tremendously, large-scale preforming has received increasing attention from experts and researchers. Different training models often have large performance gaps in training speed and accuracy when performing large-scale pre-training. In this case, choosing the appropriate model for large-scale pre-training is particularly important. This experiment uses the same image data set and the same hardware conditions to construct the image classification model respectively in the three mainstream image recognition large-scale pre-training models, Vision Transformer (VIT), Swin-Transformer and ConvNeXt, try to analyze the advantages and disadvantages of each model by experimental results. It is observed that Vision Transformer has the fastest running speed in computer vision classification experiments, but its accuracy is not as good as the other two models, Swin-Transformer has the slowest speed and average accuracy, ConvNeXt has the highest accuracy, but its speed is mediocre. The results of this experiment have some reference significance for future model selection for large-scale pre-training tasks in computer vision, this can decrease training time and improve training accuracy to some extent.","PeriodicalId":227528,"journal":{"name":"International Conference on Artificial Intelligence and Computer Engineering (ICAICE 2022)","volume":"12610 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparison of large-scale pre-trained models based ViT, swin transformer and ConvNeXt\",\"authors\":\"Jiapeng Yu\",\"doi\":\"10.1117/12.2671201\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the field of computer vision, deep learning has developed tremendously, large-scale preforming has received increasing attention from experts and researchers. Different training models often have large performance gaps in training speed and accuracy when performing large-scale pre-training. In this case, choosing the appropriate model for large-scale pre-training is particularly important. This experiment uses the same image data set and the same hardware conditions to construct the image classification model respectively in the three mainstream image recognition large-scale pre-training models, Vision Transformer (VIT), Swin-Transformer and ConvNeXt, try to analyze the advantages and disadvantages of each model by experimental results. It is observed that Vision Transformer has the fastest running speed in computer vision classification experiments, but its accuracy is not as good as the other two models, Swin-Transformer has the slowest speed and average accuracy, ConvNeXt has the highest accuracy, but its speed is mediocre. The results of this experiment have some reference significance for future model selection for large-scale pre-training tasks in computer vision, this can decrease training time and improve training accuracy to some extent.\",\"PeriodicalId\":227528,\"journal\":{\"name\":\"International Conference on Artificial Intelligence and Computer Engineering (ICAICE 2022)\",\"volume\":\"12610 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Artificial Intelligence and Computer Engineering (ICAICE 2022)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1117/12.2671201\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Artificial Intelligence and Computer Engineering (ICAICE 2022)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2671201","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Comparison of large-scale pre-trained models based ViT, swin transformer and ConvNeXt
In the field of computer vision, deep learning has developed tremendously, large-scale preforming has received increasing attention from experts and researchers. Different training models often have large performance gaps in training speed and accuracy when performing large-scale pre-training. In this case, choosing the appropriate model for large-scale pre-training is particularly important. This experiment uses the same image data set and the same hardware conditions to construct the image classification model respectively in the three mainstream image recognition large-scale pre-training models, Vision Transformer (VIT), Swin-Transformer and ConvNeXt, try to analyze the advantages and disadvantages of each model by experimental results. It is observed that Vision Transformer has the fastest running speed in computer vision classification experiments, but its accuracy is not as good as the other two models, Swin-Transformer has the slowest speed and average accuracy, ConvNeXt has the highest accuracy, but its speed is mediocre. The results of this experiment have some reference significance for future model selection for large-scale pre-training tasks in computer vision, this can decrease training time and improve training accuracy to some extent.