{"title":"基于Swin变压器的形状识别框架","authors":"Tianyang Gu, Ruipeng Min","doi":"10.1145/3529836.3529894","DOIUrl":null,"url":null,"abstract":"Shape recognition is a fundamental problem in the field of computer vision, which aims to classify various shapes. The current mainstream network architecture is convolutional neural network (CNN), however, CNN offers limited ability to extract valuable information from simple shapes for shape classification. To address this problem, this paper proposes a deep learning model based on self-attention and Vision Transformers structure (ViT) to achieve shape recognition. Compared with the traditional CNN structure, ViT considers the long-distance relationship and reduces the loss of information between layers. The model utilizes a shifted-window hierarchical vision transformer (Swin Transformer) structure and an all-scale shape representation to improve the performance of the model. Experimental results show that the proposed model achieves superior accuracy compared to other methods, achieving an accuracy of 93.82% on the animal dataset, while the performance of state-of-the-art VGG-based method is only 90.02%.","PeriodicalId":285191,"journal":{"name":"2022 14th International Conference on Machine Learning and Computing (ICMLC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Swin Transformer based Framework for Shape Recognition\",\"authors\":\"Tianyang Gu, Ruipeng Min\",\"doi\":\"10.1145/3529836.3529894\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Shape recognition is a fundamental problem in the field of computer vision, which aims to classify various shapes. The current mainstream network architecture is convolutional neural network (CNN), however, CNN offers limited ability to extract valuable information from simple shapes for shape classification. To address this problem, this paper proposes a deep learning model based on self-attention and Vision Transformers structure (ViT) to achieve shape recognition. Compared with the traditional CNN structure, ViT considers the long-distance relationship and reduces the loss of information between layers. The model utilizes a shifted-window hierarchical vision transformer (Swin Transformer) structure and an all-scale shape representation to improve the performance of the model. Experimental results show that the proposed model achieves superior accuracy compared to other methods, achieving an accuracy of 93.82% on the animal dataset, while the performance of state-of-the-art VGG-based method is only 90.02%.\",\"PeriodicalId\":285191,\"journal\":{\"name\":\"2022 14th International Conference on Machine Learning and Computing (ICMLC)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-02-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 14th International Conference on Machine Learning and Computing (ICMLC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3529836.3529894\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 14th International Conference on Machine Learning and Computing (ICMLC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3529836.3529894","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Swin Transformer based Framework for Shape Recognition
Shape recognition is a fundamental problem in the field of computer vision, which aims to classify various shapes. The current mainstream network architecture is convolutional neural network (CNN), however, CNN offers limited ability to extract valuable information from simple shapes for shape classification. To address this problem, this paper proposes a deep learning model based on self-attention and Vision Transformers structure (ViT) to achieve shape recognition. Compared with the traditional CNN structure, ViT considers the long-distance relationship and reduces the loss of information between layers. The model utilizes a shifted-window hierarchical vision transformer (Swin Transformer) structure and an all-scale shape representation to improve the performance of the model. Experimental results show that the proposed model achieves superior accuracy compared to other methods, achieving an accuracy of 93.82% on the animal dataset, while the performance of state-of-the-art VGG-based method is only 90.02%.