{"title":"基于对象级特征的DETR图像分类","authors":"Chung-Gi Ban, Dayoung Park, Youngbae Hwang","doi":"10.23919/ICCAS55662.2022.10003912","DOIUrl":null,"url":null,"abstract":"The object in an image is the main information of image representation for image classification. In case that the background in the image is complex or an object size is small, the existing invariant feature, such as Scale Invariant Feature Transform (SIFT) or Speeded Up Robust Features (SURF) is not easy to use for object-level representation. Because SIFT can not distinguish whether the feature includes relevant object information, it may consist of background or less informative features. We use Detection Transformer (DETR), the state of the art object detector to represent the object-level information. By visualizing the attention map of Transformer decoder, we find that each output vector indicates the region of objects effectively. Bag of visual words (BoVW) is applied to represent N output vectors of DETR as the feature of a query image. Based on scene-level and object-level datasets, we compare our method with SIFT based BoVW as an image classification task. We show that the proposed method perform better for object-level dataset than BoVW of SIFT.","PeriodicalId":129856,"journal":{"name":"2022 22nd International Conference on Control, Automation and Systems (ICCAS)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Image classification using DETR based object-level feature\",\"authors\":\"Chung-Gi Ban, Dayoung Park, Youngbae Hwang\",\"doi\":\"10.23919/ICCAS55662.2022.10003912\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The object in an image is the main information of image representation for image classification. In case that the background in the image is complex or an object size is small, the existing invariant feature, such as Scale Invariant Feature Transform (SIFT) or Speeded Up Robust Features (SURF) is not easy to use for object-level representation. Because SIFT can not distinguish whether the feature includes relevant object information, it may consist of background or less informative features. We use Detection Transformer (DETR), the state of the art object detector to represent the object-level information. By visualizing the attention map of Transformer decoder, we find that each output vector indicates the region of objects effectively. Bag of visual words (BoVW) is applied to represent N output vectors of DETR as the feature of a query image. Based on scene-level and object-level datasets, we compare our method with SIFT based BoVW as an image classification task. We show that the proposed method perform better for object-level dataset than BoVW of SIFT.\",\"PeriodicalId\":129856,\"journal\":{\"name\":\"2022 22nd International Conference on Control, Automation and Systems (ICCAS)\",\"volume\":\"54 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 22nd International Conference on Control, Automation and Systems (ICCAS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/ICCAS55662.2022.10003912\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 22nd International Conference on Control, Automation and Systems (ICCAS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/ICCAS55662.2022.10003912","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Image classification using DETR based object-level feature
The object in an image is the main information of image representation for image classification. In case that the background in the image is complex or an object size is small, the existing invariant feature, such as Scale Invariant Feature Transform (SIFT) or Speeded Up Robust Features (SURF) is not easy to use for object-level representation. Because SIFT can not distinguish whether the feature includes relevant object information, it may consist of background or less informative features. We use Detection Transformer (DETR), the state of the art object detector to represent the object-level information. By visualizing the attention map of Transformer decoder, we find that each output vector indicates the region of objects effectively. Bag of visual words (BoVW) is applied to represent N output vectors of DETR as the feature of a query image. Based on scene-level and object-level datasets, we compare our method with SIFT based BoVW as an image classification task. We show that the proposed method perform better for object-level dataset than BoVW of SIFT.