{"title":"基于多模态无监督域自适应的细粒度自中心动作识别","authors":"Xianyuan Liu, Tao Lei, Ping Jiang","doi":"10.1109/ITNEC56291.2023.10082267","DOIUrl":null,"url":null,"abstract":"Fine-grained egocentric action recognition has made significant progress because of the advancement of supervised learning. Some real-world applications require the network trained on one dataset to perform well on another unlabeled dataset due to the difficulty of annotating new data. However, due to the disparity in dataset distributions, i.e. domain shift, the network is unable to retain its good performance across datasets. Therefore, in this paper, we use unsupervised domain adaptation to address this difficult challenge, i.e. training a model on labeled source data such that it can be directly used on unlabeled target data with the same categories. First, we use Transformer to capture spatial information, and then we propose a temporal attention module to model temporal interdependence. In consideration of the fact that multi-modal data provides more kinds of important information, we build a tri-stream network for spatio-temporal information fusion. Finally, we align source data with target data using adversarial learning. Our network outperforms the baselines on the largest egocentric dataset, the EPIC-KITCHENS-100 dataset.","PeriodicalId":218770,"journal":{"name":"2023 IEEE 6th Information Technology,Networking,Electronic and Automation Control Conference (ITNEC)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Fine-Grained Egocentric Action Recognition with Multi-Modal Unsupervised Domain Adaptation\",\"authors\":\"Xianyuan Liu, Tao Lei, Ping Jiang\",\"doi\":\"10.1109/ITNEC56291.2023.10082267\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Fine-grained egocentric action recognition has made significant progress because of the advancement of supervised learning. Some real-world applications require the network trained on one dataset to perform well on another unlabeled dataset due to the difficulty of annotating new data. However, due to the disparity in dataset distributions, i.e. domain shift, the network is unable to retain its good performance across datasets. Therefore, in this paper, we use unsupervised domain adaptation to address this difficult challenge, i.e. training a model on labeled source data such that it can be directly used on unlabeled target data with the same categories. First, we use Transformer to capture spatial information, and then we propose a temporal attention module to model temporal interdependence. In consideration of the fact that multi-modal data provides more kinds of important information, we build a tri-stream network for spatio-temporal information fusion. Finally, we align source data with target data using adversarial learning. Our network outperforms the baselines on the largest egocentric dataset, the EPIC-KITCHENS-100 dataset.\",\"PeriodicalId\":218770,\"journal\":{\"name\":\"2023 IEEE 6th Information Technology,Networking,Electronic and Automation Control Conference (ITNEC)\",\"volume\":\"130 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE 6th Information Technology,Networking,Electronic and Automation Control Conference (ITNEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ITNEC56291.2023.10082267\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE 6th Information Technology,Networking,Electronic and Automation Control Conference (ITNEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITNEC56291.2023.10082267","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fine-Grained Egocentric Action Recognition with Multi-Modal Unsupervised Domain Adaptation
Fine-grained egocentric action recognition has made significant progress because of the advancement of supervised learning. Some real-world applications require the network trained on one dataset to perform well on another unlabeled dataset due to the difficulty of annotating new data. However, due to the disparity in dataset distributions, i.e. domain shift, the network is unable to retain its good performance across datasets. Therefore, in this paper, we use unsupervised domain adaptation to address this difficult challenge, i.e. training a model on labeled source data such that it can be directly used on unlabeled target data with the same categories. First, we use Transformer to capture spatial information, and then we propose a temporal attention module to model temporal interdependence. In consideration of the fact that multi-modal data provides more kinds of important information, we build a tri-stream network for spatio-temporal information fusion. Finally, we align source data with target data using adversarial learning. Our network outperforms the baselines on the largest egocentric dataset, the EPIC-KITCHENS-100 dataset.