Husformer：用于多模态人体状态识别的多模态变换器

IF 5 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Cognitive and Developmental Systems Pub Date : 2024-01-23 DOI:10.1109/TCDS.2024.3357618

Ruiqi Wang;Wonse Jo;Dezhong Zhao;Weizheng Wang;Arjun Gupte;Baijian Yang;Guohua Chen;Byung-Cheol Min

{"title":"Husformer：用于多模态人体状态识别的多模态变换器","authors":"Ruiqi Wang;Wonse Jo;Dezhong Zhao;Weizheng Wang;Arjun Gupte;Baijian Yang;Guohua Chen;Byung-Cheol Min","doi":"10.1109/TCDS.2024.3357618","DOIUrl":null,"url":null,"abstract":"Human state recognition is a critical topic with pervasive and important applications in human–machine systems. Multimodal fusion, which entails integrating metrics from various data sources, has proven to be a potent method for boosting recognition performance. Although recent multimodal-based models have shown promising results, they often fall short in fully leveraging sophisticated fusion strategies essential for modeling adequate cross-modal dependencies in the fusion representation. Instead, they rely on costly and inconsistent feature crafting and alignment. To address this limitation, we propose an end-to-end multimodal transformer framework for multimodal human state recognition called \n<italic>Husformer\n. Specifically, we propose using cross-modal transformers, which inspire one modality to reinforce itself through directly attending to latent relevance revealed in other modalities, to fuse different modalities while ensuring sufficient awareness of the cross-modal interactions introduced. Subsequently, we utilize a self-attention transformer to further prioritize contextual information in the fusion representation. Extensive experiments on two human emotion corpora (DEAP and WESAD) and two cognitive load datasets [multimodal dataset for objective cognitive workload assessment on simultaneous tasks (MOCAS) and CogLoad] demonstrate that in the recognition of the human state, our \n<italic>Husformer\n outperforms both state-of-the-art multimodal baselines and the use of a single modality by a large margin, especially when dealing with raw multimodal features. We also conducted an ablation study to show the benefits of each component in \n<italic>Husformer\n. Experimental details and source code are available at \n<uri>https://github.com/SMARTlab-Purdue/Husformer</uri>\n.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 4","pages":"1374-1390"},"PeriodicalIF":5.0000,"publicationDate":"2024-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Husformer: A Multimodal Transformer for Multimodal Human State Recognition\",\"authors\":\"Ruiqi Wang;Wonse Jo;Dezhong Zhao;Weizheng Wang;Arjun Gupte;Baijian Yang;Guohua Chen;Byung-Cheol Min\",\"doi\":\"10.1109/TCDS.2024.3357618\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Human state recognition is a critical topic with pervasive and important applications in human–machine systems. Multimodal fusion, which entails integrating metrics from various data sources, has proven to be a potent method for boosting recognition performance. Although recent multimodal-based models have shown promising results, they often fall short in fully leveraging sophisticated fusion strategies essential for modeling adequate cross-modal dependencies in the fusion representation. Instead, they rely on costly and inconsistent feature crafting and alignment. To address this limitation, we propose an end-to-end multimodal transformer framework for multimodal human state recognition called \\n<italic>Husformer\\n. Specifically, we propose using cross-modal transformers, which inspire one modality to reinforce itself through directly attending to latent relevance revealed in other modalities, to fuse different modalities while ensuring sufficient awareness of the cross-modal interactions introduced. Subsequently, we utilize a self-attention transformer to further prioritize contextual information in the fusion representation. Extensive experiments on two human emotion corpora (DEAP and WESAD) and two cognitive load datasets [multimodal dataset for objective cognitive workload assessment on simultaneous tasks (MOCAS) and CogLoad] demonstrate that in the recognition of the human state, our \\n<italic>Husformer\\n outperforms both state-of-the-art multimodal baselines and the use of a single modality by a large margin, especially when dealing with raw multimodal features. We also conducted an ablation study to show the benefits of each component in \\n<italic>Husformer\\n. Experimental details and source code are available at \\n<uri>https://github.com/SMARTlab-Purdue/Husformer</uri>\\n.\",\"PeriodicalId\":54300,\"journal\":{\"name\":\"IEEE Transactions on Cognitive and Developmental Systems\",\"volume\":\"16 4\",\"pages\":\"1374-1390\"},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2024-01-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Cognitive and Developmental Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10413204/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cognitive and Developmental Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10413204/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

人类状态识别是一个重要课题，在人机系统中有着广泛而重要的应用。多模态融合需要整合来自不同数据源的指标，已被证明是提高识别性能的有效方法。虽然最近基于多模态的模型已经取得了可喜的成果，但它们往往不能充分利用复杂的融合策略，而这些策略对于在融合表示中建立适当的跨模态依赖关系模型至关重要。相反，它们依赖于代价高昂且不一致的特征制作和对齐。为了解决这一局限性，我们提出了一种用于多模态人体状态识别的端到端多模态转换器框架，称为 Husformer。具体来说，我们建议使用跨模态转换器，通过直接关注其他模态中揭示的潜在相关性来激发一种模态强化自身，从而融合不同模态，同时确保对引入的跨模态交互有足够的认识。随后，我们利用自我关注转换器进一步确定融合表征中上下文信息的优先级。在两个人类情感语料库（DEAP 和 WESAD）和两个认知负荷数据集（用于同时任务的客观认知负荷评估的多模态数据集（MOCAS）和 CogLoad）上进行的广泛实验表明，在识别人类状态方面，我们的 Husformer 远远优于最先进的多模态基线和使用单一模态的方法，尤其是在处理原始多模态特征时。我们还进行了一项消融研究，以展示 Husformer 中每个组件的优势。实验详情和源代码请访问 https://github.com/SMARTlab-Purdue/Husformer。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Husformer: A Multimodal Transformer for Multimodal Human State Recognition

Human state recognition is a critical topic with pervasive and important applications in human–machine systems. Multimodal fusion, which entails integrating metrics from various data sources, has proven to be a potent method for boosting recognition performance. Although recent multimodal-based models have shown promising results, they often fall short in fully leveraging sophisticated fusion strategies essential for modeling adequate cross-modal dependencies in the fusion representation. Instead, they rely on costly and inconsistent feature crafting and alignment. To address this limitation, we propose an end-to-end multimodal transformer framework for multimodal human state recognition called Husformer . Specifically, we propose using cross-modal transformers, which inspire one modality to reinforce itself through directly attending to latent relevance revealed in other modalities, to fuse different modalities while ensuring sufficient awareness of the cross-modal interactions introduced. Subsequently, we utilize a self-attention transformer to further prioritize contextual information in the fusion representation. Extensive experiments on two human emotion corpora (DEAP and WESAD) and two cognitive load datasets [multimodal dataset for objective cognitive workload assessment on simultaneous tasks (MOCAS) and CogLoad] demonstrate that in the recognition of the human state, our Husformer outperforms both state-of-the-art multimodal baselines and the use of a single modality by a large margin, especially when dealing with raw multimodal features. We also conducted an ablation study to show the benefits of each component in Husformer . Experimental details and source code are available at https://github.com/SMARTlab-Purdue/Husformer .

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Cognitive and Developmental Systems Computer Science-Software

CiteScore

7.20

自引率

10.00%

发文量

170

期刊介绍： The IEEE Transactions on Cognitive and Developmental Systems (TCDS) focuses on advances in the study of development and cognition in natural (humans, animals) and artificial (robots, agents) systems. It welcomes contributions from multiple related disciplines including cognitive systems, cognitive robotics, developmental and epigenetic robotics, autonomous and evolutionary robotics, social structures, multi-agent and artificial life systems, computational neuroscience, and developmental psychology. Articles on theoretical, computational, application-oriented, and experimental studies as well as reviews in these areas are considered.