{"title":"利用 CNN 增强变换器进行条件选择,实现多模态情感分析","authors":"Jianwen Wang, Shiping Wang, Shunxin Xiao, Renjie Lin, Mianxiong Dong, Wenzhong Guo","doi":"10.1049/cit2.12320","DOIUrl":null,"url":null,"abstract":"<p>Attention mechanism has been a successful method for multimodal affective analysis in recent years. Despite the advances, several significant challenges remain in fusing language and its nonverbal context information. One is to generate sparse attention coefficients associated with acoustic and visual modalities, which helps locate critical emotional semantics. The other is fusing complementary cross-modal representation to construct optimal salient feature combinations of multiple modalities. A Conditional Transformer Fusion Network is proposed to handle these problems. Firstly, the authors equip the transformer module with CNN layers to enhance the detection of subtle signal patterns in nonverbal sequences. Secondly, sentiment words are utilised as context conditions to guide the computation of cross-modal attention. As a result, the located nonverbal features are not only salient but also complementary to sentiment words directly. Experimental results show that the authors’ method achieves state-of-the-art performance on several multimodal affective analysis datasets.</p>","PeriodicalId":46211,"journal":{"name":"CAAI Transactions on Intelligence Technology","volume":"9 4","pages":"917-931"},"PeriodicalIF":8.4000,"publicationDate":"2024-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.12320","citationCount":"0","resultStr":"{\"title\":\"Conditional selection with CNN augmented transformer for multimodal affective analysis\",\"authors\":\"Jianwen Wang, Shiping Wang, Shunxin Xiao, Renjie Lin, Mianxiong Dong, Wenzhong Guo\",\"doi\":\"10.1049/cit2.12320\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Attention mechanism has been a successful method for multimodal affective analysis in recent years. Despite the advances, several significant challenges remain in fusing language and its nonverbal context information. One is to generate sparse attention coefficients associated with acoustic and visual modalities, which helps locate critical emotional semantics. The other is fusing complementary cross-modal representation to construct optimal salient feature combinations of multiple modalities. A Conditional Transformer Fusion Network is proposed to handle these problems. Firstly, the authors equip the transformer module with CNN layers to enhance the detection of subtle signal patterns in nonverbal sequences. Secondly, sentiment words are utilised as context conditions to guide the computation of cross-modal attention. As a result, the located nonverbal features are not only salient but also complementary to sentiment words directly. Experimental results show that the authors’ method achieves state-of-the-art performance on several multimodal affective analysis datasets.</p>\",\"PeriodicalId\":46211,\"journal\":{\"name\":\"CAAI Transactions on Intelligence Technology\",\"volume\":\"9 4\",\"pages\":\"917-931\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2024-03-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cit2.12320\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"CAAI Transactions on Intelligence Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1049/cit2.12320\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"CAAI Transactions on Intelligence Technology","FirstCategoryId":"94","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1049/cit2.12320","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Conditional selection with CNN augmented transformer for multimodal affective analysis
Attention mechanism has been a successful method for multimodal affective analysis in recent years. Despite the advances, several significant challenges remain in fusing language and its nonverbal context information. One is to generate sparse attention coefficients associated with acoustic and visual modalities, which helps locate critical emotional semantics. The other is fusing complementary cross-modal representation to construct optimal salient feature combinations of multiple modalities. A Conditional Transformer Fusion Network is proposed to handle these problems. Firstly, the authors equip the transformer module with CNN layers to enhance the detection of subtle signal patterns in nonverbal sequences. Secondly, sentiment words are utilised as context conditions to guide the computation of cross-modal attention. As a result, the located nonverbal features are not only salient but also complementary to sentiment words directly. Experimental results show that the authors’ method achieves state-of-the-art performance on several multimodal affective analysis datasets.
期刊介绍:
CAAI Transactions on Intelligence Technology is a leading venue for original research on the theoretical and experimental aspects of artificial intelligence technology. We are a fully open access journal co-published by the Institution of Engineering and Technology (IET) and the Chinese Association for Artificial Intelligence (CAAI) providing research which is openly accessible to read and share worldwide.