Multimodal Consistency-Based Teacher for Semi-Supervised Multimodal Sentiment Analysis

IF 5.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-07-18 DOI:10.1109/TASLP.2024.3430543

Ziqi Yuan;Jingliang Fang;Hua Xu;Kai Gao

{"title":"Multimodal Consistency-Based Teacher for Semi-Supervised Multimodal Sentiment Analysis","authors":"Ziqi Yuan;Jingliang Fang;Hua Xu;Kai Gao","doi":"10.1109/TASLP.2024.3430543","DOIUrl":null,"url":null,"abstract":"Multimodal sentiment analysis holds significant importance within the realm of human-computer interaction. Due to the ease of collecting unlabeled online resources compared to the high costs associated with annotation, it becomes imperative for researchers to develop semi-supervised methods that leverage unlabeled data to enhance model performance. Existing semi-supervised approaches, particularly those applied to trivial image classification tasks, are not suitable for multimodal regression tasks due to their reliance on task-specific augmentation and thresholds designed for classification tasks. To address this limitation, we propose the Multimodal Consistency-based Teacher (MC-Teacher), which incorporates consistency-based pseudo-label technique into semi-supervised multimodal sentiment analysis. In our approach, we first propose synergistic consistency assumption which focus on the consistency among bimodal representation. Building upon this assumption, we develop a learnable filter network that autonomously learns how to identify misleading instances instead of threshold-based methods. This is achieved by leveraging both the implicit discriminant consistency on unlabeled instances and the explicit guidance on constructed training data with labeled instances. Additionally, we design the self-adaptive exponential moving average strategy to decouple the student and teacher networks, utilizing a heuristic momentum coefficient. Through both quantitative and qualitative experiments on two benchmark datasets, we demonstrate the outstanding performances of the proposed MC-Teacher approach. Furthermore, detailed analysis experiments and case studies are provided for each crucial component to intuitively elucidate the inner mechanism and further validate their effectiveness.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3669-3683"},"PeriodicalIF":5.1000,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10603417/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal sentiment analysis holds significant importance within the realm of human-computer interaction. Due to the ease of collecting unlabeled online resources compared to the high costs associated with annotation, it becomes imperative for researchers to develop semi-supervised methods that leverage unlabeled data to enhance model performance. Existing semi-supervised approaches, particularly those applied to trivial image classification tasks, are not suitable for multimodal regression tasks due to their reliance on task-specific augmentation and thresholds designed for classification tasks. To address this limitation, we propose the Multimodal Consistency-based Teacher (MC-Teacher), which incorporates consistency-based pseudo-label technique into semi-supervised multimodal sentiment analysis. In our approach, we first propose synergistic consistency assumption which focus on the consistency among bimodal representation. Building upon this assumption, we develop a learnable filter network that autonomously learns how to identify misleading instances instead of threshold-based methods. This is achieved by leveraging both the implicit discriminant consistency on unlabeled instances and the explicit guidance on constructed training data with labeled instances. Additionally, we design the self-adaptive exponential moving average strategy to decouple the student and teacher networks, utilizing a heuristic momentum coefficient. Through both quantitative and qualitative experiments on two benchmark datasets, we demonstrate the outstanding performances of the proposed MC-Teacher approach. Furthermore, detailed analysis experiments and case studies are provided for each crucial component to intuitively elucidate the inner mechanism and further validate their effectiveness.

查看原文本刊更多论文

基于多模态一致性的半监督多模态情感分析教师

多模态情感分析在人机交互领域具有重要意义。与标注所需的高昂成本相比，收集未标注的在线资源非常容易，因此研究人员必须开发半监督方法，利用未标注数据来提高模型性能。现有的半监督方法，尤其是那些应用于琐碎图像分类任务的方法，由于依赖于特定任务的增强和为分类任务设计的阈值，并不适用于多模态回归任务。为了解决这一局限性，我们提出了基于一致性的多模态教师（Multimodal Consistency-based Teacher，MC-Teacher），它将基于一致性的伪标签技术融入到半监督多模态情感分析中。在我们的方法中，我们首先提出了协同一致性假设，重点关注双模态表征之间的一致性。在此假设的基础上，我们开发了一种可学习的过滤网络，它能自主学习如何识别误导性实例，而不是基于阈值的方法。这是通过利用未标注实例的隐式判别一致性和带有标注实例的构建训练数据的显式指导来实现的。此外，我们还利用启发式动量系数设计了自适应指数移动平均策略，以解耦学生和教师网络。通过在两个基准数据集上进行定量和定性实验，我们证明了所提出的 MC-Teacher 方法的卓越性能。此外，我们还为每个关键组件提供了详细的分析实验和案例研究，以直观地阐明其内在机制并进一步验证其有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.