MDCNN: A multimodal dual-CNN recursive model for fake news detection via audio- and text-based speech emotion recognition

IF 3 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2025-09-24 DOI:10.1016/j.specom.2025.103313

Hongchen Wu, Hongxuan Li, Xiaochang Fang, Mengqi Tang, Hongzhu Yu, Bing Yu, Meng Li, Zhaorong Jing, Yihong Meng, Wei Chen, Yu Liu, Chenfei Sun, Shuang Gao, Huaxiang Zhang

{"title":"MDCNN: A multimodal dual-CNN recursive model for fake news detection via audio- and text-based speech emotion recognition","authors":"Hongchen Wu, Hongxuan Li, Xiaochang Fang, Mengqi Tang, Hongzhu Yu, Bing Yu, Meng Li, Zhaorong Jing, Yihong Meng, Wei Chen, Yu Liu, Chenfei Sun, Shuang Gao, Huaxiang Zhang","doi":"10.1016/j.specom.2025.103313","DOIUrl":null,"url":null,"abstract":"<div><div>The increasing complexity and diversity of emotional expression pose challenges when identifying fake news conveyed through text and audio formats. Integrating emotional cues derived from data offers a promising approach for balancing the tradeoff between the volume and quality of data. Leveraging recent advancements in speech emotion recognition (SER), our study proposes a Multimodal Recursive Dual-Convolutional Neural Network Model (MDCNN) for fake news detection, with a focus on sentiment analysis based on audio and text. Our proposed model employs convolutional layers to extract features from both audio and text inputs, facilitating an effective feature fusion process for sentiment classification. Through a deep bidirectional recursive encoder, the model can better understand audio and text features for determining the final emotional category. Experiments conducted on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset, which contains 5531 samples across four emotion types—anger, happiness, neutrality, and sadness—demonstrate the superior performance of the MDCNN. Its weighted average precision (WAP) is 78.8 %, which is 2.5 % higher than that of the best baseline. Compared with the existing sentiment analysis models, our approach exhibits notable enhancements in terms of accurately detecting neutral categories, thereby addressing a common challenge faced by the prior models. These findings underscore the efficacy of the MDCNN in multimodal sentiment analysis tasks and its significant achievements in neutral category classification tasks, offering a robust solution for precisely detecting fake news and conducting nuanced emotional analyses in speech recognition scenarios.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"175 ","pages":"Article 103313"},"PeriodicalIF":3.0000,"publicationDate":"2025-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639325001281","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

The increasing complexity and diversity of emotional expression pose challenges when identifying fake news conveyed through text and audio formats. Integrating emotional cues derived from data offers a promising approach for balancing the tradeoff between the volume and quality of data. Leveraging recent advancements in speech emotion recognition (SER), our study proposes a Multimodal Recursive Dual-Convolutional Neural Network Model (MDCNN) for fake news detection, with a focus on sentiment analysis based on audio and text. Our proposed model employs convolutional layers to extract features from both audio and text inputs, facilitating an effective feature fusion process for sentiment classification. Through a deep bidirectional recursive encoder, the model can better understand audio and text features for determining the final emotional category. Experiments conducted on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset, which contains 5531 samples across four emotion types—anger, happiness, neutrality, and sadness—demonstrate the superior performance of the MDCNN. Its weighted average precision (WAP) is 78.8 %, which is 2.5 % higher than that of the best baseline. Compared with the existing sentiment analysis models, our approach exhibits notable enhancements in terms of accurately detecting neutral categories, thereby addressing a common challenge faced by the prior models. These findings underscore the efficacy of the MDCNN in multimodal sentiment analysis tasks and its significant achievements in neutral category classification tasks, offering a robust solution for precisely detecting fake news and conducting nuanced emotional analyses in speech recognition scenarios.

查看原文本刊更多论文

MDCNN：一种多模态双cnn递归模型，用于基于音频和文本的语音情感识别的假新闻检测

情绪表达的复杂性和多样性日益增加，为识别通过文本和音频格式传达的假新闻带来了挑战。整合来自数据的情感线索为平衡数据数量和质量之间的权衡提供了一种很有前途的方法。利用语音情感识别（SER）的最新进展，我们的研究提出了一种用于假新闻检测的多模态递归双卷积神经网络模型（MDCNN），重点是基于音频和文本的情感分析。我们提出的模型采用卷积层从音频和文本输入中提取特征，为情感分类提供了有效的特征融合过程。通过深度双向递归编码器，该模型可以更好地理解音频和文本特征，从而确定最终的情感类别。在交互式情绪二元动作捕捉（IEMOCAP）数据集上进行的实验表明，MDCNN具有优越的性能，该数据集包含四种情绪类型（愤怒、快乐、中立和悲伤）的5531个样本。加权平均精度（WAP）为78.8%，比最佳基线提高2.5%。与现有的情感分析模型相比，我们的方法在准确检测中性类别方面表现出显著的增强，从而解决了先前模型面临的共同挑战。这些发现强调了MDCNN在多模态情绪分析任务中的有效性及其在中性类别分类任务中的显著成就，为语音识别场景中精确检测假新闻和进行细致的情绪分析提供了一个强大的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.