Computer Speech and Language最新文献

筛选
英文 中文
LRetUNet: A U-Net-based retentive network for single-channel speech enhancement 基于u - net的保留网络,用于单通道语音增强
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-03-24 DOI: 10.1016/j.csl.2025.101798
Yuxuan Zhang , Zipeng Zhang , Weiwei Guo , Wei Chen , Zhaohai Liu , Houguang Liu
{"title":"LRetUNet: A U-Net-based retentive network for single-channel speech enhancement","authors":"Yuxuan Zhang ,&nbsp;Zipeng Zhang ,&nbsp;Weiwei Guo ,&nbsp;Wei Chen ,&nbsp;Zhaohai Liu ,&nbsp;Houguang Liu","doi":"10.1016/j.csl.2025.101798","DOIUrl":"10.1016/j.csl.2025.101798","url":null,"abstract":"<div><div>Speech enhancement is an essential component of many user-oriented audio applications, serving as a fundamental task for achieving robust speech processing. Although numerous methods for speech enhancement have been proposed and have shown strong performance, a notable gap persists in the development of lightweight solutions that effectively balance performance with computational efficiency. This paper addresses a significant gap in the field by introducing a novel approach to speech enhancement that integrates a retentive mechanism within a U-Net architecture. The primary innovation of the proposed method is the design and implementation of a high-frequency future filter module, which utilizes the Fast Fourier Transform (FFT) to improve the model’s capacity to preserve and process high-frequency information that is essential for speech clarity. This module, in conjunction with the retentive mechanism, enables the network to preserve essential features across layers, resulting in enhanced speech enhancement performance. The proposed method was assessed utilizing the DNS (Deep Noise Suppression) and VoiceBank+DEMAND dataset, which are widely recognized benchmarks in the field of speech enhancement. The experimental results demonstrate that the proposed method achieves competitive performance while maintaining relatively low computational complexity. This characteristic renders our method particularly suitable for real-time applications, where both performance and efficiency are critical.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"93 ","pages":"Article 101798"},"PeriodicalIF":3.1,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143696740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
E2EPref: An end-to-end preference-based framework for speech quality assessment to alleviate bias in direct assessment scores E2EPref:基于偏好的端到端语音质量评估框架,可减轻直接评估分数的偏差
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-03-23 DOI: 10.1016/j.csl.2025.101799
Cheng-Hung Hu, Yusuke Yasuda, Tomoki Toda
{"title":"E2EPref: An end-to-end preference-based framework for speech quality assessment to alleviate bias in direct assessment scores","authors":"Cheng-Hung Hu,&nbsp;Yusuke Yasuda,&nbsp;Tomoki Toda","doi":"10.1016/j.csl.2025.101799","DOIUrl":"10.1016/j.csl.2025.101799","url":null,"abstract":"<div><div>In speech quality assessment (SQA), direct assessment (DA) scores are frequently used as the objective of model training. However, because the DA scores themselves have listener-wise bias and equal range bias, the scores predicted by models trained with DA scores do not always reflect the true quality score. In this study, we utilize preference-based learning for SQA by transforming the DA score prediction framework into a preference prediction framework. Our proposed End-to-End Preference-based framework (E2EPref) for SQA is designed for predicting system-level quality scores directly. It contains four proposed components: pair generation, preference function, threshold selection, and preference aggregation. Through these functions of E2EPref, we aim to mitigate biases introduced by directly using DA scores for training. In experiments, we show that this framework helps the SQA model alleviate biases, resulting in higher system-level Spearman’s rank correlation coefficient and linear correlation coefficient. Additionally, we evaluate the quality prediction capability of the framework in a zero-shot out-of-domain scenario. Finally, we collect subjective preference scores on a dataset already containing DA scores and analyze the advantages and disadvantages of using DA scores versus subjective preference scores as the ground truth or for model training.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"93 ","pages":"Article 101799"},"PeriodicalIF":3.1,"publicationDate":"2025-03-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143704125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Summary of the NOTSOFAR-1 challenge: Highlights and learnings NOTSOFAR-1挑战总结:亮点和经验教训
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-03-16 DOI: 10.1016/j.csl.2025.101796
Igor Abramovski , Alon Vinnikov , Shalev Shaer , Naoyuki Kanda , Xiaofei Wang , Amir Ivry , Eyal Krupka
{"title":"Summary of the NOTSOFAR-1 challenge: Highlights and learnings","authors":"Igor Abramovski ,&nbsp;Alon Vinnikov ,&nbsp;Shalev Shaer ,&nbsp;Naoyuki Kanda ,&nbsp;Xiaofei Wang ,&nbsp;Amir Ivry ,&nbsp;Eyal Krupka","doi":"10.1016/j.csl.2025.101796","DOIUrl":"10.1016/j.csl.2025.101796","url":null,"abstract":"<div><div>The first Natural Office Talkers in Settings of Far-field Audio Recordings (NOTSOFAR-1) Challenge is a pivotal initiative that sets new benchmarks by offering datasets more representative of the needs of real-world business applications than those previously available. The challenge provides a unique combination of 315 recorded meetings across 30 diverse environments, capturing real-world acoustic conditions and conversational dynamics, and a 1000-hour simulated training dataset, synthesized with enhanced authenticity for real-world generalization, incorporating 15,000 real acoustic transfer functions. In this paper, we provide an overview of the systems submitted to the challenge and analyze the top-performing approaches, hypothesizing the factors behind their success. Additionally, we highlight promising directions left unexplored by participants. By presenting key findings and actionable insights, this work aims to drive further innovation and progress in DASR research and applications.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"93 ","pages":"Article 101796"},"PeriodicalIF":3.1,"publicationDate":"2025-03-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143725819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DDP-Unet: A mapping neural network for single-channel speech enhancement DDP-Unet:用于单通道语音增强的映射神经网络
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-03-13 DOI: 10.1016/j.csl.2025.101795
Haoxiang Chen , Yanyan Xu , Dengfeng Ke , Kaile Su
{"title":"DDP-Unet: A mapping neural network for single-channel speech enhancement","authors":"Haoxiang Chen ,&nbsp;Yanyan Xu ,&nbsp;Dengfeng Ke ,&nbsp;Kaile Su","doi":"10.1016/j.csl.2025.101795","DOIUrl":"10.1016/j.csl.2025.101795","url":null,"abstract":"<div><div>For speech enhancement tasks, spectrum utilization in the time–frequency domain is crucial, as it enhances the effectiveness of audio feature extraction while reducing computational consumption. Among current speech enhancement methods in the time–frequency domain, DenseBlock and the dual-path transformer have demonstrated promising results. In this paper, to further improve the performance of speech enhancement, we optimize these two modules and propose a novel mapping neural network, DDP-Unet, which comprises three components: the encoder, the decoder, and the bottleneck. Firstly, we introduce a lightweight module, the depth-point convolutional layer (DPCL), which employs point-wise and depth-wise convolutions. DPCL is then integrated into our novel DCdenseBlock, expanding DenseBlock’s receptive field and enhancing feature fusion in the encoder and decoder stages. Additionally, to increase the breadth and depth of feature fusion in the dual-path transformer, we implement a deep dual-path transformer as the bottleneck. DDP-Unet is then evaluated on two public datasets, VCTK + DEMAND and DNS Challenge 2020. Experimental results demonstrate that DDP-Unet outperforms most existing models, achieving state-of-the-art performances on STOI, PESQ, and Si-SDR metrics.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"93 ","pages":"Article 101795"},"PeriodicalIF":3.1,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143637422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A novel Adaptive Kolmogorov Arnold Sparse Masked Attention Model with multi-loss optimization for Acoustic Echo Cancellation in double-talk noisy scenario 基于多损耗优化的自适应Kolmogorov - Arnold稀疏掩码注意模型
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-03-06 DOI: 10.1016/j.csl.2025.101786
Soni Ishwarya V., Mohanaprasad K.
{"title":"A novel Adaptive Kolmogorov Arnold Sparse Masked Attention Model with multi-loss optimization for Acoustic Echo Cancellation in double-talk noisy scenario","authors":"Soni Ishwarya V.,&nbsp;Mohanaprasad K.","doi":"10.1016/j.csl.2025.101786","DOIUrl":"10.1016/j.csl.2025.101786","url":null,"abstract":"<div><div>In recent years, deep learning techniques have emerged as the predominant approach for Acoustic Echo Cancellation (AEC), owing to their capacity to effectively model complex and nonlinear patterns. This paper presents a novel Adaptive Kolmogorov Arnold Network-Based Sparse Masked Attention Model (KASMA-LossNet) with multi-loss optimization inspired by the Kolmogorov Arnold representation theorem. The model is designed to capture complex nonlinear patterns, thereby improving speech quality and enhancing echo cancellation effectiveness, all while reducing the model’s computational load. The model effectively simplifies complex nonlinear multivariate functions into univariate representations, which is crucial for handling the intricate nonlinear aspects of echo. The KAN-based attention module is designed to apprehend dense speech patterns and analyze the relationships between echo, noise, and the target signal. It also excels at identifying long-range dependencies within the signal, assigning weight scores based on their relevance to the task, and offering exceptional flexibility, enabling the model to adapt to diverse acoustic conditions. To enhance training efficiency, three losses (smoothL1 loss, magnitude loss and log spectral distance (LSD) loss) are combined and integrated into the model, accelerating convergence, speeding up the training process, and delivering more precise results. The proposed model was implemented and tested, demonstrating notable improvements in echo return loss enhancement (ERLE) and perceptual evaluation of speech quality (PESQ). The reduction in computational load of the proposed system is demonstrated through steady GPU utilization and reduced convergence time.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"93 ","pages":"Article 101786"},"PeriodicalIF":3.1,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143578242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A bias evaluation solution for multiple sensitive attribute speech recognition 多敏感属性语音识别的偏差评估方法
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-03-04 DOI: 10.1016/j.csl.2025.101787
Zigang Chen , Yuening Zhou , Zhen Wang , Fan Liu , Tao Leng , Haihua Zhu
{"title":"A bias evaluation solution for multiple sensitive attribute speech recognition","authors":"Zigang Chen ,&nbsp;Yuening Zhou ,&nbsp;Zhen Wang ,&nbsp;Fan Liu ,&nbsp;Tao Leng ,&nbsp;Haihua Zhu","doi":"10.1016/j.csl.2025.101787","DOIUrl":"10.1016/j.csl.2025.101787","url":null,"abstract":"<div><div>Speech recognition systems are a pervasive application in the field of <span><math><mrow><mi>A</mi><mi>I</mi></mrow></math></span> (Artificial Intelligence), bringing significant benefits to society. However, they also face significant fairness issues. When dealing with groups of people with different sensitive attributes, these systems tend to exhibit bias, which may lead to the misinterpretation or ignoring of the voice of a specific group of people. In order to address the fairness issue, it becomes crucial to comprehensively reveal the presence of bias in AI systems. To address the issues of limited categories and data imbalance in existing bias evaluation datasets, we propose a new method for constructing evaluation datasets. Given the unique characteristics of speech recognition systems, we find that existing AI bias evaluation methods may not be directly applicable. Therefore, we introduce a bias evaluation method for speech recognition systems based on <span><math><mrow><mi>W</mi><mi>E</mi><mi>R</mi></mrow></math></span> (Word Error Rate). To comprehensively quantify bias across different groups, we combine multiple evaluation metrics, including WER, fairness metrics, and <span><math><mrow><mi>C</mi><mi>M</mi><mi>B</mi><mi>M</mi></mrow></math></span> (confusion matrix-based metrics). To ensure a thorough evaluation, experiments were conducted on both single sensitive attributes and cross-sensitive attributes. The experimental results indicate that, for single sensitive attributes, the speech recognition system exhibits the most significant racial bias, while in the evaluation of cross-sensitive attributes, the system shows the greatest bias against white males and black males. Finally, through T-tests, we demonstrate that the WER differences between these two groups are statistically significant.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"93 ","pages":"Article 101787"},"PeriodicalIF":3.1,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143549962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GenCeption: Evaluate vision LLMs with unlabeled unimodal data GenCeption:用未标记的单峰数据评估视觉llm
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-02-28 DOI: 10.1016/j.csl.2025.101785
Lele Cao , Valentin Buchner , Zineb Senane , Fangkai Yang
{"title":"GenCeption: Evaluate vision LLMs with unlabeled unimodal data","authors":"Lele Cao ,&nbsp;Valentin Buchner ,&nbsp;Zineb Senane ,&nbsp;Fangkai Yang","doi":"10.1016/j.csl.2025.101785","DOIUrl":"10.1016/j.csl.2025.101785","url":null,"abstract":"<div><div>Multimodal Large Language Models (MLLMs) are typically assessed using expensive annotated multimodal benchmarks, which often lag behind the rapidly evolving demands of MLLM evaluation. This paper outlines and validates GenCeption, a novel, annotation-free evaluation method that requires only unimodal data to measure inter-modality semantic coherence and inversely assesses MLLMs’ tendency to hallucinate. This approach eliminates the need for costly data annotation, minimizes the risk of training data contamination, is expected to result in slower benchmark saturation, and avoids the illusion of emerging abilities. Inspired by the DrawCeption game, GenCeption begins with a non-textual sample and proceeds through iterative description and generation steps. The semantic drift across iterations is quantified using the GC@<span><math><mi>T</mi></math></span> metric. While GenCeption is principally applicable to MLLMs across various modalities, this paper focuses on its implementation and validation for Vision LLMs (VLLMs). Based on the GenCeption method, we establish the MMECeption benchmark for evaluating VLLMs, and compare the performance of several popular VLLMs and human annotators. Our empirical results validate GenCeption’s effectiveness, demonstrating strong correlations with established VLLM benchmarks. VLLMs still significantly lag behind human performance and struggle especially with text-intensive tasks.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"93 ","pages":"Article 101785"},"PeriodicalIF":3.1,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143549961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LSRD-Net: A fine-grained sentiment analysis method based on log-normalized semantic relative distance LSRD-Net:基于对数归一化语义相对距离的细粒度情感分析方法
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-02-25 DOI: 10.1016/j.csl.2025.101782
Liming Zhou, Xiaowei Xu, Xiaodong Wang
{"title":"LSRD-Net: A fine-grained sentiment analysis method based on log-normalized semantic relative distance","authors":"Liming Zhou,&nbsp;Xiaowei Xu,&nbsp;Xiaodong Wang","doi":"10.1016/j.csl.2025.101782","DOIUrl":"10.1016/j.csl.2025.101782","url":null,"abstract":"<div><div>With the development of AI technology and increasing scene demands, research on fine-grained sentiment analysis gradually replaces sentence-level or document-level coarse-grained sentiment analysis. However, most of the existing fine-grained sentiment analysis (i.e., aspect-based sentiment analysis) relies heavily on the traditional attention mechanism and does not incorporate prior knowledge for assisted recognition in aspect sentiment focusing, ignoring the importance of aligning aspect terms with sentiment information. Therefore, considering the linguistic conventions when expressing emotions, we propose a Log-SRD-based neural network model named LSRD-Net, aiming to improve the recognition accuracy and alignment efficiency of aspect terms and sentiment tendencies. The model uses the logarithmic function to normalize the semantic relative distance (SRD) matrix, then introduces the optimized matrix into the operation of the attention mechanism to achieve the introduction of a prior knowledge, and improves the alignment of aspect term and sentiment information by means of the improved cross-attention mechanism. To validate the effectiveness of the LSRD-Net, several comparative and ablation experiments are conducted on four fine-grained sentiment analysis datasets. The analysis and evaluation of experimental results demonstrate that the LSRD-Net achieves the state-of-the-art performance.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"93 ","pages":"Article 101782"},"PeriodicalIF":3.1,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143529119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MISS: Multiple information span scoring for Chinese named entity recognition 中文命名实体识别的多信息广度评分
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-02-22 DOI: 10.1016/j.csl.2025.101783
Liyi Yang , Shuli Xing , Guojun Mao
{"title":"MISS: Multiple information span scoring for Chinese named entity recognition","authors":"Liyi Yang ,&nbsp;Shuli Xing ,&nbsp;Guojun Mao","doi":"10.1016/j.csl.2025.101783","DOIUrl":"10.1016/j.csl.2025.101783","url":null,"abstract":"<div><div>Named entity recognition (NER) has drawn much attention from researchers. In Chinese text, characters carry rich contextual and regularity-based information. In most previous works on Chinese NER, a model excavates boundary features of phrase spans, yet the token information within spans and relationship between adjacent spans are neglected, which leads to insufficient feature representations and thereby limits model performance. In this study, we construct a span-based NER model named MISS (Multiple Information Span Scoring). The model consists of two major modules: (1) a span extractor for type-independent entity extraction, where the relative position information is introduced into sequence representations; and (2) a span classifier that fuses boundary and internal information into span representations for enhanced span scoring. In the span classifier, we also employ a convolutional layer to conduct cross-span interaction, which rectifies the classification scores. Entity predictions are decoded from the sum of scores computed by two modules. Our method is simple and effective. Without any external resources, MISS achieves considerable improvement on four benchmark datasets. Moreover, the ablation experiments have demonstrated the effectiveness of each component in our model.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"92 ","pages":"Article 101783"},"PeriodicalIF":3.1,"publicationDate":"2025-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143510095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Identifying offensive memes in low-resource languages: A multi-modal multi-task approach using valence and arousal 低资源语言中攻击性模因的识别:基于效价和唤醒的多模态多任务方法
IF 3.1 3区 计算机科学
Computer Speech and Language Pub Date : 2025-02-20 DOI: 10.1016/j.csl.2025.101781
Gitanjali Kumari , Dibyanayan Bandyopadhyay , Asif Ekbal , Arindam Chatterjee , Vinutha B.N.
{"title":"Identifying offensive memes in low-resource languages: A multi-modal multi-task approach using valence and arousal","authors":"Gitanjali Kumari ,&nbsp;Dibyanayan Bandyopadhyay ,&nbsp;Asif Ekbal ,&nbsp;Arindam Chatterjee ,&nbsp;Vinutha B.N.","doi":"10.1016/j.csl.2025.101781","DOIUrl":"10.1016/j.csl.2025.101781","url":null,"abstract":"<div><div>Social media platforms, including Facebook, Twitter, and Instagram, have provided a revolutionary communication platform with unrestricted expression. However, this has also led to the propagation of offensive and abusive content, cyberbullying, and harassment. The use of memes, a popular form of multimodal media, has grown exponentially and is often used to spread objectionable content through the use of dark humor. In this paper, we propose a multi-task multi-modal framework for identifying offensive Hindi memes by leveraging the auxiliary tasks of valence and arousal to improve model performance. This approach leads to a more nuanced understanding of offensive memes and outperforms unimodal models that consider only one modality. To facilitate future research, we present a new Hindi corpus, named OffVA, containing 7,646 Hindi memes annotated with offensive, valence, and arousal labels. This is the first dataset of its kind for Hindi and can serve as a benchmark for future research on detecting offensive content in Hindi memes. Additionally, we emphasize the importance of incorporating high-resource language datasets, such as English, in identifying offensive memes in low-resource languages to improve model performance. Our experimental results on this dataset demonstrate that the proposed framework outperforms unimodal models in identifying offensive memes, and the incorporation of valence and arousal as auxiliary tasks leads to better results, highlighting the importance of considering multiple modalities and tasks for effective offensiveness detection in memes.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"92 ","pages":"Article 101781"},"PeriodicalIF":3.1,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143510094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信