Jana Roßbach , Kirsten C. Wagener , Bernd T. Meyer
{"title":"Multilingual non-intrusive binaural intelligibility prediction based on phone classification","authors":"Jana Roßbach , Kirsten C. Wagener , Bernd T. Meyer","doi":"10.1016/j.csl.2024.101684","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101684","url":null,"abstract":"<div><p>Speech intelligibility (SI) prediction models are a valuable tool for the development of speech processing algorithms for hearing aids or consumer electronics. For the use in realistic environments it is desirable that the SI model is non-intrusive (does not require separate input of original and degraded speech, transcripts or <em>a-priori</em> knowledge about the signals) and does a binaural processing of the audio signals. Most of the existing SI models do not fulfill all of these criteria. In this study, we propose an SI model based on phone probabilities obtained from a deep neural net. The model comprises a binaural enhancement stage for prediction of the speech recognition threshold (SRT) in realistic acoustic scenes. In the first part of the study, SRT predictions in different spatial configurations are compared to the results from normal-hearing listeners. On average, our approach produces lower errors and higher correlations compared to three intrusive baseline models. In the second part, we explore if measures relevant in spatial hearing, i.e., the intelligibility level difference (ILD) and the binaural ILD (BILD), can be predicted with our modeling approach. We also investigate if a language mismatch between training and testing the model plays a role when predicting ILD and BILD. This point is especially important for low-resource languages, where not thousands of hours of language material are available for training. Binaural benefits are predicted by our model with an error of 1.5 dB. This is slightly higher than the error with a competitive baseline MBSTOI (1.1 dB), but does not require separate input of original and degraded speech. We also find that good binaural predictions can be obtained with models that are not specifically trained with the target language.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101684"},"PeriodicalIF":3.1,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000676/pdfft?md5=2480b19144d8254f73d5748237f56388&pid=1-s2.0-S0885230824000676-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141592967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neural multi-task learning for end-to-end Arabic aspect-based sentiment analysis","authors":"Rajae Bensoltane, Taher Zaki","doi":"10.1016/j.csl.2024.101683","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101683","url":null,"abstract":"<div><p>Most existing aspect-based sentiment analysis (ABSA) methods perform the tasks of aspect extraction and sentiment classification independently, assuming that the aspect terms are already determined when handling the aspect sentiment classification task. However, such settings are neither practical nor appropriate in real-life applications, as aspects must be extracted prior to sentiment classification. This study aims to overcome this shortcoming by jointly identifying aspect terms and the corresponding sentiments using a multi-task learning approach based on a unified tagging scheme. The proposed model uses the Bidirectional Encoder Representations from Transformers (BERT) model to produce the input representations, followed by a Bidirectional Gated Recurrent Unit (BiGRU) layer for further contextual and semantic coding. An attention layer is added on top of BiGRU to force the model to focus on the important parts of the sentence. Finally, a Conditional Random Fields (CRF) layer is used to handle inter-label dependencies. Experiments conducted on a reference Arabic hotel dataset show that the proposed model significantly outperforms the baseline and related work models.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101683"},"PeriodicalIF":3.1,"publicationDate":"2024-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000664/pdfft?md5=5af89b8ac3b7169819a4f2bf2d9a12ff&pid=1-s2.0-S0885230824000664-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141483685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Misogynistic attitude detection in YouTube comments and replies: A high-quality dataset and algorithmic models","authors":"Aakash Singh , Deepawali Sharma , Vivek Kumar Singh","doi":"10.1016/j.csl.2024.101682","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101682","url":null,"abstract":"<div><p>Social media platforms are now not only a medium for expressing users views, feelings, emotions and sentiments but are also being abused by people to propagate unpleasant and hateful content. Consequently, research efforts have been made to develop techniques and models for automatically detecting and identifying hateful, abusive, vulgar, and offensive content on different platforms. Although significant progress has been made on the task, the research on design of methods to detect misogynistic attitude of people in non-English and code-mixed languages is not very well-developed. Non-availability of suitable datasets and resources is one main reason for this. Therefore, this paper attempts to bridge this research gap by presenting a high-quality curated dataset in the Hindi-English code-mixed language. The dataset includes 12,698 YouTube comments and replies, with each comment annotated under two-level categories, first as optimistic and pessimistic, and then into different types at second level based on the content. The inter-annotator agreement in the dataset is found to be 0.84 for the first subtask, and 0.79 for the second subtask, indicating the reasonably high quality of annotations. Different algorithmic models are explored for the task of automatic detection of the misogynistic attitude expressed in the comments, with the mBERT model giving best performance on both subtasks (reported macro average F1 scores of 0.59 and 0.52, and weighted average F1 scores of 0.66 and 0.65, respectively). The analysis and results suggest that the dataset can be used for further research on the topic and that the developed algorithmic models can be applied for automatic detection of misogynistic attitude in social media conversations and posts.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101682"},"PeriodicalIF":3.1,"publicationDate":"2024-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000652/pdfft?md5=1fb50b1ad09f16299853e9624ad9718d&pid=1-s2.0-S0885230824000652-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141483686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Turkish Coreference Resolution: Insights from deep learning, dropped pronouns, and multilingual transfer learning","authors":"Tuğba Pamay Arslan, Gülşen Eryiğit","doi":"10.1016/j.csl.2024.101681","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101681","url":null,"abstract":"<div><p>Coreference resolution (CR), which is the identification of in-text mentions that refer to the same entity, is a crucial step in natural language understanding. While CR in English has been studied for quite a long time, studies for pro-dropped and morphologically rich languages is an active research area which has yet to reach sufficient maturity. Turkish, a morphologically highly-rich language, poses interesting challenges for natural language processing tasks, including CR, due to its agglutinative nature and consequent pronoun-dropping phenomenon. This article explores the use of different neural CR architectures (i.e., mention-pair, mention-ranking, and end-to-end) on Turkish, a morphologically highly-rich language, by formulating multiple research questions around the impacts of dropped pronouns, data quality, and interlingual transfer. The preparations made to explore these research questions and the findings obtained as a result of our explorations revealed the first Turkish CR dataset that includes dropped pronoun annotations (of size 4K entities/22K mentions), new state-of-the-art results on Turkish CR, the first neural end-to-end Turkish CR results (70.4% F-score), the first multilingual end-to-end CR results including Turkish (yielding 1.0 percentage points improvement on Turkish) and the demonstration of the positive impact of dropped pronouns on CR of pro-dropped and morphologically rich languages, for the first time in the literature. Our research has brought Turkish end-to-end CR performances (72.0% F-score) to similar levels with other languages, surpassing the baseline scores by 32.1 percentage points.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101681"},"PeriodicalIF":3.1,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000640/pdfft?md5=75cd60c63807520ee823be3bbb1025ae&pid=1-s2.0-S0885230824000640-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141444378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quality achhi hai (is good), satisfied! Towards aspect based sentiment analysis in code-mixed language","authors":"Mamta , Asif Ekbal","doi":"10.1016/j.csl.2024.101668","DOIUrl":"10.1016/j.csl.2024.101668","url":null,"abstract":"<div><p>Social media, e-commerce, and other online platforms have witnessed tremendous growth in multilingual users. This requires addressing the code-mixing phenomenon, i.e. mixing of more than one language for providing a rich native user experience. User reviews and comments may benefit service providers in terms of customer management. Aspect based Sentiment Analysis (ABSA) provides a fine-grained analysis of these reviews by identifying the aspects mentioned and classifies the polarities (i.e., positive, negative, neutral, and conflict). The research in this direction has mainly focused on resource-rich monolingual languages like English, which does not suffice for analyzing multilingual code-mixed reviews. In this paper, we introduce a new task to facilitate the research on code-mixed ABSA. We offer a benchmark setup by creating a code-mixed Hinglish (i.e., mixing of Hindi and English) dataset for ABSA, which is annotated with aspect terms and their sentiment values. To demonstrate the effective usage of the dataset, we develop several deep learning based models for aspect term extraction and sentiment analysis, and establish them as the baselines for further research in this direction. <span><sup>1</sup></span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101668"},"PeriodicalIF":4.3,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000512/pdfft?md5=d4cf7f510d6f46e21b19e99b8421ebc3&pid=1-s2.0-S0885230824000512-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141399023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TadaStride: Using time adaptive strides in audio data for effective downsampling","authors":"Yoonhyung Lee , Kyomin Jung","doi":"10.1016/j.csl.2024.101678","DOIUrl":"10.1016/j.csl.2024.101678","url":null,"abstract":"<div><p>In this paper, we introduce a new downsampling method for audio data called TadaStride, which can adaptively adjust the downsampling ratios across an audio data instance. Unlike previous methods using a fixed downsampling ratio, TadaStride can preserve more information from task-relevant parts of a data instance by using smaller strides for those parts and larger strides for less relevant parts. Additionally, we also introduce TadaStride-F, which is developed as a more efficient version of TadaStride while maintaining minimal performance loss. In experiments, we evaluate our TadaStride, primarily focusing on a range of audio processing tasks. Firstly, in audio classification experiments, TadaStride and TadaStride-F outperform other widely used standard downsampling methods, even with comparable memory and time usage. Furthermore, through various analyses, we provide an understanding of how TadaStride learns effective adaptive strides and how it leads to improved performance. In addition, through additional experiments on automatic speech recognition and discrete speech representation learning, we demonstrate that TadaStride and TadaStride-F consistently outperform other downsampling methods and examine how the adaptive strides are learned in these tasks.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101678"},"PeriodicalIF":3.1,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000615/pdfft?md5=5861e2f1cdebf31ffd61d0cba92056f3&pid=1-s2.0-S0885230824000615-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141412883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A systematic study of DNN based speech enhancement in reverberant and reverberant-noisy environments","authors":"Heming Wang , Ashutosh Pandey , DeLiang Wang","doi":"10.1016/j.csl.2024.101677","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101677","url":null,"abstract":"<div><p>Deep learning has led to dramatic performance improvements for the task of speech enhancement, where deep neural networks (DNNs) are trained to recover clean speech from noisy and reverberant mixtures. Most of the existing DNN-based algorithms operate in the frequency domain, as time-domain approaches are believed to be less effective for speech dereverberation. In this study, we employ two DNNs: ARN (attentive recurrent network) and DC-CRN (densely-connected convolutional recurrent network), and systematically investigate the effects of different components on enhancement performance, such as window sizes, loss functions, and feature representations. We conduct evaluation experiments in two main conditions: reverberant-only and reverberant-noisy. Our findings suggest that incorporating larger window sizes is helpful for dereverberation, and adding transform operations (either convolutional or linear) to encode and decode waveform features improves the sparsity of the learned representations, and boosts the performance of time-domain models. Experimental results demonstrate that ARN and DC-CRN with proposed techniques achieve superior performance compared with other strong enhancement baselines.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101677"},"PeriodicalIF":4.3,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000603/pdfft?md5=6f57ae0077f304562bdf74000559d71d&pid=1-s2.0-S0885230824000603-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141325435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MPSA-DenseNet: A novel deep learning model for English accent classification","authors":"Tianyu Song , Linh Thi Hoai Nguyen , Ton Viet Ta","doi":"10.1016/j.csl.2024.101676","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101676","url":null,"abstract":"<div><p>This paper presents three innovative deep learning models for English accent classification: Multi-task Pyramid Split Attention- Densely Convolutional Networks (MPSA-DenseNet), Pyramid Split Attention- Densely Convolutional Networks (PSA-DenseNet), and Multi-task- Densely Convolutional Networks (Multi-DenseNet), that combine multi-task learning and/or the PSA module attention mechanism with DenseNet. We applied these models to data collected from five dialects of English across native English-speaking regions (England, the United States) and nonnative English-speaking regions (Hong Kong, Germany, India). Our experimental results show a significant improvement in classification accuracy, particularly with MPSA-DenseNet, which outperforms all other models, including Densely Convolutional Networks (DenseNet) and Efficient Pyramid Squeeze Attention (EPSA) models previously used for accent identification. Our findings indicate that MPSA-DenseNet is a highly promising model for accurately identifying English accents.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101676"},"PeriodicalIF":4.3,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000597/pdfft?md5=45eac4ef8fe33cc3af54ca5ce1756899&pid=1-s2.0-S0885230824000597-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141264076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Poobalan , K. Ganapriya , K. Kalaivani , K. Parthiban
{"title":"A novel and secured email classification using deep neural network with bidirectional long short-term memory","authors":"A. Poobalan , K. Ganapriya , K. Kalaivani , K. Parthiban","doi":"10.1016/j.csl.2024.101667","DOIUrl":"https://doi.org/10.1016/j.csl.2024.101667","url":null,"abstract":"<div><p>Email data has some characteristics that are different from other social media data, such as a large range of answers, formal language, notable length variations, high degrees of anomalies, and indirect relationships. The main goal in this research is to develop a robust and computationally efficient classifier that can distinguish between spam and regular email content. The benchmark Enron dataset, which is accessible to the public, was used for the tests. The six distinct Enron data sets we acquired were combined to generate the final seven Enron data sets. The dataset undergoes early preprocessing to remove superfluous sentences. The proposed model Bidirectional Long Short-Term Memory (BiLSTM) apply spam labels and to examine email documents for spam. On seven Enron datasets, DNN-BiLSTM performs better than other classifiers in the performance comparison in terms of accuracy. DNN-BiLSTM and convolutional neural networks demonstrated that they can classify spam with 96.39 % and 98.69 % accuracy, respectively, in comparison to other machine learning classifiers. The risks associated with cloud data management and potential security flaws are also covered in the paper. This research presents hybrid encryption as a means of protecting cloud data while preserving privacy by using the hybrid AES-Rabit encryption algorithm which is based on symmetric session key exchange.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101667"},"PeriodicalIF":4.3,"publicationDate":"2024-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000500/pdfft?md5=93a3ab04f63a63c4343031dc3b1f9eca&pid=1-s2.0-S0885230824000500-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141250220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicolás Grágeda , Carlos Busso , Eduardo Alvarado , Ricardo García , Rodrigo Mahu , Fernando Huenupan , Néstor Becerra Yoma
{"title":"Speech emotion recognition in real static and dynamic human-robot interaction scenarios","authors":"Nicolás Grágeda , Carlos Busso , Eduardo Alvarado , Ricardo García , Rodrigo Mahu , Fernando Huenupan , Néstor Becerra Yoma","doi":"10.1016/j.csl.2024.101666","DOIUrl":"10.1016/j.csl.2024.101666","url":null,"abstract":"<div><p>The use of speech-based solutions is an appealing alternative to communicate in human-robot interaction (HRI). An important challenge in this area is processing distant speech which is often noisy, and affected by reverberation and time-varying acoustic channels. It is important to investigate effective speech solutions, especially in dynamic environments where the robots and the users move, changing the distance and orientation between a speaker and the microphone. This paper addresses this problem in the context of speech emotion recognition (SER), which is an important task to understand the intention of the message and the underlying mental state of the user. We propose a novel setup with a PR2 robot that moves as target speech and ambient noise are simultaneously recorded. Our study not only analyzes the detrimental effect of distance speech in this dynamic robot-user setting for speech emotion recognition but also provides solutions to attenuate its effect. We evaluate the use of two beamforming schemes to spatially filter the speech signal using either delay-and-sum (D&S) or minimum variance distortionless response (MVDR). We consider the original training speech recorded in controlled situations, and simulated conditions where the training utterances are processed to simulate the target acoustic environment. We consider the case where the robot is moving (dynamic case) and not moving (static case). For speech emotion recognition, we explore two state-of-the-art classifiers using hand-crafted features implemented with the ladder network strategy and learned features implemented with the wav2vec 2.0 feature representation. MVDR led to a signal-to-noise ratio higher than the basic D&S method. However, both approaches provided very similar average concordance correlation coefficient (CCC) improvements equal to 116 % with the HRI subsets using the ladder network trained with the original MSP-Podcast training utterances. For the wav2vec 2.0-based model, only D&S led to improvements. Surprisingly, the static and dynamic HRI testing subsets resulted in a similar average concordance correlation coefficient. Finally, simulating the acoustic environment in the training dataset provided the highest average concordance correlation coefficient scores with the HRI subsets that are just 29 % and 22 % lower than those obtained with the original training/testing utterances, with ladder network and wav2vec 2.0, respectively.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101666"},"PeriodicalIF":4.3,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000494/pdfft?md5=10d8a0faec641adaf8be74271eaf5174&pid=1-s2.0-S0885230824000494-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141134350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}