Guowei Jin, Yunfeng Xu, Hong Kang, Jialin Wang, Borui Miao
{"title":"DSTM: A transformer-based model with dynamic-static feature fusion in speech emotion recognition","authors":"Guowei Jin, Yunfeng Xu, Hong Kang, Jialin Wang, Borui Miao","doi":"10.1016/j.csl.2024.101733","DOIUrl":"10.1016/j.csl.2024.101733","url":null,"abstract":"<div><div>With the support of multi-head attention, the Transformer shows remarkable results in speech emotion recognition. However, existing models still suffer from the inability to accurately locate important regions in semantic information at different time scales. To address this problem, we propose a Transformer-based network model for dynamic-static feature fusion, composed of a locally adaptive multi-head attention module and a global static attention module. The locally dynamic multi-head attention module adapts the attention window sizes and window centers of the different regions through speech samples and learnable parameters, enabling the model to adaptively discover and pay attention to valuable information embedded in speech. The global static attention module enables the model to use each element in the sequence fully and learn critical global feature information by establishing connections over the entire input sequence. We also use the data mixture training method to train our model and introduce the CENTER LOSS function to supervise the training of the model, which can better speed up the fitting speed of the model and alleviate the sample imbalance problem to a certain extent. This method achieved good performance on the IEMOCAP and MELD datasets, proving that our proposed model structure and method have better accuracy and robustness.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101733"},"PeriodicalIF":3.1,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142428314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sanhe Yang, Peichao Lai, Ruixiong Fang, Yanggeng Fu, Feiyang Ye, Yilei Wang
{"title":"FE-CFNER: Feature Enhancement-based approach for Chinese Few-shot Named Entity Recognition","authors":"Sanhe Yang, Peichao Lai, Ruixiong Fang, Yanggeng Fu, Feiyang Ye, Yilei Wang","doi":"10.1016/j.csl.2024.101730","DOIUrl":"10.1016/j.csl.2024.101730","url":null,"abstract":"<div><div>Although significant progress has been made in Chinese Named Entity Recognition (NER) methods based on deep learning, their performance often falls short in few-shot scenarios. Feature enhancement is considered a promising approach to address the issue of Chinese few-shot NER. However, traditional feature fusion methods tend to lead to the loss of important information and the integration of irrelevant information. Despite the benefits of incorporating BERT for improving entity recognition, its performance is limited when training data is insufficient. To tackle these challenges, this paper proposes a Feature Enhancement-based approach for Chinese Few-shot NER called FE-CFNER. FE-CFNER designs a double cross neural network to minimize information loss through the interaction of feature cross twice. Additionally, adaptive weights and a top-<span><math><mi>k</mi></math></span> mechanism are introduced to sparsify attention distributions, enabling the model to prioritize important information related to entities while excluding irrelevant information. To further enhance the quality of BERT embeddings, FE-CFNER employs a contrastive template for contrastive learning pre-training of BERT, enhancing BERT’s semantic understanding capability. We evaluate the proposed method on four sampled Chinese NER datasets: Weibo, Resume, Taobao, and Youku. Experimental results validate the effectiveness and superiority of FE-CFNER in Chinese few-shot NER tasks.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101730"},"PeriodicalIF":3.1,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142428313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spoofing countermeasure for fake speech detection using brute force features","authors":"Arsalan Rahman Mirza , Abdulbasit K. Al-Talabani","doi":"10.1016/j.csl.2024.101732","DOIUrl":"10.1016/j.csl.2024.101732","url":null,"abstract":"<div><div>Due to the progress in deep learning technology, techniques that generate spoofed speech have significantly emerged. Such synthetic speech can be exploited for harmful purposes, like impersonation or disseminating false information. Researchers in the area investigate the useful features for spoof detection. This paper extensively investigates three problems in spoof detection in speech, namely, the imbalanced sample per class, which may negatively affect the performance of any detection models, the effect of the feature early and late fusion, and the analysis of unseen attacks on the model. Regarding the imbalanced issue, we have proposed two approaches (a Synthetic Minority Over Sampling Technique (SMOTE)-based and a Bootstrap-based model). We have used the OpenSMILE toolkit, to extract different feature sets, their results and early and late fusion of them have been investigated. The experiments are evaluated using the ASVspoof 2019 datasets which encompass synthetic, voice-conversion, and replayed speech samples. Additionally, Support Vector Machine (SVM) and Deep Neural Network (DNN) have been adopted in the classification. The outcomes from various test scenarios indicated that neither the imbalanced nature of the dataset nor a specific feature or their fusions outperformed the brute force version of the model as the best Equal Error Rate (EER) achieved by the Imbalance model is 6.67 % and 1.80 % for both Logical Access (LA) and Physical Access (PA) respectively.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101732"},"PeriodicalIF":3.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142428363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Louis Mahon , Omri Abend , Uri Berger , Katherine Demuth , Mark Johnson , Mark Steedman
{"title":"A language-agnostic model of child language acquisition","authors":"Louis Mahon , Omri Abend , Uri Berger , Katherine Demuth , Mark Johnson , Mark Steedman","doi":"10.1016/j.csl.2024.101714","DOIUrl":"10.1016/j.csl.2024.101714","url":null,"abstract":"<div><div>This work reimplements a recent semantic bootstrapping child language acquisition (CLA) model, which was originally designed for English, and trains it to learn a new language: Hebrew. The model learns from pairs of utterances and logical forms as meaning representations, and acquires both syntax and word meanings simultaneously. The results show that the model mostly transfers to Hebrew, but that a number of factors, including the richer morphology in Hebrew, makes the learning slower and less robust. This suggests that a clear direction for future work is to enable the model to leverage the similarities between different word forms.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101714"},"PeriodicalIF":3.1,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142428315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evidence and Axial Attention Guided Document-level Relation Extraction","authors":"Jiawei Yuan , Hongyong Leng , Yurong Qian , Jiaying Chen , Mengnan Ma , Shuxiang Hou","doi":"10.1016/j.csl.2024.101728","DOIUrl":"10.1016/j.csl.2024.101728","url":null,"abstract":"<div><div>Document-level Relation Extraction (DocRE) aims to identify semantic relations among multiple entity pairs within a document. Most of the previous DocRE methods take the entire document as input. However, for human annotators, a small subset of sentences in the document, namely the evidence, is sufficient to infer the relation of an entity pair. Additionally, a document usually contains multiple entities, and these entities are scattered throughout various location of the document. Previous models use these entities independently, ignore the global interdependency among relation triples. To handle above issues, we propose a novel framework EAAGRE (Evidence and Axial Attention Guided Relation Extraction). Firstly, we use human-annotated evidence labels to supervise the attention module of DocRE system, making the model pay attention to the evidence sentences rather than others. Secondly, we construct an entity-level relation matrix and use axial attention to capture the global interactions among entity pairs. By doing so, we further extract the relations that require multiple entity pairs for prediction. We conduct various experiments on DocRED and have some improvement compared to baseline models, verifying the effectiveness of our model.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101728"},"PeriodicalIF":3.1,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142533818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SPNet: A Serial and Parallel Convolutional Neural Network algorithm for the cross-language coreference resolution","authors":"Zixi Jia , Tianli Zhao , Jingyu Ru , Yanxiang Meng , Bing Xia","doi":"10.1016/j.csl.2024.101729","DOIUrl":"10.1016/j.csl.2024.101729","url":null,"abstract":"<div><div>Current models of coreference resolution always neglect the importance of hidden feature extraction, accurate scoring framework design, and the long-term influence of preceding potential antecedents on future decision-making. However, these aspects play vital roles in scoring the likelihood of coreference between an anaphora and its’ real antecedent. In this paper, we present a novel model named Serial and Parallel Convolutional Neural Network (SPNet). Based on the SPNet, two kinds of resolvers are proposed. Given the characteristics of reinforcement learning, we joint the reinforcement learning framework and the SPNet to solve the problem of Chinese zero pronoun resolution. What is more, we make some fine-tuning on the SPNet and propose a new resolver combined with the end-to-end framework to solve the problem of coreference resolution. The experiments are conducted on the CoNLL-2012 dataset and the results show that our model is effective. Our model achieves excellent performance in the Chinese zero pronoun resolution task. On the other hand, compared with our baseline, our model also has an improvement of 0.3% in coreference resolution task.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"91 ","pages":"Article 101729"},"PeriodicalIF":3.1,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142747745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Aidan Pine , Erica Cooper , David Guzmán , Eric Joanis , Anna Kazantseva , Ross Krekoski , Roland Kuhn , Samuel Larkin , Patrick Littell , Delaney Lothian , Akwiratékha’ Martin , Korin Richmond , Marc Tessier , Cassia Valentini-Botinhao , Dan Wells , Junichi Yamagishi
{"title":"Speech Generation for Indigenous Language Education","authors":"Aidan Pine , Erica Cooper , David Guzmán , Eric Joanis , Anna Kazantseva , Ross Krekoski , Roland Kuhn , Samuel Larkin , Patrick Littell , Delaney Lothian , Akwiratékha’ Martin , Korin Richmond , Marc Tessier , Cassia Valentini-Botinhao , Dan Wells , Junichi Yamagishi","doi":"10.1016/j.csl.2024.101723","DOIUrl":"10.1016/j.csl.2024.101723","url":null,"abstract":"<div><div>As the quality of contemporary speech synthesis improves, so too does the interest from language communities in developing text-to-speech (TTS) systems for a variety of real-world applications. Much of the work on TTS has focused on high-resource languages, resulting in implicitly resource-intensive paths to building such systems. The goal of this paper is to provide signposts and points of reference for future low-resource speech synthesis efforts, with insights drawn from the Speech Generation for Indigenous Language Education (SGILE) project. Funded and coordinated by the National Research Council of Canada (NRC), this multi-year, multi-partner project has the goal of producing high-quality text-to-speech systems that support the teaching of Indigenous languages in a variety of educational contexts. We provide background information and motivation for the project, as well as details about our approach and project structure, including results from a multi-day requirements-gathering session. We discuss some of our key challenges, including building models with appropriate controls for educators, improving model data efficiency, and strategies for low-resource transfer learning and evaluation. Finally, we provide a detailed survey of existing speech synthesis software and introduce EveryVoice TTS, a toolkit designed specifically for low-resource speech synthesis.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101723"},"PeriodicalIF":3.1,"publicationDate":"2024-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142533842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yael Segal-Feldman , Kasia Hitczenko , Matthew Goldrick , Adam Buchwald , Angela Roberts , Joseph Keshet
{"title":"Enhancing analysis of diadochokinetic speech using deep neural networks","authors":"Yael Segal-Feldman , Kasia Hitczenko , Matthew Goldrick , Adam Buchwald , Angela Roberts , Joseph Keshet","doi":"10.1016/j.csl.2024.101715","DOIUrl":"10.1016/j.csl.2024.101715","url":null,"abstract":"<div><p>Diadochokinetic speech tasks (DDK) involve the repetitive production of consonant-vowel syllables. These tasks are useful in detecting impairments, differential diagnosis, and monitoring progress in speech-motor impairments. However, manual analysis of those tasks is time-consuming, subjective, and provides only a rough picture of speech. This paper presents several deep neural network models working on the raw waveform for the automatic segmentation of stop consonants and vowels from unannotated and untranscribed speech. A deep encoder serves as a features extractor module, replacing conventional signal processing features. In this context, diverse deep learning architectures, such as convolutional neural networks (CNNs) and large self-supervised models like HuBERT, are applied for the extraction process. A decoder model uses derived embeddings to identify frame types. Consequently, the paper studies diverse deep architectures, ranging from linear layers, LSTM, CNN, and transformers. These architectures are assessed for their ability to detect speech rate, sound duration, and boundary locations on a dataset of healthy individuals and an unseen dataset of older individuals with Parkinson’s Disease. The results reveal that an LSTM model performs better than all other models on both datasets and is comparable to trained human annotators.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101715"},"PeriodicalIF":3.1,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhonghe Han , Jintao Liu , Yuanben Zhang , Lili Zhang , Lei Wang , Zequn Zhang , Zhihao Zhao , Zhenyu Huang
{"title":"Copiously Quote Classics: Improving Chinese Poetry Generation with historical allusion knowledge","authors":"Zhonghe Han , Jintao Liu , Yuanben Zhang , Lili Zhang , Lei Wang , Zequn Zhang , Zhihao Zhao , Zhenyu Huang","doi":"10.1016/j.csl.2024.101708","DOIUrl":"10.1016/j.csl.2024.101708","url":null,"abstract":"<div><p>Integrating allusions into poems is an advanced form of human poetry writing, which could clearly express the author’s thoughts and arouse the resonance of readers. However, existing poetry generation works mainly focus on improving the coherence and fluency of poetry, while generating poems with allusion knowledge is rarely considered. To solve this issue, we propose an <strong>A</strong>llusion-aware <strong>C</strong>hinese <strong>P</strong>oetry <strong>G</strong>eneration (ACPG) framework in this study. Concretely, we first release an <strong>A</strong>llusion-<strong>E</strong>nriched <strong>P</strong>oetry (AEP) dataset by linking poems with historical allusions, which might enable a new research direction for poetry generation. Based on this dataset, we design a three-stage learning mechanism to encourage the training stage under a low-resource setting, which can effectively exploit the knowledge of large-scale poetry and allusion data to generate informative allusive poems. Extensive experiments demonstrate the effectiveness of ACPG among a series of proposed baselines. Moreover, the proposed ACPG framework can also be applied to lyrics generation or other controlled text generation tasks, which can incorporate allusion knowledge into the generated results and enhance the meaning and quality of the texts.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101708"},"PeriodicalIF":3.1,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Johanan Joysingh , P. Vijayalakshmi , T. Nagarajan
{"title":"Significance of chirp MFCC as a feature in speech and audio applications","authors":"S. Johanan Joysingh , P. Vijayalakshmi , T. Nagarajan","doi":"10.1016/j.csl.2024.101713","DOIUrl":"10.1016/j.csl.2024.101713","url":null,"abstract":"<div><p>A novel feature, based on the chirp z-transform, that offers an improved representation of the underlying true spectrum is proposed. This feature, the chirp MFCC, is derived by computing the Mel frequency cepstral coefficients from the chirp magnitude spectrum, instead of the Fourier transform magnitude spectrum. The theoretical foundations for the proposal, and the experimental validation using product of likelihood Gaussians, to show the improved class separation offered by the proposed chirp MFCC, when compared with basic MFCC are discussed. Further, real world evaluation of the feature is performed using three diverse tasks, namely, speech–music classification, speaker identification, and speech commands recognition. It is shown in all three tasks that the proposed chirp MFCC offers considerable improvements.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101713"},"PeriodicalIF":3.1,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000962/pdfft?md5=9eea65049758593f74e943bfcd89ac3f&pid=1-s2.0-S0885230824000962-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142077196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}