Yael Segal-Feldman , Kasia Hitczenko , Matthew Goldrick , Adam Buchwald , Angela Roberts , Joseph Keshet
{"title":"Enhancing analysis of diadochokinetic speech using deep neural networks","authors":"Yael Segal-Feldman , Kasia Hitczenko , Matthew Goldrick , Adam Buchwald , Angela Roberts , Joseph Keshet","doi":"10.1016/j.csl.2024.101715","DOIUrl":"10.1016/j.csl.2024.101715","url":null,"abstract":"<div><p>Diadochokinetic speech tasks (DDK) involve the repetitive production of consonant-vowel syllables. These tasks are useful in detecting impairments, differential diagnosis, and monitoring progress in speech-motor impairments. However, manual analysis of those tasks is time-consuming, subjective, and provides only a rough picture of speech. This paper presents several deep neural network models working on the raw waveform for the automatic segmentation of stop consonants and vowels from unannotated and untranscribed speech. A deep encoder serves as a features extractor module, replacing conventional signal processing features. In this context, diverse deep learning architectures, such as convolutional neural networks (CNNs) and large self-supervised models like HuBERT, are applied for the extraction process. A decoder model uses derived embeddings to identify frame types. Consequently, the paper studies diverse deep architectures, ranging from linear layers, LSTM, CNN, and transformers. These architectures are assessed for their ability to detect speech rate, sound duration, and boundary locations on a dataset of healthy individuals and an unseen dataset of older individuals with Parkinson’s Disease. The results reveal that an LSTM model performs better than all other models on both datasets and is comparable to trained human annotators.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101715"},"PeriodicalIF":3.1,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhonghe Han , Jintao Liu , Yuanben Zhang , Lili Zhang , Lei Wang , Zequn Zhang , Zhihao Zhao , Zhenyu Huang
{"title":"Copiously Quote Classics: Improving Chinese Poetry Generation with historical allusion knowledge","authors":"Zhonghe Han , Jintao Liu , Yuanben Zhang , Lili Zhang , Lei Wang , Zequn Zhang , Zhihao Zhao , Zhenyu Huang","doi":"10.1016/j.csl.2024.101708","DOIUrl":"10.1016/j.csl.2024.101708","url":null,"abstract":"<div><p>Integrating allusions into poems is an advanced form of human poetry writing, which could clearly express the author’s thoughts and arouse the resonance of readers. However, existing poetry generation works mainly focus on improving the coherence and fluency of poetry, while generating poems with allusion knowledge is rarely considered. To solve this issue, we propose an <strong>A</strong>llusion-aware <strong>C</strong>hinese <strong>P</strong>oetry <strong>G</strong>eneration (ACPG) framework in this study. Concretely, we first release an <strong>A</strong>llusion-<strong>E</strong>nriched <strong>P</strong>oetry (AEP) dataset by linking poems with historical allusions, which might enable a new research direction for poetry generation. Based on this dataset, we design a three-stage learning mechanism to encourage the training stage under a low-resource setting, which can effectively exploit the knowledge of large-scale poetry and allusion data to generate informative allusive poems. Extensive experiments demonstrate the effectiveness of ACPG among a series of proposed baselines. Moreover, the proposed ACPG framework can also be applied to lyrics generation or other controlled text generation tasks, which can incorporate allusion knowledge into the generated results and enhance the meaning and quality of the texts.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101708"},"PeriodicalIF":3.1,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Johanan Joysingh , P. Vijayalakshmi , T. Nagarajan
{"title":"Significance of chirp MFCC as a feature in speech and audio applications","authors":"S. Johanan Joysingh , P. Vijayalakshmi , T. Nagarajan","doi":"10.1016/j.csl.2024.101713","DOIUrl":"10.1016/j.csl.2024.101713","url":null,"abstract":"<div><p>A novel feature, based on the chirp z-transform, that offers an improved representation of the underlying true spectrum is proposed. This feature, the chirp MFCC, is derived by computing the Mel frequency cepstral coefficients from the chirp magnitude spectrum, instead of the Fourier transform magnitude spectrum. The theoretical foundations for the proposal, and the experimental validation using product of likelihood Gaussians, to show the improved class separation offered by the proposed chirp MFCC, when compared with basic MFCC are discussed. Further, real world evaluation of the feature is performed using three diverse tasks, namely, speech–music classification, speaker identification, and speech commands recognition. It is shown in all three tasks that the proposed chirp MFCC offers considerable improvements.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101713"},"PeriodicalIF":3.1,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000962/pdfft?md5=9eea65049758593f74e943bfcd89ac3f&pid=1-s2.0-S0885230824000962-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142077196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Artificial disfluency detection, uh no, disfluency generation for the masses","authors":"Tatiana Passali , Thanassis Mavropoulos , Grigorios Tsoumakas , Georgios Meditskos , Stefanos Vrochidis","doi":"10.1016/j.csl.2024.101711","DOIUrl":"10.1016/j.csl.2024.101711","url":null,"abstract":"<div><p>Existing approaches for disfluency detection typically require the existence of large annotated datasets. However, current datasets for this task are limited, suffer from class imbalance, and lack some types of disfluencies that are encountered in real-world scenarios. At the same time, augmentation techniques for disfluency detection are not able to model complex types of disfluencies. This limits such approaches to only performing pre-training since the generated data are not indicative of disfluencies that occur in real scenarios and, as a result, cannot be directly used for training disfluency detection models, as we experimentally demonstrate. This imposes significant constraints on the usefulness of such approaches in practice since real disfluencies still have to be collected in order to train the models. In this work, we propose Large-scale ARtificial Disfluency Generation (LARD), a method for automatically generating artificial disfluencies, and more specifically repairs, from fluent text. Unlike existing augmentation techniques, LARD can simulate all the different and complex types of disfluencies. In addition, it incorporates contextual embeddings into the disfluency generation to produce realistic, context-aware artificial disfluencies. LARD can be used effectively for training disfluency detection models, bypassing the requirement of annotated disfluent data. Our empirical evaluation shows that LARD outperforms existing rule-based augmentation methods and increases the accuracy of existing disfluency detectors. In addition, experiments demonstrate that the proposed method can be effectively used in a low-resource setup.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101711"},"PeriodicalIF":3.1,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000949/pdfft?md5=3e3442312f5819775b9ad09e131a9dd3&pid=1-s2.0-S0885230824000949-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep multi-task learning based detection of correlated mental disorders using audio modality","authors":"Rohan Kumar Gupta, Rohit Sinha","doi":"10.1016/j.csl.2024.101710","DOIUrl":"10.1016/j.csl.2024.101710","url":null,"abstract":"<div><p>The existence of correlation among mental disorders is a well-known phenomenon. Multi-task learning (MTL) has been reported to yield enhanced detection performance of a targeted mental disorder by leveraging its correlation with other related mental disorders, mainly in textual and visual modalities. The validation of the same on audio modality is yet to be explored. In this study, we explore homogeneous and heterogeneous MTL paradigms for detecting two correlated mental disorders, namely major depressive disorder (MDD) and post-traumatic stress disorder (PTSD), on a publicly available audio dataset. The detection of both disorders is interchangeably employed as an auxiliary task when the other is the main task. In addition, a few other tasks are employed as auxiliary tasks. The results show that both MTL paradigms, implemented using two considered deep-learning models, outperformed the corresponding single-task learning (STL). The best relative improvement in the detection performance of MDD and PTSD is found to be 29.9% and 28.8%, respectively. Furthermore, we analyzed the cross-corpus generalization of MTL using two distinct datasets that involve MDD/PTSD instances. The results indicate that the generalizability of MTL is significantly superior to that of STL. The best relative increment in the cross-corpus generalization performance of MDD and PTSD detection is found to be 25.0% and 56.5%, respectively.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101710"},"PeriodicalIF":3.1,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000937/pdfft?md5=abe8ab646f019a4cea29fbd4acdd6557&pid=1-s2.0-S0885230824000937-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142011678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive feature extraction for entity relation extraction","authors":"Weizhe Yang , Yongbin Qin , Ruizhang Huang , Yanping Chen","doi":"10.1016/j.csl.2024.101712","DOIUrl":"10.1016/j.csl.2024.101712","url":null,"abstract":"<div><p>Effective capturing of semantic dependencies within sentences is pivotal to support relation extraction. However, challenges such as feature sparsity, and the complexity of identifying the structure of target entity pairs brought by the traditional methods of feature extraction pose significant obstacles for relation extraction. Existing methods that rely on combined features or recurrent networks also face limitations, such as over-reliance on prior knowledge or the gradient vanishing problem. To address these limitations, we propose an Adaptive Feature Extraction (AFE) method, combining neural networks with feature engineering to capture high-order abstract and long-distance semantic dependencies. Our approach extracts atomic features from sentences, maps them into distributed representations, and categorizes these representations into multiple mixed features through adaptive combination, setting it apart from other methods. The proposed AFE-based model uses four different convolutional layers to facilitate feature learning and weighting from the adaptive feature representations, thereby enhancing the discriminative power of deep networks for relation extraction. Experimental results on the English datasets ACE05 English, SciERC, and the Chinese datasets ACE05 Chinese, and CLTC(SanWen) demonstrated the superiority of our method, the F1 scores were improved by 4.16%, 3.99%, 0.82%, and 1.60%, respectively. In summary, our AFE method provides a flexible, and effective solution to some challenges in cross-domain and cross-language relation extraction.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101712"},"PeriodicalIF":3.1,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000950/pdfft?md5=adb04036e83a59bb4a0206084d42c6c1&pid=1-s2.0-S0885230824000950-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142048663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A neural network approach for speech enhancement and noise-robust bandwidth extension","authors":"Xiang Hao , Chenglin Xu , Chen Zhang , Lei Xie","doi":"10.1016/j.csl.2024.101709","DOIUrl":"10.1016/j.csl.2024.101709","url":null,"abstract":"<div><p>When processing noisy utterances with varying frequency bandwidths using an enhancement model, the effective bandwidth of the resulting enhanced speech often remains unchanged. However, high-frequency components are crucial for perceived audio quality, underscoring the need for noise-robust bandwidth extension capabilities in speech enhancement networks. In this study, we addressed this challenge by proposing a novel network architecture and loss function based on the CAUNet, which is a state-of-the-art speech enhancement method. We introduced a multi-scale loss and implemented a coordinate embedded upsampling block to facilitate bandwidth extension while maintaining the ability of speech enhancement. Additionally, we proposed a gradient loss function to promote the neural network’s convergence, leading to significant performance improvements. Our experimental results validate these modifications and clearly demonstrate the superiority of our approach over competing methods.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101709"},"PeriodicalIF":3.1,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000925/pdfft?md5=3c5e79967537c7a56d567c957963e01b&pid=1-s2.0-S0885230824000925-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141993007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiyuan Jia , Zongqing Mao , Zhen Zhang , Qiyun Lv , Xin Wang , Guohua Wu
{"title":"Syntax-controlled paraphrases generation with VAE and multi-task learning","authors":"Xiyuan Jia , Zongqing Mao , Zhen Zhang , Qiyun Lv , Xin Wang , Guohua Wu","doi":"10.1016/j.csl.2024.101705","DOIUrl":"10.1016/j.csl.2024.101705","url":null,"abstract":"<div><p>Paraphrase generation is an important method for augmenting text data, which has a crucial role in Natural Language Generation (NLG). However, existing methods lack the ability to capture the semantic representation of input sentences and the syntactic structure of exemplars, which can easily lead to problems such as redundant content, semantic inaccuracies, and poor diversity. To tackle these challenges, we propose a Syntax-Controlled Paraphrase Generator (SCPG), which utilizes attention networks and VAE-based hidden variables to model the semantics of input sentences and the syntax of exemplars. In addition, in order to achieve controllability of the target paraphrase structure, we propose a method for learning semantic and syntactic representations based on multi-task learning, and successfully integrate the two through a gating mechanism. Extensive experimental results show that SCPG achieves SOTA results in terms of both semantic consistency and syntactic controllability, and is able to make a better trade-off between preserving semantics and novelty of sentence structure.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101705"},"PeriodicalIF":3.1,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000883/pdfft?md5=a172f9652be80ec2012b298f58353215&pid=1-s2.0-S0885230824000883-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142011679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mehedi Hasan Bijoy , Nahid Hossain , Salekul Islam , Swakkhar Shatabda
{"title":"A transformer-based spelling error correction framework for Bangla and resource scarce Indic languages","authors":"Mehedi Hasan Bijoy , Nahid Hossain , Salekul Islam , Swakkhar Shatabda","doi":"10.1016/j.csl.2024.101703","DOIUrl":"10.1016/j.csl.2024.101703","url":null,"abstract":"<div><p>Spelling error correction is the task of identifying and rectifying misspelled words in texts. It is a potential and active research topic in Natural Language Processing because of numerous applications in human language understanding. The phonetically or visually similar yet semantically distinct characters make it an arduous task in any language. Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods which we found rather inefficient. In particular, machine learning-based approaches, which exhibit superior performance to rule-based and statistical methods, are ineffective as they correct each character regardless of its appropriateness. In this paper, we propose a novel detector-purificator-corrector framework, DPCSpell based on denoising transformers by addressing previous issues. In addition to that, we present a method for large-scale corpus creation from scratch which in turn resolves the resource limitation problem of any left-to-right scripted language. The empirical outcomes demonstrate the effectiveness of our approach, which outperforms previous state-of-the-art methods by attaining an exact match (EM) score of 94.78%, a precision score of 0.9487, a recall score of 0.9478, an f1 score of 0.948, an f0.5 score of 0.9483, and a modified accuracy (MA) score of 95.16% for Bangla spelling error correction. The models and corpus are publicly available at <span><span>https://tinyurl.com/DPCSpell</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101703"},"PeriodicalIF":3.1,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S088523082400086X/pdfft?md5=42e971181da3ed460a728ce6126888c9&pid=1-s2.0-S088523082400086X-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141947071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modelling child comprehension: A case of suffixal passive construction in Korean","authors":"Gyu-Ho Shin , Seongmin Mun","doi":"10.1016/j.csl.2024.101701","DOIUrl":"10.1016/j.csl.2024.101701","url":null,"abstract":"<div><div>The present study investigates a computational model's ability to capture monolingual children's language behaviour during comprehension in Korean, an understudied language in the field. Specifically, we test whether and how two neural network architectures (LSTM, GPT-2) cope with a suffixal passive construction involving verbal morphology and required interpretive procedures (i.e., revising the mapping between thematic roles and case markers) driven by that morphology. To this end, we fine-tune our models via patching (i.e., pre-trained model + caregiver input) and hyperparameter adjustments, and measure their binary classification performance on the test sentences used in a behavioural study manifesting scrambling and omission of sentential components to varying degrees. We find that, while these models’ performance converges with the children's response patterns found in the behavioural study to some extent, the models do not faithfully simulate the children's comprehension behaviour pertaining to the suffixal passive, yielding by-model, by-condition, and by-hyperparameter asymmetries. This points to the limits of the neural networks’ capacity to address child language features. The implications of this study invite subsequent inquiries on the extent to which computational models reveal developmental trajectories of children's linguistic knowledge that have been unveiled through corpus-based or experimental research.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"90 ","pages":"Article 101701"},"PeriodicalIF":3.1,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142446405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}