{"title":"Multimodal attention for lip synthesis using conditional generative adversarial networks","authors":"Andrea Vidal, Carlos Busso","doi":"10.1016/j.specom.2023.102959","DOIUrl":"10.1016/j.specom.2023.102959","url":null,"abstract":"<div><p>The synthesis of lip movements is an important problem for a <em>socially interactive agent</em> (SIA). It is important to generate lip movements that are synchronized with speech and have realistic co-articulation. We hypothesize that combining lexical information (i.e., sequence of phonemes) and acoustic features can lead not only to models that generate the correct lip movements matching the articulatory movements, but also to trajectories that are well synchronized with the speech emphasis and emotional content. This work presents attention-based frameworks that use acoustic and lexical information to enhance the synthesis of lip movements. The lexical information is obtained from <em>automatic speech recognition</em> (ASR) transcriptions, broadening the range of applications of the proposed solution. We propose models based on <em>conditional generative adversarial networks</em> (CGAN) with self-modality attention and cross-modalities attention mechanisms. These models allow us to understand which frames are considered more in the generation of lip movements. We animate the synthesized lip movements using blendshapes. These animations are used to compare our proposed multimodal models with alternative methods, including unimodal models implemented with either text or acoustic features. We rely on subjective metrics using perceptual evaluations and an objective metric based on the LipSync model. The results show that our proposed models with attention mechanisms are preferred over the baselines on the perception of naturalness. The addition of cross-modality attentions and self-modality attentions has a significant positive impact on the performance of the generated sequences. We observe that lexical information provides valuable information even when the transcriptions are not perfect. The improved performance observed by the multimodal system confirms the complementary information provided by the speech and text modalities.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102959"},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43953120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction of whitespace and word segmentation in noisy Pashto text using CRF","authors":"Ijazul Haq, Weidong Qiu, Jie Guo, Peng Tang","doi":"10.1016/j.specom.2023.102970","DOIUrl":"10.1016/j.specom.2023.102970","url":null,"abstract":"<div><p>Word segmentation is the process of splitting up the text into words. In English and most European languages, word boundaries are identified by whitespace, while in Pashto, there is no explicit word delimiter. Pashto uses whitespace for word separation but not consistently, and it cannot be considered a reliable word-boundary identifier. This inconsistency makes the Pashto word segmentation unique and challenging. Moreover, Pashto is a low-resource, non-standardized language with no established rules for the correct usage of whitespace that leads to two typical spelling errors, space-omission, and space-insertion. These errors significantly affect the performance of the word segmenter. This study aims to develop a state-of-the-art word segmenter for Pashto, with a proofing tool to identify and correct the position of space in a noisy text. The CRF algorithm is incorporated to train two machine learning models for these tasks. For models' training, we have developed a text corpus of nearly 3.5 million words, annotated for the correct positions of spaces and explicit word boundary information using a lexicon-based technique, and then manually checked for errors. The experimental results of the model are very satisfactory, where the F1-scores are 99.2% and 96.7% for the proofing model and word segmenter, respectively.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102970"},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46845071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fractional feature-based speech enhancement with deep neural network","authors":"Liyun Xu, Tong Zhang","doi":"10.1016/j.specom.2023.102971","DOIUrl":"10.1016/j.specom.2023.102971","url":null,"abstract":"<div><p>Speech enhancement (SE) has become a considerable promise application of deep learning. Commonly, the deep neural network (DNN) in the SE task is trained to learn a mapping from the noisy features to the clean. However, the features are usually extracted in the time or frequency domain. In this paper, the improved features in the fractional domain are presented based on the flexible character of fractional Fourier transform (FRFT). First, the distribution characters and differences of the speech signal and the noise in the fractional domain are investigated. Second, the L1-optimal FRFT spectrum and the feature matrix constructed from a set of FRFT spectrums are served as the training features in DNN and applied in the SE. A series of pre-experiments conducted in various different fractional transform orders illustrate that the L1-optimal FRFT-DNN-based SE method can achieve a great enhancement result compared with the methods based on another single fractional spectrum. Moreover, the matrix of FRFT-DNN-based SE performs better under the same conditions. Finally, compared with other two typically SE models, the experiment results indicate that the proposed method could reach significantly performance in different SNRs with unseen noise types. The conclusions confirm the advantages of using the proposed improved features in the fractional domain.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102971"},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43924981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigating prosodic entrainment from global conversations to local turns and tones in Mandarin conversations","authors":"Zhihua Xia , Julia Hirschberg , Rivka Levitan","doi":"10.1016/j.specom.2023.102961","DOIUrl":"10.1016/j.specom.2023.102961","url":null,"abstract":"<div><p>Previous research on acoustic entrainment has paid less attention to tones than to other prosodic features. This study sets a hierarchical framework by three layers of conversations, turns and tone units, investigates prosodic entrainment in Mandarin spontaneous dialogues at each level, and compares the three. Our research has found that (1) global and local entrainment exist independently, and local entrainment is more evident than global; (2) variation exists in prosodic features’ contribution to entrainment at three levels: amplitude features exhibiting more prominent entrainment at both global and local levels, and speaking-rate and F0 features showing more prominence at the local levels; and (3) no convergence is found at the conversational level, at the turn level or over tone units.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102961"},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46026309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The dependence of accommodation processes on conversational experience","authors":"L. Ann Burchfield, Mark Antoniou, Anne Cutler","doi":"10.1016/j.specom.2023.102963","DOIUrl":"10.1016/j.specom.2023.102963","url":null,"abstract":"<div><p>Conversational partners accommodate to one another's speech, a process that greatly facilitates perception. This process occurs in both first (L1) and second languages (L2); however, recent research has revealed that adaptation can be language-specific, with listeners sometimes applying it in one language but not in another. Here, we investigate whether a supply of novel talkers impacts whether the adaptation is applied, testing Mandarin-English groups whose use of their two languages involves either an extensive or a restricted set of social situations. Perceptual learning in Mandarin and English is examined across two similarly-constituted groups in the same English-speaking environment: (a) heritage language users with Mandarin as family L1 and English as environmental language, and (b) international students with Mandarin as L1 and English as later-acquired L2. In English, exposure to an ambiguous sound in lexically disambiguating contexts prompted the expected retuning of phonemic boundaries in categorisation for the heritage users, but not for the students. In Mandarin, the opposite appeared: the heritage users showed no adaptation, but the students did adapt. In each case where learning did not appear, participants reported using the language in question with fewer interlocutors. The results support the view that successful retuning ability in any language requires regular conversational interaction with novel talkers.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102963"},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47047534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li Chai , Hang Chen , Jun Du , Qing-Feng Liu , Chin-Hui Lee
{"title":"Space-and-speaker-aware acoustic modeling with effective data augmentation for recognition of multi-array conversational speech","authors":"Li Chai , Hang Chen , Jun Du , Qing-Feng Liu , Chin-Hui Lee","doi":"10.1016/j.specom.2023.102958","DOIUrl":"https://doi.org/10.1016/j.specom.2023.102958","url":null,"abstract":"<div><p>We propose a space-and-speaker-aware (SSA) approach to acoustic modeling (AM), denoted as SSA-AM, to improve system performances of automatic speech recognition (ASR) in distant multi-array conversational scenarios. In contrast to conventional AM which only uses spectral features from a target speaker as inputs, the inputs to SSA-AM consists of speech features from both the target and interfering speakers, which contain discriminative information from different speakers, including spatial information embedded in interaural phase differences (IPDs) between individual interfering speakers and the target speaker. In the proposed SSA-AM framework, we explore four acoustic model architectures consisting of different combinations of four neural networks, namely deep residual network, factorized time delay neural network, self-attention and residual bidirectional long short-term memory neural network. Various data augmentation techniques are adopted to expand the training data to include different options of beamformed speech obtained from multi-channel speech enhancement. Evaluated on the recent CHiME-6 Challenge Track 1, our proposed SSA-AM framework achieves consistent recognition performance improvements when compared with the official baseline acoustic models. Furthermore, SSA-AM outperforms acoustic models without explicitly using the space and speaker information. Finally, our data augmentation schemes are shown to be especially effective for compact model designs. Code is released at <span>https://github.com/coalboss/SSA_AM</span><svg><path></path></svg>.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102958"},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49728533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Development of a hybrid word recognition system and dataset for the Azerbaijani Sign Language dactyl alphabet","authors":"Jamaladdin Hasanov , Nigar Alishzade , Aykhan Nazimzade , Samir Dadashzade , Toghrul Tahirov","doi":"10.1016/j.specom.2023.102960","DOIUrl":"10.1016/j.specom.2023.102960","url":null,"abstract":"<div><p>The paper introduces a real-time fingerspelling-to-text translation system for the Azerbaijani Sign Language (AzSL), targeted to the clarification of the words with no available or ambiguous signs. The system consists of both statistical and probabilistic models, used in the sign recognition and sequence generation phases. Linguistic, technical, and <em>human–computer interaction</em>-related challenges, which are usually not considered in publicly available sign-based recognition application programming interfaces and tools, are addressed in this study. The specifics of the AzSL are reviewed, feature selection strategies are evaluated, and a robust model for the translation of hand signs is suggested. The two-stage recognition model exhibits high accuracy during real-time inference. Considering the lack of a publicly available dataset with the benchmark, a new, comprehensive AzSL dataset consisting of 13,444 samples collected by 221 volunteers is described and made publicly available for the sign language recognition community. To extend the dataset and make the model robust to changes, augmentation methods and their effect on the performance are analyzed. A lexicon-based validation method used for the probabilistic analysis and candidate word selection enhances the probability of the recognized phrases. Experiments delivered 94% accuracy on the test dataset, which was close to the real-time user experience. The dataset and implemented software are shared in a public repository for review and further research (CeDAR, 2021; Alishzade et al., 2022). The work has been presented at TeknoFest 2022 and ranked as the first in the category of <em>social-oriented technologies</em>.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"153 ","pages":"Article 102960"},"PeriodicalIF":3.2,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46498442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A new time–frequency representation based on the tight framelet packet for telephone-band speech coding","authors":"Souhir Bousselmi, Kaïs Ouni","doi":"10.1016/j.specom.2023.102954","DOIUrl":"10.1016/j.specom.2023.102954","url":null,"abstract":"<div><p>To improve the quality and intelligibility of telephone-band speech coding, a new time–frequency representation based on a tight framelet packet transform is proposed in this paper. In the context of speech coding, the effectiveness of this representation stems from its resilience to quantization noise, and reconstruction stability. Moreover, it offers a sub-band decomposition and good time–frequency localization according to the critical bands of the human ear. The coded signal is obtained using dynamic bit allocation and optimal quantization of normalized framelet coefficients. The performances of the corresponding method are compared to the critically sampled wavelet packet transform. Extensive simulation revealed that the proposed speech coding scheme, which incorporates the tight framelet packet transform performs better than that based on the critically sampled wavelet packet transform. Furthermore, it ensures a high bit-rate reduction with negligible degradation in speech quality. The proposed coder is found to outperform the standard telephone-band speech coders in term of objective measures and subjective evaluations including a formal listening test. The subjective quality of our codec at 4 kbps is almost identical to the reference G.711 codec operating at 64 kbps.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"152 ","pages":"Article 102954"},"PeriodicalIF":3.2,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46455896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Real-time intelligibility affects the realization of French word-final schwa","authors":"Georgia Zellou , Ioana Chitoran , Ziqi Zhou","doi":"10.1016/j.specom.2023.102962","DOIUrl":"10.1016/j.specom.2023.102962","url":null,"abstract":"<div><p>Speech variation has been hypothesized to reflect both speaker-internal influences of lexical access on production and adaptive modifications to make words more intelligible to the listener. The current study considers categorical and gradient variation in the production of word-final schwa in French as explained by lexical access processes, phonological, and/or listener-oriented influences on speech production, while controlling for other factors. To that end, native French speakers completed two laboratory production tasks. In Experiment 1, speakers produced 32 monosyllabic words varying in lexical frequency in a word list production task with no listener feedback. In Experiment 2, speakers produced the same words to an interlocutor while completing a map task varying listener comprehension success across trials: in half the trials, the words are correctly perceived by the interlocutor; in half, there is misunderstanding. Results reveal that speakers are more likely to produce word-final schwa when there is explicit pressure to be intelligible to the interlocutor. Also, when schwa is produced, it is longer preceding a consonant-initial word. Taken together, findings suggest that there are both phonological and clarity-oriented influences on word-final schwa realization in French.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"152 ","pages":"Article 102962"},"PeriodicalIF":3.2,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45132784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application of virtual human sign language translation based on speech recognition","authors":"Xin Li , Shuying Yang, Haiming Guo","doi":"10.1016/j.specom.2023.06.001","DOIUrl":"10.1016/j.specom.2023.06.001","url":null,"abstract":"<div><p>For the application problem of speech recognition to sign language translation, we conducted a study in two parts: improving speech recognition's effectiveness and promoting the application of sign language translation. The mainstream frequency-domain feature has achieved great success in speech recognition. However, it fails to capture the instantaneous gap in speech, and the time-domain feature makes up for this deficiency. In order to combine the advantages of frequency and time domain features, an acoustic architecture with a joint time domain encoder and frequency domain encoder is proposed. A new time-domain feature based on SSM (State-Space-Model) is proposed in the time- domain encoder and encoded using the GRU model. A new model, ConFLASH, is proposed in the frequency domain encoder, which is a lightweight model combining CNN and FLASH (a variant of the Transformer model). It not only reduces the computational complexity of the Transformer model but also effectively integrates the global modeling advantages of the Transformer model and the local modeling advantages of CNN. The Transducer structure is used to decode speech after the encoders are joined. This acoustic model is named GRU-ConFLASH- Transducer. On the self-built dataset and open-source dataset speechocean, it achieves optimal WER (Word Error Rate) of 2.6% and 4.7%. In addition, to better realize the visual application of sign language translation, a 3D virtual human model is designed and developed.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"152 ","pages":"Article 102951"},"PeriodicalIF":3.2,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48223785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}