{"title":"The Role of Rhythm and Vowel Space in Speech Recognition","authors":"Li-Fang Lai, J. G. Hell, John M. Lipski","doi":"10.21437/speechprosody.2022-87","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-87","url":null,"abstract":"This paper explores the role of rhythm and vowel space in automatic speech recognition (ASR), with a particular focus on Midland and Southern American English in the Appalachian region. Three sets of analysis were conducted. First, we computed the word error rates between the ground truth and the transcripts generated by DARLA. Consistent with previous studies, the results show higher error rates for Southern English (59.5%) than for Midland English (47.2%), suggesting a dialect gap in speech recognition. Next, we examined whether the error rates are influenced by rhythm. The results show that neither %V nor ΔV reliably predicted ASR performance. We also sought to draw a link between vowel space, speech intelligibility, and ASR performance. Three vowel space metrics were considered: convex hull, formant dispersion, and the polygon area. We noticed that as convex hull and formant dispersion increase, the error rates decrease, particularly for Midland speakers. This aligns with our hypothesis that more expanded vowel space enhances speech intelligibility, thus reducing the error rate for the Midland cohort. No clear connection between the polygon area, speech intelligibility, and error rates was found. These results, albeit suggestive, point out some promising directions for improving acoustic modeling in speech recognition.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123112458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Listener adjustment of stress cue use to fit language vocabulary structure","authors":"Laurence Bruggeman, Jenny Yu, A. Cutler","doi":"10.21437/speechprosody.2022-54","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-54","url":null,"abstract":"In lexical stress languages, phonemically identical syllables can differ suprasegmentally (in duration, amplitude, F0). Such stress cues allow listeners to speed spoken-word recognition by rejecting mismatching competitors (e.g., unstressed set - in settee rules out stressed set- in setting , setter , settle ). Such processing effects have indeed been observed in Spanish, Dutch and German, but English listeners are known to largely ignore stress cues. Dutch and German listeners even outdo English listeners in distinguishing stressed versus unstressed English syllables. This has been attributed to the relative frequency across the stress languages of unstressed syllables with full vowels; in English most unstressed syllables contain schwa, instead, and stress cues on full vowels are thus least often informative in this language. If only informativeness matters, would English listeners who encounter situations where such cues would pay off for them (e.g., learning one of those other stress languages) then shift to using stress cues? Likewise, would stress cue users with English as L2, if mainly using English, shift away from using the cues in English? Here we report tests of these two questions, with each receiving a yes answer. We propose that English listeners’ disregard of stress cues is purely pragmatic.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116625739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech Prosody 2022Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-165
Sabine Zerbian, Marlene Böttcher, Yulia Zuban
{"title":"Prosody of contrastive adjectives in mono- and bilingual speakers of English and Russian: a corpus study","authors":"Sabine Zerbian, Marlene Böttcher, Yulia Zuban","doi":"10.21437/speechprosody.2022-165","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-165","url":null,"abstract":"The study reports on the frequency of occurrence and prosodic realization of adjective-noun phrases in which the adjective is contrastively focused. The productions of bilingual speakers are investigated in both their languages, Heritage Russian and majority English. The data are extracted from a corpus of semi-spontaneous speech which was collected in a comparable way from mono- and bilingual speakers in the U.S. and Russia. Results of the analysis show that there is a language-specific difference in that Russian speakers use ADJ CF +N combinations less frequently than English speakers despite a reported parallel between the languages in terms of semantics and prosody. Moreover, English and Russian seem to differ in their accentuation pattern in ADJ CF +N. Speakers of Russian as a Heritage Language frequently use double accents in ADJ CF +N. Across English and Russian, double accents in ADJ CF +N occur more frequently in formal than in informal situation, and more frequently in bilingual than in monolingual speakers. The results are discussed in light of the often reported tendency in heritage language grammars to avoid ambiguity.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125371139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech Prosody 2022Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-143
Nari Rhee, Jianjing Kuang, Aoju Chen
{"title":"The effect of musicality on the development of Mandarin prosody","authors":"Nari Rhee, Jianjing Kuang, Aoju Chen","doi":"10.21437/speechprosody.2022-143","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-143","url":null,"abstract":"Past work has shown a link between children’s musicality and language learning. But research is still sparse on the effect of musicality on the development of prosody, which uses tonal and temporal cues also relevant for processing music. In particular, the questions of when and how musicality affects the development of various aspects of the prosodic grammar remain largely unknown. In this study, we investigate the effect of musicality on the development of focus-marking in Mandarin-speaking 4-to 6-year-olds using speech data elicited in a controlled but interactive setting. We have found that the development of focus-marking in Mandarin is only weakly affected by the learner’s musicality. Specifically, children produce adult-like distinctions between on-focus and pre-focus positions, regardless of musicality. A musicality effect is observed in the contrast between on-focus and post-focus positions only in the 4-year-olds. The limited musicality effect on focus-marking is in contrast with our previous work, in which we found that musicality has a salient effect on the lexical tone production by children younger than 6 years. Together, the current results suggest that musicality advantage in the development of prosody depends on aspects of the prosodic grammar and the stage of development.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114263293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Àlex Peiró-Lilja, Guillermo Cámbara, M. Farrús, J. Luque
{"title":"Naturalness and Intelligibility Monitoring for Text-to-Speech Evaluation","authors":"Àlex Peiró-Lilja, Guillermo Cámbara, M. Farrús, J. Luque","doi":"10.21437/speechprosody.2022-91","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-91","url":null,"abstract":"Current text-to-speech (TTS) systems are deep learning-based models capable of learning phonetic articulation and intelligibility, as well as prosodic attributes that model speaking style, providing naturalness to synthetic voices. However, the performance of these models highly depends on their training of hyper-parameters and iterations. Besides, a conventional loss function does not reflect a correct voice modeling; thus, we believe a dedicated training assessment on TTS is needed. To this end, we monitor intelligibility and naturalness during training of Tacotron2 model in a 2-step process. First, we report the analysis of a method to follow up the intelligibility of the TTS in terms of character-level token error rate (TER) by using five different automatic speech recognition (ASR) systems. Sec-ond, we extend this work with a recently published TTS naturalness predictor that estimates this aspect in terms of mean opinion scores (MOS). Finally, we unify predicted MOS with TER measurements to return, over each training checkpoint, a single score that we name Full Assessment Score (FAS). We report the relevant preference of our listeners on the checkpoint with maximum FAS rather than the one with minimum validation loss, both in intelligibility and naturalness —up to 62 . 3% in the latter.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117265240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The effects of prosodic prominence on the acquisition of L2 phonological features","authors":"Fabián Santiago, Paolo Mairano, Bianca De Paolis","doi":"10.21437/speechprosody.2022-77","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-77","url":null,"abstract":"Mainstream L2 phonology models do not include predictions concerning how the prosodic structure interacts with the acquisition of segments. However, many studies have shown that the realization of pitch accents or melodic contours associated to prosodic boundaries results in the hyper-articulation of segments in correspondence of such prosodic boundaries. Our goal is to provide empirical evidence for the positive effects of prosodic prominence on the acquisition of challenging L2 French sounds The prosodic-phonetic interface has been largely underestimated in second language acquisition. Few studies have investigated whether prosodic prominence may serve as an optimal context for learners to extract information on the acoustic properties of new sounds, which may then be reflected in more accurate productions. In this paper, we report the acoustic patterns of L2 French vowels produced in two different prosodic conditions: (1) in word internal position (unaccented), (2) in initial and final boundaries of Accentual Phrases and Intonation Phrases. We analyzed oral productions by 40 participants: 10 French native speakers and 30 L2 French learners with L1 Spanish, L1 English and L1 Italian (10 each). We extracted acoustic parameters for ~15k vowels and calculated the degree of acoustic overlap via Pillai scores for the following triplets: /i/~/y/~/u/, /e/~/ø/~/o/. Our results show that prosodic prominence results in a smaller acoustic overlap of some L2 French vowel contrasts.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128337840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech Prosody 2022Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-153
Stella Gryllia, K. Marcoux, Kathleen Jepson, A. Arvaniti
{"title":"The many shapes of H*","authors":"Stella Gryllia, K. Marcoux, Kathleen Jepson, A. Arvaniti","doi":"10.21437/speechprosody.2022-153","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-153","url":null,"abstract":"We examined individual and task-related variability in the realization of Greek nuclear H* followed by L-L% edge tones. The accents (N = 748) were elicited from native speakers of Greek, producing scripted and unscripted speech, and examined using functional Principal Components Analysis. The accented vowel onset was used for landmark registration to capture accent shape and the alignment of the fall. The resulting PCs were analysed using LMEMs (fixed factors: speaker; task type (scripted, unscripted); accented syllable distance from the analysis window offset, to examine the effects of tonal crowding). Tonal scaling and the steepness of the fall (reflected in PC1 and PC2 respectively) changed by task in ways that differed across speakers. PC3, which captured accent shape, also varied by speaker, reflecting shape differences between a rise-fall and (the expected) plateau-plus-fall realization. Tonal crowding did not have consistent effects. In short, the overall accent shape and the alignment of the accentual fall varied by speaker and task. These results hint at substantial variability in tonal realization. At the same time, they indicate that tonal alignment is not as consistent as is sometimes portrayed and thus it should not be the sole criterion for tone categorization.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128556677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Hierarchical Predictive Processing Approach to Modelling Prosody","authors":"J. Šimko, Adaeze Adigwe, A. Suni, M. Vainio","doi":"10.21437/speechprosody.2022-86","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-86","url":null,"abstract":"Prosodic patterns—and linguistic structures in general— are hierarchical in nature, providing for efficient means for encoding information in temporally constrained situations where communicative events occur. However, there are no theoretical frameworks that are capable of representing the full extent of linguistic behaviour in a cohesive way that could capture the paradigmatic and syntagmatic links between the organizational levels present in everyday speech. Here we propose a novel theoretical and modelling account of perception and production of prosodic patterns in speech communication, derived from the influential Predictive Processing theory of neural implementation of perception and action based on a hierarchical system of generative models producing progressively more detailed probabilistic predictions of future events. The framework provides a conceptualization of the hierarchical organization of speech prosody as well as a principled way of unifying speech perception and production by postulat-ing a single processing hierarchy shared by both modalities. We discuss the possible implications of the theory for prosodic analysis of speech communication, including conversational setting. In addition, we outline a viable computational implementation in the form of a machine learning architecture that can be used as a testbed for generating and evaluating predictions brought forth by the theory.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128958480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech Prosody 2022Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-156
Xiaoqing Wang, Wentao Gu
{"title":"Effects of Gender and Language Proficiency on Phonetic Accommodation in Chinese EFL Learners","authors":"Xiaoqing Wang, Wentao Gu","doi":"10.21437/speechprosody.2022-156","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-156","url":null,"abstract":"Phonetic accommodation is ubiquitous in cross-linguistic/cultural speech communication. The present study examined the effects of gender and language proficiency on phonetic accommodation in Chinese EFL learners. Five vowels /i/, /u/, /æ/, /ɑ/ and /ʌ/ were embedded in a pair of syllables /hVt/ and /hVd/ to compose ten target words. Three groups of Chinese EFL learners differing in the level of English language proficiency (i.e., elementary, intermediate, and advanced) participated in the experiment. To elicit spontaneous conversational speech, a Diapix task embedded with all ten target words was conducted between each participant and a model talker who was a native speaker of American English. Also, each participant read aloud the ten words before and after the Diapix task. Phonetic accommodation was measured by acoustic analysis of vowel duration and formants. For vowel duration, the higher-proficiency learners converged more than the lower-proficiency ones. For vowel formants, a significant interaction effect was found between gender and language proficiency, i.e., females converged less than males in the advanced learners, whereas females converged more than males in the lower-proficiency learners.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128959317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech Prosody 2022Pub Date : 2022-05-23DOI: 10.21437/speechprosody.2022-135
Anindita Nath, Nigel G. Ward
{"title":"On the Predictability of the Prosody of Dialog Markers from the Prosody of the Local Context","authors":"Anindita Nath, Nigel G. Ward","doi":"10.21437/speechprosody.2022-135","DOIUrl":"https://doi.org/10.21437/speechprosody.2022-135","url":null,"abstract":"Dialog markers, such as yeah and okay generally seem to fit smoothly in the flow of dialog, with prosody that is natural and appropriate for the local context. We here examine this effect, specifically looking at the predictability of the prosody of dialog markers from the prosody of the local context. Using 72 prosodic features representing the local context, we built simple models able to predict the average pitch, log energy, cepstral flux, and harmonic ratio for the 12 most common dialog markers of American English. The model’s predictions accounted for over a third of the variance in the observed prosody, showing a modest but meaningful context dependence.","PeriodicalId":442842,"journal":{"name":"Speech Prosody 2022","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129045062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}