Anqi Xu , Daniel R. van Niekerk , Branislav Gerazov , Paul Konstantin Krug , Peter Birkholz , Santitham Prom-on , Lorna F. Halliday , Yi Xu
{"title":"Artificial vocal learning guided by speech recognition: What it may tell us about how children learn to speak","authors":"Anqi Xu , Daniel R. van Niekerk , Branislav Gerazov , Paul Konstantin Krug , Peter Birkholz , Santitham Prom-on , Lorna F. Halliday , Yi Xu","doi":"10.1016/j.wocn.2024.101338","DOIUrl":"https://doi.org/10.1016/j.wocn.2024.101338","url":null,"abstract":"<div><p>It has long been a mystery how children learn to speak without formal instructions. Previous research has used computational modelling to help solve the mystery by simulating vocal learning with direct imitation or caregiver feedback, but has encountered difficulty in overcoming the speaker normalisation problem, namely, discrepancies between children’s vocalisations and that of adults due to age-related anatomical differences. Here we show that vocal learning can be successfully simulated via recognition-guided vocal exploration without explicit speaker normalisation. We trained an articulatory synthesiser with three-dimensional vocal tract models of an adult and two child configurations of different ages to learn monosyllabic English words consisting of CVC syllables, based on coarticulatory dynamics and two kinds of auditory feedback: (i) acoustic features to simulate universal phonetic perception (or direct imitation), and (ii) a deep-learning-based speech recogniser to simulate native-language phonological perception. Native listeners were invited to evaluate the learned synthetic speech with natural speech as baseline reference. Results show that the English words trained with the speech recogniser were more intelligible than those trained with acoustic features, sometimes close to natural speech. The successful simulation of vocal learning in this study suggests that a combination of coarticulatory dynamics and native-language phonological perception may be critical also for real-life vocal production learning.</p></div>","PeriodicalId":51397,"journal":{"name":"Journal of Phonetics","volume":"105 ","pages":"Article 101338"},"PeriodicalIF":1.9,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0095447024000445/pdfft?md5=941cb45273d2db483f6143ef8085a741&pid=1-s2.0-S0095447024000445-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141428706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei-Rong Chen , Michael C. Stern , D.H. Whalen , Donald Derrick , Christopher Carignan , Catherine T. Best , Mark Tiede
{"title":"Assessing ultrasound probe stabilization for quantifying speech production contrasts using the Adjustable Laboratory Probe Holder for UltraSound (ALPHUS)","authors":"Wei-Rong Chen , Michael C. Stern , D.H. Whalen , Donald Derrick , Christopher Carignan , Catherine T. Best , Mark Tiede","doi":"10.1016/j.wocn.2024.101339","DOIUrl":"https://doi.org/10.1016/j.wocn.2024.101339","url":null,"abstract":"<div><p>Ultrasound imaging of the tongue is biased by the probe movements relative to the speaker’s head. Two common remedies are restricting or algorithmically compensating for such movements, each with its own challenges. We describe these challenges in details and evaluate an open-source, adjustable probe stabilizer for ultrasound (ALPHUS), specifically designed to address these challenges by restricting uncorrectable probe movements while allowing for correctable ones (e.g., jaw opening) to facilitate naturalness. The stabilizer is highly modular and adaptable to different users (e.g., adults and children) and different research/clinical needs (e.g., imaging in both midsagittal and coronal orientations). The results of three experiments show that probe movement over uncorrectable degrees of freedom was negligible, while movement over correctable degrees of freedom that could be compensated through post-processing alignment was relatively large, indicating unconstrained articulation over parameters relevant for natural speech. Results also showed that probe movements as small as 5 mm or 2 degrees can neutralize phonemic contrasts in ultrasound tongue positions. This demonstrates that while stabilized but uncorrected ultrasound imaging can provide reliable tongue shape information (e.g., curvature or complexity), accurate tongue position (e.g., height or backness) with respect to vocal tract hard structure needs correction for probe displacement relative to the head.</p></div>","PeriodicalId":51397,"journal":{"name":"Journal of Phonetics","volume":"105 ","pages":"Article 101339"},"PeriodicalIF":1.9,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141302459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The effect of breathy voice on tone identification by listeners of different ages in Suzhou Wu Chinese","authors":"Chunyu Ge, Peggy Mok","doi":"10.1016/j.wocn.2024.101330","DOIUrl":"https://doi.org/10.1016/j.wocn.2024.101330","url":null,"abstract":"<div><p>Suzhou Wu Chinese has undergone a transphonologization of a voicing contrast in initial consonants to a tone contrast. In consequence, the tone system has split into two registers, in which the high register tones are higher in pitch and modal voiced, whilst the low register tones are lower in pitch and breathy voiced. Our previous studies have found that breathy voice in the low register tones is disappearing in younger speakers’ production. This finding motivated us to investigate the effect of breathy voice on tone identification across age groups. Participants from three age groups completed a tone identification experiment. Stimuli were constructed based on natural tokens produced by a middle-aged female speaker and an older female speaker. The manipulation of phonation was accomplished by using the base syllables of both high and low register tones, for both unchecked (T1 vs. T2) and checked (T7 vs. T8) tone pairs. The results showed that breathy voice is still used by younger listeners in their perception and its effect on their tone identification is similar to that for older and middle-aged listeners. Moreover, the effect of breathy voice is modulated by social indexical factors (i.e., talker voice). The implications of the results for the origin of the loss of breathy voice in Suzhou Wu and the mechanism of sound change are discussed.</p></div>","PeriodicalId":51397,"journal":{"name":"Journal of Phonetics","volume":"105 ","pages":"Article 101330"},"PeriodicalIF":1.9,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140950688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conceição Cunha , Phil Hoole , Dirk Voit , Jens Frahm , Jonathan Harrington
{"title":"The physiological basis of the phonologization of vowel nasalization: A real-time MRI analysis of American and Southern British English","authors":"Conceição Cunha , Phil Hoole , Dirk Voit , Jens Frahm , Jonathan Harrington","doi":"10.1016/j.wocn.2024.101329","DOIUrl":"https://doi.org/10.1016/j.wocn.2024.101329","url":null,"abstract":"<div><p>The diachronic change by which coarticulatory nasalization increases in VN (vowel-nasal) sequences has been modelled as an earlier alignment of the velum combined with oral gesture weakening of N. The model was tested by comparing American (USE) and Standard Southern British English (BRE) based on the assumption that this diachronic change is more advanced in USE. Real-time MRI data was collected from 16 USE and 27 BRE adult speakers producing monosyllables with coda /Vn, Vnd, Vnz/. For USE, nasalization was greater in V, less in N, and there was greater tongue tip lenition than for BRE. The dialects showed a similar stability of the velum gesture and a trade-off between vowel nasalization and tongue tip lenition. Velum alignment was not earlier in USE. Instead, a closer approximation of the time of the tongue tip peak velocity towards the tongue tip maximum for USE caused a shift in the acoustic boundary within VN towards N, giving the illusion that the velum gesture has an earlier alignment in USE. It is suggested that coda reduction which targets the tongue tip more than the velum is a principal physiological mechanism responsible for the onset of diachronic vowel nasalization.</p></div>","PeriodicalId":51397,"journal":{"name":"Journal of Phonetics","volume":"105 ","pages":"Article 101329"},"PeriodicalIF":1.9,"publicationDate":"2024-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0095447024000354/pdfft?md5=a796ba209e07d6d7a77d5ad1e757f23d&pid=1-s2.0-S0095447024000354-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140918808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The phonetics of vowel intrusion in Sgi Bara","authors":"Don Daniels , Zoë Haupt , Melissa M. Baese-Berk","doi":"10.1016/j.wocn.2024.101323","DOIUrl":"https://doi.org/10.1016/j.wocn.2024.101323","url":null,"abstract":"<div><p>We provide a phonetic examination of intrusive vowels in Sgi Bara [jil]. These vowels are inserted in predictable places, and their quality (either [i], [ɨ], or [u]) is also predictable, so they are not considered phonemic. We demonstrate that they differ from phonemic vowels in their duration, being shorter; and in their articulation, being more peripheral; but not in their intensity. We then demonstrate how this phonetic understanding of the difference between intrusive and phonemic vowels can be used to answer phonological questions about Sgi Bara. We offer two case studies: phonologically ambiguous sequences of high vowels, and frequent two-word combinations that may be univerbating. The results confirm the existence of a distinction between intrusive and phonemic vowels.</p></div>","PeriodicalId":51397,"journal":{"name":"Journal of Phonetics","volume":"104 ","pages":"Article 101323"},"PeriodicalIF":1.9,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0095447024000299/pdfft?md5=4ed2ce41979d22264153fa5638e56f22&pid=1-s2.0-S0095447024000299-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140823935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Acoustic cue sensitivity in the perception of native category and their relation to nonnative phonological contrast learning","authors":"Jieun Lee , Hanyong Park","doi":"10.1016/j.wocn.2024.101327","DOIUrl":"https://doi.org/10.1016/j.wocn.2024.101327","url":null,"abstract":"<div><p>Experiment 1 investigates whether individual differences in sensitivity to acoustic cues in L1 category perception measured by the Visual Analogue Scaling (VAS) task could explain individual variability in L2 phonological contrast learning [research question (RQ1)]. f0 is a solid cue for Korean three-way stop contrasts (i.e., lenis-aspirated stop distinction) but not for English voicing contrasts. Results showed that naïve English learners of Korean with more gradient performance in the VAS task, which was used as a proxy of f0 cue sensitivity in L1, had an advantage in L2 contrast learning. More gradient learners showed more nativelike f0 utilization during and after the High Variability Phonetic Training (HVPT), suggesting the transfer of L1 acoustic cue sensitivity to L2 learning. Experiment 2 examines whether the cue-attention switching training with L1 stimuli provided before HVPT sessions could aid learners by reallocating their attention away from the L2-irrelevant to the L2-relevant acoustic dimension (RQ2). Results demonstrated the effectiveness of the cue-attention switching training with L1 stimuli, especially to learners with less sensitivity to f0 in the VAS task. This study emphasizes the importance of considering individual differences in L2 training and shows the possibility of utilizing the VAS task as a pretraining assessment to predict the acquisition of L2 phonological contrasts and L2 cue-weighting strategies.</p></div>","PeriodicalId":51397,"journal":{"name":"Journal of Phonetics","volume":"104 ","pages":"Article 101327"},"PeriodicalIF":1.9,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0095447024000330/pdfft?md5=298fd21f6b274b949b25732e7a11c234&pid=1-s2.0-S0095447024000330-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140605756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Being clear about clear speech: Intelligibility of hard-of-hearing-directed, non-native-directed, and casual speech for L1- and L2-English listeners","authors":"Nicholas B. Aoki, Georgia Zellou","doi":"10.1016/j.wocn.2024.101328","DOIUrl":"https://doi.org/10.1016/j.wocn.2024.101328","url":null,"abstract":"<div><p>Relative to one’s default (casual) speech, clear speech contains acoustic modifications that are often perceptually beneficial. Clear speech encompasses many different styles, yet most work only compares clear and casual speech as a binary. Furthermore, the term “clear speech” is often <em>unclear</em> − despite variation in elicitation instructions across studies (e.g., speak clearly, imagine an L2-listener or someone with hearing loss, etc.), the generic term “clear speech” is used when interpreting results, under the tacit assumption that clear speech is monolithic. The current study examined the acoustics and intelligibility of casual speech and two clear styles (hard-of-hearing-directed and non-native-directed speech). We find: (1) the clear styles are acoustically distinct (non-native-directed speech is slower with lower mean intensity and f0); (2) the clear styles are perceptually distinct (only hard-of-hearing-directed speech enhances intelligibility); (3) no differences in intelligibility benefits are observed between L1 and L2-listeners. These results underscore the importance of considering the intended interlocutor in speaking style elicitation, leading to a discussion about the issues that arise when reference to “clear speech” lacks clarity. It is suggested that to be more <em>clear</em> about clear speech, greater caution should be taken when interpreting results about speaking style variation.</p></div>","PeriodicalId":51397,"journal":{"name":"Journal of Phonetics","volume":"104 ","pages":"Article 101328"},"PeriodicalIF":1.9,"publicationDate":"2024-04-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0095447024000342/pdfft?md5=bd035ba46dd9b5604519609b4fb5bf11&pid=1-s2.0-S0095447024000342-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140551802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A minimal dynamical model of Intonation: Tone contrast, alignment, and scaling of American English pitch accents as emergent properties","authors":"Khalil Iskarous , Jennifer Cole , Jeremy Steffman","doi":"10.1016/j.wocn.2024.101309","DOIUrl":"https://doi.org/10.1016/j.wocn.2024.101309","url":null,"abstract":"<div><p>The pitch accent system of Mainstream American English (MAE) is one of the most well-studied phenomena within the Autosegmental-Metrical (AM) approach to intonation. In this work we present an explicit model grounded in dynamical theory that predicts both qualitative phonological and quantitative phonetic generalizations about the MAE system. While the traditional AM account separates a phonological model of the structure of the accents from the F0 algorithm that interprets the phonological specification, we propose a unified dynamical model that encompasses both. The proposed model is introduced incrementally, one dynamical term at a time, to arrive at the minimal model needed to account for observed empirical generalizations, avoiding unnecessary complexity. The quantitative and qualitative properties of the MAE system that inform the dynamical model are based on an analysis of a large database of productions of the four most well-studied pitch accents of American English: three rising accents (H*, L+H*, L*+H) and a low-falling accent (L*). The dynamic model highlights the importance of velocity-based measures of F0, not typically invoked in intonational research, as key to understanding F0 differences among pitch accent categories. Although the focus of this work is on the MAE pitch accent system, suggestions are made for how the unified phonetic-phonological dynamical framework presented can be further developed to account for other pitch-based phenomena in a variety of languages.</p></div>","PeriodicalId":51397,"journal":{"name":"Journal of Phonetics","volume":"104 ","pages":"Article 101309"},"PeriodicalIF":1.9,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140533585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An acoustic study on age-related changes in vowel production of Chinese","authors":"Chao Kong, Xueqing Long, Juan Liu","doi":"10.1016/j.wocn.2024.101324","DOIUrl":"https://doi.org/10.1016/j.wocn.2024.101324","url":null,"abstract":"<div><p>This paper investigates the relationship between vowel production and age using speech data from 109 Chinese L1 speakers (61 females and 48 males) covering an age range of 20 to 80 years. Acoustical estimation of vocal tract length (VTL) as well as multiple acoustic metrics are analyzed with generalized additive mixed models (GAMM). The results indicate that: (1) After controlling for VTL, <span><math><mrow><msub><mrow><mi>F</mi></mrow><mrow><mn>0</mn></mrow></msub></mrow></math></span> and duration, vowels show a centralization trend with increasing age, with a more significant effect observed in female speakers; (2) VTL does not significantly change with age; (3) The patterns observed in vowel distinctiveness and duration may present evidence contradicting the notion of vowel lengthening as a compensatory mechanism; (4) The patterns of age-related changes in different measurements and different genders are diverse. The U-shaped change patterns are found in the male speakers and the age around 50 may serve as a turning point. Based on these findings, we have explored some possible reasons for inconsistent conclusions in previous studies. The physiological aging phenomena of vowel production and potential compensatory mechanisms on motor control abilities, as well as other possible influencing factors, are also discussed.</p></div>","PeriodicalId":51397,"journal":{"name":"Journal of Phonetics","volume":"104 ","pages":"Article 101324"},"PeriodicalIF":1.9,"publicationDate":"2024-03-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140321032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Planning for the future and reacting to the present: Proactive and reactive F0 adjustments in speech","authors":"Seung-Eun Kim , Sam Tilsen","doi":"10.1016/j.wocn.2024.101322","DOIUrl":"https://doi.org/10.1016/j.wocn.2024.101322","url":null,"abstract":"<div><p>Previous studies have examined whether speakers initiate longer utterances with higher F0. Evidence for such effects is mixed and is mostly based on point estimates of F0 at the beginning of the utterance. Moreover, it is unknown whether utterance length can influence F0 control solely at utterance onset or also during the utterance. We conducted a sentence production task to investigate how control of pitch register – F0 ceiling, floor, and span – is influenced by utterance length. Specifically, we test whether speakers adjust register both in relation to an initially planned utterance length – <em>proactive</em> F0 control – and in response to changes in utterance length that occur after response onset – <em>reactive</em> F0 control. Target sentences in the experiment had one, two, or three subject noun phrases, which were cued with visual stimuli. An experimental manipulation was tested in which some visual stimuli were delayed until after participants initiated the utterance. Evidence for both proactive and reactive control of register was observed. Participants adopted a higher register ceiling and broader span in longer utterances. Furthermore, they decreased the amount of ceiling compression upon encountering delayed stimuli. The findings suggest the presence of a mechanism in which speakers continuously estimate the remaining length of the utterance and use that information to adjust pitch register.</p></div>","PeriodicalId":51397,"journal":{"name":"Journal of Phonetics","volume":"104 ","pages":"Article 101322"},"PeriodicalIF":1.9,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140290735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}