{"title":"Order-aware Pairwise Intoxication Detection","authors":"Meng Ge, Ruixiong Zhang, Wei Zou, Xiangang Li, Cheng Gong, Longbiao Wang, J. Dang","doi":"10.1109/ISCSLP49672.2021.9362078","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362078","url":null,"abstract":"Alcoholic intoxication has always been and still is known as one of the major causes leading to traffic accidents and in-car conflicts. A system of intoxication detection is established to detect whether a person is intoxicated through the means of machine learning. The system would be able to provide significant assistance in the enforcement of traffic laws, which would ultimately save lives. However, most of the existing systems mainly attach great importance to the tested speaker’s characteristics of current speech, and ignore the existence of personalized differences in speech. To deal with this problem, we focus on modeling the measurable acousic change between the current state and the sober state of a speaker, instead of the current state in the existing scheme only. Furthermore, we are inspired by our discovery that the order-related cues (e.g. gender, time, location) on speaker and trip is largely relevant to alcoholic intoxication. Therefore, we incorporate order-related cues into the speechbased system in order to obtain better performance. Finally, it is demonstrated by extensive experimental results on DiDi Drunk Dataset in real scene that our proposed system achieved a significant improvement from 74.1% to 84.9% in terms of AUC.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129628260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Channel Interdependence Enhanced Speaker Embeddings for Far-Field Speaker Verification","authors":"Ling-jun Zhao, M. Mak","doi":"10.1109/ISCSLP49672.2021.9362108","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362108","url":null,"abstract":"Recognizing speakers from a distance using far-field microphones is difficult because of the environmental noise and reverberation distortion. In this work, we tackle these problems by strengthening the frame-level processing and feature aggregation of x-vector networks. Specifically, we restructure the dilated convolutional layers into Res2Net blocks to generate multi-scale frame-level features. To exploit the relationship between the channels, we introduce squeeze-and-excitation (SE) units to rescale the channels’ activations and investigate the best places to put these SE units in the Res2Net blocks. Based on the hypothesis that layers at different depth contain speaker information at different granularity levels, multi-block feature aggregation is introduced to propagate and aggregate the features at various depths. To optimally weight the channels and frames during feature aggregation, we propose a channel-dependent attention mechanism. Combining all of these enhancements leads to a network architecture called channel-interdependence enhanced Res2Net (CE-Res2Net). Results show that the proposed network achieves a relative improvement of about 16% in EER and 17% in minDCF on the VOiCES 2019 Challenge’s evaluation set.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129927377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Taiyang Guo, J. Dang, Gaoyan Zhang, Bin Zhao, M. Unoki
{"title":"Frequency-specific Brain Network Dynamics during Perceiving Real Words and Pseudowords","authors":"Taiyang Guo, J. Dang, Gaoyan Zhang, Bin Zhao, M. Unoki","doi":"10.1109/ISCSLP49672.2021.9362052","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362052","url":null,"abstract":"Many studies used EEG to investigate the brain mechanism of semantic processing and the dynamic brain connectivity at the word level. However, it is requiring more detailed dynamic analysis within a word to clarify the onset of dynamic brain network activity and priming effect. For this reason, this study focused on syllable level within a word to investigate semantic processing brain network dynamics for perceiving spoken real words and pseudowords using EEG data with the constraint of fMRI-based network templates. Results illustrated that real words can activate speech perception brain network rapidly, then finished very soon. When perceiving pseudowords, the onset of the perception brain network was slower, but activity lasted longer. If the first syllable of a real word has clear categorical features, the semantic categorization brain network would respond to it quickly as the priming effect. The frequency-specific analysis showed that the theta, alpha, and beta brain rhythms contribute more to the semantic processing of Chinese real words than the gamma.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127682096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Consonantal Effects of Aspiration on Onset F0 in Cantonese","authors":"Xinran Ren, P. Mok","doi":"10.1109/ISCSLP49672.2021.9362106","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362106","url":null,"abstract":"Consonantal effects on onset f0 are implemented differently in different languages, namely, the duration of consonantal effect, the direction of f0 change and its perceptual importance vary cross-linguistically. This study aims to investigate consonantal effects of aspiration in a tone language, Cantonese. The results showed that aspiration had a raising effect on onset f0, that is, onset f0 after aspirated stops was higher than after unaspirated stops. Besides, the aspiration-related f0 perturbations can extend to around 100ms after voicing. However, unlike f0 as a secondary cue for stop contrasts in English, when voice onset time (VOT) becomes ambiguous, f0 was not strengthened for contrast enhancement in Cantonese as well as in L2 English. This indicates that although consonantal effects in Cantonese showed phonetically similar directions and comparable duration with native English, onset f0 was not used for phonological contrast enhancement.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123350619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Usability And Practicality of Speech Recording by Mobile Phones for Phonetic Analysis","authors":"Yihan Guan, Bin Li","doi":"10.1109/ISCSLP49672.2021.9362082","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362082","url":null,"abstract":"High-quality speech recording is critical to phonetic analysis. However, when professional equipment or a sound-proof booth is not accessible, such as in random sampling or during the current pandemic period, is it reliable and valid to use non-professional devices to record speech data? We selected ten devices and examined the frequency range and signal-to-noise ratio (SNR) of speech data they recorded. We also compared recordings in a quiet room with noise at a moderate level. The results showed that all devices recorded a wide frequency range, which covered speech frequency well. But, their SNRs differed significantly. Environmental noise also appeared to affect recording quality. We then analyzed fine-grained phonetic parameters of data recorded in the quiet room, including suprasegmental, segmental and phonation-related parameters. F0 was found relatively consistent in the recordings from all devices, but certain differences were captured in F1, F2 and Center of Gravity (CoG). F3 as well as parameters relevant to phonation analysis, on the other hand, showed high variations. Therefore, our findings suggest that non-professional devices such as mobile phones are reliable substitutes of professional recorders, at least in prosodic analysis for general purposes. Cautions should be taken when values of F3 and phonation-related parameters are involved.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127028309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zheying Huang, Peng Li, Ji Xu, Pengyuan Zhang, Yonghong Yan
{"title":"Context-dependent Label Smoothing Regularization for Attention-based End-to-End Code-Switching Speech Recognition","authors":"Zheying Huang, Peng Li, Ji Xu, Pengyuan Zhang, Yonghong Yan","doi":"10.1109/ISCSLP49672.2021.9362080","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362080","url":null,"abstract":"Previous works utilize the context-independent (CI) label smoothing regularization (LSR) method to prevent attention-based End-to-End (E2E) automatic speech recognition (ASR) model, which is trained with a cross entropy loss function and hard labels, from making over-confident predictions. But the CI LSR method does not make use of linguistic knowledge within and between languages in the case of code-switching speech recognition (CSSR). In this paper, we propose the context-dependent (CD) LSR method. According to code-switching linguistic knowledge, the output units are classified into several categories and several context dependency rules are made. Under the guidance of the context dependency rules, prior label distribution is generated dynamically according to the category of historical context, rather than being fixed. Thus, the CD LSR method can utilize the linguistic knowledge in the case of CSSR to further improve the performance of the model. Experiments on the SEAME corpus demonstrate the effects of the proposed method. The final system with the CD LSR method achieves the best performance with 37.21% mixed error rate (MER), obtaining up to 3.7% relative MER reduction compared to the baseline system with no LSR method.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129995938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Acoustic Correlates and Time Span of the Non-modal Phonation in Kunshan Wu Chinese","authors":"Wenwei Xu, P. Mok","doi":"10.1109/ISCSLP49672.2021.9362083","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362083","url":null,"abstract":"This study investigates the acoustic correlates and time span of the non-modal phonation in Kunshan Wu, a Northern Wu dialect spoken in a city neighboring Shanghai and Suzhou.While previous studies mostly believe that the non-modal phonation in the lower register in Wu dialects is breathier, the phonetic correlates and methods of measurement vary among researchers, and measurement biases render some results to be unreliable. In this study, twelve native speakers of different ages and genders were recorded for examination of the acoustics in isolated monosyllabic words.Results show that the lower register generally exhibit higher spectral tilts and more noise, which confirms that the non-modal phonation is breathier. Based on the time course of two measures that are consistently useful across age and gender, the time span of the non-modal phonation is on average eight-ninths of unchecked vowels or entire checked vowels from the onset. Moreover, in one of the lower register unchecked tones, the non-modal phonation is found to last no shorter than the modal phonation in the upper register.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130727697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Adaptive LASSO-based Sparse Time-Varying Complex AR Speech Analysis","authors":"K. Funaki","doi":"10.1109/ISCSLP49672.2021.9362085","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362085","url":null,"abstract":"Linear Prediction (LP) is commonly used in speech processing. In speech coding, the LP is used to remove the formant elements from the speech signal, and the residual is quantized by using the Algebraic code vector after removing pitch elements. In speech synthesis, the LP is also used to generate the glottal or residual excitation for the WaveNet. We have proposed a Time-Varying Complex AR (TV-CAR) speech analysis for an analytic signal to cope with the drawbacks of the LP, such as MMSE, Extended Least Square (ELS), that are the l2-norm optimization methods. We have already evaluated the performance on F0 estimation and robust automatic speech recognition. Recently, we have proposed l2-norm regularized LP-based TV-CAR analysis in the time-domain and the frequency-domain. The regularized TV-CAR method can estimate more accurate formant frequencies, and we have shown that the resulting LP residual makes it possible to estimate a more precise F0. On the other hand, sparse estimation based on l1-norm optimization has been focused on image processing that can extract meaningful information from colossal information. LASSO algorithm is an l1-norm regularized sparse algorithm. In this paper, adaptive LASSO-based TV-CAR analysis is proposed, and the performance is evaluated using the F0 estimation.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130792625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yun Feng, Yan Feng, Chenwei Xie, William Shi-Yuan Wang
{"title":"Age-Related Decline of Classifier Usage in Southwestern Mandarin","authors":"Yun Feng, Yan Feng, Chenwei Xie, William Shi-Yuan Wang","doi":"10.1109/ISCSLP49672.2021.9362071","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362071","url":null,"abstract":"This pilot study examined the age-related decline of classifier usage in Southwestern Mandarin through comparing classifier score and the number of the default classifier ge among older adults aged 50-87 years. They were grouped into healthy, mild cognitive impairment (MCI) and Alzheimer’s disease (AD) groups based on Montreal Cognitive Assessment. AD group used significantly more inappropriate classifiers and the default classifier ge than healthy group. Results also showed a significant correlation between cognitive/semantic abilities and classifier usage, indicating that the deficit of classifier usage was associated with declined cognitive/semantic abilities. Decline of semantic cognition with aging, specifically semantic store and processing, has been postulated as possible underlying explanation. However, no significant difference was found in classifier usage between healthy and MCI groups, which might reveal that their classifier usage and semantic cognition was still normal.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133609574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prosodic Profiles of the Mandarin Speech Conveying Ironic Compliment","authors":"Shanpeng Li, Wentao Gu","doi":"10.1109/ISCSLP49672.2021.9362092","DOIUrl":"https://doi.org/10.1109/ISCSLP49672.2021.9362092","url":null,"abstract":"This study investigated prosodic profiles of the Mandarin speech conveying ironic compliment, an understudied subtype of irony in comparison to sarcasm. We compared two sets of utterances that shared the same text but conveyed ironic compliment or direct blaming, depending on the context. Ten prosodic parameters, including F0, intensity, speech rate, and voice quality features extracted from audio and EGG signals were analyzed. Results showed significant differences in all ten prosodic parameters between the utterances conveying these two attitudes. Moreover, the effects of (non-)keywords and speaker gender on prosodic profiles of the Mandarin speech conveying ironic compliment were also examined.","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133952724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}