2012 IEEE Spoken Language Technology Workshop (SLT)最新文献

筛选
英文 中文
Topic n-gram count language model adaptation for speech recognition 主题n-图计数语言模型自适应语音识别
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424216
Md. Akmal Haidar, D. O'Shaughnessy
{"title":"Topic n-gram count language model adaptation for speech recognition","authors":"Md. Akmal Haidar, D. O'Shaughnessy","doi":"10.1109/SLT.2012.6424216","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424216","url":null,"abstract":"We introduce novel language model (LM) adaptation approaches using the latent Dirichlet allocation (LDA) model. Observed n-grams in the training set are assigned to topics using soft and hard clustering. In soft clustering, each n-gram is assigned to topics such that the total count of that n-gram for all topics is equal to the global count of that n-gram in the training set. Here, the normalized topic weights of the n-gram are multiplied by the global n-gram count to form the topic n-gram count for the respective topics. In hard clustering, each n-gram is assigned to a single topic with the maximum fraction of the global n-gram count for the corresponding topic. Here, the topic is selected using the maximum topic weight for the n-gram. The topic n-gram count LMs are created using the respective topic n-gram counts and adapted by using the topic weights of a development test set. We compute the average of the confidence measures: the probability of word given topic and the probability of topic given word. The average is taken over the words in the n-grams and the development test set to form the topic weights of the n-grams and the development test set respectively. Our approaches show better performance over some traditional approaches using the WSJ corpus.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130771343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Statistical methods for varying the degree of articulation in new HMM-based voices 在新的基于hmm的声音中改变发音程度的统计方法
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424238
B. Picart, Thomas Drugman, T. Dutoit
{"title":"Statistical methods for varying the degree of articulation in new HMM-based voices","authors":"B. Picart, Thomas Drugman, T. Dutoit","doi":"10.1109/SLT.2012.6424238","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424238","url":null,"abstract":"This paper focuses on the automatic modification of the degree of articulation (hypo/hyperarticulation) of an existing standard neutral voice in the framework of HMM-based speech synthesis. Starting from a source speaker for which neutral, hypo and hyperarticulated speech data are available, two sets of transformations are computed during the adaptation of the neutral speech synthesizer. These transformations are then applied to a new target speaker for which no hypo/hyperarticulated recordings are available. Four statistical methods are investigated, differing in the speaking style adaptation technique (MLLR vs. CMLLR) and in the speaking style transposition approach (phonetic vs. acoustic correspondence) they use. This study focuses on the prosody model although such techniques can be applied to any stream of parameters exhibiting suited interpolability properties. Two subjective evaluations are performed in order to determine which statistical transformation method achieves the better segmental quality and reproduction of the articulation degree.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133893657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Automatic classification of unequal lexical stress patterns using machine learning algorithms 使用机器学习算法的不相等词法重音模式自动分类
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424255
M. Shahin, B. Ahmed, K. Ballard
{"title":"Automatic classification of unequal lexical stress patterns using machine learning algorithms","authors":"M. Shahin, B. Ahmed, K. Ballard","doi":"10.1109/SLT.2012.6424255","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424255","url":null,"abstract":"Technology based speech therapy systems are severely handicapped due to the absence of accurate prosodic event identification algorithms. This paper introduces an automatic method for the classification of strong-weak (SW) and weak-strong (WS) stress patterns in children speech with American English accent, for use in the assessment of the speech dysprosody. We investigate the ability of two sets of features used to train classifiers to identify the variation in lexical stress between two consecutive syllables. The first set consists of traditional features derived from measurements of pitch, intensity and duration, whereas the second set consists of energies of different filter banks. Three different classifiers were used in the experiments: an Artificial Neural Network (ANN) classifier with a single hidden layer, Support Vector Machine (SVM) classifier with both linear and Gaussian kernels and the Maximum Entropy modeling (MaxEnt). these features. Best results were obtained using an ANN classifier and a combination of the two sets of features. The system correctly classified 94% of the SW stress patterns and 76% of the WS stress patterns.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132354438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Combining multiple translation systems for Spoken Language Understanding portability 结合多种翻译系统,实现口语理解的可移植性
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424221
Fernando García, L. Hurtado, E. Segarra, E. Arnal, G. Riccardi
{"title":"Combining multiple translation systems for Spoken Language Understanding portability","authors":"Fernando García, L. Hurtado, E. Segarra, E. Arnal, G. Riccardi","doi":"10.1109/SLT.2012.6424221","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424221","url":null,"abstract":"We are interested in the problem of learning Spoken Language Understanding (SLU) models for multiple target languages. Learning such models requires annotated corpora, and porting to different languages would require corpora with parallel text translation and semantic annotations. In this paper we investigate how to learn a SLU model in a target language starting from no target text and no semantic annotation. Our proposed algorithm is based on the idea of exploiting the diversity (with regard to performance and coverage) of multiple translation systems to transfer statistically stable word-to-concept mappings in the case of the romance language pair, French and Spanish. Each translation system performs differently at the lexical level (wrt BLEU). The best translation system performances for the semantic task are gained from their combination at different stages of the portability methodology. We have evaluated the portability algorithms on the French MEDIA corpus, using French as the source language and Spanish as the target language. The experiments show the effectiveness of the proposed methods with respect to the source language SLU baseline.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114692441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Joint language models for automatic speech recognition and understanding 用于自动语音识别和理解的联合语言模型
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424222
Ali Orkan Bayer, G. Riccardi
{"title":"Joint language models for automatic speech recognition and understanding","authors":"Ali Orkan Bayer, G. Riccardi","doi":"10.1109/SLT.2012.6424222","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424222","url":null,"abstract":"Language models (LMs) are one of the main knowledge sources used by automatic speech recognition (ASR) and Spoken Language Understanding (SLU) systems. In ASR systems they are optimized to decode words from speech for a transcription task. In SLU systems they are optimized to map words into concept constructs or interpretation representations. Performance optimization is generally designed independently for ASR and SLU models in terms of word accuracy and concept accuracy respectively. However, the best word accuracy performance does not always yield the best understanding performance. In this paper we investigate how LMs originally trained to maximize word accuracy can be parametrized to account for speech understanding constraints and maximize concept accuracy. Incremental reduction in concept error rate is observed when a LM is trained on word-to-concept mappings. We show how to optimize the joint transcription and understanding task performance in the lexical-semantic relation space.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121327277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Syllable-based prosodic analysis of Amharic read speech 基于音节的阿姆哈拉语朗读韵律分析
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424232
O. Jokisch, Y. Gebremedhin, R. Hoffmann
{"title":"Syllable-based prosodic analysis of Amharic read speech","authors":"O. Jokisch, Y. Gebremedhin, R. Hoffmann","doi":"10.1109/SLT.2012.6424232","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424232","url":null,"abstract":"Amharic is the official language of Ethiopia and belongs to the under-resourced languages. Analyzing a new corpus of Amharic read speech, this contribution surveys syllable-based prosodic variations in f0, duration and intensity to develop suitable prosody models for speech synthesis and recognition. The article starts with a brief description of the Amharic script, the prosodic analysis methods, an accentuation experiment using resynthesis and a perceptual test. The main part summarizes stress-related analysis results for f0, duration and intensity and their interrelations. The quantitative variations of Amharic are comparable with the range in well-examined languages. The observed modifications in syllable duration and mean f0 proved to be relevant for stress perception as demonstrated in the perceptual test with resynthesis stimuli.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"11 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124615247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Audio-visual feature integration based on piecewise linear transformation for noise robust automatic speech recognition 基于分段线性变换的视听特征集成噪声鲁棒自动语音识别
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424213
Yosuke Kashiwagi, Masayuki Suzuki, N. Minematsu, K. Hirose
{"title":"Audio-visual feature integration based on piecewise linear transformation for noise robust automatic speech recognition","authors":"Yosuke Kashiwagi, Masayuki Suzuki, N. Minematsu, K. Hirose","doi":"10.1109/SLT.2012.6424213","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424213","url":null,"abstract":"Multimodal speech recognition is a promising approach to realize noise robust automatic speech recognition (ASR), and is currently gathering the attention of many researchers. Multimodal ASR utilizes not only audio features, which are sensitive to background noises, but also non-audio features such as lip shapes to achieve noise robustness. Although various methods have been proposed to integrate audio-visual features, there are still continuing discussions on how the vest integration of audio and visual features is realized. Weights of audio and visual features should be decided according to the noise features and levels: in general, larger weights to visual features when the noise level is low and vice versa, but how it can be controlled? In this paper, we propose a method based on piecewise linear transformation in feature integration. In contrast to other feature integration methods, our proposed method can appropriately change the weight depending on a state of an observed noisy feature, which has information both on uttered phonemes and environmental noise. Experiments on noisy speech recognition are conducted following to CENSREC-1-AV, and word error reduction rate around 24% is realized in average as compared to a decision fusion method.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116622854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
The FAU Video Lecture Browser system FAU视频讲座浏览器系统
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424256
K. Riedhammer, Martin Gropp, E. Nöth
{"title":"The FAU Video Lecture Browser system","authors":"K. Riedhammer, Martin Gropp, E. Nöth","doi":"10.1109/SLT.2012.6424256","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424256","url":null,"abstract":"A growing number of universities and other educational institutions provide recordings of lectures and seminars as an additional resource to the students. In contrast to educational films that are scripted, directed and often shot by film professionals, these plain recordings are typically not post-processed in an editorial sense. Thus, the videos often contain longer periods of inactivity or silence, unnecessary repetitions, or corrections of prior mistakes. This paper describes the FAU Video Lecture Browser system, a web-based platform for the interactive assessment of video lectures, that helps to close the gap between a plain recording and a useful e-learning resource by displaying automatically extracted and ranked key phrases on an augmented time line based on stream graphs. In a pilot study, users of the interface were able to complete a topic localization task about 29 % faster than users provided with the video only while achieving about the same accuracy. The user interactions can be logged on the server to collect data to evaluate the quality of the phrases and rankings, and to train systems that produce customized phrase rankings.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126697955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
What makes this voice sound so bad? A multidimensional analysis of state-of-the-art text-to-speech systems 这声音怎么这么难听?最先进的文本到语音系统的多维分析
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424229
Florian Hinterleitner, C. Norrenbrock, S. Möller, U. Heute
{"title":"What makes this voice sound so bad? A multidimensional analysis of state-of-the-art text-to-speech systems","authors":"Florian Hinterleitner, C. Norrenbrock, S. Möller, U. Heute","doi":"10.1109/SLT.2012.6424229","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424229","url":null,"abstract":"This paper presents research on perceptual quality dimensions of synthetic speech. We generated 57 stimuli from 16/19 female/male German text-to-speech systems (TTS) and asked listeners to judge the perceptual distances between them in a sorting task. Through a subsequent multidimensional scaling algorithm, we extracted three dimensions. Via expert listening and a comparison to ratings gathered on 16 attribute scales, the three dimensions can be assigned to naturalness of voice, temporal distortions and calmness. These dimensions are discussed in detail and compared to the perceptual quality dimensions from previous multidimensional analyses. Moreover, the results are analyzed depending on the type of TTS system. The identified dimensions will be used in the future to build a dimension-based quality predictor for synthetic speech.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127900811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Optimization of the DET curve in speaker verification 说话人验证中DET曲线的优化
2012 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2012-12-01 DOI: 10.1109/SLT.2012.6424243
Leibny Paola García-Perera, J. Nolazco-Flores, B. Raj, R. Stern
{"title":"Optimization of the DET curve in speaker verification","authors":"Leibny Paola García-Perera, J. Nolazco-Flores, B. Raj, R. Stern","doi":"10.1109/SLT.2012.6424243","DOIUrl":"https://doi.org/10.1109/SLT.2012.6424243","url":null,"abstract":"Speaker verification systems are, in essence, statistical pattern detectors which can trade off false rejections for false acceptances. Any operating point characterized by a specific tradeoff between false rejections and false acceptances may be chosen. Training paradigms in speaker verification systems however either learn the parameters of the classifier employed without actually considering this tradeoff, or optimize the parameters for a particular operating point exemplified by the ratio of positive and negative training instances supplied. In this paper we investigate the optimization of training paradigms to explicitly consider the tradeoff between false rejections and false acceptances, by minimizing the area under the curve of the detection error tradeoff curve. To optimize the parameters, we explicitly minimize a mathematical characterization of the area under the detection error tradeoff curve, through generalized probabilistic descent. Experiments on the NIST 2008 database show that for clean signals the proposed optimization approach is at least as effective as conventional learning. On noisy data, verification performance obtained with the proposed approach is considerably better than that obtained with conventional learning methods.","PeriodicalId":375378,"journal":{"name":"2012 IEEE Spoken Language Technology Workshop (SLT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127161673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信