{"title":"Speech synthesis using approximate matching of syllables","authors":"E. V. Raghavendra, B. Yegnanarayana, K. Prahallad","doi":"10.1109/SLT.2008.4777834","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777834","url":null,"abstract":"In this paper we propose a technique for a syllable based speech synthesis system. While syllable based synthesizers produce better sounding speech than diphone and phone, the coverage of all syllables is a non-trivial issue. We address the issue of coverage of syllables through approximating the syllable when the required syllable is not found. To verify our hypothesis, we conducted perceptual studies on manually modified sentences and found that our assumption is valid. Similar approaches have been used in speech synthesis and it shows that such approximation produces intelligible and better quality speech than diphone units.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115443001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A similar content retrieval method for podcast episodes","authors":"Junta Mizuno, J. Ogata, Masataka Goto","doi":"10.1109/SLT.2008.4777899","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777899","url":null,"abstract":"Given podcasts (audio blogs) that are sets of speech files called episodes, this paper describes a method for retrieving episodes that have similar content. Although most previous retrieval methods were based on bibliographic information, tags, or users' playback behaviors without considering spoken content, our method can compute content-based similarity based on speech recognition results of podcast episodes even if the recognition results include some errors. To overcome those errors, it converts intermediate speech-recognition results to a confusion network containing competitive candidates, and then computes the similarity by using keywords extracted from the network. Experimental results with episodes that have different word accuracy and content showed that keywords obtained from competitive candidates were useful in retrieving similar episodes. To show relevant episodes, our method will be incorporated into PodCastle, a public web service that provides full-text searching of podcasts on the basis of speech recognition.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125146285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating the effectiveness of features and sampling in extractive meeting summarization","authors":"Shasha Xie, Yang Liu, Hui-Ching Lin","doi":"10.1109/SLT.2008.4777864","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777864","url":null,"abstract":"Feature-based approaches are widely used in the task of extractive meeting summarization. In this paper, we analyze and evaluate the effectiveness of different types of features using forward feature selection in an SVM classifier. In addition to features used in prior studies, we introduce topic related features and demonstrate that these features are helpful for meeting summarization. We also propose a new way to resample the sentences based on their salience scores for model training and testing. The experimental results on both the human transcripts and recognition output, evaluated by the ROUGE summarization metrics, show that feature selection and data resampling help improve the system performance.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125567818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sub-word modeling of out of vocabulary words in spoken term detection","authors":"Igor Szöke, L. Burget, J. Černocký, M. Fapšo","doi":"10.1109/SLT.2008.4777893","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777893","url":null,"abstract":"This paper deals with comparison of sub-word based methods for spoken term detection (STD) task and phone recognition. The sub-word units are needed for search for out-of-vocabulary words. We compared words, phones and multigrams. The maximal length and pruning of multigrams were investigated first. Then two constrained methods of multigram training were proposed. We evaluated on the NIST STD06 dev-set CTS data. The conclusion is that the proposed method improves the phone accuracy more than 9% relative and STD accuracy more than 7% relative.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"171 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122946456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unexplored directions in spoken language technology for development","authors":"F. Weber, Kalika Bali, R. Rosenfeld, K. Toyama","doi":"10.1109/SLT.2008.4777825","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777825","url":null,"abstract":"The full range of possibilities for spoken-language technologies (SLTs) to impact poor communities has been investigated on partially, despite what appears to be strong potential. Voice interfaces raise fewer barriers for the illiterate, require less training to use, and are a natural choice for applications on cell phones, which have far greater penetration, in the developing world than PCs. At the same time, critical lessons of existing technology projects in development still apply and require careful attention. We suggest how to expand the view of SLT for development, and discuss how its potential can realistically be explored.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128249959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An analysis of grammatical errors in non-native speech in english","authors":"J. Lee, S. Seneff","doi":"10.1109/SLT.2008.4777847","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777847","url":null,"abstract":"While a wide variety of grammatical mistakes may be observed in the speech of non-native speakers, the types and frequencies of these mistakes are not random. Certain parts of speech, for example, have been shown to be especially problematic for Japanese learners of English [1]. Modeling these errors can potentially enhance the performance of computer-assisted language learning systems. This paper presents an automatic method to estimate an error model from a non-native English corpus, focusing on articles and prepositions. A fine-grained analysis is achieved by conditioning the errors on appropriate words in the context.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128634944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Phonetic name matching for cross-lingual Spoken Sentence Retrieval","authors":"Heng Ji, R. Grishman, Wen Wang","doi":"10.1109/SLT.2008.4777895","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777895","url":null,"abstract":"Cross-lingual spoken sentence retrieval (CLSSR) remains a challenge, especially for queries including OOV words such as person names. This paper proposes a simple method of fuzzy matching between query names and phones of candidate audio segments. This approach has the advantage of avoiding some word decoding errors in automatic speech recognition (ASR). Experiments on Mandarin-English CLSSR show that phone-based searching and conventional translation-based searching are complementary. Adding phone matching achieved 26.29% improvement on F-measure over searching on state-of-the-art machine translation (MT) output and 8.83% over entity translation (ET) output.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127751950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bob: A lexicon and pronunciation dictionary generator","authors":"V. Wan, J. Dines, A. Hannani, Thomas Hain","doi":"10.1109/SLT.2008.4777879","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777879","url":null,"abstract":"This paper presents Bob, a tool for managing lexicons and generating pronunciation dictionaries for automatic speech recognition systems. It aims to maintain a high level of consistency between lexicons and language modelling corpora by managing the text normalisation and lexicon generation processes in a single dedicated package. It also aims to maintain consistent pronunciation dictionaries by generating pronunciation hypotheses automatically and aiding their verification. The tool's design and functionality are described. Also two case studies highlighting the importance of consistency and illustrating the use of the tool are reported.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121680972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic identification of gender & accent in spoken Hindi utterances with regional Indian accents","authors":"Kamini Malhotra, A. Khosla","doi":"10.1109/SLT.2008.4777902","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777902","url":null,"abstract":"In the past significant effort has been focused on automatic extraction of information from speech signals. Most techniques have aimed at automatic speech recognition or speaker identification. Automatic accent identification (AID) has received far less attention. This paper gives an approach to identify gender and accent of a speaker using Gaussian mixture modeling technique. The proposed approach is text independent and identifies accent among four regional Indian accents in spoken Hindi and also identifies the gender. The accents worked upon are Kashmiri, Manipuri, Bengali and neutral Hindi. The Gaussian mixture model (GMM) approach precludes the need of speech segmentation for training and makes the implementation of the system very simple. When gender dependent GMMs are used, the accent identification score is enhanced and gender is also correctly recognized. The results show that the GMMs lend themselves to accent and gender identification task very well. In this approach spectral features have been incorporated in the form of mel frequency cepstral coefficients (MFCC). The approach has a wide scope of expansion to incorporate other regional accents in a very simple way.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129375953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sequential system combination for machine translation of speech","authors":"D. Karakos, S. Khudanpur","doi":"10.1109/SLT.2008.4777889","DOIUrl":"https://doi.org/10.1109/SLT.2008.4777889","url":null,"abstract":"System combination is a technique which has been shown to yield significant gains in speech recognition and machine translation. Most combination schemes perform an alignment between different system outputs in order to produce lattices (or confusion networks), from which a composite hypothesis is chosen, possibly with the help of a large language model. The benefit of this approach is two-fold: (i) whenever many systems agree with each other on a set of words, the combination output contains these words with high confidence; and (ii) whenever the systems disagree, the language model resolves the ambiguity based on the (probably correct) agreed upon context. The case of machine translation system combination is more challenging because of the different word orders of the translations: the alignment has to incorporate computationally expensive movements of word blocks. In this paper, we show how one can combine translation outputs efficiently, extending the incremental alignment procedure of (A-V.I. Rosti et al., 2008). A comparison between different system combination design choices is performed on an Arabic speech translation task.","PeriodicalId":186876,"journal":{"name":"2008 IEEE Spoken Language Technology Workshop","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132771317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}