M. Draman, D. C. Tee, Z. Lambak, M. R. Yahya, M. I. M. Yusoff, S. Ibrahim, S. Saidon, N. A. Haris, T. Tan
{"title":"Malay speech corpus of telecommunication call center preparation for ASR","authors":"M. Draman, D. C. Tee, Z. Lambak, M. R. Yahya, M. I. M. Yusoff, S. Ibrahim, S. Saidon, N. A. Haris, T. Tan","doi":"10.1109/ICOICT.2017.8074675","DOIUrl":null,"url":null,"abstract":"This paper presents the methodology uses in preparing a conversation speech corpus for acoustic model training of Malay automatic speech recognition (ASR) in telco call center. Data preparation is significant and should be done properly in order to build robust model for an ASR system. We described the issues during filtering process and the list of sensitive data to be removed to avoid any personal information being leaked out to third party. After that, we manually transcribed the filtered data based on a set of transcribing rules specifically designed to suit with Malay ASR engine. Finally, we conducted analysis based on the 5-hours transcribed data to obtain N-gram models and the frequency of word occurrence for our call center sample voice data which can help us to develop symptom-cause code matching application in the coming future.","PeriodicalId":244500,"journal":{"name":"2017 5th International Conference on Information and Communication Technology (ICoIC7)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 5th International Conference on Information and Communication Technology (ICoIC7)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOICT.2017.8074675","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
This paper presents the methodology uses in preparing a conversation speech corpus for acoustic model training of Malay automatic speech recognition (ASR) in telco call center. Data preparation is significant and should be done properly in order to build robust model for an ASR system. We described the issues during filtering process and the list of sensitive data to be removed to avoid any personal information being leaked out to third party. After that, we manually transcribed the filtered data based on a set of transcribing rules specifically designed to suit with Malay ASR engine. Finally, we conducted analysis based on the 5-hours transcribed data to obtain N-gram models and the frequency of word occurrence for our call center sample voice data which can help us to develop symptom-cause code matching application in the coming future.