{"title":"The effect of diacritization on Arabic speech recogntion","authors":"Fawaz S. Al-Anzi, Dia AbuZeina","doi":"10.1109/AEECT.2017.8257758","DOIUrl":null,"url":null,"abstract":"Arabic automatic speech recognition (ASR) is a successful application of natural language processing (NLP). However, Arabic formal text is generally written without diacritics, which produces different pronunciation forms. That is, the Arabic writing system allows discarding short vowels and, hence, forcing the reader to use the prior knowledge and the words context to infer the missing diacritics. For speech recognition, there are two options for textual training data; either diacritized (also called vowelized) or non-diacritized text. However, using non-diacritized text may introduce a challenge for Arabic ASR as missing the short vowels may lead to some confusion in the learning process. This ambiguity produces a less than optimal acoustic model that is one of the most important components of ASR systems. In this paper, we present the performance using diacritized and non-diacritized text. In the experiments, we used the Carnegie Mellon University (CMU) PocketSphinx speech recognizer. We also used a new “in house” modern standard Arabic (MSA) continuous speech corpus that contains 13.5 hours for training and 4.1 hours for testing. The text of the corpus was manually diacritized. For acoustic modelling, we used the phonetic tied-mixture (PTM). The experimental results show that the non-diacritized text system scored 76.4% (i.e. 1-word error rate (WER)) while the diacritized text based system scored 63.8%. Even the diacritized case has less accuracy due to the slight differences in diacritics; however, the non-diacritized case might be adequate and faultless for the Arabic native speakers.","PeriodicalId":286127,"journal":{"name":"2017 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AEECT.2017.8257758","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Arabic automatic speech recognition (ASR) is a successful application of natural language processing (NLP). However, Arabic formal text is generally written without diacritics, which produces different pronunciation forms. That is, the Arabic writing system allows discarding short vowels and, hence, forcing the reader to use the prior knowledge and the words context to infer the missing diacritics. For speech recognition, there are two options for textual training data; either diacritized (also called vowelized) or non-diacritized text. However, using non-diacritized text may introduce a challenge for Arabic ASR as missing the short vowels may lead to some confusion in the learning process. This ambiguity produces a less than optimal acoustic model that is one of the most important components of ASR systems. In this paper, we present the performance using diacritized and non-diacritized text. In the experiments, we used the Carnegie Mellon University (CMU) PocketSphinx speech recognizer. We also used a new “in house” modern standard Arabic (MSA) continuous speech corpus that contains 13.5 hours for training and 4.1 hours for testing. The text of the corpus was manually diacritized. For acoustic modelling, we used the phonetic tied-mixture (PTM). The experimental results show that the non-diacritized text system scored 76.4% (i.e. 1-word error rate (WER)) while the diacritized text based system scored 63.8%. Even the diacritized case has less accuracy due to the slight differences in diacritics; however, the non-diacritized case might be adequate and faultless for the Arabic native speakers.