{"title":"A Survey and Comparative Study of Arabic Diacritization Tools","authors":"O. Hamed, Torsten Zesch","doi":"10.21248/jlcl.32.2017.213","DOIUrl":null,"url":null,"abstract":"Modern Standard Arabic, as well as other languages based on the Arabic script, are usually written without diacritics, which complicates many language processing tasks. Although many different approaches for automatic diacritization of Arabic have been proposed, it is still unclear what performance level can be expected in a practical setting. For that purpose, we first survey the Arabic diacritization tools in the literature and group the results by the corpus used for testing. We then conduct a comparative study between the available tools for diacritization (Farasa and Madamira) as well as two baselines. We evaluate the error rates for these systems using a set of publicly available, fully-diacritized corpora in two different evaluation modes. With the help of human annotators, we conduct an additional experiment examining error categories. We find that Farasa is outperforming Madamira and the baselines in both modes.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Lang. Technol. Comput. Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21248/jlcl.32.2017.213","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24
Abstract
Modern Standard Arabic, as well as other languages based on the Arabic script, are usually written without diacritics, which complicates many language processing tasks. Although many different approaches for automatic diacritization of Arabic have been proposed, it is still unclear what performance level can be expected in a practical setting. For that purpose, we first survey the Arabic diacritization tools in the literature and group the results by the corpus used for testing. We then conduct a comparative study between the available tools for diacritization (Farasa and Madamira) as well as two baselines. We evaluate the error rates for these systems using a set of publicly available, fully-diacritized corpora in two different evaluation modes. With the help of human annotators, we conduct an additional experiment examining error categories. We find that Farasa is outperforming Madamira and the baselines in both modes.