{"title":"Evaluation of the ambiguity caused by the absence of diacritical marks in Arabic texts: Statistical study","authors":"Mohamed Boudchiche, A. Mazroui","doi":"10.1109/ICTA.2015.7426904","DOIUrl":null,"url":null,"abstract":"This work falls within the framework of the Natural Language Processing. Its objective is to assess the level of ambiguity caused by the absence of diacritical marks in Arabic texts during the information extraction process. We have carried out a statistical study based on four indicators: the root, the lemma, the stem and the POS tag of the word. For this, we used a large vowelized corpus containing more than 80 million words collected from several sources. The conducted study showed that the absence of diacritical marks in Arabic texts represents the main cause of the ambiguity observed in the information extraction process. Thus, based on this study we can conclude that the use of a vowelized corpus reduces considerably the ambiguity.","PeriodicalId":375443,"journal":{"name":"2015 5th International Conference on Information & Communication Technology and Accessibility (ICTA)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 5th International Conference on Information & Communication Technology and Accessibility (ICTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTA.2015.7426904","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13
Abstract
This work falls within the framework of the Natural Language Processing. Its objective is to assess the level of ambiguity caused by the absence of diacritical marks in Arabic texts during the information extraction process. We have carried out a statistical study based on four indicators: the root, the lemma, the stem and the POS tag of the word. For this, we used a large vowelized corpus containing more than 80 million words collected from several sources. The conducted study showed that the absence of diacritical marks in Arabic texts represents the main cause of the ambiguity observed in the information extraction process. Thus, based on this study we can conclude that the use of a vowelized corpus reduces considerably the ambiguity.