Evaluation of the ambiguity caused by the absence of diacritical marks in Arabic texts: Statistical study

2015 5th International Conference on Information & Communication Technology and Accessibility (ICTA) Pub Date : 2015-12-01 DOI:10.1109/ICTA.2015.7426904

Mohamed Boudchiche, A. Mazroui

引用次数: 13

Abstract

This work falls within the framework of the Natural Language Processing. Its objective is to assess the level of ambiguity caused by the absence of diacritical marks in Arabic texts during the information extraction process. We have carried out a statistical study based on four indicators: the root, the lemma, the stem and the POS tag of the word. For this, we used a large vowelized corpus containing more than 80 million words collected from several sources. The conducted study showed that the absence of diacritical marks in Arabic texts represents the main cause of the ambiguity observed in the information extraction process. Thus, based on this study we can conclude that the use of a vowelized corpus reduces considerably the ambiguity.

查看原文本刊更多论文

对阿拉伯语文本中没有变音符所引起的歧义的评价:统计研究

这项工作属于自然语言处理的框架。其目的是评估在信息提取过程中由于阿拉伯文本中没有变音符号而造成的歧义程度。我们基于词根、引理、词干和词性标注四个指标进行了统计研究。为此，我们使用了一个从多个来源收集的包含超过8000万单词的大型元音语料库。研究表明，阿拉伯语文本中变音符的缺失是信息提取过程中出现歧义的主要原因。因此，基于这项研究，我们可以得出结论，元音化语料库的使用大大减少了歧义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 5th International Conference on Information & Communication Technology and Accessibility (ICTA)

自引率

0.00%

发文量