Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus)

Ramzi Salah, Lailatul Qadri Binti Zakaria
{"title":"Building the Classical Arabic Named Entity Recognition Corpus (CANERCorpus)","authors":"Ramzi Salah, Lailatul Qadri Binti Zakaria","doi":"10.1109/INFRKM.2018.8464820","DOIUrl":null,"url":null,"abstract":"The past decade has witnessed construction of the background information resources to overcome several challenges in text mining tasks. For non-English languages with poor knowledge sources such as Arabic, these challenges have become more salient especially for handling the natural language processing applications that require human annotation. In the Named Entity Recognition (NER) task, several researches have been introduced to address the complexity of Arabic in terms of morphological and syntactical variations. However, there are a small number of studies dealing with Classical Arabic (CA) that is the official language of Quran and Hadith. CA was also used for archiving the Islamic topics that contain a lot of useful information which could of great value if extracted. Therefore, in this paper, we introduce Classical Arabic Named Entity Recognition corpus as a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities. It is freely available and manual annotation by human experts, containing more than 7,000 Hadiths. Based on Islamic topics, we classify named entities into 20 types which include the specific-domain entities that have not been handled before such as Allah, Prophet, Paradise, Hell, and Religion. The differences between the standard and classical Arabic are described in details during this work. Moreover, the comprehensive statistical analysis is introduced to measure the factors that play important role in manual human annotation.","PeriodicalId":196731,"journal":{"name":"2018 Fourth International Conference on Information Retrieval and Knowledge Management (CAMP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Fourth International Conference on Information Retrieval and Knowledge Management (CAMP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INFRKM.2018.8464820","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14

Abstract

The past decade has witnessed construction of the background information resources to overcome several challenges in text mining tasks. For non-English languages with poor knowledge sources such as Arabic, these challenges have become more salient especially for handling the natural language processing applications that require human annotation. In the Named Entity Recognition (NER) task, several researches have been introduced to address the complexity of Arabic in terms of morphological and syntactical variations. However, there are a small number of studies dealing with Classical Arabic (CA) that is the official language of Quran and Hadith. CA was also used for archiving the Islamic topics that contain a lot of useful information which could of great value if extracted. Therefore, in this paper, we introduce Classical Arabic Named Entity Recognition corpus as a new corpus of tagged data that can be useful for handling the issues in recognition of Arabic named entities. It is freely available and manual annotation by human experts, containing more than 7,000 Hadiths. Based on Islamic topics, we classify named entities into 20 types which include the specific-domain entities that have not been handled before such as Allah, Prophet, Paradise, Hell, and Religion. The differences between the standard and classical Arabic are described in details during this work. Moreover, the comprehensive statistical analysis is introduced to measure the factors that play important role in manual human annotation.
建立经典阿拉伯语命名实体识别语料库(CANERCorpus)
在过去的十年中,背景信息资源的建设克服了文本挖掘任务中的几个挑战。对于知识来源贫乏的非英语语言,如阿拉伯语,这些挑战变得更加突出,特别是在处理需要人工注释的自然语言处理应用程序时。在命名实体识别(NER)任务中,已经介绍了一些研究来解决阿拉伯语在形态和句法变化方面的复杂性。然而,有少量的研究涉及古典阿拉伯语(CA),这是古兰经和圣训的官方语言。CA还用于将包含大量有用信息的伊斯兰主题存档,这些信息如果提取出来,将具有很大的价值。因此,在本文中,我们引入了经典阿拉伯语命名实体识别语料库作为一种新的标记数据语料库,可以用于处理阿拉伯语命名实体识别中的问题。它是免费的,由人类专家手工注释,包含7000多首圣训。基于伊斯兰主题,我们将命名实体分为20种类型,其中包括以前没有处理过的特定领域实体,如安拉、先知、天堂、地狱和宗教。标准阿拉伯语和古典阿拉伯语之间的差异在这项工作中被详细描述。在此基础上,引入综合统计分析方法对人工标注中起重要作用的因素进行测度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信