Arabic named entity recognition using boosting method

Mohamad Bagher Sajadi, Behrooz Minaei
{"title":"Arabic named entity recognition using boosting method","authors":"Mohamad Bagher Sajadi, Behrooz Minaei","doi":"10.1109/AISP.2017.8324098","DOIUrl":null,"url":null,"abstract":"In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and Effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers. While most of these researches are based on Modern Standard Arabic (MSA), in this paper, we focus on Classical Arabic (CA) literature. We propose a corpus called NoorCorp with 200k labeled words for research purposes which is annotated by expert human resources manually. We also collected about 18k proper names from old Hadith books as gazetteer which is called NoorGazet. Using ensemble learning, we develop a new approach for extraction of named entities (NEs) including person, location and organization. Adaboost.M2 algorithm, as implementation of multiclass Boosting method, is applied to train the prediction model. Results show that performance of the method is better than decision tree as the base classifier. We have used tokenizing, part of speech (POS) tagging, and base phrase chunking (BPC) to overcome linguistic obstacles in Arabic. An overall F-measure value of 96.04 is obtained. In addition, we have studied the effect of preprocessing and external resources on the system results. Finally, the proposed approach is applied on ANERCorp as MSA corpus and we have compared the results with NoorCorp.","PeriodicalId":386952,"journal":{"name":"2017 Artificial Intelligence and Signal Processing Conference (AISP)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Artificial Intelligence and Signal Processing Conference (AISP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AISP.2017.8324098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

In Natural Language Processing (NLP) studies, developing resources and tools makes a contribution to extension and Effectiveness of researches in each language. In recent years, Arabic Named Entity Recognition (ANER) has been considered by NLP researchers. While most of these researches are based on Modern Standard Arabic (MSA), in this paper, we focus on Classical Arabic (CA) literature. We propose a corpus called NoorCorp with 200k labeled words for research purposes which is annotated by expert human resources manually. We also collected about 18k proper names from old Hadith books as gazetteer which is called NoorGazet. Using ensemble learning, we develop a new approach for extraction of named entities (NEs) including person, location and organization. Adaboost.M2 algorithm, as implementation of multiclass Boosting method, is applied to train the prediction model. Results show that performance of the method is better than decision tree as the base classifier. We have used tokenizing, part of speech (POS) tagging, and base phrase chunking (BPC) to overcome linguistic obstacles in Arabic. An overall F-measure value of 96.04 is obtained. In addition, we have studied the effect of preprocessing and external resources on the system results. Finally, the proposed approach is applied on ANERCorp as MSA corpus and we have compared the results with NoorCorp.
阿拉伯语命名实体识别的增强方法
在自然语言处理(NLP)研究中,资源和工具的开发有助于各语言研究的扩展和有效性。近年来,阿拉伯语命名实体识别(ANER)成为自然语言处理研究的热点。虽然这些研究大多基于现代标准阿拉伯语(MSA),但本文主要关注古典阿拉伯语(CA)文献。我们提出了一个名为NoorCorp的语料库,其中包含20万个标记词,用于研究目的,由人力资源专家手动注释。我们还从古老的圣训书籍中收集了大约18k个专有名称作为地名词典,称为NoorGazet。利用集成学习,我们开发了一种新的方法来提取命名实体(NEs),包括人、地点和组织。演算法。M2算法作为多类Boosting方法的实现,用于训练预测模型。结果表明,该方法的性能优于决策树作为基分类器。我们使用了标记化、词性标注和基本短语分块(BPC)来克服阿拉伯语中的语言障碍。总体f测量值为96.04。此外,我们还研究了预处理和外部资源对系统结果的影响。最后,将该方法应用于ANERCorp作为MSA语料,并与NoorCorp的结果进行了比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信