ADPBC: Arabic Dependency Parsing Based Corpora for Information Extraction

Sally Mohamed, M. Hussien, Hamdy M. Mousa
{"title":"ADPBC: Arabic Dependency Parsing Based Corpora for Information Extraction","authors":"Sally Mohamed, M. Hussien, Hamdy M. Mousa","doi":"10.5815/IJITCS.2021.01.04","DOIUrl":null,"url":null,"abstract":"There is a massive amount of different information and data in the World Wide Web, and the number of Arabic users and contents is widely increasing. Information extraction is an essential issue to access and sort the data on the web. In this regard, information extraction becomes a challenge, especially for languages, which have a complex morphology like Arabic. Consequently, the trend today is to build a new corpus that makes the information extraction easier and more precise. This paper presents Arabic linguistically analyzed corpus, including dependency relation. The collected data includes five fields; they are a sport, religious, weather, news and biomedical. The output is CoNLL universal lattice file format (CoNLL-UL). The corpus contains an index for the sentences and their linguistic meta-data to enable quick mining and search across the corpus. This corpus has seventeenth morphological annotations and eight features based on the identification of the textual structures help to recognize and understand the grammatical characteristics of the text and perform the dependency relation. The parsing and dependency process conducted by the universal dependency model and corrected manually. The results illustrated the enhancement in the dependency relation corpus. The designed Arabic corpus helps to quickly get linguistic annotations for a text and make the information Extraction techniques easy and clear to learn. The gotten results illustrated the average enhancement in the dependency relation corpus.","PeriodicalId":130361,"journal":{"name":"International Journal of Information Technology and Computer Science","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Information Technology and Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5815/IJITCS.2021.01.04","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

There is a massive amount of different information and data in the World Wide Web, and the number of Arabic users and contents is widely increasing. Information extraction is an essential issue to access and sort the data on the web. In this regard, information extraction becomes a challenge, especially for languages, which have a complex morphology like Arabic. Consequently, the trend today is to build a new corpus that makes the information extraction easier and more precise. This paper presents Arabic linguistically analyzed corpus, including dependency relation. The collected data includes five fields; they are a sport, religious, weather, news and biomedical. The output is CoNLL universal lattice file format (CoNLL-UL). The corpus contains an index for the sentences and their linguistic meta-data to enable quick mining and search across the corpus. This corpus has seventeenth morphological annotations and eight features based on the identification of the textual structures help to recognize and understand the grammatical characteristics of the text and perform the dependency relation. The parsing and dependency process conducted by the universal dependency model and corrected manually. The results illustrated the enhancement in the dependency relation corpus. The designed Arabic corpus helps to quickly get linguistic annotations for a text and make the information Extraction techniques easy and clear to learn. The gotten results illustrated the average enhancement in the dependency relation corpus.
基于阿拉伯语依赖句法分析的信息提取语料库
万维网上有大量不同的信息和数据,阿拉伯用户和内容的数量正在广泛增加。信息提取是对网络数据进行访问和分类的关键问题。在这方面,信息提取成为一个挑战,特别是对于像阿拉伯语这样具有复杂形态学的语言。因此,当今的趋势是建立一个新的语料库,使信息提取更容易和更精确。本文对阿拉伯文语料库进行了语言分析,包括语料库的依存关系。收集的数据包括五个领域;它们是体育、宗教、天气、新闻和生物医学。输出为CoNLL通用格文件格式(CoNLL- ul)。语料库包含句子及其语言元数据的索引,以实现跨语料库的快速挖掘和搜索。该语料库在对语篇结构进行识别的基础上,有17个形态注释和8个特征,有助于认识和理解语篇的语法特征,执行依存关系。由通用依赖模型执行并手动修正的解析和依赖过程。结果显示了依赖关系语料库的增强。设计的阿拉伯文语料库有助于快速获得文本的语言注释,使信息提取技术易于学习。得到的结果说明了依赖关系语料库的平均增强。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信