阿拉伯语的新语言资源:包含超过两百万单词的语料库和语料库处理工具

A. Al-Thubaity, Marwa Khan, Manal Al-Mazrua, Maram Al-Mousa
{"title":"阿拉伯语的新语言资源:包含超过两百万单词的语料库和语料库处理工具","authors":"A. Al-Thubaity, Marwa Khan, Manal Al-Mazrua, Maram Al-Mousa","doi":"10.1109/IALP.2013.21","DOIUrl":null,"url":null,"abstract":"Arabic is a resource-poor language relative to other languages with a similar number of speakers. This situation negatively affects corpus-based linguistic studies in Arabic and, to a lesser extent, Arabic language processing. This paper presents a brief overview of recent freely available Arabic corpora and corpora processing tools, and it examines some of the issues that may be preventing Arabic linguists from using the same. These issues reveal the need for new language resources to enrich and foster Arabic corpus-based studies. Accordingly, this paper introduces the design of a new Arabic corpus that includes modern standard Arabic varieties based on newspapers from all Arab countries and that comprises more than two million words, it also describes the main features of a corpus processing tool specifically designed for Arabic, called \"Khawas ÛæÇÕ\" (\"diver\" in English). Khawas provides more features than any other freely available corpus processing tool for Arabic, including n-gram frequency and concordance, collocations, and statistical comparison of two corpora. Finally, we outline modifications and improvements that could be made in future works.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":"{\"title\":\"New Language Resources for Arabic: Corpus Containing More Than Two Million Words and a Corpus Processing Tool\",\"authors\":\"A. Al-Thubaity, Marwa Khan, Manal Al-Mazrua, Maram Al-Mousa\",\"doi\":\"10.1109/IALP.2013.21\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Arabic is a resource-poor language relative to other languages with a similar number of speakers. This situation negatively affects corpus-based linguistic studies in Arabic and, to a lesser extent, Arabic language processing. This paper presents a brief overview of recent freely available Arabic corpora and corpora processing tools, and it examines some of the issues that may be preventing Arabic linguists from using the same. These issues reveal the need for new language resources to enrich and foster Arabic corpus-based studies. Accordingly, this paper introduces the design of a new Arabic corpus that includes modern standard Arabic varieties based on newspapers from all Arab countries and that comprises more than two million words, it also describes the main features of a corpus processing tool specifically designed for Arabic, called \\\"Khawas ÛæÇÕ\\\" (\\\"diver\\\" in English). Khawas provides more features than any other freely available corpus processing tool for Arabic, including n-gram frequency and concordance, collocations, and statistical comparison of two corpora. Finally, we outline modifications and improvements that could be made in future works.\",\"PeriodicalId\":413833,\"journal\":{\"name\":\"2013 International Conference on Asian Language Processing\",\"volume\":\"52 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-08-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"26\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 International Conference on Asian Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IALP.2013.21\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Asian Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2013.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 26

摘要

相对于使用人数相似的其他语言,阿拉伯语是一种资源贫乏的语言。这种情况对以语料库为基础的阿拉伯语语言学研究产生负面影响,并在较小程度上影响阿拉伯语处理。本文介绍了最近免费提供的阿拉伯语料库和语料库处理工具的简要概述,并研究了一些可能阻止阿拉伯语言学家使用相同的问题。这些问题表明需要新的语言资源来丰富和促进基于阿拉伯文语料库的研究。因此,本文介绍了一个新的阿拉伯语料库的设计,该语料库包括基于所有阿拉伯国家的报纸的现代标准阿拉伯语品种,包括200多万字,它还描述了专门为阿拉伯语设计的语料库处理工具的主要特征,称为“Khawas ÛæÇÕ”(英语中的“diver”)。Khawas提供了比任何其他免费的阿拉伯语语料库处理工具更多的功能,包括n-gram频率和一致性,搭配和两个语料库的统计比较。最后,我们概述了在未来的工作中可以做出的修改和改进。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
New Language Resources for Arabic: Corpus Containing More Than Two Million Words and a Corpus Processing Tool
Arabic is a resource-poor language relative to other languages with a similar number of speakers. This situation negatively affects corpus-based linguistic studies in Arabic and, to a lesser extent, Arabic language processing. This paper presents a brief overview of recent freely available Arabic corpora and corpora processing tools, and it examines some of the issues that may be preventing Arabic linguists from using the same. These issues reveal the need for new language resources to enrich and foster Arabic corpus-based studies. Accordingly, this paper introduces the design of a new Arabic corpus that includes modern standard Arabic varieties based on newspapers from all Arab countries and that comprises more than two million words, it also describes the main features of a corpus processing tool specifically designed for Arabic, called "Khawas ÛæÇÕ" ("diver" in English). Khawas provides more features than any other freely available corpus processing tool for Arabic, including n-gram frequency and concordance, collocations, and statistical comparison of two corpora. Finally, we outline modifications and improvements that could be made in future works.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信