New Language Resources for Arabic: Corpus Containing More Than Two Million Words and a Corpus Processing Tool

2013 International Conference on Asian Language Processing Pub Date : 2013-08-17 DOI:10.1109/IALP.2013.21

A. Al-Thubaity, Marwa Khan, Manal Al-Mazrua, Maram Al-Mousa

{"title":"New Language Resources for Arabic: Corpus Containing More Than Two Million Words and a Corpus Processing Tool","authors":"A. Al-Thubaity, Marwa Khan, Manal Al-Mazrua, Maram Al-Mousa","doi":"10.1109/IALP.2013.21","DOIUrl":null,"url":null,"abstract":"Arabic is a resource-poor language relative to other languages with a similar number of speakers. This situation negatively affects corpus-based linguistic studies in Arabic and, to a lesser extent, Arabic language processing. This paper presents a brief overview of recent freely available Arabic corpora and corpora processing tools, and it examines some of the issues that may be preventing Arabic linguists from using the same. These issues reveal the need for new language resources to enrich and foster Arabic corpus-based studies. Accordingly, this paper introduces the design of a new Arabic corpus that includes modern standard Arabic varieties based on newspapers from all Arab countries and that comprises more than two million words, it also describes the main features of a corpus processing tool specifically designed for Arabic, called \"Khawas ÛæÇÕ\" (\"diver\" in English). Khawas provides more features than any other freely available corpus processing tool for Arabic, including n-gram frequency and concordance, collocations, and statistical comparison of two corpora. Finally, we outline modifications and improvements that could be made in future works.","PeriodicalId":413833,"journal":{"name":"2013 International Conference on Asian Language Processing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Asian Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IALP.2013.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

Abstract

Arabic is a resource-poor language relative to other languages with a similar number of speakers. This situation negatively affects corpus-based linguistic studies in Arabic and, to a lesser extent, Arabic language processing. This paper presents a brief overview of recent freely available Arabic corpora and corpora processing tools, and it examines some of the issues that may be preventing Arabic linguists from using the same. These issues reveal the need for new language resources to enrich and foster Arabic corpus-based studies. Accordingly, this paper introduces the design of a new Arabic corpus that includes modern standard Arabic varieties based on newspapers from all Arab countries and that comprises more than two million words, it also describes the main features of a corpus processing tool specifically designed for Arabic, called "Khawas ÛæÇÕ" ("diver" in English). Khawas provides more features than any other freely available corpus processing tool for Arabic, including n-gram frequency and concordance, collocations, and statistical comparison of two corpora. Finally, we outline modifications and improvements that could be made in future works.

查看原文本刊更多论文

阿拉伯语的新语言资源:包含超过两百万单词的语料库和语料库处理工具

相对于使用人数相似的其他语言，阿拉伯语是一种资源贫乏的语言。这种情况对以语料库为基础的阿拉伯语语言学研究产生负面影响，并在较小程度上影响阿拉伯语处理。本文介绍了最近免费提供的阿拉伯语料库和语料库处理工具的简要概述，并研究了一些可能阻止阿拉伯语言学家使用相同的问题。这些问题表明需要新的语言资源来丰富和促进基于阿拉伯文语料库的研究。因此，本文介绍了一个新的阿拉伯语料库的设计，该语料库包括基于所有阿拉伯国家的报纸的现代标准阿拉伯语品种，包括200多万字，它还描述了专门为阿拉伯语设计的语料库处理工具的主要特征，称为“Khawas ÛæÇÕ”(英语中的“diver”)。Khawas提供了比任何其他免费的阿拉伯语语料库处理工具更多的功能，包括n-gram频率和一致性，搭配和两个语料库的统计比较。最后，我们概述了在未来的工作中可以做出的修改和改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 International Conference on Asian Language Processing

自引率

0.00%

发文量