Construction and Annotation of the Jordan Comprehensive Contemporary Arabic Corpus (JCCA)

M. Sawalha, Faisal Alshargi, A. AlShdaifat, S. Yagi, Mohammad A. Qudah
{"title":"Construction and Annotation of the Jordan Comprehensive Contemporary Arabic Corpus (JCCA)","authors":"M. Sawalha, Faisal Alshargi, A. AlShdaifat, S. Yagi, Mohammad A. Qudah","doi":"10.18653/v1/W19-4616","DOIUrl":null,"url":null,"abstract":"To compile a modern dictionary that catalogues the words in currency, and to study linguistic patterns in the contemporary language, it is necessary to have a corpus of authentic texts that reflect current usage of the language. Although there are numerous Arabic corpora, none claims to be representative of the language in terms of the combination of geographical region, genre, subject matter, mode, and medium. This paper describes a 100-million-word corpus that takes the British National Corpus (BNC) as a model. The aim of the corpus is to be balanced, annotated, comprehensive, and representative of contemporary Arabic as written and spoken in Arab countries today. It will be different from most others in not being heavily-dominated by the news or in mixing the classical with the modern. In this paper is an outline of the methodology adopted for the design, construction, and annotation of this corpus. DIWAN (Alshargi and Rambow, 2015) was used to annotate a one-million-word snapshot of the corpus. DIWAN is a dialectal word annotation tool, but we upgraded it by adding a new tag-set that is based on traditional Arabic grammar and by adding the roots and morphological patterns of nouns and verbs. Moreover, the corpus we constructed covers the major spoken varieties of Arabic.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"WANLP@ACL 2019","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W19-4616","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

To compile a modern dictionary that catalogues the words in currency, and to study linguistic patterns in the contemporary language, it is necessary to have a corpus of authentic texts that reflect current usage of the language. Although there are numerous Arabic corpora, none claims to be representative of the language in terms of the combination of geographical region, genre, subject matter, mode, and medium. This paper describes a 100-million-word corpus that takes the British National Corpus (BNC) as a model. The aim of the corpus is to be balanced, annotated, comprehensive, and representative of contemporary Arabic as written and spoken in Arab countries today. It will be different from most others in not being heavily-dominated by the news or in mixing the classical with the modern. In this paper is an outline of the methodology adopted for the design, construction, and annotation of this corpus. DIWAN (Alshargi and Rambow, 2015) was used to annotate a one-million-word snapshot of the corpus. DIWAN is a dialectal word annotation tool, but we upgraded it by adding a new tag-set that is based on traditional Arabic grammar and by adding the roots and morphological patterns of nouns and verbs. Moreover, the corpus we constructed covers the major spoken varieties of Arabic.
约旦当代阿拉伯语综合语料库(JCCA)的构建与注释
为了编纂一部现代词典,对常用词汇进行分类,并研究当代语言的语言模式,有必要拥有反映语言当前用法的真实文本的语料库。虽然阿拉伯语语料库数量众多,但没有一个语料库在地理区域、体裁、题材、方式和媒介的结合方面能够代表该语言。本文描述了一个以英国国家语料库(BNC)为模型的亿字语料库。语料库的目的是平衡,注释,全面,并代表当代阿拉伯语的书面和口语在今天的阿拉伯国家。它与其他大多数报纸的不同之处在于,它不受新闻的严重支配,也不将古典与现代相结合。本文概述了该语料库的设计、构建和注释所采用的方法。使用DIWAN (Alshargi and Rambow, 2015)对语料库的一百万字快照进行注释。DIWAN是一个方言单词注释工具,但我们通过添加一个基于传统阿拉伯语法的新标签集,以及添加名词和动词的词根和形态模式,对其进行了升级。此外,我们构建的语料库涵盖了阿拉伯语的主要口语变体。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信