M. Sawalha, Faisal Alshargi, A. AlShdaifat, S. Yagi, Mohammad A. Qudah
{"title":"Construction and Annotation of the Jordan Comprehensive Contemporary Arabic Corpus (JCCA)","authors":"M. Sawalha, Faisal Alshargi, A. AlShdaifat, S. Yagi, Mohammad A. Qudah","doi":"10.18653/v1/W19-4616","DOIUrl":null,"url":null,"abstract":"To compile a modern dictionary that catalogues the words in currency, and to study linguistic patterns in the contemporary language, it is necessary to have a corpus of authentic texts that reflect current usage of the language. Although there are numerous Arabic corpora, none claims to be representative of the language in terms of the combination of geographical region, genre, subject matter, mode, and medium. This paper describes a 100-million-word corpus that takes the British National Corpus (BNC) as a model. The aim of the corpus is to be balanced, annotated, comprehensive, and representative of contemporary Arabic as written and spoken in Arab countries today. It will be different from most others in not being heavily-dominated by the news or in mixing the classical with the modern. In this paper is an outline of the methodology adopted for the design, construction, and annotation of this corpus. DIWAN (Alshargi and Rambow, 2015) was used to annotate a one-million-word snapshot of the corpus. DIWAN is a dialectal word annotation tool, but we upgraded it by adding a new tag-set that is based on traditional Arabic grammar and by adding the roots and morphological patterns of nouns and verbs. Moreover, the corpus we constructed covers the major spoken varieties of Arabic.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"WANLP@ACL 2019","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W19-4616","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
To compile a modern dictionary that catalogues the words in currency, and to study linguistic patterns in the contemporary language, it is necessary to have a corpus of authentic texts that reflect current usage of the language. Although there are numerous Arabic corpora, none claims to be representative of the language in terms of the combination of geographical region, genre, subject matter, mode, and medium. This paper describes a 100-million-word corpus that takes the British National Corpus (BNC) as a model. The aim of the corpus is to be balanced, annotated, comprehensive, and representative of contemporary Arabic as written and spoken in Arab countries today. It will be different from most others in not being heavily-dominated by the news or in mixing the classical with the modern. In this paper is an outline of the methodology adopted for the design, construction, and annotation of this corpus. DIWAN (Alshargi and Rambow, 2015) was used to annotate a one-million-word snapshot of the corpus. DIWAN is a dialectal word annotation tool, but we upgraded it by adding a new tag-set that is based on traditional Arabic grammar and by adding the roots and morphological patterns of nouns and verbs. Moreover, the corpus we constructed covers the major spoken varieties of Arabic.
为了编纂一部现代词典,对常用词汇进行分类,并研究当代语言的语言模式,有必要拥有反映语言当前用法的真实文本的语料库。虽然阿拉伯语语料库数量众多,但没有一个语料库在地理区域、体裁、题材、方式和媒介的结合方面能够代表该语言。本文描述了一个以英国国家语料库(BNC)为模型的亿字语料库。语料库的目的是平衡,注释,全面,并代表当代阿拉伯语的书面和口语在今天的阿拉伯国家。它与其他大多数报纸的不同之处在于,它不受新闻的严重支配,也不将古典与现代相结合。本文概述了该语料库的设计、构建和注释所采用的方法。使用DIWAN (Alshargi and Rambow, 2015)对语料库的一百万字快照进行注释。DIWAN是一个方言单词注释工具,但我们通过添加一个基于传统阿拉伯语法的新标签集,以及添加名词和动词的词根和形态模式,对其进行了升级。此外,我们构建的语料库涵盖了阿拉伯语的主要口语变体。