A benchmark for evaluating Arabic word embedding models

IF 1.9 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering Pub Date : 2022-10-17 DOI:10.1017/S1351324922000444

S. Yagi, A. Elnagar, Shehdeh Fareh

{"title":"A benchmark for evaluating Arabic word embedding models","authors":"S. Yagi, A. Elnagar, Shehdeh Fareh","doi":"10.1017/S1351324922000444","DOIUrl":null,"url":null,"abstract":"Abstract Modelling the distributional semantics of such a morphologically rich language as Arabic needs to take into account its introflexive, fusional, and inflectional nature attributes that make up its combinatorial sequences and substitutional paradigms. To evaluate such word distributional models, the benchmarks that have been used thus far in Arabic have mimicked those in English. This paper reports on a benchmark that we designed to reflect linguistic patterns in both Contemporary Arabic and Classical Arabic, the first being a cover term for written and spoken Modern Standard Arabic, while the second for pre-modern Arabic. The analogy items we included in this benchmark are chosen in a transparent manner such that they would capture the major features of nouns and verbs; derivational and inflectional morphology; high-, middle-, and low-frequency patterns and lexical items; and morphosemantic, morphosyntactic, and semantic dimensions of the language. All categories included in this benchmark are carefully selected to ensure proper representation of the language. The benchmark consists of 45 roots of the trilateral, all-consonantal, and semivowel-inclusive types; six morphosemantic patterns (’af‘ala; ifta‘ala; infa‘ala; istaf‘ala; tafa‘‘ala; and tafā‘ala); five derivations (the verbal noun, active participle, and the contrasts in Masculine-Feminine; Feminine-Singular-Plural; Masculine-Singular-Plural); and morphosyntactic transformations (perfect and imperfect verbs conjugated for all pronouns); and lexical semantics (synonyms, antonyms, and hyponyms of nouns, verbs, and adjectives), as well as capital cities and currencies. All categories include an equal proportion of high-, medium-, and low-frequency items. For the purpose of validating the proposed benchmark, we developed a set of embedding models from different textual sources. Then, we tested them intrinsically using the proposed benchmark and extrinsically using two natural language processing tasks: Arabic Named Entity Recognition and Text Classification. The evaluation leads to the conclusion that the proposed benchmark is truly reflective of this morphologically rich language and discriminatory of word embeddings.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"978 - 1003"},"PeriodicalIF":1.9000,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/S1351324922000444","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 2

Abstract

Abstract Modelling the distributional semantics of such a morphologically rich language as Arabic needs to take into account its introflexive, fusional, and inflectional nature attributes that make up its combinatorial sequences and substitutional paradigms. To evaluate such word distributional models, the benchmarks that have been used thus far in Arabic have mimicked those in English. This paper reports on a benchmark that we designed to reflect linguistic patterns in both Contemporary Arabic and Classical Arabic, the first being a cover term for written and spoken Modern Standard Arabic, while the second for pre-modern Arabic. The analogy items we included in this benchmark are chosen in a transparent manner such that they would capture the major features of nouns and verbs; derivational and inflectional morphology; high-, middle-, and low-frequency patterns and lexical items; and morphosemantic, morphosyntactic, and semantic dimensions of the language. All categories included in this benchmark are carefully selected to ensure proper representation of the language. The benchmark consists of 45 roots of the trilateral, all-consonantal, and semivowel-inclusive types; six morphosemantic patterns (’af‘ala; ifta‘ala; infa‘ala; istaf‘ala; tafa‘‘ala; and tafā‘ala); five derivations (the verbal noun, active participle, and the contrasts in Masculine-Feminine; Feminine-Singular-Plural; Masculine-Singular-Plural); and morphosyntactic transformations (perfect and imperfect verbs conjugated for all pronouns); and lexical semantics (synonyms, antonyms, and hyponyms of nouns, verbs, and adjectives), as well as capital cities and currencies. All categories include an equal proportion of high-, medium-, and low-frequency items. For the purpose of validating the proposed benchmark, we developed a set of embedding models from different textual sources. Then, we tested them intrinsically using the proposed benchmark and extrinsically using two natural language processing tasks: Arabic Named Entity Recognition and Text Classification. The evaluation leads to the conclusion that the proposed benchmark is truly reflective of this morphologically rich language and discriminatory of word embeddings.

查看原文本刊更多论文

评估阿拉伯语单词嵌入模型的基准

摘要对阿拉伯语这样一种形态丰富的语言的分布语义进行建模需要考虑其内屈折、融合和屈折的性质属性，这些属性构成了其组合序列和替代范式。为了评估这样的单词分布模型，迄今为止在阿拉伯语中使用的基准模仿了英语中的基准。本文报告了我们设计的一个基准，该基准旨在反映当代阿拉伯语和古典阿拉伯语的语言模式，第一个是现代标准阿拉伯语的书面和口语覆盖词，而第二个是前现代阿拉伯语的覆盖词。我们在这个基准中包含的类比项目是以透明的方式选择的，这样它们就能捕捉到名词和动词的主要特征；派生形态和屈折形态；高、中、低频模式和词汇项目；以及语言的形态语义、形态句法和语义维度。本基准中包含的所有类别都经过仔细选择，以确保语言的正确表达。基准包括45个三边、全辅音和半元音的词根；六种形态语义模式（'af'ala；ifta'ala；infa'ala；istaf'ala、tafa'ala和tafā'ala）；五个派生词（动词名词、主动分词和男性-女性、女性单数复数、男性单数复数中的对比）；形态句法转换（所有代词的完全动词和不完全动词共轭）；词汇语义（名词、动词和形容词的同义词、反义词和上义词），以及首都和货币。所有类别都包括相同比例的高、中、低频项目。为了验证所提出的基准，我们从不同的文本来源开发了一组嵌入模型。然后，我们使用所提出的基准对它们进行了内在测试，并使用两个自然语言处理任务进行了外在测试：阿拉伯语命名实体识别和文本分类。评估得出的结论是，所提出的基准确实反映了这种形态丰富的语言和单词嵌入的歧视性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.