{"title":"A benchmark for evaluating Arabic word embedding models","authors":"S. Yagi, A. Elnagar, Shehdeh Fareh","doi":"10.1017/S1351324922000444","DOIUrl":null,"url":null,"abstract":"Abstract Modelling the distributional semantics of such a morphologically rich language as Arabic needs to take into account its introflexive, fusional, and inflectional nature attributes that make up its combinatorial sequences and substitutional paradigms. To evaluate such word distributional models, the benchmarks that have been used thus far in Arabic have mimicked those in English. This paper reports on a benchmark that we designed to reflect linguistic patterns in both Contemporary Arabic and Classical Arabic, the first being a cover term for written and spoken Modern Standard Arabic, while the second for pre-modern Arabic. The analogy items we included in this benchmark are chosen in a transparent manner such that they would capture the major features of nouns and verbs; derivational and inflectional morphology; high-, middle-, and low-frequency patterns and lexical items; and morphosemantic, morphosyntactic, and semantic dimensions of the language. All categories included in this benchmark are carefully selected to ensure proper representation of the language. The benchmark consists of 45 roots of the trilateral, all-consonantal, and semivowel-inclusive types; six morphosemantic patterns (’af‘ala; ifta‘ala; infa‘ala; istaf‘ala; tafa‘‘ala; and tafā‘ala); five derivations (the verbal noun, active participle, and the contrasts in Masculine-Feminine; Feminine-Singular-Plural; Masculine-Singular-Plural); and morphosyntactic transformations (perfect and imperfect verbs conjugated for all pronouns); and lexical semantics (synonyms, antonyms, and hyponyms of nouns, verbs, and adjectives), as well as capital cities and currencies. All categories include an equal proportion of high-, medium-, and low-frequency items. For the purpose of validating the proposed benchmark, we developed a set of embedding models from different textual sources. Then, we tested them intrinsically using the proposed benchmark and extrinsically using two natural language processing tasks: Arabic Named Entity Recognition and Text Classification. The evaluation leads to the conclusion that the proposed benchmark is truly reflective of this morphologically rich language and discriminatory of word embeddings.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"978 - 1003"},"PeriodicalIF":2.3000,"publicationDate":"2022-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/S1351324922000444","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 2
Abstract
Abstract Modelling the distributional semantics of such a morphologically rich language as Arabic needs to take into account its introflexive, fusional, and inflectional nature attributes that make up its combinatorial sequences and substitutional paradigms. To evaluate such word distributional models, the benchmarks that have been used thus far in Arabic have mimicked those in English. This paper reports on a benchmark that we designed to reflect linguistic patterns in both Contemporary Arabic and Classical Arabic, the first being a cover term for written and spoken Modern Standard Arabic, while the second for pre-modern Arabic. The analogy items we included in this benchmark are chosen in a transparent manner such that they would capture the major features of nouns and verbs; derivational and inflectional morphology; high-, middle-, and low-frequency patterns and lexical items; and morphosemantic, morphosyntactic, and semantic dimensions of the language. All categories included in this benchmark are carefully selected to ensure proper representation of the language. The benchmark consists of 45 roots of the trilateral, all-consonantal, and semivowel-inclusive types; six morphosemantic patterns (’af‘ala; ifta‘ala; infa‘ala; istaf‘ala; tafa‘‘ala; and tafā‘ala); five derivations (the verbal noun, active participle, and the contrasts in Masculine-Feminine; Feminine-Singular-Plural; Masculine-Singular-Plural); and morphosyntactic transformations (perfect and imperfect verbs conjugated for all pronouns); and lexical semantics (synonyms, antonyms, and hyponyms of nouns, verbs, and adjectives), as well as capital cities and currencies. All categories include an equal proportion of high-, medium-, and low-frequency items. For the purpose of validating the proposed benchmark, we developed a set of embedding models from different textual sources. Then, we tested them intrinsically using the proposed benchmark and extrinsically using two natural language processing tasks: Arabic Named Entity Recognition and Text Classification. The evaluation leads to the conclusion that the proposed benchmark is truly reflective of this morphologically rich language and discriminatory of word embeddings.
期刊介绍:
Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.