Identifying Code-switching in Arabizi

Workshop on Arabic Natural Language Processing Pub Date : 1900-01-01 DOI:10.18653/v1/2022.wanlp-1.18

Safaa Shehadi, S. Wintner

引用次数: 2

Abstract

We describe a corpus of social media posts that include utterances in Arabizi, a Roman-script rendering of Arabic, mixed with other languages, notably English, French, and Arabic written in the Arabic script. We manually annotated a subset of the texts with word-level language IDs; this is a non-trivial task due to the nature of mixed-language writing, especially on social media. We developed classifiers that can accurately predict the language ID tags. Then, we extended the word-level predictions to identify sentences that include Arabizi (and code-switching), and applied the classifiers to the raw corpus, thereby harvesting a large number of additional instances. The result is a large-scale dataset of Arabizi, with precise indications of code-switching between Arabizi and English, French, and Arabic.

查看原文本刊更多论文

识别阿拉伯语的语码转换

我们描述了一个社交媒体帖子的语料库，其中包括阿拉伯语的话语，阿拉伯语是阿拉伯语的罗马文字翻译，与其他语言混合，特别是英语，法语和用阿拉伯语书写的阿拉伯语。我们用单词级语言id手动标注文本子集;由于混合语言写作的性质，这是一项非常重要的任务，尤其是在社交媒体上。我们开发了可以准确预测语言ID标签的分类器。然后，我们扩展了单词级预测，以识别包含阿拉伯语(和代码切换)的句子，并将分类器应用于原始语料库，从而收获了大量额外的实例。结果是一个大规模的阿拉伯语数据集，精确地显示了阿拉伯语与英语、法语和阿拉伯语之间的代码转换。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Workshop on Arabic Natural Language Processing

自引率

0.00%

发文量