How Much Does Lookahead Matter for Disambiguation? Partial Arabic Diacritization Case Study

IF 5.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics Pub Date : 2022-08-24 DOI:10.1162/coli_a_00456

Saeed Esmail, Kfir Bar, N. Dershowitz

{"title":"How Much Does Lookahead Matter for Disambiguation? Partial Arabic Diacritization Case Study","authors":"Saeed Esmail, Kfir Bar, N. Dershowitz","doi":"10.1162/coli_a_00456","DOIUrl":null,"url":null,"abstract":"Abstract We suggest a model for partial diacritization of deep orthographies. We focus on Arabic, where the optional indication of selected vowels by means of diacritics can resolve ambiguity and improve readability. Our partial diacritizer restores short vowels only when they contribute to the ease of understandability during reading a given running text. The idea is to identify those uncertainties of absent vowels that require the reader to look ahead to disambiguate. To achieve this, two independent neural networks are used for predicting diacritics, one that takes the entire sentence as input and another that considers only the text that has been read thus far. Partial diacritization is then determined by retaining precisely those vowels on which the two networks disagree, preferring the reading based on consideration of the whole sentence over the more naïve reading-order diacritization. For evaluation, we prepared a new dataset of Arabic texts with both full and partial vowelization. In addition to facilitating readability, we find that our partial diacritizer improves translation quality compared either to their total absence or to random selection. Lastly, we study the benefit of knowing the text that follows the word in focus toward the restoration of short vowels during reading, and we measure the degree to which lookahead contributes to resolving ambiguities encountered while reading. L’Herbelot had asserted, that the most ancient Korans, written in the Cufic character, had no vowel points; and that these were first invented by Jahia–ben Jamer, who died in the 127th year of the Hegira. “Toderini’s History of Turkish Literature,” Analytical Review (1789)","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"48 1","pages":"1103-1123"},"PeriodicalIF":5.3000,"publicationDate":"2022-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00456","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 1

Abstract

Abstract We suggest a model for partial diacritization of deep orthographies. We focus on Arabic, where the optional indication of selected vowels by means of diacritics can resolve ambiguity and improve readability. Our partial diacritizer restores short vowels only when they contribute to the ease of understandability during reading a given running text. The idea is to identify those uncertainties of absent vowels that require the reader to look ahead to disambiguate. To achieve this, two independent neural networks are used for predicting diacritics, one that takes the entire sentence as input and another that considers only the text that has been read thus far. Partial diacritization is then determined by retaining precisely those vowels on which the two networks disagree, preferring the reading based on consideration of the whole sentence over the more naïve reading-order diacritization. For evaluation, we prepared a new dataset of Arabic texts with both full and partial vowelization. In addition to facilitating readability, we find that our partial diacritizer improves translation quality compared either to their total absence or to random selection. Lastly, we study the benefit of knowing the text that follows the word in focus toward the restoration of short vowels during reading, and we measure the degree to which lookahead contributes to resolving ambiguities encountered while reading. L’Herbelot had asserted, that the most ancient Korans, written in the Cufic character, had no vowel points; and that these were first invented by Jahia–ben Jamer, who died in the 127th year of the Hegira. “Toderini’s History of Turkish Literature,” Analytical Review (1789)

查看原文本刊更多论文

展望对消除歧义有多重要？部分阿拉伯语转调个案研究

摘要我们提出了一个深度正字法的部分变音模型。我们关注阿拉伯语，在阿拉伯语中，通过变音符号对所选元音的可选指示可以解决歧义并提高可读性。只有当短元音有助于在阅读给定的运行文本时易于理解时，我们的部分变音器才能恢复短元音。这个想法是为了识别缺失元音的不确定性，这些不确定性需要读者向前看以消除歧义。为了实现这一点，使用了两个独立的神经网络来预测变音符号，一个将整个句子作为输入，另一个只考虑迄今为止阅读过的文本。然后，通过准确地保留两个网络不一致的元音来确定部分变音，更喜欢基于整个句子的阅读，而不是更天真的阅读顺序变音。为了进行评估，我们准备了一个新的阿拉伯语文本数据集，包括完整和部分元音。除了提高可读性外，我们还发现，与完全不存在或随机选择相比，我们的部分变音器提高了翻译质量。最后，我们研究了在阅读过程中，了解单词后面的文本对恢复短元音的好处，并衡量了前瞻性在解决阅读中遇到的歧义方面的作用。L’Herbelot断言，最古老的《古兰经》是用库菲克文字写成的，没有元音点；这些最早是由贾希亚-本·贾米尔发明的，他死于赫吉拉127年。《托代里尼的土耳其文学史》，《分析评论》（1789）

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Linguistics 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Computational Linguistics, the longest-running publication dedicated solely to the computational and mathematical aspects of language and the design of natural language processing systems, provides university and industry linguists, computational linguists, AI and machine learning researchers, cognitive scientists, speech specialists, and philosophers with the latest insights into the computational aspects of language research.