结合嘈杂的语义信号与正字法线索:印度语方言连续体的同源归纳法

Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL) Pub Date : 1900-01-01 DOI:10.18653/v1/2022.conll-1.9

Niyati Bafna, Josef van Genabith, C. España-Bonet, Z. Žabokrtský

{"title":"结合嘈杂的语义信号与正字法线索:印度语方言连续体的同源归纳法","authors":"Niyati Bafna, Josef van Genabith, C. España-Bonet, Z. Žabokrtský","doi":"10.18653/v1/2022.conll-1.9","DOIUrl":null,"url":null,"abstract":"We present a novel method for unsupervised cognate/borrowing identification from monolingual corpora designed for low and extremely low resource scenarios, based on combining noisy semantic signals from joint bilingual spaces with orthographic cues modelling sound change. We apply our method to the North Indian dialect continuum, containing several dozens of dialects and languages spoken by more than 100 million people. Many of these languages are zero-resource and therefore natural language processing for them is non-existent. We first collect monolingual data for 26 Indic languages, 16 of which were previously zero-resource, and perform exploratory character, lexical and subword cross-lingual alignment experiments for the first time at this scale on this dialect continuum. We create bilingual evaluation lexicons against Hindi for 20 of the languages. We then apply our cognate identification method on the data, and show that our method outperforms both traditional orthography baselines as well as EM-style learnt edit distance matrices. To the best of our knowledge, this is the first work to combine traditional orthographic cues with noisy bilingual embeddings to tackle unsupervised cognate detection in a (truly) low-resource setup, showing that even noisy bilingual embeddings can act as good guides for this task. We release our multilingual dialect corpus, called HinDialect, as well as our scripts for evaluation data collection and cognate induction.","PeriodicalId":221345,"journal":{"name":"Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Combining Noisy Semantic Signals with Orthographic Cues: Cognate Induction for the Indic Dialect Continuum\",\"authors\":\"Niyati Bafna, Josef van Genabith, C. España-Bonet, Z. Žabokrtský\",\"doi\":\"10.18653/v1/2022.conll-1.9\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a novel method for unsupervised cognate/borrowing identification from monolingual corpora designed for low and extremely low resource scenarios, based on combining noisy semantic signals from joint bilingual spaces with orthographic cues modelling sound change. We apply our method to the North Indian dialect continuum, containing several dozens of dialects and languages spoken by more than 100 million people. Many of these languages are zero-resource and therefore natural language processing for them is non-existent. We first collect monolingual data for 26 Indic languages, 16 of which were previously zero-resource, and perform exploratory character, lexical and subword cross-lingual alignment experiments for the first time at this scale on this dialect continuum. We create bilingual evaluation lexicons against Hindi for 20 of the languages. We then apply our cognate identification method on the data, and show that our method outperforms both traditional orthography baselines as well as EM-style learnt edit distance matrices. To the best of our knowledge, this is the first work to combine traditional orthographic cues with noisy bilingual embeddings to tackle unsupervised cognate detection in a (truly) low-resource setup, showing that even noisy bilingual embeddings can act as good guides for this task. We release our multilingual dialect corpus, called HinDialect, as well as our scripts for evaluation data collection and cognate induction.\",\"PeriodicalId\":221345,\"journal\":{\"name\":\"Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)\",\"volume\":\"102 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2022.conll-1.9\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.conll-1.9","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

本文提出了一种针对低资源和极低资源场景的单语语料库中无监督同源/借词识别的新方法，该方法将联合双语空间中的噪声语义信号与模拟声音变化的正字法线索相结合。我们将我们的方法应用于北印度方言连续体，其中包含了超过1亿人使用的几十种方言和语言。这些语言中的许多都是零资源的，因此不存在对它们的自然语言处理。我们首先收集了26种印度语言的单语数据，其中16种语言之前是零资源的，并首次在这个方言连续体的这个尺度上进行了探索性的字符、词汇和子词跨语言对齐实验。我们针对20种语言创建了针对印地语的双语评估词典。然后，我们将我们的同源识别方法应用于数据，并表明我们的方法优于传统的正字法基线以及em风格的学习编辑距离矩阵。据我们所知，这是第一次将传统的正字法线索与嘈杂的双语嵌入结合起来，在(真正的)低资源设置中解决无监督同源词检测问题，表明即使是嘈杂的双语嵌入也可以作为这项任务的良好指导。我们发布了我们的多语言方言语料库，称为印地语，以及我们用于评估数据收集和同源归纳的脚本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Combining Noisy Semantic Signals with Orthographic Cues: Cognate Induction for the Indic Dialect Continuum

We present a novel method for unsupervised cognate/borrowing identification from monolingual corpora designed for low and extremely low resource scenarios, based on combining noisy semantic signals from joint bilingual spaces with orthographic cues modelling sound change. We apply our method to the North Indian dialect continuum, containing several dozens of dialects and languages spoken by more than 100 million people. Many of these languages are zero-resource and therefore natural language processing for them is non-existent. We first collect monolingual data for 26 Indic languages, 16 of which were previously zero-resource, and perform exploratory character, lexical and subword cross-lingual alignment experiments for the first time at this scale on this dialect continuum. We create bilingual evaluation lexicons against Hindi for 20 of the languages. We then apply our cognate identification method on the data, and show that our method outperforms both traditional orthography baselines as well as EM-style learnt edit distance matrices. To the best of our knowledge, this is the first work to combine traditional orthographic cues with noisy bilingual embeddings to tackle unsupervised cognate detection in a (truly) low-resource setup, showing that even noisy bilingual embeddings can act as good guides for this task. We release our multilingual dialect corpus, called HinDialect, as well as our scripts for evaluation data collection and cognate induction.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

自引率

0.00%

发文量