SIGMORPHON 2020共享任务0:类型学上多样的形态变化

Special Interest Group on Computational Morphology and Phonology Workshop Pub Date : 2020-06-20 DOI:10.18653/v1/2020.sigmorphon-1.1

Ekaterina Vylomova, Jennifer C. White, Elizabeth Salesky, Sabrina J. Mielke, Shijie Wu, E. Ponti, R. Maudslay, Ran Zmigrod, Josef Valvoda, S. Toldova, Francis M. Tyers, E. Klyachko, I. Yegorov, Natalia Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, Andrew Krizhanovsky, Tiago Pimentel, Lucas Torroba Hennigen, Christo Kirov, Garrett Nicolai, Adina Williams, Antonios Anastasopoulos, Hilaria Cruz, Eleanor Chodroff, Ryan Cotterell, Miikka Silfverberg, Mans Hulden

{"title":"SIGMORPHON 2020共享任务0:类型学上多样的形态变化","authors":"Ekaterina Vylomova, Jennifer C. White, Elizabeth Salesky, Sabrina J. Mielke, Shijie Wu, E. Ponti, R. Maudslay, Ran Zmigrod, Josef Valvoda, S. Toldova, Francis M. Tyers, E. Klyachko, I. Yegorov, Natalia Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, Andrew Krizhanovsky, Tiago Pimentel, Lucas Torroba Hennigen, Christo Kirov, Garrett Nicolai, Adina Williams, Antonios Anastasopoulos, Hilaria Cruz, Eleanor Chodroff, Ryan Cotterell, Miikka Silfverberg, Mans Hulden","doi":"10.18653/v1/2020.sigmorphon-1.1","DOIUrl":null,"url":null,"abstract":"A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems’ ability to generalize across typologically distinct languages, many of which are low resource. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages. A total of 22 systems (19 neural) from 10 teams were submitted to the task. All four winning systems were neural (two monolingual transformers and two massively multilingual RNN-based models with gated attention). Most teams demonstrate utility of data hallucination and augmentation, ensembles, and multilingual training for low-resource languages. Non-neural learners and manually designed grammars showed competitive and even superior performance on some languages (such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were relatively easy for most systems and achieved over 90% mean accuracy while others were more challenging.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"61","resultStr":"{\"title\":\"SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection\",\"authors\":\"Ekaterina Vylomova, Jennifer C. White, Elizabeth Salesky, Sabrina J. Mielke, Shijie Wu, E. Ponti, R. Maudslay, Ran Zmigrod, Josef Valvoda, S. Toldova, Francis M. Tyers, E. Klyachko, I. Yegorov, Natalia Krizhanovsky, Paula Czarnowska, Irene Nikkarinen, Andrew Krizhanovsky, Tiago Pimentel, Lucas Torroba Hennigen, Christo Kirov, Garrett Nicolai, Adina Williams, Antonios Anastasopoulos, Hilaria Cruz, Eleanor Chodroff, Ryan Cotterell, Miikka Silfverberg, Mans Hulden\",\"doi\":\"10.18653/v1/2020.sigmorphon-1.1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems’ ability to generalize across typologically distinct languages, many of which are low resource. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages. A total of 22 systems (19 neural) from 10 teams were submitted to the task. All four winning systems were neural (two monolingual transformers and two massively multilingual RNN-based models with gated attention). Most teams demonstrate utility of data hallucination and augmentation, ensembles, and multilingual training for low-resource languages. Non-neural learners and manually designed grammars showed competitive and even superior performance on some languages (such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were relatively easy for most systems and achieved over 90% mean accuracy while others were more challenging.\",\"PeriodicalId\":186158,\"journal\":{\"name\":\"Special Interest Group on Computational Morphology and Phonology Workshop\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"61\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Special Interest Group on Computational Morphology and Phonology Workshop\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2020.sigmorphon-1.1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Special Interest Group on Computational Morphology and Phonology Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2020.sigmorphon-1.1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 61

摘要

自然语言处理(NLP)的一个广泛目标是开发一个能够处理任何自然语言的系统。然而，大多数系统仅使用一种语言(如英语)的数据开发。SIGMORPHON 2020关于形态反射的共享任务旨在研究系统在不同类型语言之间进行泛化的能力，其中许多语言资源匮乏。系统的开发使用了45种语言和5个语族的数据，并对另外45种语言和10个语族(总共13个)的数据进行了微调，并对所有90种语言进行了评估。共有来自10个团队的22个系统(19个神经系统)被提交给该任务。所有四个赢得系统神经(两个单语变压器和两个大规模多语种RNN-based模型封闭的关注)。大多数团队演示了对低资源语言的数据幻觉和增强、集成和多语言训练的效用。非神经学习器和人工设计的语法在某些语言(如英格里亚语、塔吉克语、他加洛语、扎尔马语、林加拉语)上表现出竞争力，甚至更好，特别是在数据非常有限的情况下。一些语系(亚非语系、尼日尔-刚果语系、突厥语系)对大多数系统来说相对容易，平均准确率达到90%以上，而其他语系则更具挑战性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection

A broad goal in natural language processing (NLP) is to develop a system that has the capacity to process any natural language. Most systems, however, are developed using data from just one language such as English. The SIGMORPHON 2020 shared task on morphological reinflection aims to investigate systems’ ability to generalize across typologically distinct languages, many of which are low resource. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages. A total of 22 systems (19 neural) from 10 teams were submitted to the task. All four winning systems were neural (two monolingual transformers and two massively multilingual RNN-based models with gated attention). Most teams demonstrate utility of data hallucination and augmentation, ensembles, and multilingual training for low-resource languages. Non-neural learners and manually designed grammars showed competitive and even superior performance on some languages (such as Ingrian, Tajik, Tagalog, Zarma, Lingala), especially with very limited data. Some language families (Afro-Asiatic, Niger-Congo, Turkic) were relatively easy for most systems and achieved over 90% mean accuracy while others were more challenging.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Special Interest Group on Computational Morphology and Phonology Workshop

自引率

0.00%

发文量