一个小的griko -意大利语语音翻译语料库

Workshop on Spoken Language Technologies for Under-resourced Languages Pub Date : 2018-07-27 DOI:10.21437/SLTU.2018-8

Marcely Zanon Boito, Antonios Anastasopoulos, M. Lekakou, A. Villavicencio, L. Besacier

{"title":"一个小的griko -意大利语语音翻译语料库","authors":"Marcely Zanon Boito, Antonios Anastasopoulos, M. Lekakou, A. Villavicencio, L. Besacier","doi":"10.21437/SLTU.2018-8","DOIUrl":null,"url":null,"abstract":"This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 2 hours of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also includes morpho syntactic tags and word-level glosses. Applying an automatic unit discovery method, pseudo-phones were also generated. We detail how the corpus was collected, cleaned and processed, and we illustrate its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery. The dataset will be available online, aiming to encourage replicability and diversity in computational language documentation experiments.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"A small Griko-Italian speech translation corpus\",\"authors\":\"Marcely Zanon Boito, Antonios Anastasopoulos, M. Lekakou, A. Villavicencio, L. Besacier\",\"doi\":\"10.21437/SLTU.2018-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 2 hours of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also includes morpho syntactic tags and word-level glosses. Applying an automatic unit discovery method, pseudo-phones were also generated. We detail how the corpus was collected, cleaned and processed, and we illustrate its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery. The dataset will be available online, aiming to encourage replicability and diversity in computational language documentation experiments.\",\"PeriodicalId\":190269,\"journal\":{\"name\":\"Workshop on Spoken Language Technologies for Under-resourced Languages\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on Spoken Language Technologies for Under-resourced Languages\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/SLTU.2018-8\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Spoken Language Technologies for Under-resourced Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/SLTU.2018-8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

This论文提出了一个扩展到一个非常低资源的平行语料库收集在一种濒危语言，Griko，使其对计算研究有用。该语料库由330个话语(约2小时的演讲)组成，这些话语已被转录并翻译为意大利语，并附有单词级语音到转录和语音到翻译对齐的注释。该语料库还包括词法语法标记和单词级注释。应用自动单元发现方法，生成了伪电话。我们详细介绍了语料库是如何收集、清理和处理的，并通过展示语音到翻译对齐和无监督词发现任务的一些基线结果，说明了语料库在零资源任务中的使用。该数据集将在网上提供，旨在鼓励计算语言文档实验的可复制性和多样性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A small Griko-Italian speech translation corpus

This paper presents an extension to a very low-resource parallel corpus collected in an endangered language, Griko, making it useful for computational research. The corpus consists of 330 utterances (about 2 hours of speech) which have been transcribed and translated in Italian, with annotations for word-level speech-to-transcription and speech-to-translation alignments. The corpus also includes morpho syntactic tags and word-level glosses. Applying an automatic unit discovery method, pseudo-phones were also generated. We detail how the corpus was collected, cleaned and processed, and we illustrate its use on zero-resource tasks by presenting some baseline results for the task of speech-to-translation alignment and unsupervised word discovery. The dataset will be available online, aiming to encourage replicability and diversity in computational language documentation experiments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Workshop on Spoken Language Technologies for Under-resourced Languages

自引率

0.00%

发文量