从言语预测世界语言结构地图集的特征

Workshop on Spoken Language Technologies for Under-resourced Languages Pub Date : 2018-08-29 DOI:10.21437/SLTU.2018-52

Alexander Gutkin, Tatiana Merkulova, Martin Jansche

{"title":"从言语预测世界语言结构地图集的特征","authors":"Alexander Gutkin, Tatiana Merkulova, Martin Jansche","doi":"10.21437/SLTU.2018-52","DOIUrl":null,"url":null,"abstract":"Recent work considered how images paired with speech can be used as supervision for building speech systems when transcriptions are not available. We ask whether visual grounding can be used for cross-lingual keyword spotting: given a text keyword in one language, the task is to retrieve spoken utterances containing that keyword in another language. This could enable searching through speech in a low-resource language using text queries in a high-resource language. As a proof-of-concept, we use English speech with German queries: we use a German visual tagger to add keyword labels to each training image, and then train a neural network to map English speech to German keywords. Without seeing parallel speech-transcriptions or translations, the model achieves a precision at ten of 58%. We show that most erroneous retrievals contain equivalent or semantically relevant keywords; excluding these would improve P@10 to 91%.","PeriodicalId":190269,"journal":{"name":"Workshop on Spoken Language Technologies for Under-resourced Languages","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predicting the Features of World Atlas of Language Structures from Speech\",\"authors\":\"Alexander Gutkin, Tatiana Merkulova, Martin Jansche\",\"doi\":\"10.21437/SLTU.2018-52\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent work considered how images paired with speech can be used as supervision for building speech systems when transcriptions are not available. We ask whether visual grounding can be used for cross-lingual keyword spotting: given a text keyword in one language, the task is to retrieve spoken utterances containing that keyword in another language. This could enable searching through speech in a low-resource language using text queries in a high-resource language. As a proof-of-concept, we use English speech with German queries: we use a German visual tagger to add keyword labels to each training image, and then train a neural network to map English speech to German keywords. Without seeing parallel speech-transcriptions or translations, the model achieves a precision at ten of 58%. We show that most erroneous retrievals contain equivalent or semantically relevant keywords; excluding these would improve P@10 to 91%.\",\"PeriodicalId\":190269,\"journal\":{\"name\":\"Workshop on Spoken Language Technologies for Under-resourced Languages\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Workshop on Spoken Language Technologies for Under-resourced Languages\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/SLTU.2018-52\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Spoken Language Technologies for Under-resourced Languages","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/SLTU.2018-52","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

最近的研究工作考虑了在没有转录的情况下，如何将图像与语音配对作为构建语音系统的监督。我们想知道视觉基础是否可以用于跨语言关键字识别:给定一种语言的文本关键字，任务是检索包含该关键字的另一种语言的口语。这样就可以使用高资源语言中的文本查询来搜索低资源语言中的语音。作为概念验证，我们使用英语语音和德语查询:我们使用德语视觉标记器为每个训练图像添加关键字标签，然后训练神经网络将英语语音映射到德语关键字。在没有看到平行的语音转录或翻译的情况下，该模型的准确率达到了58%的10%。我们发现，大多数错误检索包含等效或语义相关的关键字;排除这些因素后，P@10将提高到91%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Predicting the Features of World Atlas of Language Structures from Speech

Recent work considered how images paired with speech can be used as supervision for building speech systems when transcriptions are not available. We ask whether visual grounding can be used for cross-lingual keyword spotting: given a text keyword in one language, the task is to retrieve spoken utterances containing that keyword in another language. This could enable searching through speech in a low-resource language using text queries in a high-resource language. As a proof-of-concept, we use English speech with German queries: we use a German visual tagger to add keyword labels to each training image, and then train a neural network to map English speech to German keywords. Without seeing parallel speech-transcriptions or translations, the model achieves a precision at ten of 58%. We show that most erroneous retrievals contain equivalent or semantically relevant keywords; excluding these would improve P@10 to 91%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Workshop on Spoken Language Technologies for Under-resourced Languages

自引率

0.00%

发文量