SpeechTaxi: On Multilingual Semantic Speech Classification

Lennart Keller, Goran Glavaš
{"title":"SpeechTaxi: On Multilingual Semantic Speech Classification","authors":"Lennart Keller, Goran Glavaš","doi":"arxiv-2409.06372","DOIUrl":null,"url":null,"abstract":"Recent advancements in multilingual speech encoding as well as transcription\nraise the question of the most effective approach to semantic speech\nclassification. Concretely, can (1) end-to-end (E2E) classifiers obtained by\nfine-tuning state-of-the-art multilingual speech encoders (MSEs) match or\nsurpass the performance of (2) cascading (CA), where speech is first\ntranscribed into text and classification is delegated to a text-based\nclassifier. To answer this, we first construct SpeechTaxi, an 80-hour\nmultilingual dataset for semantic speech classification of Bible verses,\ncovering 28 diverse languages. We then leverage SpeechTaxi to conduct a wide\nrange of experiments comparing E2E and CA in monolingual semantic speech\nclassification as well as in cross-lingual transfer. We find that E2E based on\nMSEs outperforms CA in monolingual setups, i.e., when trained on in-language\ndata. However, MSEs seem to have poor cross-lingual transfer abilities, with\nE2E substantially lagging CA both in (1) zero-shot transfer to languages unseen\nin training and (2) multilingual training, i.e., joint training on multiple\nlanguages. Finally, we devise a novel CA approach based on transcription to\nRomanized text as a language-agnostic intermediate representation and show that\nit represents a robust solution for languages without native ASR support. Our\nSpeechTaxi dataset is publicly available at: https://huggingface.co/\ndatasets/LennartKeller/SpeechTaxi/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"23 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06372","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recent advancements in multilingual speech encoding as well as transcription raise the question of the most effective approach to semantic speech classification. Concretely, can (1) end-to-end (E2E) classifiers obtained by fine-tuning state-of-the-art multilingual speech encoders (MSEs) match or surpass the performance of (2) cascading (CA), where speech is first transcribed into text and classification is delegated to a text-based classifier. To answer this, we first construct SpeechTaxi, an 80-hour multilingual dataset for semantic speech classification of Bible verses, covering 28 diverse languages. We then leverage SpeechTaxi to conduct a wide range of experiments comparing E2E and CA in monolingual semantic speech classification as well as in cross-lingual transfer. We find that E2E based on MSEs outperforms CA in monolingual setups, i.e., when trained on in-language data. However, MSEs seem to have poor cross-lingual transfer abilities, with E2E substantially lagging CA both in (1) zero-shot transfer to languages unseen in training and (2) multilingual training, i.e., joint training on multiple languages. Finally, we devise a novel CA approach based on transcription to Romanized text as a language-agnostic intermediate representation and show that it represents a robust solution for languages without native ASR support. Our SpeechTaxi dataset is publicly available at: https://huggingface.co/ datasets/LennartKeller/SpeechTaxi/.
SpeechTaxi:多语言语义语音分类
多语言语音编码和转录领域的最新进展提出了语义语音分类最有效方法的问题。具体来说,(1)通过微调最先进的多语种语音编码器(MSE)而获得的端到端(E2E)分类器的性能能否与(2)级联(CA)的性能相媲美或超越(CA),即首先将语音转录为文本,然后将分类委托给基于文本的分类器。为了回答这个问题,我们首先构建了 SpeechTaxi,这是一个用于圣经经文语义语音分类的 80 小时多语言数据集,涵盖 28 种不同语言。然后,我们利用 SpeechTaxi 进行了更广泛的实验,比较了 E2E 和 CA 在单语语义语音分类和跨语言传输中的效果。我们发现,基于 MSEs 的 E2E 在单语设置中优于 CA,即在语言内数据上进行训练时。然而,MSEs 的跨语言迁移能力似乎很差,E2E 在以下两个方面都大大落后于 CA:(1) 向训练中未见语言的零点迁移;(2) 多语言训练,即多语言联合训练。最后,我们设计了一种基于转录到罗马化文本的新颖 CA 方法,作为一种与语言无关的中间表示,并证明它对于没有本地 ASR 支持的语言是一种稳健的解决方案。我们的 SpeechTaxi 数据集可在以下网址公开获取:https://huggingface.co/datasets/LennartKeller/SpeechTaxi/。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信