DSCLAP:特定领域对比语言-音频预培训

Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li
{"title":"DSCLAP:特定领域对比语言-音频预培训","authors":"Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li","doi":"arxiv-2409.09289","DOIUrl":null,"url":null,"abstract":"Analyzing real-world multimodal signals is an essential and challenging task\nfor intelligent voice assistants (IVAs). Mainstream approaches have achieved\nremarkable performance on various downstream tasks of IVAs with pre-trained\naudio models and text models. However, these models are pre-trained\nindependently and usually on tasks different from target domains, resulting in\nsub-optimal modality representations for downstream tasks. Moreover, in many\ndomains, collecting enough language-audio pairs is extremely hard, and\ntranscribing raw audio also requires high professional skills, making it\ndifficult or even infeasible to joint pre-training. To address these\npainpoints, we propose DSCLAP, a simple and effective framework that enables\nlanguage-audio pre-training with only raw audio signal input. Specifically,\nDSCLAP converts raw audio signals into text via an ASR system and combines a\ncontrastive learning objective and a language-audio matching objective to align\nthe audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of\nin-vehicle domain audio. Empirical results on two downstream tasks show that\nwhile conceptually simple, DSCLAP significantly outperforms the baseline models\nin all metrics, showing great promise for domain-specific IVAs applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training\",\"authors\":\"Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li\",\"doi\":\"arxiv-2409.09289\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Analyzing real-world multimodal signals is an essential and challenging task\\nfor intelligent voice assistants (IVAs). Mainstream approaches have achieved\\nremarkable performance on various downstream tasks of IVAs with pre-trained\\naudio models and text models. However, these models are pre-trained\\nindependently and usually on tasks different from target domains, resulting in\\nsub-optimal modality representations for downstream tasks. Moreover, in many\\ndomains, collecting enough language-audio pairs is extremely hard, and\\ntranscribing raw audio also requires high professional skills, making it\\ndifficult or even infeasible to joint pre-training. To address these\\npainpoints, we propose DSCLAP, a simple and effective framework that enables\\nlanguage-audio pre-training with only raw audio signal input. Specifically,\\nDSCLAP converts raw audio signals into text via an ASR system and combines a\\ncontrastive learning objective and a language-audio matching objective to align\\nthe audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of\\nin-vehicle domain audio. Empirical results on two downstream tasks show that\\nwhile conceptually simple, DSCLAP significantly outperforms the baseline models\\nin all metrics, showing great promise for domain-specific IVAs applications.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09289\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

分析真实世界的多模态信号是智能语音助手(IVA)的一项重要而又具有挑战性的任务。主流方法通过预训练音频模型和文本模型,在 IVA 的各种下游任务中取得了显著的性能。然而,这些模型都是独立预训练的,而且通常是在与目标领域不同的任务上进行预训练,从而导致下游任务的模态表示不够理想。此外,在许多领域,收集足够的语言-音频对非常困难,而翻译原始音频也需要很高的专业技能,这使得联合预训练变得困难甚至不可行。为了解决这些问题,我们提出了 DSCLAP,这是一个简单有效的框架,只需输入原始音频信号就能实现语言和音频的预训练。具体来说,DSCLAP 通过 ASR 系统将原始音频信号转换为文本,并将对比学习目标和语言音频匹配目标结合起来,使音频和 ASR 转录内容保持一致。我们在 12107 小时的车载领域音频上对 DSCLAP 进行了预训练。两个下游任务的实证结果表明,虽然概念简单,但 DSCLAP 在所有指标上都明显优于基线模型,为特定领域的 IVAs 应用展示了巨大的前景。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training
Analyzing real-world multimodal signals is an essential and challenging task for intelligent voice assistants (IVAs). Mainstream approaches have achieved remarkable performance on various downstream tasks of IVAs with pre-trained audio models and text models. However, these models are pre-trained independently and usually on tasks different from target domains, resulting in sub-optimal modality representations for downstream tasks. Moreover, in many domains, collecting enough language-audio pairs is extremely hard, and transcribing raw audio also requires high professional skills, making it difficult or even infeasible to joint pre-training. To address these painpoints, we propose DSCLAP, a simple and effective framework that enables language-audio pre-training with only raw audio signal input. Specifically, DSCLAP converts raw audio signals into text via an ASR system and combines a contrastive learning objective and a language-audio matching objective to align the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of in-vehicle domain audio. Empirical results on two downstream tasks show that while conceptually simple, DSCLAP significantly outperforms the baseline models in all metrics, showing great promise for domain-specific IVAs applications.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信