DSCLAP：特定领域对比语言-音频预培训

arXiv - CS - Sound Pub Date : 2024-09-14 DOI:arxiv-2409.09289

Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li

{"title":"DSCLAP：特定领域对比语言-音频预培训","authors":"Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li","doi":"arxiv-2409.09289","DOIUrl":null,"url":null,"abstract":"Analyzing real-world multimodal signals is an essential and challenging task\nfor intelligent voice assistants (IVAs). Mainstream approaches have achieved\nremarkable performance on various downstream tasks of IVAs with pre-trained\naudio models and text models. However, these models are pre-trained\nindependently and usually on tasks different from target domains, resulting in\nsub-optimal modality representations for downstream tasks. Moreover, in many\ndomains, collecting enough language-audio pairs is extremely hard, and\ntranscribing raw audio also requires high professional skills, making it\ndifficult or even infeasible to joint pre-training. To address these\npainpoints, we propose DSCLAP, a simple and effective framework that enables\nlanguage-audio pre-training with only raw audio signal input. Specifically,\nDSCLAP converts raw audio signals into text via an ASR system and combines a\ncontrastive learning objective and a language-audio matching objective to align\nthe audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of\nin-vehicle domain audio. Empirical results on two downstream tasks show that\nwhile conceptually simple, DSCLAP significantly outperforms the baseline models\nin all metrics, showing great promise for domain-specific IVAs applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training\",\"authors\":\"Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li\",\"doi\":\"arxiv-2409.09289\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Analyzing real-world multimodal signals is an essential and challenging task\\nfor intelligent voice assistants (IVAs). Mainstream approaches have achieved\\nremarkable performance on various downstream tasks of IVAs with pre-trained\\naudio models and text models. However, these models are pre-trained\\nindependently and usually on tasks different from target domains, resulting in\\nsub-optimal modality representations for downstream tasks. Moreover, in many\\ndomains, collecting enough language-audio pairs is extremely hard, and\\ntranscribing raw audio also requires high professional skills, making it\\ndifficult or even infeasible to joint pre-training. To address these\\npainpoints, we propose DSCLAP, a simple and effective framework that enables\\nlanguage-audio pre-training with only raw audio signal input. Specifically,\\nDSCLAP converts raw audio signals into text via an ASR system and combines a\\ncontrastive learning objective and a language-audio matching objective to align\\nthe audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of\\nin-vehicle domain audio. Empirical results on two downstream tasks show that\\nwhile conceptually simple, DSCLAP significantly outperforms the baseline models\\nin all metrics, showing great promise for domain-specific IVAs applications.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09289\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

分析真实世界的多模态信号是智能语音助手（IVA）的一项重要而又具有挑战性的任务。主流方法通过预训练音频模型和文本模型，在 IVA 的各种下游任务中取得了显著的性能。然而，这些模型都是独立预训练的，而且通常是在与目标领域不同的任务上进行预训练，从而导致下游任务的模态表示不够理想。此外，在许多领域，收集足够的语言-音频对非常困难，而翻译原始音频也需要很高的专业技能，这使得联合预训练变得困难甚至不可行。为了解决这些问题，我们提出了 DSCLAP，这是一个简单有效的框架，只需输入原始音频信号就能实现语言和音频的预训练。具体来说，DSCLAP 通过 ASR 系统将原始音频信号转换为文本，并将对比学习目标和语言音频匹配目标结合起来，使音频和 ASR 转录内容保持一致。我们在 12107 小时的车载领域音频上对 DSCLAP 进行了预训练。两个下游任务的实证结果表明，虽然概念简单，但 DSCLAP 在所有指标上都明显优于基线模型，为特定领域的 IVAs 应用展示了巨大的前景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training

Analyzing real-world multimodal signals is an essential and challenging task for intelligent voice assistants (IVAs). Mainstream approaches have achieved remarkable performance on various downstream tasks of IVAs with pre-trained audio models and text models. However, these models are pre-trained independently and usually on tasks different from target domains, resulting in sub-optimal modality representations for downstream tasks. Moreover, in many domains, collecting enough language-audio pairs is extremely hard, and transcribing raw audio also requires high professional skills, making it difficult or even infeasible to joint pre-training. To address these painpoints, we propose DSCLAP, a simple and effective framework that enables language-audio pre-training with only raw audio signal input. Specifically, DSCLAP converts raw audio signals into text via an ASR system and combines a contrastive learning objective and a language-audio matching objective to align the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of in-vehicle domain audio. Empirical results on two downstream tasks show that while conceptually simple, DSCLAP significantly outperforms the baseline models in all metrics, showing great promise for domain-specific IVAs applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Sound

自引率

0.00%

发文量