Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li
{"title":"DSCLAP:特定领域对比语言-音频预培训","authors":"Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li","doi":"arxiv-2409.09289","DOIUrl":null,"url":null,"abstract":"Analyzing real-world multimodal signals is an essential and challenging task\nfor intelligent voice assistants (IVAs). Mainstream approaches have achieved\nremarkable performance on various downstream tasks of IVAs with pre-trained\naudio models and text models. However, these models are pre-trained\nindependently and usually on tasks different from target domains, resulting in\nsub-optimal modality representations for downstream tasks. Moreover, in many\ndomains, collecting enough language-audio pairs is extremely hard, and\ntranscribing raw audio also requires high professional skills, making it\ndifficult or even infeasible to joint pre-training. To address these\npainpoints, we propose DSCLAP, a simple and effective framework that enables\nlanguage-audio pre-training with only raw audio signal input. Specifically,\nDSCLAP converts raw audio signals into text via an ASR system and combines a\ncontrastive learning objective and a language-audio matching objective to align\nthe audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of\nin-vehicle domain audio. Empirical results on two downstream tasks show that\nwhile conceptually simple, DSCLAP significantly outperforms the baseline models\nin all metrics, showing great promise for domain-specific IVAs applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training\",\"authors\":\"Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li\",\"doi\":\"arxiv-2409.09289\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Analyzing real-world multimodal signals is an essential and challenging task\\nfor intelligent voice assistants (IVAs). Mainstream approaches have achieved\\nremarkable performance on various downstream tasks of IVAs with pre-trained\\naudio models and text models. However, these models are pre-trained\\nindependently and usually on tasks different from target domains, resulting in\\nsub-optimal modality representations for downstream tasks. Moreover, in many\\ndomains, collecting enough language-audio pairs is extremely hard, and\\ntranscribing raw audio also requires high professional skills, making it\\ndifficult or even infeasible to joint pre-training. To address these\\npainpoints, we propose DSCLAP, a simple and effective framework that enables\\nlanguage-audio pre-training with only raw audio signal input. Specifically,\\nDSCLAP converts raw audio signals into text via an ASR system and combines a\\ncontrastive learning objective and a language-audio matching objective to align\\nthe audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of\\nin-vehicle domain audio. Empirical results on two downstream tasks show that\\nwhile conceptually simple, DSCLAP significantly outperforms the baseline models\\nin all metrics, showing great promise for domain-specific IVAs applications.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09289\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
分析真实世界的多模态信号是智能语音助手(IVA)的一项重要而又具有挑战性的任务。主流方法通过预训练音频模型和文本模型,在 IVA 的各种下游任务中取得了显著的性能。然而,这些模型都是独立预训练的,而且通常是在与目标领域不同的任务上进行预训练,从而导致下游任务的模态表示不够理想。此外,在许多领域,收集足够的语言-音频对非常困难,而翻译原始音频也需要很高的专业技能,这使得联合预训练变得困难甚至不可行。为了解决这些问题,我们提出了 DSCLAP,这是一个简单有效的框架,只需输入原始音频信号就能实现语言和音频的预训练。具体来说,DSCLAP 通过 ASR 系统将原始音频信号转换为文本,并将对比学习目标和语言音频匹配目标结合起来,使音频和 ASR 转录内容保持一致。我们在 12107 小时的车载领域音频上对 DSCLAP 进行了预训练。两个下游任务的实证结果表明,虽然概念简单,但 DSCLAP 在所有指标上都明显优于基线模型,为特定领域的 IVAs 应用展示了巨大的前景。
Analyzing real-world multimodal signals is an essential and challenging task
for intelligent voice assistants (IVAs). Mainstream approaches have achieved
remarkable performance on various downstream tasks of IVAs with pre-trained
audio models and text models. However, these models are pre-trained
independently and usually on tasks different from target domains, resulting in
sub-optimal modality representations for downstream tasks. Moreover, in many
domains, collecting enough language-audio pairs is extremely hard, and
transcribing raw audio also requires high professional skills, making it
difficult or even infeasible to joint pre-training. To address these
painpoints, we propose DSCLAP, a simple and effective framework that enables
language-audio pre-training with only raw audio signal input. Specifically,
DSCLAP converts raw audio signals into text via an ASR system and combines a
contrastive learning objective and a language-audio matching objective to align
the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of
in-vehicle domain audio. Empirical results on two downstream tasks show that
while conceptually simple, DSCLAP significantly outperforms the baseline models
in all metrics, showing great promise for domain-specific IVAs applications.