IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献

筛选
英文 中文
DeFTAN-II: Efficient Multichannel Speech Enhancement With Subgroup Processing DeFTAN-II:利用分组处理实现高效多通道语音增强
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-30 DOI: 10.1109/TASLP.2024.3488564
Dongheon Lee;Jung-Woo Choi
{"title":"DeFTAN-II: Efficient Multichannel Speech Enhancement With Subgroup Processing","authors":"Dongheon Lee;Jung-Woo Choi","doi":"10.1109/TASLP.2024.3488564","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3488564","url":null,"abstract":"In this work, we present DeFTAN-II, an efficient multichannel speech enhancement model based on transformer architecture and subgroup processing. Despite the success of transformers in speech enhancement, they face challenges in capturing local relations, reducing the high computational complexity, and lowering memory usage. To address these limitations, we introduce subgroup processing in our model, combining subgroups of locally emphasized features with other subgroups containing original features. The subgroup processing is implemented in several blocks of the proposed network. In the proposed split dense blocks extracting spatial features, a pair of subgroups is sequentially concatenated and processed by convolution layers to effectively reduce the computational complexity and memory usage. For the F- and T-transformers extracting temporal and spectral relations, we introduce cross-attention between subgroups to identify relationships between locally emphasized and non-emphasized features. The dual-path feedforward network then aggregates attended features in terms of the gating of local features processed by dilated convolutions. Through extensive comparisons with state-of-the-art multichannel speech enhancement models, we demonstrate that DeFTAN-II with subgroup processing outperforms existing methods at significantly lower computational complexity. Moreover, we evaluate the model's generalization capability on real-world data without fine-tuning, which further demonstrates its effectiveness in practical scenarios.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4850-4866"},"PeriodicalIF":4.1,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CL-MASR: A Continual Learning Benchmark for Multilingual ASR CL-MASR:多语种自动识别的持续学习基准
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-29 DOI: 10.1109/TASLP.2024.3487410
Luca Della Libera;Pooneh Mousavi;Salah Zaiem;Cem Subakan;Mirco Ravanelli
{"title":"CL-MASR: A Continual Learning Benchmark for Multilingual ASR","authors":"Luca Della Libera;Pooneh Mousavi;Salah Zaiem;Cem Subakan;Mirco Ravanelli","doi":"10.1109/TASLP.2024.3487410","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3487410","url":null,"abstract":"Modern multilingual automatic speech recognition (ASR) systems like Whisper have made it possible to transcribe audio in multiple languages with a single model. However, current state-of-the-art ASR models are typically evaluated on individual languages or in a multi-task setting, overlooking the challenge of continually learning new languages. There is insufficient research on how to add new languages without losing valuable information from previous data. Furthermore, existing continual learning benchmarks focus mostly on vision and language tasks, leaving continual learning for multilingual ASR largely unexplored. To bridge this gap, we propose CL-MASR, a benchmark designed for studying multilingual ASR in a continual learning setting. CL-MASR provides a diverse set of continual learning methods implemented on top of large-scale pretrained ASR models, along with common metrics to assess the effectiveness of learning new languages while addressing the issue of catastrophic forgetting. To the best of our knowledge, CL-MASR is the first continual learning benchmark for the multilingual ASR task.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4931-4944"},"PeriodicalIF":4.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WEDA: Exploring Copyright Protection for Large Language Model Downstream Alignment WEDA:探索大型语言模型下游对齐的版权保护
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-29 DOI: 10.1109/TASLP.2024.3487419
Shen Wang;Jialiang Dong;Longfei Wu;Zhitao Guan
{"title":"WEDA: Exploring Copyright Protection for Large Language Model Downstream Alignment","authors":"Shen Wang;Jialiang Dong;Longfei Wu;Zhitao Guan","doi":"10.1109/TASLP.2024.3487419","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3487419","url":null,"abstract":"Large Language Models (LLMs) have shown incomparable representation and generalization capabilities, which have led to significant advancements in Natural Language Processing (NLP). Before deployment, the pre-trained LLMs often need to be tailored to specific downstream tasks for improved performance, which is commonly referred to as downstream alignment. This is a costly effort considering the needed manpower, training resources, and downstream-specific data. While much attention has been paid to protecting the copyright of the models themselves, the copyright protection of LLM alignment has been largely overlooked. In this paper, we present Watermark Embedding for Downstream Alignment (WEDA) scheme, which can provide effective copyright protection for two popular LLM alignment techniques parameter-efficient fine-tuning (PEFT) and in-context learning (ICL). For alignment through PEFT, we propose a Chain of Thought (CoT) based solution to embed watermarks into the PEFT weights. Furthermore, we extend this solution to safeguard alignment through ICL by utilizing the prefix-integrated CoT to watermark examples embedded within ICL prompts. We conduct an extensive experimental evaluation to demonstrate the effectiveness of our proposed scheme.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4755-4767"},"PeriodicalIF":4.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Knowledge-Guided Transformer for Joint Theme and Emotion Classification of Chinese Classical Poetry 用于中国古典诗词主题和情感联合分类的知识引导转换器
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-29 DOI: 10.1109/TASLP.2024.3487409
Yuting Wei;Linmei Hu;Yangfu Zhu;Jiaqi Zhao;Bin Wu
{"title":"Knowledge-Guided Transformer for Joint Theme and Emotion Classification of Chinese Classical Poetry","authors":"Yuting Wei;Linmei Hu;Yangfu Zhu;Jiaqi Zhao;Bin Wu","doi":"10.1109/TASLP.2024.3487409","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3487409","url":null,"abstract":"The classifications of the theme and emotion are essential for understanding and organizing Chinese classical poetry. Existing works often overlook the rich semantic knowledge derived from poem annotations, which contain crucial insights into themes and emotions and are instrumental in semantic understanding. Additionally, the complex interdependence and diversity of themes and emotions within poems are frequently disregarded. Hence, this paper introduces a Poetry Knowledge-augmented Joint Model (Poka) specifically designed for the multi-label classification of themes and emotions in Chinese classical poetry. Specifically, we first employ an automated approach to construct two semantic knowledge graphs for theme and emotion. These graphs facilitate a deeper understanding of the poems by bridging the semantic gap between the obscure ancient words and their modern Chinese counterparts. Representations related to themes and emotions are then acquired through a knowledge-guided mask-transformer. Moreover, Poka leverages the inherent correlations between themes and emotions by adopting a joint classification strategy with shared training parameters. Extensive experiments demonstrate that our model achieves state-of-the-art performance on both theme and emotion classifications, especially on tail labels.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4783-4794"},"PeriodicalIF":4.1,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Syntax-Augmented Hierarchical Interactive Encoder for Zero-Shot Cross-Lingual Information Extraction 用于零点跨语言信息提取的语法增强型分层交互式编码器
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-28 DOI: 10.1109/TASLP.2024.3485547
Jun-Yu Ma;Jia-Chen Gu;Zhen-Hua Ling;Quan Liu;Cong Liu;Guoping Hu
{"title":"Syntax-Augmented Hierarchical Interactive Encoder for Zero-Shot Cross-Lingual Information Extraction","authors":"Jun-Yu Ma;Jia-Chen Gu;Zhen-Hua Ling;Quan Liu;Cong Liu;Guoping Hu","doi":"10.1109/TASLP.2024.3485547","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485547","url":null,"abstract":"Zero-shot cross-lingual information extraction (IE) aims at constructing an IE model for some low-resource target languages, given annotations exclusively in some rich-resource languages. Recent studies have shown language-universal features can bridge the gap between languages. However, prior work has neither explored the potential of establishing interactions between language-universal features and contextual representations nor incorporated features that can effectively model constituent span attributes and relationships between multiple spans. In this study, a \u0000<bold>s</b>\u0000yntax-augmented \u0000<bold>h</b>\u0000ierarchical \u0000<bold>in</b>\u0000teractive \u0000<bold>e</b>\u0000ncoder (SHINE) is proposed to transfer cross-lingual IE knowledge. The proposed encoder is capable of interactively capturing complementary information between features and contextual information, to derive language-agnostic representations for various cross-lingual IE tasks. Concretely, a multi-level interaction network is designed to hierarchically interact the complementary information to strengthen domain adaptability. Besides, in addition to the well-studied word-level syntax features of part-of-speech and dependency relation, a new span-level syntax feature of constituency structure is introduced to model the constituent span information which is crucial for IE. Experiments across seven languages on three IE tasks and four benchmarks verify the effectiveness and generalization ability of the proposed method.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4795-4809"},"PeriodicalIF":4.1,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142600368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic Disfluency Detection From Untranscribed Speech 从未记录的语音中自动检测不流畅语句
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485465
Amrit Romana;Kazuhito Koishida;Emily Mower Provost
{"title":"Automatic Disfluency Detection From Untranscribed Speech","authors":"Amrit Romana;Kazuhito Koishida;Emily Mower Provost","doi":"10.1109/TASLP.2024.3485465","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485465","url":null,"abstract":"Speech disfluencies, such as filled pauses or repetitions, are disruptions in the typical flow of speech. All speakers experience disfluencies at times, and the rate at which we produce disfluencies may be increased by certain speaker or environmental characteristics. Modeling disfluencies has been shown to be useful for a range of downstream tasks, and as a result, disfluency detection has many potential applications. In this work, we investigate language, acoustic, and multimodal methods for frame-level automatic disfluency detection and categorization. Each of these methods relies on audio as an input. First, we evaluate several automatic speech recognition (ASR) systems in terms of their ability to transcribe disfluencies, measured using disfluency error rates. We then use these ASR transcripts as input to a language-based disfluency detection model. We find that disfluency detection performance is largely limited by the quality of transcripts and alignments. We find that an acoustic-based approach that does not require transcription as an intermediate step outperforms the ASR language approach. Finally, we present multimodal architectures which we find improve disfluency detection performance over the unimodal approaches. Ultimately, this work introduces novel approaches for automatic frame-level disfluency and categorization. In the long term, this will help researchers incorporate automatic disfluency detection into a range of applications.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4727-4740"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation Auffusion:利用扩散和大型语言模型的力量进行文本到音频生成
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485485
Jinlong Xue;Yayue Deng;Yingming Gao;Ya Li
{"title":"Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation","authors":"Jinlong Xue;Yayue Deng;Yingming Gao;Ya Li","doi":"10.1109/TASLP.2024.3485485","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485485","url":null,"abstract":"Recent advancements in diffusion models and large language models (LLMs) have significantly propelled the field of generation tasks. Text-to-Audio (TTA), a burgeoning generation application designed to generate audio from natural language prompts, is attracting increasing attention. However, existing TTA studies often struggle with generation quality and text-audio alignment, especially for complex textual inputs. Drawing inspiration from state-of-the-art Text-to-Image (T2I) diffusion models, we introduce Auffusion, a TTA system adapting T2I model frameworks to TTA task, by effectively leveraging their inherent generative strengths and precise cross-modal alignment. Our objective and subjective evaluations demonstrate that Auffusion surpasses previous TTA approaches using limited data and computational resources. Furthermore, the text encoder serves as a critical bridge between text and audio, since it acts as an instruction for the diffusion model to generate coherent content. Previous studies in T2I recognize the significant impact of encoder choice on cross-modal alignment, like fine-grained details and object bindings, while similar evaluation is lacking in prior TTA works. Through comprehensive ablation studies and innovative cross-attention map visualizations, we provide insightful assessments, being the first to reveal the internal mechanisms in the TTA field and intuitively explain how different text encoders influence the diffusion process. Our findings reveal Auffusion's superior capability in generating audios that accurately match textual descriptions, which is further demonstrated in several related tasks, such as audio style transfer, inpainting, and other manipulations.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4700-4712"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
E$^{3}$TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications E$^{3}$TTS:端到端基于文本的语音编辑 TTS 系统及其应用
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485466
Zheng Liang;Ziyang Ma;Chenpeng Du;Kai Yu;Xie Chen
{"title":"E$^{3}$TTS: End-to-End Text-Based Speech Editing TTS System and Its Applications","authors":"Zheng Liang;Ziyang Ma;Chenpeng Du;Kai Yu;Xie Chen","doi":"10.1109/TASLP.2024.3485466","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485466","url":null,"abstract":"Text-based speech editing aims at manipulating part of real audio by modifying the corresponding transcribed text, without being discernible by human auditory system. With the enhanced capability of neural Text-to-speech (TTS), researchers try to tackle speech editing problems with TTS methods. In this paper, we propose E\u0000<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\u0000TTS, a.k.a. end-to-end text-based speech editing TTS system, which combines a text encoder, a speech encoder, and a joint net for speech synthesis and speech editing. E\u0000<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\u0000TTS can insert, replace, and delete speech content at will, by manipulating the given text. Experiments show that our speech editing outperforms strong baselines on HiFiTTS and LibriTTS datasets, speakers of which are seen or unseen, respectively. Further, we introduce E\u0000<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\u0000TTS into data augmentation for automatic speech recognition (ASR) to mitigate the data insufficiency problem in code-switching and named entity recognition scenarios\u0000<sup>1</sup>\u0000. E\u0000<inline-formula><tex-math>$^{3}$</tex-math></inline-formula>\u0000TTS retains the coherence and reality of the recorded audio compared to past data augmentation methods. The experimental results show significant performance improvements over baseline systems with traditional TTS-based data augmentation. The code and samples of the proposed speech editing model are available at this repository.\u0000<sup>2</sup>","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4810-4821"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142600369","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EchoScan: Scanning Complex Room Geometries via Acoustic Echoes EchoScan:通过声学回声扫描复杂的房间几何结构
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485516
Inmo Yeon;Iljoo Jeong;Seungchul Lee;Jung-Woo Choi
{"title":"EchoScan: Scanning Complex Room Geometries via Acoustic Echoes","authors":"Inmo Yeon;Iljoo Jeong;Seungchul Lee;Jung-Woo Choi","doi":"10.1109/TASLP.2024.3485516","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485516","url":null,"abstract":"Accurate estimation of indoor space geometries is vital for constructing precise digital twins, whose broad industrial applications include navigation in unfamiliar environments and efficient evacuation planning, particularly in low-light conditions. This study introduces EchoScan, a deep neural network model that utilizes acoustic echoes to perform room geometry inference. Conventional sound-based techniques rely on estimating geometry-related room parameters such as wall position and room size, thereby limiting the diversity of inferable room geometries. Contrarily, EchoScan overcomes this limitation by directly inferring room floorplan maps and height maps, thereby enabling it to handle rooms with complex shapes, including curved walls. The segmentation task for predicting floorplan and height maps enables the model to leverage both low- and high-order reflections. The use of high-order reflections further allows EchoScan to infer complex room shapes when some walls of the room are unobservable from the position of an audio device. Herein, EchoScan was trained and evaluated using RIRs synthesized from complex environments, including the Manhattan and Atlanta layouts, employing a practical audio device configuration compatible with commercial, off-the-shelf devices.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4768-4782"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cacophony: An Improved Contrastive Audio-Text Model Cacophony:改进的音频文本对比模型
IF 4.1 2区 计算机科学
IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485170
Ge Zhu;Jordan Darefsky;Zhiyao Duan
{"title":"Cacophony: An Improved Contrastive Audio-Text Model","authors":"Ge Zhu;Jordan Darefsky;Zhiyao Duan","doi":"10.1109/TASLP.2024.3485170","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485170","url":null,"abstract":"Despite recent advancements, audio-text models still lag behind their image-text counterparts in scale and performance. In this paper, we propose to improve both the data scale and the training procedure of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset containing 13,000 hours of text-labeled audio, using pretrained language models to process noisy text descriptions and automatic captioning to obtain text descriptions for unlabeled audio samples. We first train on audio-only data with a masked autoencoder (MAE) objective, which allows us to benefit from the scalability of unlabeled audio datasets. We then train a contrastive model with an auxiliary captioning objective with the audio encoder initialized from the MAE model. Our final model, which we name Cacophony, achieves state-of-the-art performance on audio-text retrieval tasks, and exhibits competitive results on the HEAR benchmark and other downstream tasks such as zero-shot classification.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4867-4879"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142691772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信