arXiv - CS - Sound最新文献

ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration ESPnet-EZ：仅使用 Python 的 ESPnet，易于微调和集成

arXiv - CS - Sound Pub Date : 2024-09-14 DOI: arxiv-2409.09506

Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe

引用次数: 0

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion MacST：通过文本转写进行重音转换的多重音语音合成

arXiv - CS - Sound Pub Date : 2024-09-14 DOI: arxiv-2409.09352

Sho Inoue, Shuai Wang, Wanxing Wang, Pengcheng Zhu, Mengxiao Bi, Haizhou Li

引用次数: 0

DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training DSCLAP：特定领域对比语言-音频预培训

arXiv - CS - Sound Pub Date : 2024-09-14 DOI: arxiv-2409.09289

Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li

{"title":"DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training","authors":"Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li","doi":"arxiv-2409.09289","DOIUrl":"https://doi.org/arxiv-2409.09289","url":null,"abstract":"Analyzing real-world multimodal signals is an essential and challenging task\u0000for intelligent voice assistants (IVAs). Mainstream approaches have achieved\u0000remarkable performance on various downstream tasks of IVAs with pre-trained\u0000audio models and text models. However, these models are pre-trained\u0000independently and usually on tasks different from target domains, resulting in\u0000sub-optimal modality representations for downstream tasks. Moreover, in many\u0000domains, collecting enough language-audio pairs is extremely hard, and\u0000transcribing raw audio also requires high professional skills, making it\u0000difficult or even infeasible to joint pre-training. To address these\u0000painpoints, we propose DSCLAP, a simple and effective framework that enables\u0000language-audio pre-training with only raw audio signal input. Specifically,\u0000DSCLAP converts raw audio signals into text via an ASR system and combines a\u0000contrastive learning objective and a language-audio matching objective to align\u0000the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of\u0000in-vehicle domain audio. Empirical results on two downstream tasks show that\u0000while conceptually simple, DSCLAP significantly outperforms the baseline models\u0000in all metrics, showing great promise for domain-specific IVAs applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection M$^{3}$V：用于设备导向语音检测的多模态多视角方法

arXiv - CS - Sound Pub Date : 2024-09-14 DOI: arxiv-2409.09284

Anna Wang, Da Liu, Zhiyu Zhang, Shengqiang Liu, Jie Gao, Yali Li

{"title":"M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection","authors":"Anna Wang, Da Liu, Zhiyu Zhang, Shengqiang Liu, Jie Gao, Yali Li","doi":"arxiv-2409.09284","DOIUrl":"https://doi.org/arxiv-2409.09284","url":null,"abstract":"With the goal of more natural and human-like interaction with virtual voice\u0000assistants, recent research in the field has focused on full duplex interaction\u0000mode without relying on repeated wake-up words. This requires that in scenes\u0000with complex sound sources, the voice assistant must classify utterances as\u0000device-oriented or non-device-oriented. The dual-encoder structure, which is\u0000jointly modeled by text and speech, has become the paradigm of device-directed\u0000speech detection. However, in practice, these models often produce incorrect\u0000predictions for unaligned input pairs due to the unavoidable errors of\u0000automatic speech recognition (ASR).To address this challenge, we propose\u0000M$^{3}$V, a multi-modal multi-view approach for device-directed speech\u0000detection, which frames we frame the problem as a multi-view learning task that\u0000introduces unimodal views and a text-audio alignment view in the network\u0000besides the multi-modal. Experimental results show that M$^{3}$V significantly\u0000outperforms models trained using only single or multi-modality and surpasses\u0000human judgment performance on ASR error data for the first time.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Subband Splitting: Simple, Efficient and Effective Technique for Solving Block Permutation Problem in Determined Blind Source Separation 子带分割：解决确定盲源分离中块排列问题的简单、高效和有效技术

arXiv - CS - Sound Pub Date : 2024-09-14 DOI: arxiv-2409.09294

Kazuki Matsumoto, Kohei Yatabe

{"title":"Subband Splitting: Simple, Efficient and Effective Technique for Solving Block Permutation Problem in Determined Blind Source Separation","authors":"Kazuki Matsumoto, Kohei Yatabe","doi":"arxiv-2409.09294","DOIUrl":"https://doi.org/arxiv-2409.09294","url":null,"abstract":"Solving the permutation problem is essential for determined blind source\u0000separation (BSS). Existing methods, such as independent vector analysis (IVA)\u0000and independent low-rank matrix analysis (ILRMA), tackle the permutation\u0000problem by modeling the co-occurrence of the frequency components of source\u0000signals. One of the remaining challenges in these methods is the block\u0000permutation problem, which may lead to poor separation results. In this paper,\u0000we propose a simple and effective technique for solving the block permutation\u0000problem. The proposed technique splits the entire frequencies into overlapping\u0000subbands and sequentially applies a BSS method (e.g., IVA, ILRMA, or any other\u0000method) to each subband. Since the problem size is reduced by the splitting,\u0000the BSS method can effectively work in each subband. Then, the permutations\u0000between the subbands are aligned by using the separation result in one subband\u0000as the initial values for the other subbands. Experimental results showed that\u0000the proposed technique remarkably improved the separation performance without\u0000increasing the total computational cost.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features 通过预测可解释的声学特征来解释用于语音情感识别的深度学习嵌入式算法

arXiv - CS - Sound Pub Date : 2024-09-14 DOI: arxiv-2409.09511

Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh

{"title":"Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features","authors":"Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh","doi":"arxiv-2409.09511","DOIUrl":"https://doi.org/arxiv-2409.09511","url":null,"abstract":"Pre-trained deep learning embeddings have consistently shown superior\u0000performance over handcrafted acoustic features in speech emotion recognition\u0000(SER). However, unlike acoustic features with clear physical meaning, these\u0000embeddings lack clear interpretability. Explaining these embeddings is crucial\u0000for building trust in healthcare and security applications and advancing the\u0000scientific understanding of the acoustic information that is encoded in them.\u0000This paper proposes a modified probing approach to explain deep learning\u0000embeddings in the SER space. We predict interpretable acoustic features (e.g.,\u0000f0, loudness) from (i) the complete set of embeddings and (ii) a subset of the\u0000embedding dimensions identified as most important for predicting each emotion.\u0000If the subset of the most important dimensions better predicts a given emotion\u0000than all dimensions and also predicts specific acoustic features more\u0000accurately, we infer those acoustic features are important for the embedding\u0000model for the given task. We conducted experiments using the WavLM embeddings\u0000and eGeMAPS acoustic features as audio representations, applying our method to\u0000the RAVDESS and SAVEE emotional speech datasets. Based on this evaluation, we\u0000demonstrate that Energy, Frequency, Spectral, and Temporal categories of\u0000acoustic features provide diminishing information to SER in that order,\u0000demonstrating the utility of the probing classifier method to relate embeddings\u0000to interpretable acoustic features.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Prevailing Research Areas for Music AI in the Era of Foundation Models 基础模型时代音乐人工智能的主流研究领域

arXiv - CS - Sound Pub Date : 2024-09-14 DOI: arxiv-2409.09378

Megan Wei, Mateusz Modrzejewski, Aswin Sivaraman, Dorien Herremans

{"title":"Prevailing Research Areas for Music AI in the Era of Foundation Models","authors":"Megan Wei, Mateusz Modrzejewski, Aswin Sivaraman, Dorien Herremans","doi":"arxiv-2409.09378","DOIUrl":"https://doi.org/arxiv-2409.09378","url":null,"abstract":"In tandem with the recent advancements in foundation model research, there\u0000has been a surge of generative music AI applications within the past few years.\u0000As the idea of AI-generated or AI-augmented music becomes more mainstream, many\u0000researchers in the music AI community may be wondering what avenues of research\u0000are left. With regards to music generative models, we outline the current areas\u0000of research with significant room for exploration. Firstly, we pose the\u0000question of foundational representation of these generative models and\u0000investigate approaches towards explainability. Next, we discuss the current\u0000state of music datasets and their limitations. We then overview different\u0000generative models, forms of evaluating these models, and their computational\u0000constraints/limitations. Subsequently, we highlight applications of these\u0000generative models towards extensions to multiple modalities and integration\u0000with artists' workflow as well as music education systems. Finally, we survey\u0000the potential copyright implications of generative music and discuss strategies\u0000for protecting the rights of musicians. While it is not meant to be exhaustive,\u0000our survey calls to attention a variety of research directions enabled by music\u0000foundation models.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation 利用基于变换器的分层对齐和分离式跨模态表示进行音频文本检索

arXiv - CS - Sound Pub Date : 2024-09-14 DOI: arxiv-2409.09256

Yifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou

{"title":"Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation","authors":"Yifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou","doi":"arxiv-2409.09256","DOIUrl":"https://doi.org/arxiv-2409.09256","url":null,"abstract":"Most existing audio-text retrieval (ATR) approaches typically rely on a\u0000single-level interaction to associate audio and text, limiting their ability to\u0000align different modalities and leading to suboptimal matches. In this work, we\u0000present a novel ATR framework that leverages two-stream Transformers in\u0000conjunction with a Hierarchical Alignment (THA) module to identify multi-level\u0000correspondences of different Transformer blocks between audio and text.\u0000Moreover, current ATR methods mainly focus on learning a global-level\u0000representation, missing out on intricate details to capture audio occurrences\u0000that correspond to textual semantics. To bridge this gap, we introduce a\u0000Disentangled Cross-modal Representation (DCR) approach that disentangles\u0000high-dimensional features into compact latent factors to grasp fine-grained\u0000audio-text semantic correlations. Additionally, we develop a confidence-aware\u0000(CA) module to estimate the confidence of each latent factor pair and\u0000adaptively aggregate cross-modal latent factors to achieve local semantic\u0000alignment. Experiments show that our THA effectively boosts ATR performance,\u0000with the DCR approach further contributing to consistent performance gains.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech 参加 2024 年 VoiceMOS 挑战赛的 T05 系统：从深度图像分类器到高质量合成语音自然度 MOS 预测的迁移学习

arXiv - CS - Sound Pub Date : 2024-09-14 DOI: arxiv-2409.09305

Kaito Baba, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari

引用次数: 0

Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling 儿童与成人二元互动中以自我为中心的说话者分类：从感知到计算建模

arXiv - CS - Sound Pub Date : 2024-09-14 DOI: arxiv-2409.09340

Tiantian Feng, Anfeng Xu, Xuan Shi, Somer Bishop, Shrikanth Narayanan

{"title":"Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling","authors":"Tiantian Feng, Anfeng Xu, Xuan Shi, Somer Bishop, Shrikanth Narayanan","doi":"arxiv-2409.09340","DOIUrl":"https://doi.org/arxiv-2409.09340","url":null,"abstract":"Autism spectrum disorder (ASD) is a neurodevelopmental condition\u0000characterized by challenges in social communication, repetitive behavior, and\u0000sensory processing. One important research area in ASD is evaluating children's\u0000behavioral changes over time during treatment. The standard protocol with this\u0000objective is BOSCC, which involves dyadic interactions between a child and\u0000clinicians performing a pre-defined set of activities. A fundamental aspect of\u0000understanding children's behavior in these interactions is automatic speech\u0000understanding, particularly identifying who speaks and when. Conventional\u0000approaches in this area heavily rely on speech samples recorded from a\u0000spectator perspective, and there is limited research on egocentric speech\u0000modeling. In this study, we design an experiment to perform speech sampling in\u0000BOSCC interviews from an egocentric perspective using wearable sensors and\u0000explore pre-training Ego4D speech samples to enhance child-adult speaker\u0000classification in dyadic interactions. Our findings highlight the potential of\u0000egocentric speech collection and pre-training to improve speaker classification\u0000accuracy.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0