Nikolai L. Kühne, Astrid H. F. Kitchen, Marie S. Jensen, Mikkel S. L. Brøndt, Martin Gonzalez, Christophe Biscio, Zheng-Hua Tan
{"title":"Detecting and Defending Against Adversarial Attacks on Automatic Speech Recognition via Diffusion Models","authors":"Nikolai L. Kühne, Astrid H. F. Kitchen, Marie S. Jensen, Mikkel S. L. Brøndt, Martin Gonzalez, Christophe Biscio, Zheng-Hua Tan","doi":"arxiv-2409.07936","DOIUrl":"https://doi.org/arxiv-2409.07936","url":null,"abstract":"Automatic speech recognition (ASR) systems are known to be vulnerable to\u0000adversarial attacks. This paper addresses detection and defence against\u0000targeted white-box attacks on speech signals for ASR systems. While existing\u0000work has utilised diffusion models (DMs) to purify adversarial examples,\u0000achieving state-of-the-art results in keyword spotting tasks, their\u0000effectiveness for more complex tasks such as sentence-level ASR remains\u0000unexplored. Additionally, the impact of the number of forward diffusion steps\u0000on performance is not well understood. In this paper, we systematically\u0000investigate the use of DMs for defending against adversarial attacks on\u0000sentences and examine the effect of varying forward diffusion steps. Through\u0000comprehensive experiments on the Mozilla Common Voice dataset, we demonstrate\u0000that two forward diffusion steps can completely defend against adversarial\u0000attacks on sentences. Moreover, we introduce a novel, training-free approach\u0000for detecting adversarial attacks by leveraging a pre-trained DM. Our\u0000experimental results show that this method can detect adversarial attacks with\u0000high accuracy.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A corpus-based investigation of pitch contours of monosyllabic words in conversational Taiwan Mandarin","authors":"Xiaoyun Jin, Mirjam Ernestus, R. Harald Baayen","doi":"arxiv-2409.07891","DOIUrl":"https://doi.org/arxiv-2409.07891","url":null,"abstract":"In Mandarin, the tonal contours of monosyllabic words produced in isolation\u0000or in careful speech are characterized by four lexical tones: a high-level tone\u0000(T1), a rising tone (T2), a dipping tone (T3) and a falling tone (T4). However,\u0000in spontaneous speech, the actual tonal realization of monosyllabic words can\u0000deviate significantly from these canonical tones due to intra-syllabic\u0000co-articulation and inter-syllabic co-articulation with adjacent tones. In\u0000addition, Chuang et al. (2024) recently reported that the tonal contours of\u0000disyllabic Mandarin words with T2-T4 tone pattern are co-determined by their\u0000meanings. Following up on their research, we present a corpus-based\u0000investigation of how the pitch contours of monosyllabic words are realized in\u0000spontaneous conversational Mandarin, focusing on the effects of contextual\u0000predictors on the one hand, and the way in words' meanings co-determine pitch\u0000contours on the other hand. We analyze the F0 contours of 3824 tokens of 63\u0000different word types in a spontaneous Taiwan Mandarin corpus, using the\u0000generalized additive (mixed) model to decompose a given observed pitch contour\u0000into a set of component pitch contours. We show that the tonal context\u0000substantially modify a word's canonical tone. Once the effect of tonal context\u0000is controlled for, T2 and T3 emerge as low flat tones, contrasting with T1 as a\u0000high tone, and with T4 as a high-to-mid falling tone. The neutral tone (T0),\u0000which in standard descriptions, is realized based on the preceding tone,\u0000emerges as a low tone in its own right, modified by the other predictors in the\u0000same way as the standard tones T1, T2, T3, and T4. We also show that word, and\u0000even more so, word sense, co-determine words' F0 contours. Analyses of variable\u0000importance using random forests further supported the substantial effect of\u0000tonal context and an effect of word sense.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TSELM: Target Speaker Extraction using Discrete Tokens and Language Models","authors":"Beilong Tang, Bang Zeng, Ming Li","doi":"arxiv-2409.07841","DOIUrl":"https://doi.org/arxiv-2409.07841","url":null,"abstract":"We propose TSELM, a novel target speaker extraction network that leverages\u0000discrete tokens and language models. TSELM utilizes multiple discretized layers\u0000from WavLM as input tokens and incorporates cross-attention mechanisms to\u0000integrate target speaker information. Language models are employed to capture\u0000the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the\u0000audio from the tokens. By applying a cross-entropy loss, TSELM models the\u0000probability distribution of output tokens, thus converting the complex\u0000regression problem of audio generation into a classification task. Experimental\u0000results show that TSELM achieves excellent results in speech quality and\u0000comparable results in speech intelligibility.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wangjin Zhou, Fengrun Zhang, Yiming Liu, Wenhao Guan, Yi Zhao, He Qu
{"title":"Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations","authors":"Wangjin Zhou, Fengrun Zhang, Yiming Liu, Wenhao Guan, Yi Zhao, He Qu","doi":"arxiv-2409.08039","DOIUrl":"https://doi.org/arxiv-2409.08039","url":null,"abstract":"This study presents an innovative Zero-Shot any-to-any Singing Voice\u0000Conversion (SVC) method, leveraging a novel clustering-based phoneme\u0000representation to effectively separate content, timbre, and singing style. This\u0000approach enables precise voice characteristic manipulation. We discovered that\u0000datasets with fewer recordings per artist are more susceptible to timbre\u0000leakage. Extensive testing on over 10,000 hours of singing and user feedback\u0000revealed our model significantly improves sound quality and timbre accuracy,\u0000aligning with our objectives and advancing voice conversion technology.\u0000Furthermore, this research advances zero-shot SVC and sets the stage for future\u0000work on discrete speech representation, emphasizing the preservation of rhyme.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hierarchical Symbolic Pop Music Generation with Graph Neural Networks","authors":"Wen Qing Lim, Jinhua Liang, Huan Zhang","doi":"arxiv-2409.08155","DOIUrl":"https://doi.org/arxiv-2409.08155","url":null,"abstract":"Music is inherently made up of complex structures, and representing them as\u0000graphs helps to capture multiple levels of relationships. While music\u0000generation has been explored using various deep generation techniques, research\u0000on graph-related music generation is sparse. Earlier graph-based music\u0000generation worked only on generating melodies, and recent works to generate\u0000polyphonic music do not account for longer-term structure. In this paper, we\u0000explore a multi-graph approach to represent both the rhythmic patterns and\u0000phrase structure of Chinese pop music. Consequently, we propose a two-step\u0000approach that aims to generate polyphonic music with coherent rhythm and\u0000long-term structure. We train two Variational Auto-Encoder networks - one on a\u0000MIDI dataset to generate 4-bar phrases, and another on song structure labels to\u0000generate full song structure. Our work shows that the models are able to learn\u0000most of the structural nuances in the training dataset, including chord and\u0000pitch frequency distributions, and phrase attributes.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Sparse Coding with the Adaptive Locally Competitive Algorithm for Speech Classification","authors":"Soufiyan Bahadi, Eric Plourde, Jean Rouat","doi":"arxiv-2409.08188","DOIUrl":"https://doi.org/arxiv-2409.08188","url":null,"abstract":"Researchers are exploring novel computational paradigms such as sparse coding\u0000and neuromorphic computing to bridge the efficiency gap between the human brain\u0000and conventional computers in complex tasks. A key area of focus is\u0000neuromorphic audio processing. While the Locally Competitive Algorithm has\u0000emerged as a promising solution for sparse coding, offering potential for\u0000real-time and low-power processing on neuromorphic hardware, its applications\u0000in neuromorphic speech classification have not been thoroughly studied. The\u0000Adaptive Locally Competitive Algorithm builds upon the Locally Competitive\u0000Algorithm by dynamically adjusting the modulation parameters of the filter bank\u0000to fine-tune the filters' sensitivity. This adaptability enhances lateral\u0000inhibition, improving reconstruction quality, sparsity, and convergence time,\u0000which is crucial for real-time applications. This paper demonstrates the\u0000potential of the Locally Competitive Algorithm and its adaptive variant as\u0000robust feature extractors for neuromorphic speech classification. Results show\u0000that the Locally Competitive Algorithm achieves better speech classification\u0000accuracy at the expense of higher power consumption compared to the LAUSCHER\u0000cochlea model used for benchmarking. On the other hand, the Adaptive Locally\u0000Competitive Algorithm mitigates this power consumption issue without\u0000compromising the accuracy. The dynamic power consumption is reduced to a range\u0000of 0.004 to 13 milliwatts on neuromorphic hardware, three orders of magnitude\u0000less than setups using Graphics Processing Units. These findings position the\u0000Adaptive Locally Competitive Algorithm as a compelling solution for efficient\u0000speech classification systems, promising substantial advancements in balancing\u0000speech classification accuracy and power efficiency.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bridging Paintings and Music -- Exploring Emotion based Music Generation through Paintings","authors":"Tanisha Hisariya, Huan Zhang, Jinhua Liang","doi":"arxiv-2409.07827","DOIUrl":"https://doi.org/arxiv-2409.07827","url":null,"abstract":"Rapid advancements in artificial intelligence have significantly enhanced\u0000generative tasks involving music and images, employing both unimodal and\u0000multimodal approaches. This research develops a model capable of generating\u0000music that resonates with the emotions depicted in visual arts, integrating\u0000emotion labeling, image captioning, and language models to transform visual\u0000inputs into musical compositions. Addressing the scarcity of aligned art and\u0000music data, we curated the Emotion Painting Music Dataset, pairing paintings\u0000with corresponding music for effective training and evaluation. Our dual-stage\u0000framework converts images to text descriptions of emotional content and then\u0000transforms these descriptions into music, facilitating efficient learning with\u0000minimal data. Performance is evaluated using metrics such as Fr'echet Audio\u0000Distance (FAD), Total Harmonic Distortion (THD), Inception Score (IS), and KL\u0000divergence, with audio-emotion text similarity confirmed by the pre-trained\u0000CLAP model to demonstrate high alignment between generated music and text. This\u0000synthesis tool bridges visual art and music, enhancing accessibility for the\u0000visually impaired and opening avenues in educational and therapeutic\u0000applications by providing enriched multi-sensory experiences.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dark Experience for Incremental Keyword Spotting","authors":"Tianyi Peng, Yang Xiao","doi":"arxiv-2409.08153","DOIUrl":"https://doi.org/arxiv-2409.08153","url":null,"abstract":"Spoken keyword spotting (KWS) is crucial for identifying keywords within\u0000audio inputs and is widely used in applications like Apple Siri and Google\u0000Home, particularly on edge devices. Current deep learning-based KWS systems,\u0000which are typically trained on a limited set of keywords, can suffer from\u0000performance degradation when encountering new domains, a challenge often\u0000addressed through few-shot fine-tuning. However, this adaptation frequently\u0000leads to catastrophic forgetting, where the model's performance on original\u0000data deteriorates. Progressive continual learning (CL) strategies have been\u0000proposed to overcome this, but they face limitations such as the need for\u0000task-ID information and increased storage, making them less practical for\u0000lightweight devices. To address these challenges, we introduce Dark Experience\u0000for Keyword Spotting (DE-KWS), a novel CL approach that leverages dark\u0000knowledge to distill past experiences throughout the training process. DE-KWS\u0000combines rehearsal and distillation, using both ground truth labels and logits\u0000stored in a memory buffer to maintain model performance across tasks.\u0000Evaluations on the Google Speech Command dataset show that DE-KWS outperforms\u0000existing CL baselines in average accuracy without increasing model size,\u0000offering an effective solution for resource-constrained edge devices. The\u0000scripts are available on GitHub for the future research.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pedro J. Villasana T., Lars Villemoes, Janusz Klejsa, Per Hedelin
{"title":"Audio Decoding by Inverse Problem Solving","authors":"Pedro J. Villasana T., Lars Villemoes, Janusz Klejsa, Per Hedelin","doi":"arxiv-2409.07858","DOIUrl":"https://doi.org/arxiv-2409.07858","url":null,"abstract":"We consider audio decoding as an inverse problem and solve it through\u0000diffusion posterior sampling. Explicit conditioning functions are developed for\u0000input signal measurements provided by an example of a transform domain\u0000perceptual audio codec. Viability is demonstrated by evaluating arbitrary\u0000pairings of a set of bitrates and task-agnostic prior models. For instance, we\u0000observe significant improvements on piano while maintaining speech performance\u0000when a speech model is replaced by a joint model trained on both speech and\u0000piano. With a more general music model, improved decoding compared to legacy\u0000methods is obtained for a broad range of content types and bitrates. The noisy\u0000mean model, underlying the proposed derivation of conditioning, enables a\u0000significant reduction of gradient evaluations for diffusion posterior sampling,\u0000compared to methods based on Tweedie's mean. Combining Tweedie's mean with our\u0000conditioning functions improves the objective performance. An audio demo is\u0000available at https://dpscodec-demo.github.io/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tidal MerzA: Combining affective modelling and autonomous code generation through Reinforcement Learning","authors":"Elizabeth Wilson, György Fazekas, Geraint Wiggins","doi":"arxiv-2409.07918","DOIUrl":"https://doi.org/arxiv-2409.07918","url":null,"abstract":"This paper presents Tidal-MerzA, a novel system designed for collaborative\u0000performances between humans and a machine agent in the context of live coding,\u0000specifically focusing on the generation of musical patterns. Tidal-MerzA fuses\u0000two foundational models: ALCAA (Affective Live Coding Autonomous Agent) and\u0000Tidal Fuzz, a computational framework. By integrating affective modelling with\u0000computational generation, this system leverages reinforcement learning\u0000techniques to dynamically adapt music composition parameters within the\u0000TidalCycles framework, ensuring both affective qualities to the patterns and\u0000syntactical correctness. The development of Tidal-MerzA introduces two distinct\u0000agents: one focusing on the generation of mini-notation strings for musical\u0000expression, and another on the alignment of music with targeted affective\u0000states through reinforcement learning. This approach enhances the adaptability\u0000and creative potential of live coding practices and allows exploration of\u0000human-machine creative interactions. Tidal-MerzA advances the field of\u0000computational music generation, presenting a novel methodology for\u0000incorporating artificial intelligence into artistic practices.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}