IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献_第9页

On the Generalization Ability of Complex-Valued Variational U-Networks for Single-Channel Speech Enhancement 论用于单声道语音增强的复值变分 U 网络的泛化能力

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-15 DOI: 10.1109/TASLP.2024.3444492

Eike J. Nustede;Jörn Anemüller

{"title":"On the Generalization Ability of Complex-Valued Variational U-Networks for Single-Channel Speech Enhancement","authors":"Eike J. Nustede;Jörn Anemüller","doi":"10.1109/TASLP.2024.3444492","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3444492","url":null,"abstract":"The ability to generalize well to different environments is of importance for audio de-noising systems in real-world scenarios. Especially single-channel signals require efficient noise filtering without impacting speech intelligibility negatively. Our previous work has shown that a probabilistic latent space model combined with a U-Network architecture increases performance and generalization ability to some extent. Here, we further evaluate magnitude-only, as well as complex-valued U-Network models, on two different datasets, and in a train-test mismatch scenario. Adaptability of models is evaluated by introducing a curve-based score similar to area-under-the-curve metrics. The proposed probabilistic latent space models outperform their ablated variants in most conditions, as well as well-known comparison methods, while increases in network size are negligible. Improvements of up to 0.97 dB SI-SDR in matched, and 2.72 dB SI-SDR in mismatched conditions are observed, with highest total SI-SDR scores of 20.21 dB and 18.71 dB, respectively. The proposed stability-score aligns well with observed performance behaviour, further validating the probabilistic latent space model.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3838-3849"},"PeriodicalIF":4.1,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10637717","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142084492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Conformer-Based Sound Event Detection Using Frequency Dynamic Convolutions and BEATs Audio Embeddings 利用频率动态卷积和 BEATs 音频嵌入增强基于共形器的声音事件检测

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-15 DOI: 10.1109/TASLP.2024.3444490

Sara Barahona;Diego de Benito-Gorrón;Doroteo T. Toledano;Daniel Ramos

{"title":"Enhancing Conformer-Based Sound Event Detection Using Frequency Dynamic Convolutions and BEATs Audio Embeddings","authors":"Sara Barahona;Diego de Benito-Gorrón;Doroteo T. Toledano;Daniel Ramos","doi":"10.1109/TASLP.2024.3444490","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3444490","url":null,"abstract":"Over the last few years, most of the tasks employing Deep Learning techniques for audio processing have achieved state-of-the-art results employing Conformer-based systems. However, when it comes to sound event detection (SED), it was scarcely used after it won the DCASE Challenge 2020 Task 4. In previous research, we found that Conformer-based systems achieved a higher performance in terms of sound events classification compared to other architectures frequently employed, such as Convolutional Recurrent Neural Networks (CRNNs). Given that the second scenario proposed for the Polyphonic Sound Detection Score (PSDS2) is focused on avoiding confusion between classes, in this paper we propose to optimize a Conformer-based system to maximize the performance on this scenario. For this purpose, we performed a hyperparameter tuning and incorporated recently proposed Frequency Dynamic Convolutions (FDY) to enhance its classification properties. Additionally, we employed our previously proposed multi-resolution approach not only to enhance the performance but also to gain a deeper understanding of the Conformer architecture for SED, analyzing its advantages and disadvantages, and finding possible solutions to them. Additionally, we explored the integration of embeddings from the pre-trained model BEATs, an iterative framework to learn Bidirectional Encoder representation from Audio Transformers. By concatenating these embeddings into the input of the Conformer blocks, results were further improved, achieving a PSDS2 value of 0.813 and considerably outperforming SED systems based on CRNNs.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3896-3907"},"PeriodicalIF":4.1,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10637738","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142143639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BirdVoxDetect: Large-Scale Detection and Classification of Flight Calls for Bird Migration Monitoring BirdVoxDetect：用于鸟类迁徙监测的大规模飞行鸣叫检测和分类

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-15 DOI: 10.1109/TASLP.2024.3444486

Vincent Lostanlen;Aurora Cramer;Justin Salamon;Andrew Farnsworth;Benjamin M. Van Doren;Steve Kelling;Juan Pablo Bello

{"title":"BirdVoxDetect: Large-Scale Detection and Classification of Flight Calls for Bird Migration Monitoring","authors":"Vincent Lostanlen;Aurora Cramer;Justin Salamon;Andrew Farnsworth;Benjamin M. Van Doren;Steve Kelling;Juan Pablo Bello","doi":"10.1109/TASLP.2024.3444486","DOIUrl":"10.1109/TASLP.2024.3444486","url":null,"abstract":"Sound event classification has the potential to advance our understanding of bird migration. Although it is long known that migratory species have a vocal signature of their own, previous work on automatic flight call classification has been limited in robustness and scope: e.g., covering few recording sites, short acquisition segments, and simplified biological taxonomies. In this paper, we present BirdVoxDetect (BVD), the first full-fledged solution to bird migration monitoring from acoustic sensor network data. As an open-source software, BVD integrates an original pipeline of three machine learning modules. The first module is a random forest classifier of sensor faults, trained with human-in-the-loop active learning. The second module is a deep convolutional neural network for sound event detection with per-channel energy normalization (PCEN). The third module is a multitask convolutional neural network which predicts the family, genus, and species of flight calls from passerines \u0000<italic>(Passeriformes)</i>\u0000 of North America. We evaluate BVD on a new dataset (296 hours from nine locations, the largest to date for this task) and discuss the main sources of estimation error in a real-world deployment: mechanical sensor failures, sensitivity to background noise, misdetection, and taxonomic confusion. Then, we deploy BVD to an unprecedented scale: 6672 hours of audio (approximately one terabyte), corresponding to a full season of bird migration. Running BVD in parallel over the full-season dataset yields 1.6 billion FFT's, 480 million neural network predictions, and over six petabytes of throughput. With this method, our main finding is that deep learning and bioacoustic sensor networks are ready to complement radar observations and crowdsourced surveys for bird migration monitoring, thus benefiting conservation ecology and land-use planning at large.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4134-4145"},"PeriodicalIF":4.1,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation 用于多对多多语言语音到语音翻译的无文本单元到单元训练

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-15 DOI: 10.1109/TASLP.2024.3444470

Minsu Kim;Jeongsoo Choi;Dahun Kim;Yong Man Ro

{"title":"Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation","authors":"Minsu Kim;Jeongsoo Choi;Dahun Kim;Yong Man Ro","doi":"10.1109/TASLP.2024.3444470","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3444470","url":null,"abstract":"This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation that can also benefit the transfer of pre-trained knowledge to text-based systems, text-to-speech synthesis and text-to-speech translation. To this end, we represent multilingual speech with speech units that are the discretized representations of speech features derived from a self-supervised speech model. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech, which can be easily associated with both speech and text modalities at the phonetic level information. By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT). Specifically, the encoder is conditioned on the source language token to correctly understand the input spoken language, while the decoder is conditioned on the target language token to generate the translated speech in the target language. Therefore, during the training, the model can build the knowledge of how languages are comprehended and how to relate them to different languages. Since speech units can be easily associated from both audio and text by quantization and phonemization respectively, the trained model can easily transferred to text-related tasks, even if it is trained in a textless manner. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST), requiring only minimal fine-tuning steps on text inputs. By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks. Moreover, thanks to the many-to-many language training, we show that the UTUT can also perform language translations for novel language pairs that are not present during training as pairs, which has not well been explored in the previous literature.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3934-3946"},"PeriodicalIF":4.1,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142159099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Coarse-to-Fine Target Speaker Extraction Based on Contextual Information Exploitation 基于上下文信息开发的粗到细目标扬声器提取

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-08 DOI: 10.1109/TASLP.2024.3440638

Xue Yang;Changchun Bao;Xianhong Chen

{"title":"Coarse-to-Fine Target Speaker Extraction Based on Contextual Information Exploitation","authors":"Xue Yang;Changchun Bao;Xianhong Chen","doi":"10.1109/TASLP.2024.3440638","DOIUrl":"10.1109/TASLP.2024.3440638","url":null,"abstract":"To address the cocktail party problem, the target speaker extraction (TSE) has received increasing attention recently. Typically, the TSE is explored in two scenarios. The first scenario is a specific one, where the target speaker is present and the signal received by the microphone contains at least two speakers. The second scenario is a universal one, where the target speaker may be present or absent and the received signal may contain one or multiple speakers. Numerous TSE studies utilize the target speaker's embedding to guide the extraction. However, solely utilizing this embedding may not fully leverage the contextual information within the enrollment. To address this limitation, a novel approach that directly exploits the contextual information in the time-frequency (T-F) domain was proposed. This paper improves this approach by integrating our previously proposed coarse-to-fine framework. For the specific scenario, an interaction block is employed to facilitate direct interaction between the T-F representations of the enrollment and received signal. This direct interaction leads to the consistent representation of the enrollment that serves as guidance for the coarse extraction. Afterwards, the T-F representation of the coarsely extracted signal is utilized to guide the refining extraction. The residual representation obtained during the refining extraction increases the extraction precision. Besides, this paper explores an undisturbed universal scenario where the noise and reverberation are not considered. A two-level decision-making scheme is devised to generalize our proposed method for this undisturbed universal scenario. The proposed method achieves high performance and is proven effective for both scenarios.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3795-3810"},"PeriodicalIF":4.1,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Theoretical Analysis of Maclaurin Expansion Based Linear Differential Microphone Arrays and Improved Solutions 基于麦克劳林扩展的线性差分麦克风阵列的理论分析和改进方案

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-07 DOI: 10.1109/TASLP.2024.3439994

Jinfu Wang;Feiran Yang;Xiaoqing Hu;Jun Yang

{"title":"Theoretical Analysis of Maclaurin Expansion Based Linear Differential Microphone Arrays and Improved Solutions","authors":"Jinfu Wang;Feiran Yang;Xiaoqing Hu;Jun Yang","doi":"10.1109/TASLP.2024.3439994","DOIUrl":"10.1109/TASLP.2024.3439994","url":null,"abstract":"Linear differential microphone arrays (LDMAs) are becoming popular due to their potentially high directional gain and frequency-invariant beampattern. By increasing the number of microphones, the Maclaurin expansion-based LDMAs address the inherently poor robustness problem of the conventional LDMA at low frequencies. However, this method encounters severe beampattern distortion and the deep nulls problem in the white noise gain (WNG) and the directivity factor (DF) at high frequencies as the number of microphones increases. In this paper, we reveal that the severe beampattern distortion is attributed to the deviation term of the synthesized beampattern while the deep nulls problem in the WNG and the DF is attributed to the violation of the distortionless constraint in the desired direction. We then propose two new design methods to avoid the degraded performance of LDMAs. Compared to the Maclaurin series expansion-based method, the first method additionally imposes the distortionless constraint in the desired direction, and the deep nulls problem in the WNG and the DF can be avoided. The second method explicitly requires the response of the higher order spatial directivity pattern in the deviation term to be zero, and thus the beampattern distortion can be avoided. By choosing the frequency-wise parameter that determines the number of the considered higher order spatial directivity patterns, the second method enables a good trade-off between the WNG and the beampattern distortion. Simulations exemplify the superiority of the proposed method against existing methods in terms of the robustness and the beampattern distortion.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3811-3825"},"PeriodicalIF":4.1,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors 使用非自回归吸引子的端到端神经扬声器标示法

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-07 DOI: 10.1109/TASLP.2024.3439993

Magdalena Rybicka;Jesús Villalba;Thomas Thebaud;Najim Dehak;Konrad Kowalczyk

{"title":"End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors","authors":"Magdalena Rybicka;Jesús Villalba;Thomas Thebaud;Najim Dehak;Konrad Kowalczyk","doi":"10.1109/TASLP.2024.3439993","DOIUrl":"10.1109/TASLP.2024.3439993","url":null,"abstract":"Despite many recent developments in speaker diarization, it remains a challenge and an active area of research to make diarization robust and effective in real-life scenarios. Well-established clustering-based methods are showing good performance and qualities. However, such systems are built of several independent, separately optimized modules, which may cause non-optimum performance. End-to-end neural speaker diarization (EEND) systems are considered the next stepping stone in pursuing high-performance diarization. Nevertheless, this approach also suffers limitations, such as dealing with long recordings and scenarios with a large (more than four) or unknown number of speakers in the recording. The appearance of EEND with encoder-decoder-based attractors (EEND-EDA) enabled us to deal with recordings that contain a flexible number of speakers thanks to an LSTM-based EDA module. A competitive alternative over the referenced EEND-EDA baseline is the EEND with non-autoregressive attractor (EEND-NAA) estimation, proposed recently by the authors of this article. NAA back-end incorporates k-means clustering as part of the attractor estimation and an attractor refinement module based on a Transformer decoder. However, in our previous work on EEND-NAA, we assumed a known number of speakers, and the experimental evaluation was limited to 2-speaker recordings only. In this article, we describe in detail our recent EEND-NAA approach and propose further improvements to the EEND-NAA architecture, introducing three novel variants of the NAA back-end, which can handle recordings containing speech of a variable and unknown number of speakers. Conducted experiments include simulated mixtures generated using the Switchboard and NIST SRE datasets and real-life recordings from the CALLHOME and DIHARD II datasets. In experimental evaluation, the proposed systems achieve up to 51% relative improvement for the simulated scenario and up to 15% for real recordings over the baseline EEND-EDA.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3960-3973"},"PeriodicalIF":4.1,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization 通过自适应神经网络量化实现轻量级扬声器验证

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-05 DOI: 10.1109/TASLP.2024.3437237

Bei Liu;Haoyu Wang;Yanmin Qian

{"title":"Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization","authors":"Bei Liu;Haoyu Wang;Yanmin Qian","doi":"10.1109/TASLP.2024.3437237","DOIUrl":"10.1109/TASLP.2024.3437237","url":null,"abstract":"Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids customized for each network layer based on k-means clustering. By applying it to the pre-trained SV systems, we obtain a series of quantized variants with different bit widths. To enhance low-bit quantized models, a mixed precision quantization algorithm along with a multi-stage fine-tuning (MSFT) strategy is further introduced. This approach assigns varying bit widths to different network layers. When bit combinations are determined, MSFT progressively quantizes and fine-tunes the network in a specific order. Finally, we design two distinct binary quantization schemes to mitigate performance degradation of 1-bit quantized models: the static and adaptive quantizers. Experiments on VoxCeleb demonstrate that lossless 4-bit uniform precision quantization is achieved on both ResNets and DF-ResNets, yielding a promising compression ratio of \u0000<inline-formula><tex-math>$sim$</tex-math></inline-formula>\u00008. Moreover, compared to uniform precision approach, mixed precision quantization not only obtains additional performance improvements with a similar model size but also offers the flexibility to generate bit combination for any desirable model size. In addition, our suggested 1-bit quantization schemes remarkably boost the performance of binarized models. Finally, a thorough comparison with existing lightweight SV systems reveals that our proposed models outperform all previous methods by a large margin across various model size ranges.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3771-3784"},"PeriodicalIF":4.1,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks SpeechPrompt：提示语音语言模型完成语音处理任务

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-02 DOI: 10.1109/TASLP.2024.3436618

Kai-Wei Chang;Haibin Wu;Yu-Kai Wang;Yuan-Kuei Wu;Hua Shen;Wei-Cheng Tseng;Iu-Thing Kang;Shang-Wen Li;Hung-Yi Lee

{"title":"SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks","authors":"Kai-Wei Chang;Haibin Wu;Yu-Kai Wang;Yuan-Kuei Wu;Hua Shen;Wei-Cheng Tseng;Iu-Thing Kang;Shang-Wen Li;Hung-Yi Lee","doi":"10.1109/TASLP.2024.3436618","DOIUrl":"10.1109/TASLP.2024.3436618","url":null,"abstract":"Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM's inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3730-3744"},"PeriodicalIF":4.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141886634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Artist Similarity Based on Heterogeneous Graph Neural Networks 基于异构图神经网络的艺术家相似性

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-08-02 DOI: 10.1109/TASLP.2024.3437170

Angelo Cesar Mendes da Silva;Diego Furtado Silva;Ricardo Marcondes Marcacini

{"title":"Artist Similarity Based on Heterogeneous Graph Neural Networks","authors":"Angelo Cesar Mendes da Silva;Diego Furtado Silva;Ricardo Marcondes Marcacini","doi":"10.1109/TASLP.2024.3437170","DOIUrl":"10.1109/TASLP.2024.3437170","url":null,"abstract":"Music streaming platforms rely on recommending similar artists to maintain user engagement, with artists benefiting from these suggestions to boost their popularity. Another important feature is music information retrieval, allowing users to explore new content. In both scenarios, performance depends on how to compute the similarity between musical content. This is a challenging process since musical data is inherently multimodal, containing textual and audio data. We propose a novel graph-based artist representation that integrates audio, lyrics features, and artist relations. Thus, a multimodal representation on a heterogeneous graph is proposed, along with a network regularization process followed by a GNN model to aggregate multimodal information into a more robust unified representation. The proposed method explores this final multimodal representation for the task of artist similarity as a link prediction problem. Our method introduces a new importance matrix to emphasize related artists in this multimodal space. We compare our approach with other strong baselines based on combining input features, importance matrix construction, and GNN models. Experimental results highlight the superiority of multimodal representation through the transfer learning process and the value of the importance matrix in enhancing GNN models for artist similarity.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3717-3729"},"PeriodicalIF":4.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141880423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0