Shangda Wu, Yue Yang, Zhaowen Wang, Xiaobing Li, Maosong Sun
{"title":"Generating chord progression from melody with flexible harmonic rhythm and controllable harmonic density","authors":"Shangda Wu, Yue Yang, Zhaowen Wang, Xiaobing Li, Maosong Sun","doi":"10.1186/s13636-023-00314-6","DOIUrl":"https://doi.org/10.1186/s13636-023-00314-6","url":null,"abstract":"Melody harmonization, which involves generating a chord progression that complements a user-provided melody, continues to pose a significant challenge. A chord progression must not only be in harmony with the melody, but also interdependent on its rhythmic pattern. While previous neural network-based systems have been successful in producing chord progressions for given melodies, they have not adequately addressed controllable melody harmonization, nor have they focused on generating harmonic rhythms with flexibility in the rates or patterns of chord changes. This paper presents AutoHarmonizer, a novel system for harmonic density-controllable melody harmonization with such a flexible harmonic rhythm. AutoHarmonizer is equipped with an extensive vocabulary of 1462 chord types and can generate chord progressions that vary in harmonic density for a given melody. Experimental results indicate that the AutoHarmonizer-generated chord progressions exhibit a diverse range of harmonic rhythms and that the system’s controllable harmonic density is effective.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"9 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139470654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction: Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios","authors":"Stijn Kindt, Jenthe Thienpondt, Luca Becker, Nilesh Madhu","doi":"10.1186/s13636-023-00319-1","DOIUrl":"https://doi.org/10.1186/s13636-023-00319-1","url":null,"abstract":"<p><b>Correction: EURASIP Journal on Audio, Speech, and Music Processing 2023, 46 (2023)</b></p><p><b>https://doi.org/10.1186/s13636-023-00310-w</b></p><p>Following publication of the original article [1], we have been notified that Figure 14, for each cluster subfigure, there was an additional bottom row. These have been removed.</p><p>Originally published Figure 14:</p><figure><picture><source srcset=\"//media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs13636-023-00319-1/MediaObjects/13636_2023_319_Figa_HTML.png?as=webp\" type=\"image/webp\"/><img alt=\"figure a\" aria-describedby=\"Figa\" height=\"949\" loading=\"lazy\" src=\"//media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs13636-023-00319-1/MediaObjects/13636_2023_319_Figa_HTML.png\" width=\"427\"/></picture></figure><p>Corrected Figure 14:</p><figure><picture><source srcset=\"//media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs13636-023-00319-1/MediaObjects/13636_2023_319_Figb_HTML.png?as=webp\" type=\"image/webp\"/><img alt=\"figure b\" aria-describedby=\"Figb\" height=\"844\" loading=\"lazy\" src=\"//media.springernature.com/lw685/springer-static/image/art%3A10.1186%2Fs13636-023-00319-1/MediaObjects/13636_2023_319_Figb_HTML.png\" width=\"685\"/></picture></figure><p>The original article has been corrected.</p><ol data-track-component=\"outbound reference\"><li data-counter=\"1.\"><p>Kindt et al., Robustness of ad hoc microphone clustering using speaker embeddings: evaluation under realistic and challenging scenarios. EURASIP J. Audio Speech Music Process. <b>2023</b>, 46 (2023). https://doi.org/10.1186/s13636-023-00310-w</p><p>Article Google Scholar </p></li></ol><p>Download references<svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" role=\"img\" width=\"16\"><use xlink:href=\"#icon-eds-i-download-medium\" xmlns:xlink=\"http://www.w3.org/1999/xlink\"></use></svg></p><h3>Authors and Affiliations</h3><ol><li><p>IDLab, Department of Electronics and Information Systems, Ghent University - Imec, Ghent, Belgium</p><p>Stijn Kindt, Jenthe Thienpondt & Nilesh Madhu</p></li><li><p>Institute of Communication Acoustics, Ruhr-Universität Bochum, Bochum, Germany</p><p>Luca Becker</p></li></ol><span>Authors</span><ol><li><span>Stijn Kindt</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li><li><span>Jenthe Thienpondt</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li><li><span>Luca Becker</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li><li><span>Nilesh Madhu</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li></ol><h3>Corresponding author</h3><p>Correspondence to Stijn Kindt.</p><p><b>Open Access</b> This article is licensed under a Creative Commons Attribution 4.0 Internati","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"22 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139470389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neural electric bass guitar synthesis framework enabling attack-sustain-representation-based technique control","authors":"Junya Koguchi, Masanori Morise","doi":"10.1186/s13636-024-00327-9","DOIUrl":"https://doi.org/10.1186/s13636-024-00327-9","url":null,"abstract":"Musical instrument sound synthesis (MISS) often utilizes a text-to-speech framework because of its similarity to speech in terms of generating sounds from symbols. Moreover, a plucked string instrument, such as electric bass guitar (EBG), shares acoustical similarities with speech. We propose an attack-sustain (AS) representation of the playing technique to take advantage of this similarity. The AS representation treats the attack segment as an unvoiced consonant and the sustain segment as a voiced vowel. In addition, we propose a MISS framework for an EBG that can control its playing techniques: (1) we constructed a EBG sound database containing a rich set of playing techniques, (2) we developed a dynamic time warping and timbre conversion to align the sounds and AS labels, (3) we extend an existing MISS framework to control playing techniques using AS representation as control symbols. The experimental evaluation suggests that our AS representation effectively controls the playing techniques and improves the naturalness of the synthetic sound.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"25 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139421015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Significance of relative phase features for shouted and normal speech classification","authors":"Khomdet Phapatanaburi, Longbiao Wang, Meng Liu, Seiichi Nakagawa, Talit Jumphoo, Peerapong Uthansakul","doi":"10.1186/s13636-023-00324-4","DOIUrl":"https://doi.org/10.1186/s13636-023-00324-4","url":null,"abstract":"Shouted and normal speech classification plays an important role in many speech-related applications. The existing works are often based on magnitude-based features and ignore phase-based features, which are directly related to magnitude information. In this paper, the importance of phase-based features is explored for the detection of shouted speech. The novel contributions of this work are as follows. (1) Three phase-based features, namely, relative phase (RP), linear prediction analysis estimated speech-based RP (LPAES-RP) and linear prediction residual-based RP (LPR-RP) features, are explored for shouted and normal speech classification. (2) We propose a new RP feature, called the glottal source-based RP (GRP) feature. The main idea of the proposed GRP feature is to exploit the difference between RP and LPAES-RP features to detect shouted speech. (3) A score combination of phase- and magnitude-based features is also employed to further improve the classification performance. The proposed feature and combination are evaluated using the shouted normal electroglottograph speech (SNE-Speech) corpus. The experimental findings show that the RP, LPAES-RP, and LPR-RP features provide promising results for the detection of shouted speech. We also find that the proposed GRP feature can provide better results than those of the standard mel-frequency cepstral coefficient (MFCC) feature. Moreover, compared to using individual features, the score combination of the MFCC and RP/LPAES-RP/LPR-RP/GRP features yields an improved detection performance. Performance analysis under noisy environments shows that the score combination of the MFCC and the RP/LPAES-RP/LPR-RP features gives more robust classification. These outcomes show the importance of RP features in distinguishing shouted speech from normal speech.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"31 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139373734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep semantic learning for acoustic scene classification","authors":"Yun-Fei Shao, Xin-Xin Ma, Yong Ma, Wei-Qiang Zhang","doi":"10.1186/s13636-023-00323-5","DOIUrl":"https://doi.org/10.1186/s13636-023-00323-5","url":null,"abstract":"Acoustic scene classification (ASC) is the process of identifying the acoustic environment or scene from which an audio signal is recorded. In this work, we propose an encoder-decoder-based approach to ASC, which is borrowed from the SegNet in image semantic segmentation tasks. We also propose a novel feature normalization method named Mixup Normalization, which combines channel-wise instance normalization and the Mixup method to learn useful information for scene and discard specific information related to different devices. In addition, we propose an event extraction block, which can extract the accurate semantic segmentation region from the segmentation network, to imitate the effect of image segmentation on audio features. With four data augmentation techniques, our best single system achieved an average accuracy of 71.26% on different devices in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2020 ASC Task 1A dataset. The result indicates a minimum margin of 17% against the DCASE 2020 challenge Task 1A baseline system. It has lower complexity and higher performance compared with other state-of-the-art CNN models, without using any supplementary data other than the official challenge dataset.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"61 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2024-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139082371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eric Grinstein, Elisa Tengan, Bilgesu Çakmak, Thomas Dietzen, Leonardo Nunes, Toon van Waterschoot, Mike Brookes, Patrick A Naylor
{"title":"Steered Response Power for Sound Source Localization: a tutorial review.","authors":"Eric Grinstein, Elisa Tengan, Bilgesu Çakmak, Thomas Dietzen, Leonardo Nunes, Toon van Waterschoot, Mike Brookes, Patrick A Naylor","doi":"10.1186/s13636-024-00377-z","DOIUrl":"10.1186/s13636-024-00377-z","url":null,"abstract":"<p><p>In the last three decades, the Steered Response Power (SRP) method has been widely used for the task of Sound Source Localization (SSL), due to its satisfactory localization performance on moderately reverberant and noisy scenarios. Many works have analysed and extended the original SRP method to reduce its computational cost, to allow it to locate multiple sources, or to improve its performance in adverse environments. In this work, we review over 200 papers on the SRP method and its variants, with emphasis on the SRP-PHAT method. We also present eXtensible-SRP, or X-SRP, a generalized and modularized version of the SRP algorithm which allows the reviewed extensions to be implemented. We provide a Python implementation of the algorithm which includes selected extensions from the literature.</p>","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2024 1","pages":"59"},"PeriodicalIF":1.7,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11557718/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142631140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefano Damiano, Luca Bondi, Andre Guntoro, Toon van Waterschoot
{"title":"A framework for the acoustic simulation of passing vehicles using variable length delay lines.","authors":"Stefano Damiano, Luca Bondi, Andre Guntoro, Toon van Waterschoot","doi":"10.1186/s13636-024-00372-4","DOIUrl":"10.1186/s13636-024-00372-4","url":null,"abstract":"<p><p>The sound produced by vehicles driving on roadways constitutes one of the dominant noise sources in urban areas. The impact of traffic noise on human activities and the related investigation on modeling, assessment, and abatement strategies fueled the research on the simulation of the sound produced by individual passing vehicles. Simulators enable in fact to promote a perceptual assessment of the nature of traffic noise and of the impact of single road agents on the overall soundscape. In this work, we present <i>TrafficSoundSim</i>, an open-source framework for the acoustic simulation of vehicles transiting on a road. We first discuss the generation of the sound signal produced by a vehicle, represented as a combination of road/tire interaction noise and engine noise. We then introduce a propagation model based on the use of variable length delay lines, allowing to simulate acoustic propagation and Doppler effect. The proposed simulator incorporates the effect of air absorption and ground reflection, modeled via complex-valued reflection coefficients dependent on the road surface impedance, as well as a model of the directivity of sound sources representing the passing vehicles. The source signal generation and the propagation stages are decoupled, and all effects are implemented using finite impulse response filters. Moreover, no recorded data is required to run the simulation, making the framework flexible and independent on data availability. Finally, to validate the framework capability to accurately simulate passing vehicles, a comparison between synthetic and recorded pass-by events is presented. The validation shows that sounds generated with the proposed method achieve a good match with recorded events in terms of power spectral density and psychoacoustics metrics as well as a perceptually plausible result.</p>","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"2024 1","pages":"49"},"PeriodicalIF":1.7,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11446978/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142382151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online distributed waveform-synchronization for acoustic sensor networks with dynamic topology","authors":"Aleksej Chinaev, Niklas Knaepper, Gerald Enzner","doi":"10.1186/s13636-023-00311-9","DOIUrl":"https://doi.org/10.1186/s13636-023-00311-9","url":null,"abstract":"Acoustic sensing by multiple devices connected in a wireless acoustic sensor network (WASN) creates new opportunities for multichannel signal processing. However, the autonomy of agents in such a network still necessitates the alignment of sensor signals to a common sampling rate. It has been demonstrated that waveform-based estimation of sampling rate offset (SRO) between any node pair can be retrieved from asynchronous signals already exchanged in the network, but connected online operation for network-wide distributed sampling-time synchronization still presents an open research task. This is especially true if the WASN experiences topology changes due to failure or appearance of nodes or connections. In this work, we rely on an online waveform-based closed-loop SRO estimation and compensation unit for nodes pairs. For WASNs hierarchically organized as a directed minimum spanning tree (MST), it is then shown how local synchronization propagates network-wide from the root node to the leaves. Moreover, we propose a network protocol for sustaining an existing network-wide synchronization in case of local topology changes. In doing so, the dynamic WASN maintains the MST topology after reorganization to support continued operation with minimum node distances. Experimental evaluation in a simulated apartment with several rooms proves the ability of our methods to reach and sustain accurate SRO estimation and compensation in dynamic WASNs.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"242 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138717470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Signal processing and machine learning for speech and audio in acoustic sensor networks","authors":"Walter Kellermann, Rainer Martin, Nobutaka Ono","doi":"10.1186/s13636-023-00322-6","DOIUrl":"https://doi.org/10.1186/s13636-023-00322-6","url":null,"abstract":"<p>Nowadays, we are surrounded by a plethora of recording devices, including mobile phones, laptops, tablets, smartwatches, and camcorders, among others. However, conventional multichannel signal processing methods can usually not be applied to jointly process the signals recorded by multiple distributed devices because synchronous recording is essential. Thus, commercially available microphone array processing is currently limited to a single device where all microphones are mounted. The full exploitation of the spatial diversity offered by multiple audio devices without requiring wired networking is a major challenge, whose potential practical and commercial benefits prompted significant research efforts over the past decade.</p><p>Wireless acoustic sensor networks (WASNs) have become a new paradigm of acoustic sensing to overcome the limitations of individual devices. Along with wireless communications between microphone nodes and addressing new challenges in handling asynchronous channels, unknown microphone positions, and distributed computing, the WASN enables us to spatially distribute many recording devices. These may cover a wider area and utilize the nodes to form an extended microphone array. It promises to significantly improve the performance of various audio tasks such as speech enhancement, speech recognition, diarization, scene analysis, and anomalous acoustic event detection.</p><p>For this special issue, six papers were accepted which all address the above-mentioned fundamental challenges when using WASNs: First, the question of which sensors should be used for a specific signal processing task or extraction of a target source is addressed by the papers of Guenther et al. and Kindt et al. Given a set of sensors, a method for its synchronization on waveform level in dynamic scenarios is presented by Chinaev et al., and a localization method using both sensor signals and higher-level environmental information is discussed by Grinstein et al. Finally, robust speaker counting and source separation are addressed by Hsu and Bai and the task of removing specific interference from a single sensor signal is tackled by Kawamura et al.</p><p>The paper ‘Microphone utility estimation in acoustic sensor networks using single-channel signal features’ by Guenther et al. proposes a method to assess the utility of individual sensors of a WASN for coherence-based signal processing, e.g., beamforming or blind source separation, by using appropriate single-channel signal features as proxies for waveforms. Thereby, the need for transmitting waveforms for identifying suitable sensors for a synchronized cluster of sensors is avoided and the required amount of transmitted data can be reduced by several orders of magnitude. It is shown that both estimation-theoretic processing of single-channel features and deep learning-based identification of such features lead to measures of coherence in the feature space that reflect the suitability of distributed se","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"55 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138717609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lightweight target speaker separation network based on joint training","authors":"Jing Wang, Hanyue Liu, Liang Xu, Wenjing Yang, Weiming Yi, Fang Liu","doi":"10.1186/s13636-023-00317-3","DOIUrl":"https://doi.org/10.1186/s13636-023-00317-3","url":null,"abstract":"Target speaker separation aims to separate the speech components of the target speaker from mixed speech and remove extraneous components such as noise. In recent years, deep learning-based speech separation methods have made significant breakthroughs and have gradually become mainstream. However, these existing methods generally face problems with system latency and performance upper limits due to the large model size. To solve these problems, this paper proposes improvements in the network structure and training methods to enhance the model’s performance. A lightweight target speaker separation network based on long-short-term memory (LSTM) is proposed, which can reduce the model size and computational delay while maintaining the separation performance. Based on this, a target speaker separation method based on joint training is proposed to achieve the overall training and optimization of the target speaker separation system. Joint loss functions based on speaker registration and speaker separation are proposed for joint training of the network to further improve the system’s performance. The experimental results show that the lightweight target speaker separation network proposed in this paper has better performance while being lightweight, and joint training of the target speaker separation network with our proposed loss function can further improve the separation performance of the original model.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"10 1","pages":""},"PeriodicalIF":2.4,"publicationDate":"2023-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"138546436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}