IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献_第4页

EchoScan: Scanning Complex Room Geometries via Acoustic Echoes EchoScan：通过声学回声扫描复杂的房间几何结构

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485516

Inmo Yeon;Iljoo Jeong;Seungchul Lee;Jung-Woo Choi

{"title":"EchoScan: Scanning Complex Room Geometries via Acoustic Echoes","authors":"Inmo Yeon;Iljoo Jeong;Seungchul Lee;Jung-Woo Choi","doi":"10.1109/TASLP.2024.3485516","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485516","url":null,"abstract":"Accurate estimation of indoor space geometries is vital for constructing precise digital twins, whose broad industrial applications include navigation in unfamiliar environments and efficient evacuation planning, particularly in low-light conditions. This study introduces EchoScan, a deep neural network model that utilizes acoustic echoes to perform room geometry inference. Conventional sound-based techniques rely on estimating geometry-related room parameters such as wall position and room size, thereby limiting the diversity of inferable room geometries. Contrarily, EchoScan overcomes this limitation by directly inferring room floorplan maps and height maps, thereby enabling it to handle rooms with complex shapes, including curved walls. The segmentation task for predicting floorplan and height maps enables the model to leverage both low- and high-order reflections. The use of high-order reflections further allows EchoScan to infer complex room shapes when some walls of the room are unobservable from the position of an audio device. Herein, EchoScan was trained and evaluated using RIRs synthesized from complex environments, including the Manhattan and Atlanta layouts, employing a practical audio device configuration compatible with commercial, off-the-shelf devices.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4768-4782"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Interference-Controlled Maximum Noise Reduction Beamformer Based on Deep-Learned Interference Manifold 基于深度学习干扰矩阵的干扰控制型最大降噪波束成形器

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485551

Yichen Yang;Ningning Pan;Wen Zhang;Chao Pan;Jacob Benesty;Jingdong Chen

{"title":"Interference-Controlled Maximum Noise Reduction Beamformer Based on Deep-Learned Interference Manifold","authors":"Yichen Yang;Ningning Pan;Wen Zhang;Chao Pan;Jacob Benesty;Jingdong Chen","doi":"10.1109/TASLP.2024.3485551","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485551","url":null,"abstract":"Beamforming has been used in a wide range of applications to extract the signal of interest from microphone array observations, which consist of not only the signal of interest, but also noise, interference, and reverberation. The recently proposed interference-controlled maximum noise reduction (ICMR) beamformer provides a flexible way to control the specified amount of the interference attenuation and noise suppression; but it requires accurate estimation of the manifold vector of the interference sources, which is challenging to achieve in real-world applications. To address this issue, we introduce an interference-controlled maximum noise reduction network (ICMRNet) in this study, which is a deep neural network (DNN)-based method for manifold vector estimation. With densely connected modified conformer blocks and the end-to-end training strategy, the interference manifold is learned directly from the observation signals. This approach, akin to ICMR, adeptly adapts to time-varying interference and demonstrates superior convergence rate and extraction efficacy as compared to the linearly constrained minimum variance (LCMV)-based neural beamformers when appropriate attenuation factors are selected. Moreover, via learning-based extraction, ICMRNet effectively suppresses reverberation components within the target signal. Comparative analysis against baseline methods validates the efficacy of the proposed method.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4676-4690"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Dynamic and Static Representations for Extrapolation-Based Temporal Knowledge Graph Reasoning 为基于外推法的时态知识图谱推理学习动态和静态表征

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-23 DOI: 10.1109/TASLP.2024.3485500

Pengfei Li;Guangyou Zhou;Zhiwen Xie;Penghui Xie;Jimmy Xiangji Huang

{"title":"Learning Dynamic and Static Representations for Extrapolation-Based Temporal Knowledge Graph Reasoning","authors":"Pengfei Li;Guangyou Zhou;Zhiwen Xie;Penghui Xie;Jimmy Xiangji Huang","doi":"10.1109/TASLP.2024.3485500","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3485500","url":null,"abstract":"Temporal knowledge graph reasoning aims to predict the missing links (facts) in the future timestamps. However, most existing methods have a common limitation: they focus on learning dynamic representations of temporal knowledge graphs and rarely consider static characteristics that remain unchanged over time. To address the above issues, we propose to learn the dynamic and static representations for temporal knowledge graph reasoning (DSTKG), which introduces two latent variables to capture the dynamic and static characteristics of entities in temporal knowledge graphs. First, we use a Bi-GRU-based inference network to learn the static latent representation of historical facts and a nonlinear discrete-time transition-based inference network to learn the dynamic latent representation. Then, we sample the latent variables multiple times using re-parameterization tricks to obtain high-quality embeddings and make predictions in the future timestamps. The empirical results on four benchmark datasets show that our model is more effective than state-of-the-art approaches. Compared with the strong baseline model DBKGE (RotatE), the proposed model achieves performance improvements of 2.69%, \u0000<inline-formula><tex-math>$1.59%$</tex-math></inline-formula>\u0000, 1.18% and 1.22% on Yago11k, Wikidata12k, ICEWS14 and ICEWS05-15 respectively, regarding the evaluation metric MRR.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4741-4754"},"PeriodicalIF":4.1,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142598650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Derivative-Free Optimization for Low-Rank Adaptation in Large Language Models 在大型语言模型中进行无衍生优化以实现低ank自适应

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-09 DOI: 10.1109/TASLP.2024.3477330

Feihu Jin;Yifan Liu;Ying Tan

引用次数: 0

Smoothed Frame-Level SINR and Its Estimation for Sensor Selection in Distributed Acoustic Sensor Networks 分布式声学传感器网络中用于传感器选择的平滑帧级 SINR 及其估算

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-09 DOI: 10.1109/TASLP.2024.3477277

Shanzheng Guan;Mou Wang;Zhongxin Bai;Jianyu Wang;Jingdong Chen;Jacob Benesty

{"title":"Smoothed Frame-Level SINR and Its Estimation for Sensor Selection in Distributed Acoustic Sensor Networks","authors":"Shanzheng Guan;Mou Wang;Zhongxin Bai;Jianyu Wang;Jingdong Chen;Jacob Benesty","doi":"10.1109/TASLP.2024.3477277","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3477277","url":null,"abstract":"Distributed acoustic sensor network (DASN) refers to a sound acquisition system that consists of a collection of microphones randomly distributed across a wide acoustic area. Theory and methods for DASN are gaining increasing attention as the associated technologies can be used in a broad range of applications to solve challenging problems. However, unlike traditional microphone arrays or centralized systems, properly exploiting the redundancy among different channels in DASN is facing many challenges including but not limited to variations in pre-amplification gains, clocks, sensors' response, and signal-to-interference-plus-noise ratios (SINRs). Selecting appropriate sensors relevant to the task at hand is therefore crucial in DASN. In this work, we propose a speaker-dependent smoothed frame-level SINR estimation method for sensor selection in multi-speaker scenarios, specifically addressing source movement within DASN. Additionally, we devise an approach for similarity measurement to generate dynamic speaker embeddings resilient to variations in reference speech levels. Furthermore, we introduce a novel loss function that integrates classification and ordinal regression within a unified framework. Extensive simulations are performed and the results demonstrate the efficacy of the proposed method in accurately estimating smoothed frame-level SINR dynamically, yielding state-of-the-art performance.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4554-4568"},"PeriodicalIF":4.1,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142517851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Improved Objective Perceptual Audio Quality Assessment - Part 1: A Novel Data-Driven Cognitive Model 改进客观感知音频质量评估 - 第一部分：新颖的数据驱动认知模型

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-09 DOI: 10.1109/TASLP.2024.3477291

Pablo M. Delgado;Jürgen Herre

{"title":"Towards Improved Objective Perceptual Audio Quality Assessment - Part 1: A Novel Data-Driven Cognitive Model","authors":"Pablo M. Delgado;Jürgen Herre","doi":"10.1109/TASLP.2024.3477291","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3477291","url":null,"abstract":"Efficient audioquality assessment is vital for streamlining audio codec development. Objective assessment tools have been developed over time to algorithmically predict quality ratings from subjective assessments, the gold standard for quality judgment. Many of these tools use perceptual auditory models to extract audio features that are mapped to a basic audio quality score prediction using machine learning algorithms and subjective scores as training data. However, existing tools struggle with generalization in quality prediction, especially when faced with unknown signal and distortion types. This is particularly evident in the presence of signals coded using non-waveform-preserving parametric techniques. Addressing these challenges, this two-part work proposes extensions to the Perceptual Evaluation of Audio Quality (PEAQ - ITU-R BS.1387-1) recommendation. Part 1 focuses on increasing generalization, while Part 2 targets accurate spatial audio quality measurement in audio coding. To enhance prediction generalization, this paper (Part 1) introduces a novel machine learning approach that uses subjective data to model cognitive aspects of audio quality perception. The proposed method models the perceived severity of audible distortions by adaptively weighting different distortion metrics. The weights are determined using an interaction cost function that captures relationships between distortion salience and cognitive effects. Compared to other machine learning methods and established tools, the proposed architecture achieves higher prediction accuracy on large databases of previously unseen subjective quality scores. The perceptually-motivated model offers a more manageable alternative to general-purpose machine learning algorithms, allowing potential extensions and improvements to multi-dimensional quality measurement without complete retraining.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4661-4675"},"PeriodicalIF":4.1,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Audio-Only Phonetic Segment Classification Using Embeddings Learned From Audio and Ultrasound Tongue Imaging Data 利用从音频和超声波舌头成像数据中学习的嵌入式技术进行纯音频音段分类

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-07 DOI: 10.1109/TASLP.2024.3473316

Ilhan Aytutuldu;Yakup Genc;Yusuf Sinan Akgul

引用次数: 0

Investigating the Design Space of Diffusion Models for Speech Enhancement 调查用于语音增强的扩散模型设计空间

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-03 DOI: 10.1109/TASLP.2024.3473319

Philippe Gonzalez;Zheng-Hua Tan;Jan Østergaard;Jesper Jensen;Tommy Sonne Alstrøm;Tobias May

{"title":"Investigating the Design Space of Diffusion Models for Speech Enhancement","authors":"Philippe Gonzalez;Zheng-Hua Tan;Jan Østergaard;Jesper Jensen;Tommy Sonne Alstrøm;Tobias May","doi":"10.1109/TASLP.2024.3473319","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473319","url":null,"abstract":"Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system while using fewer sampling steps, thus reducing the computational cost by a factor of four.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4486-4500"},"PeriodicalIF":4.1,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704960","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142524164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Real-Time Multichannel Deep Speech Enhancement in Hearing Aids: Comparing Monaural and Binaural Processing in Complex Acoustic Scenarios 助听器中的实时多通道深度语音增强：比较复杂声学场景中的单声道和双声道处理技术

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-02 DOI: 10.1109/TASLP.2024.3473315

Nils L. Westhausen;Hendrik Kayser;Theresa Jansen;Bernd T. Meyer

{"title":"Real-Time Multichannel Deep Speech Enhancement in Hearing Aids: Comparing Monaural and Binaural Processing in Complex Acoustic Scenarios","authors":"Nils L. Westhausen;Hendrik Kayser;Theresa Jansen;Bernd T. Meyer","doi":"10.1109/TASLP.2024.3473315","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473315","url":null,"abstract":"Deep learning has the potential to enhance speech signals and increase their intelligibility for users of hearing aids. Deep models suited for real-world application should feature a low computational complexity and low processing delay of only a few milliseconds. In this paper, we explore deep speech enhancement that matches these requirements and contrast monaural and binaural processing algorithms in two complex acoustic scenes. Both algorithms are evaluated with objective metrics and in experiments with hearing-impaired listeners performing a speech-in-noise test. Results are compared to two traditional enhancement strategies, i.e., adaptive differential microphone processing and binaural beamforming. While in diffuse noise, all algorithms perform similarly, the binaural deep learning approach performs best in the presence of spatial interferers. Through a post-analysis, this can be attributed to improvements at low SNRs and to precise spatial filtering.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4596-4606"},"PeriodicalIF":4.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704042","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142587649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RISC: A Corpus for Shout Type Classification and Shout Intensity Prediction RISC：用于呼喊类型分类和呼喊强度预测的语料库

IF 4.1 2区计算机科学

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-10-02 DOI: 10.1109/TASLP.2024.3473302

Takahiro Fukumori;Taito Ishida;Yoichi Yamashita

{"title":"RISC: A Corpus for Shout Type Classification and Shout Intensity Prediction","authors":"Takahiro Fukumori;Taito Ishida;Yoichi Yamashita","doi":"10.1109/TASLP.2024.3473302","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473302","url":null,"abstract":"The detection of shouted speech is crucial in audio surveillance and monitoring. Although it is desirable for a security system to be able to identify emergencies, existing corpora provide only a binary label (i.e., shouted or normal) for each speech sample, making it difficult to predict the shout intensity. Furthermore, most corpora comprise only utterances typical of hazardous situations, meaning that classifiers cannot learn to discriminate such utterances from shouts typical of less hazardous situations such as cheers. Thus, this paper presents a novel research source, the RItsumeikan Shout Corpus (RISC), which contains wide variety types of shouted speech samples collected in recording experiments. Each shouted speech sample in RISC has a shout type and is also assigned shout intensity ratings via a crowdsourcing service. We also present a comprehensive performance comparison among deep learning approaches for speech type classification tasks and a shout intensity prediction task. The results show that feature learning based on the spectral and cepstral domains achieves high performance, no matter which network architecture is used. The results also demonstrate that shout type classification and intensity prediction are still challenging tasks, and RISC is expected to contribute to further development in this research area.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4434-4444"},"PeriodicalIF":4.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10704045","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142434604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0