Shiqi Wang , Hongbing Qiu , Xiyu Song , Mei Wang , Fangzhi Yao
{"title":"利用方向特征和旋转转向进行环境声神经语音提取","authors":"Shiqi Wang , Hongbing Qiu , Xiyu Song , Mei Wang , Fangzhi Yao","doi":"10.1016/j.apacoust.2024.110384","DOIUrl":null,"url":null,"abstract":"<div><div>In scenes with noise and overlapping speakers, directionally extracting audio tracks corresponding to individual speakers is crucial for immersive and interactive spatial audio systems. Although neural networks have been successful in this task, existing steering approaches for adjusting the direction of neural speech extraction mainly target spatial audio directly collected by microphone arrays, while directional speech extraction with Ambisonics spatial audio is less well studied. Therefore, to encode the target directional information as input for the neural network, this paper proposes two Ambisonics directional features based on the spatial feature difference and beamforming principle: the relative harmonic difference and the directional signal enhancement ratio. Using the special property of Ambisonics' rotation transform, a rotary steering pre-processing is also proposed to align the target speaker's direction with a fixed reference by inversely rotating the sound field, thereby simplifying multi-directional extraction to fixed-directional extraction. Finally, we integrate these proposed approaches with the existing temporal-spectral-spatial filtering neural networks to establish a generalized framework for steerable speech extraction and conduct experiments on a simulated Ambisonics dataset containing multiple speakers and noise sources. The experiments show that the proposed approaches outperform existing conditional steering and can be applied to various existing neural network architectures.</div></div>","PeriodicalId":55506,"journal":{"name":"Applied Acoustics","volume":"228 ","pages":"Article 110384"},"PeriodicalIF":3.4000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Ambisonics neural speech extraction with directional feature and rotary steering\",\"authors\":\"Shiqi Wang , Hongbing Qiu , Xiyu Song , Mei Wang , Fangzhi Yao\",\"doi\":\"10.1016/j.apacoust.2024.110384\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In scenes with noise and overlapping speakers, directionally extracting audio tracks corresponding to individual speakers is crucial for immersive and interactive spatial audio systems. Although neural networks have been successful in this task, existing steering approaches for adjusting the direction of neural speech extraction mainly target spatial audio directly collected by microphone arrays, while directional speech extraction with Ambisonics spatial audio is less well studied. Therefore, to encode the target directional information as input for the neural network, this paper proposes two Ambisonics directional features based on the spatial feature difference and beamforming principle: the relative harmonic difference and the directional signal enhancement ratio. Using the special property of Ambisonics' rotation transform, a rotary steering pre-processing is also proposed to align the target speaker's direction with a fixed reference by inversely rotating the sound field, thereby simplifying multi-directional extraction to fixed-directional extraction. Finally, we integrate these proposed approaches with the existing temporal-spectral-spatial filtering neural networks to establish a generalized framework for steerable speech extraction and conduct experiments on a simulated Ambisonics dataset containing multiple speakers and noise sources. The experiments show that the proposed approaches outperform existing conditional steering and can be applied to various existing neural network architectures.</div></div>\",\"PeriodicalId\":55506,\"journal\":{\"name\":\"Applied Acoustics\",\"volume\":\"228 \",\"pages\":\"Article 110384\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2024-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Acoustics\",\"FirstCategoryId\":\"101\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0003682X24005358\",\"RegionNum\":2,\"RegionCategory\":\"物理与天体物理\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Acoustics","FirstCategoryId":"101","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0003682X24005358","RegionNum":2,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
Ambisonics neural speech extraction with directional feature and rotary steering
In scenes with noise and overlapping speakers, directionally extracting audio tracks corresponding to individual speakers is crucial for immersive and interactive spatial audio systems. Although neural networks have been successful in this task, existing steering approaches for adjusting the direction of neural speech extraction mainly target spatial audio directly collected by microphone arrays, while directional speech extraction with Ambisonics spatial audio is less well studied. Therefore, to encode the target directional information as input for the neural network, this paper proposes two Ambisonics directional features based on the spatial feature difference and beamforming principle: the relative harmonic difference and the directional signal enhancement ratio. Using the special property of Ambisonics' rotation transform, a rotary steering pre-processing is also proposed to align the target speaker's direction with a fixed reference by inversely rotating the sound field, thereby simplifying multi-directional extraction to fixed-directional extraction. Finally, we integrate these proposed approaches with the existing temporal-spectral-spatial filtering neural networks to establish a generalized framework for steerable speech extraction and conduct experiments on a simulated Ambisonics dataset containing multiple speakers and noise sources. The experiments show that the proposed approaches outperform existing conditional steering and can be applied to various existing neural network architectures.
期刊介绍:
Since its launch in 1968, Applied Acoustics has been publishing high quality research papers providing state-of-the-art coverage of research findings for engineers and scientists involved in applications of acoustics in the widest sense.
Applied Acoustics looks not only at recent developments in the understanding of acoustics but also at ways of exploiting that understanding. The Journal aims to encourage the exchange of practical experience through publication and in so doing creates a fund of technological information that can be used for solving related problems. The presentation of information in graphical or tabular form is especially encouraged. If a report of a mathematical development is a necessary part of a paper it is important to ensure that it is there only as an integral part of a practical solution to a problem and is supported by data. Applied Acoustics encourages the exchange of practical experience in the following ways: • Complete Papers • Short Technical Notes • Review Articles; and thereby provides a wealth of technological information that can be used to solve related problems.
Manuscripts that address all fields of applications of acoustics ranging from medicine and NDT to the environment and buildings are welcome.