Zhuliang Lv, Yi Zhou, Hongqing Liu, Xiaofeng Shu, Nannan Zhang
{"title":"基于tcn的全景视频生成立体声音频的主环境提取","authors":"Zhuliang Lv, Yi Zhou, Hongqing Liu, Xiaofeng Shu, Nannan Zhang","doi":"10.1109/ISSPIT51521.2020.9408696","DOIUrl":null,"url":null,"abstract":"Spatial audio is one of the most essential parts of immersive audio-visual experience such as virtual reality (VR), which reproduces the inherent spatiality of sound and the correspondence of audio-visual experience. Ambisonics is the dominant spatial audio solution due to its flexibility and fidelity. However, the production of Ambisonics audio is difficult for the public because of the requirements of expensive equipments or professional music production ability. In this work, an end-to-end Ambisonics generator for panorama video is proposed. To improve the perception of directional sound, we assume that sound field is composed of a primary sound source and an ambient sound without spatiality, and a Temporal Convolutional Network (TCN) based Primary Ambient Extractor (PAE) is proposed to separate the two parts of sound field. The directional sound is spatially encoded by the weights from audio-visual fusion network added by ambient part. Our network is evaluated with panorama video clips with first order Ambisonics. The results show that the proposed approach outperforms other methods in terms of objective evaluations.","PeriodicalId":111385,"journal":{"name":"2020 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A TCN-based Primary Ambient Extraction in Generating Ambisonics Audio from Panorama Video\",\"authors\":\"Zhuliang Lv, Yi Zhou, Hongqing Liu, Xiaofeng Shu, Nannan Zhang\",\"doi\":\"10.1109/ISSPIT51521.2020.9408696\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Spatial audio is one of the most essential parts of immersive audio-visual experience such as virtual reality (VR), which reproduces the inherent spatiality of sound and the correspondence of audio-visual experience. Ambisonics is the dominant spatial audio solution due to its flexibility and fidelity. However, the production of Ambisonics audio is difficult for the public because of the requirements of expensive equipments or professional music production ability. In this work, an end-to-end Ambisonics generator for panorama video is proposed. To improve the perception of directional sound, we assume that sound field is composed of a primary sound source and an ambient sound without spatiality, and a Temporal Convolutional Network (TCN) based Primary Ambient Extractor (PAE) is proposed to separate the two parts of sound field. The directional sound is spatially encoded by the weights from audio-visual fusion network added by ambient part. Our network is evaluated with panorama video clips with first order Ambisonics. The results show that the proposed approach outperforms other methods in terms of objective evaluations.\",\"PeriodicalId\":111385,\"journal\":{\"name\":\"2020 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISSPIT51521.2020.9408696\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSPIT51521.2020.9408696","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A TCN-based Primary Ambient Extraction in Generating Ambisonics Audio from Panorama Video
Spatial audio is one of the most essential parts of immersive audio-visual experience such as virtual reality (VR), which reproduces the inherent spatiality of sound and the correspondence of audio-visual experience. Ambisonics is the dominant spatial audio solution due to its flexibility and fidelity. However, the production of Ambisonics audio is difficult for the public because of the requirements of expensive equipments or professional music production ability. In this work, an end-to-end Ambisonics generator for panorama video is proposed. To improve the perception of directional sound, we assume that sound field is composed of a primary sound source and an ambient sound without spatiality, and a Temporal Convolutional Network (TCN) based Primary Ambient Extractor (PAE) is proposed to separate the two parts of sound field. The directional sound is spatially encoded by the weights from audio-visual fusion network added by ambient part. Our network is evaluated with panorama video clips with first order Ambisonics. The results show that the proposed approach outperforms other methods in terms of objective evaluations.