2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)最新文献

筛选
英文 中文
ASGAN-VC: One-Shot Voice Conversion with Additional Style Embedding and Generative Adversarial Networks ASGAN-VC:带有附加风格嵌入和生成对抗网络的一次性语音转换
Weicheng Li, Tzer-jen Wei
{"title":"ASGAN-VC: One-Shot Voice Conversion with Additional Style Embedding and Generative Adversarial Networks","authors":"Weicheng Li, Tzer-jen Wei","doi":"10.23919/APSIPAASC55919.2022.9979975","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979975","url":null,"abstract":"In this paper, we present a voice conversion system that improves the quality of generated voice and its similarity to the target voice style significantly. Many VC systems use feature-disentangle-based learning techniques to separate speakers' voices from their linguistic content in order to translate a voice into another style. This is the approach we are taking. To prevent speaker-style information from obscuring the content embedding, some previous works quantize or reduce the dimension of the embedding. However, an imperfect disentanglement would damage the quality and similarity of the sound. In this paper, to further improve quality and similarity in voice conversion, we propose a novel style transfer method within an autoencoder-based VC system that involves generative adversarial training. The conversion process was objectively evaluated using the fair third-party speaker verification system, the results shows that ASGAN-VC outperforms VQVC + and AGAINVC in terms of speaker similarity. A subjectively observing that our proposal outperformed the VQVC + and AGAINVC in terms of naturalness and speaker similarity.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"34 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113987836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Neural Beamformer with Automatic Detection of Notable Sounds for Acoustic Scene Classification 自动检测显著声音的神经波束形成器用于声学场景分类
Sota Ichikawa, Takeshi Yamada, S. Makino
{"title":"Neural Beamformer with Automatic Detection of Notable Sounds for Acoustic Scene Classification","authors":"Sota Ichikawa, Takeshi Yamada, S. Makino","doi":"10.23919/APSIPAASC55919.2022.9980351","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980351","url":null,"abstract":"Recently, acoustic scene classification using an acoustic beamformer that is applied to a multichannel input signal has been proposed. Generally, prior information such as the direction of arrival of a target sound is necessary to generate a spatial filter for beamforming. However, it is not clear which sound is notable (i.e., useful for classification) in each individual sound scene and thus in which direction the target sound is located. It is therefore difficult to simply apply a beamformer for preprocessing. To solve this problem, we propose a method using a neural beamformer composed of the neural networks of a spatial filter generator and a classifier, which are optimized in an end-to-end manner. The aim of the proposed method is to automatically find a notable sound in each individual sound scene and generate a spatial filter to emphasize that notable sound, without requiring any prior information such as the direction of arrival and the reference signal of the target sound in both training and testing. The loss functions used in the proposed method are of four types: one is for classification and the remaining loss functions are for beamforming that help in obtaining a clear directivity pattern. To evaluate the performance of the proposed method, we conducted an experiment on classifying two scenes: one is a scene where a male is speaking under noise and another is a scene where a female is speaking under noise. The experimental results showed that the segmental SNR averaged over all the test data was improved by 10.7 dB. This indicates that the proposed method could successfully find speech as a notable sound in this classification task and generate the spatial filter to emphasize it.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114038260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MoCoVC: Non-parallel Voice Conversion with Momentum Contrastive Representation Learning 基于动量对比表征学习的非平行语音转换
Kotaro Onishi, Toru Nakashika
{"title":"MoCoVC: Non-parallel Voice Conversion with Momentum Contrastive Representation Learning","authors":"Kotaro Onishi, Toru Nakashika","doi":"10.23919/APSIPAASC55919.2022.9979937","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979937","url":null,"abstract":"Non-parallel voice conversion with deep neural net-works often disentangle speaker individuality and speech content. However, these methods rely on external models, text data, or implicit constraints for ways to disentangle. They may require learning other models or annotating text, or may not understand how latent representations are acquired. Therefore, we pro-pose voice conversion with momentum contrastive representation learning (MoCo V C), a method of explicitly adding constraints to intermediate features using contrastive representation learning, which is a self-supervised learning method. Using contrastive rep-resentation learning with transformations that preserve utterance content allows us to explicitly constrain the intermediate features to preserve utterance content. We present transformations used for contrastive representation learning that could be used for voice conversion and verify the effectiveness of each in an exper-iment. Moreover, MoCoVC demonstrates a high or comparable performance to the vector quantization constrained method in terms of both naturalness and speaker individuality in subjective evaluation experiments.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115062338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Classification of Short Audio Acoustic Scenes Based on Data Augmentation Methods 基于数据增强方法的短声声场景分类
Xuan Zhang, Yunfei Shao, Jun-Xiang Xu, Yong Ma, Wei-Qiang Zhang
{"title":"Classification of Short Audio Acoustic Scenes Based on Data Augmentation Methods","authors":"Xuan Zhang, Yunfei Shao, Jun-Xiang Xu, Yong Ma, Wei-Qiang Zhang","doi":"10.23919/APSIPAASC55919.2022.9980120","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980120","url":null,"abstract":"How to effectively classify short audio data into acoustic scenes is a new challenge proposed by task 1 of the DCASE2022 challenge. This paper details the exploration we made for this problem and the architecture we used. Our architecture is based on Segnet, adding an instance normalization layer to normalize the activations of the previous layer at conv_block 1 of encoder and deconv_block 2 of decoder. Log-mel spectrograms, delta features, and delta-delta features were extracted to train the acoustic scene classification model. A total of 6 data augmentation methods were applied as follows: mixup, time and frequency domain masking, image augmentation, auto level, pix2pix, and random crop. We applied three model compression schemes: pruning, quantization, and knowledge distillation to reduce model complexity. The proposed system achieved higher classification accuracy than the baseline system. Our model can achieve an average accuracy of 60.58% when tested on the test set of TAU Urban Acoustic Scenes 2022 Mobile, development dataset. After model compression, our model achieved an average accuracy of 54.11% within the 127.2 K parameters size, 8-bit quantization, and MMACs less than 30 M.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115064845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
4G Signal RSSI Recommendation System for ISP Quality of Service Improvement 4G信号RSSI推荐系统,提高ISP服务质量
Tanatpon Duangta, Watcharaphong Yookwan, K. Chinnasarn, A. Boonsongsrikul
{"title":"4G Signal RSSI Recommendation System for ISP Quality of Service Improvement","authors":"Tanatpon Duangta, Watcharaphong Yookwan, K. Chinnasarn, A. Boonsongsrikul","doi":"10.23919/APSIPAASC55919.2022.9980030","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980030","url":null,"abstract":"4G Signal RSSI Recommendation System is one of the monitoring methods. The usage rate of local users improves the quality of traffic signals to cycle to receive increased traffic. This paper proposed a method for Prediction and the traffic of data rates used within the area at each location. The result of the proposed approach comparing the performance of models was: the RMSE Gradient Boost Tree, Decision Tree, and Random Forest were 0.291, 0.316 and 0.346, respectively. The correlation will be 0.976, 0.971, and 0.966 for Gradient Boost Tree, Decision Tree, and Random Forest, respectively, and the accuracy of Gradient Boost Tree, Decision Tree, and Random Forest were 97.8%, 97.4%, and 97%, respectively. The results of ensemble learning methods, the RMSE, correlation, and accuracy were: 0.312, 0.972, and 97.5%.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117310819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Studying Human-Based Speaker Diarization and Comparing to State-of-the-Art Systems 基于人的说话人分化研究及与先进系统的比较
Simon W. McKnight, Aidan O. T. Hogg, Vincent W. Neo, P. Naylor
{"title":"Studying Human-Based Speaker Diarization and Comparing to State-of-the-Art Systems","authors":"Simon W. McKnight, Aidan O. T. Hogg, Vincent W. Neo, P. Naylor","doi":"10.23919/APSIPAASC55919.2022.9979811","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979811","url":null,"abstract":"Human-based speaker diarization experiments were carried out on a five-minute extract of a typical AMI corpus meeting to see how much variance there is in human reviews based on hearing only and to compare with state-of-the-art diarization systems on the same extract. There are three distinct experiments: (a) one with no prior information; (b) one with the ground truth speech activity detection (GT-SAD); and (c) one with the blank ground truth labels (GT-labels). The results show that most human reviews tend to be quite similar, albeit with some outliers, but the choice of GT-labels can make a dramatic difference to scored performance. Using the GT-SAD provides a big advantage and improves human review scores substantially, though small differences in the GT-SAD used can have a dramatic effect on results. The use of forgiveness collars is shown to be unhelpful. The results show that state-of-the-art systems can outperform the best human reviews when no prior information is provided. However, the best human reviews still outperform state-of-the-art systems when starting from the GT-SAD.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"264 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116040071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Investigation of noise-reverberation-robustness of modulation spectral features for speech-emotion recognition 语音情感识别中调制谱特征的噪声-混响-鲁棒性研究
Taiyang Guo, Sixia Li, M. Unoki, S. Okada
{"title":"Investigation of noise-reverberation-robustness of modulation spectral features for speech-emotion recognition","authors":"Taiyang Guo, Sixia Li, M. Unoki, S. Okada","doi":"10.23919/APSIPAASC55919.2022.9980032","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980032","url":null,"abstract":"Speech-emotion recognition (SER) in noisy reverber-ant environments is a fundamental technique for real-world ap-plications, including call center service and psychological disease diagnosis. However, in daily auditory environments with noise and reverberation, previous studies using acoustic features could not achieve the same emotion-recognition rates as in an ideal experimental environment (with no noise and no reverberation). To remedy this imperfection, it is necessary to find robust features against noise and reverberation for SER. However, it has been proved that a daily noisy reverberant environment (signal-to-noise ratio is greater than 10 dB and reverberation time is less than 1.0 s) does not affect humans' vocal-emotion recognition. On the basis of the auditory system of human perception, previous research proposed modulation spectral features (MSFs) that contribute to vocal-emotion recognition by humans. Using MSFs has the potential to improve SER in noisy reverberant environments. We investigated the effectiveness and robustness of MSFs for SER in noisy reverberant environments. We used noise-vocoded speech, which is synthesized speech that retains emotional components of speech signals in noisy reverberant environments as speech data. We also used a support vector machine as the classifier to carry out emotion recognition. The experimental results indicate that compared with two widely used feature sets, using MSFs improved the recognition accuracy in 13 of the 26 environments with an average improvement of 11.38%. Thus, MSFs contribute to SER and are robust against noise and reverberation.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115474304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Unsupervised Disentanglement of Timbral, Pitch, and Variation Features From Musical Instrument Sounds With Random Perturbation 随机扰动下乐器声音的音色、音高和变奏特征的无监督解耦
Keitaro Tanaka, Yoshiaki Bando, Kazuyoshi Yoshii, S. Morishima
{"title":"Unsupervised Disentanglement of Timbral, Pitch, and Variation Features From Musical Instrument Sounds With Random Perturbation","authors":"Keitaro Tanaka, Yoshiaki Bando, Kazuyoshi Yoshii, S. Morishima","doi":"10.23919/APSIPAASC55919.2022.9979893","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979893","url":null,"abstract":"This paper describes an unsupervised disentangled representation learning method for musical instrument sounds with pitched and unpitched spectra. Since conventional methods have commonly attempted to disentangle timbral features (e.g., instruments) and pitches (e.g., MIDI note numbers and FOs), they can be applied to only pitched sounds. Global timbres unique to instruments and local variations (e.g., expressions and playstyles) are also treated without distinction. Instead, we represent the spectrogram of a musical instrument sound with a variational autoencoder (VAE) that has timbral, pitch, and variation features as latent variables. The pitch clarity or percussiveness, brightness, and FOs (if existing) are considered to be represented in the abstract pitch features. The unsupervised disentanglement is achieved by extracting time-invariant and time-varying features as global timbres and local variations from randomly pitch-shifted input sounds and time-varying features as local pitch features from randomly timbre-distorted input sounds. To enhance the disentanglement of timbral and variation features from pitch features, input sounds are separated into spectral envelopes and fine structures with cepstrum analysis. The experiments showed that the proposed method can provide effective timbral and pitch features for better musical instrument classification and pitch estimation.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122507691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Evaluation of Voice Service in LEO Communication with 3GPP PUSCH Repetition Enhancement 基于3GPP PUSCH重复增强的LEO通信语音业务评价
Shou-Hong Liu, Chun-Tai Liu, Wei-Hung Chou, JenYi Pan
{"title":"Evaluation of Voice Service in LEO Communication with 3GPP PUSCH Repetition Enhancement","authors":"Shou-Hong Liu, Chun-Tai Liu, Wei-Hung Chou, JenYi Pan","doi":"10.23919/APSIPAASC55919.2022.9979986","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9979986","url":null,"abstract":"In recent years, 3GPP (3rd generation partnership project) has studied and developed the standards for Non-terrestrial networks (NTN). One of the newest working items for NTN is coverage enhancements. In this paper, we construct the NTN channel described in 3GPP. Moreover, we summarize the NTN channel model and current coverage enhancements. Since the scenario of NTN is very different from the traditional terrestrial network systems, we also summarize the challenges and phenomena of the NTN. To reach the high communication quality of voice over internet protocol (VoIP) service in NTN, we evaluate the performance and discuss the benefit of the PUSCH repetition technique in the NTN low Earth orbit (LEO) scenario.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"481 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122604068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Image Watermarking based on Saliency Detection and Multiple Transformations 基于显著性检测和多重变换的图像水印
Ahmed Khan, Koksheik Wong, Vishnu Monn Baskaran
{"title":"Image Watermarking based on Saliency Detection and Multiple Transformations","authors":"Ahmed Khan, Koksheik Wong, Vishnu Monn Baskaran","doi":"10.23919/APSIPAASC55919.2022.9980044","DOIUrl":"https://doi.org/10.23919/APSIPAASC55919.2022.9980044","url":null,"abstract":"An ideal image watermarking (IW) scheme aims to manage the trade-off among quality, capacity, and robustness. However, our literature survey reveals some flaws in the form of poor robustness and quality or low embedding capability. In this paper, multiple frequency domain based image watermarking scheme using salient (eye-catching) object detection is applied. Specifically, the host and the watermark images are partitioned into background and foreground regions by the proposed multi-dimension decomposition, which accumulates image patches and combining them to form the salient map. Next, the watermark image is encrypted by multiple applications of the 3D Arnold and logistic maps, then embedded into both the identified foreground and background regions of the host image by using different embedding strengths. The proposed method can embed 1 color pixel of the watermark image into 1 color pixel in the host image while maintaining high image quality. In the best case scenario, we could embed a 24-bit image as the watermark into a 24-bit image of the same dimension while maintaining an average RGB-SSIM of 0.9999. Experiments are carried out (with 10K MSRA dataset images) to verify the performance of the proposed method and to compare our proposed method against the state-of-the-art (SOTA) watermarking methods.","PeriodicalId":382967,"journal":{"name":"2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123921076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信