ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)最新文献

筛选
英文 中文
Deep Reinforcement Learning-based Rate Adaptation for Adaptive 360-Degree Video Streaming 基于深度强化学习的自适应360度视频流速率自适应
Nuowen Kan, Junni Zou, Kexin Tang, Chenglin Li, Ning Liu, H. Xiong
{"title":"Deep Reinforcement Learning-based Rate Adaptation for Adaptive 360-Degree Video Streaming","authors":"Nuowen Kan, Junni Zou, Kexin Tang, Chenglin Li, Ning Liu, H. Xiong","doi":"10.1109/ICASSP.2019.8683779","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683779","url":null,"abstract":"In this paper, we propose a deep reinforcement learning (DRL)-based rate adaptation algorithm for adaptive 360-degree video streaming, which is able to maximize the quality of experience of viewers by adapting the transmitted video quality to the time-varying network conditions. Specifically, to reduce the possible switching latency of the field of view (FoV), we design a new QoE metric by introducing a penalty term for the large buffer occupancy. A scalable FoV method is further proposed to alleviate the combinatorial explosion of the action space in the DRL formulation. Then, we model the rate adaptation logic as a Markov decision process and employ the DRL-based algorithm to dynamically learn the optimal video transmission rate. Simulation results show the superior performance of the proposed algorithm compared to the existing algorithms.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"75 1","pages":"4030-4034"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83823437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Dnn-based Spectral Enhancement for Neural Waveform Generators with Low-bit Quantization 基于dnn的低比特量化神经波形发生器频谱增强
Yang Ai, Jing-Xuan Zhang, Liang Chen, Zhenhua Ling
{"title":"Dnn-based Spectral Enhancement for Neural Waveform Generators with Low-bit Quantization","authors":"Yang Ai, Jing-Xuan Zhang, Liang Chen, Zhenhua Ling","doi":"10.1109/ICASSP.2019.8683016","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683016","url":null,"abstract":"This paper presents a spectral enhancement method to improve the quality of speech reconstructed by neural waveform generators with low-bit quantization. At training stage, this method builds a multiple-target DNN, which predicts log amplitude spectra of natural high-bit waveforms together with the amplitude ratios between natural and distorted spectra. Log amplitude spectra of the waveforms reconstructed by low-bit neural waveform generators are adopted as model input. At generation stage, the enhanced amplitude spectra are obtained by an ensemble decoding strategy, and are further combined with the phase spectra of low-bit waveforms to produce the final waveforms by inverse STFT. In our experiments on WaveRNN vocoders, an 8-bit WaveRNN with spectral enhancement outperforms a 16-bit counterpart with the same model complexity in terms of the quality of reconstructed waveforms. Besides, the proposed spectral enhancement method can also help an 8-bit WaveRNN with reduced model complexity to achieve similar subjective performance with a conventional 16-bit WaveRNN.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"27 1","pages":"7025-7029"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80747101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Speaker Recognition for Multi-speaker Conversations Using X-vectors 使用x向量的多说话人对话的说话人识别
David Snyder, D. Garcia-Romero, Gregory Sell, A. McCree, Daniel Povey, S. Khudanpur
{"title":"Speaker Recognition for Multi-speaker Conversations Using X-vectors","authors":"David Snyder, D. Garcia-Romero, Gregory Sell, A. McCree, Daniel Povey, S. Khudanpur","doi":"10.1109/ICASSP.2019.8683760","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683760","url":null,"abstract":"Recently, deep neural networks that map utterances to fixed-dimensional embeddings have emerged as the state-of-the-art in speaker recognition. Our prior work introduced x-vectors, an embedding that is very effective for both speaker recognition and diarization. This paper combines our previous work and applies it to the problem of speaker recognition on multi-speaker conversations. We measure performance on Speakers in the Wild and report what we believe are the best published error rates on this dataset. Moreover, we find that diarization substantially reduces error rate when there are multiple speakers, while maintaining excellent performance on single-speaker recordings. Finally, we introduce an easily implemented method to remove the domain-sensitive threshold typically used in the clustering stage of a diarization system. The proposed method is more robust to domain shifts, and achieves similar results to those obtained using a well-tuned threshold.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"4 1","pages":"5796-5800"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81538209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 244
Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking 基于集成时频掩蔽的深度神经网络语音信号到达时间差估计
Pasi Pertilä, Mikko Parviainen
{"title":"Time Difference of Arrival Estimation of Speech Signals Using Deep Neural Networks with Integrated Time-frequency Masking","authors":"Pasi Pertilä, Mikko Parviainen","doi":"10.1109/ICASSP.2019.8682574","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682574","url":null,"abstract":"The Time Difference of Arrival (TDoA) of a sound wavefront impinging on a microphone pair carries spatial information about the source. However, captured speech typically contains dynamic non-speech interference sources and noise. Therefore, the TDoA estimates fluctuate between speech and interference. Deep Neural Networks (DNNs) have been applied for Time-Frequency (TF) masking for Acoustic Source Localization (ASL) to filter out non-speech components from a speaker location likelihood function. However, the type of TF mask for this task is not obvious. Secondly, the DNN should estimate the TDoA values, but existing solutions estimate the TF mask instead. To overcome these issues, a direct formulation of the TF masking as a part of a DNN-based ASL structure is proposed. Furthermore, the proposed network operates in an online manner, i.e., producing estimates frame-by-frame. Combined with the use of recurrent layers it exploits the sequential progression of speaker related TDoAs. Training with different microphone spacings allows model re-use for different microphone pair geometries in inference. Real-data experiments with smartphone recordings of speech in interference demonstrate the network’s generalization capability.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"18 1","pages":"436-440"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82498885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
An Ensemble of Deep Recurrent Neural Networks for P-wave Detection in Electrocardiogram 用于心电图p波检测的深度递归神经网络集成
A. Peimankar, S. Puthusserypady
{"title":"An Ensemble of Deep Recurrent Neural Networks for P-wave Detection in Electrocardiogram","authors":"A. Peimankar, S. Puthusserypady","doi":"10.1109/ICASSP.2019.8682307","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682307","url":null,"abstract":"Detection of P-waves in electrocardiogram (ECG) signals is of great importance to cardiologists in order to help them diagnosing arrhythmias such as atrial fibrillation. This paper proposes an end-to-end deep learning approach for detection of P-waves in ECG signals. Four different deep Recurrent Neural Networks (RNNs), namely, the Long-Short Term Memory (LSTM) are used in an ensemble framework. Each of these networks are trained to extract the useful features from raw ECG signals and determine the absence/presence of P-waves. Outputs of these classifiers are then combined for final detection of the P-waves. The proposed algorithm was trained and validated on a database which consists of more than 111000 annotated heart beats and the results show consistently high classification accuracy and sensitivity of around 98.48% and 97.22%, respectively.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"5 1","pages":"1284-1288"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82555306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr 基于Dfsmn-ctc-smbr的普通话语音识别建模单元研究
Shiliang Zhang, Ming Lei, Yuan Liu, Wei Li
{"title":"Investigation of Modeling Units for Mandarin Speech Recognition Using Dfsmn-ctc-smbr","authors":"Shiliang Zhang, Ming Lei, Yuan Liu, Wei Li","doi":"10.1109/icassp.2019.8683859","DOIUrl":"https://doi.org/10.1109/icassp.2019.8683859","url":null,"abstract":"The choice of acoustic modeling units is critical to acoustic modeling in large vocabulary continuous speech recognition (LVCSR) tasks. The recent connectionist temporal classification (CTC) based acoustic models have more options for the choice of modeling units. In this work, we propose a DFSMN-CTC-sMBR acoustic model and investigate various modeling units for Mandarin speech recognition. In addition to the commonly used context-independent Initial/Finals (CI-IF), context-dependent Initial/Finals (CD-IF) and Syllable, we also propose a hybrid Character-Syllable modeling units by mixing high frequency Chinese characters and syllables. Experimental results show that DFSMN-CTC-sMBR models with all these types of modeling units can significantly outperform the well-trained conventional hybrid models. Moreover, we find that the proposed hybrid Character-Syllable modeling units is the best choice for CTC based acoustic modeling for Mandarin speech recognition in our work since it can dramatically reduce substitution errors in recognition results. In a 20,000 hours Mandarin speech recognition task, the DFSMN-CTC-sMBR system with hybrid Character-Syllable achieves a character error rate (CER) of 7.45% while performance of the well-trained DFSMN-CE-sMBR system is 9.49%.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"65 1","pages":"7085-7089"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86494193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Cascaded Point Network for 3D Hand Pose Estimation* 级联点网络的3D手姿态估计*
Yikun Dou, Xuguang Wang, Yuying Zhu, Xiaoming Deng, Cuixia Ma, Liang Chang, Hongan Wang
{"title":"Cascaded Point Network for 3D Hand Pose Estimation*","authors":"Yikun Dou, Xuguang Wang, Yuying Zhu, Xiaoming Deng, Cuixia Ma, Liang Chang, Hongan Wang","doi":"10.1109/ICASSP.2019.8683356","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683356","url":null,"abstract":"Recent PointNet-family hand pose methods have the advantages of high pose estimation performance and small model size, and it is a key problem to get effective sample points for PointNet-family methods. In this paper, we propose a two-stage coarse to fine hand pose estimation method, which belongs to PointNet-family methods and explores a new sample point strategy. In the first stage, we use 3D coordinate and surface normal of normalized point cloud as input to regress coarse hand joints. In the second stage, we use the hand joints in the first stage as the initial sample points to refine the hand joints. Experiments on widely used datasets demonstrate that using joints as sample points is more effective and our method achieves top-rank performance.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"33 1","pages":"1982-1986"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83660401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Learning Semantic-preserving Space Using User Profile and Multimodal Media Content from Political Social Network 利用政治社交网络用户档案和多模态媒体内容学习语义保留空间
Wei-Hao Chang, Jeng-Lin Li, Chi-Chun Lee
{"title":"Learning Semantic-preserving Space Using User Profile and Multimodal Media Content from Political Social Network","authors":"Wei-Hao Chang, Jeng-Lin Li, Chi-Chun Lee","doi":"10.1109/ICASSP.2019.8682596","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8682596","url":null,"abstract":"The use of social media in politics has dramatically changed the way campaigns are run and how elected officials interact with their constituents. An advanced algorithm is required to analyze and understand this large amount of heterogeneous social media data to investigate several key issues, such as stance and strategy, in political science. Most of previous works concentrate their studies using text-as-data approach, where the rich yet heterogeneous information in the user profile, social relationship, and multimodal media content is largely ignored. In this work, we propose a two-branch network that jointly maps the post contents and politician profile into the same latent space, which is trained using a large-margin objective that combines a cross-instance distance constraint with a within-instance semantic-preserving constraint. Our proposed political embedding space can be utilized not only in reliably identifying political spectrum and message type but also in providing a political representation space for interpretable ease-of-visualization.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"42 1","pages":"3990-3994"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85860122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Hierarchical Two-level Modelling of Emotional States in Spoken Dialog Systems 口语对话系统中情绪状态的分层两级建模
Oxana Verkholyak, D. Fedotov, Heysem Kaya, Yang Zhang, Alexey Karpov
{"title":"Hierarchical Two-level Modelling of Emotional States in Spoken Dialog Systems","authors":"Oxana Verkholyak, D. Fedotov, Heysem Kaya, Yang Zhang, Alexey Karpov","doi":"10.1109/ICASSP.2019.8683240","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683240","url":null,"abstract":"Emotions occur in complex social interactions, and thus processing of isolated utterances may not be sufficient to grasp the nature of underlying emotional states. Dialog speech provides useful information about context that explains nuances of emotions and their transitions. Context can be defined on different levels; this paper proposes a hierarchical context modelling approach based on RNN-LSTM architecture, which models acoustical context on the frame level and partner’s emotional context on the dialog level. The method is proved effective together with cross-corpus training setup and domain adaptation technique in a set of speaker independent cross-validation experiments on IEMOCAP corpus for three levels of activation and valence classification. As a result, the state-of-the-art on this corpus is advanced for both dimensions using only acoustic modality.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"310 1","pages":"6700-6704"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76454143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Improved Metrical Alignment of Midi Performance Based on a Repetition-aware Online-adapted Grammar 基于重复感知在线适应语法的Midi演奏的韵律一致性改进
Andrew Mcleod, Eita Nakamura, Kazuyoshi Yoshii
{"title":"Improved Metrical Alignment of Midi Performance Based on a Repetition-aware Online-adapted Grammar","authors":"Andrew Mcleod, Eita Nakamura, Kazuyoshi Yoshii","doi":"10.1109/ICASSP.2019.8683808","DOIUrl":"https://doi.org/10.1109/ICASSP.2019.8683808","url":null,"abstract":"This paper presents an improvement on an existing grammar-based method for metrical structure detection and alignment, a task which involves aligning a repeated tree structure with an input stream of musical notes. The previous method achieves state-of-the-art results, but performs poorly when it lacks training data. Data annotated as it requires is not widely available, making this drawback of the method significant. We present a novel online learning technique to improve the grammar’s performance on unseen rhythmic patterns using a dynamically learned piece-specific grammar. The piece-specific grammar can measure the musical well-formedness of the underlying alignment without requiring any training data. It instead relies on musical repetition and self-similarity, enabling the model to recognize repeated rhythmic patterns, even when a similar pattern was never seen in the training data. Using it, we see improved performance on a corpus containing only Bach compositions, as well as a second corpus containing works from a variety of composers, indicating that the online-learned grammar helps the model generalize to unseen rhythms and styles.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"46 1","pages":"186-190"},"PeriodicalIF":0.0,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81064727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信