ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)最新文献

筛选
英文 中文
Agent-Environment Network for Temporal Action Proposal Generation 时间动作提案生成的agent -环境网络
Viet-Khoa Vo-Ho, Ngan T. H. Le, Kashu Yamazaki, A. Sugimoto, Minh-Triet Tran
{"title":"Agent-Environment Network for Temporal Action Proposal Generation","authors":"Viet-Khoa Vo-Ho, Ngan T. H. Le, Kashu Yamazaki, A. Sugimoto, Minh-Triet Tran","doi":"10.1109/ICASSP39728.2021.9415101","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9415101","url":null,"abstract":"Temporal action proposal generation is an essential and challenging task that aims at localizing temporal intervals containing human actions in untrimmed videos. Most of existing approaches are unable to follow the human cognitive process of understanding the video context due to lack of attention mechanism to express the concept of an action or an agent who performs the action or the interaction between the agent and the environment. Based on the action definition that a human, known as an agent, interacts with the environment and performs an action that affects the environment, we propose a contextual Agent-Environment Network. Our proposed contextual AEN involves (i) agent pathway, operating at a local level to tell about which humans/agents are acting and (ii) environment pathway operating at a global level to tell about how the agents interact with the environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks, i.e C3D and SlowFast, show that our method robustly exhibits outperformance against state-of-the-art methods regardless of the employed backbone network.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114101872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
An Improved Mean Teacher Based Method for Large Scale Weakly Labeled Semi-Supervised Sound Event Detection 一种改进的基于平均教师的大规模弱标记半监督声音事件检测方法
Xu Zheng, Yan Song, I. Mcloughlin, Lin Liu, Lirong Dai
{"title":"An Improved Mean Teacher Based Method for Large Scale Weakly Labeled Semi-Supervised Sound Event Detection","authors":"Xu Zheng, Yan Song, I. Mcloughlin, Lin Liu, Lirong Dai","doi":"10.1109/ICASSP39728.2021.9414931","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414931","url":null,"abstract":"This paper presents an improved mean teacher (MT) based method for large-scale weakly labeled semi-supervised sound event detection (SED), by focusing on learning a better student model. Two main improvements are proposed based on the authors’ previous perturbation based MT method. Firstly, an event-aware module is de-signed to allow multiple branches with different kernel sizes to be fused via an attention mechanism. By inserting this module after the convolutional layer, each neuron can adaptively adjust its receptive field to suit different sound events. Secondly, instead of using the teacher model to provide a consistency cost term, we propose using a stochastic inference of unlabeled examples to generate high quality pseudo-targets by averaging multiple predictions from the perturbed student model. MixUp of both labeled and unlabeled data is further exploited to improve the effectiveness of student model. Finally, the teacher model can be obtained via exponential moving average (EMA) of the student model, which generates final predictions for SED during inference. Experiments on the DCASE2018 task4 dataset demonstrate the ability of the proposed method. Specifically, an F1-score of 42.1% is achieved, significantly outperforming the 32.4% achieved by the winning system, or the 39.3% by the previous perturbation based method.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"55 5","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114114106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Improved Atomic Norm Based Channel Estimation for Time-Varying Narrowband Leaked Channels 时变窄带泄漏信道的改进原子范数信道估计
Jianxiu Li, U. Mitra
{"title":"Improved Atomic Norm Based Channel Estimation for Time-Varying Narrowband Leaked Channels","authors":"Jianxiu Li, U. Mitra","doi":"10.1109/ICASSP39728.2021.9413804","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413804","url":null,"abstract":"In this paper, improved channel gain delay estimation strategies are investigated when practical pulse shapes with finite block length and transmission bandwidth are employed. Pilot-aided channel estimation with an improved atomic norm based approach is proposed to promote the low rank structure of the channel. All the channel parameters, i.e., delays, Doppler shifts and channel gains are recovered. Design choices which ensure unique estimates of channel parameters for root-raised-cosine pulse shapes are examined. Furthermore, a perturbation analysis is conducted. Finally, numerical results verify the theoretical analysis and show performance improvements over the previously proposed method.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114331335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fast Inverse Mapping of Face GANs 人脸gan的快速逆映射
N. Bayat, Vahid Reza Khazaie, Y. Mohsenzadeh
{"title":"Fast Inverse Mapping of Face GANs","authors":"N. Bayat, Vahid Reza Khazaie, Y. Mohsenzadeh","doi":"10.1109/ICASSP39728.2021.9413532","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413532","url":null,"abstract":"Generative adversarial networks (GANs) synthesize realistic images from random latent vectors. While many studies have explored various training configurations and architectures for GANs, the problem of inverting the generator of GANs has been inadequately investigated. We train a ResNet architecture to map given faces to latent vectors that can be used to generate faces nearly identical to the target. We use a perceptual loss to embed face details in the recovered latent vector while maintaining visual quality using a pixel loss. The vast majority of studies on latent vector recovery are very slow and perform well only on generated images. We argue that our method can be used to determine a fast mapping between real human faces and latent-space vectors that contain most of the important face style details. At last, we demonstrate the performance of our approach on both real and generated faces.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114374961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
What And Where To Focus In Person Search 在个人搜索中关注什么和在哪里
Tong Zhou, Kun Tian
{"title":"What And Where To Focus In Person Search","authors":"Tong Zhou, Kun Tian","doi":"10.1109/ICASSP39728.2021.9414439","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414439","url":null,"abstract":"Person search aims to locate and identify the query person from a gallery of original scene images. Almost all previous methods only consider single high-level semantic information, ignoring that the essence of identification task is to learn rich and expressive features. Additionally, large pose variations and occlusions of the target person significantly increase the difficulty of search task. For these two findings, we first propose multilevel semantic aggregation algorithm for more discriminative feature descriptors. Then, a pose-assisted attention module is designed to highlight fine-grained area of the target and simultaneously capture valuable clues for identification. Extensive experiments confirm that our framework can coordinate multilevel semantics of persons and effectively alleviate the adverse effects of occlusion and various pose. We also achieve state-of-the-art performance on two challenging datasets CUHK-SYSU and PRW.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114533795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A New Framework Based on Transfer Learning for Cross-Database Pneumonia Detection 基于迁移学习的跨数据库肺炎检测新框架
Xinxin Shan, Y. Wen
{"title":"A New Framework Based on Transfer Learning for Cross-Database Pneumonia Detection","authors":"Xinxin Shan, Y. Wen","doi":"10.1109/ICASSP39728.2021.9414997","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414997","url":null,"abstract":"Cross-database classification means that the model is able to apply to the serious disequilibrium of data distributions, and it is trained by one database while tested by another database. Thus, cross-database pneumonia detection is a challenging task. In this paper, we proposed a new framework based on transfer learning for cross-database pneumonia detection. First, based on transfer learning, we fine-tune a backbone that pre-trained on non-medical data by using a small amount of pneumonia images, which improves the detection performance on homogeneous dataset. Then in order to make the fine-tuned model applicable to cross-database classification, the adaptation layer combined with a self-learning strategy is proposed to retrain the model. The adaptation layer is to make the heterogeneous data distributions approximate and the self-learning strategy helps to tweak the model by generating pseudo-labels. Experiments on three pneumonia databases show that our proposed model completes the cross-database detection of pneumonia and shows good performance.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121486059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Multi-Scale and Multi-Region Facial Discriminative Representation for Automatic Depression Level Prediction 基于多尺度多区域面部判别表征的抑郁水平自动预测
Mingyue Niu, J. Tao, B. Liu
{"title":"Multi-Scale and Multi-Region Facial Discriminative Representation for Automatic Depression Level Prediction","authors":"Mingyue Niu, J. Tao, B. Liu","doi":"10.1109/ICASSP39728.2021.9413504","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9413504","url":null,"abstract":"Physiological studies have shown that differences in facial activities between depressed patients and normal individuals are manifested in different local facial regions and the durations of these activities are not the same. But most previous works extract features from the entire facial region at a fixed time scale to predict the individual depression level. Thus, they are inadequate in capturing dynamic facial changes. For these reasons, we propose a multi-scale and multi-region fa-cial dynamic representation method to improve the prediction performance. In particular, we firstly use multiple time scales to divide the original long-term video into segments containing different facial regions. Secondly, the segment-level feature is extracted by 3D convolution neural network to characterize the facial activities with different durations in different facial regions. Thirdly, this paper adopts eigen evolution pooling and gradient boosting decision tree to aggregate these segment-level features and select discriminative elements to generate the video-level feature. Finally, the depression level is predicted using support vector regression. Experiments are conducted on AVEC2013 and AVEC2014. The results demonstrate that our method achieves better performance than the previous works.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121496866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Teacher-Student Learning for Low-Latency Online Speech Enhancement Using Wave-U-Net 利用Wave-U-Net进行低延迟在线语音增强的师生学习
Sotaro Nakaoka, Li Li, S. Inoue, S. Makino
{"title":"Teacher-Student Learning for Low-Latency Online Speech Enhancement Using Wave-U-Net","authors":"Sotaro Nakaoka, Li Li, S. Inoue, S. Makino","doi":"10.1109/ICASSP39728.2021.9414280","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414280","url":null,"abstract":"In this paper, we propose a low-latency online extension of wave-U-net for single-channel speech enhancement, which utilizes teacher-student learning to reduce the system latency while keeping the enhancement performance high. Wave-U-net is a recently proposed end-to-end source separation method, which achieved remarkable performance in singing voice separation and speech enhancement tasks. Since the enhancement is performed in the time domain, wave-U-net can efficiently model phase information and address the domain transformation limitation, where the time-frequency domain is normally adopted. In this paper, we apply wave-U-net to face-to-face applications such as hearing aids and in-car communication systems, where a strictly low-latency of less than 10 ms is required. To this end, we investigate online versions of wave-U-net and propose the use of teacher-student learning to prevent the performance degradation caused by the reduction in input segment length such that the system delay in a CPU is less than 10 ms. The experimental results revealed that the proposed model could perform in real-time with low-latency and high performance, achieving a signal-to-distortion ratio improvement of about 8.73 dB.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121570125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Coughwatch: Real-World Cough Detection using Smartwatches Coughwatch:使用智能手表进行真实咳嗽检测
D. Liaqat, S. Liaqat, Jun Lin Chen, Tina Sedaghat, Moshe Gabel, Frank Rudzicz, E. D. Lara
{"title":"Coughwatch: Real-World Cough Detection using Smartwatches","authors":"D. Liaqat, S. Liaqat, Jun Lin Chen, Tina Sedaghat, Moshe Gabel, Frank Rudzicz, E. D. Lara","doi":"10.1109/ICASSP39728.2021.9414881","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414881","url":null,"abstract":"Continuous monitoring of cough may provide insights into the health of individuals as well as the effectiveness of treatments. Smart-watches, in particular, are highly promising for such monitoring: they are inexpensive, unobtrusive, programmable, and have a variety of sensors. However, current mobile cough detection systems are not designed for smartwatches, and perform poorly when applied to real-world smartwatch data since they are often evaluated on data collected in the lab.In this work we propose CoughWatch, a lightweight cough detector for smartwatches that uses audio and movement data for in-the-wild cough detection. On our in-the-wild data, CoughWatch achieves a precision of 82% and recall of 55%, compared to 6% precision and 19% recall achieved by the current state-of-the-art approach. Furthermore, by incorporating gyroscope and accelerometer data, CoughWatch improves precision by up to 15.5 percentage points compared to an audio-only model.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114711564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
History Utterance Embedding Transformer LM for Speech Recognition 用于语音识别的历史话语嵌入变换LM
Keqi Deng, Gaofeng Cheng, Haoran Miao, Pengyuan Zhang, Yonghong Yan
{"title":"History Utterance Embedding Transformer LM for Speech Recognition","authors":"Keqi Deng, Gaofeng Cheng, Haoran Miao, Pengyuan Zhang, Yonghong Yan","doi":"10.1109/ICASSP39728.2021.9414575","DOIUrl":"https://doi.org/10.1109/ICASSP39728.2021.9414575","url":null,"abstract":"History utterances contain rich contextual information; however, better extracting information from the history utterances and using it to improve the language model (LM) is still challenging. In this paper, we propose the history utterance embedding Transformer LM (HTLM), which includes an embedding generation network for extracting contextual information contained in the history utterances and a main Transformer LM for current prediction. In addition, the two-stage attention (TSA) is proposed to encode richer contextual information into the embedding of history utterances (h-emb) while supporting GPU parallel training. Furthermore, we combine the extracted h-emb and embedding of current utterance (c-emb) through the dot-product attention and a fusion method for HTLM's current prediction. Experiments are conducted on the HKUST dataset and achieve a 23.4% character error rate (CER) on the test set. Compared with the baseline, the proposed method yields 12.86 absolute perplexity reduction and 0.8% absolute CER reduction.","PeriodicalId":347060,"journal":{"name":"ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114763095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信