{"title":"Human Activity Recognition From Motion and Acoustic Sensors Using Contrastive Learning","authors":"Rui Zhou, Running Zhao, Edith C. H. Ngai","doi":"10.1109/ICASSPW59220.2023.10192969","DOIUrl":null,"url":null,"abstract":"In this paper, we formulate human activity recognition as a downstream task of pretrained multimodal contrastive learning (MCL) models and break the convention of the one-modality-to-one-modality contrastive paradigm by allowing the models to have more than one source modality. Different from the prevailing assumption in MCL that one source modality and one target modality are the counterparts of each other, this work considers the possibility where it takes multiple source modalities with complementary information to match up to a target modality. In particular, we leverage a large-scale pretrained audio-language contrastive model and extend it to accepting IMU and audio input. The experiment results indicate the superiority of using complementary source modalities over using any source modality alone with 10.3% to 35.0% performance gain.","PeriodicalId":158726,"journal":{"name":"2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSPW59220.2023.10192969","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In this paper, we formulate human activity recognition as a downstream task of pretrained multimodal contrastive learning (MCL) models and break the convention of the one-modality-to-one-modality contrastive paradigm by allowing the models to have more than one source modality. Different from the prevailing assumption in MCL that one source modality and one target modality are the counterparts of each other, this work considers the possibility where it takes multiple source modalities with complementary information to match up to a target modality. In particular, we leverage a large-scale pretrained audio-language contrastive model and extend it to accepting IMU and audio input. The experiment results indicate the superiority of using complementary source modalities over using any source modality alone with 10.3% to 35.0% performance gain.