Aaqib Saeed, T. Ozcelebi, S. Trajanovski, J. Lukkien
{"title":"端到端多模态行为上下文识别在现实生活中的应用","authors":"Aaqib Saeed, T. Ozcelebi, S. Trajanovski, J. Lukkien","doi":"10.23919/fusion43075.2019.9011194","DOIUrl":null,"url":null,"abstract":"Smart devices of everyday use (such as smartphones and wearables) are increasingly integrated with sensors that provide immense amounts of information about a person's daily life. The automatic and unobtrusive sensing of human behavioral context can help develop solutions for assisted living, fitness tracking, sleep monitoring, and several other fields. Towards addressing this issue, we raise the question: can a machine learn to recognize a diverse set of contexts and activities in a real-life through jointly learning from raw multi-modal signals (e.g., accelerometer, gyroscope and audio)? In this paper, we propose a multi-stream network comprising of temporal convolution and fully-connected layers to address the problem of multi-label behavioral context recognition. A four-stream network architecture handles learning from each modality with a contextualization module which incorporates extracted representations to infer a user's context. Our empirical evaluation suggests that a deep convolutional network trained end-to-end achieves comparable performance to manual feature engineering with minimal effort. Furthermore, the presented architecture can be extended to include similar sensors for performance improvements and handles missing modalities through multi-task learning on a highly imbalanced and sparsely labeled dataset.","PeriodicalId":348881,"journal":{"name":"2019 22th International Conference on Information Fusion (FUSION)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"End-to-End Multi-Modal Behavioral Context Recognition in a Real-Life Setting\",\"authors\":\"Aaqib Saeed, T. Ozcelebi, S. Trajanovski, J. Lukkien\",\"doi\":\"10.23919/fusion43075.2019.9011194\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Smart devices of everyday use (such as smartphones and wearables) are increasingly integrated with sensors that provide immense amounts of information about a person's daily life. The automatic and unobtrusive sensing of human behavioral context can help develop solutions for assisted living, fitness tracking, sleep monitoring, and several other fields. Towards addressing this issue, we raise the question: can a machine learn to recognize a diverse set of contexts and activities in a real-life through jointly learning from raw multi-modal signals (e.g., accelerometer, gyroscope and audio)? In this paper, we propose a multi-stream network comprising of temporal convolution and fully-connected layers to address the problem of multi-label behavioral context recognition. A four-stream network architecture handles learning from each modality with a contextualization module which incorporates extracted representations to infer a user's context. Our empirical evaluation suggests that a deep convolutional network trained end-to-end achieves comparable performance to manual feature engineering with minimal effort. Furthermore, the presented architecture can be extended to include similar sensors for performance improvements and handles missing modalities through multi-task learning on a highly imbalanced and sparsely labeled dataset.\",\"PeriodicalId\":348881,\"journal\":{\"name\":\"2019 22th International Conference on Information Fusion (FUSION)\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 22th International Conference on Information Fusion (FUSION)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.23919/fusion43075.2019.9011194\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 22th International Conference on Information Fusion (FUSION)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/fusion43075.2019.9011194","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
End-to-End Multi-Modal Behavioral Context Recognition in a Real-Life Setting
Smart devices of everyday use (such as smartphones and wearables) are increasingly integrated with sensors that provide immense amounts of information about a person's daily life. The automatic and unobtrusive sensing of human behavioral context can help develop solutions for assisted living, fitness tracking, sleep monitoring, and several other fields. Towards addressing this issue, we raise the question: can a machine learn to recognize a diverse set of contexts and activities in a real-life through jointly learning from raw multi-modal signals (e.g., accelerometer, gyroscope and audio)? In this paper, we propose a multi-stream network comprising of temporal convolution and fully-connected layers to address the problem of multi-label behavioral context recognition. A four-stream network architecture handles learning from each modality with a contextualization module which incorporates extracted representations to infer a user's context. Our empirical evaluation suggests that a deep convolutional network trained end-to-end achieves comparable performance to manual feature engineering with minimal effort. Furthermore, the presented architecture can be extended to include similar sensors for performance improvements and handles missing modalities through multi-task learning on a highly imbalanced and sparsely labeled dataset.