用户生成视频中的多模态事件检测

2011 IEEE International Symposium on Multimedia Pub Date : 2011-12-05 DOI:10.1109/ISM.2011.49

Francesco Cricri, Kostadin Dabov, I. Curcio, Sujeet Mate, M. Gabbouj

{"title":"用户生成视频中的多模态事件检测","authors":"Francesco Cricri, Kostadin Dabov, I. Curcio, Sujeet Mate, M. Gabbouj","doi":"10.1109/ISM.2011.49","DOIUrl":null,"url":null,"abstract":"Nowadays most camera-enabled electronic devices contain various auxiliary sensors such as accelerometers, gyroscopes, compasses, GPS receivers, etc. These sensors are often used during the media acquisition to limit camera degradations such as shake and also to provide some basic tagging information such as the location used in geo-tagging. Surprisingly, exploiting the sensor-recordings modality for high-level event detection has been a subject of rather limited research, further constrained to highly specialized acquisition setups. In this work, we show how these sensor modalities, alone or in combination with content-based analysis, allow inferring information about the video content. In addition, we consider a multi-camera scenario, where multiple user generated recordings of a common scene (e.g., music concerts, public events) are available. In order to understand some higher-level semantics of the recorded media, we jointly analyze the individual video recordings and sensor measurements of the multiple users. The detected semantics include generic interesting events and some more specific events. The detection exploits correlations in the camera motion and in the audio content of multiple users. We show that the proposed multimodal analysis methods perform well on various recordings obtained in real live music performances.","PeriodicalId":339410,"journal":{"name":"2011 IEEE International Symposium on Multimedia","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Multimodal Event Detection in User Generated Videos\",\"authors\":\"Francesco Cricri, Kostadin Dabov, I. Curcio, Sujeet Mate, M. Gabbouj\",\"doi\":\"10.1109/ISM.2011.49\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays most camera-enabled electronic devices contain various auxiliary sensors such as accelerometers, gyroscopes, compasses, GPS receivers, etc. These sensors are often used during the media acquisition to limit camera degradations such as shake and also to provide some basic tagging information such as the location used in geo-tagging. Surprisingly, exploiting the sensor-recordings modality for high-level event detection has been a subject of rather limited research, further constrained to highly specialized acquisition setups. In this work, we show how these sensor modalities, alone or in combination with content-based analysis, allow inferring information about the video content. In addition, we consider a multi-camera scenario, where multiple user generated recordings of a common scene (e.g., music concerts, public events) are available. In order to understand some higher-level semantics of the recorded media, we jointly analyze the individual video recordings and sensor measurements of the multiple users. The detected semantics include generic interesting events and some more specific events. The detection exploits correlations in the camera motion and in the audio content of multiple users. We show that the proposed multimodal analysis methods perform well on various recordings obtained in real live music performances.\",\"PeriodicalId\":339410,\"journal\":{\"name\":\"2011 IEEE International Symposium on Multimedia\",\"volume\":\"81 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE International Symposium on Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISM.2011.49\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE International Symposium on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISM.2011.49","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

如今，大多数具有相机功能的电子设备包含各种辅助传感器，如加速度计，陀螺仪，指南针，GPS接收器等。这些传感器通常在媒体采集过程中使用，以限制相机的退化，如抖动，并提供一些基本的标记信息，如地理标记中使用的位置。令人惊讶的是，利用传感器记录模式进行高级事件检测一直是相当有限的研究主题，进一步受到高度专业化的采集设置的限制。在这项工作中，我们展示了这些传感器模式如何单独或与基于内容的分析相结合，允许推断有关视频内容的信息。此外，我们考虑了一个多摄像头场景，其中多个用户生成的一个常见场景(例如，音乐会，公共事件)的记录是可用的。为了理解录制媒体的一些更高层次的语义，我们联合分析了多个用户的单个视频记录和传感器测量。检测到的语义包括一般的有趣事件和一些更具体的事件。该检测利用摄像机运动和多个用户的音频内容中的相关性。我们表明，所提出的多模态分析方法在实际现场音乐表演中获得的各种录音上表现良好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multimodal Event Detection in User Generated Videos

Nowadays most camera-enabled electronic devices contain various auxiliary sensors such as accelerometers, gyroscopes, compasses, GPS receivers, etc. These sensors are often used during the media acquisition to limit camera degradations such as shake and also to provide some basic tagging information such as the location used in geo-tagging. Surprisingly, exploiting the sensor-recordings modality for high-level event detection has been a subject of rather limited research, further constrained to highly specialized acquisition setups. In this work, we show how these sensor modalities, alone or in combination with content-based analysis, allow inferring information about the video content. In addition, we consider a multi-camera scenario, where multiple user generated recordings of a common scene (e.g., music concerts, public events) are available. In order to understand some higher-level semantics of the recorded media, we jointly analyze the individual video recordings and sensor measurements of the multiple users. The detected semantics include generic interesting events and some more specific events. The detection exploits correlations in the camera motion and in the audio content of multiple users. We show that the proposed multimodal analysis methods perform well on various recordings obtained in real live music performances.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 IEEE International Symposium on Multimedia

自引率

0.00%

发文量