Proceedings of the 2020 International Conference on Multimedia Retrieval最新文献_第5页

CEA'20: The 12th Workshop on Multimedia for Cooking and Eating Activities CEA'20:第十二届烹饪和饮食活动多媒体研讨会

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3388040

I. Ide, Yoko Yamakata, Atsushi Hashimoto

引用次数: 0

Heterogeneous Non-Local Fusion for Multimodal Activity Recognition 多模态活动识别的异质非局部融合

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390675

Petr Byvshev, P. Mettes, Yu Xiao

{"title":"Heterogeneous Non-Local Fusion for Multimodal Activity Recognition","authors":"Petr Byvshev, P. Mettes, Yu Xiao","doi":"10.1145/3372278.3390675","DOIUrl":"https://doi.org/10.1145/3372278.3390675","url":null,"abstract":"In this work, we investigate activity recognition using multimodal inputs from heterogeneous sensors. Activity recognition is commonly tackled from a single-modal perspective using videos. In case multiple signals are used, they come from the same homogeneous modality, e.g. in the case of color and optical flow. Here, we propose an activity network that fuses multimodal inputs coming from completely different and heterogeneous sensors. We frame such a heterogeneous fusion as a non-local operation. The observation is that in a non-local operation, only the channel dimensions need to match. In the network, heterogeneous inputs are fused, while maintaining the shapes and dimensionalities that fit each input. We outline both asymmetric fusion, where one modality serves to enforce the other, and symmetric fusion variants. To further promote research into multimodal activity recognition, we introduce GloVid, a first-person activity dataset captured with video recordings and smart glove sensor readings. Experiments on GloVid show the potential of heterogeneous non-local fusion for activity recognition, outperforming individual modalities and standard fusion techniques.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124029232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Beyond Relevance Feedback for Searching and Exploring large Multimedia Collections 搜索和探索大型多媒体馆藏的超越相关性反馈

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390669

M. Worring

{"title":"Beyond Relevance Feedback for Searching and Exploring large Multimedia Collections","authors":"M. Worring","doi":"10.1145/3372278.3390669","DOIUrl":"https://doi.org/10.1145/3372278.3390669","url":null,"abstract":"Relevance feedback was introduced over twenty years ago as a powerful tool for interactive retrieval and still is the dominant mode of interaction in multimedia retrieval systems. Over the years methods have improved and recently relevance feedback has become feasible on even the largest collections available in the multimedia community. Yet, relevance feedback typically targets the optimization of linear lists of search results and thus focuses on only one of the many tasks on the search - explore axis. Truly interactive retrieval systems have to consider the whole axis and interactive categorization is an overarching framework for many of those tasks. The multimedia analytics system MediaTable exploits this to support users in getting insight in large image collections. Categorization as a representation of the collection and user tasks does not capture the relations between items in the collection like graphs do. Hypergraphs are combining categories and relations in one model and as they are founded in set theory in fact are closely related to categorization. They, therefore, provide an elegant framework to move forward. In this talk we highlight the progress that has been made in the field of interactive retrieval and in the direction of multimedia analytics. We will further consider the promises that new results in deep learning, especially in the context of graph convolutional networks, and hypergraphs might bring to go beyond relevance feedback.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132321551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MAENet: Boosting Feature Representation for Cross-Modal Person Re-Identification with Pairwise Supervision 基于两两监督的跨模态人再识别的增强特征表示

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390699

Yongbiao Chen, Sheng Zhang, Zhengwei Qi

{"title":"MAENet: Boosting Feature Representation for Cross-Modal Person Re-Identification with Pairwise Supervision","authors":"Yongbiao Chen, Sheng Zhang, Zhengwei Qi","doi":"10.1145/3372278.3390699","DOIUrl":"https://doi.org/10.1145/3372278.3390699","url":null,"abstract":"Person re-identification aims at successfully retrieving the images of a specific person in the gallery dataset given a probe image. Among all the existing research areas related to person re-identification, visible to thermal person re-identification (VT-REID) has gained proliferating momentum. VT-REID is deemed to be a rather challenging task owing to the large cross-modality gap [25], cross-modality variation and intra-modality variation. Existing techniques generally tackle this problem by embedding cross-modality data with convolutional neural networks into shared feature space to bridge the cross-modality discrepancy, and subsequently, devise hinge losses on similarity learning to alleviate the variation. However, feature extraction methods based simply on convolutional neural networks may fail to capture the distinctive and modality-invariant features, resulting in noises for further re-identification techniques. In this work, we present a novel modality and appearance invariant embedding learning framework equipped with maximum likelihood learning to perform cross-modal person re-identification. Extensive and comprehensive experiments are conducted to test the effectiveness of our framework. Results demonstrated that the proposed framework yields state-of-the-art Re-ID accuracy on RegDB and SYSU-MM01 datasets.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130217230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

One Perceptron to Rule Them All: Language, Vision, Audio and Speech 一个感知器统治所有:语言，视觉，音频和语音

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390740

Xavier Giró-i-Nieto

引用次数: 1

Optimizing Queries over Video via Lightweight Keypoint-based Object Detection 通过基于轻量级关键点的对象检测优化视频查询

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390714

Jiansheng Dong, Jingling Yuan, Lin Li, X. Zhong, Weiru Liu

{"title":"Optimizing Queries over Video via Lightweight Keypoint-based Object Detection","authors":"Jiansheng Dong, Jingling Yuan, Lin Li, X. Zhong, Weiru Liu","doi":"10.1145/3372278.3390714","DOIUrl":"https://doi.org/10.1145/3372278.3390714","url":null,"abstract":"Recent advancements in convolutional neural networks based object detection have enabled analyzing the mounting video data with high accuracy. However, inference speed is a major drawback of these video analysis system because of the heavy object detectors. To address the computational and practicability challenges of video analysis, we propose FastQ, a system for efficient querying over video at scale. Given a target video, FastQ can automatically label the category and number of objects for each frame. We introduce a novel lightweight object detector named FDet to improve the efficiency of query system. First, a difference detector filters the frames whose difference is less than the threshold. Second, FDet is employed to efficiently label the remaining frames. To reduce inference time, FDet detects a center keypoint and a pair of corners from the feature map generated by a lightweight backbone to predict the bounding boxes. FDet completely avoid the complicated computation related to anchor boxes. Compared with state-of-the-art real-time detectors, FDet achieves superior performance with 29.1% AP on COCO benchmark at 25.3ms. Experiments show that FastQ achieves 150 times to 300 times speed-ups while maintaining more than 90% accuracy in video queries.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123216065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

iSparse: Output Informed Sparsification of Neural Network iSparse:神经网络的输出信息稀疏化

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390688

Yash Garg, K. Candan

引用次数: 3

Visual Relations Augmented Cross-modal Retrieval 视觉关系增强跨模态检索

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390709

Yutian Guo, Jingjing Chen, Hao Zhang, Yu-Gang Jiang

{"title":"Visual Relations Augmented Cross-modal Retrieval","authors":"Yutian Guo, Jingjing Chen, Hao Zhang, Yu-Gang Jiang","doi":"10.1145/3372278.3390709","DOIUrl":"https://doi.org/10.1145/3372278.3390709","url":null,"abstract":"Retrieving relevant samples across multiple-modalities is a primary topic that receives consistently research interests in multimedia communities, and has benefited various real-world multimedia applications (e.g., text-based image searching). Current models mainly focus on learning a unified visual semantic embedding space to bridge visual contents & text query, targeting at aligning relevant samples from different modalities as neighbors in the embedding space. However, these models did not consider relations between visual components in learning visual representations, resulting in their incapability of distinguishing images with the same visual components but different relations (i.e., Figure 1). To precisely modeling visual contents, we introduce a novel framework that enhanced visual representation with relations between components. Specifically, visual relations are represented by the scene graph extracted from an image, then encoded by the graph convolutional neural networks for learning visual relational features. We combine the relational and compositional representation together for image-text retrieval. Empirical results conducted on the challenging MS-COCO and Flicker 30K datasets demonstrate the effectiveness of our proposed model for cross-modal retrieval task.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122243443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

QIK: A System for Large-Scale Image Retrieval on Everyday Scenes With Common Objects QIK:一种基于常见物体的日常场景的大规模图像检索系统

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390682

Arun Zachariah, Mohamed Gharibi, P. Rao

{"title":"QIK: A System for Large-Scale Image Retrieval on Everyday Scenes With Common Objects","authors":"Arun Zachariah, Mohamed Gharibi, P. Rao","doi":"10.1145/3372278.3390682","DOIUrl":"https://doi.org/10.1145/3372278.3390682","url":null,"abstract":"In this paper, we propose a system for large-scale image retrieval on everyday scenes with common objects by leveraging advances in deep learning and natural language processing (NLP). Unlike recent state-of-the-art approaches that extract image features from a convolutional neural network (CNN), our system exploits the predictions made by deep neural networks for image understanding tasks. Our system aims to capture the relationships between objects in an everyday scene rather than just the individual objects in the scene. It works as follows: For each image in the database, it generates most probable captions and detects objects in the image using state-of-the-art deep learning models. The captions are parsed and represented by tree structures using NLP techniques. These are stored and indexed in a database system. When a user poses a query image, its caption is generated using deep learning and parsed into its corresponding tree structures. Then an optimized tree-pattern query is constructed and executed on the database to retrieve a set of candidate images. Finally, these candidate images are ranked using the tree-edit distance metric computed on the tree structures. A query based on only objects detected in the query image can also be formulated and executed. In this case, the ranking scheme uses the probabilities of the detected objects. We evaluated the performance of our system on the Microsoft COCO dataset containing everyday scenes (with common objects) and observed that our system can outperform state-of-the-art techniques in terms of mean average precision for large-scale image retrieval.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"32 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114091147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Fake News Detection via Knowledge-driven Multimodal Graph Convolutional Networks 基于知识驱动的多模态图卷积网络的假新闻检测

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390713

Youze Wang, Shengsheng Qian, Jun Hu, Quan Fang, Changsheng Xu

{"title":"Fake News Detection via Knowledge-driven Multimodal Graph Convolutional Networks","authors":"Youze Wang, Shengsheng Qian, Jun Hu, Quan Fang, Changsheng Xu","doi":"10.1145/3372278.3390713","DOIUrl":"https://doi.org/10.1145/3372278.3390713","url":null,"abstract":"Nowadays, with the rapid development of social media, there is a great deal of news produced every day. How to detect fake news automatically from a large of multimedia posts has become very important for people, the government and news recommendation sites. However, most of the existing approaches either extract features from the text of the post which is a single modality or simply concatenate the visual features and textual features of a post to get a multimodal feature and detect fake news. Most of them ignore the background knowledge hidden in the text content of the post which facilitates fake news detection. To address these issues, we propose a novel Knowledge-driven Multimodal Graph Convolutional Network (KMGCN) to model the semantic representations by jointly modeling the textual information, knowledge concepts and visual information into a unified framework for fake news detection. Instead of viewing text content as word sequences normally, we convert them into a graph, which can model non-consecutive phrases for better obtaining the composition of semantics. Besides, we not only convert visual information as nodes of graphs but also retrieve external knowledge from real-world knowledge graph as nodes of graphs to provide complementary semantics information to improve fake news detection. We utilize a well-designed graph convolutional network to extract the semantic representation of these graphs. Extensive experiments on two public real-world datasets illustrate the validation of our approach.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"229 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115991207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 96