Proceedings of the 2020 International Conference on Multimedia Retrieval最新文献

筛选
英文 中文
CEA'20: The 12th Workshop on Multimedia for Cooking and Eating Activities CEA'20:第十二届烹饪和饮食活动多媒体研讨会
Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3388040
I. Ide, Yoko Yamakata, Atsushi Hashimoto
{"title":"CEA'20: The 12th Workshop on Multimedia for Cooking and Eating Activities","authors":"I. Ide, Yoko Yamakata, Atsushi Hashimoto","doi":"10.1145/3372278.3388040","DOIUrl":"https://doi.org/10.1145/3372278.3388040","url":null,"abstract":"The 12th Workshop on Multimedia for Cooking and Eating Activities presents This overview introduces the aim of the CEA'20 workshop and the list of papers presented in the workshop.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"397 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117310182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Heterogeneous Non-Local Fusion for Multimodal Activity Recognition 多模态活动识别的异质非局部融合
Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390675
Petr Byvshev, P. Mettes, Yu Xiao
{"title":"Heterogeneous Non-Local Fusion for Multimodal Activity Recognition","authors":"Petr Byvshev, P. Mettes, Yu Xiao","doi":"10.1145/3372278.3390675","DOIUrl":"https://doi.org/10.1145/3372278.3390675","url":null,"abstract":"In this work, we investigate activity recognition using multimodal inputs from heterogeneous sensors. Activity recognition is commonly tackled from a single-modal perspective using videos. In case multiple signals are used, they come from the same homogeneous modality, e.g. in the case of color and optical flow. Here, we propose an activity network that fuses multimodal inputs coming from completely different and heterogeneous sensors. We frame such a heterogeneous fusion as a non-local operation. The observation is that in a non-local operation, only the channel dimensions need to match. In the network, heterogeneous inputs are fused, while maintaining the shapes and dimensionalities that fit each input. We outline both asymmetric fusion, where one modality serves to enforce the other, and symmetric fusion variants. To further promote research into multimodal activity recognition, we introduce GloVid, a first-person activity dataset captured with video recordings and smart glove sensor readings. Experiments on GloVid show the potential of heterogeneous non-local fusion for activity recognition, outperforming individual modalities and standard fusion techniques.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124029232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Beyond Relevance Feedback for Searching and Exploring large Multimedia Collections 搜索和探索大型多媒体馆藏的超越相关性反馈
Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390669
M. Worring
{"title":"Beyond Relevance Feedback for Searching and Exploring large Multimedia Collections","authors":"M. Worring","doi":"10.1145/3372278.3390669","DOIUrl":"https://doi.org/10.1145/3372278.3390669","url":null,"abstract":"Relevance feedback was introduced over twenty years ago as a powerful tool for interactive retrieval and still is the dominant mode of interaction in multimedia retrieval systems. Over the years methods have improved and recently relevance feedback has become feasible on even the largest collections available in the multimedia community. Yet, relevance feedback typically targets the optimization of linear lists of search results and thus focuses on only one of the many tasks on the search - explore axis. Truly interactive retrieval systems have to consider the whole axis and interactive categorization is an overarching framework for many of those tasks. The multimedia analytics system MediaTable exploits this to support users in getting insight in large image collections. Categorization as a representation of the collection and user tasks does not capture the relations between items in the collection like graphs do. Hypergraphs are combining categories and relations in one model and as they are founded in set theory in fact are closely related to categorization. They, therefore, provide an elegant framework to move forward. In this talk we highlight the progress that has been made in the field of interactive retrieval and in the direction of multimedia analytics. We will further consider the promises that new results in deep learning, especially in the context of graph convolutional networks, and hypergraphs might bring to go beyond relevance feedback.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132321551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MAENet: Boosting Feature Representation for Cross-Modal Person Re-Identification with Pairwise Supervision 基于两两监督的跨模态人再识别的增强特征表示
Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390699
Yongbiao Chen, Sheng Zhang, Zhengwei Qi
{"title":"MAENet: Boosting Feature Representation for Cross-Modal Person Re-Identification with Pairwise Supervision","authors":"Yongbiao Chen, Sheng Zhang, Zhengwei Qi","doi":"10.1145/3372278.3390699","DOIUrl":"https://doi.org/10.1145/3372278.3390699","url":null,"abstract":"Person re-identification aims at successfully retrieving the images of a specific person in the gallery dataset given a probe image. Among all the existing research areas related to person re-identification, visible to thermal person re-identification (VT-REID) has gained proliferating momentum. VT-REID is deemed to be a rather challenging task owing to the large cross-modality gap [25], cross-modality variation and intra-modality variation. Existing techniques generally tackle this problem by embedding cross-modality data with convolutional neural networks into shared feature space to bridge the cross-modality discrepancy, and subsequently, devise hinge losses on similarity learning to alleviate the variation. However, feature extraction methods based simply on convolutional neural networks may fail to capture the distinctive and modality-invariant features, resulting in noises for further re-identification techniques. In this work, we present a novel modality and appearance invariant embedding learning framework equipped with maximum likelihood learning to perform cross-modal person re-identification. Extensive and comprehensive experiments are conducted to test the effectiveness of our framework. Results demonstrated that the proposed framework yields state-of-the-art Re-ID accuracy on RegDB and SYSU-MM01 datasets.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130217230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
One Perceptron to Rule Them All: Language, Vision, Audio and Speech 一个感知器统治所有:语言,视觉,音频和语音
Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390740
Xavier Giró-i-Nieto
{"title":"One Perceptron to Rule Them All: Language, Vision, Audio and Speech","authors":"Xavier Giró-i-Nieto","doi":"10.1145/3372278.3390740","DOIUrl":"https://doi.org/10.1145/3372278.3390740","url":null,"abstract":"Deep neural networks have boosted the convergence of multimedia data analytics in a unified framework shared by practitioners in natural language, vision and speech. Image captioning, lip reading or video sonorization are some of the first applications of a new and exciting field of research exploiting the generalization properties of deep neural representation. This tutorial will firstly review the basic neural architectures to encode and decode vision, text and audio, to later review the those models that have successfully translated information across modalities.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134124782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Optimizing Queries over Video via Lightweight Keypoint-based Object Detection 通过基于轻量级关键点的对象检测优化视频查询
Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390714
Jiansheng Dong, Jingling Yuan, Lin Li, X. Zhong, Weiru Liu
{"title":"Optimizing Queries over Video via Lightweight Keypoint-based Object Detection","authors":"Jiansheng Dong, Jingling Yuan, Lin Li, X. Zhong, Weiru Liu","doi":"10.1145/3372278.3390714","DOIUrl":"https://doi.org/10.1145/3372278.3390714","url":null,"abstract":"Recent advancements in convolutional neural networks based object detection have enabled analyzing the mounting video data with high accuracy. However, inference speed is a major drawback of these video analysis system because of the heavy object detectors. To address the computational and practicability challenges of video analysis, we propose FastQ, a system for efficient querying over video at scale. Given a target video, FastQ can automatically label the category and number of objects for each frame. We introduce a novel lightweight object detector named FDet to improve the efficiency of query system. First, a difference detector filters the frames whose difference is less than the threshold. Second, FDet is employed to efficiently label the remaining frames. To reduce inference time, FDet detects a center keypoint and a pair of corners from the feature map generated by a lightweight backbone to predict the bounding boxes. FDet completely avoid the complicated computation related to anchor boxes. Compared with state-of-the-art real-time detectors, FDet achieves superior performance with 29.1% AP on COCO benchmark at 25.3ms. Experiments show that FastQ achieves 150 times to 300 times speed-ups while maintaining more than 90% accuracy in video queries.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123216065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
iSparse: Output Informed Sparsification of Neural Network iSparse:神经网络的输出信息稀疏化
Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390688
Yash Garg, K. Candan
{"title":"iSparse: Output Informed Sparsification of Neural Network","authors":"Yash Garg, K. Candan","doi":"10.1145/3372278.3390688","DOIUrl":"https://doi.org/10.1145/3372278.3390688","url":null,"abstract":"Deep neural networks have demonstrated unprecedented success in various multimedia applications. However, the networks created are often very complex, with large numbers of trainable edges that require extensive computational resources. We note that many successful networks nevertheless often contain large numbers of redundant edges. Moreover, many of these edges may have negligible contributions towards the overall network performance. In this paper, we propose a novel iSparse framework and experimentally show, that we can sparsify the network without impacting the network performance. iSparse leverages a novel edge significance score, E, to determine the importance of an edge with respect to the final network output. Furthermore, iSparse can be applied both while training a model or on top of a pre-trained model, making it a retraining-free approach - leading to a minimal computational overhead. Comparisons of iSparse against Dropout, L1, DropConnect, Retraining-Free, and Lottery-Ticket Hypothesis on benchmark datasets show that iSparse leads to effective network sparsifications.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127408791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Visual Relations Augmented Cross-modal Retrieval 视觉关系增强跨模态检索
Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390709
Yutian Guo, Jingjing Chen, Hao Zhang, Yu-Gang Jiang
{"title":"Visual Relations Augmented Cross-modal Retrieval","authors":"Yutian Guo, Jingjing Chen, Hao Zhang, Yu-Gang Jiang","doi":"10.1145/3372278.3390709","DOIUrl":"https://doi.org/10.1145/3372278.3390709","url":null,"abstract":"Retrieving relevant samples across multiple-modalities is a primary topic that receives consistently research interests in multimedia communities, and has benefited various real-world multimedia applications (e.g., text-based image searching). Current models mainly focus on learning a unified visual semantic embedding space to bridge visual contents & text query, targeting at aligning relevant samples from different modalities as neighbors in the embedding space. However, these models did not consider relations between visual components in learning visual representations, resulting in their incapability of distinguishing images with the same visual components but different relations (i.e., Figure 1). To precisely modeling visual contents, we introduce a novel framework that enhanced visual representation with relations between components. Specifically, visual relations are represented by the scene graph extracted from an image, then encoded by the graph convolutional neural networks for learning visual relational features. We combine the relational and compositional representation together for image-text retrieval. Empirical results conducted on the challenging MS-COCO and Flicker 30K datasets demonstrate the effectiveness of our proposed model for cross-modal retrieval task.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122243443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
QIK: A System for Large-Scale Image Retrieval on Everyday Scenes With Common Objects QIK:一种基于常见物体的日常场景的大规模图像检索系统
Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390682
Arun Zachariah, Mohamed Gharibi, P. Rao
{"title":"QIK: A System for Large-Scale Image Retrieval on Everyday Scenes With Common Objects","authors":"Arun Zachariah, Mohamed Gharibi, P. Rao","doi":"10.1145/3372278.3390682","DOIUrl":"https://doi.org/10.1145/3372278.3390682","url":null,"abstract":"In this paper, we propose a system for large-scale image retrieval on everyday scenes with common objects by leveraging advances in deep learning and natural language processing (NLP). Unlike recent state-of-the-art approaches that extract image features from a convolutional neural network (CNN), our system exploits the predictions made by deep neural networks for image understanding tasks. Our system aims to capture the relationships between objects in an everyday scene rather than just the individual objects in the scene. It works as follows: For each image in the database, it generates most probable captions and detects objects in the image using state-of-the-art deep learning models. The captions are parsed and represented by tree structures using NLP techniques. These are stored and indexed in a database system. When a user poses a query image, its caption is generated using deep learning and parsed into its corresponding tree structures. Then an optimized tree-pattern query is constructed and executed on the database to retrieve a set of candidate images. Finally, these candidate images are ranked using the tree-edit distance metric computed on the tree structures. A query based on only objects detected in the query image can also be formulated and executed. In this case, the ranking scheme uses the probabilities of the detected objects. We evaluated the performance of our system on the Microsoft COCO dataset containing everyday scenes (with common objects) and observed that our system can outperform state-of-the-art techniques in terms of mean average precision for large-scale image retrieval.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"32 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114091147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Fake News Detection via Knowledge-driven Multimodal Graph Convolutional Networks 基于知识驱动的多模态图卷积网络的假新闻检测
Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-01 DOI: 10.1145/3372278.3390713
Youze Wang, Shengsheng Qian, Jun Hu, Quan Fang, Changsheng Xu
{"title":"Fake News Detection via Knowledge-driven Multimodal Graph Convolutional Networks","authors":"Youze Wang, Shengsheng Qian, Jun Hu, Quan Fang, Changsheng Xu","doi":"10.1145/3372278.3390713","DOIUrl":"https://doi.org/10.1145/3372278.3390713","url":null,"abstract":"Nowadays, with the rapid development of social media, there is a great deal of news produced every day. How to detect fake news automatically from a large of multimedia posts has become very important for people, the government and news recommendation sites. However, most of the existing approaches either extract features from the text of the post which is a single modality or simply concatenate the visual features and textual features of a post to get a multimodal feature and detect fake news. Most of them ignore the background knowledge hidden in the text content of the post which facilitates fake news detection. To address these issues, we propose a novel Knowledge-driven Multimodal Graph Convolutional Network (KMGCN) to model the semantic representations by jointly modeling the textual information, knowledge concepts and visual information into a unified framework for fake news detection. Instead of viewing text content as word sequences normally, we convert them into a graph, which can model non-consecutive phrases for better obtaining the composition of semantics. Besides, we not only convert visual information as nodes of graphs but also retrieve external knowledge from real-world knowledge graph as nodes of graphs to provide complementary semantics information to improve fake news detection. We utilize a well-designed graph convolutional network to extract the semantic representation of these graphs. Extensive experiments on two public real-world datasets illustrate the validation of our approach.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"229 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115991207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 96
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书