Proceedings of the 2020 International Conference on Multimedia Retrieval最新文献

Visual Story Ordering with a Bidirectional Writer 使用双向书写器的视觉故事排序

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-08 DOI: 10.1145/3372278.3390735

Wei-Rou Lin, Hen-Hsen Huang, Hsin-Hsi Chen

引用次数: 0

An Interactive Multimodal Retrieval System for Memory Assistant and Life Organized Support 记忆辅助和生活组织支持的交互式多模态检索系统

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-08 DOI: 10.1145/3372278.3391934

Van-Luon Tran, Anh-Vu Mai-Nguyen, Trong-Dat Phan, Anh-Khoa Vo, Minh-Son Dao, K. Zettsu

{"title":"An Interactive Multimodal Retrieval System for Memory Assistant and Life Organized Support","authors":"Van-Luon Tran, Anh-Vu Mai-Nguyen, Trong-Dat Phan, Anh-Khoa Vo, Minh-Son Dao, K. Zettsu","doi":"10.1145/3372278.3391934","DOIUrl":"https://doi.org/10.1145/3372278.3391934","url":null,"abstract":"Lifelogging is known as the new trend of writing diary digitally where both the surrounding environment and personal physiological data and cognition are collected at the same time under the first perspective. Exploring and exploiting these lifelog (i.e., data created by lifelogging) can provide useful insights for human beings, including healthcare, work, entertainment, and family, to name a few. Unfortunately, having a valuable tool working on lifelog to discover these insights is still a tough challenge. To meet this requirement, we introduce an interactive multimodal retrieval system that aims to provide people with two functions, memory assistant and life organized support, with a friendly and easy-to-use web UI. The output of the former function is a video with footages expressing all instances of events people want to recall. The latter function generates a statistical report of each event so that people can have more information to balance their lifestyle. The system relies on two major algorithms that try to match keywords/phrases to images and to run a cluster-based query using a watershed-based approach.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116899089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Image Synthesis from Locally Related Texts 从本地相关文本合成图像

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-08 DOI: 10.1145/3372278.3390684

Tianrui Niu, Fangxiang Feng, Lingxuan Li, Xiaojie Wang

{"title":"Image Synthesis from Locally Related Texts","authors":"Tianrui Niu, Fangxiang Feng, Lingxuan Li, Xiaojie Wang","doi":"10.1145/3372278.3390684","DOIUrl":"https://doi.org/10.1145/3372278.3390684","url":null,"abstract":"Text-to-image synthesis refers to generating photo-realistic images from text descriptions. Recent works focus on generating images with complex scenes and multiple objects. However, the text inputs to these models are the only captions that always describe the most apparent object or feature of the image and detailed information (e.g. visual attributes) for regions and objects are often missing. Quantitative evaluation of generation performances is still an unsolved problem, where traditional image classification- or retrieval-based metrics fail at evaluating complex images. To address these problems, we propose to generate images conditioned on locally-related texts, i.e., descriptions of local image regions or objects instead of the whole image. Specifically, questions and answers (QAs) are chosen as locally-related texts, which makes it possible to use VQA accuracy as a new evaluation metric. The intuition is simple: higher image quality and image-text consistency (both globally and locally) can help a VQA model answer questions more correctly. We purposed VQA-GAN model with three key modules: hierarchical QA encoder, QA-conditional GAN and external VQA loss. These modules help leverage the new inputs effectively. Thorough experiments on two public VQA datasets demonstrate the effectiveness of the model and the newly proposed metric.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128360945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Reducing Response Time for Multimedia Event Processing using Domain Adaptation 利用领域自适应减少多媒体事件处理的响应时间

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-08 DOI: 10.1145/3372278.3390722

Asra Aslam, E. Curry

引用次数: 6

Trajectory Prediction Network for Future Anticipation of Ships 船舶未来预测的轨迹预测网络

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-08 DOI: 10.1145/3372278.3390676

Pim Dijt, P. Mettes

{"title":"Trajectory Prediction Network for Future Anticipation of Ships","authors":"Pim Dijt, P. Mettes","doi":"10.1145/3372278.3390676","DOIUrl":"https://doi.org/10.1145/3372278.3390676","url":null,"abstract":"This work investigates the anticipation of future ship locations based on multimodal sensors. Predicting future trajectories of ships is an important component for the development of safe autonomous sailing ships on water. A core challenge towards future trajectory prediction is making sense of multiple modalities from vastly different sensors, including GPS coordinates, radar images, and charts specifying water and land regions. To that end, we propose a Trajectory Prediction Network, an end-to-end approach for trajectory anticipation based on multimodal sensors. Our approach is framed as a multi-task sequence-to-sequence network, with network components for coordinate sequences and radar images. In the network, water/land segmentations from charts are integrated as an auxiliary training objective. Since future anticipation of ships has not previously been studied from such a multimodal perspective, we introduce the Inland Shipping Dataset (ISD), a novel dataset for future anticipation of ships. Experimental evaluation on ISD shows the potential of our approach, outperforming single-modal variants and baselines from related anticipation tasks.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121333827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Multi-level Recognition on Falls from Activities of Daily Living 从日常生活活动看跌倒的多层次认识

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-08 DOI: 10.1145/3372278.3390702

Jiawei Li, Shutao Xia, Qianggang Ding

{"title":"Multi-level Recognition on Falls from Activities of Daily Living","authors":"Jiawei Li, Shutao Xia, Qianggang Ding","doi":"10.1145/3372278.3390702","DOIUrl":"https://doi.org/10.1145/3372278.3390702","url":null,"abstract":"The falling accident is one of the largest threats to human health, which leads to broken bones, head injury, or even death. Therefore, automatic human fall recognition is vital for the Activities of Daily Living (ADL). In this paper, we try to define multi-level computer vision tasks for the visually observed fall recognition problem and study the methods and pipeline. We make frame-level labels for the fall action on several ADL datasets to test the methods and support the analysis. While current deep-learning fall recognition methods usually work on the sequence-level input, we propose a novel Dynamic Pose Motion (DPM) representation to go a step further, which can be captured by a flexible motion extraction module. Besides, a sequence-level fall recognition pipeline is proposed, which has an explicit two-branch structure for the appearance and motion feature, and has canonical LSTM to make temporal modeling and fall prediction. Finally, while current research only makes a binary classification on the fall and ADL, we further study how to detect the start time and the end time of a fall action in a video-level task. We conduct analysis experiments and ablation studies on both the simulated and real-life fall datasets. The relabelled datasets and extensive experiments form a new baseline on the recognition of falls and ADL.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134159680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Forward and Backward Multimodal NMT for Improved Monolingual and Multilingual Cross-Modal Retrieval 改进单语和多语跨模态检索的前向和后向多模态NMT

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-08 DOI: 10.1145/3372278.3390674

Po-Yao (Bernie) Huang, Xiaojun Chang, Alexander Hauptmann, E. Hovy

{"title":"Forward and Backward Multimodal NMT for Improved Monolingual and Multilingual Cross-Modal Retrieval","authors":"Po-Yao (Bernie) Huang, Xiaojun Chang, Alexander Hauptmann, E. Hovy","doi":"10.1145/3372278.3390674","DOIUrl":"https://doi.org/10.1145/3372278.3390674","url":null,"abstract":"We explore methods to enrich the diversity of captions associated with pictures for learning improved visual-semantic embeddings (VSE) in cross-modal retrieval. In the spirit of \"A picture is worth a thousand words\", it would take dozens of sentences to parallel each picture's content adequately. But in fact, real-world multimodal datasets tend to provide only a few (typically, five) descriptions per image. For cross-modal retrieval, the resulting lack of diversity and coverage prevents systems from capturing the fine-grained inter-modal dependencies and intra-modal diversities in the shared VSE space. Using the fact that the encoder-decoder architectures in neural machine translation (NMT) have the capacity to enrich both monolingual and multilingual textual diversity, we propose a novel framework leveraging multimodal neural machine translation (MMT) to perform forward and backward translations based on salient visual objects to generate additional text-image pairs which enables training improved monolingual cross-modal retrieval (English-Image) and multilingual cross-modal retrieval (English-Image and German-Image) models. Experimental results show that the proposed framework can substantially and consistently improve the performance of state-of-the-art models on multiple datasets. The results also suggest that the models with multilingual VSE outperform the models with monolingual VSE.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133031306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Urban Movie Map for Walkers: Route View Synthesis using 360° Videos 城市电影地图为步行者:路线视图合成使用360°视频

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-08 DOI: 10.1145/3372278.3390707

Naoki Sugimoto, Toru Okubo, K. Aizawa

引用次数: 3

ICDAR'20: Intelligent Cross-Data Analysis and Retrieval ICDAR'20:智能交叉数据分析与检索

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-08 DOI: 10.1145/3372278.3388041

Minh-Son Dao, M. Fjeld, F. Biljecki, U. Yavanoglu, M. Dong

{"title":"ICDAR'20: Intelligent Cross-Data Analysis and Retrieval","authors":"Minh-Son Dao, M. Fjeld, F. Biljecki, U. Yavanoglu, M. Dong","doi":"10.1145/3372278.3388041","DOIUrl":"https://doi.org/10.1145/3372278.3388041","url":null,"abstract":"The First International Workshop on \"Intelligence Cross-Data Analytics and Retrieval\" (ICDAR'20) welcomes any theoretical and practical works on intelligence cross-data analytics and retrieval to bring the smart-sustainable society to human beings. We have witnessed the era of big data where almost any event that happens is recorded and stored either distributedly or centrally. The utmost requirement here is that data came from different sources, and various domains must be harmonically analyzed to get their insights immediately towards giving the ability to be retrieved thoroughly. These emerging requirements lead to the need for interdisciplinary and multidisciplinary contributions that address different aspects of the problem, such as data collection, storage, protection, processing, and transmission, as well as knowledge discovery, retrieval, and security and privacy. Hence, the goal of the workshop is to attract researchers and experts in the areas of multimedia information retrieval, machine learning, AI, data science, event-based processing and analysis, multimodal multimedia content analysis, lifelog data analysis, urban computing, environmental science, atmospheric science, and security and privacy to tackle the issues as mentioned earlier.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115913621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

surgXplore: Interactive Video Exploration for Endoscopy 外科探索:内窥镜交互式视频探索

Proceedings of the 2020 International Conference on Multimedia Retrieval Pub Date : 2020-06-08 DOI: 10.1145/3372278.3391930

Andreas Leibetseder, Klaus Schöffmann

引用次数: 0