{"title":"Visual Story Ordering with a Bidirectional Writer","authors":"Wei-Rou Lin, Hen-Hsen Huang, Hsin-Hsi Chen","doi":"10.1145/3372278.3390735","DOIUrl":"https://doi.org/10.1145/3372278.3390735","url":null,"abstract":"This paper introduces visual story ordering, a challenging task in which images and text are ordered in a visual story jointly. We propose a neural network model based on the reader-processor-writer architecture with a self-attention mechanism. A novel bidirectional decoder is further proposed with bidirectional beam search. Experimental results show the effectiveness of the approach. The information gained from multimodal learning is presented and discussed. We also find that the proposed embedding narrows the distance between images and their corresponding story sentences, even though we do not align the two modalities explicitly. As it addresses a general issue in generative models, the proposed bidirectional inference mechanism applies to a variety of applications.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125163424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Interactive Multimodal Retrieval System for Memory Assistant and Life Organized Support","authors":"Van-Luon Tran, Anh-Vu Mai-Nguyen, Trong-Dat Phan, Anh-Khoa Vo, Minh-Son Dao, K. Zettsu","doi":"10.1145/3372278.3391934","DOIUrl":"https://doi.org/10.1145/3372278.3391934","url":null,"abstract":"Lifelogging is known as the new trend of writing diary digitally where both the surrounding environment and personal physiological data and cognition are collected at the same time under the first perspective. Exploring and exploiting these lifelog (i.e., data created by lifelogging) can provide useful insights for human beings, including healthcare, work, entertainment, and family, to name a few. Unfortunately, having a valuable tool working on lifelog to discover these insights is still a tough challenge. To meet this requirement, we introduce an interactive multimodal retrieval system that aims to provide people with two functions, memory assistant and life organized support, with a friendly and easy-to-use web UI. The output of the former function is a video with footages expressing all instances of events people want to recall. The latter function generates a statistical report of each event so that people can have more information to balance their lifestyle. The system relies on two major algorithms that try to match keywords/phrases to images and to run a cluster-based query using a watershed-based approach.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116899089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianrui Niu, Fangxiang Feng, Lingxuan Li, Xiaojie Wang
{"title":"Image Synthesis from Locally Related Texts","authors":"Tianrui Niu, Fangxiang Feng, Lingxuan Li, Xiaojie Wang","doi":"10.1145/3372278.3390684","DOIUrl":"https://doi.org/10.1145/3372278.3390684","url":null,"abstract":"Text-to-image synthesis refers to generating photo-realistic images from text descriptions. Recent works focus on generating images with complex scenes and multiple objects. However, the text inputs to these models are the only captions that always describe the most apparent object or feature of the image and detailed information (e.g. visual attributes) for regions and objects are often missing. Quantitative evaluation of generation performances is still an unsolved problem, where traditional image classification- or retrieval-based metrics fail at evaluating complex images. To address these problems, we propose to generate images conditioned on locally-related texts, i.e., descriptions of local image regions or objects instead of the whole image. Specifically, questions and answers (QAs) are chosen as locally-related texts, which makes it possible to use VQA accuracy as a new evaluation metric. The intuition is simple: higher image quality and image-text consistency (both globally and locally) can help a VQA model answer questions more correctly. We purposed VQA-GAN model with three key modules: hierarchical QA encoder, QA-conditional GAN and external VQA loss. These modules help leverage the new inputs effectively. Thorough experiments on two public VQA datasets demonstrate the effectiveness of the model and the newly proposed metric.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128360945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reducing Response Time for Multimedia Event Processing using Domain Adaptation","authors":"Asra Aslam, E. Curry","doi":"10.1145/3372278.3390722","DOIUrl":"https://doi.org/10.1145/3372278.3390722","url":null,"abstract":"The Internet of Multimedia Things (IoMT) is an emerging concept due to the large amount of multimedia data produced by sensing devices. Existing event-based systems mainly focus on scalar data, and multimedia event-based solutions are domain-specific. Multiple applications may require handling of numerous known/unknown concepts which may belong to the same/different domains with an unbounded vocabulary. Although deep neural network-based techniques are effective for image recognition, the limitation of having to train classifiers for unseen concepts will lead to an increase in the overall response-time for users. Since it is not practical to have all trained classifiers available, it is necessary to address the problem of training of classifiers on demand for unbounded vocabulary. By exploiting transfer learning based techniques, evaluations showed that the proposed framework can answer within ~0.01 min to ~30 min of response-time with accuracy ranges from 95.14% to 98.53%, even when all subscriptions are new/unknown.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121319415","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trajectory Prediction Network for Future Anticipation of Ships","authors":"Pim Dijt, P. Mettes","doi":"10.1145/3372278.3390676","DOIUrl":"https://doi.org/10.1145/3372278.3390676","url":null,"abstract":"This work investigates the anticipation of future ship locations based on multimodal sensors. Predicting future trajectories of ships is an important component for the development of safe autonomous sailing ships on water. A core challenge towards future trajectory prediction is making sense of multiple modalities from vastly different sensors, including GPS coordinates, radar images, and charts specifying water and land regions. To that end, we propose a Trajectory Prediction Network, an end-to-end approach for trajectory anticipation based on multimodal sensors. Our approach is framed as a multi-task sequence-to-sequence network, with network components for coordinate sequences and radar images. In the network, water/land segmentations from charts are integrated as an auxiliary training objective. Since future anticipation of ships has not previously been studied from such a multimodal perspective, we introduce the Inland Shipping Dataset (ISD), a novel dataset for future anticipation of ships. Experimental evaluation on ISD shows the potential of our approach, outperforming single-modal variants and baselines from related anticipation tasks.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121333827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-level Recognition on Falls from Activities of Daily Living","authors":"Jiawei Li, Shutao Xia, Qianggang Ding","doi":"10.1145/3372278.3390702","DOIUrl":"https://doi.org/10.1145/3372278.3390702","url":null,"abstract":"The falling accident is one of the largest threats to human health, which leads to broken bones, head injury, or even death. Therefore, automatic human fall recognition is vital for the Activities of Daily Living (ADL). In this paper, we try to define multi-level computer vision tasks for the visually observed fall recognition problem and study the methods and pipeline. We make frame-level labels for the fall action on several ADL datasets to test the methods and support the analysis. While current deep-learning fall recognition methods usually work on the sequence-level input, we propose a novel Dynamic Pose Motion (DPM) representation to go a step further, which can be captured by a flexible motion extraction module. Besides, a sequence-level fall recognition pipeline is proposed, which has an explicit two-branch structure for the appearance and motion feature, and has canonical LSTM to make temporal modeling and fall prediction. Finally, while current research only makes a binary classification on the fall and ADL, we further study how to detect the start time and the end time of a fall action in a video-level task. We conduct analysis experiments and ablation studies on both the simulated and real-life fall datasets. The relabelled datasets and extensive experiments form a new baseline on the recognition of falls and ADL.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134159680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Po-Yao (Bernie) Huang, Xiaojun Chang, Alexander Hauptmann, E. Hovy
{"title":"Forward and Backward Multimodal NMT for Improved Monolingual and Multilingual Cross-Modal Retrieval","authors":"Po-Yao (Bernie) Huang, Xiaojun Chang, Alexander Hauptmann, E. Hovy","doi":"10.1145/3372278.3390674","DOIUrl":"https://doi.org/10.1145/3372278.3390674","url":null,"abstract":"We explore methods to enrich the diversity of captions associated with pictures for learning improved visual-semantic embeddings (VSE) in cross-modal retrieval. In the spirit of \"A picture is worth a thousand words\", it would take dozens of sentences to parallel each picture's content adequately. But in fact, real-world multimodal datasets tend to provide only a few (typically, five) descriptions per image. For cross-modal retrieval, the resulting lack of diversity and coverage prevents systems from capturing the fine-grained inter-modal dependencies and intra-modal diversities in the shared VSE space. Using the fact that the encoder-decoder architectures in neural machine translation (NMT) have the capacity to enrich both monolingual and multilingual textual diversity, we propose a novel framework leveraging multimodal neural machine translation (MMT) to perform forward and backward translations based on salient visual objects to generate additional text-image pairs which enables training improved monolingual cross-modal retrieval (English-Image) and multilingual cross-modal retrieval (English-Image and German-Image) models. Experimental results show that the proposed framework can substantially and consistently improve the performance of state-of-the-art models on multiple datasets. The results also suggest that the models with multilingual VSE outperform the models with monolingual VSE.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133031306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Urban Movie Map for Walkers: Route View Synthesis using 360° Videos","authors":"Naoki Sugimoto, Toru Okubo, K. Aizawa","doi":"10.1145/3372278.3390707","DOIUrl":"https://doi.org/10.1145/3372278.3390707","url":null,"abstract":"We propose a movie map for walkers based on synthesized street walking views along routes in a particular area. From the perspectives of walkers, we captured a number of omnidirectional videos along streets in the target area (1km2 around Kyoto Station). We captured a separate video for each street. We then performed simultaneous localization and mapping to obtain camera poses from key video frames in all of the videos and adjusted the coordinates based on a map of the area using reference points. To join one video to another smoothly at intersections, we identified frames of video intersection based on camera locations and visual feature matching. Finally, we generated moving route views by connecting the omnidirectional videos based on the alignment of the cameras. To improve smoothness at intersections, we generated rotational views by mixing video intersection frames from two videos. The results demonstrate that our method can precisely identify intersection frames and generate smooth connections between videos at intersections.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115241774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minh-Son Dao, M. Fjeld, F. Biljecki, U. Yavanoglu, M. Dong
{"title":"ICDAR'20: Intelligent Cross-Data Analysis and Retrieval","authors":"Minh-Son Dao, M. Fjeld, F. Biljecki, U. Yavanoglu, M. Dong","doi":"10.1145/3372278.3388041","DOIUrl":"https://doi.org/10.1145/3372278.3388041","url":null,"abstract":"The First International Workshop on \"Intelligence Cross-Data Analytics and Retrieval\" (ICDAR'20) welcomes any theoretical and practical works on intelligence cross-data analytics and retrieval to bring the smart-sustainable society to human beings. We have witnessed the era of big data where almost any event that happens is recorded and stored either distributedly or centrally. The utmost requirement here is that data came from different sources, and various domains must be harmonically analyzed to get their insights immediately towards giving the ability to be retrieved thoroughly. These emerging requirements lead to the need for interdisciplinary and multidisciplinary contributions that address different aspects of the problem, such as data collection, storage, protection, processing, and transmission, as well as knowledge discovery, retrieval, and security and privacy. Hence, the goal of the workshop is to attract researchers and experts in the areas of multimedia information retrieval, machine learning, AI, data science, event-based processing and analysis, multimodal multimedia content analysis, lifelog data analysis, urban computing, environmental science, atmospheric science, and security and privacy to tackle the issues as mentioned earlier.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115913621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"surgXplore: Interactive Video Exploration for Endoscopy","authors":"Andreas Leibetseder, Klaus Schöffmann","doi":"10.1145/3372278.3391930","DOIUrl":"https://doi.org/10.1145/3372278.3391930","url":null,"abstract":"Accumulating recordings of daily conducted surgical interventions such as endoscopic procedures for the long term generates very large video archives that are both difficult to search and explore. Since physicians utilize this kind of media routinely for documentation, treatment planning or education and training, it can be considered a crucial task to make said archives manageable in regards to discovering or retrieving relevant content. We present an interactive tool including a multitude of modalities for browsing, searching and filtering medical content, demonstrating its usefulness on over 140 hours of pre-processed laparoscopic surgery videos.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"257 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122138640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}