ACM Multimedia Asia最新文献_第8页

Self-Adaptive Hashing for Fine-Grained Image Retrieval 用于细粒度图像检索的自适应哈希

ACM Multimedia Asia Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490591

Yajie Zhang, Yuxuan Dai, Wei Tang, Lu Jin, Xinguang Xiang

引用次数: 0

Goldeye: Enhanced Spatial Awareness for the Visually Impaired using Mixed Reality and Vibrotactile Feedback Goldeye:使用混合现实和触觉振动反馈增强视障人士的空间意识

ACM Multimedia Asia Pub Date : 2021-12-01 DOI: 10.1145/3469877.3495636

Jun Lee, Narayanan Rajeev, A. Bhojan

引用次数: 4

Visual Storytelling with Hierarchical BERT Semantic Guidance 基于层次BERT语义引导的视觉叙事

ACM Multimedia Asia Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490604

Ruichao Fan, Hanli Wang, Jinjing Gu, Xianhui Liu

{"title":"Visual Storytelling with Hierarchical BERT Semantic Guidance","authors":"Ruichao Fan, Hanli Wang, Jinjing Gu, Xianhui Liu","doi":"10.1145/3469877.3490604","DOIUrl":"https://doi.org/10.1145/3469877.3490604","url":null,"abstract":"Visual storytelling, which aims at automatically producing a narrative paragraph for photo album, remains quite challenging due to the complexity and diversity of photo album content. In addition, open-domain photo albums cover a broad range of topics and this results in highly variable vocabularies and expression styles to describe photo albums. In this work, a novel teacher-student visual storytelling framework with hierarchical BERT semantic guidance (HBSG) is proposed to address the above-mentioned challenges. The proposed teacher module consists of two joint tasks, namely, word-level latent topic generation and semantic-guided sentence generation. The first task aims to predict the latent topic of the story. As there is no ground-truth topic information, a pre-trained BERT model based on visual contents and annotated stories is utilized to mine topics. Then the topic vector is distilled to a designed image-topic prediction model. In the semantic-guided sentence generation task, HBSG is introduced for two purposes. The first is to narrow down the language complexity across topics, where the co-attention decoder with vision and semantic is designed to leverage the latent topics to induce topic-related language models. The second is to employ sentence semantic as an online external linguistic knowledge teacher module. Finally, an auxiliary loss is devised to transform linguistic knowledge into the language generation model. Extensive experiments are performed to demonstrate the effectiveness of HBSG framework, which surpasses the state-of-the-art approaches evaluated on the VIST test set.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"107 Pt 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129111001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Local Self-Attention on Fine-grained Cross-media Retrieval 细粒度跨媒体检索中的局部自关注

ACM Multimedia Asia Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490590

Chen Wang, Yazhou Yao, Qiong Wang, Zhenmin Tang

{"title":"Local Self-Attention on Fine-grained Cross-media Retrieval","authors":"Chen Wang, Yazhou Yao, Qiong Wang, Zhenmin Tang","doi":"10.1145/3469877.3490590","DOIUrl":"https://doi.org/10.1145/3469877.3490590","url":null,"abstract":"Due to the heterogeneity gap, the data representation of different media is inconsistent and belongs to different feature spaces. Therefore, it is challenging to measure the fine-grained gap between them. To this end, we propose an attention space training method to learn common representations of different media data. Specifically, we utilize local self-attention layers to learn the common attention space between different media data. We propose a similarity concatenation method to understand the content relationship between features. To further improve the robustness of the model, we also train a local position encoding to capture the spatial relationships between features. In this way, our proposed method can effectively reduce the gap between different feature distributions on cross-media retrieval tasks. It also improves the fine-grained recognition performance by attaching attention to high-level semantic information. Extensive experiments and ablation studies demonstrate that our proposed method achieves state-of-the-art performance. At the same time, our approach provides a new pipeline for fine-grained cross-media retrieval. The source code and models are publicly available at: https://github.com/NUST-Machine-Intelligence-Laboratory/SAFGCMHN.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123356357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages 利用资源丰富的语言数据集进行资源贫乏语言的端到端场景文本识别

ACM Multimedia Asia Pub Date : 2021-11-24 DOI: 10.1145/3469877.3490571

Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura

{"title":"Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages","authors":"Shota Orihashi, Yoshihiro Yamazaki, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Ryo Masumura","doi":"10.1145/3469877.3490571","DOIUrl":"https://doi.org/10.1145/3469877.3490571","url":null,"abstract":"This paper presents a novel training method for end-to-end scene text recognition. End-to-end scene text recognition offers high recognition accuracy, especially when using the encoder-decoder model based on Transformer. To train a highly accurate end-to-end model, we need to prepare a large image-to-text paired dataset for the target language. However, it is difficult to collect this data, especially for resource-poor languages. To overcome this difficulty, our proposed method utilizes well-prepared large datasets in resource-rich languages such as English, to train the resource-poor encoder-decoder model. Our key idea is to build a model in which the encoder reflects knowledge of multiple languages while the decoder specializes in knowledge of just the resource-poor language. To this end, the proposed method pre-trains the encoder by using a multilingual dataset that combines the resource-poor language’s dataset and the resource-rich language’s dataset to learn language-invariant knowledge for scene text recognition. The proposed method also pre-trains the decoder by using the resource-poor language’s dataset to make the decoder better suited to the resource-poor language. Experiments on Japanese scene text recognition using a small, publicly available dataset demonstrate the effectiveness of the proposed method.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"329 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121992174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Holodeck: Immersive 3D Displays Using Swarms of Flying Light Specks [Extended Abstract] 全息甲板:使用飞行光斑群的沉浸式3D显示[扩展摘要]

ACM Multimedia Asia Pub Date : 2021-11-02 DOI: 10.1145/3469877.3493698

Shahram Ghandeharizadeh

{"title":"Holodeck: Immersive 3D Displays Using Swarms of Flying Light Specks [Extended Abstract]","authors":"Shahram Ghandeharizadeh","doi":"10.1145/3469877.3493698","DOIUrl":"https://doi.org/10.1145/3469877.3493698","url":null,"abstract":"Unmanned Aerial Vehicles (UAVs) have moved beyond a platform for hobbyists to enable environmental monitoring, journalism, film industry, search and rescue, package delivery, and entertainment. This paper describes 3D displays using swarms of flying light specks, FLSs. An FLS is a small (hundreds of micrometers in size) UAV with one or more light sources to generate different colors and textures with adjustable brightness. A synchronized swarm of FLSs renders an illumination in a pre-specified 3D volume, an FLS display. An FLS display provides true depth, enabling a user to perceive a scene more completely by analyzing its illumination from different angles. An FLS display may either be non-immersive or immersive. Both will support 3D acoustics. Non-immersive FLS displays may be the size of a 1980’s computer monitor, enabling a surgical team to observe and control micro robots performing heart surgery inside a patient’s body. Immersive FLS displays may be the size of a room, enabling users to interact with objects, e.g., a rock, a teapot. An object with behavior will be constructed using FLS-matters. FLS-matter will enable a user to touch and manipulate an object, e.g., a user may pick up a teapot or throw a rock. An immersive and interactive FLS display will approximate Star Trek’s holodeck. A successful realization of the research ideas presented in this paper will provide fundamental insights into implementing a holodeck using swarms of FLSs. A holodeck will transform the future of human communication and perception, and how we interact with information and data. It will revolutionize the future of how we work, learn, play and entertain, receive medical care, and socialize.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124401738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Hierarchical Deep Residual Reasoning for Temporal Moment Localization 时间矩定位的层次深度残差推理

ACM Multimedia Asia Pub Date : 2021-10-31 DOI: 10.1145/3469877.3490595

Ziyang Ma, Xianjing Han, Xuemeng Song, Yiran Cui, Liqiang Nie

{"title":"Hierarchical Deep Residual Reasoning for Temporal Moment Localization","authors":"Ziyang Ma, Xianjing Han, Xuemeng Song, Yiran Cui, Liqiang Nie","doi":"10.1145/3469877.3490595","DOIUrl":"https://doi.org/10.1145/3469877.3490595","url":null,"abstract":"Temporal Moment Localization (TML) in untrimmed videos is a challenging task in the field of multimedia, which aims at localizing the start and end points of the activity in the video, described by a sentence query. Existing methods mainly focus on mining the correlation between video and sentence representations or investigating the fusion manner of the two modalities. These works mainly understand the video and sentence coarsely, ignoring the fact that a sentence can be understood from various semantics, and the dominant words affecting the moment localization in the semantics are the action and object reference. Toward this end, we propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics to achieve a finer-grained localization. Furthermore, considering that videos with different resolution and sentences with different length have different difficulty in understanding, we design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner. Extensive experiments conducted on Charades-STA and ActivityNet-Captions datasets demonstrate the superiority of our HDRR model compared with other state-of-the-art methods.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123846180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Improving Camouflaged Object Detection with the Uncertainty of Pseudo-edge Labels 利用伪边缘标签的不确定性改进伪装目标检测

ACM Multimedia Asia Pub Date : 2021-10-29 DOI: 10.1145/3469877.3490587

Nobukatsu Kajiura, Hong Liu, S. Satoh

{"title":"Improving Camouflaged Object Detection with the Uncertainty of Pseudo-edge Labels","authors":"Nobukatsu Kajiura, Hong Liu, S. Satoh","doi":"10.1145/3469877.3490587","DOIUrl":"https://doi.org/10.1145/3469877.3490587","url":null,"abstract":"This paper focuses on camouflaged object detection (COD), which is a task to detect objects hidden in the background. Most of the current COD models aim to highlight the target object directly while outputting ambiguous camouflaged boundaries. On the other hand, the performance of the models considering edge information is not yet satisfactory. To this end, we propose a new framework that makes full use of multiple visual cues, i.e., saliency as well as edges, to refine the predicted camouflaged map. This framework consists of three key components, i.e., a pseudo-edge generator, a pseudo-map generator, and an uncertainty-aware refinement module. In particular, the pseudo-edge generator estimates the boundary that outputs the pseudo-edge label, and the conventional COD method serves as the pseudo-map generator that outputs the pseudo-map label. Then, we propose an uncertainty-based module to reduce the uncertainty and noise of such two pseudo labels, which takes both pseudo labels as input and outputs an edge-accurate camouflaged map. Experiments on various COD datasets demonstrate the effectiveness of our method with superior performance to the existing state-of-the-art methods.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125957253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Patch-Based Deep Autoencoder for Point Cloud Geometry Compression 基于补丁的深度自编码器点云几何压缩

ACM Multimedia Asia Pub Date : 2021-10-18 DOI: 10.1145/3469877.3490611

Kang-Soo You, Pan Gao

引用次数: 12

Explore before Moving: A Feasible Path Estimation and Memory Recalling Framework for Embodied Navigation 移动前探索:具身导航的可行路径估计与记忆回忆框架

ACM Multimedia Asia Pub Date : 2021-10-16 DOI: 10.1145/3469877.3490570

Yang Wu, Shirui Feng, Guanbin Li, Liang Lin

引用次数: 0