Proceedings of the 2nd ACM International Conference on Multimedia in Asia最新文献

筛选
英文 中文
Fixed-size video summarization over streaming data via non-monotone submodular maximization 通过非单调次模最大化实现流数据的固定大小视频摘要
Proceedings of the 2nd ACM International Conference on Multimedia in Asia Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446285
Ganfeng Lu, Jiping Zheng
{"title":"Fixed-size video summarization over streaming data via non-monotone submodular maximization","authors":"Ganfeng Lu, Jiping Zheng","doi":"10.1145/3444685.3446285","DOIUrl":"https://doi.org/10.1145/3444685.3446285","url":null,"abstract":"Video summarization which potentially fast browses a large amount of emerging video data as well as saves storage cost has attracted tremendous attentions in machine learning and information retrieval. Among existing efforts, determinantal point processes (DPPs) designed for selecting a subset of video frames to represent the whole video have shown great success in video summarization. However, existing methods have shown poor performance to generate fixed-size output summaries for video data, especially when video frames arrive in streaming manner. In this paper, we provide an efficient approach k-seqLS which summarizes streaming video data with a fixed-size k in vein of DPPs. Our k-seqLS approach can fully exploit the sequential nature of video frames by setting a time window and the frames outside the window have no influence on current video frame. Since the log-style of the DPP probability for each subset of frames is a non-monotone submodular function, local search as well as greedy techniques with cardinality constraints are adopted to make k-seqLS fixed-sized, efficient and with theoretical guarantee. Our experiments show that our proposed k-seqLS exhibits higher performance while maintaining practical running time.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121059537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Full-resolution encoder-decoder networks with multi-scale feature fusion for human pose estimation 基于多尺度特征融合的全分辨率编码器-解码器网络
Proceedings of the 2nd ACM International Conference on Multimedia in Asia Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446282
Jie Ou, Mingjian Chen, Hong Wu
{"title":"Full-resolution encoder-decoder networks with multi-scale feature fusion for human pose estimation","authors":"Jie Ou, Mingjian Chen, Hong Wu","doi":"10.1145/3444685.3446282","DOIUrl":"https://doi.org/10.1145/3444685.3446282","url":null,"abstract":"To achieve more accurate 2D human pose estimation, we extend the successful encoder-decoder network, simple baseline network (SBN), in three ways. To reduce the quantization errors caused by the large output stride size, two more decoder modules are appended to the end of the simple baseline network to get full output resolution. Then, the global context blocks (GCBs) are added to the encoder and decoder modules to enhance them with global context features. Furthermore, we propose a novel spatial-attention-based multi-scale feature collection and distribution module (SA-MFCD) to fuse and distribute multi-scale features to boost the pose estimation. Experimental results on the MS COCO dataset indicate that our network can remarkably improve the accuracy of human pose estimation over SBN, our network using ResNet34 as the backbone network can even achieve the same accuracy as SBN with ResNet152, and our networks can achieve superior results with big backbone networks.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115592855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scene graph generation via multi-relation classification and cross-modal attention coordinator 基于多关系分类和跨模态注意协调器的场景图生成
Proceedings of the 2nd ACM International Conference on Multimedia in Asia Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446276
Xiaoyi Zhang, Zheng Wang, Xing Xu, Jiwei Wei, Yang Yang
{"title":"Scene graph generation via multi-relation classification and cross-modal attention coordinator","authors":"Xiaoyi Zhang, Zheng Wang, Xing Xu, Jiwei Wei, Yang Yang","doi":"10.1145/3444685.3446276","DOIUrl":"https://doi.org/10.1145/3444685.3446276","url":null,"abstract":"Scene graph generation intends to build graph-based representation from images, where nodes and edges respectively represent objects and relationships between them. However, scene graph generation today is heavily limited by imbalanced class prediction. Specifically, most of existing work achieves satisfying performance on simple and frequent relation classes (e.g. on), yet leaving poor performance with fine-grained and infrequent ones (e.g. walk on, stand on). To tackle this problem, in this paper, we redesign the framework as two branches, representation learning branch and classifier learning branch, for a more balanced scene graph generator. Furthermore, for representation learning branch, we propose Cross-modal Attention Coordinator (CAC) to gather consistent features from multi-modal using dynamic attention. For classifier learning branch, we first transfer relation classes' knowledge from large scale corpus, then we leverage Multi-Relationship classifier via Graph Attention neTworks (MR-GAT) to bridge the gap between frequent relations and infrequent ones. The comprehensive experimental results on VG200, a challenge dataset, indicate the competitiveness and the significant superiority of our proposed approach.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114960445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient inter-image relation graph neural network hashing for scalable image retrieval 高效的图像间关系图神经网络哈希可扩展图像检索
Proceedings of the 2nd ACM International Conference on Multimedia in Asia Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446321
Hui Cui, Lei Zhu, Wentao Tan
{"title":"Efficient inter-image relation graph neural network hashing for scalable image retrieval","authors":"Hui Cui, Lei Zhu, Wentao Tan","doi":"10.1145/3444685.3446321","DOIUrl":"https://doi.org/10.1145/3444685.3446321","url":null,"abstract":"Unsupervised deep hashing is a promising technique for large-scale image retrieval, as it equips powerful deep neural networks and has advantage on label independence. However, the unsupervised deep hashing process needs to train a large amount of deep neural network parameters, which is hard to optimize when no labeled training samples are provided. How to maintain the well scalability of unsupervised hashing while exploiting the advantage of deep neural network is an interesting but challenging problem to investigate. With the motivation, in this paper, we propose a simple but effective Inter-image Relation Graph Neural Network Hashing (IRGNNH) method. Different from all existing complex models, we discover the latent inter-image semantic relations without any manual labels and exploit them further to assist the unsupervised deep hashing process. Specifically, we first parse the images to extract latent involved semantics. Then, relation graph convolutional network is constructed to model the inter-image semantic relations and visual similarity, which generates representation vectors for image relations and contents. Finally, adversarial learning is performed to seamlessly embed the constructed relations into the image hash learning process, and improve the discriminative capability of the hash codes. Experiments demonstrate that our method significantly outperforms the state-of-the-art unsupervised deep hashing methods on both retrieval accuracy and efficiency.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129932555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Cross-cultural design of facial expressions for humanoids: is there cultural difference between Japan and Denmark? 仿人面部表情的跨文化设计:日本和丹麦有文化差异吗?
Proceedings of the 2nd ACM International Conference on Multimedia in Asia Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446294
I. Kanaya, Meina Tawaki, Keiko Yamamoto
{"title":"Cross-cultural design of facial expressions for humanoids: is there cultural difference between Japan and Denmark?","authors":"I. Kanaya, Meina Tawaki, Keiko Yamamoto","doi":"10.1145/3444685.3446294","DOIUrl":"https://doi.org/10.1145/3444685.3446294","url":null,"abstract":"In this research, the authors succeeded in creating facial expressions made with the minimum necessary elements for recognizing a face. The elements are two eyes and a mouth made using precise circles, which are transformed to make facial expressions geometrically, through rotation and vertically scaling transformation. The facial expression patterns made by the geometric elements and transformations were composed employing three dimensions of visual information that had been suggested by many previous researches, slantedness of the mouth, openness of the face, and slantedness of the eyes. The authors found that this minimal facial expressions can be classified into 10 emotions: happy, angry, sad, disgust, fear, surprised, angry*, fear*, neutral (pleasant) indicating positive emotion, and neutral (unpleasant) indicating negative emotion. The authors also investigate and report cultural differences of impressions of facial expressions of above-mentioned simplified face.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117017056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-time arbitrary video style transfer 实时任意视频风格转换
Proceedings of the 2nd ACM International Conference on Multimedia in Asia Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446301
Xingyu Liu, Zongxing Ji, Piao Huang, Tongwei Ren
{"title":"Real-time arbitrary video style transfer","authors":"Xingyu Liu, Zongxing Ji, Piao Huang, Tongwei Ren","doi":"10.1145/3444685.3446301","DOIUrl":"https://doi.org/10.1145/3444685.3446301","url":null,"abstract":"Video style transfer aims to synthesize a stylized video that has similar content structure with a content video and is rendered in the style of a style image. The existing video style transfer methods cannot simultaneously realize high efficiency, arbitrary style and temporal consistency. In this paper, we propose the first real-time arbitrary video style transfer method with only one model. Specifically, we utilize a three-network architecture consisting of a prediction network, a stylization network and a loss network. Prediction network is used for extracting style parameters from a given style image; Stylization network is for generating the corresponding stylized video; Loss network is for training prediction network and stylization network, whose loss function includes content loss, style loss and temporal consistency loss. We conduct three experiments and a user study to test the effectiveness of our method. The experimental results show that our method outperforms the state-of-the-arts.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129210123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Motion-transformer: self-supervised pre-training for skeleton-based action recognition 运动转换器:基于骨骼的动作识别的自监督预训练
Proceedings of the 2nd ACM International Conference on Multimedia in Asia Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446289
Yi-Bin Cheng, Xipeng Chen, Dongyu Zhang, Liang Lin
{"title":"Motion-transformer: self-supervised pre-training for skeleton-based action recognition","authors":"Yi-Bin Cheng, Xipeng Chen, Dongyu Zhang, Liang Lin","doi":"10.1145/3444685.3446289","DOIUrl":"https://doi.org/10.1145/3444685.3446289","url":null,"abstract":"With the development of deep learning, skeleton-based action recognition has achieved great progress in recent years. However, most of the current works focus on extracting more informative spatial representations of the human body, but haven't made full use of the temporal dependencies already contained in the sequence of human action. To this end, we propose a novel transformer-based model called Motion-Transformer to sufficiently capture the temporal dependencies via self-supervised pre-training on the sequence of human action. Besides, we propose to predict the motion flow of human skeletons for better learning the temporal dependencies in sequence. The pre-trained model is then fine-tuned on the task of action recognition. Experimental results on the large scale NTU RGB+D dataset shows our model is effective in modeling temporal relation, and the flow prediction pre-training is beneficial to expose the inherent dependencies in time dimensional. With this pre-training and fine-tuning paradigm, our final model outperforms previous state-of-the-art methods.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131402574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Synthesized 3D models with smartphone based MR to modify the PreBuilt environment: interior design 综合3D模型与智能手机为基础的MR修改预建环境:室内设计
Proceedings of the 2nd ACM International Conference on Multimedia in Asia Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446251
Anish Bhardwaj, N. Chauhan, R. Shah
{"title":"Synthesized 3D models with smartphone based MR to modify the PreBuilt environment: interior design","authors":"Anish Bhardwaj, N. Chauhan, R. Shah","doi":"10.1145/3444685.3446251","DOIUrl":"https://doi.org/10.1145/3444685.3446251","url":null,"abstract":"The past few years have seen an increase in the number of products that use AR and VR as well as the emergence of products in both these categories i.e. Mixed Reality. However, current systems are exclusive to a market that exists in the top 1% of the population in most countries due to the expensive and heavy technology required by these systems. This project showcases a system in the field of Smartphone Based Mixed Reality through an Interior Design Solution that allows the user to visualise their design choices through the lens of a smartphone. Our system uses Image Processing algorithms to perceive room dimensions alongside a GUI which allows a user to create their own blueprints. Navigable 3D models are created from these blueprints, allowing users to view their builds. Following this, Users switch to the mobile application for the purpose of visualising their ideas in their own homes (MR). This System/POC showcases the potential of MR as a field that can be explored for a larger portion of the population through a more efficient medium.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124845606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pulse localization networks with infrared camera 红外摄像机脉冲定位网络
Proceedings of the 2nd ACM International Conference on Multimedia in Asia Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446318
Bohong Yang, Kai Meng, Hong Lu, Xinyao Nie, Guanhao Huang, Jingjing Luo, Xing Zhu
{"title":"Pulse localization networks with infrared camera","authors":"Bohong Yang, Kai Meng, Hong Lu, Xinyao Nie, Guanhao Huang, Jingjing Luo, Xing Zhu","doi":"10.1145/3444685.3446318","DOIUrl":"https://doi.org/10.1145/3444685.3446318","url":null,"abstract":"Pulse localization is the basic task of the pulse diagnosis with robot. More accurate location can reduce the misdiagnosis caused by different types of pulse. Traditional works usually use a collection surface with a certain area for contact detection, and move the collection surface to collect changes of power for pulse localization. These methods often require the subjects place their wrist in a given position. In this paper, we propose a novel pulse localization method which uses the infrared camera as the input sensor, and locates the pulse on wrist with the neural network. This method can not only reduce the contact between the machine and the subject, reduce the discomfort of the process, but also reduce the preparation time for the test, which can improve the detection efficiency. The experiments show that our proposed method can locate the pulse with high accuracy. And we have applied this method to pulse diagnosis robot for pulse data collection.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122562301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Self-supervised adversarial learning for cross-modal retrieval 跨模态检索的自监督对抗学习
Proceedings of the 2nd ACM International Conference on Multimedia in Asia Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446269
Yangchao Wang, Shiyuan He, Xing Xu, Yang Yang, Jingjing Li, Heng Tao Shen
{"title":"Self-supervised adversarial learning for cross-modal retrieval","authors":"Yangchao Wang, Shiyuan He, Xing Xu, Yang Yang, Jingjing Li, Heng Tao Shen","doi":"10.1145/3444685.3446269","DOIUrl":"https://doi.org/10.1145/3444685.3446269","url":null,"abstract":"Cross-modal retrieval aims at enabling flexible retrieval across different modalities. The core of cross-modal retrieval is to learn projections for different modalities and make instances in the learned common subspace comparable to each other. Self-supervised learning automatically creates a supervision signal by transformation of input data and learns semantic features by training to predict the artificial labels. In this paper, we proposed a novel method named Self-Supervised Adversarial Learning (SSAL) for Cross-Modal Retrieval, which deploys self-supervised learning and adversarial learning to seek an effective common subspace. A feature projector tries to generate modality-invariant representations in the common subspace that can confuse an adversarial discriminator consists of two classifiers. One of the classifiers aims to predict rotation angle from image representations, while the other classifier tries to discriminate between different modalities from the learned embeddings. By confusing the self-supervised adversarial model, feature projector filters out the abundant high-level visual semantics and learns image embeddings that are better aligned with text modality in the common subspace. Through the joint exploitation of the above, an effective common subspace is learned, in which representations of different modlities are aligned better and common information of different modalities is well preserved. Comprehensive experimental results on three widely-used benchmark datasets show that the proposed method is superior in cross-modal retrieval and significantly outperforms the existing cross-modal retrieval methods.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129551921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信