ACM Multimedia Asia最新文献_第2页

Structural Knowledge Organization and Transfer for Class-Incremental Learning 渐进式学习的结构知识组织与迁移

ACM Multimedia Asia Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490598

Yu Liu, Xiaopeng Hong, Xiaoyu Tao, Songlin Dong, Jingang Shi, Yihong Gong

引用次数: 3

MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients MIRecipe:用于配料外观变化阶段感知识别的配方数据集

ACM Multimedia Asia Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490596

Yixin Zhang, Yoko Yamakata, Keishi Tajima

{"title":"MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients","authors":"Yixin Zhang, Yoko Yamakata, Keishi Tajima","doi":"10.1145/3469877.3490596","DOIUrl":"https://doi.org/10.1145/3469877.3490596","url":null,"abstract":"In this paper, we introduce a new recipe dataset MIRecipe (Multimedia-Instructional Recipe). It has both text and image data for every cooking step, while the conventional recipe datasets only contain final dish images, and/or images only for some of the steps. It consists of 26,725 recipes, which include 239,973 steps in total. The recognition of ingredients in images associated with cooking steps poses a new challenge: Since ingredients are processed during cooking, the appearance of the same ingredient is very different in the beginning and finishing stages of the cooking. The general object recognition methods, which assume the constant appearance of objects, do not perform well for such objects. To solve the problem, we propose two stage-aware techniques: stage-wise model learning, which trains a separate model for each stage, and stage-aware curriculum learning, which starts with the training data from the beginning stage and proceeds to the later stages. Our experiment with our dataset shows that our method achieves higher accuracy than the model trained using all the data without considering the stages. Our dataset is available at our GitHub repository.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121379082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Convolutional Neural Network-Based Pure Paint Pigment Identification Using Hyperspectral Images 基于卷积神经网络的高光谱图像纯颜料识别

ACM Multimedia Asia Pub Date : 2021-12-01 DOI: 10.1145/3469877.3495641

Ailin Chen, R. Jesus, M. Vilarigues

引用次数: 2

Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension 面向实时单阶段引用表达式理解的实体关系融合

ACM Multimedia Asia Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490592

Hang Yu, Weixin Li, Jiankai Li, Ye Du

{"title":"Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension","authors":"Hang Yu, Weixin Li, Jiankai Li, Ye Du","doi":"10.1145/3469877.3490592","DOIUrl":"https://doi.org/10.1145/3469877.3490592","url":null,"abstract":"Referring Expression Comprehension (REC) is the task of grounding object which is referred by the language expression. Previous one-stage REC methods usually use one single language feature vector to represent the whole query for grounding and no reasoning between different objects is performed despite the rich relation cues of objects contained in the language expression, which depresses their grounding accuracy. Additionally, these methods mostly use the feature pyramid networks for multi-scale visual object feature extraction but ground on different feature layers separately, neglecting the connections between objects with different scales. To address these problems, we propose a novel one-stage REC method, i.e. the Entity Relation Fusion Network (ERFN) to locate referred object by relation guided reasoning on different objects. In ERFN, instead of grounding objects at each layer separately, we propose a Language Guided Multi-Scale Fusion (LGMSF) model to utilize language to guide the fusion of representations of objects with different scales into one feature map.For modeling connections between different objects, we design a Relation Guided Feature Fusion (RGFF) model that extracts entities in the language expression to enhance the referred entity feature in the visual object feature map, and further extracts relations to guide object feature fusion based on the self-attention mechanism. Experimental results show that our method is competitive with the state-of-the-art one-stage and two-stage REC methods, and can also keep inferring in real time.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133439636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Flat and Shallow: Understanding Fake Image Detection Models by Architecture Profiling 平面和浅:通过架构分析理解假图像检测模型

ACM Multimedia Asia Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490566

Jing-Fen Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei

{"title":"Flat and Shallow: Understanding Fake Image Detection Models by Architecture Profiling","authors":"Jing-Fen Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei","doi":"10.1145/3469877.3490566","DOIUrl":"https://doi.org/10.1145/3469877.3490566","url":null,"abstract":"Digital image manipulations have been heavily abused to spread misinformation. Despite the great efforts dedicated in research community, prior works are mostly performance-driven, i.e., optimizing performances using standard/heavy networks designed for semantic classification. A thorough understanding for fake images detection models is still missing. This paper studies the essential ingredients for a good fake image detection model, by profiling the best-performing architectures. Specifically, we conduct a thorough analysis on a massive number of detection models, and observe how the performances are affected by different patterns of network structure. Our key findings include: 1) with the same computational budget, flat network structures (e.g., large kernel sizes, wide connections) perform better than commonly used deep networks; 2) operations in shallow layers deserve more computational capacities to trade-off performance and computational cost. These findings sketch a general profile for essential models of fake image detection, which show clear differences with those for semantic classification. Furthermore, based on our analysis, we propose a new Depth-Separable Search Space (DSS) for fake image detection. Compared to state-of-the-art methods, our model achieves competitive performance while saving more than 50% parameters.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134628278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Hyperspectral Super-Resolution via Heterogeneous Knowledge Distillation 利用异构知识蒸馏提高高光谱超分辨率

ACM Multimedia Asia Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490610

Ziqian Liu, Qing Ma, Junjun Jiang, Xianming Liu

{"title":"Improving Hyperspectral Super-Resolution via Heterogeneous Knowledge Distillation","authors":"Ziqian Liu, Qing Ma, Junjun Jiang, Xianming Liu","doi":"10.1145/3469877.3490610","DOIUrl":"https://doi.org/10.1145/3469877.3490610","url":null,"abstract":"Hyperspectral images (HSI) contains rich spectrum information but their spatial resolution is often limited by imaging system. Super-resolution (SR) reconstruction becomes a hot topic aiming to increase spatial resolution without extra hardware cost. The fusion-based hyperspectral image super-resolution (FHSR) methods use supplementary high-resolution multispectral images (HR-MSI) to recover spatial details, but well co-registered HR-MSI is hard to collect. Recently, single hyperspectral image super-resolution (SHSR) methods based on deep learning have made great progress. However, lack of HR-MSI input makes these SHSR methods difficult to exploit the spatial information. To take advantages of FHSR and SHSR methods, in this paper we propose a new pipeline treating HR-MSI as privilege information and try to improve our SHSR model with knowledge distillation. That is, our model uses paired MSI-HSI data to train and only needs LR-HSI as input during inference. Specifically, we combine SHSR and spectral super-resolution (SSR) and design a novel architecture, Distillation-Oriented Dual-branch Net (DODN), to make the SHSR model fully employ transferred knowledge from the SSR model. Since the main stream of SSR model are 2D CNNs and full 2D CNN causes spectral disorder in SHSR task, a new mixed 2D/3D block, called Distillation-Oriented Dual-branch Block (DODB) is proposed, where the 3D branch extracts spectral-spatial correlation while the 2D branch accepts information from the SSR model through knowledge distillation. The main idea is to distill the knowledge of spatial information from HR-MSI to the SHSR model without changing its network architecture. Extensive experiments on two benchmark datasets, CAVE and NTIRE2020, demonstrate that our proposed DODN outperforms the state-of-the-art SHSR methods, in terms of both quantitative and qualitative analysis.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132535736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Motion = Video - Content: Towards Unsupervised Learning of Motion Representation from Videos 运动=视频-内容:从视频中实现运动表示的无监督学习

ACM Multimedia Asia Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490582

Hehe Fan, Mohan S. Kankanhalli

引用次数: 3

PLM-IPE: A Pixel-Landmark Mutual Enhanced Framework for Implicit Preference Estimation PLM-IPE:一个用于隐式偏好估计的像素里程碑式相互增强框架

ACM Multimedia Asia Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490621

Federico Becattini, Xuemeng Song, C. Baecchi, S. Fang, C. Ferrari, Liqiang Nie, A. del Bimbo

{"title":"PLM-IPE: A Pixel-Landmark Mutual Enhanced Framework for Implicit Preference Estimation","authors":"Federico Becattini, Xuemeng Song, C. Baecchi, S. Fang, C. Ferrari, Liqiang Nie, A. del Bimbo","doi":"10.1145/3469877.3490621","DOIUrl":"https://doi.org/10.1145/3469877.3490621","url":null,"abstract":"In this paper, we are interested in understanding how customers perceive fashion recommendations, in particular when observing a proposed combination of garments to compose an outfit. Automatically understanding how a suggested item is perceived, without any kind of active engagement, is in fact an essential block to achieve interactive applications. We propose a pixel-landmark mutual enhanced framework for implicit preference estimation, named PLM-IPE, which is capable of inferring the user’s implicit preferences exploiting visual cues, without any active or conscious engagement. PLM-IPE consists of three key modules: pixel-based estimator, landmark-based estimator and mutual learning based optimization. The former two modules work on capturing the implicit reaction of the user from the pixel level and landmark level, respectively. The last module serves to transfer knowledge between the two parallel estimators. Towards evaluation, we collected a real-world dataset, named SentiGarment, which contains 3,345 facial reaction videos paired with suggested outfits and human labeled reaction scores. Extensive experiments show the superiority of our model over state-of-the-art approaches.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123722148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Multi-branch Semantic Learning Network for Text-to-Image Synthesis 用于文本到图像合成的多分支语义学习网络

ACM Multimedia Asia Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490567

Jiading Ling, Xingcai Wu, Zhenguo Yang, Xudong Mao, Qing Li, Wenyin Liu

引用次数: 0

Generation of Variable-Length Time Series from Text using Dynamic Time Warping-Based Method 基于动态时间翘曲的文本变长时间序列生成方法

ACM Multimedia Asia Pub Date : 2021-12-01 DOI: 10.1145/3469877.3495644

Ayaka Ideno, Yusuke Mukuta, Tatsuya Harada

{"title":"Generation of Variable-Length Time Series from Text using Dynamic Time Warping-Based Method","authors":"Ayaka Ideno, Yusuke Mukuta, Tatsuya Harada","doi":"10.1145/3469877.3495644","DOIUrl":"https://doi.org/10.1145/3469877.3495644","url":null,"abstract":"This study is aimed at finding a suitable method for generating time-series data such as video clips or avatar motions from text stating multiple events. This paper addresses the generation of variable-length time-series data considering the order and variable duration of events stated in the text. Although the use of the variant of Mean Squared Error (MSE) is a common means of training, only the gap between the element of ground-truth (GT) data and generated data at the same time are considered. Thus, variants of MSE are unsuitable for the task at hand because the loss may not be small for the generated and GT data with the same order of events if the time for each event does not overlap. To solve the problem, we propose a Dynamic Time Warping-Like method for Variable-Length data (DTWL-VL), which determines the corresponding elements of the GT and the generated data, allowing for the time difference between them, and makes them closer. We compared DTWL-VL, a variant of MSE, and an existing method for time-series data generation which considers the time difference between the corresponding part in the GT and generated data. Since the existing method is aimed at generating fixed-length data, we extend the method for generating variable-length time-series data. We conducted experiments using a dataset prepared for this study. Both DTWL-VL and the existing methods outperformed the MSE variant. Moreover, although the existing method outperformed DTWL-VL under certain settings, DTWL-VL required a smaller training period.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124849591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0