{"title":"Image-Signal Correlation Network for Textile Fiber Identification","authors":"Bo Peng, Liren He, Yining Qiu, Dong Wu, M. Chi","doi":"10.1145/3503161.3548310","DOIUrl":"https://doi.org/10.1145/3503161.3548310","url":null,"abstract":"Identifying fiber compositions is an important aspect of the textile industry. In recent decades, near-infrared spectroscopy has shown its potential in the automatic detection of fiber components. However, for plant fibers such as cotton and linen, the chemical compositions are the same and thus the absorption spectra are very similar, leading to the problem of \"different materials with the same spectrum, whereas the same material with different spectrums\" and it is difficult using a single mode of NIR signals to capture the effective features to distinguish these fibers. To solve this problem, textile experts under a microscope measure the cross-sectional or longitudinal characteristics of fibers to determine fiber contents with a destructive way. In this paper, we construct the first NIR signal-microscope image textile fiber composition dataset (NIRITFC). Based on the NIRITFC dataset, we propose an image-signal correlation network (ISiC-Net) and design image-signal correlation perception and image-signal correlation attention modules, respectively, to effectively integrate the visual features (esp. local texture details of fibers) with the finer absorption spectrum information of the NIR signal to capture the deep abstract features of bimodal data for nondestructive textile fiber identification. To better learn the spectral characteristics of the fiber components, the endmember vectors of the corresponding fibers are generated by embedding encoding, and the reconstruction loss is designed to guide the model to reconstruct the NIR signals of the corresponding fiber components by a nonlinear mapping. The quantitative and qualitative results are significantly improved compared to both single and bimodal approaches, indicating the great potential of combining microscopic images and NIR signals for textile fiber composition identification.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115315768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingting Li, Moi Hoon Yap, Wen-Huang Cheng, John See, Xiaopeng Hong, Xiabai Li, Su-Jing Wang
{"title":"FME '22: 2nd Workshop on Facial Micro-Expression: Advanced Techniques for Multi-Modal Facial Expression Analysis","authors":"Jingting Li, Moi Hoon Yap, Wen-Huang Cheng, John See, Xiaopeng Hong, Xiabai Li, Su-Jing Wang","doi":"10.1145/3503161.3554777","DOIUrl":"https://doi.org/10.1145/3503161.3554777","url":null,"abstract":"Micro-expressions are facial movements that are extremely short and not easily detected, which often reflect the genuine emotions of individuals. Micro-expressions are important cues for understanding real human emotions and can be used for non-contact non-perceptual deception detection, or abnormal emotion recognition. It has broad application prospects in national security, judicial practice, health prevention, clinical practice, etc. However, micro-expression feature extraction and learning are highly challenging because micro-expressions have the characteristics of short duration, low intensity, and local asymmetry. In addition, the intelligent micro-expression analysis combined with deep learning technology is also plagued by the problem of small samples. Not only is micro-expression elicitation very difficult, micro-expression annotation is also very time-consuming and laborious. More importantly, the micro-expression generation mechanism is not yet clear, which shackles the application of micro-expressions in real scenarios. FME'22 is the inaugural workshop in this area of research, with the aim of promoting interactions between researchers and scholars from within this niche area of research and also including those from broader, general areas of expression and psychology research. The complete FME'22 workshop proceedings are available at: https://dl.acm.org/doi/proceedings/10.1145/3552465.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115659629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing Semi-Supervised Learning with Cross-Modal Knowledge","authors":"Hui Zhu, Yongchun Lü, Hongbin Wang, Xunyi Zhou, Qin Ma, Yanhong Liu, Ning Jiang, Xinde Wei, Linchengxi Zeng, Xiaofang Zhao","doi":"10.1145/3503161.3548026","DOIUrl":"https://doi.org/10.1145/3503161.3548026","url":null,"abstract":"Semi-supervised learning (SSL), which leverages a small number of labeled data that rely on expert knowledge and a large number of easily accessible unlabeled data, has made rapid progress recently. However, the information comes from a single modality and the corresponding labels are in form of one-hot in pre-existing SSL approaches, which can easily lead to deficiency supervision, omission of information and unsatisfactory results, especially when more categories and less labeled samples are covered. In this paper, we propose a novel method to further enhance SSL by introducing semantic modal knowledge, which contains the word embeddings of class labels and the semantic hierarchy structure among classes. The former helps retain more potential information and almost quantitatively reflects the similarities and differences between categories. The later encourages the model to construct the classification edge from simple to complex, and thus improves the generalization ability of the model. Comprehensive experiments and ablation studies are conducted on commonly-used datasets to demonstrate the effectiveness of our method.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121797014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fine-grained Micro-Expression Generation based on Thin-Plate Spline and Relative AU Constraint","authors":"Sirui Zhao, Shukang Yin, Huaying Tang, Rijin Jin, Yifan Xu, Tong Xu, Enhong Chen","doi":"10.1145/3503161.3551597","DOIUrl":"https://doi.org/10.1145/3503161.3551597","url":null,"abstract":"As a typical psychological stress reaction, micro-expression (ME) is usually quickly leaked on a human face and can reveal the true feeling and emotional cognition. Therefore,automatic ME analysis (MEA) has essential applications in safety, clinical and other fields. However, the lack of adequate ME data has severely hindered MEA research. To overcome this dilemma and encouraged by current image generation techniques, this paper proposes a fine-grained ME generation method to enhance ME data in terms of data volume and diversity. Specifically, we first estimate non-linear ME motion using thin-plate spline transformation with a dense motion network. Then, the estimated ME motion transformations, including optical flow and occlusion masks, are sent to the generation network to synthesize the target facial micro-expression. In particular, we obtain the relative action units (AUs) of the source ME to the target face as a constraint to encourage the network to ignore expression-irrelevant movements, thereby generating fine-grained MEs. Through comparative experiments on CASME II, SMIC and SAMM datasets, we demonstrate the effectiveness and superiority of our method. Source code is provided in https://github.com/MEA-LAB-421/MEGC2022-Generation.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117171631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuaifeng Li, Kaixin Wang, Yanbo Gao, Xun Cai, Mao Ye
{"title":"Geometric Warping Error Aware CNN for DIBR Oriented View Synthesis","authors":"Shuaifeng Li, Kaixin Wang, Yanbo Gao, Xun Cai, Mao Ye","doi":"10.1145/3503161.3547946","DOIUrl":"https://doi.org/10.1145/3503161.3547946","url":null,"abstract":"Depth Image based Rendering (DIBR) oriented view synthesis is an important virtual view generation technique. It warps the reference view images to the target viewpoint based on their depth maps, without requiring many available viewpoints. However, in the 3D warping process, pixels are warped to fractional pixel locations and then rounded (or interpolated) to integer pixels, resulting in geometric warping error and reducing the image quality. This resembles, to some extent, the image super-resolution problem, but with unfixed fractional pixel locations. To address this problem, we propose a geometric warping error aware CNN (GWEA) framework to enhance the DIBR oriented view synthesis. First, a deformable convolution based geometric warping error aware alignment (GWEA-DCA) module is developed, by taking advantage of the geometric warping error preserved in the DIBR module. The offset learned in the deformable convolution can account for the geometric warping error to facilitate the mapping from the fractional pixels to integer pixels. Moreover, in view that the pixels in the warped images are of different qualities due to the different strengths of warping errors, an attention enhanced view blending (GWEA-AttVB) module is further developed to adaptively fuse the pixels from different warped images. Finally, a partial convolution based hole filling and refinement module fills the remaining holes and improves the quality of the overall image. Experiments show that our model can synthesize higher-quality images than the existing methods, and ablation study is also conducted, validating the effectiveness of each proposed module.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117199228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Relation-enhanced Negative Sampling for Multimodal Knowledge Graph Completion","authors":"Derong Xu, Tong Xu, Shiwei Wu, Jingbo Zhou, Enhong Chen","doi":"10.1145/3503161.3548388","DOIUrl":"https://doi.org/10.1145/3503161.3548388","url":null,"abstract":"Knowledge Graph Completion (KGC), aiming to infer the missing part of Knowledge Graphs (KGs), has long been treated as a crucial task to support downstream applications of KGs, especially for the multimodal KGs (MKGs) which suffer the incomplete relations due to the insufficient accumulation of multimodal corpus. Though a few research attentions have been paid to the completion task of MKGs, there is still a lack of specially designed negative sampling strategies tailored to MKGs. Meanwhile, though effective negative sampling strategies have been widely regarded as a crucial solution for KGC to alleviate the vanishing gradient problem, we realize that, there is a unique challenge for negative sampling in MKGs about how to model the effect of KG relations during learning the complementary semantics among multiple modalities as an extra context. In this case, traditional negative sampling techniques which only consider the structural knowledge may fail to deal with the multimodal KGC task. To that end, in this paper, we propose a MultiModal Relation-enhanced Negative Sampling (MMRNS) framework for multimodal KGC task. Especially, we design a novel knowledge-guided cross-modal attention (KCA) mechanism, which provides bi-directional attention for visual & textual features via integrating relation embedding. Then, an effective contrastive semantic sampler is devised after consolidating the KCA mechanism with contrastive learning. In this way, a more similar representation of semantic features between positive samples, as well as a more diverse representation between negative samples under different relations could be learned. Afterwards, a masked gumbel-softmax optimization mechanism is utilized for solving the non-differentiability of sampling process, which provides effective parameter optimization compared with traditional sample strategies. Extensive experiments on three multimodal KGs demonstrate that our MMRNS framework could significantly outperform the state-of-the-art baseline methods, which validates the effectiveness of relation guides in multimodal KGC task.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120895565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Alexa, let's work together! How Alexa Helps Customers Complete Tasks with Verbal and Visual Guidance in the Alexa Prize TaskBot Challenge","authors":"Y. Maarek","doi":"10.1145/3503161.3549912","DOIUrl":"https://doi.org/10.1145/3503161.3549912","url":null,"abstract":"In this talk, I will present the Alexa Prize TaskBot Challenge, which allows selected academic teams to develop TaskBots. TaskBots are agents that interact with Alexa users who require assistance (via \"Alexa, let's work together\") to complete everyday tasks requiring multiple steps and decisions, such as cooking and home improvement. One of the unique elements of this challenge is its multi-modal nature, where users receive both verbal guidance and visual instructions, when a screen is available (e.g., on Echo Show devices). Some of the hard AI challenges the teams addressed included leveraging domain knowledge, tacking dialogue state, supporting adaptive and robust conversations and probably the most relevant to this conference: handling multi-modal interactions.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120901252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Granular Semantic Mining for Weakly Supervised Semantic Segmentation","authors":"Meijie Zhang, Jianwu Li, Tianfei Zhou","doi":"10.1145/3503161.3547919","DOIUrl":"https://doi.org/10.1145/3503161.3547919","url":null,"abstract":"This paper solves the problem of learning image semantic segmentation using image-level supervision. The task is promising in terms of reducing annotation efforts, yet extremely challenging due to the difficulty to directly associate high-level concepts with low-level appearance. While current efforts handle each concept independently, we take a broader perspective to harvest implicit, holistic structures of semantic concepts, which express valuable prior knowledge for accurate concept grounding. This raises multi-granular semantic mining, a new formalism allowing flexible specification of complex relations in the label space. In particular, we propose a heterogeneous graph neural network (Hgnn) to model the heterogeneity of multi-granular semantics within a set of input images. The Hgnn consists of two types of sub-graphs: 1) an external graph characterizes the relations across different images to mine inter-image contexts; and for each image, 2) an internal graph is constructed to mine inter-class semantic dependencies within each individual image. Through heterogeneous graph learning, our Hgnn is able to land a comprehensive understanding of object patterns, leading to more accurate semantic concept grounding. Extensive experimental results show that Hgnn outperforms the current state-of-the-art approaches on the popular PASCAL VOC 2012 and COCO 2014 benchmarks. Our code is available at: https://github.com/maeve07/HGNN.git.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121093269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingyu Yao, Yu Bai, Wei Du, Xuejun Zhang, Heng Quan, Fuli Cai, Hongwei Kang
{"title":"Multi-Level Spatiotemporal Network for Video Summarization","authors":"Mingyu Yao, Yu Bai, Wei Du, Xuejun Zhang, Heng Quan, Fuli Cai, Hongwei Kang","doi":"10.1145/3503161.3548105","DOIUrl":"https://doi.org/10.1145/3503161.3548105","url":null,"abstract":"With the increasing of ubiquitous devices with cameras, video content is widely produced in the industry. Automation video summarization allows content consumers effectively retrieve the moments that capture their primary attention. Existing supervised methods mainly focus on frame-level information. As a natural phenomenon, video fragments in different shots are richer in semantics than frames. We leverage this as a free latent supervision signal and introduce a novel model named multi-level spatiotemporal network (MLSN). Our approach contains Multi-Level Feature Representations (MLFR) and Local Relative Loss (LRL). MLFR module consists of frame-level features, fragment-level features, and shot-level features with relative position encoding. For videos of different shot durations, it can flexibly capture and accommodate semantic information of different spatiotemporal granularities; LRL utilizes the partial ordering relations among frames of each fragment to capture highly discriminative features to improve the sensitivity of the model. Our method substantially improves the best existing published method by 7% on our industrial products dataset LSVD. Meanwhile, experimental results on two widely used benchmark datasets SumMe and TVSum demonstrate that our method outperforms most state-of-the-art ones.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121226083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MVSPlenOctree: Fast and Generic Reconstruction of Radiance Fields in PlenOctree from Multi-view Stereo","authors":"Wenpeng Xing, Jie Chen","doi":"10.1145/3503161.3547795","DOIUrl":"https://doi.org/10.1145/3503161.3547795","url":null,"abstract":"We present MVSPlenOctree, a novel approach that can efficiently reconstruct radiance fields for view synthesis. Unlike previous scene-specific radiance fields reconstruction methods, we present a generic pipeline that can efficiently reconstruct 360-degree-renderable radiance fields via multi-view stereo (MVS) inference from tens of sparse-spread out images. Our approach leverages variance-based statistic features for MVS inference, and combines this with image based rendering and volume rendering for radiance field reconstruction. We first train a MVS Machine for reasoning scene's density and appearance. Then, based on the spatial hierarchy of the PlenOctree and coarse-to-fine dense sampling mechanism, we design a robust and efficient sampling strategy for PlenOctree reconstruction, which handles occlusion robustly. A 360-degree-renderable radiance fields can be reconstructed in PlenOctree from MVS Machine in an efficient single forward pass. We trained our method on real-world DTU, LLFF datasets, and synthetic datasets. We validate its generalizability by evaluating on the test set of DTU dataset which are unseen in training. In summary, our radiance field reconstruction method is both efficient and generic, a coarse 360-degree-renderable radiance field can be reconstructed in seconds and a dense one within minutes. Please visit the project page for more details: https://derry-xing.github.io/projects/MVSPlenOctree.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127085150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}