Chunyang Xie, Dongheng Zhang, Zhi Wu, Cong Yu, Yang Hu, Qibin Sun, Yan Chen
{"title":"RF-based Multi-view Pose Machine for Multi-Person 3D Pose Estimation","authors":"Chunyang Xie, Dongheng Zhang, Zhi Wu, Cong Yu, Yang Hu, Qibin Sun, Yan Chen","doi":"10.1109/ICME55011.2023.00454","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00454","url":null,"abstract":"In this paper, we present RF-based Multi-view Pose machine (RF-MvP) for multi-person 3D pose estimation using RF signals. Specifically, we first develop a lightweight anchor-free detector module to locate and crop regions of interest from horizontal and vertical RF signals. Afterward, we propose a Multi-view Fusion Network to unproject the RF signals from the horizontal and vertical millimeter-wave radars into a unified latent space, and then calculate the correlation for weighted fusion. Finally, a Spatio-Temporal Attention Network is designed to reconstruct the multi-person 3D skeleton sequences, in which the spatial attention module is proposed to recover invisible body parts using non-local correlations among joints and the temporal attention module refines the 3D pose sequences using temporal coherency learned from frame queries. We evaluate the performance of the proposed RF-MvP and state-of-the-art methods on a large-scale dataset with multi-person 3D pose labels and corresponding radar signals. The experimental results show that RF-MvP outperforms all of the baseline methods, which locates multi-person 3D key points with an average error of 73mm and generalizes well in new data such as occlusion, low illumination.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115106191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinxin Wang, Zhongwen Guo, Chao Yang, Xiaomei Li, Ziyuan Cui
{"title":"Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition","authors":"Jinxin Wang, Zhongwen Guo, Chao Yang, Xiaomei Li, Ziyuan Cui","doi":"10.1109/ICME55011.2023.00116","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00116","url":null,"abstract":"Compared to feature or decision fusion, hybrid fusion can beneficially improve audio-visual speech recognition accuracy. Existing works are mainly prone to design the multi-modality feature extraction process, interaction, and prediction, neglecting useful information on the multi-modality and the optimal combination of different predicted results. In this paper, we propose a multi-scale hybrid fusion network (MSHF) for mandarin audio-visual speech recognition. Our MSHF consists of a feature extraction subnetwork to exploit the proposed multi-scale feature extraction module (MSFE) to obtain multi-scale features and a hybrid fusion subnetwork to integrate the intrinsic correlation of different modality information, optimizing the weights of prediction results for different modalities to achieve the best classification. We further design a feature recognition module (FRM) for accurate audio-visual speech recognition. We conducted experiments on the CAS-VSR-W1k dataset. The experimental results show that the proposed method outperforms the selected competitive baselines and the state-of-the-art, indicating the superiority of our proposed modules.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115390057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jia Chen, Haidongqing Yuan, Fei Fang, Tao Peng, X. Hu
{"title":"Unsupervised Fashion Style Learning by Solving Fashion Jigsaw Puzzles","authors":"Jia Chen, Haidongqing Yuan, Fei Fang, Tao Peng, X. Hu","doi":"10.1109/ICME55011.2023.00317","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00317","url":null,"abstract":"Fashion style learning is the basis for many tasks in fashion AI, such as clothing recommendations, fashion trend analysis and popularity prediction. Most of the existing methods rely on the quality and quantity of the annotations. This paper proposes an efficient two-step unsupervised fashion style learning framework with \"Fashion Jigsaw\" task and centroid-based density clustering algorithm. First, we design the \"Fashion Jigsaw\" unsupervised learning task according to the distribution of fashion elements in full-body fashion images. By splitting and recovering fashion images, we pre-train a model that can extract both intra-image and inter-image information. Second, we propose a centroid-based density clustering algorithm and introduce the concept of \"centroid\" to cluster fashion image features and represent fashion styles. Meanwhile, we keep the noise features to discover the newly sprouted fashion styles. Experiment results demonstrate the effectiveness of our proposed method.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115449981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Edge-Aware Mirror Network for Camouflaged Object Detection","authors":"Dongyue Sun, Shiyao Jiang, Lin Qi","doi":"10.1109/ICME55011.2023.00420","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00420","url":null,"abstract":"Existing edge-aware camouflaged object detection (COD) methods normally output the edge prediction in the early stage. However, edges are important and fundamental factors in the following segmentation task. Due to the high visual similarity between camouflaged targets and the surroundings, edge prior predicted in early stage usually introduces erroneous foreground-background and contaminates features for segmentation. To tackle this problem, we propose a novel Edge-aware Mirror Network (EAMNet), which models edge detection and camouflaged object segmentation as a cross refinement process. More specifically, EAMNet has a two-branch architecture, where a segmentation-induced edge aggregation module and an edge- induced integrity aggregation module are designed to cross-guide the segmentation branch and edge detection branch. A guided-residual channel attention module which leverages the residual connection and gated convolution finally better extracts structural details from low-level features. Quantitative and qualitative experiment results show that EAMNet outperforms existing cutting-edge baselines on three widely used COD datasets. Codes are available at https://github.com/sdy1999/EAMNet.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115482474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"2S-DFN: Dual-semantic Decoding Fusion Networks for Fine-grained Image Recognition","authors":"Pufen Zhang, Peng Shi, Song Zhang","doi":"10.1109/icme55011.2023.00012","DOIUrl":"https://doi.org/10.1109/icme55011.2023.00012","url":null,"abstract":"In previous fine-grained image recognition (FGIR) methods, the single global or local semantic fusion view may not be comprehensive to reveal the semantic associations between image and text. Besides, the encoding fusion strategy cannot fuse the semantics finely because the low-order text semantic dependence and the irrelevant semantic concepts are fused. To address these issues, a novel Dual-Semantic Decoding Fusion Networks (2S-DFN) is proposed for FGIR. Specifically, a multilayer text semantic encoder is first constructed to extract the higher-order semantics dependence among text. To obtain sufficient semantic association, two decoding semantic fusion streams are symmetrically designed from the global and local perspectives. Moreover, by decoding way to implant text features to semantic fusion layer as well as cascading it deeply, two streams fuse the semantics of text and image finely. Extensive experiments demonstrate that the effectiveness of the proposed method and 2S-DFN attains the state-of-the-art results on two benchmark datasets.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"12 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115483528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Li, Baoyao Yang, Dan Pan, An Zeng, Long Wu, Yang Yang
{"title":"Early Diagnosis of Alzheimer’s Disease Based on Multimodal Hypergraph Attention Network","authors":"Yi Li, Baoyao Yang, Dan Pan, An Zeng, Long Wu, Yang Yang","doi":"10.1109/ICME55011.2023.00041","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00041","url":null,"abstract":"Alzheimer’s disease (AD) is a typical neurodegenerative disease involving multiple pathogenic factors. Early detection is the key to effective treatment of AD. However, most methods are developed based on data from a single modality, and ignore the relationships among subjects. In machine learning problems, hypergraph can be used to express the relationships between objects. In light of this, a framework for early diagnosis of Alzheimer’s disease based on multimodal hypergraph attention network is proposed in this paper. Specifically, we combine multimodal features to construct cross modal hypergraph, which represents the high-order structural relationships among subjects. Finally, a hypergraph attention network is used to fuse hypergraphs and perform the final classification. Our experimental results on the Alzheimer Disease Neuroimaging Initiative (ADNI) database show that our proposed method has better classification performance than the most advanced methods.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123621030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Be-or-Not Prompt Enhanced Hard Negatives Generating For Memes Category Detection","authors":"Jian Cui, Lin Li, Xiaohui Tao","doi":"10.1109/ICME55011.2023.00038","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00038","url":null,"abstract":"Memes are one of the most popular social media in online disinformation campaigns. Their creators often use a variety of rhetoric and psychological skills to achieve the purpose of misinformed audiences. These characteristics lead to the unsatisfactory performance of memes category detection tasks, such as predicting propaganda techniques, being harmful or not, and so on. To this end, we propose a novel memes category detection model via Be-or-Not Prompt Enhanced hard Negatives generating (BNPEN). Firstly, our BNPEN is reformulated into a contrastive learning-based image-text matching (ITM) task through category-padded prompt engineering. Secondly, we design the be-or-not prompt templates to keep the writing style of memes and create hard negative image-text pairs. Finally, our negatives generating can alleviate the negative-positive-coupling (NPC) effects in contrastive learning, thus improving the image-text matching quality. Conducted on two public datasets, experimental results show that our BNPEN is better than the off-the-shelf multi-modal learning models in terms of F1 and Accuracy measures.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124285770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SQT: Debiased Visual Question Answering via Shuffling Question Types","authors":"Tianyu Huai, Shuwen Yang, Junhang Zhang, Guoan Wang, Xinru Yu, Tianlong Ma, Liang He","doi":"10.1109/ICME55011.2023.00109","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00109","url":null,"abstract":"Visual Question Answering (VQA) aims to obtain answers through image-question pairs. Nowadays, the VQA model tends to get answers only through questions, ignoring the information in the images. This phenomenon is caused by bias. As indicated by previous studies, the bias in VQA mainly comes from text modality. Our analysis of bias suggests that the question type is a crucial factor in bias formation. To interrupt the shortcut from question type to answer for de-biasing, we propose a self-supervised method for Shuffling Question Types (SQT) to reduce bias from text modality, which overcomes the prior language problem by mitigating the question-to-answer bias without introducing external annotations. Moreover, we propose a new objective function for negative samples. Experimental results show that our approach can achieve 61.76% accuracy on the VQA-CP v2 dataset, which outperforms the state-of-the-art in both self-supervised and supervised methods.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129340876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaogang Du, Yinghao Wu, Tao Lei, Dongxin Gu, Yinyin Nie, A. Nandi
{"title":"ATENet: Adaptive Tiny-Object Enhanced Network for Polyp Segmentation","authors":"Xiaogang Du, Yinghao Wu, Tao Lei, Dongxin Gu, Yinyin Nie, A. Nandi","doi":"10.1109/ICME55011.2023.00389","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00389","url":null,"abstract":"Polyp segmentation is of great importance for the diagnosis and treatment of colorectal cancer. However, it is difficult to segment polyps accurately due to a large number of tiny polyps and the low contrast between polyps and the surrounding mucosa. To address this issue, we design an Adaptive Tiny-object Enhanced Network (ATENet) for tiny polyp segmentation. The proposed ATENet has two advantages: First, we design an adaptive tiny-object encoder containing three parallel branches, which can effectively extract the shape and position features of tiny polyps and thus improve the segmentation accuracy of tiny polyps. Second, we design a simple enhanced feature decoder, which can not only suppress the background noise of feature maps, but also supplement the detail information to improve further the polyp segmentation accuracy. Extensive experiments on three benchmark datasets demonstrate that the proposed ATENet can achieve the state-of-the-art performance while maintaining low computational complexity.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128256945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zepeng Huang, Qi Wan, Junliang Chen, Xiaodong Zhao, Kai Ye, Linlin Shen
{"title":"ADATS: Adaptive RoI-Align based Transformer for End-to-End Text Spotting","authors":"Zepeng Huang, Qi Wan, Junliang Chen, Xiaodong Zhao, Kai Ye, Linlin Shen","doi":"10.1109/ICME55011.2023.00243","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00243","url":null,"abstract":"Scene text spotting has attracted great attention in recent years. Compared with two-stage approaches that locate scene texts in the first stage and recognize them in the second stage, the advantages of joint location and recognition training are not fully explored. In this paper, we present an ADaptive RoI-Align based transformer for end-to-end Text Spotting (ADATS), which simultaneously locates and recognizes text with a single forward pass. By employing an Adaptive RoI-Align, the text features are extracted from the feature extraction network with the original aspect ratio, such that less information is lost during the alignment of arbitrarily-shaped scene text. Attention-based segmentation and recognition heads allow us to simultaneously optimize detection and recognition. Experiments on ICDAR 2015, MSRA-TD500, Total-Text, and CTW1500 demonstrate the effectiveness of our method.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128524687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}