Xingyu Zhu, Feifei Dai, Xiaoyan Gu, Haihui Fan, B. Li, Weiping Wang
{"title":"ERPG: Enhancing Entity Representations with Prompt Guidance for Complex Named Entity Recognition","authors":"Xingyu Zhu, Feifei Dai, Xiaoyan Gu, Haihui Fan, B. Li, Weiping Wang","doi":"10.1109/ICME55011.2023.00478","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00478","url":null,"abstract":"Recently, sequence generation methods are widely used in complex named entity recognition. By selecting high-related tokens to generate complex named entities, these methods obtain several achievements. However, due to lack of guidance in learning output format and ignoring labels in obtaining features, sequence generation methods suffer invalid output and inaccurate recognition. To solve that, we propose an Enhancing Entity Representation method with Prompt Guidance (ERPG). Specifically, in order to reduce invalid output, we design the candidate entity generation module that generate candidate entities and their labels as expected. Besides, to accurately recognize candidate entities, we propose candidate entity refine module, which obtain distinguishable candidate entity representations and filter them accurately. Based on that, our method finally outperforms baselines by 1.20, 1.62 and 0.69 F1 scores in ACE2004, GENIA and CADEC corpora, which proves the effectiveness in complex named entity recognition.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124370876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiangming Chen, Wanxia Deng, Bo Peng, Tianpeng Liu, Yingmei Wei, Li Liu
{"title":"Variational Information Bottleneck for Cross Domain Object Detection","authors":"Jiangming Chen, Wanxia Deng, Bo Peng, Tianpeng Liu, Yingmei Wei, Li Liu","doi":"10.1109/ICME55011.2023.00381","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00381","url":null,"abstract":"Cross domain object detection leverages a labeled source domain to learn an object detector which performs well in a novel unlabeled target domain. Most existing works mainly align the distribution utilizing the entire image knowledge ignoring the obstacles of task-uncorrelated information to alleviate the domain discrepancy. To tackle this issue, we propose a novel module called Variational Instance Disentanglement (VID) based on information theory which aims to decouple the information of task-correlated while filtering out the task-uncorrelated factors at the instance level. Notably, the proposed VID can be used as a plug-and-play module without bringing extra network parameter cost. We equip it with adversarial network and self-training network forming Variational Instance Disentanglement Adversarial Network (VIDAN) and Variational Instance Disentanglement Self-training Network (VIDSN), respectively. Extensive experiments on multiple widely-used scenarios show that the proposed method improves the performance of the popular frameworks and outperforms state-of-the-art methods.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114566290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shumian Yang, Xinxin Xiang, Fenghua Tong, Dawei Zhao, Xin Li
{"title":"Image Compressed Sensing Using Multi-Scale Characteristic Residual Learning","authors":"Shumian Yang, Xinxin Xiang, Fenghua Tong, Dawei Zhao, Xin Li","doi":"10.1109/ICME55011.2023.00275","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00275","url":null,"abstract":"Deep network-based image compressed sensing (CS) methods have attracted much attention in recent years due to their low reconstruction complexity and high reconstruction quality. However, the existing methods usually use one or multiple convolution layer(s) consisting of convolutional kernels with the same size to extract image features in image sampling, which results in incomplete feature extraction. Besides, the existing models usually focus on the extraction of deep features in image reconstruction, while ignoring the influence of shallow features. To overcome these issues, this paper proposes a multi-scale characteristic residual learning network (dubbed MSCRLNet) for image CS. In this network, convolutional kernels with different sizes are used to capture multi-level spatial features in image sampling, and a multi-scale residual network with channel attention is used to speed up network convergence in image reconstruction. Experiments show that the proposed MSCRLNet outperforms many existing state-of-the-art methods.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114849609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Han Guo, Yuanlong Yu, Yujie Wang, Xuelin Chen, Yixin Zhuang
{"title":"Learning High Frequency Surface Functions In Shells","authors":"Han Guo, Yuanlong Yu, Yujie Wang, Xuelin Chen, Yixin Zhuang","doi":"10.1109/ICME55011.2023.00112","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00112","url":null,"abstract":"Recently, coordinate-based MLPs have been shown to be powerful representations for 3D surfaces, where learning high-frequency details is facilitated by modulating surface functions with periodic functions [1], [2]. While shortening the periodicity helps in learning high frequencies, it leads to increasing ambiguity, i.e., more points along the axis directions become similar in the embedded space, so that many points on the surface and outside the surface have similar predictions. In addition, short periodicity increases local geometric variations, leading to unexpected noisy artifacts in untrained regions. Unlike existing methods that learn surface functions in a regular cube, we find surfaces within shells, a coarse form of the target surfaces constructed by a binary classifier. The advantage of build surfaces in shells is that MLPs focus on regions of interest, which inherently reduces ambiguity and also promotes training efficiency and test accuracy. We demonstrate the effectiveness of shells and show significant improvements over baseline methods in 3D surface reconstruction from raw point clouds.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115125676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DSP-Net: Diverse Structure Prior Network for Image Inpainting","authors":"Lin Sun, Chao Yang, Bin Jiang","doi":"10.1109/ICME55011.2023.00088","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00088","url":null,"abstract":"The latest deep learning-based approaches have advanced diverse image inpainting task. However, existing methods limit to be aware of the structure information well, which constricts the performance of diverse generations. The intuitive representation of diversity generation is the structure change since the structure is the basis of the image. In this paper, we make full use of the structure information and propose the diverse structure prior network (DSP-Net). Specifically, there are two stages in DSP-Net to generate the diverse structure first and refine the texture next. For the diverse structure generation, we prompt the structural distribution to be similar to the Gaussian distribution to sample the diverse structural prior. With these priors, we refine the texture with a proposed propagation attention module. Meanwhile, we propose a structure diversity loss to enhance the ability of diverse structure generation further. Experiments on benchmark datasets including CelebA-HQ and Places2 indicate that DSP-Net is effective for diverse and visually realistic image restoration.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114733119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Hengyu Man, Xiaopeng Fan
{"title":"Modality-Fusion Spiking Transformer Network for Audio-Visual Zero-Shot Learning","authors":"Wenrui Li, Zhengyu Ma, Liang-Jian Deng, Hengyu Man, Xiaopeng Fan","doi":"10.1109/ICME55011.2023.00080","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00080","url":null,"abstract":"Audio-visual zero-shot learning (ZSL), which learns to classify video data from the classes not being observed during training, is challenging. In audio-visual ZSL, both semantic and temporal information from different modalities is relevant to each other. However, effectively extracting and fusing information from audio and visual remains an open challenge. In this work, we propose an Audio-Visual Modality-fusion Spiking Transformer network (AVMST) for audio-visual ZSL. To be more specific, AVMST provides a spiking neural network (SNN) module for extracting conspicuous temporal information of each modality, a cross-attention block to effectively fuse the temporal and semantic information, and a transformer reasoning module to further explore the interrelationships of fusion features. To provide robust temporal features, the spiking threshold of the SNN module is adjusted dynamically based on the semantic cues of different modalities. The generated feature map is in accordance with the zero-shot learning property thanks to our proposed spiking transformer’s ability to combine the robustness of SNN feature extraction and the precision of transformer feature inference. Extensive experiments on three benchmark audio-visual datasets (i.e., VGGSound, UCF and ActivityNet) validate that the proposed AVMST outperforms existing state-of-the-art methods by a significant margin. The code and pre-trained models are available at https://github.com/liwr-hit/ICME23_AVMST.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116985422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RASNet: A Reinforcement Assistant Network for Frame Selection in Video-based Posture Recognition","authors":"Ruotong Hu, Xianzhi Wang, Xiaojun Chang, Yeqi Hu, Xiaowei Xin, Xiangqian Ding, Baoqi Guo","doi":"10.1109/ICME55011.2023.00366","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00366","url":null,"abstract":"Most existing video-based posture recognition methods treat frames equally using unified or random sampling strategies, thus losing the temporal relationship information among frames. To address this problem, we propose a lightweight framework, namely RASNet, to adaptively select informative frames for recognition. Specifically, we design a video-suited exploration environment to guide the agent in learning the selection strategy. We introduce the reparametrization method to convert the discrete action space into a continuous space, making the agent robust and random. For the reward part, we design a multi-factor function to reward the agent keeping a balance between frame usage and accuracy. Extensive experiments on three large-scale datasets prove the effectiveness of RASNet, e.g., achieving 85.9% accuracy with fewer 1.15 frames than other state-of-the-art methods on Kinetics 600.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116985538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DiST-GAN: Distillation-based Semantic Transfer for Text-Guided Face Generation","authors":"Guoxing Yang, Feifei Fu, Nanyi Fei, Hao Wu, Ruitao Ma, Zhiwu Lu","doi":"10.1109/ICME55011.2023.00149","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00149","url":null,"abstract":"Recently, large-scale pre-training has achieved great success in multi-modal tasks and shown powerful generalization ability due to superior semantic comprehension. In the field of text-to-image synthesis, recent works induce large-scale pre-training with VQ-VAE as a discrete visual tokenizer, which can synthesize realistic images from arbitrary text inputs. However, the quality of images generated by these methods is still inferior to that of images generated by GAN-based methods, especially in some specific domains. To leverage both the superior semantic comprehension of large-scale pre-training models and the powerful ability of GAN-based models in photorealistic image generation, we propose a novel knowledge distillation framework termed DiST-GAN to transfer the semantic knowledge of large-scale visual-language pre-training models (e.g., CLIP) to GAN-based generator for text-guided face image generation. Our DiST-GAN consists of two key components: (1) A new CLIP-based adaptive contrastive loss is devised to ensure the generated images are consistent with the input texts. (2) A language-to-vision (L2V) transformation module is learned to transform token embeddings of each text into an intermediate embedding that is aligned with the image embedding extracted by CLIP. With these two novel components, the semantic knowledge contained in CLIP can thus be transferred to GAN-based generator which preserves the superior ability of photorealistic image generation in the mean time. Extensive results on the Multi-Modal CelebA-HQ dataset show that our DiST-GAN achieves significant improvements over the state-of-the-arts.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117326015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhen Liang, Changyuan Zhao, Wanwei Liu, Bai Xue, Wenjing Yang
{"title":"A Geometrical Characterization on Feature Density of Image Datasets","authors":"Zhen Liang, Changyuan Zhao, Wanwei Liu, Bai Xue, Wenjing Yang","doi":"10.1109/ICME55011.2023.00313","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00313","url":null,"abstract":"Recently, the interpretability and verification of deep learning have attracted enormous attention from both academic and industrial communities, aiming to gain users’ trust and ease their concerns. To guide learning procedures or data operations carried out in a more interpretable way, in this paper, we put a similar perspective on image datasets, the inputs of deep learning. Based on manifold learning, we work out an interpretable geometrical characterization on the curvity of manifolds to depict the feature density of datasets, which is represented with the ratio of the Euclidean distance and the geodesic distance. It is a noteworthy characteristic of image datasets and we take the dataset compression and enhancement problems as application instances via sample credit assignment with the geometrical information. Experiments on typical image datasets have justified the effectiveness and enormous prospect of the presented geometrical characteristic.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129728510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hadi Amirpour, V. V. Menon, Samira Afzal, R.-C. Prodan, C. Timmerer
{"title":"Optimizing Video Streaming for Sustainability and Quality: The Role of Preset Selection in Per-Title Encoding","authors":"Hadi Amirpour, V. V. Menon, Samira Afzal, R.-C. Prodan, C. Timmerer","doi":"10.1109/ICME55011.2023.00289","DOIUrl":"https://doi.org/10.1109/ICME55011.2023.00289","url":null,"abstract":"HTTP Adaptive Streaming (HAS) methods divide a video into smaller segments, encoded at multiple pre-defined bitrates to construct a bitrate ladder. Bitrate ladders are usually optimized per title over several dimensions, such as bitrate, resolution, and framerate. This paper adds a new dimension to the bitrate ladder by considering the energy consumption of the encoding process. Video encoders often have multiple pre-defined presets to balance the trade-off between encoding time, energy consumption, and compression efficiency. Faster presets disable certain coding tools defined by the codec to reduce the encoding time at the cost of reduced compression efficiency. Firstly, this paper evaluates the energy consumption and compression efficiency of different x265 presets for 500 video sequences. Secondly, optimized presets are selected for various representations in a bitrate ladder based on the results to guarantee a minimal drop in video quality while saving energy. Finally, a new per title model, which optimizes the trade-off between compression efficiency and energy consumption, is proposed. The experimental results show that decreasing the VMAF score by 0.15 and 0.39 while choosing an optimized preset results in encoding energy savings of 70% and 83%, respectively.","PeriodicalId":321830,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo (ICME)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129329103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}