Guangyang Wu, Weijie Wu, Xiaohong Liu, Kele Xu, Tianjiao Wan, Wenyi Wang
{"title":"Cheap-Fake Detection with LLM Using Prompt Engineering","authors":"Guangyang Wu, Weijie Wu, Xiaohong Liu, Kele Xu, Tianjiao Wan, Wenyi Wang","doi":"10.1109/ICMEW59549.2023.00025","DOIUrl":"https://doi.org/10.1109/ICMEW59549.2023.00025","url":null,"abstract":"The misuse of real photographs with conflicting image captions in news items is an example of the out-of-context (OOC) misuse of media. In order to detect OOC media, individuals must determine the accuracy of the statement and evaluate whether the triplet (i.e., the image and two captions) relates to the same event. This paper presents a novel learnable approach for detecting OOC media in ICME'23 Grand Challenge on Detecting Cheapfakes. The proposed method is based on the COSMOS structure, which assesses the coherence between an image and captions, as well as between two captions. We enhance the baseline algorithm by incorporating a Large Language Model (LLM), GPT3.5, as a feature extractor. Specifically, we propose an innovative approach to feature extraction utilizing prompt engineering to develop a robust and reliable feature extractor with GPT3.5 model. The proposed method captures the correlation between two captions and effectively integrates this module into the COSMOS baseline model, which allows for a deeper understanding of the relationship between captions. By incorporating this module, we demonstrate the potential for significant improvements in cheap-fakes detection performance. The proposed methodology holds promising implications for various applications such as natural language processing, image captioning, and text-to-image synthesis. Docker for submission is available at https://hub.docker.com/repository/docker/mulns/acmmmcheapfakes.","PeriodicalId":111482,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126211756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Half Title Page","authors":"","doi":"10.1109/icmew59549.2023.00001","DOIUrl":"https://doi.org/10.1109/icmew59549.2023.00001","url":null,"abstract":"","PeriodicalId":111482,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130097309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semi-Supervised Federated Learning for Keyword Spotting","authors":"Enmao Diao, Eric W. Tramel, Jie Ding, Tao Zhang","doi":"10.1109/ICMEW59549.2023.00087","DOIUrl":"https://doi.org/10.1109/ICMEW59549.2023.00087","url":null,"abstract":"Keyword Spotting (KWS) is a critical aspect of audio-based applications on mobile devices and virtual assistants. Recent developments in Federated Learning (FL) have significantly expanded the ability to train machine learning models by utilizing the computational and private data resources of numerous distributed devices. However, existing FL methods typically require that devices possess accurate ground-truth labels, which can be both expensive and impractical when dealing with local audio data. In this study, we first demonstrate the effectiveness of Semi-Supervised Federated Learning (SSL) and FL for KWS. We then extend our investigation to Semi-Supervised Federated Learning (SSFL) for KWS, where devices possess completely unlabeled data, while the server has access to a small amount of labeled data. We perform numerical analyses using state-of-the-art SSL, FL, and SSFL techniques to demonstrate that the performance of KWS models can be significantly improved by leveraging the abundant unlabeled heterogeneous data available on devices.","PeriodicalId":111482,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127499651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoyuan Guo, Xiang Wei, Q. Su, Hui-Huang Zhao, Shunli Zhan
{"title":"Prompt What You Need: Enhancing Segmentation in Rainy Scenes with Anchor-Based Prompting","authors":"Xiaoyuan Guo, Xiang Wei, Q. Su, Hui-Huang Zhao, Shunli Zhan","doi":"10.1109/ICMEW59549.2023.00019","DOIUrl":"https://doi.org/10.1109/ICMEW59549.2023.00019","url":null,"abstract":"Semantic segmentation in rainy scenes is a challenging task due to the complex environment, class distribution imbalance, and limited annotated data. To address these challenges, we propose a novel framework that utilizes semi-supervised learning and pre-trained segmentation foundation model to achieve superior performance. Specifically, our framework leverages the semi-supervised model as the basis for generating raw semantic segmentation results, while also serving as a guiding force to prompt pre-trained foundation model to compensate for knowledge gaps with entropy-based anchors. In addition, to minimize the impact of irrelevant segmentation masks generated by the pre-trained foundation model, we also propose a mask filtering and fusion mechanism that optimizes raw semantic segmentation results based on the principle of minimum risk. The proposed framework achieves superior segmentation performance on the Rainy WCity dataset and is awarded the first prize in the subtrack of STRAIN in ICME 2023 Grand Challenges.","PeriodicalId":111482,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134099555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marta Milovanovi'c, Enzo Tartaglione, Marco Cagnazzo, F. Henry
{"title":"Learn How to Prune Pixels for Multi-View Neural Image-Based Synthesis","authors":"Marta Milovanovi'c, Enzo Tartaglione, Marco Cagnazzo, F. Henry","doi":"10.1109/ICMEW59549.2023.00034","DOIUrl":"https://doi.org/10.1109/ICMEW59549.2023.00034","url":null,"abstract":"Image-based rendering techniques stand at the core of an immersive experience for the user, as they generate novel views given a set of multiple input images. Since they have shown good performance in terms of objective and subjective quality, the research community devotes great effort to their improvement. However, the large volume of data necessary to render at the receiver's side hinders applications in limited bandwidth environments or prevents their employment in real-time applications. We present LeHoPP, a method for input pixel pruning, where we examine the importance of each input pixel concerning the rendered view, and we avoid the use of irrelevant pixels. Even without retraining the image-based rendering network, our approach shows a good tradeoff between synthesis quality and pixel rate. When tested in the general neural rendering framework, compared to other pruning baselines, LeHoPP gains between 0.9 dB and 3.6 dB on average.","PeriodicalId":111482,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124872811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anderson de Andrade, Alon Harell, Yalda Foroutan, Ivan V. Baji'c
{"title":"Conditional and Residual Methods in Scalable Coding for Humans and Machines","authors":"Anderson de Andrade, Alon Harell, Yalda Foroutan, Ivan V. Baji'c","doi":"10.1109/ICMEW59549.2023.00040","DOIUrl":"https://doi.org/10.1109/ICMEW59549.2023.00040","url":null,"abstract":"We present methods for conditional and residual coding in the context of scalable coding for humans and machines. Our focus is on optimizing the rate-distortion performance of the reconstruction task using the information available in the computer vision task. We include an information analysis of both approaches to provide baselines and also propose an entropy model suitable for conditional coding with increased modelling capacity and similar tractability as previous work. We apply these methods to image reconstruction, using, in one instance, representations created for semantic segmentation on the Cityscapes dataset, and in another instance, representations created for object detection on the COCO dataset. In both experiments, we obtain similar performance between the conditional and residual methods, with the resulting ratedistortion curves contained within our baselines.","PeriodicalId":111482,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129528060","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting Inductive Bias in Transformer for Point Cloud Classification and Segmentation","authors":"Zihao Li, Pan Gao, Hui Yuan, Ran Wei, M. Paul","doi":"10.1109/ICMEW59549.2023.00031","DOIUrl":"https://doi.org/10.1109/ICMEW59549.2023.00031","url":null,"abstract":"Discovering inter-point connection for efficient high-dimensional feature extraction from point coordinate is a key challenge in processing point cloud. Most existing methods focus on designing efficient local feature extractors while ignoring global connection, or vice versa. In this paper, we design a new Inductive Bias-aided Transformer (IBT) method to learn 3D inter-point relations, which considers both local and global attentions. Specifically, considering local spatial coherence, local feature learning is performed through Relative Position Encoding and Attentive Feature Pooling. We incorporate the learned locality into the Transformer module. The local feature affects value component in Transformer to modulate the relationship between channels of each point, which can enhance self-attention mechanism with locality based channel interaction. We demonstrate its superiority experimentally on classification and segmentation tasks. The code is available at: https://github.com/jiamang/IBT","PeriodicalId":111482,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126506126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Jin, Wu Zhou, Jinyu Wang, Duo Xu, Yiqing Rong, Jialin Sun
{"title":"An Order-Complexity Model for Aesthetic Quality Assessment of Homophony Music Performance","authors":"Xin Jin, Wu Zhou, Jinyu Wang, Duo Xu, Yiqing Rong, Jialin Sun","doi":"10.1109/ICMEW59549.2023.00061","DOIUrl":"https://doi.org/10.1109/ICMEW59549.2023.00061","url":null,"abstract":"Although computational aesthetics evaluation has made certain achievements in many fields, its research of music performance remains to be explored. At present, subjective evaluation is still a ultimate method of music aesthetics research, but it will consume a lot of human and material resources. In addition, the music performance generated by AI is still mechanical, monotonous and lacking in beauty. In order to guide the generation task of AI music performance, and to improve the performance effect of human performers, this paper uses Birkhoff's aesthetic measure to propose a method of objective measurement of beauty. The main contributions of this paper are as follows: Firstly, we put forward an objective aesthetic evaluation method to measure the music performance aesthetic; Secondly, we propose 10 basic music features and 4 aesthetic music features. Experiments show that our method performs well on performance assessment.","PeriodicalId":111482,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"161 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125892193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinhui Huang, Chunyi Li, A. Bentaleb, Roger Zimmermann, Guangtao Zhai
{"title":"XGC-VQA: A Unified Video Quality Assessment Model for User, Professionally, and Occupationally-Generated Content","authors":"Xinhui Huang, Chunyi Li, A. Bentaleb, Roger Zimmermann, Guangtao Zhai","doi":"10.1109/ICMEW59549.2023.00081","DOIUrl":"https://doi.org/10.1109/ICMEW59549.2023.00081","url":null,"abstract":"With the rapid growth of Internet video data amounts and types, a unified Video Quality Assessment (VQA) is needed to inspire video communication with perceptual quality. To meet the real-time and universal requirements in providing such inspiration, this study proposes a VQA model from a classification of User Generated Content (UGC), Professionally Generated Content (PGC), and Occupationally Generated Content (OGC). In the time domain, this study utilizes non-uniform sampling, as each content type has varying temporal importance based on its perceptual quality. In the spatial domain, centralized downsampling is performed before the VQA process by utilizing a patch splicing/sampling mechanism to lower complexity for real-time assessment. The experimental results demonstrate that the proposed method achieves a median correlation of 0.7 while limiting the computation time below 5s for three content types, which ensures that the communication experience of UGC, PGC, and OGC can be optimized altogether.","PeriodicalId":111482,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127445568","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Perceptual Quality Assessment Exploration for AIGC Images","authors":"Zicheng Zhang, Chunyi Li, Wei Sun, Xiaohong Liu, Xiongkuo Min, Guangtao Zhai","doi":"10.1109/ICMEW59549.2023.00082","DOIUrl":"https://doi.org/10.1109/ICMEW59549.2023.00082","url":null,"abstract":"AI Generated Content (AIGC) has gained widespread attention with the increasing efficiency of deep learning in content creation. AIGC, created with the assistance of artificial intelligence technology, includes various forms of content, among which the AI-generated images (AGIs) have brought significant impact to society and have been applied to various fields such as entertainment, education, social media, etc. However, due to hardware limitations and technical proficiency, the quality of AIGC images (AGIs) varies, necessitating refinement and filtering before practical use. Consequently, there is an urgent need for developing objective models to assess the quality of AGIs. Unfortunately, no research has been carried out to investigate the perceptual quality assessment for AGIs specifically. Therefore, in this paper, we first discuss the major evaluation aspects such as technical issues, AI artifacts, unnaturalness, discrepancy, and aesthetics for AGI quality assessment. Then we present the first perceptual AGI quality assessment database, AGIQA-1K, which consists of 1,080 AGIs generated from diffusion models. A well-organized subjective experiment is followed to collect the quality labels of the AGIs. Finally, we conduct a benchmark experiment to evaluate the performance of current image quality assessment (IQA) models. The database is released on https://github.com/lcysyzxdxc/AGIQA-1k-Database.","PeriodicalId":111482,"journal":{"name":"2023 IEEE International Conference on Multimedia and Expo Workshops (ICMEW)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125937237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}