{"title":"Lighthouse: A User-Friendly Library for Reproducible Video Moment Retrieval and Highlight Detection","authors":"Taichi Nishimura, Shota Nakada, Hokuto Munakata, Tatsuya Komatsu","doi":"arxiv-2408.02901","DOIUrl":"https://doi.org/arxiv-2408.02901","url":null,"abstract":"We propose Lighthouse, a user-friendly library for reproducible video moment\u0000retrieval and highlight detection (MR-HD). Although researchers proposed\u0000various MR-HD approaches, the research community holds two main issues. The\u0000first is a lack of comprehensive and reproducible experiments across various\u0000methods, datasets, and video-text features. This is because no unified training\u0000and evaluation codebase covers multiple settings. The second is user-unfriendly\u0000design. Because previous works use different libraries, researchers set up\u0000individual environments. In addition, most works release only the training\u0000codes, requiring users to implement the whole inference process of MR-HD.\u0000Lighthouse addresses these issues by implementing a unified reproducible\u0000codebase that includes six models, three features, and five datasets. In\u0000addition, it provides an inference API and web demo to make these methods\u0000easily accessible for researchers and developers. Our experiments demonstrate\u0000that Lighthouse generally reproduces the reported scores in the reference\u0000papers. The code is available at https://github.com/line/lighthouse.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Babajide Alamu Owoyele, Martin Schilling, Rohan Sawahn, Niklas Kaemer, Pavel Zherebenkov, Bhuvanesh Verma, Wim Pouw, Gerard de Melo
{"title":"MaskAnyone Toolkit: Offering Strategies for Minimizing Privacy Risks and Maximizing Utility in Audio-Visual Data Archiving","authors":"Babajide Alamu Owoyele, Martin Schilling, Rohan Sawahn, Niklas Kaemer, Pavel Zherebenkov, Bhuvanesh Verma, Wim Pouw, Gerard de Melo","doi":"arxiv-2408.03185","DOIUrl":"https://doi.org/arxiv-2408.03185","url":null,"abstract":"This paper introduces MaskAnyone, a novel toolkit designed to navigate some\u0000privacy and ethical concerns of sharing audio-visual data in research.\u0000MaskAnyone offers a scalable, user-friendly solution for de-identifying\u0000individuals in video and audio content through face-swapping and voice\u0000alteration, supporting multi-person masking and real-time bulk processing. By\u0000integrating this tool within research practices, we aim to enhance data\u0000reproducibility and utility in social science research. Our approach draws on\u0000Design Science Research, proposing that MaskAnyone can facilitate safer data\u0000sharing and potentially reduce the storage of fully identifiable data. We\u0000discuss the development and capabilities of MaskAnyone, explore its integration\u0000into ethical research practices, and consider the broader implications of\u0000audio-visual data masking, including issues of consent and the risk of misuse.\u0000The paper concludes with a preliminary evaluation framework for assessing the\u0000effectiveness and ethical integration of masking tools in such research\u0000settings.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"74 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer","authors":"Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu","doi":"arxiv-2408.03284","DOIUrl":"https://doi.org/arxiv-2408.03284","url":null,"abstract":"Lip-syncing videos with given audio is the foundation for various\u0000applications including the creation of virtual presenters or performers. While\u0000recent studies explore high-fidelity lip-sync with different techniques, their\u0000task-orientated models either require long-term videos for clip-specific\u0000training or retain visible artifacts. In this paper, we propose a unified and\u0000effective framework ReSyncer, that synchronizes generalized audio-visual facial\u0000information. The key design is revisiting and rewiring the Style-based\u0000generator to efficiently adopt 3D facial dynamics predicted by a principled\u0000style-injected Transformer. By simply re-configuring the information insertion\u0000mechanisms within the noise and style space, our framework fuses motion and\u0000appearance with unified training. Extensive experiments demonstrate that\u0000ReSyncer not only produces high-fidelity lip-synced videos according to audio,\u0000but also supports multiple appealing properties that are suitable for creating\u0000virtual presenters and performers, including fast personalized fine-tuning,\u0000video-driven lip-syncing, the transfer of speaking styles, and even face\u0000swapping. Resources can be found at\u0000https://guanjz20.github.io/projects/ReSyncer.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Sun, Yu Song, Jihong Hu, Yen-Wei Chen, Lanfen Lin
{"title":"Multitask and Multimodal Neural Tuning for Large Models","authors":"Hao Sun, Yu Song, Jihong Hu, Yen-Wei Chen, Lanfen Lin","doi":"arxiv-2408.03001","DOIUrl":"https://doi.org/arxiv-2408.03001","url":null,"abstract":"In recent years, large-scale multimodal models have demonstrated impressive\u0000capabilities across various domains. However, enabling these models to\u0000effectively perform multiple multimodal tasks simultaneously remains a\u0000significant challenge. To address this, we introduce a novel tuning method\u0000called neural tuning, designed to handle diverse multimodal tasks concurrently,\u0000including reasoning segmentation, referring segmentation, image captioning, and\u0000text-to-image generation. Neural tuning emulates sparse distributed\u0000representation in human brain, where only specific subsets of neurons are\u0000activated for each task. Additionally, we present a new benchmark, MMUD, where\u0000each sample is annotated with multiple task labels. By applying neural tuning\u0000to pretrained large models on the MMUD benchmark, we achieve simultaneous task\u0000handling in a streamlined and efficient manner. All models, code, and datasets\u0000will be publicly available after publication, facilitating further research and\u0000development in this field.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku
{"title":"COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark","authors":"Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku","doi":"arxiv-2408.02272","DOIUrl":"https://doi.org/arxiv-2408.02272","url":null,"abstract":"Procedural video understanding is gaining attention in the vision and\u0000language community. Deep learning-based video analysis requires extensive data.\u0000Consequently, existing works often use web videos as training resources, making\u0000it challenging to query instructional contents from raw video observations. To\u0000address this issue, we propose a new dataset, COM Kitchens. The dataset\u0000consists of unedited overhead-view videos captured by smartphones, in which\u0000participants performed food preparation based on given recipes. Fixed-viewpoint\u0000video datasets often lack environmental diversity due to high camera setup\u0000costs. We used modern wide-angle smartphone lenses to cover cooking counters\u0000from sink to cooktop in an overhead view, capturing activity without in-person\u0000assistance. With this setup, we collected a diverse dataset by distributing\u0000smartphones to participants. With this dataset, we propose the novel\u0000video-to-text retrieval task Online Recipe Retrieval (OnRR) and new video\u0000captioning domain Dense Video Captioning on unedited Overhead-View videos\u0000(DVC-OV). Our experiments verified the capabilities and limitations of current\u0000web-video-based SOTA methods in handling these tasks.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"467 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zifeng Li, Wenzhong Tang, Shijun Gao, Shuai Wang, Yanxiang Wang
{"title":"Multiple Contexts and Frequencies Aggregation Network forDeepfake Detection","authors":"Zifeng Li, Wenzhong Tang, Shijun Gao, Shuai Wang, Yanxiang Wang","doi":"arxiv-2408.01668","DOIUrl":"https://doi.org/arxiv-2408.01668","url":null,"abstract":"Deepfake detection faces increasing challenges since the fast growth of\u0000generative models in developing massive and diverse Deepfake technologies.\u0000Recent advances rely on introducing heuristic features from spatial or\u0000frequency domains rather than modeling general forgery features within\u0000backbones. To address this issue, we turn to the backbone design with two\u0000intuitive priors from spatial and frequency detectors, textit{i.e.,} learning\u0000robust spatial attributes and frequency distributions that are discriminative\u0000for real and fake samples. To this end, we propose an efficient network for\u0000face forgery detection named MkfaNet, which consists of two core modules. For\u0000spatial contexts, we design a Multi-Kernel Aggregator that adaptively selects\u0000organ features extracted by multiple convolutions for modeling subtle facial\u0000differences between real and fake faces. For the frequency components, we\u0000propose a Multi-Frequency Aggregator to process different bands of frequency\u0000components by adaptively reweighing high-frequency and low-frequency features.\u0000Comprehensive experiments on seven popular deepfake detection benchmarks\u0000demonstrate that our proposed MkfaNet variants achieve superior performances in\u0000both within-domain and across-domain evaluations with impressive efficiency of\u0000parameter usage.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"100 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MMPKUBase: A Comprehensive and High-quality Chinese Multi-modal Knowledge Graph","authors":"Xuan Yi, Yanzeng Li, Lei Zou","doi":"arxiv-2408.01679","DOIUrl":"https://doi.org/arxiv-2408.01679","url":null,"abstract":"Multi-modal knowledge graphs have emerged as a powerful approach for\u0000information representation, combining data from different modalities such as\u0000text, images, and videos. While several such graphs have been constructed and\u0000have played important roles in applications like visual question answering and\u0000recommendation systems, challenges persist in their development. These include\u0000the scarcity of high-quality Chinese knowledge graphs and limited domain\u0000coverage in existing multi-modal knowledge graphs. This paper introduces\u0000MMPKUBase, a robust and extensive Chinese multi-modal knowledge graph that\u0000covers diverse domains, including birds, mammals, ferns, and more, comprising\u0000over 50,000 entities and over 1 million filtered images. To ensure data\u0000quality, we employ Prototypical Contrastive Learning and the Isolation Forest\u0000algorithm to refine the image data. Additionally, we have developed a\u0000user-friendly platform to facilitate image attribute exploration.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"79 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hong Guan, Yancheng Wang, Lulu Xie, Soham Nag, Rajeev Goel, Niranjan Erappa Narayana Swamy, Yingzhen Yang, Chaowei Xiao, Jonathan Prisby, Ross Maciejewski, Jia Zou
{"title":"IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection","authors":"Hong Guan, Yancheng Wang, Lulu Xie, Soham Nag, Rajeev Goel, Niranjan Erappa Narayana Swamy, Yingzhen Yang, Chaowei Xiao, Jonathan Prisby, Ross Maciejewski, Jia Zou","doi":"arxiv-2408.01690","DOIUrl":"https://doi.org/arxiv-2408.01690","url":null,"abstract":"Effective fraud detection and analysis of government-issued identity\u0000documents, such as passports, driver's licenses, and identity cards, are\u0000essential in thwarting identity theft and bolstering security on online\u0000platforms. The training of accurate fraud detection and analysis tools depends\u0000on the availability of extensive identity document datasets. However, current\u0000publicly available benchmark datasets for identity document analysis, including\u0000MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a\u0000limited number of samples, cover insufficient varieties of fraud patterns, and\u0000seldom include alterations in critical personal identifying fields like\u0000portrait images, limiting their utility in training models capable of detecting\u0000realistic frauds while preserving privacy. In response to these shortcomings, our research introduces a new benchmark\u0000dataset, IDNet, designed to advance privacy-preserving fraud detection efforts.\u0000The IDNet dataset comprises 837,060 images of synthetically generated identity\u0000documents, totaling approximately 490 gigabytes, categorized into 20 types from\u0000$10$ U.S. states and 10 European countries. We evaluate the utility and present\u0000use cases of the dataset, illustrating how it can aid in training\u0000privacy-preserving fraud detection methods, facilitating the generation of\u0000camera and video capturing of identity documents, and testing schema\u0000unification and other identity document management functionalities.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Music2P: A Multi-Modal AI-Driven Tool for Simplifying Album Cover Design","authors":"Joong Ho Choi, Geonyeong Choi, Ji-Eun Han, Wonjin Yang, Zhi-Qi Cheng","doi":"arxiv-2408.01651","DOIUrl":"https://doi.org/arxiv-2408.01651","url":null,"abstract":"In today's music industry, album cover design is as crucial as the music\u0000itself, reflecting the artist's vision and brand. However, many AI-driven album\u0000cover services require subscriptions or technical expertise, limiting\u0000accessibility. To address these challenges, we developed Music2P, an\u0000open-source, multi-modal AI-driven tool that streamlines album cover creation,\u0000making it efficient, accessible, and cost-effective through Ngrok. Music2P\u0000automates the design process using techniques such as Bootstrapping Language\u0000Image Pre-training (BLIP), music-to-text conversion (LP-music-caps), image\u0000segmentation (LoRA), and album cover and QR code generation (ControlNet). This\u0000paper demonstrates the Music2P interface, details our application of these\u0000technologies, and outlines future improvements. Our ultimate goal is to provide\u0000a tool that empowers musicians and producers, especially those with limited\u0000resources or expertise, to create compelling album covers.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses","authors":"Chaolei Tan, Zihang Lin, Junfu Pu, Zhongang Qi, Wei-Yi Pei, Zhi Qu, Yexin Wang, Ying Shan, Wei-Shi Zheng, Jian-Fang Hu","doi":"arxiv-2408.01669","DOIUrl":"https://doi.org/arxiv-2408.01669","url":null,"abstract":"Video grounding is a fundamental problem in multimodal content understanding,\u0000aiming to localize specific natural language queries in an untrimmed video.\u0000However, current video grounding datasets merely focus on simple events and are\u0000either limited to shorter videos or brief sentences, which hinders the model\u0000from evolving toward stronger multimodal understanding capabilities. To address\u0000these limitations, we present a large-scale video grounding dataset named\u0000SynopGround, in which more than 2800 hours of videos are sourced from popular\u0000TV dramas and are paired with accurately localized human-written synopses. Each\u0000paragraph in the synopsis serves as a language query and is manually annotated\u0000with precise temporal boundaries in the long video. These paragraph queries are\u0000tightly correlated to each other and contain a wealth of abstract expressions\u0000summarizing video storylines and specific descriptions portraying event\u0000details, which enables the model to learn multimodal perception on more\u0000intricate concepts over longer context dependencies. Based on the dataset, we\u0000further introduce a more complex setting of video grounding dubbed\u0000Multi-Paragraph Video Grounding (MPVG), which takes as input multiple\u0000paragraphs and a long video for grounding each paragraph query to its temporal\u0000interval. In addition, we propose a novel Local-Global Multimodal Reasoner\u0000(LGMR) to explicitly model the local-global structures of long-term multimodal\u0000inputs for MPVG. Our method provides an effective baseline solution to the\u0000multi-paragraph video grounding problem. Extensive experiments verify the\u0000proposed model's effectiveness as well as its superiority in long-term\u0000multi-paragraph video grounding over prior state-of-the-arts. Dataset and code\u0000are publicly available. Project page: https://synopground.github.io/.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"93 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141941866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}