Optimizing Multimodal Data Queries in Data Lakes

IF 3.5 1区 计算机科学 Q1 Multidisciplinary
Runqun Xiong;Shiyuan Zhao;Ciyuan Chen;Zhuqing Xu
{"title":"Optimizing Multimodal Data Queries in Data Lakes","authors":"Runqun Xiong;Shiyuan Zhao;Ciyuan Chen;Zhuqing Xu","doi":"10.26599/TST.2025.9010022","DOIUrl":null,"url":null,"abstract":"This paper addresses the challenge of efficiently querying multimodal related data in data lakes, a large-scale storage and management system that supports heterogeneous data formats, including structured, semi-structured, and unstructured data. Multimodal data queries are crucial because they enable seamless retrieval of related data across modalities, such as tables, images, and text, which has applications in fields like e-commerce, healthcare, and education. However, existing methods primarily focus on single-modality queries, such as joinable or unionable table discovery, and struggle to handle the heterogeneity and lack of metadata in data lakes while balancing accuracy and efficiency. To tackle these challenges, we propose a Multimodal data Query mechanism for Data Lakes (MQDL), which employs a modality-adaptive indexing mechanism raleted and contrastive learning based embeddings to unify representations across modalities. Additionally, we introduce product quantization to optimize candidate verification during queries, reducing computational overhead while maintaining precision. We evaluate MQDL using a table-image dataset across multiple business scenarios, measuring metrics such as precision, recall, and F1-score. Results show that MQDL achieves an accuracy rate of approximately 90%, while demonstrating strong scalability and reduced query response time compared to traditional methods. These findings highlight MQDL's potential to enhance multimodal data retrieval in complex data lake environments.","PeriodicalId":48690,"journal":{"name":"Tsinghua Science and Technology","volume":"30 6","pages":"2625-2637"},"PeriodicalIF":3.5000,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11072065","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Tsinghua Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11072065/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}
引用次数: 0

Abstract

This paper addresses the challenge of efficiently querying multimodal related data in data lakes, a large-scale storage and management system that supports heterogeneous data formats, including structured, semi-structured, and unstructured data. Multimodal data queries are crucial because they enable seamless retrieval of related data across modalities, such as tables, images, and text, which has applications in fields like e-commerce, healthcare, and education. However, existing methods primarily focus on single-modality queries, such as joinable or unionable table discovery, and struggle to handle the heterogeneity and lack of metadata in data lakes while balancing accuracy and efficiency. To tackle these challenges, we propose a Multimodal data Query mechanism for Data Lakes (MQDL), which employs a modality-adaptive indexing mechanism raleted and contrastive learning based embeddings to unify representations across modalities. Additionally, we introduce product quantization to optimize candidate verification during queries, reducing computational overhead while maintaining precision. We evaluate MQDL using a table-image dataset across multiple business scenarios, measuring metrics such as precision, recall, and F1-score. Results show that MQDL achieves an accuracy rate of approximately 90%, while demonstrating strong scalability and reduced query response time compared to traditional methods. These findings highlight MQDL's potential to enhance multimodal data retrieval in complex data lake environments.
优化数据湖中的多模态数据查询
数据湖是一种支持异构数据格式(包括结构化、半结构化和非结构化数据)的大规模存储和管理系统,本文解决了在数据湖中高效查询多模式相关数据的挑战。多模式数据查询至关重要,因为它们支持跨模式(如表、图像和文本)无缝检索相关数据,这在电子商务、医疗保健和教育等领域都有应用。然而,现有的方法主要关注单模态查询,例如可连接或可联合的表发现,并且在平衡准确性和效率的同时难以处理数据湖中的异构性和缺乏元数据。为了应对这些挑战,我们提出了一种数据湖的多模态数据查询机制(MQDL),该机制采用了一种基于模态自适应的索引机制和基于对比学习的嵌入来统一跨模态的表示。此外,我们引入了产品量化来优化查询期间的候选验证,在保持精度的同时减少了计算开销。我们使用跨多个业务场景的表-图像数据集评估MQDL,测量精度、召回率和F1-score等指标。结果表明,MQDL实现了大约90%的准确率,同时与传统方法相比,MQDL显示了强大的可伸缩性和更短的查询响应时间。这些发现突出了MQDL在复杂数据湖环境中增强多模态数据检索的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Tsinghua Science and Technology
Tsinghua Science and Technology COMPUTER SCIENCE, INFORMATION SYSTEMSCOMPU-COMPUTER SCIENCE, SOFTWARE ENGINEERING
CiteScore
10.20
自引率
10.60%
发文量
2340
期刊介绍: Tsinghua Science and Technology (Tsinghua Sci Technol) started publication in 1996. It is an international academic journal sponsored by Tsinghua University and is published bimonthly. This journal aims at presenting the up-to-date scientific achievements in computer science, electronic engineering, and other IT fields. Contributions all over the world are welcome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信