{"title":"优化数据湖中的多模态数据查询","authors":"Runqun Xiong;Shiyuan Zhao;Ciyuan Chen;Zhuqing Xu","doi":"10.26599/TST.2025.9010022","DOIUrl":null,"url":null,"abstract":"This paper addresses the challenge of efficiently querying multimodal related data in data lakes, a large-scale storage and management system that supports heterogeneous data formats, including structured, semi-structured, and unstructured data. Multimodal data queries are crucial because they enable seamless retrieval of related data across modalities, such as tables, images, and text, which has applications in fields like e-commerce, healthcare, and education. However, existing methods primarily focus on single-modality queries, such as joinable or unionable table discovery, and struggle to handle the heterogeneity and lack of metadata in data lakes while balancing accuracy and efficiency. To tackle these challenges, we propose a Multimodal data Query mechanism for Data Lakes (MQDL), which employs a modality-adaptive indexing mechanism raleted and contrastive learning based embeddings to unify representations across modalities. Additionally, we introduce product quantization to optimize candidate verification during queries, reducing computational overhead while maintaining precision. We evaluate MQDL using a table-image dataset across multiple business scenarios, measuring metrics such as precision, recall, and F1-score. Results show that MQDL achieves an accuracy rate of approximately 90%, while demonstrating strong scalability and reduced query response time compared to traditional methods. These findings highlight MQDL's potential to enhance multimodal data retrieval in complex data lake environments.","PeriodicalId":48690,"journal":{"name":"Tsinghua Science and Technology","volume":"30 6","pages":"2625-2637"},"PeriodicalIF":3.5000,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11072065","citationCount":"0","resultStr":"{\"title\":\"Optimizing Multimodal Data Queries in Data Lakes\",\"authors\":\"Runqun Xiong;Shiyuan Zhao;Ciyuan Chen;Zhuqing Xu\",\"doi\":\"10.26599/TST.2025.9010022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper addresses the challenge of efficiently querying multimodal related data in data lakes, a large-scale storage and management system that supports heterogeneous data formats, including structured, semi-structured, and unstructured data. Multimodal data queries are crucial because they enable seamless retrieval of related data across modalities, such as tables, images, and text, which has applications in fields like e-commerce, healthcare, and education. However, existing methods primarily focus on single-modality queries, such as joinable or unionable table discovery, and struggle to handle the heterogeneity and lack of metadata in data lakes while balancing accuracy and efficiency. To tackle these challenges, we propose a Multimodal data Query mechanism for Data Lakes (MQDL), which employs a modality-adaptive indexing mechanism raleted and contrastive learning based embeddings to unify representations across modalities. Additionally, we introduce product quantization to optimize candidate verification during queries, reducing computational overhead while maintaining precision. We evaluate MQDL using a table-image dataset across multiple business scenarios, measuring metrics such as precision, recall, and F1-score. Results show that MQDL achieves an accuracy rate of approximately 90%, while demonstrating strong scalability and reduced query response time compared to traditional methods. These findings highlight MQDL's potential to enhance multimodal data retrieval in complex data lake environments.\",\"PeriodicalId\":48690,\"journal\":{\"name\":\"Tsinghua Science and Technology\",\"volume\":\"30 6\",\"pages\":\"2625-2637\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-07-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11072065\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Tsinghua Science and Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11072065/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Multidisciplinary\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Tsinghua Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/11072065/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}
This paper addresses the challenge of efficiently querying multimodal related data in data lakes, a large-scale storage and management system that supports heterogeneous data formats, including structured, semi-structured, and unstructured data. Multimodal data queries are crucial because they enable seamless retrieval of related data across modalities, such as tables, images, and text, which has applications in fields like e-commerce, healthcare, and education. However, existing methods primarily focus on single-modality queries, such as joinable or unionable table discovery, and struggle to handle the heterogeneity and lack of metadata in data lakes while balancing accuracy and efficiency. To tackle these challenges, we propose a Multimodal data Query mechanism for Data Lakes (MQDL), which employs a modality-adaptive indexing mechanism raleted and contrastive learning based embeddings to unify representations across modalities. Additionally, we introduce product quantization to optimize candidate verification during queries, reducing computational overhead while maintaining precision. We evaluate MQDL using a table-image dataset across multiple business scenarios, measuring metrics such as precision, recall, and F1-score. Results show that MQDL achieves an accuracy rate of approximately 90%, while demonstrating strong scalability and reduced query response time compared to traditional methods. These findings highlight MQDL's potential to enhance multimodal data retrieval in complex data lake environments.
期刊介绍:
Tsinghua Science and Technology (Tsinghua Sci Technol) started publication in 1996. It is an international academic journal sponsored by Tsinghua University and is published bimonthly. This journal aims at presenting the up-to-date scientific achievements in computer science, electronic engineering, and other IT fields. Contributions all over the world are welcome.