提示还是不提示？使用大型语言模型对异构数据进行整合和建模的导航

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering Pub Date : 2024-05-06 DOI:10.1016/j.datak.2024.102313

Adel Remadi , Karim El Hage , Yasmina Hobeika , Francesca Bugiotti

{"title":"提示还是不提示？使用大型语言模型对异构数据进行整合和建模的导航","authors":"Adel Remadi , Karim El Hage , Yasmina Hobeika , Francesca Bugiotti","doi":"10.1016/j.datak.2024.102313","DOIUrl":null,"url":null,"abstract":"<div><p>Manually integrating data of diverse formats and languages is vital to many artificial intelligence applications. However, the task itself remains challenging and time-consuming. This paper highlights the potential of Large Language Models (LLMs) to streamline data extraction and resolution processes. Our approach aims to address the ongoing challenge of integrating heterogeneous data sources, encouraging advancements in the field of data engineering. Applied on the specific use case of learning disorders in higher education, our research demonstrates LLMs’ capability to effectively extract data from unstructured sources. It is then further highlighted that LLMs can enhance data integration by providing the ability to resolve entities originating from multiple data sources. Crucially, the paper underscores the necessity of preliminary data modeling decisions to ensure the success of such technological applications. By merging human expertise with LLM-driven automation, this study advocates for the further exploration of semi-autonomous data engineering pipelines.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"152 ","pages":"Article 102313"},"PeriodicalIF":2.7000,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X24000375/pdfft?md5=11ee9c76542d55fac49075892a9a8c7d&pid=1-s2.0-S0169023X24000375-main.pdf","citationCount":"0","resultStr":"{\"title\":\"To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data\",\"authors\":\"Adel Remadi , Karim El Hage , Yasmina Hobeika , Francesca Bugiotti\",\"doi\":\"10.1016/j.datak.2024.102313\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Manually integrating data of diverse formats and languages is vital to many artificial intelligence applications. However, the task itself remains challenging and time-consuming. This paper highlights the potential of Large Language Models (LLMs) to streamline data extraction and resolution processes. Our approach aims to address the ongoing challenge of integrating heterogeneous data sources, encouraging advancements in the field of data engineering. Applied on the specific use case of learning disorders in higher education, our research demonstrates LLMs’ capability to effectively extract data from unstructured sources. It is then further highlighted that LLMs can enhance data integration by providing the ability to resolve entities originating from multiple data sources. Crucially, the paper underscores the necessity of preliminary data modeling decisions to ensure the success of such technological applications. By merging human expertise with LLM-driven automation, this study advocates for the further exploration of semi-autonomous data engineering pipelines.</p></div>\",\"PeriodicalId\":55184,\"journal\":{\"name\":\"Data & Knowledge Engineering\",\"volume\":\"152 \",\"pages\":\"Article 102313\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2024-05-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0169023X24000375/pdfft?md5=11ee9c76542d55fac49075892a9a8c7d&pid=1-s2.0-S0169023X24000375-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data & Knowledge Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169023X24000375\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & Knowledge Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169023X24000375","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

手动整合不同格式和语言的数据对许多人工智能应用来说都至关重要。然而，这项任务本身仍然具有挑战性且耗时。本文强调了大型语言模型（LLM）在简化数据提取和解析过程方面的潜力。我们的方法旨在解决整合异构数据源的持续挑战，推动数据工程领域的进步。我们的研究以高等教育中的学习障碍为特定用例，展示了 LLMs 从非结构化数据源中有效提取数据的能力。论文还进一步强调，LLM 可以提供解析来自多个数据源的实体的能力，从而加强数据整合。最重要的是，本文强调了初步数据建模决策的必要性，以确保此类技术应用的成功。通过将人类专业知识与 LLM 驱动的自动化相结合，本研究主张进一步探索半自主数据工程管道。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

To prompt or not to prompt: Navigating the use of Large Language Models for integrating and modeling heterogeneous data

Manually integrating data of diverse formats and languages is vital to many artificial intelligence applications. However, the task itself remains challenging and time-consuming. This paper highlights the potential of Large Language Models (LLMs) to streamline data extraction and resolution processes. Our approach aims to address the ongoing challenge of integrating heterogeneous data sources, encouraging advancements in the field of data engineering. Applied on the specific use case of learning disorders in higher education, our research demonstrates LLMs’ capability to effectively extract data from unstructured sources. It is then further highlighted that LLMs can enhance data integration by providing the ability to resolve entities originating from multiple data sources. Crucially, the paper underscores the necessity of preliminary data modeling decisions to ensure the success of such technological applications. By merging human expertise with LLM-driven automation, this study advocates for the further exploration of semi-autonomous data engineering pipelines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Data & Knowledge Engineering 工程技术-计算机：人工智能

CiteScore

5.00

自引率

0.00%

发文量

审稿时长

6 months

期刊介绍： Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and interaction between these two related fields of interest. DKE reaches a world-wide audience of researchers, designers, managers and users. The major aim of the journal is to identify, investigate and analyze the underlying principles in the design and effective use of these systems.