{"title":"用于语义类型检测的 LLM 辅助标记功能生成","authors":"Chenjie Li, Dan Zhang, Jin Wang","doi":"arxiv-2408.16173","DOIUrl":null,"url":null,"abstract":"Detecting semantic types of columns in data lake tables is an important\napplication. A key bottleneck in semantic type detection is the availability of\nhuman annotation due to the inherent complexity of data lakes. In this paper,\nwe propose using programmatic weak supervision to assist in annotating the\ntraining data for semantic type detection by leveraging labeling functions. One\nchallenge in this process is the difficulty of manually writing labeling\nfunctions due to the large volume and low quality of the data lake table\ndatasets. To address this issue, we explore employing Large Language Models\n(LLMs) for labeling function generation and introduce several prompt\nengineering strategies for this purpose. We conduct experiments on real-world\nweb table datasets. Based on the initial results, we perform extensive analysis\nand provide empirical insights and future directions for researchers in this\nfield.","PeriodicalId":501123,"journal":{"name":"arXiv - CS - Databases","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"LLM-assisted Labeling Function Generation for Semantic Type Detection\",\"authors\":\"Chenjie Li, Dan Zhang, Jin Wang\",\"doi\":\"arxiv-2408.16173\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Detecting semantic types of columns in data lake tables is an important\\napplication. A key bottleneck in semantic type detection is the availability of\\nhuman annotation due to the inherent complexity of data lakes. In this paper,\\nwe propose using programmatic weak supervision to assist in annotating the\\ntraining data for semantic type detection by leveraging labeling functions. One\\nchallenge in this process is the difficulty of manually writing labeling\\nfunctions due to the large volume and low quality of the data lake table\\ndatasets. To address this issue, we explore employing Large Language Models\\n(LLMs) for labeling function generation and introduce several prompt\\nengineering strategies for this purpose. We conduct experiments on real-world\\nweb table datasets. Based on the initial results, we perform extensive analysis\\nand provide empirical insights and future directions for researchers in this\\nfield.\",\"PeriodicalId\":501123,\"journal\":{\"name\":\"arXiv - CS - Databases\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Databases\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.16173\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Databases","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.16173","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
LLM-assisted Labeling Function Generation for Semantic Type Detection
Detecting semantic types of columns in data lake tables is an important
application. A key bottleneck in semantic type detection is the availability of
human annotation due to the inherent complexity of data lakes. In this paper,
we propose using programmatic weak supervision to assist in annotating the
training data for semantic type detection by leveraging labeling functions. One
challenge in this process is the difficulty of manually writing labeling
functions due to the large volume and low quality of the data lake table
datasets. To address this issue, we explore employing Large Language Models
(LLMs) for labeling function generation and introduce several prompt
engineering strategies for this purpose. We conduct experiments on real-world
web table datasets. Based on the initial results, we perform extensive analysis
and provide empirical insights and future directions for researchers in this
field.