Jiale Li , Run Qian , Yandan Tan , Zhixin Li , Luyu Chen , Sen Liu , Jie Wu , Hongfeng Chai
{"title":"TabSAL: Synthesizing Tabular data with Small agent Assisted Language models","authors":"Jiale Li , Run Qian , Yandan Tan , Zhixin Li , Luyu Chen , Sen Liu , Jie Wu , Hongfeng Chai","doi":"10.1016/j.knosys.2024.112438","DOIUrl":null,"url":null,"abstract":"<div><p>Tabular data are widely used in machine-learning tasks because of their prevalence in various fields; however, the potential risks of data breaches in tabular data and privacy protection regulations render such data almost unavailable. Tabular data generation methods alleviate data unavailability by synthesizing privacy-free data, and generating data using language models is a novel innovation. Language models can synthesize high-quality datasets by learning knowledge from nondestructive information and recognizing the semantics of table columns. However, when current language models function as generators, their encoding methods are hindered by complicated decoding processes, and the limited predictive ability of language models restricts their generative capability. To this end, we propose an encoding method based on interactive data structures such as JavaScript Object Notation for converting tabular data. We design TabSAL, which is a pluggable tabular data generation framework with small agent assisted language models, to boost the predictive capability, resulting in high-quality synthetic datasets with a much lower computational resource cost. In addition, a benchmark that integrates eight datasets, three methods, and three assessment directions has been issued, which indicates that TabSAL surpasses the state of the art by up to 60%.</p></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":null,"pages":null},"PeriodicalIF":7.2000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124010724","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Tabular data are widely used in machine-learning tasks because of their prevalence in various fields; however, the potential risks of data breaches in tabular data and privacy protection regulations render such data almost unavailable. Tabular data generation methods alleviate data unavailability by synthesizing privacy-free data, and generating data using language models is a novel innovation. Language models can synthesize high-quality datasets by learning knowledge from nondestructive information and recognizing the semantics of table columns. However, when current language models function as generators, their encoding methods are hindered by complicated decoding processes, and the limited predictive ability of language models restricts their generative capability. To this end, we propose an encoding method based on interactive data structures such as JavaScript Object Notation for converting tabular data. We design TabSAL, which is a pluggable tabular data generation framework with small agent assisted language models, to boost the predictive capability, resulting in high-quality synthetic datasets with a much lower computational resource cost. In addition, a benchmark that integrates eight datasets, three methods, and three assessment directions has been issued, which indicates that TabSAL surpasses the state of the art by up to 60%.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.