TabSAL: Synthesizing Tabular data with Small agent Assisted Language models

IF 7.2 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Jiale Li , Run Qian , Yandan Tan , Zhixin Li , Luyu Chen , Sen Liu , Jie Wu , Hongfeng Chai
{"title":"TabSAL: Synthesizing Tabular data with Small agent Assisted Language models","authors":"Jiale Li ,&nbsp;Run Qian ,&nbsp;Yandan Tan ,&nbsp;Zhixin Li ,&nbsp;Luyu Chen ,&nbsp;Sen Liu ,&nbsp;Jie Wu ,&nbsp;Hongfeng Chai","doi":"10.1016/j.knosys.2024.112438","DOIUrl":null,"url":null,"abstract":"<div><p>Tabular data are widely used in machine-learning tasks because of their prevalence in various fields; however, the potential risks of data breaches in tabular data and privacy protection regulations render such data almost unavailable. Tabular data generation methods alleviate data unavailability by synthesizing privacy-free data, and generating data using language models is a novel innovation. Language models can synthesize high-quality datasets by learning knowledge from nondestructive information and recognizing the semantics of table columns. However, when current language models function as generators, their encoding methods are hindered by complicated decoding processes, and the limited predictive ability of language models restricts their generative capability. To this end, we propose an encoding method based on interactive data structures such as JavaScript Object Notation for converting tabular data. We design TabSAL, which is a pluggable tabular data generation framework with small agent assisted language models, to boost the predictive capability, resulting in high-quality synthetic datasets with a much lower computational resource cost. In addition, a benchmark that integrates eight datasets, three methods, and three assessment directions has been issued, which indicates that TabSAL surpasses the state of the art by up to 60%.</p></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":null,"pages":null},"PeriodicalIF":7.2000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124010724","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Tabular data are widely used in machine-learning tasks because of their prevalence in various fields; however, the potential risks of data breaches in tabular data and privacy protection regulations render such data almost unavailable. Tabular data generation methods alleviate data unavailability by synthesizing privacy-free data, and generating data using language models is a novel innovation. Language models can synthesize high-quality datasets by learning knowledge from nondestructive information and recognizing the semantics of table columns. However, when current language models function as generators, their encoding methods are hindered by complicated decoding processes, and the limited predictive ability of language models restricts their generative capability. To this end, we propose an encoding method based on interactive data structures such as JavaScript Object Notation for converting tabular data. We design TabSAL, which is a pluggable tabular data generation framework with small agent assisted language models, to boost the predictive capability, resulting in high-quality synthetic datasets with a much lower computational resource cost. In addition, a benchmark that integrates eight datasets, three methods, and three assessment directions has been issued, which indicates that TabSAL surpasses the state of the art by up to 60%.

TabSAL:利用小型代理辅助语言模型合成表格数据
表格式数据在各个领域都非常普遍,因此被广泛应用于机器学习任务中;然而,表格式数据潜在的数据泄露风险和隐私保护法规使得这些数据几乎不可用。表格数据生成方法通过合成无隐私数据来缓解数据不可用的问题,而使用语言模型生成数据则是一种新颖的创新。语言模型可以通过学习非破坏性信息中的知识和识别表格列的语义来合成高质量的数据集。然而,目前的语言模型在作为生成器时,其编码方法受到复杂解码过程的阻碍,语言模型有限的预测能力也限制了其生成能力。为此,我们提出了一种基于 JavaScript Object Notation 等交互式数据结构的编码方法,用于转换表格数据。我们设计的 TabSAL 是一个可插拔的表格数据生成框架,它具有小型代理辅助语言模型,可提高预测能力,从而以更低的计算资源成本生成高质量的合成数据集。此外,我们还发布了一项整合了八个数据集、三种方法和三个评估方向的基准测试,结果表明 TabSAL 比现有技术水平高出 60%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Knowledge-Based Systems
Knowledge-Based Systems 工程技术-计算机:人工智能
CiteScore
14.80
自引率
12.50%
发文量
1245
审稿时长
7.8 months
期刊介绍: Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信