G.Charbel N. Kindji , Lina M. Rojas-Barahona , Elisa Fromont , Tanguy Urvoy
{"title":"表格数据生成模型:深入调查和广泛调优的性能基准","authors":"G.Charbel N. Kindji , Lina M. Rojas-Barahona , Elisa Fromont , Tanguy Urvoy","doi":"10.1016/j.neucom.2025.131655","DOIUrl":null,"url":null,"abstract":"<div><div>Generating realistic, safe, and useful tabular data is important for downstream tasks such as privacy preserving, imputation, oversampling, explainability, and simulation. However, the structure of tabular data, marked by heterogeneous types, non-smooth distributions, complex feature dependencies, and categorical imbalance, poses significant challenges. Although many generative approaches have been proposed, a fair and unified evaluation across datasets remains missing. This work benchmarks five recent model families on 16 diverse datasets (average 80 K rows), with careful optimization of hyperparameters, feature encodings, and architectures. We show that dataset-specific tuning leads to substantial performance gains, particularly for diffusion-based models. We further introduce constrained hyperparameter spaces that retain competitive performance while significantly reducing tuning cost, enabling efficient model selection under fixed GPU budgets. As future perspectives, we can study cross-domain and cross-table generation.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"658 ","pages":"Article 131655"},"PeriodicalIF":6.5000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Tabular data generation models: An in-depth survey and performance benchmarks with extensive tuning\",\"authors\":\"G.Charbel N. Kindji , Lina M. Rojas-Barahona , Elisa Fromont , Tanguy Urvoy\",\"doi\":\"10.1016/j.neucom.2025.131655\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Generating realistic, safe, and useful tabular data is important for downstream tasks such as privacy preserving, imputation, oversampling, explainability, and simulation. However, the structure of tabular data, marked by heterogeneous types, non-smooth distributions, complex feature dependencies, and categorical imbalance, poses significant challenges. Although many generative approaches have been proposed, a fair and unified evaluation across datasets remains missing. This work benchmarks five recent model families on 16 diverse datasets (average 80 K rows), with careful optimization of hyperparameters, feature encodings, and architectures. We show that dataset-specific tuning leads to substantial performance gains, particularly for diffusion-based models. We further introduce constrained hyperparameter spaces that retain competitive performance while significantly reducing tuning cost, enabling efficient model selection under fixed GPU budgets. As future perspectives, we can study cross-domain and cross-table generation.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"658 \",\"pages\":\"Article 131655\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225023276\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225023276","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Tabular data generation models: An in-depth survey and performance benchmarks with extensive tuning
Generating realistic, safe, and useful tabular data is important for downstream tasks such as privacy preserving, imputation, oversampling, explainability, and simulation. However, the structure of tabular data, marked by heterogeneous types, non-smooth distributions, complex feature dependencies, and categorical imbalance, poses significant challenges. Although many generative approaches have been proposed, a fair and unified evaluation across datasets remains missing. This work benchmarks five recent model families on 16 diverse datasets (average 80 K rows), with careful optimization of hyperparameters, feature encodings, and architectures. We show that dataset-specific tuning leads to substantial performance gains, particularly for diffusion-based models. We further introduce constrained hyperparameter spaces that retain competitive performance while significantly reducing tuning cost, enabling efficient model selection under fixed GPU budgets. As future perspectives, we can study cross-domain and cross-table generation.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.