CACTI: Leveraging Copy Masking and Contextual Information to Improve Tabular Data Imputation.

Proceedings of machine learning research Pub Date : 2025-07-01

Aditya Gorla, Ryan Wang, Zhengtong Liu, Ulzee An, Sriram Sankararaman

引用次数: 0

Abstract

We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features - captured by column names and text descriptions - to better represent feature dependence. These dual sources of inductive bias enable CACTI to outperform state-of-the-art methods - an average R ² gain of 7.8% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) - across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance. Code is publicly available at github.com/sriramlab/CACTI.

本刊更多论文

CACTI：利用复制屏蔽和上下文信息来改进表格数据输入。

我们提出了CACTI，这是一种用于输入表格数据的掩码自动编码方法，它利用了缺失模式和上下文信息中的结构。我们的方法采用了一种新颖的中位数截断复制掩蔽训练策略，该策略鼓励模型从缺失的经验模式中学习，同时结合特征之间的语义关系（由列名和文本描述捕获），以更好地表示特征依赖性。这些诱导偏置的双重来源使CACTI在各种数据集和缺失条件下的表现优于最先进的方法——平均r2增益比次优方法高7.8%（在非随机缺失、随机缺失和完全随机缺失分别为13.4%、6.1%和5.3%）。我们的研究结果强调了利用数据集特定的上下文信息和缺失模式来提高imputation性能的价值。代码可在github.com/sriramlab/CACTI上公开获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of machine learning research

自引率

0.00%

发文量