Decoding non-linearity and complexity: deep tabular learning approaches for materials science

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery Pub Date : 2025-08-01 DOI:10.1039/D5DD00166H

Vahid Attari and Raymundo Arroyave

{"title":"Decoding non-linearity and complexity: deep tabular learning approaches for materials science","authors":"Vahid Attari and Raymundo Arroyave","doi":"10.1039/D5DD00166H","DOIUrl":null,"url":null,"abstract":"<p >Materials datasets, particularly those capturing high-temperature properties pose significant challenges for learning tasks due to their skewed distributions, wide feature ranges, and multimodal behaviors. While tree-based models like XGBoost are inherently non-linear and often perform well on many tabular problems, their reliance on piecewise constant splits can limit effectiveness when modeling smooth, long-tailed, or higher-order relationships prevalent in advanced materials data. To address these challenges, we investigate the effectiveness of encoder–decoder model for data transformation using regularized Fully Dense Networks (FDN-R), Disjunctive Normal Form Networks (DNF-Net), 1D Convolutional Neural Networks (CNNs), and Variational Autoencoders, along with TabNet, a hybrid attention-based model, to address these challenges. Our results indicate that while XGBoost remains competitive on simpler tasks, encoder–decoder models, particularly those based on regularized FDN-R and DNF-Net, demonstrate better generalization on highly skewed targets like creep resistance, across small, medium, and large datasets. TabNet's attention mechanism offers moderate gains but underperforms on extreme values. These findings emphasize the importance of aligning model architecture with feature complexity and demonstrate the promise of hybrid encoder–decoder models for robust and generalizable materials prediction from composition data.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 2765-2780"},"PeriodicalIF":6.2000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00166h?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d5dd00166h","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Materials datasets, particularly those capturing high-temperature properties pose significant challenges for learning tasks due to their skewed distributions, wide feature ranges, and multimodal behaviors. While tree-based models like XGBoost are inherently non-linear and often perform well on many tabular problems, their reliance on piecewise constant splits can limit effectiveness when modeling smooth, long-tailed, or higher-order relationships prevalent in advanced materials data. To address these challenges, we investigate the effectiveness of encoder–decoder model for data transformation using regularized Fully Dense Networks (FDN-R), Disjunctive Normal Form Networks (DNF-Net), 1D Convolutional Neural Networks (CNNs), and Variational Autoencoders, along with TabNet, a hybrid attention-based model, to address these challenges. Our results indicate that while XGBoost remains competitive on simpler tasks, encoder–decoder models, particularly those based on regularized FDN-R and DNF-Net, demonstrate better generalization on highly skewed targets like creep resistance, across small, medium, and large datasets. TabNet's attention mechanism offers moderate gains but underperforms on extreme values. These findings emphasize the importance of aligning model architecture with feature complexity and demonstrate the promise of hybrid encoder–decoder models for robust and generalizable materials prediction from composition data.

Abstract Image

查看原文本刊更多论文

解码非线性和复杂性：材料科学的深度表格学习方法

材料数据集，特别是那些捕获高温特性的数据集，由于其分布偏态、特征范围广、多模态行为，给学习任务带来了重大挑战。虽然基于树的模型（如XGBoost）本质上是非线性的，并且通常在许多表格问题上表现良好，但它们对分段不变分割的依赖可能会限制在高级材料数据中流行的平滑、长尾或高阶关系建模时的有效性。为了解决这些挑战，我们使用正则化全密集网络（FDN-R）、析取范式网络（DNF-Net）、一维卷积神经网络（cnn）和变分自编码器以及TabNet（一种基于注意力的混合模型）来研究编码器-解码器模型在数据转换中的有效性，以解决这些挑战。我们的研究结果表明，虽然XGBoost在更简单的任务上仍然具有竞争力，但编码器-解码器模型，特别是基于正则化FDN-R和DNF-Net的模型，在小型、中型和大型数据集上，在抗蠕变等高度倾斜的目标上表现出更好的泛化。TabNet的注意力机制提供了适度的收益，但在极端值上表现不佳。这些发现强调了将模型架构与特征复杂性相结合的重要性，并展示了混合编码器-解码器模型的前景，该模型可以从成分数据中进行鲁棒性和可泛化的材料预测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Digital discovery

CiteScore

2.80

自引率

0.00%

发文量