Role of High Fidelity Vs. Low Fidelity Experimental Data in Machine Learning Model Performance for Predicting Polymer Solubility.

IF 4.3 3区化学 Q2 POLYMER SCIENCE

Macromolecular Rapid Communications Pub Date : 2025-07-28 DOI:10.1002/marc.202500454

Mona Amrihesari, Manali Banerjee, Raul Olmedo, Blair Brettmann

{"title":"Role of High Fidelity Vs. Low Fidelity Experimental Data in Machine Learning Model Performance for Predicting Polymer Solubility.","authors":"Mona Amrihesari, Manali Banerjee, Raul Olmedo, Blair Brettmann","doi":"10.1002/marc.202500454","DOIUrl":null,"url":null,"abstract":"<p><p>Reliable classification of polymer-solvent compatibility is essential for solution formulation and materials discovery. Applying machine learning (ML) and artificial intelligence to this task is of growing interest in polymer science, but the effectiveness of such models depends on the quality/nature of the training data. This study evaluates how experimental data fidelity, as set by the experimental method, influences ML model performance by comparing classifiers trained on two experimental datasets: one generated from turbidity-based measurements using a Crystal16 parallel crystallizer as a high-fidelity source and another derived from visual solubility inspection as a low-fidelity dataset. Both datasets were encoded using one-hot encoding for polymers and Morgan fingerprints for solvents and modeled using XGBoost classifiers to predict solubility labels as soluble, insoluble, and partially soluble. Confusion matrices showed that models trained on high-fidelity data better captured partially soluble behavior and more clearly distinguished between classes, highlighting the advantage of quantitative measurements over subjective classification. We also found that adding temperature as a feature improved prediction accuracy for the low-fidelity dataset-a key consideration for literature-derived data, which often lacks this information. These findings underscore the importance of experimental rigor and completeness when developing generalizable ML-based tools for polymer solubility prediction.</p>","PeriodicalId":205,"journal":{"name":"Macromolecular Rapid Communications","volume":" ","pages":"e00454"},"PeriodicalIF":4.3000,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Macromolecular Rapid Communications","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1002/marc.202500454","RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"POLYMER SCIENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Reliable classification of polymer-solvent compatibility is essential for solution formulation and materials discovery. Applying machine learning (ML) and artificial intelligence to this task is of growing interest in polymer science, but the effectiveness of such models depends on the quality/nature of the training data. This study evaluates how experimental data fidelity, as set by the experimental method, influences ML model performance by comparing classifiers trained on two experimental datasets: one generated from turbidity-based measurements using a Crystal16 parallel crystallizer as a high-fidelity source and another derived from visual solubility inspection as a low-fidelity dataset. Both datasets were encoded using one-hot encoding for polymers and Morgan fingerprints for solvents and modeled using XGBoost classifiers to predict solubility labels as soluble, insoluble, and partially soluble. Confusion matrices showed that models trained on high-fidelity data better captured partially soluble behavior and more clearly distinguished between classes, highlighting the advantage of quantitative measurements over subjective classification. We also found that adding temperature as a feature improved prediction accuracy for the low-fidelity dataset-a key consideration for literature-derived data, which often lacks this information. These findings underscore the importance of experimental rigor and completeness when developing generalizable ML-based tools for polymer solubility prediction.

查看原文本刊更多论文

高保真度与低保真度实验数据在预测聚合物溶解度的机器学习模型性能中的作用。

可靠的聚合物-溶剂相容性分类对溶液配方和材料发现至关重要。将机器学习（ML）和人工智能应用于这项任务是聚合物科学越来越感兴趣的，但这些模型的有效性取决于训练数据的质量/性质。本研究通过比较在两个实验数据集上训练的分类器，评估实验数据保真度（由实验方法设置）如何影响ML模型的性能：一个来自基于浊度的测量，使用Crystal16并行结晶器作为高保真度来源，另一个来自视觉溶解度检查作为低保真度数据集。这两个数据集都使用one-hot编码对聚合物进行编码，使用Morgan指纹对溶剂进行编码，并使用XGBoost分类器建模，预测溶解度标签为可溶性、不溶性和部分可溶性。混淆矩阵表明，在高保真数据上训练的模型更好地捕获了部分可溶解的行为，并且更清楚地区分了类别，突出了定量测量相对于主观分类的优势。我们还发现，添加温度作为特征可以提高低保真度数据集的预测精度——这是文献衍生数据的关键考虑因素，通常缺乏这一信息。这些发现强调了在开发可推广的基于ml的聚合物溶解度预测工具时实验严谨性和完整性的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Macromolecular Rapid Communications 工程技术-高分子科学

CiteScore

7.70

自引率

6.50%

发文量

477

审稿时长

1.4 months

期刊介绍： Macromolecular Rapid Communications publishes original research in polymer science, ranging from chemistry and physics of polymers to polymers in materials science and life sciences.