The prediction of crystal densities of a big data set using 1D and 2D structure features

IF 2.1 4区 化学 Q3 CHEMISTRY, MULTIDISCIPLINARY
Xianlan Li, Dingling Kong, Yue Luan, Lili Guo, Yanhua Lu, Wei Li, Meng Tang, Qingyou Zhang, Aimin Pang
{"title":"The prediction of crystal densities of a big data set using 1D and 2D structure features","authors":"Xianlan Li,&nbsp;Dingling Kong,&nbsp;Yue Luan,&nbsp;Lili Guo,&nbsp;Yanhua Lu,&nbsp;Wei Li,&nbsp;Meng Tang,&nbsp;Qingyou Zhang,&nbsp;Aimin Pang","doi":"10.1007/s11224-024-02279-4","DOIUrl":null,"url":null,"abstract":"<div><p>A large data set of over 30 thousand organic compounds containing carbon, nitrogen, oxygen, fluorine, and hydrogen was collected, and the density of each compound was predicted by 1D descriptors derived from its molecular formula and 2D descriptors derived from its constitutional structural features. The 2D structural features are composed of Benson’s groups, corrected groups, and 2D structural features of the whole molecular structures. All the descriptors were extracted by an in-house program in Java with a function to ensure that each atom (or bond) of molecules is represented by Benson’s groups once for atom-based (or bond-based) descriptors. Partial least square (PLS) and random forest (RF) methods were used separately to build models to predict the density. Further, the variable selection of descriptors was performed by variable importance of RF. For partial least square, the combination of the models constructed by descriptors based on the atoms and the bonds achieved the best results in this paper: for the cross-validation of the training set, the Pearson correlation coefficient (<i>R</i>) = 0.9270, mean absolute error (<i>MAE</i>) = 0.0270 g·cm<sup>−3</sup>, and root mean squared error (<i>RMSE</i>) = 0.0426 g·cm<sup>−3</sup>; for the prediction of the test set, <i>R</i> = 0.9454, <i>MAE</i> = 0.0263 g·cm<sup>−3</sup>, and <i>RMSE</i> = 0.0375 g·cm<sup>−3</sup>.</p></div>","PeriodicalId":780,"journal":{"name":"Structural Chemistry","volume":"35 5","pages":"1375 - 1385"},"PeriodicalIF":2.1000,"publicationDate":"2024-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Structural Chemistry","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1007/s11224-024-02279-4","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

A large data set of over 30 thousand organic compounds containing carbon, nitrogen, oxygen, fluorine, and hydrogen was collected, and the density of each compound was predicted by 1D descriptors derived from its molecular formula and 2D descriptors derived from its constitutional structural features. The 2D structural features are composed of Benson’s groups, corrected groups, and 2D structural features of the whole molecular structures. All the descriptors were extracted by an in-house program in Java with a function to ensure that each atom (or bond) of molecules is represented by Benson’s groups once for atom-based (or bond-based) descriptors. Partial least square (PLS) and random forest (RF) methods were used separately to build models to predict the density. Further, the variable selection of descriptors was performed by variable importance of RF. For partial least square, the combination of the models constructed by descriptors based on the atoms and the bonds achieved the best results in this paper: for the cross-validation of the training set, the Pearson correlation coefficient (R) = 0.9270, mean absolute error (MAE) = 0.0270 g·cm−3, and root mean squared error (RMSE) = 0.0426 g·cm−3; for the prediction of the test set, R = 0.9454, MAE = 0.0263 g·cm−3, and RMSE = 0.0375 g·cm−3.

Abstract Image

Abstract Image

利用一维和二维结构特征预测大数据集的晶体密度
我们收集了一个包含 3 万多种含碳、氮、氧、氟和氢的有机化合物的大型数据集,并通过分子式得出的一维描述符和分子结构特征得出的二维描述符预测了每种化合物的密度。二维结构特征由 Benson 基团、校正基团和整个分子结构的二维结构特征组成。所有描述符都是通过 Java 内部程序提取的,该程序的一个功能是确保分子中的每个原子(或键)都能用本森基团来表示一次基于原子(或键)的描述符。在建立密度预测模型时,分别使用了偏最小二乘法(PLS)和随机森林法(RF)。此外,描述符的变量选择是通过 RF 的变量重要性进行的。对于偏最小二乘法,本文采用基于原子和键的描述符构建的模型组合取得了最佳结果:在训练集的交叉验证中,皮尔逊相关系数(R)= 0.9270,平均绝对误差(MAE)= 0.0270 g-cm-3,均方根误差(RMSE)= 0.0426 g-cm-3;对于测试集的预测,R = 0.9454,MAE = 0.0263 g-cm-3,RMSE = 0.0375 g-cm-3。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Structural Chemistry
Structural Chemistry 化学-化学综合
CiteScore
3.80
自引率
11.80%
发文量
227
审稿时长
3.7 months
期刊介绍: Structural Chemistry is an international forum for the publication of peer-reviewed original research papers that cover the condensed and gaseous states of matter and involve numerous techniques for the determination of structure and energetics, their results, and the conclusions derived from these studies. The journal overcomes the unnatural separation in the current literature among the areas of structure determination, energetics, and applications, as well as builds a bridge to other chemical disciplines. Ist comprehensive coverage encompasses broad discussion of results, observation of relationships among various properties, and the description and application of structure and energy information in all domains of chemistry. We welcome the broadest range of accounts of research in structural chemistry involving the discussion of methodologies and structures,experimental, theoretical, and computational, and their combinations. We encourage discussions of structural information collected for their chemicaland biological significance.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信