使用核密度估计确定机器学习模型的领域:材料特性预测中的应用

Lane E. Schultz, Yiqi Wang, Ryan Jacobs, Dane Morgan
{"title":"使用核密度估计确定机器学习模型的领域:材料特性预测中的应用","authors":"Lane E. Schultz, Yiqi Wang, Ryan Jacobs, Dane Morgan","doi":"arxiv-2406.05143","DOIUrl":null,"url":null,"abstract":"Knowledge of the domain of applicability of a machine learning model is\nessential to ensuring accurate and reliable model predictions. In this work, we\ndevelop a new approach of assessing model domain and demonstrate that our\napproach provides accurate and meaningful designation of in-domain versus\nout-of-domain when applied across multiple model types and material property\ndata sets. Our approach assesses the distance between a test and training data\npoint in feature space by using kernel density estimation and shows that this\ndistance provides an effective tool for domain determination. We show that\nchemical groups considered unrelated based on established chemical knowledge\nexhibit significant dissimilarities by our measure. We also show that high\nmeasures of dissimilarity are associated with poor model performance (i.e.,\nhigh residual magnitudes) and poor estimates of model uncertainty (i.e.,\nunreliable uncertainty estimation). Automated tools are provided to enable\nresearchers to establish acceptable dissimilarity thresholds to identify\nwhether new predictions of their own machine learning models are in-domain\nversus out-of-domain.","PeriodicalId":501211,"journal":{"name":"arXiv - PHYS - Other Condensed Matter","volume":"63 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Determining Domain of Machine Learning Models using Kernel Density Estimates: Applications in Materials Property Prediction\",\"authors\":\"Lane E. Schultz, Yiqi Wang, Ryan Jacobs, Dane Morgan\",\"doi\":\"arxiv-2406.05143\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Knowledge of the domain of applicability of a machine learning model is\\nessential to ensuring accurate and reliable model predictions. In this work, we\\ndevelop a new approach of assessing model domain and demonstrate that our\\napproach provides accurate and meaningful designation of in-domain versus\\nout-of-domain when applied across multiple model types and material property\\ndata sets. Our approach assesses the distance between a test and training data\\npoint in feature space by using kernel density estimation and shows that this\\ndistance provides an effective tool for domain determination. We show that\\nchemical groups considered unrelated based on established chemical knowledge\\nexhibit significant dissimilarities by our measure. We also show that high\\nmeasures of dissimilarity are associated with poor model performance (i.e.,\\nhigh residual magnitudes) and poor estimates of model uncertainty (i.e.,\\nunreliable uncertainty estimation). Automated tools are provided to enable\\nresearchers to establish acceptable dissimilarity thresholds to identify\\nwhether new predictions of their own machine learning models are in-domain\\nversus out-of-domain.\",\"PeriodicalId\":501211,\"journal\":{\"name\":\"arXiv - PHYS - Other Condensed Matter\",\"volume\":\"63 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - PHYS - Other Condensed Matter\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.05143\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Other Condensed Matter","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.05143","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

了解机器学习模型的适用领域对于确保模型预测的准确性和可靠性至关重要。在这项工作中,我们开发了一种评估模型领域的新方法,并证明该方法在应用于多种模型类型和材料属性数据集时,能准确而有意义地指定领域内与领域外。我们的方法通过使用核密度估计来评估测试数据点和训练数据点在特征空间中的距离,并证明这一距离为确定领域提供了有效的工具。我们的结果表明,根据已有的化学知识,被认为不相关的化学组在我们的测量中表现出显著的不相似性。我们还表明,高相似度与低模型性能(即高残差幅度)和低模型不确定性估计(即不可靠的不确定性估计)有关。我们提供了自动化工具,使研究人员能够建立可接受的不相似性阈值,以确定他们自己的机器学习模型的新预测是否属于域内与域外。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Determining Domain of Machine Learning Models using Kernel Density Estimates: Applications in Materials Property Prediction
Knowledge of the domain of applicability of a machine learning model is essential to ensuring accurate and reliable model predictions. In this work, we develop a new approach of assessing model domain and demonstrate that our approach provides accurate and meaningful designation of in-domain versus out-of-domain when applied across multiple model types and material property data sets. Our approach assesses the distance between a test and training data point in feature space by using kernel density estimation and shows that this distance provides an effective tool for domain determination. We show that chemical groups considered unrelated based on established chemical knowledge exhibit significant dissimilarities by our measure. We also show that high measures of dissimilarity are associated with poor model performance (i.e., high residual magnitudes) and poor estimates of model uncertainty (i.e., unreliable uncertainty estimation). Automated tools are provided to enable researchers to establish acceptable dissimilarity thresholds to identify whether new predictions of their own machine learning models are in-domain versus out-of-domain.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信