使用核密度估计确定机器学习模型的领域：材料特性预测中的应用

arXiv - PHYS - Other Condensed Matter Pub Date : 2024-05-28 DOI:arxiv-2406.05143

Lane E. Schultz, Yiqi Wang, Ryan Jacobs, Dane Morgan

{"title":"使用核密度估计确定机器学习模型的领域：材料特性预测中的应用","authors":"Lane E. Schultz, Yiqi Wang, Ryan Jacobs, Dane Morgan","doi":"arxiv-2406.05143","DOIUrl":null,"url":null,"abstract":"Knowledge of the domain of applicability of a machine learning model is\nessential to ensuring accurate and reliable model predictions. In this work, we\ndevelop a new approach of assessing model domain and demonstrate that our\napproach provides accurate and meaningful designation of in-domain versus\nout-of-domain when applied across multiple model types and material property\ndata sets. Our approach assesses the distance between a test and training data\npoint in feature space by using kernel density estimation and shows that this\ndistance provides an effective tool for domain determination. We show that\nchemical groups considered unrelated based on established chemical knowledge\nexhibit significant dissimilarities by our measure. We also show that high\nmeasures of dissimilarity are associated with poor model performance (i.e.,\nhigh residual magnitudes) and poor estimates of model uncertainty (i.e.,\nunreliable uncertainty estimation). Automated tools are provided to enable\nresearchers to establish acceptable dissimilarity thresholds to identify\nwhether new predictions of their own machine learning models are in-domain\nversus out-of-domain.","PeriodicalId":501211,"journal":{"name":"arXiv - PHYS - Other Condensed Matter","volume":"63 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Determining Domain of Machine Learning Models using Kernel Density Estimates: Applications in Materials Property Prediction\",\"authors\":\"Lane E. Schultz, Yiqi Wang, Ryan Jacobs, Dane Morgan\",\"doi\":\"arxiv-2406.05143\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Knowledge of the domain of applicability of a machine learning model is\\nessential to ensuring accurate and reliable model predictions. In this work, we\\ndevelop a new approach of assessing model domain and demonstrate that our\\napproach provides accurate and meaningful designation of in-domain versus\\nout-of-domain when applied across multiple model types and material property\\ndata sets. Our approach assesses the distance between a test and training data\\npoint in feature space by using kernel density estimation and shows that this\\ndistance provides an effective tool for domain determination. We show that\\nchemical groups considered unrelated based on established chemical knowledge\\nexhibit significant dissimilarities by our measure. We also show that high\\nmeasures of dissimilarity are associated with poor model performance (i.e.,\\nhigh residual magnitudes) and poor estimates of model uncertainty (i.e.,\\nunreliable uncertainty estimation). Automated tools are provided to enable\\nresearchers to establish acceptable dissimilarity thresholds to identify\\nwhether new predictions of their own machine learning models are in-domain\\nversus out-of-domain.\",\"PeriodicalId\":501211,\"journal\":{\"name\":\"arXiv - PHYS - Other Condensed Matter\",\"volume\":\"63 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - PHYS - Other Condensed Matter\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.05143\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Other Condensed Matter","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.05143","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

了解机器学习模型的适用领域对于确保模型预测的准确性和可靠性至关重要。在这项工作中，我们开发了一种评估模型领域的新方法，并证明该方法在应用于多种模型类型和材料属性数据集时，能准确而有意义地指定领域内与领域外。我们的方法通过使用核密度估计来评估测试数据点和训练数据点在特征空间中的距离，并证明这一距离为确定领域提供了有效的工具。我们的结果表明，根据已有的化学知识，被认为不相关的化学组在我们的测量中表现出显著的不相似性。我们还表明，高相似度与低模型性能（即高残差幅度）和低模型不确定性估计（即不可靠的不确定性估计）有关。我们提供了自动化工具，使研究人员能够建立可接受的不相似性阈值，以确定他们自己的机器学习模型的新预测是否属于域内与域外。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Determining Domain of Machine Learning Models using Kernel Density Estimates: Applications in Materials Property Prediction

Knowledge of the domain of applicability of a machine learning model is essential to ensuring accurate and reliable model predictions. In this work, we develop a new approach of assessing model domain and demonstrate that our approach provides accurate and meaningful designation of in-domain versus out-of-domain when applied across multiple model types and material property data sets. Our approach assesses the distance between a test and training data point in feature space by using kernel density estimation and shows that this distance provides an effective tool for domain determination. We show that chemical groups considered unrelated based on established chemical knowledge exhibit significant dissimilarities by our measure. We also show that high measures of dissimilarity are associated with poor model performance (i.e., high residual magnitudes) and poor estimates of model uncertainty (i.e., unreliable uncertainty estimation). Automated tools are provided to enable researchers to establish acceptable dissimilarity thresholds to identify whether new predictions of their own machine learning models are in-domain versus out-of-domain.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - PHYS - Other Condensed Matter

自引率

0.00%

发文量