An ensemble learning strategy for multi-source hydrogen embrittlement data by introducing missing information

Materials Genome Engineering Advances Pub Date : 2024-05-01 DOI:10.1002/mgea.35

Xujie Gong, Ruichao Lei, Ruize Sun, Xue Jiang, Yanjing Su, Yu Yan

{"title":"An ensemble learning strategy for multi-source hydrogen embrittlement data by introducing missing information","authors":"Xujie Gong, Ruichao Lei, Ruize Sun, Xue Jiang, Yanjing Su, Yu Yan","doi":"10.1002/mgea.35","DOIUrl":null,"url":null,"abstract":"<p>Accurately and quickly predicting hydrogen embrittlement performance is critical for the service of metal materials. However, due to multi-source heterogeneity, existing hydrogen embrittlement data are missing, making it impractical to train reliable machine learning models. In this study, we proposed an ensemble learning training strategy for missing data based on the Adaboost algorithm. This method introduced a mask matrix with missing data and enabled each round of training to generate sub-datasets, considering missing value information. The strategy first trained a subset of features based on the existing dataset and a selected method and continuously focused on the combination of features with the highest error for iterative training, where the mask matrix of the missing data was used as the input to fit the weights of each base learner using a neural network. Compared with directly modeling on highly sparse data, the predictive ability of this strategy was significantly improved by approximately 20%. In addition, in the testing of new samples, the predicted mean absolute error of the new model was successfully reduced from 0.2 to 0.09. This strategy offers good adaptability to the hydrogen embrittlement sensitivity of different sizes and can avoid interference from feature importance caused by filling data.</p>","PeriodicalId":100889,"journal":{"name":"Materials Genome Engineering Advances","volume":"2 2","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/mgea.35","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Materials Genome Engineering Advances","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/mgea.35","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Accurately and quickly predicting hydrogen embrittlement performance is critical for the service of metal materials. However, due to multi-source heterogeneity, existing hydrogen embrittlement data are missing, making it impractical to train reliable machine learning models. In this study, we proposed an ensemble learning training strategy for missing data based on the Adaboost algorithm. This method introduced a mask matrix with missing data and enabled each round of training to generate sub-datasets, considering missing value information. The strategy first trained a subset of features based on the existing dataset and a selected method and continuously focused on the combination of features with the highest error for iterative training, where the mask matrix of the missing data was used as the input to fit the weights of each base learner using a neural network. Compared with directly modeling on highly sparse data, the predictive ability of this strategy was significantly improved by approximately 20%. In addition, in the testing of new samples, the predicted mean absolute error of the new model was successfully reduced from 0.2 to 0.09. This strategy offers good adaptability to the hydrogen embrittlement sensitivity of different sizes and can avoid interference from feature importance caused by filling data.

Abstract Image

查看原文本刊更多论文

引入缺失信息的多源氢脆数据集合学习策略

准确、快速地预测氢脆性能对金属材料的服务至关重要。然而，由于多源异构性，现有的氢脆数据缺失，使得训练可靠的机器学习模型变得不切实际。在本研究中，我们提出了一种基于 Adaboost 算法的缺失数据集合学习训练策略。这种方法引入了一个包含缺失数据的掩码矩阵，每一轮训练都能生成子数据集，并考虑缺失值信息。该策略首先根据现有数据集和选定的方法训练一个特征子集，并持续关注误差最大的特征组合进行迭代训练，其中缺失数据的掩码矩阵被用作使用神经网络拟合每个基础学习器权重的输入。与直接对高稀疏数据建模相比，该策略的预测能力显著提高了约 20%。此外，在新样本测试中，新模型的预测平均绝对误差成功地从 0.2 降至 0.09。该策略对不同尺寸的氢脆敏感性具有良好的适应性，并能避免填充数据对特征重要性的干扰。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Materials Genome Engineering Advances

自引率

0.00%

发文量