高维场景下惩罚回归与机器学习算法的比较研究

SIAM undergraduate research online Pub Date : 2023-01-01 DOI:10.1137/22s1538302

Connor Shrader, Gabriel Ackall

{"title":"高维场景下惩罚回归与机器学习算法的比较研究","authors":"Connor Shrader, Gabriel Ackall","doi":"10.1137/22s1538302","DOIUrl":null,"url":null,"abstract":". With the prevalence of big data in recent years, the importance of modeling high dimensional data and selecting important features has increased greatly. High dimensional data is common in many fields such as genome decoding, rare disease identification, and environmental modeling. However, most traditional regression machine learning models are not designed to handle high dimensional data or conduct variable selection. In this paper, we investigate the use of penalized regression meth-ods such as ridge, least absolute shrinkage and selection operation, elastic net, smoothly clipped absolute deviation, and minimax concave penalty compared to traditional machine learning models such as random forest, XGBoost, and support vector machines. We compare these models using factorial design methods for Monte Carlo simulations in 540 environments, with factors being the response variable, number of predictors, number of samples, signal to noise ratio, covariance matrix, and correlation strength. We also compare different models using empirical data to evaluate their viability in real-world scenarios. We evaluate the models using the training and test mean squared error, variable selection accuracy, β -sensitivity, and β -specificity. We found that the performance of penalized regression models is comparable with traditional machine learning algorithms in most high-dimensional situations. The analysis helps to create a greater understanding of the strengths and weaknesses of each model type and provide a reference for other researchers on which machine learning techniques they should use, depending on a range of factors and data environments. Our study shows that penalized regression techniques should be included in predictive modelers’ toolbox.","PeriodicalId":93373,"journal":{"name":"SIAM undergraduate research online","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Comparative Study of Penalized Regression and Machine Learning Algorithms in High Dimensional Scenarios\",\"authors\":\"Connor Shrader, Gabriel Ackall\",\"doi\":\"10.1137/22s1538302\",\"DOIUrl\":null,\"url\":null,\"abstract\":\". With the prevalence of big data in recent years, the importance of modeling high dimensional data and selecting important features has increased greatly. High dimensional data is common in many fields such as genome decoding, rare disease identification, and environmental modeling. However, most traditional regression machine learning models are not designed to handle high dimensional data or conduct variable selection. In this paper, we investigate the use of penalized regression meth-ods such as ridge, least absolute shrinkage and selection operation, elastic net, smoothly clipped absolute deviation, and minimax concave penalty compared to traditional machine learning models such as random forest, XGBoost, and support vector machines. We compare these models using factorial design methods for Monte Carlo simulations in 540 environments, with factors being the response variable, number of predictors, number of samples, signal to noise ratio, covariance matrix, and correlation strength. We also compare different models using empirical data to evaluate their viability in real-world scenarios. We evaluate the models using the training and test mean squared error, variable selection accuracy, β -sensitivity, and β -specificity. We found that the performance of penalized regression models is comparable with traditional machine learning algorithms in most high-dimensional situations. The analysis helps to create a greater understanding of the strengths and weaknesses of each model type and provide a reference for other researchers on which machine learning techniques they should use, depending on a range of factors and data environments. Our study shows that penalized regression techniques should be included in predictive modelers’ toolbox.\",\"PeriodicalId\":93373,\"journal\":{\"name\":\"SIAM undergraduate research online\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"SIAM undergraduate research online\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1137/22s1538302\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIAM undergraduate research online","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1137/22s1538302","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

．随着近年来大数据的普及，对高维数据进行建模和选取重要特征的重要性大大提高。高维数据在基因组解码、罕见疾病鉴定和环境建模等许多领域都很常见。然而，大多数传统的回归机器学习模型并不是为处理高维数据或进行变量选择而设计的。在本文中，我们与传统的机器学习模型(如随机森林、XGBoost和支持向量机)相比，研究了惩罚回归方法(如脊、最小绝对收缩和选择操作、弹性网、平滑剪裁绝对偏差和最小最大凹惩罚)的使用。我们使用因子设计方法在540个环境中进行蒙特卡罗模拟，比较这些模型，因子为响应变量、预测因子数量、样本数量、信噪比、协方差矩阵和相关强度。我们还使用经验数据比较了不同的模型，以评估它们在现实世界场景中的可行性。我们使用训练和检验均方误差、变量选择准确性、β敏感性和β特异性来评估模型。我们发现，在大多数高维情况下，惩罚回归模型的性能与传统机器学习算法相当。该分析有助于更好地理解每种模型类型的优缺点，并为其他研究人员提供参考，根据一系列因素和数据环境，他们应该使用哪种机器学习技术。我们的研究表明惩罚回归技术应该包含在预测建模者的工具箱中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Comparative Study of Penalized Regression and Machine Learning Algorithms in High Dimensional Scenarios

. With the prevalence of big data in recent years, the importance of modeling high dimensional data and selecting important features has increased greatly. High dimensional data is common in many fields such as genome decoding, rare disease identification, and environmental modeling. However, most traditional regression machine learning models are not designed to handle high dimensional data or conduct variable selection. In this paper, we investigate the use of penalized regression meth-ods such as ridge, least absolute shrinkage and selection operation, elastic net, smoothly clipped absolute deviation, and minimax concave penalty compared to traditional machine learning models such as random forest, XGBoost, and support vector machines. We compare these models using factorial design methods for Monte Carlo simulations in 540 environments, with factors being the response variable, number of predictors, number of samples, signal to noise ratio, covariance matrix, and correlation strength. We also compare different models using empirical data to evaluate their viability in real-world scenarios. We evaluate the models using the training and test mean squared error, variable selection accuracy, β -sensitivity, and β -specificity. We found that the performance of penalized regression models is comparable with traditional machine learning algorithms in most high-dimensional situations. The analysis helps to create a greater understanding of the strengths and weaknesses of each model type and provide a reference for other researchers on which machine learning techniques they should use, depending on a range of factors and data environments. Our study shows that penalized regression techniques should be included in predictive modelers’ toolbox.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

SIAM undergraduate research online

自引率

0.00%

发文量