带有分类预测因子的 Lasso 和 Group Lasso:编码策略对变量选择和预测的影响

Y. Huang, A. Montoya
{"title":"带有分类预测因子的 Lasso 和 Group Lasso:编码策略对变量选择和预测的影响","authors":"Y. Huang, A. Montoya","doi":"10.31234/osf.io/wc45u","DOIUrl":null,"url":null,"abstract":"Machine learning methods are being increasingly adopted in psychological research. Lasso performs variable selection and regularization, and is particularly appealing to psychology researchers because of its connection to linear regression. Researchers conflate properties of linear regression with properties of lasso; however, we demonstrate that this is not the case for models with categorical predictors. Specifically, the coding strategy used for categorical predictors impacts lasso’s performance but not linear regression. Group lasso is an alternative to lasso for models with categorical predictors. We demonstrate the inconsistency of lasso and group lasso models using a real data set: lasso performs different variable selection and has different prediction accuracy depending on the coding strategy, and group lasso performs consistent variable selection but has different prediction accuracy. Additionally, group lasso may include many predictors when very few are needed, leading to overfitting. Using Monte Carlo simulation, we show that categorical variables with one group mean differing from all others (one dominant group) are more likely to be included in the model by group lasso than lasso, leading to overfitting. This effect is strongest when the mean difference is large and there are many categories. Researchers primarily focus on the similarity between linear regression and lasso, but pay little attention to their different properties. This project demonstrates that when using lasso and group lasso, the effect of coding strategies should be considered. We conclude with recommended solutions to this issue and future directions of exploration to improve implementation of machine learning approaches in psychological science.","PeriodicalId":93575,"journal":{"name":"Journal of behavioral data science","volume":"58 32","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Lasso and Group Lasso with Categorical Predictors: Impact of Coding Strategy on Variable Selection and Prediction\",\"authors\":\"Y. Huang, A. Montoya\",\"doi\":\"10.31234/osf.io/wc45u\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning methods are being increasingly adopted in psychological research. Lasso performs variable selection and regularization, and is particularly appealing to psychology researchers because of its connection to linear regression. Researchers conflate properties of linear regression with properties of lasso; however, we demonstrate that this is not the case for models with categorical predictors. Specifically, the coding strategy used for categorical predictors impacts lasso’s performance but not linear regression. Group lasso is an alternative to lasso for models with categorical predictors. We demonstrate the inconsistency of lasso and group lasso models using a real data set: lasso performs different variable selection and has different prediction accuracy depending on the coding strategy, and group lasso performs consistent variable selection but has different prediction accuracy. Additionally, group lasso may include many predictors when very few are needed, leading to overfitting. Using Monte Carlo simulation, we show that categorical variables with one group mean differing from all others (one dominant group) are more likely to be included in the model by group lasso than lasso, leading to overfitting. This effect is strongest when the mean difference is large and there are many categories. Researchers primarily focus on the similarity between linear regression and lasso, but pay little attention to their different properties. This project demonstrates that when using lasso and group lasso, the effect of coding strategies should be considered. We conclude with recommended solutions to this issue and future directions of exploration to improve implementation of machine learning approaches in psychological science.\",\"PeriodicalId\":93575,\"journal\":{\"name\":\"Journal of behavioral data science\",\"volume\":\"58 32\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-05-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of behavioral data science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.31234/osf.io/wc45u\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of behavioral data science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.31234/osf.io/wc45u","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

心理学研究越来越多地采用机器学习方法。Lasso 可以进行变量选择和正则化,由于它与线性回归相关,因此对心理学研究人员特别有吸引力。研究人员将线性回归的特性与 Lasso 的特性混为一谈;然而,我们证明,对于带有分类预测因子的模型来说,情况并非如此。具体来说,用于分类预测因子的编码策略会影响 lasso 的性能,但不会影响线性回归。对于带有分类预测因子的模型,群套索是套索的替代方法。我们使用一个真实数据集证明了 lasso 模型和组 lasso 模型的不一致性:lasso 根据编码策略执行不同的变量选择,预测准确率也不同;而组 lasso 执行一致的变量选择,预测准确率也不同。此外,分组套索可能会在只需要很少预测因子的情况下包含很多预测因子,从而导致过度拟合。通过蒙特卡罗模拟,我们发现,与 lasso 相比,分组 lasso 更有可能将一个组均值不同于所有其他组(一个主导组)的分类变量包含在模型中,从而导致过度拟合。当均值差异较大且类别较多时,这种效应最强。研究人员主要关注线性回归和 lasso 的相似性,但很少关注它们的不同特性。本项目表明,在使用 lasso 和分组 lasso 时,应考虑编码策略的影响。最后,我们提出了解决这一问题的建议和未来的探索方向,以改进机器学习方法在心理科学中的应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Lasso and Group Lasso with Categorical Predictors: Impact of Coding Strategy on Variable Selection and Prediction
Machine learning methods are being increasingly adopted in psychological research. Lasso performs variable selection and regularization, and is particularly appealing to psychology researchers because of its connection to linear regression. Researchers conflate properties of linear regression with properties of lasso; however, we demonstrate that this is not the case for models with categorical predictors. Specifically, the coding strategy used for categorical predictors impacts lasso’s performance but not linear regression. Group lasso is an alternative to lasso for models with categorical predictors. We demonstrate the inconsistency of lasso and group lasso models using a real data set: lasso performs different variable selection and has different prediction accuracy depending on the coding strategy, and group lasso performs consistent variable selection but has different prediction accuracy. Additionally, group lasso may include many predictors when very few are needed, leading to overfitting. Using Monte Carlo simulation, we show that categorical variables with one group mean differing from all others (one dominant group) are more likely to be included in the model by group lasso than lasso, leading to overfitting. This effect is strongest when the mean difference is large and there are many categories. Researchers primarily focus on the similarity between linear regression and lasso, but pay little attention to their different properties. This project demonstrates that when using lasso and group lasso, the effect of coding strategies should be considered. We conclude with recommended solutions to this issue and future directions of exploration to improve implementation of machine learning approaches in psychological science.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信