逻辑回归的模型选择策略比较

IF 1.5 3区管理学 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

Journal of Data and Information Science Pub Date : 2024-01-25 DOI:10.2478/jdis-2024-0001

Markku Karhunen

{"title":"逻辑回归的模型选择策略比较","authors":"Markku Karhunen","doi":"10.2478/jdis-2024-0001","DOIUrl":null,"url":null,"abstract":"Purpose The purpose of this study is to develop and compare model choice strategies in context of logistic regression. Model choice means the choice of the covariates to be included in the model. Design/methodology/approach The study is based on Monte Carlo simulations. The methods are compared in terms of three measures of accuracy: specificity and two kinds of sensitivity. A loss function combining sensitivity and specificity is introduced and used for a final comparison. Findings The choice of method depends on how much the users emphasize sensitivity against specificity. It also depends on the sample size. For a typical logistic regression setting with a moderate sample size and a small to moderate effect size, either BIC, BICc or Lasso seems to be optimal. Research limitations Numerical simulations cannot cover the whole range of data-generating processes occurring with real-world data. Thus, more simulations are needed. Practical implications Researchers can refer to these results if they believe that their data-generating process is somewhat similar to some of the scenarios presented in this paper. Alternatively, they could run their own simulations and calculate the loss function. Originality/value This is a systematic comparison of model choice algorithms and heuristics in context of logistic regression. The distinction between two types of sensitivity and a comparison based on a loss function are methodological novelties.","PeriodicalId":44622,"journal":{"name":"Journal of Data and Information Science","volume":"24 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A comparison of model choice strategies for logistic regression\",\"authors\":\"Markku Karhunen\",\"doi\":\"10.2478/jdis-2024-0001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose The purpose of this study is to develop and compare model choice strategies in context of logistic regression. Model choice means the choice of the covariates to be included in the model. Design/methodology/approach The study is based on Monte Carlo simulations. The methods are compared in terms of three measures of accuracy: specificity and two kinds of sensitivity. A loss function combining sensitivity and specificity is introduced and used for a final comparison. Findings The choice of method depends on how much the users emphasize sensitivity against specificity. It also depends on the sample size. For a typical logistic regression setting with a moderate sample size and a small to moderate effect size, either BIC, BICc or Lasso seems to be optimal. Research limitations Numerical simulations cannot cover the whole range of data-generating processes occurring with real-world data. Thus, more simulations are needed. Practical implications Researchers can refer to these results if they believe that their data-generating process is somewhat similar to some of the scenarios presented in this paper. Alternatively, they could run their own simulations and calculate the loss function. Originality/value This is a systematic comparison of model choice algorithms and heuristics in context of logistic regression. The distinction between two types of sensitivity and a comparison based on a loss function are methodological novelties.\",\"PeriodicalId\":44622,\"journal\":{\"name\":\"Journal of Data and Information Science\",\"volume\":\"24 1\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2024-01-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Data and Information Science\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://doi.org/10.2478/jdis-2024-0001\",\"RegionNum\":3,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"INFORMATION SCIENCE & LIBRARY SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Science","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.2478/jdis-2024-0001","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}

引用次数: 0

摘要

目的本研究旨在开发和比较逻辑回归中的模型选择策略。模型选择指的是对模型中包含的协变量的选择。设计/方法/途径本研究以蒙特卡罗模拟为基础。研究从三个准确度方面对各种方法进行了比较：特异性和两种灵敏度。引入了一个结合灵敏度和特异性的损失函数，并用于最终比较。研究结果方法的选择取决于用户对灵敏度和特异性的重视程度。这也取决于样本量。对于具有中等样本量和小到中等效应大小的典型逻辑回归设置，BIC、BICc 或 Lasso 似乎都是最佳选择。研究局限性数值模拟无法涵盖真实世界数据的全部数据生成过程。因此，需要进行更多的模拟。实践意义如果研究人员认为他们的数据生成过程与本文中的某些情况有些相似，他们可以参考这些结果。或者，他们也可以自己进行模拟并计算损失函数。独创性/价值这是在逻辑回归的背景下对模型选择算法和启发式方法进行的系统比较。本文在方法论上的新颖之处在于区分了两类灵敏度，并基于损失函数进行了比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A comparison of model choice strategies for logistic regression

Purpose The purpose of this study is to develop and compare model choice strategies in context of logistic regression. Model choice means the choice of the covariates to be included in the model. Design/methodology/approach The study is based on Monte Carlo simulations. The methods are compared in terms of three measures of accuracy: specificity and two kinds of sensitivity. A loss function combining sensitivity and specificity is introduced and used for a final comparison. Findings The choice of method depends on how much the users emphasize sensitivity against specificity. It also depends on the sample size. For a typical logistic regression setting with a moderate sample size and a small to moderate effect size, either BIC, BICc or Lasso seems to be optimal. Research limitations Numerical simulations cannot cover the whole range of data-generating processes occurring with real-world data. Thus, more simulations are needed. Practical implications Researchers can refer to these results if they believe that their data-generating process is somewhat similar to some of the scenarios presented in this paper. Alternatively, they could run their own simulations and calculate the loss function. Originality/value This is a systematic comparison of model choice algorithms and heuristics in context of logistic regression. The distinction between two types of sensitivity and a comparison based on a loss function are methodological novelties.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Data and Information Science INFORMATION SCIENCE & LIBRARY SCIENCE-

CiteScore

3.50

自引率

6.70%

发文量

495

期刊介绍： JDIS devotes itself to the study and application of the theories, methods, techniques, services, infrastructural facilities using big data to support knowledge discovery for decision & policy making. The basic emphasis is big data-based, analytics centered, knowledge discovery driven, and decision making supporting. The special effort is on the knowledge discovery to detect and predict structures, trends, behaviors, relations, evolutions and disruptions in research, innovation, business, politics, security, media and communications, and social development, where the big data may include metadata or full content data, text or non-textural data, structured or non-structural data, domain specific or cross-domain data, and dynamic or interactive data. The main areas of interest are: (1) New theories, methods, and techniques of big data based data mining, knowledge discovery, and informatics, including but not limited to scientometrics, communication analysis, social network analysis, tech & industry analysis, competitive intelligence, knowledge mapping, evidence based policy analysis, and predictive analysis. (2) New methods, architectures, and facilities to develop or improve knowledge infrastructure capable to support knowledge organization and sophisticated analytics, including but not limited to ontology construction, knowledge organization, semantic linked data, knowledge integration and fusion, semantic retrieval, domain specific knowledge infrastructure, and semantic sciences. (3) New mechanisms, methods, and tools to embed knowledge analytics and knowledge discovery into actual operation, service, or managerial processes, including but not limited to knowledge assisted scientific discovery, data mining driven intelligent workflows in learning, communications, and management. Specific topic areas may include: Knowledge organization Knowledge discovery and data mining Knowledge integration and fusion Semantic Web metrics Scientometrics Analytic and diagnostic informetrics Competitive intelligence Predictive analysis Social network analysis and metrics Semantic and interactively analytic retrieval Evidence-based policy analysis Intelligent knowledge production Knowledge-driven workflow management and decision-making Knowledge-driven collaboration and its management Domain knowledge infrastructure with knowledge fusion and analytics Development of data and information services