{"title":"逻辑回归的模型选择策略比较","authors":"Markku Karhunen","doi":"10.2478/jdis-2024-0001","DOIUrl":null,"url":null,"abstract":"Purpose The purpose of this study is to develop and compare model choice strategies in context of logistic regression. Model choice means the choice of the covariates to be included in the model. Design/methodology/approach The study is based on Monte Carlo simulations. The methods are compared in terms of three measures of accuracy: specificity and two kinds of sensitivity. A loss function combining sensitivity and specificity is introduced and used for a final comparison. Findings The choice of method depends on how much the users emphasize sensitivity against specificity. It also depends on the sample size. For a typical logistic regression setting with a moderate sample size and a small to moderate effect size, either BIC, BICc or Lasso seems to be optimal. Research limitations Numerical simulations cannot cover the whole range of data-generating processes occurring with real-world data. Thus, more simulations are needed. Practical implications Researchers can refer to these results if they believe that their data-generating process is somewhat similar to some of the scenarios presented in this paper. Alternatively, they could run their own simulations and calculate the loss function. Originality/value This is a systematic comparison of model choice algorithms and heuristics in context of logistic regression. The distinction between two types of sensitivity and a comparison based on a loss function are methodological novelties.","PeriodicalId":44622,"journal":{"name":"Journal of Data and Information Science","volume":"24 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2024-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A comparison of model choice strategies for logistic regression\",\"authors\":\"Markku Karhunen\",\"doi\":\"10.2478/jdis-2024-0001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose The purpose of this study is to develop and compare model choice strategies in context of logistic regression. Model choice means the choice of the covariates to be included in the model. Design/methodology/approach The study is based on Monte Carlo simulations. The methods are compared in terms of three measures of accuracy: specificity and two kinds of sensitivity. A loss function combining sensitivity and specificity is introduced and used for a final comparison. Findings The choice of method depends on how much the users emphasize sensitivity against specificity. It also depends on the sample size. For a typical logistic regression setting with a moderate sample size and a small to moderate effect size, either BIC, BICc or Lasso seems to be optimal. Research limitations Numerical simulations cannot cover the whole range of data-generating processes occurring with real-world data. Thus, more simulations are needed. Practical implications Researchers can refer to these results if they believe that their data-generating process is somewhat similar to some of the scenarios presented in this paper. Alternatively, they could run their own simulations and calculate the loss function. Originality/value This is a systematic comparison of model choice algorithms and heuristics in context of logistic regression. The distinction between two types of sensitivity and a comparison based on a loss function are methodological novelties.\",\"PeriodicalId\":44622,\"journal\":{\"name\":\"Journal of Data and Information Science\",\"volume\":\"24 1\",\"pages\":\"\"},\"PeriodicalIF\":1.5000,\"publicationDate\":\"2024-01-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Data and Information Science\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://doi.org/10.2478/jdis-2024-0001\",\"RegionNum\":3,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"INFORMATION SCIENCE & LIBRARY SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Science","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.2478/jdis-2024-0001","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
A comparison of model choice strategies for logistic regression
Purpose The purpose of this study is to develop and compare model choice strategies in context of logistic regression. Model choice means the choice of the covariates to be included in the model. Design/methodology/approach The study is based on Monte Carlo simulations. The methods are compared in terms of three measures of accuracy: specificity and two kinds of sensitivity. A loss function combining sensitivity and specificity is introduced and used for a final comparison. Findings The choice of method depends on how much the users emphasize sensitivity against specificity. It also depends on the sample size. For a typical logistic regression setting with a moderate sample size and a small to moderate effect size, either BIC, BICc or Lasso seems to be optimal. Research limitations Numerical simulations cannot cover the whole range of data-generating processes occurring with real-world data. Thus, more simulations are needed. Practical implications Researchers can refer to these results if they believe that their data-generating process is somewhat similar to some of the scenarios presented in this paper. Alternatively, they could run their own simulations and calculate the loss function. Originality/value This is a systematic comparison of model choice algorithms and heuristics in context of logistic regression. The distinction between two types of sensitivity and a comparison based on a loss function are methodological novelties.
期刊介绍:
JDIS devotes itself to the study and application of the theories, methods, techniques, services, infrastructural facilities using big data to support knowledge discovery for decision & policy making. The basic emphasis is big data-based, analytics centered, knowledge discovery driven, and decision making supporting. The special effort is on the knowledge discovery to detect and predict structures, trends, behaviors, relations, evolutions and disruptions in research, innovation, business, politics, security, media and communications, and social development, where the big data may include metadata or full content data, text or non-textural data, structured or non-structural data, domain specific or cross-domain data, and dynamic or interactive data.
The main areas of interest are:
(1) New theories, methods, and techniques of big data based data mining, knowledge discovery, and informatics, including but not limited to scientometrics, communication analysis, social network analysis, tech & industry analysis, competitive intelligence, knowledge mapping, evidence based policy analysis, and predictive analysis.
(2) New methods, architectures, and facilities to develop or improve knowledge infrastructure capable to support knowledge organization and sophisticated analytics, including but not limited to ontology construction, knowledge organization, semantic linked data, knowledge integration and fusion, semantic retrieval, domain specific knowledge infrastructure, and semantic sciences.
(3) New mechanisms, methods, and tools to embed knowledge analytics and knowledge discovery into actual operation, service, or managerial processes, including but not limited to knowledge assisted scientific discovery, data mining driven intelligent workflows in learning, communications, and management.
Specific topic areas may include:
Knowledge organization
Knowledge discovery and data mining
Knowledge integration and fusion
Semantic Web metrics
Scientometrics
Analytic and diagnostic informetrics
Competitive intelligence
Predictive analysis
Social network analysis and metrics
Semantic and interactively analytic retrieval
Evidence-based policy analysis
Intelligent knowledge production
Knowledge-driven workflow management and decision-making
Knowledge-driven collaboration and its management
Domain knowledge infrastructure with knowledge fusion and analytics
Development of data and information services