Linear Probability Models (LPM) and Big Data: The Good, the Bad, and the Ugly

S. Chatla, Galit Shmueli
{"title":"Linear Probability Models (LPM) and Big Data: The Good, the Bad, and the Ugly","authors":"S. Chatla, Galit Shmueli","doi":"10.2139/ssrn.2353841","DOIUrl":null,"url":null,"abstract":"Linear regression is among the most popular statistical models in social sciences research. Linear probability models (LPMs) - linear regression models applied to a binary outcome - are used in various disciplines. Surprisingly, LPMs are rare in the IS literature, where logit and probit models are typically used for binary outcomes. LPMs have been examined with respect to specific aspects, but a thorough evaluation of their practical pros and cons for different research goals under different scenarios is missing. We perform an extensive simulation study to evaluate the advantages and dangers of LPMs, especially in the realm of Big Data that now affects IS research. We evaluate LPM for the three common uses of binary outcome models: inference and estimation, prediction and classification, and selection bias. We compare its performance to logit and probit, under different sample sizes, error distributions, and more. We find that coefficient directions, statistical significance, and marginal effects yield results similar to logit and probit. Although LPM coefficients are biased, they are consistent for the true parameters up to a multiplicative scalar. Coefficient bias can be corrected by assuming an error distribution. For classification and selection bias, LPM is on par with logit and probit in terms of class separation and ranking, and is a viable alternative in selection models. It is lacking when the predicted probabilities are directly of interest, because predicted probabilities can exceed the unit interval. We illustrate some of these results through by modeling price in online auctions, using data from eBay.","PeriodicalId":384078,"journal":{"name":"ERN: Other Econometrics: Data Collection & Data Estimation Methodology (Topic)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ERN: Other Econometrics: Data Collection & Data Estimation Methodology (Topic)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.2353841","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 18

Abstract

Linear regression is among the most popular statistical models in social sciences research. Linear probability models (LPMs) - linear regression models applied to a binary outcome - are used in various disciplines. Surprisingly, LPMs are rare in the IS literature, where logit and probit models are typically used for binary outcomes. LPMs have been examined with respect to specific aspects, but a thorough evaluation of their practical pros and cons for different research goals under different scenarios is missing. We perform an extensive simulation study to evaluate the advantages and dangers of LPMs, especially in the realm of Big Data that now affects IS research. We evaluate LPM for the three common uses of binary outcome models: inference and estimation, prediction and classification, and selection bias. We compare its performance to logit and probit, under different sample sizes, error distributions, and more. We find that coefficient directions, statistical significance, and marginal effects yield results similar to logit and probit. Although LPM coefficients are biased, they are consistent for the true parameters up to a multiplicative scalar. Coefficient bias can be corrected by assuming an error distribution. For classification and selection bias, LPM is on par with logit and probit in terms of class separation and ranking, and is a viable alternative in selection models. It is lacking when the predicted probabilities are directly of interest, because predicted probabilities can exceed the unit interval. We illustrate some of these results through by modeling price in online auctions, using data from eBay.
线性概率模型(LPM)和大数据:好、坏、丑
线性回归是社会科学研究中最常用的统计模型之一。线性概率模型(lpm) -应用于二元结果的线性回归模型-用于各种学科。令人惊讶的是,lpm在IS文献中很少见,其中logit和probit模型通常用于二进制结果。lpm已经从特定的方面进行了研究,但是对于不同场景下不同研究目标的实际优缺点的全面评估仍然缺失。我们进行了广泛的模拟研究,以评估lpm的优势和危险,特别是在现在影响IS研究的大数据领域。我们评估了二元结果模型的三种常见用途:推理和估计,预测和分类,以及选择偏差。在不同的样本量、误差分布等情况下,我们将其性能与logit和probit进行比较。我们发现系数方向、统计显著性和边际效应产生的结果与logit和probit相似。尽管LPM系数是有偏差的,但它们对于真正的参数是一致的,直到一个乘法标量。系数偏差可以通过假设误差分布来修正。对于分类和选择偏差,LPM在类分离和排序方面与logit和probit相当,是一种可行的选择模型。当预测概率是直接感兴趣的,因为预测概率可能超过单位区间时,它是缺乏的。我们通过使用eBay的数据对在线拍卖中的价格进行建模来说明其中的一些结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信