This is not normal! (Re-) Evaluating the lower $n$ guildelines for regression analysis

arXiv - STAT - Methodology Pub Date : 2024-09-10 DOI:arxiv-2409.06413

David Randahl

{"title":"This is not normal! (Re-) Evaluating the lower $n$ guildelines for regression analysis","authors":"David Randahl","doi":"arxiv-2409.06413","DOIUrl":null,"url":null,"abstract":"The commonly cited rule of thumb for regression analysis, which suggests that\na sample size of $n \\geq 30$ is sufficient to ensure valid inferences, is\nfrequently referenced but rarely scrutinized. This research note evaluates the\nlower bound for the number of observations required for regression analysis by\nexploring how different distributional characteristics, such as skewness and\nkurtosis, influence the convergence of t-values to the t-distribution in linear\nregression models. Through an extensive simulation study involving over 22\nbillion regression models, this paper examines a range of symmetric,\nplatykurtic, and skewed distributions, testing sample sizes from 4 to 10,000.\nThe results reveal that it is sufficient that either the dependent or\nindependent variable follow a symmetric distribution for the t-values to\nconverge to the t-distribution at much smaller sample sizes than $n=30$. This\nis contrary to previous guidance which suggests that the error term needs to be\nnormally distributed for this convergence to happen at low $n$. On the other\nhand, if both dependent and independent variables are highly skewed the\nrequired sample size is substantially higher. In cases of extreme skewness,\neven sample sizes of 10,000 do not ensure convergence. These findings suggest\nthat the $n\\geq30$ rule is too permissive in certain cases but overly\nconservative in others, depending on the underlying distributional\ncharacteristics. This study offers revised guidelines for determining the\nminimum sample size necessary for valid regression analysis.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"28 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Methodology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06413","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The commonly cited rule of thumb for regression analysis, which suggests that a sample size of $n \geq 30$ is sufficient to ensure valid inferences, is frequently referenced but rarely scrutinized. This research note evaluates the lower bound for the number of observations required for regression analysis by exploring how different distributional characteristics, such as skewness and kurtosis, influence the convergence of t-values to the t-distribution in linear regression models. Through an extensive simulation study involving over 22 billion regression models, this paper examines a range of symmetric, platykurtic, and skewed distributions, testing sample sizes from 4 to 10,000. The results reveal that it is sufficient that either the dependent or independent variable follow a symmetric distribution for the t-values to converge to the t-distribution at much smaller sample sizes than $n=30$. This is contrary to previous guidance which suggests that the error term needs to be normally distributed for this convergence to happen at low $n$. On the other hand, if both dependent and independent variables are highly skewed the required sample size is substantially higher. In cases of extreme skewness, even sample sizes of 10,000 do not ensure convergence. These findings suggest that the $n\geq30$ rule is too permissive in certain cases but overly conservative in others, depending on the underlying distributional characteristics. This study offers revised guidelines for determining the minimum sample size necessary for valid regression analysis.

查看原文本刊更多论文

这是不正常的！(Re-) 评估回归分析的下 $n$ 准则

通常引用的回归分析经验法则认为，样本量为 $n \geq 30$ 就足以确保有效推论，这一法则经常被引用，但却很少被仔细研究。本研究报告通过探讨不同的分布特征（如偏斜度和峰度）如何影响线性回归模型中 t 值向 t 分布的收敛，评估了回归分析所需的观察数下限。本文通过一项涉及超过 220 亿个回归模型的广泛模拟研究，考察了一系列对称分布、偏桔型分布和倾斜分布，测试了 4 到 10,000 个样本量。结果发现，在样本量远小于 $n=30$ 的情况下，因变量或自变量遵循对称分布就足以使 t 值收敛到 t 分布。这与以前的指导相反，以前的指导认为误差项需要呈正态分布，才能在低 $n$ 时收敛。另一方面，如果因变量和自变量都高度偏斜，所需的样本量就会大大增加。在极度偏斜的情况下，即使样本量达到 10,000 个，也不能确保收敛。这些发现表明，$n\geq30$ 规则在某些情况下过于宽松，而在另一些情况下则过于保守，这取决于基本的分布特征。本研究为确定有效回归分析所需的最小样本量提供了修订指南。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - STAT - Methodology

自引率

0.00%

发文量