{"title":"This is not normal! (Re-) Evaluating the lower $n$ guildelines for regression analysis","authors":"David Randahl","doi":"arxiv-2409.06413","DOIUrl":null,"url":null,"abstract":"The commonly cited rule of thumb for regression analysis, which suggests that\na sample size of $n \\geq 30$ is sufficient to ensure valid inferences, is\nfrequently referenced but rarely scrutinized. This research note evaluates the\nlower bound for the number of observations required for regression analysis by\nexploring how different distributional characteristics, such as skewness and\nkurtosis, influence the convergence of t-values to the t-distribution in linear\nregression models. Through an extensive simulation study involving over 22\nbillion regression models, this paper examines a range of symmetric,\nplatykurtic, and skewed distributions, testing sample sizes from 4 to 10,000.\nThe results reveal that it is sufficient that either the dependent or\nindependent variable follow a symmetric distribution for the t-values to\nconverge to the t-distribution at much smaller sample sizes than $n=30$. This\nis contrary to previous guidance which suggests that the error term needs to be\nnormally distributed for this convergence to happen at low $n$. On the other\nhand, if both dependent and independent variables are highly skewed the\nrequired sample size is substantially higher. In cases of extreme skewness,\neven sample sizes of 10,000 do not ensure convergence. These findings suggest\nthat the $n\\geq30$ rule is too permissive in certain cases but overly\nconservative in others, depending on the underlying distributional\ncharacteristics. This study offers revised guidelines for determining the\nminimum sample size necessary for valid regression analysis.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"28 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Methodology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06413","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The commonly cited rule of thumb for regression analysis, which suggests that
a sample size of $n \geq 30$ is sufficient to ensure valid inferences, is
frequently referenced but rarely scrutinized. This research note evaluates the
lower bound for the number of observations required for regression analysis by
exploring how different distributional characteristics, such as skewness and
kurtosis, influence the convergence of t-values to the t-distribution in linear
regression models. Through an extensive simulation study involving over 22
billion regression models, this paper examines a range of symmetric,
platykurtic, and skewed distributions, testing sample sizes from 4 to 10,000.
The results reveal that it is sufficient that either the dependent or
independent variable follow a symmetric distribution for the t-values to
converge to the t-distribution at much smaller sample sizes than $n=30$. This
is contrary to previous guidance which suggests that the error term needs to be
normally distributed for this convergence to happen at low $n$. On the other
hand, if both dependent and independent variables are highly skewed the
required sample size is substantially higher. In cases of extreme skewness,
even sample sizes of 10,000 do not ensure convergence. These findings suggest
that the $n\geq30$ rule is too permissive in certain cases but overly
conservative in others, depending on the underlying distributional
characteristics. This study offers revised guidelines for determining the
minimum sample size necessary for valid regression analysis.