{"title":"Stata,快与慢:为什么在大数据集中运行许多小回归需要这么长时间以及如何应对","authors":"P. Geertsema","doi":"10.2139/ssrn.2423171","DOIUrl":null,"url":null,"abstract":"Stata is fast, often very fast. However, when performing regressions on small sub-samples within a large host dataset (more than 1 million observations) performance can deteriorate by many orders of magnitude. For example, an OLS regression on a sub-sample of 100 consecutive observations takes 3.6 seconds in a host dataset with 1 billion observations, but only 3.8 milliseconds in a host dataset with 1000 observations. The difference in performance is due to the mechanism regress uses to mark estimation samples. This performance deterioration has practical implications in finance research, where many variables of interest are themselves estimated via millions of individual OLS regressions within large panel datasets. I suggest an approach that circumvents this issue by using a simple Mata implementation of regress which I call fastreg. As a test, I estimate daily Fama and French 3-factor betas for individual stocks in the CRSP database from 1923 to 2013 using a 250-day rolling window. In this setting fastreg is approximately 367 times faster than regress. The code for fastreg ado is included in the Appendix and is open-source licensed under the GNU GPL.","PeriodicalId":320844,"journal":{"name":"PSN: Econometrics","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Stata, Fast and Slow: Why Running Many Small Regressions in a Large Dataset Takes So Long; and What to Do About It\",\"authors\":\"P. Geertsema\",\"doi\":\"10.2139/ssrn.2423171\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stata is fast, often very fast. However, when performing regressions on small sub-samples within a large host dataset (more than 1 million observations) performance can deteriorate by many orders of magnitude. For example, an OLS regression on a sub-sample of 100 consecutive observations takes 3.6 seconds in a host dataset with 1 billion observations, but only 3.8 milliseconds in a host dataset with 1000 observations. The difference in performance is due to the mechanism regress uses to mark estimation samples. This performance deterioration has practical implications in finance research, where many variables of interest are themselves estimated via millions of individual OLS regressions within large panel datasets. I suggest an approach that circumvents this issue by using a simple Mata implementation of regress which I call fastreg. As a test, I estimate daily Fama and French 3-factor betas for individual stocks in the CRSP database from 1923 to 2013 using a 250-day rolling window. In this setting fastreg is approximately 367 times faster than regress. The code for fastreg ado is included in the Appendix and is open-source licensed under the GNU GPL.\",\"PeriodicalId\":320844,\"journal\":{\"name\":\"PSN: Econometrics\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"PSN: Econometrics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2139/ssrn.2423171\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"PSN: Econometrics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.2423171","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Stata, Fast and Slow: Why Running Many Small Regressions in a Large Dataset Takes So Long; and What to Do About It
Stata is fast, often very fast. However, when performing regressions on small sub-samples within a large host dataset (more than 1 million observations) performance can deteriorate by many orders of magnitude. For example, an OLS regression on a sub-sample of 100 consecutive observations takes 3.6 seconds in a host dataset with 1 billion observations, but only 3.8 milliseconds in a host dataset with 1000 observations. The difference in performance is due to the mechanism regress uses to mark estimation samples. This performance deterioration has practical implications in finance research, where many variables of interest are themselves estimated via millions of individual OLS regressions within large panel datasets. I suggest an approach that circumvents this issue by using a simple Mata implementation of regress which I call fastreg. As a test, I estimate daily Fama and French 3-factor betas for individual stocks in the CRSP database from 1923 to 2013 using a 250-day rolling window. In this setting fastreg is approximately 367 times faster than regress. The code for fastreg ado is included in the Appendix and is open-source licensed under the GNU GPL.