{"title":"A Needle Found: Machine Learning Does Not Significantly Improve Corporate Fraud Detection Beyond a Simple Screen on Sales Growth","authors":"S. Walker","doi":"10.2139/ssrn.3739480","DOIUrl":null,"url":null,"abstract":"Recent papers have been highly promotional of the benefits of machine learning in the detection of corporate fraud. For example, Bao, Ke, Li, Yu, and Zhang (2020) recently published in the Journal of Accounting Research report that their machine learning model increases performance by +75% above the current parsimonious standard in the accounting literature, the financial ratio-based F-Score (Dechow, et al. 2011), when measured at the highest risk levels. They also show that raw variables alone, rather than financial ratios, can achieve this task. However, a quick peak under the hood reveals an issue that, if corrected for, reduces the results to no better than the F-Score. \n \nIn this paper, I create a machine learning model applying the latest in machine learning known as XGBoost to over 100 financial ratios sourced from prior literature. I compare this model to an XGBoost model applying the 28 raw variables suggested by Bao, et al. Additional models are benchmarked include the F-Score, the M-Score (Beneish 1999), the FSD Score based on Benford’s Law (Amiram, et al. 2015), and a simple screen on 4-year sales growth. \n \nA Wilcoxon rank sum test will show that differences between the models at the top 1% of risk are not significantly different. In fact, at this level, the models fail often in any given year. At the top 10% of risk where models produce consistent annual results, advanced methods match the performance of the F-Score, or even a simple univariate screen on sales growth I measure performance using positive predictive values (PPV) also known as precision which measures the likelihood of a fraud case within the top 1% or top 10% list. My XGBoost model outperforms the models at the 1% level, but positive predictive values remain quite low to be of any practical use with PPVs in the 3% range. A discussion will follow to explain what would be required to move positive predicted values beyond the single digits for this research question.","PeriodicalId":198128,"journal":{"name":"Applied Accounting - Practitioner eJournal","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Accounting - Practitioner eJournal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.3739480","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Recent papers have been highly promotional of the benefits of machine learning in the detection of corporate fraud. For example, Bao, Ke, Li, Yu, and Zhang (2020) recently published in the Journal of Accounting Research report that their machine learning model increases performance by +75% above the current parsimonious standard in the accounting literature, the financial ratio-based F-Score (Dechow, et al. 2011), when measured at the highest risk levels. They also show that raw variables alone, rather than financial ratios, can achieve this task. However, a quick peak under the hood reveals an issue that, if corrected for, reduces the results to no better than the F-Score.
In this paper, I create a machine learning model applying the latest in machine learning known as XGBoost to over 100 financial ratios sourced from prior literature. I compare this model to an XGBoost model applying the 28 raw variables suggested by Bao, et al. Additional models are benchmarked include the F-Score, the M-Score (Beneish 1999), the FSD Score based on Benford’s Law (Amiram, et al. 2015), and a simple screen on 4-year sales growth.
A Wilcoxon rank sum test will show that differences between the models at the top 1% of risk are not significantly different. In fact, at this level, the models fail often in any given year. At the top 10% of risk where models produce consistent annual results, advanced methods match the performance of the F-Score, or even a simple univariate screen on sales growth I measure performance using positive predictive values (PPV) also known as precision which measures the likelihood of a fraud case within the top 1% or top 10% list. My XGBoost model outperforms the models at the 1% level, but positive predictive values remain quite low to be of any practical use with PPVs in the 3% range. A discussion will follow to explain what would be required to move positive predicted values beyond the single digits for this research question.