A Needle Found: Machine Learning Does Not Significantly Improve Corporate Fraud Detection Beyond a Simple Screen on Sales Growth

Applied Accounting - Practitioner eJournal Pub Date : 2020-11-29 DOI:10.2139/ssrn.3739480

S. Walker

{"title":"A Needle Found: Machine Learning Does Not Significantly Improve Corporate Fraud Detection Beyond a Simple Screen on Sales Growth","authors":"S. Walker","doi":"10.2139/ssrn.3739480","DOIUrl":null,"url":null,"abstract":"Recent papers have been highly promotional of the benefits of machine learning in the detection of corporate fraud. For example, Bao, Ke, Li, Yu, and Zhang (2020) recently published in the Journal of Accounting Research report that their machine learning model increases performance by +75% above the current parsimonious standard in the accounting literature, the financial ratio-based F-Score (Dechow, et al. 2011), when measured at the highest risk levels. They also show that raw variables alone, rather than financial ratios, can achieve this task. However, a quick peak under the hood reveals an issue that, if corrected for, reduces the results to no better than the F-Score. \n \nIn this paper, I create a machine learning model applying the latest in machine learning known as XGBoost to over 100 financial ratios sourced from prior literature. I compare this model to an XGBoost model applying the 28 raw variables suggested by Bao, et al. Additional models are benchmarked include the F-Score, the M-Score (Beneish 1999), the FSD Score based on Benford’s Law (Amiram, et al. 2015), and a simple screen on 4-year sales growth. \n \nA Wilcoxon rank sum test will show that differences between the models at the top 1% of risk are not significantly different. In fact, at this level, the models fail often in any given year. At the top 10% of risk where models produce consistent annual results, advanced methods match the performance of the F-Score, or even a simple univariate screen on sales growth I measure performance using positive predictive values (PPV) also known as precision which measures the likelihood of a fraud case within the top 1% or top 10% list. My XGBoost model outperforms the models at the 1% level, but positive predictive values remain quite low to be of any practical use with PPVs in the 3% range. A discussion will follow to explain what would be required to move positive predicted values beyond the single digits for this research question.","PeriodicalId":198128,"journal":{"name":"Applied Accounting - Practitioner eJournal","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Accounting - Practitioner eJournal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2139/ssrn.3739480","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Recent papers have been highly promotional of the benefits of machine learning in the detection of corporate fraud. For example, Bao, Ke, Li, Yu, and Zhang (2020) recently published in the Journal of Accounting Research report that their machine learning model increases performance by +75% above the current parsimonious standard in the accounting literature, the financial ratio-based F-Score (Dechow, et al. 2011), when measured at the highest risk levels. They also show that raw variables alone, rather than financial ratios, can achieve this task. However, a quick peak under the hood reveals an issue that, if corrected for, reduces the results to no better than the F-Score. In this paper, I create a machine learning model applying the latest in machine learning known as XGBoost to over 100 financial ratios sourced from prior literature. I compare this model to an XGBoost model applying the 28 raw variables suggested by Bao, et al. Additional models are benchmarked include the F-Score, the M-Score (Beneish 1999), the FSD Score based on Benford’s Law (Amiram, et al. 2015), and a simple screen on 4-year sales growth. A Wilcoxon rank sum test will show that differences between the models at the top 1% of risk are not significantly different. In fact, at this level, the models fail often in any given year. At the top 10% of risk where models produce consistent annual results, advanced methods match the performance of the F-Score, or even a simple univariate screen on sales growth I measure performance using positive predictive values (PPV) also known as precision which measures the likelihood of a fraud case within the top 1% or top 10% list. My XGBoost model outperforms the models at the 1% level, but positive predictive values remain quite low to be of any practical use with PPVs in the 3% range. A discussion will follow to explain what would be required to move positive predicted values beyond the single digits for this research question.

查看原文本刊更多论文

一针发现:机器学习并没有显著提高企业欺诈检测，除了简单的销售增长屏幕

最近的论文一直在大力宣传机器学习在检测企业欺诈方面的好处。例如，Bao、Ke、Li、Yu和Zhang(2020)最近在《会计研究杂志》(Journal of Accounting Research)上发表的报告称，当以最高风险水平衡量时，他们的机器学习模型比会计文献中目前的简约标准——基于财务比率的F-Score (Dechow, et al. 2011)——提高了75%的绩效。他们还表明，单独使用原始变量，而不是财务比率，可以完成这项任务。然而，在引擎盖下的快速峰值揭示了一个问题，如果加以纠正，将结果降低到不比f分好。在本文中，我创建了一个机器学习模型，将最新的机器学习技术XGBoost应用于来自先前文献的100多个财务比率。我将该模型与应用Bao等人建议的28个原始变量的XGBoost模型进行比较。其他基准模型包括F-Score, M-Score (Beneish 1999)，基于本福德定律的FSD评分(Amiram等人，2015)，以及4年销售增长的简单屏幕。Wilcoxon秩和检验将显示，风险前1%的模型之间的差异没有显著差异。事实上，在这个水平上，模型在任何一年都经常失败。在前10%的风险中，模型产生一致的年度结果，先进的方法与F-Score的表现相匹配，甚至是简单的单变量销售增长屏幕，我使用正预测值(PPV)衡量业绩，也称为精度，衡量前1%或前10%列表中欺诈案件的可能性。我的XGBoost模型在1%的水平上优于其他模型，但对于ppv在3%范围内的实际应用来说，正预测值仍然很低。接下来的讨论将解释将这个研究问题的正预测值移动到个位数之外需要什么。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Accounting - Practitioner eJournal

自引率

0.00%

发文量