使用真实世界的交易数据识别洗钱:利用传统的回归和机器学习技术

STEM Fellowship Journal Pub Date : 2021-11-01 DOI:10.17975/sfj-2021-006

Daniel Harris, Kyla Pyndiura, S. Sturrock, R. Christensen

{"title":"使用真实世界的交易数据识别洗钱:利用传统的回归和机器学习技术","authors":"Daniel Harris, Kyla Pyndiura, S. Sturrock, R. Christensen","doi":"10.17975/sfj-2021-006","DOIUrl":null,"url":null,"abstract":"Money laundering is a pervasive legal and economic problem that hides criminal activity. Identifying money laundering is a priority for both banks and governments, thus, machine learning algorithms have emerged as a possible strategy to detect suspicious financial activity within financial institutions. We used traditional regression and supervised machine learning techniques to identify bank customers at an increased risk of committing money laundering. Specifically, we assessed whether model performance differed across varying operationalizations of the outcome (e.g., multinomial vs. binary classification) and determined whether the inclusion of investigator-derived novel features (e.g., averages across existing features) could improve model performance. We received two proprietary datasets from Scotiabank, a large bank headquartered in Canada. The datasets included customer account information (N = 4,469) and customers’ monthly transaction histories (N = 2,827) from April 15, 2019 to April 15, 2020. We implemented traditional logistic regression, logistic regression with LASSO regularization (LASSO), K-nearest neighbours (KNN), and extreme gradient boosted models (XGBoost). Results indicated that traditional logistic regression with a binary outcome, conducted with investigator-derived novel features, performed the best with an F1 score of 0.79 and accuracy of 0.72. Models with a binary outcome had higher accuracy than the multinomial models, but the F1 scores yielded mixed results. For KNN and XGBoost, we observed little change or worsening performance after the introduction of the investigator-derived novel features. However, the investigator-derived novel features improved model performance for LASSO and traditional logistic regression. Our findings demonstrate that investigators should consider different operationalizations of the outcome, where possible, and include novel features derived from existing features to potentially improve the detection of customer at risk of committing money laundering.","PeriodicalId":268438,"journal":{"name":"STEM Fellowship Journal","volume":"199 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Using real-world transaction data to identify money laundering: Leveraging traditional regression and machine learning techniques\",\"authors\":\"Daniel Harris, Kyla Pyndiura, S. Sturrock, R. Christensen\",\"doi\":\"10.17975/sfj-2021-006\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Money laundering is a pervasive legal and economic problem that hides criminal activity. Identifying money laundering is a priority for both banks and governments, thus, machine learning algorithms have emerged as a possible strategy to detect suspicious financial activity within financial institutions. We used traditional regression and supervised machine learning techniques to identify bank customers at an increased risk of committing money laundering. Specifically, we assessed whether model performance differed across varying operationalizations of the outcome (e.g., multinomial vs. binary classification) and determined whether the inclusion of investigator-derived novel features (e.g., averages across existing features) could improve model performance. We received two proprietary datasets from Scotiabank, a large bank headquartered in Canada. The datasets included customer account information (N = 4,469) and customers’ monthly transaction histories (N = 2,827) from April 15, 2019 to April 15, 2020. We implemented traditional logistic regression, logistic regression with LASSO regularization (LASSO), K-nearest neighbours (KNN), and extreme gradient boosted models (XGBoost). Results indicated that traditional logistic regression with a binary outcome, conducted with investigator-derived novel features, performed the best with an F1 score of 0.79 and accuracy of 0.72. Models with a binary outcome had higher accuracy than the multinomial models, but the F1 scores yielded mixed results. For KNN and XGBoost, we observed little change or worsening performance after the introduction of the investigator-derived novel features. However, the investigator-derived novel features improved model performance for LASSO and traditional logistic regression. Our findings demonstrate that investigators should consider different operationalizations of the outcome, where possible, and include novel features derived from existing features to potentially improve the detection of customer at risk of committing money laundering.\",\"PeriodicalId\":268438,\"journal\":{\"name\":\"STEM Fellowship Journal\",\"volume\":\"199 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"STEM Fellowship Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.17975/sfj-2021-006\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"STEM Fellowship Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17975/sfj-2021-006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

洗钱是一个普遍存在的法律和经济问题，它隐藏着犯罪活动。识别洗钱是银行和政府的首要任务，因此，机器学习算法已成为检测金融机构内可疑金融活动的一种可能策略。我们使用传统的回归和监督机器学习技术来识别洗钱风险增加的银行客户。具体来说，我们评估了模型的性能是否在结果的不同操作化(例如，多项分类与二元分类)中有所不同，并确定纳入研究者衍生的新特征(例如，现有特征的平均值)是否可以提高模型的性能。我们从Scotiabank(一家总部位于加拿大的大型银行)获得了两个专有数据集。数据集包括2019年4月15日至2020年4月15日的客户账户信息(N = 4,469)和客户每月交易历史(N = 2,827)。我们实现了传统的逻辑回归、LASSO正则化逻辑回归(LASSO)、k近邻逻辑回归(KNN)和极端梯度增强模型(XGBoost)。结果表明，使用研究者衍生的新特征进行二元结果的传统逻辑回归效果最好，F1得分为0.79，准确率为0.72。具有二元结果的模型比多项模型具有更高的准确性，但F1分数产生了混合结果。对于KNN和XGBoost，我们观察到在引入研究者衍生的新特征后，性能几乎没有变化或恶化。然而，研究者衍生的新特征提高了LASSO和传统逻辑回归的模型性能。我们的研究结果表明，在可能的情况下，调查人员应该考虑结果的不同操作方式，并包括从现有特征派生的新特征，以潜在地提高对有洗钱风险的客户的检测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Using real-world transaction data to identify money laundering: Leveraging traditional regression and machine learning techniques

Money laundering is a pervasive legal and economic problem that hides criminal activity. Identifying money laundering is a priority for both banks and governments, thus, machine learning algorithms have emerged as a possible strategy to detect suspicious financial activity within financial institutions. We used traditional regression and supervised machine learning techniques to identify bank customers at an increased risk of committing money laundering. Specifically, we assessed whether model performance differed across varying operationalizations of the outcome (e.g., multinomial vs. binary classification) and determined whether the inclusion of investigator-derived novel features (e.g., averages across existing features) could improve model performance. We received two proprietary datasets from Scotiabank, a large bank headquartered in Canada. The datasets included customer account information (N = 4,469) and customers’ monthly transaction histories (N = 2,827) from April 15, 2019 to April 15, 2020. We implemented traditional logistic regression, logistic regression with LASSO regularization (LASSO), K-nearest neighbours (KNN), and extreme gradient boosted models (XGBoost). Results indicated that traditional logistic regression with a binary outcome, conducted with investigator-derived novel features, performed the best with an F1 score of 0.79 and accuracy of 0.72. Models with a binary outcome had higher accuracy than the multinomial models, but the F1 scores yielded mixed results. For KNN and XGBoost, we observed little change or worsening performance after the introduction of the investigator-derived novel features. However, the investigator-derived novel features improved model performance for LASSO and traditional logistic regression. Our findings demonstrate that investigators should consider different operationalizations of the outcome, where possible, and include novel features derived from existing features to potentially improve the detection of customer at risk of committing money laundering.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

STEM Fellowship Journal

自引率

0.00%

发文量