Efficient, interpretable and automated feature engineering for bank data

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Big Data Research Pub Date : 2025-03-28 DOI:10.1016/j.bdr.2025.100524

Atilla Karaahmetoğlu , Mehmet Yıldız , Erdem Ünal , Uğur Aydın , Murat Koraş , Barış Akgün

{"title":"Efficient, interpretable and automated feature engineering for bank data","authors":"Atilla Karaahmetoğlu , Mehmet Yıldız , Erdem Ünal , Uğur Aydın , Murat Koraş , Barış Akgün","doi":"10.1016/j.bdr.2025.100524","DOIUrl":null,"url":null,"abstract":"<div><div>Banks rely on expert-generated features and simple models to have high performance and interpretability at the same time. Interpretability is needed for internal assessment and regulatory compliance for specific problems such as risk assessment and both expert generated features and simple models satisfy this need. However, feature generation by experts is a time-consuming process and susceptible to bias. In addition, features need to be generated fairly often due to the dynamic nature of bank data, and in case of significant changes or new data sources, expertise might take a while to build up. Complex models, such as deep neural networks, may be able to remedy this. However, interpretability/explainability approaches for complex models are not satisfactory from the banks' point of view. In addition, such models do not always work well with tabular data which is abundant in banking applications. This paper introduces an automated feature synthesis pipeline that creates informative and domain-interpretable features which iconsumes significantly less time than brute-force methods. We create novel feature synthesis steps, define elimination rules to rule out uninterpretable features, and combine performance-based feature selection methods to pick desirable ones to build our models. Our results on two different datasets show that the features generated with our pipeline; (1) perform on par or better than features generated by existing methods, (2) are obtained faster, and (3) are domain-interpretable.</div></div>","PeriodicalId":56017,"journal":{"name":"Big Data Research","volume":"40 ","pages":"Article 100524"},"PeriodicalIF":4.2000,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data Research","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S221457962500019X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Banks rely on expert-generated features and simple models to have high performance and interpretability at the same time. Interpretability is needed for internal assessment and regulatory compliance for specific problems such as risk assessment and both expert generated features and simple models satisfy this need. However, feature generation by experts is a time-consuming process and susceptible to bias. In addition, features need to be generated fairly often due to the dynamic nature of bank data, and in case of significant changes or new data sources, expertise might take a while to build up. Complex models, such as deep neural networks, may be able to remedy this. However, interpretability/explainability approaches for complex models are not satisfactory from the banks' point of view. In addition, such models do not always work well with tabular data which is abundant in banking applications. This paper introduces an automated feature synthesis pipeline that creates informative and domain-interpretable features which iconsumes significantly less time than brute-force methods. We create novel feature synthesis steps, define elimination rules to rule out uninterpretable features, and combine performance-based feature selection methods to pick desirable ones to build our models. Our results on two different datasets show that the features generated with our pipeline; (1) perform on par or better than features generated by existing methods, (2) are obtained faster, and (3) are domain-interpretable.

查看原文本刊更多论文

银行数据的高效、可解释和自动化特征工程

银行依靠专家生成的特征和简单的模型来同时具有高性能和可解释性。内部评估和特定问题（如风险评估）的法规遵从性需要可解释性，专家生成的特征和简单模型都能满足这一需求。然而，由专家生成特征是一个耗时的过程，并且容易受到偏见的影响。此外，由于银行数据的动态性，需要相当频繁地生成功能，并且在发生重大更改或新数据源的情况下，可能需要一段时间才能建立专门知识。复杂的模型，如深度神经网络，可能能够弥补这一点。然而，从银行的角度来看，复杂模型的可解释性/可解释性方法并不令人满意。此外，这种模型并不总是能很好地处理银行应用中大量的表格数据。本文介绍了一种自动化的特征合成管道，它可以创建信息丰富且领域可解释的特征，比暴力方法消耗的时间要少得多。我们创建了新的特征合成步骤，定义了消除规则来排除不可解释的特征，并结合基于性能的特征选择方法来选择理想的特征来构建我们的模型。我们在两个不同的数据集上的结果表明，我们的管道生成的特征；(1)性能与现有方法生成的特征相当或更好，(2)获得速度更快，(3)可域解释。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Big Data Research Computer Science-Computer Science Applications

CiteScore

8.40

自引率

3.00%

发文量

期刊介绍： The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in dealing with big data, as well as papers on specific Platforms and Technologies used to deal with big data. To promote Data Science and interdisciplinary collaboration between fields, and to showcase the benefits of data driven research, papers demonstrating applications of big data in domains as diverse as Geoscience, Social Web, Finance, e-Commerce, Health Care, Environment and Climate, Physics and Astronomy, Chemistry, life sciences and drug discovery, digital libraries and scientific publications, security and government will also be considered. Occasionally the journal may publish whitepapers on policies, standards and best practices.