Double-weighted kNN: a simple and efficient variant with embedded feature selection

IF 4 Q2 BUSINESS

Journal of Marketing Analytics Pub Date : 2024-04-06 DOI:10.1057/s41270-024-00302-5

Almudena Moreno-Ribera, Aida Calviño

{"title":"Double-weighted kNN: a simple and efficient variant with embedded feature selection","authors":"Almudena Moreno-Ribera, Aida Calviño","doi":"10.1057/s41270-024-00302-5","DOIUrl":null,"url":null,"abstract":"Predictive modeling aims at providing estimates of an unknown variable, the target, from a set of known ones, the input. The k Nearest Neighbors (kNN) is one of the best-known predictive algorithms due to its simplicity and well behavior. However, this class of models has some drawbacks, such as the non-robustness to the existence of irrelevant input features or the need to transform qualitative variables into dummies, with the corresponding loss of information for ordinal ones. In this work, a kNN regression variant, easily adaptable for classification purposes, is suggested. The proposal allows dealing with all types of input variables while embedding feature selection in a simple and efficient manner, reducing the tuning phase. More precisely, making use of the weighted Gower distance, we develop a powerful tool to cope with these inconveniences. Finally, to boost the tool predictive power, a second weighting scheme is added to the neighbors. The proposed method is applied to a collection of 20 data sets, different in size, data type, and distribution of the target variable. Moreover, the results are compared with the previously proposed kNN variants, showing its supremacy, particularly when the weighting scheme is based on non-linear association measures.","PeriodicalId":43041,"journal":{"name":"Journal of Marketing Analytics","volume":"52 1","pages":""},"PeriodicalIF":4.0000,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Marketing Analytics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1057/s41270-024-00302-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BUSINESS","Score":null,"Total":0}

引用次数: 0

Abstract

Predictive modeling aims at providing estimates of an unknown variable, the target, from a set of known ones, the input. The k Nearest Neighbors (kNN) is one of the best-known predictive algorithms due to its simplicity and well behavior. However, this class of models has some drawbacks, such as the non-robustness to the existence of irrelevant input features or the need to transform qualitative variables into dummies, with the corresponding loss of information for ordinal ones. In this work, a kNN regression variant, easily adaptable for classification purposes, is suggested. The proposal allows dealing with all types of input variables while embedding feature selection in a simple and efficient manner, reducing the tuning phase. More precisely, making use of the weighted Gower distance, we develop a powerful tool to cope with these inconveniences. Finally, to boost the tool predictive power, a second weighting scheme is added to the neighbors. The proposed method is applied to a collection of 20 data sets, different in size, data type, and distribution of the target variable. Moreover, the results are compared with the previously proposed kNN variants, showing its supremacy, particularly when the weighting scheme is based on non-linear association measures.

Abstract Image

查看原文本刊更多论文

双加权 kNN：带有嵌入式特征选择的简单高效变体

预测建模的目的是根据一组已知变量（即输入变量）提供未知变量（即目标变量）的估计值。k Nearest Neighbors (kNN) 是最著名的预测算法之一，因为它简单易行。然而，这类模型也有一些缺点，比如对不相关输入特征的不稳定性，或者需要将定性变量转换为虚拟变量，从而相应地损失了序数变量的信息。在这项工作中，我们提出了一种 kNN 回归变体，很容易适应分类目的。该建议允许处理所有类型的输入变量，同时以简单高效的方式嵌入特征选择，减少了调整阶段。更确切地说，利用加权高尔距离，我们开发了一种强大的工具来应对这些不便。最后，为了提高工具的预测能力，我们在邻居中加入了第二个加权方案。我们将所提出的方法应用于 20 个数据集，这些数据集的规模、数据类型和目标变量的分布各不相同。此外，我们还将结果与之前提出的 kNN 变体进行了比较，结果显示了其优越性，尤其是当加权方案基于非线性关联测量时。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Marketing Analytics BUSINESS-

CiteScore

5.40

自引率

16.70%

发文量

期刊介绍： Data has become the new ore in today’s knowledge economy. However, merely storing and reporting are not enough to thrive in today’s increasingly competitive markets. What is called for is the ability to make sense of all these oceans of data, and to apply those insights to the way companies approach their markets, adjust to changing market conditions, and respond to new competitors. Marketing analytics lies at the heart of this contemporary wave of data driven decision-making. Companies can no longer survive when they rely on gut instinct to make decisions. Strategic leverage of data is one of the few remaining sources of sustainable competitive advantage. New products can be copied faster than ever before. Staff are becoming less loyal as well as more mobile, and business centers themselves are moving across the globe in a world that is getting flatter and flatter. The Journal of Marketing Analytics brings together applied research and practice papers in this blossoming field. A unique blend of applied academic research, combined with insights from commercial best practices makes the Journal of Marketing Analytics a perfect companion for academics and practitioners alike. Academics can stay in touch with the latest developments in this field. Marketing analytics professionals can read about the latest trends, and cutting edge academic research in this discipline. The Journal of Marketing Analytics will feature applied research papers on topics like targeting, segmentation, big data, customer loyalty and lifecycle management, cross-selling, CRM, data quality management, multi-channel marketing, and marketing strategy. The Journal of Marketing Analytics aims to combine the rigor of carefully controlled scientific research methods with applicability of real world case studies. Our double blind review process ensures that papers are selected on their content and merits alone, selecting the best possible papers in this field.