Robust Distance Measures for kNN Classification of Cancer Data.

IF 2.4 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Cancer Informatics Pub Date : 2020-10-13 eCollection Date: 2020-01-01 DOI:10.1177/1176935120965542

Rezvan Ehsani, Finn Drabløs

{"title":"Robust Distance Measures for kNN Classification of Cancer Data.","authors":"Rezvan Ehsani, Finn Drabløs","doi":"10.1177/1176935120965542","DOIUrl":null,"url":null,"abstract":"The k-Nearest Neighbor (kNN) classifier represents a simple and very general approach to classification. Still, the performance of kNN classifiers can often compete with more complex machine-learning algorithms. The core of kNN depends on a \"guilt by association\" principle where classification is performed by measuring the similarity between a query and a set of training patterns, often computed as distances. The relative performance of kNN classifiers is closely linked to the choice of distance or similarity measure, and it is therefore relevant to investigate the effect of using different distance measures when comparing biomedical data. In this study on classification of cancer data sets, we have used both common and novel distance measures, including the novel distance measures Sobolev and Fisher, and we have evaluated the performance of kNN with these distances on 4 cancer data sets of different type. We find that the performance when using the novel distance measures is comparable to the performance with more well-established measures, in particular for the Sobolev distance. We define a robust ranking of all the distance measures according to overall performance. Several distance measures show robust performance in kNN over several data sets, in particular the Hassanat, Sobolev, and Manhattan measures. Some of the other measures show good performance on selected data sets but seem to be more sensitive to the nature of the classification data. It is therefore important to benchmark distance measures on similar data prior to classification to identify the most suitable measure in each case.","PeriodicalId":35418,"journal":{"name":"Cancer Informatics","volume":"19 ","pages":"1176935120965542"},"PeriodicalIF":2.4000,"publicationDate":"2020-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1177/1176935120965542","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/1176935120965542","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 33

Abstract

The k-Nearest Neighbor (kNN) classifier represents a simple and very general approach to classification. Still, the performance of kNN classifiers can often compete with more complex machine-learning algorithms. The core of kNN depends on a "guilt by association" principle where classification is performed by measuring the similarity between a query and a set of training patterns, often computed as distances. The relative performance of kNN classifiers is closely linked to the choice of distance or similarity measure, and it is therefore relevant to investigate the effect of using different distance measures when comparing biomedical data. In this study on classification of cancer data sets, we have used both common and novel distance measures, including the novel distance measures Sobolev and Fisher, and we have evaluated the performance of kNN with these distances on 4 cancer data sets of different type. We find that the performance when using the novel distance measures is comparable to the performance with more well-established measures, in particular for the Sobolev distance. We define a robust ranking of all the distance measures according to overall performance. Several distance measures show robust performance in kNN over several data sets, in particular the Hassanat, Sobolev, and Manhattan measures. Some of the other measures show good performance on selected data sets but seem to be more sensitive to the nature of the classification data. It is therefore important to benchmark distance measures on similar data prior to classification to identify the most suitable measure in each case.

Abstract Image

查看原文本刊更多论文

癌症数据kNN分类的鲁棒距离度量。

k-最近邻(kNN)分类器代表了一种简单而非常通用的分类方法。尽管如此，kNN分类器的性能通常可以与更复杂的机器学习算法竞争。kNN的核心依赖于“关联罪责”原则，其中通过测量查询和一组训练模式之间的相似性来执行分类，通常以距离计算。kNN分类器的相对性能与距离或相似性度量的选择密切相关，因此研究在比较生物医学数据时使用不同距离度量的影响是相关的。在癌症数据集的分类研究中，我们使用了常用的和新颖的距离度量，包括新颖的距离度量Sobolev和Fisher，我们用这些距离在4个不同类型的癌症数据集上评估了kNN的性能。我们发现，使用新距离度量时的性能与使用更完善的度量时的性能相当，特别是对于索博列夫距离。我们根据整体性能定义了所有距离度量的稳健排名。若干距离度量在若干数据集上显示出kNN的稳健性能，特别是Hassanat、Sobolev和Manhattan度量。其他一些度量在选定的数据集上显示出良好的性能，但似乎对分类数据的性质更敏感。因此，重要的是在分类之前对类似数据的距离度量进行基准测试，以确定每种情况下最合适的度量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cancer Informatics Medicine-Oncology

CiteScore

3.00

自引率

5.00%

发文量

审稿时长

8 weeks

期刊介绍： The field of cancer research relies on advances in many other disciplines, including omics technology, mass spectrometry, radio imaging, computer science, and biostatistics. Cancer Informatics provides open access to peer-reviewed high-quality manuscripts reporting bioinformatics analysis of molecular genetics and/or clinical data pertaining to cancer, emphasizing the use of machine learning, artificial intelligence, statistical algorithms, advanced imaging techniques, data visualization, and high-throughput technologies. As the leading journal dedicated exclusively to the report of the use of computational methods in cancer research and practice, Cancer Informatics leverages methodological improvements in systems biology, genomics, proteomics, metabolomics, and molecular biochemistry into the fields of cancer detection, treatment, classification, risk-prediction, prevention, outcome, and modeling.