{"title":"An Investigation into Distance Measures in Cluster Analysis","authors":"Zoe Shapcott","doi":"arxiv-2404.13664","DOIUrl":null,"url":null,"abstract":"This report provides an exploration of different distance measures that can\nbe used with the $K$-means algorithm for cluster analysis. Specifically, we\ninvestigate the Mahalanobis distance, and critically assess any benefits it may\nhave over the more traditional measures of the Euclidean, Manhattan and Maximum\ndistances. We perform this by first defining the metrics, before considering\ntheir advantages and drawbacks as discussed in literature regarding this area.\nWe apply these distances, first to some simulated data and then to subsets of\nthe Dry Bean dataset [1], to explore if there is a better quality detectable\nfor one metric over the others in these cases. One of the sections is devoted\nto analysing the information obtained from ChatGPT in response to prompts\nrelating to this topic.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"2 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Other Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2404.13664","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This report provides an exploration of different distance measures that can
be used with the $K$-means algorithm for cluster analysis. Specifically, we
investigate the Mahalanobis distance, and critically assess any benefits it may
have over the more traditional measures of the Euclidean, Manhattan and Maximum
distances. We perform this by first defining the metrics, before considering
their advantages and drawbacks as discussed in literature regarding this area.
We apply these distances, first to some simulated data and then to subsets of
the Dry Bean dataset [1], to explore if there is a better quality detectable
for one metric over the others in these cases. One of the sections is devoted
to analysing the information obtained from ChatGPT in response to prompts
relating to this topic.