Selecting Key Features of Online Behaviour on South African Informative Websites Prior to Unsupervised Machine Learning

Statistics, Optimization & Information Computing Pub Date : 2022-10-21 DOI:10.19139/soic-2310-5070-1139

Judah Soobramoney, R. Chifurira, T. Zewotir

{"title":"Selecting Key Features of Online Behaviour on South African Informative Websites Prior to Unsupervised Machine Learning","authors":"Judah Soobramoney, R. Chifurira, T. Zewotir","doi":"10.19139/soic-2310-5070-1139","DOIUrl":null,"url":null,"abstract":"The main aim of the study was to explore the feature selection process of online web data prior to unsupervised machine learning models. At the time of writing, no such literature could be found reporting the use of feature selection in this context. Feature selection was determined by inspecting the variability and association between features. The variability of numeric features were quantified using the variance, mean absolute difference and dispersion ratio metrics whilst the coefficient of unalikeability was employed for categorical features. To quantify association, correlation matrices were used for numeric features, chi-squared independence tests between categorical features and box-and-whisker plots between mixed features. The main findings showed the variance, mean absolute difference, dispersion ratio and coefficient of unalikeability metrics have successfully highlighted features with very low variability within the observed data. Whilst the correlation matrix, chi-squared test for independence and box-and-whisker plots highlighted possible redundancy, natural relationships and insightful relationships between the features thereby suggesting features to be considered for omission prior to unsupervised modelling. The proposed methods and findings can be applied to various other applications of feature selection and exploration.","PeriodicalId":131002,"journal":{"name":"Statistics, Optimization & Information Computing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics, Optimization & Information Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.19139/soic-2310-5070-1139","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The main aim of the study was to explore the feature selection process of online web data prior to unsupervised machine learning models. At the time of writing, no such literature could be found reporting the use of feature selection in this context. Feature selection was determined by inspecting the variability and association between features. The variability of numeric features were quantified using the variance, mean absolute difference and dispersion ratio metrics whilst the coefficient of unalikeability was employed for categorical features. To quantify association, correlation matrices were used for numeric features, chi-squared independence tests between categorical features and box-and-whisker plots between mixed features. The main findings showed the variance, mean absolute difference, dispersion ratio and coefficient of unalikeability metrics have successfully highlighted features with very low variability within the observed data. Whilst the correlation matrix, chi-squared test for independence and box-and-whisker plots highlighted possible redundancy, natural relationships and insightful relationships between the features thereby suggesting features to be considered for omission prior to unsupervised modelling. The proposed methods and findings can be applied to various other applications of feature selection and exploration.

查看原文本刊更多论文

在无监督机器学习之前选择南非信息网站在线行为的关键特征

该研究的主要目的是探索在无监督机器学习模型之前在线web数据的特征选择过程。在撰写本文时，没有这样的文献可以找到报告在这种情况下使用特征选择。通过检查特征之间的可变性和相关性来确定特征选择。数值特征的可变性使用方差、平均绝对差和离散比度量来量化，而不相似系数用于分类特征。为了量化关联，将相关矩阵用于数字特征、分类特征之间的卡方独立性检验以及混合特征之间的盒须图。主要研究结果表明，方差、平均绝对差、离散比和不相似度系数指标成功地突出了观察数据中极低可变性的特征。而相关矩阵、独立性的卡方检验和盒须图突出了特征之间可能的冗余、自然关系和深刻的关系，从而表明在无监督建模之前需要考虑遗漏的特征。所提出的方法和发现可以应用于特征选择和探索的各种其他应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Statistics, Optimization & Information Computing

自引率

0.00%

发文量