Selecting Key Features of Online Behaviour on South African Informative Websites Prior to Unsupervised Machine Learning

Judah Soobramoney, R. Chifurira, T. Zewotir
{"title":"Selecting Key Features of Online Behaviour on South African Informative Websites Prior to Unsupervised Machine Learning","authors":"Judah Soobramoney, R. Chifurira, T. Zewotir","doi":"10.19139/soic-2310-5070-1139","DOIUrl":null,"url":null,"abstract":"The main aim of the study was to explore the feature selection process of online web data prior to unsupervised machine learning models. At the time of writing, no such literature could be found reporting the use of feature selection in this context. Feature selection was determined by inspecting the variability and association between features. The variability of numeric features were quantified using the variance, mean absolute difference and dispersion ratio metrics whilst the coefficient of unalikeability was employed for categorical features. To quantify association, correlation matrices were used for numeric features, chi-squared independence tests between categorical features and box-and-whisker plots between mixed features. The main findings showed the variance, mean absolute difference, dispersion ratio and coefficient of unalikeability metrics have successfully highlighted features with very low variability within the observed data. Whilst the correlation matrix, chi-squared test for independence and box-and-whisker plots highlighted possible redundancy, natural relationships and insightful relationships between the features thereby suggesting features to be considered for omission prior to unsupervised modelling. The proposed methods and findings can be applied to various other applications of feature selection and exploration.","PeriodicalId":131002,"journal":{"name":"Statistics, Optimization & Information Computing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistics, Optimization & Information Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.19139/soic-2310-5070-1139","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The main aim of the study was to explore the feature selection process of online web data prior to unsupervised machine learning models. At the time of writing, no such literature could be found reporting the use of feature selection in this context. Feature selection was determined by inspecting the variability and association between features. The variability of numeric features were quantified using the variance, mean absolute difference and dispersion ratio metrics whilst the coefficient of unalikeability was employed for categorical features. To quantify association, correlation matrices were used for numeric features, chi-squared independence tests between categorical features and box-and-whisker plots between mixed features. The main findings showed the variance, mean absolute difference, dispersion ratio and coefficient of unalikeability metrics have successfully highlighted features with very low variability within the observed data. Whilst the correlation matrix, chi-squared test for independence and box-and-whisker plots highlighted possible redundancy, natural relationships and insightful relationships between the features thereby suggesting features to be considered for omission prior to unsupervised modelling. The proposed methods and findings can be applied to various other applications of feature selection and exploration.
在无监督机器学习之前选择南非信息网站在线行为的关键特征
该研究的主要目的是探索在无监督机器学习模型之前在线web数据的特征选择过程。在撰写本文时,没有这样的文献可以找到报告在这种情况下使用特征选择。通过检查特征之间的可变性和相关性来确定特征选择。数值特征的可变性使用方差、平均绝对差和离散比度量来量化,而不相似系数用于分类特征。为了量化关联,将相关矩阵用于数字特征、分类特征之间的卡方独立性检验以及混合特征之间的盒须图。主要研究结果表明,方差、平均绝对差、离散比和不相似度系数指标成功地突出了观察数据中极低可变性的特征。而相关矩阵、独立性的卡方检验和盒须图突出了特征之间可能的冗余、自然关系和深刻的关系,从而表明在无监督建模之前需要考虑遗漏的特征。所提出的方法和发现可以应用于特征选择和探索的各种其他应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信