Gaining insights in datasets in the shade of “garbage in, garbage out” rationale: Feature space distribution fitting

IF 6.4 2区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Gürol Canbek
{"title":"Gaining insights in datasets in the shade of “garbage in, garbage out” rationale: Feature space distribution fitting","authors":"Gürol Canbek","doi":"10.1002/widm.1456","DOIUrl":null,"url":null,"abstract":"This article emphasizes comprehending the “Garbage In, Garbage Out” (GIGO) rationale and ensuring the dataset quality in Machine Learning (ML) applications to achieve high and generalizable performance. An initial step should be added in an ML workflow where researchers evaluate the insights gained by quantitative analysis of the datasets sample and feature spaces. This study contributes towards achieving such a goal by suggesting a technique to quantify datasets in terms of feature frequency distribution characteristics. Hence a unique insight is provided into how the features in the available dataset samples are frequent. The technique was demonstrated in 11 benign and malign (malware) Android application datasets belonging to six academic Android mobile malware classification studies. The permissions requested by applications such as CALL_PHONE compose a relatively high‐dimensional binary feature space. The results showed that the distributions fit well into two of the four long right‐tail statistical distributions: log‐normal, exponential, power law, and Poisson. Precisely, log‐normal was the most exhibited statistical distribution except the two malign datasets that were in exponential. This study also explores statistical distribution fit/unfit feature analysis that enhances the insights in feature space. Finally, the study compiles phenomena examples in the literature exhibiting these statistical distributions that should be considered for interpreting the fitted distributions. In conclusion, conducting well‐formed statistical methods provides a clear understanding of the datasets and intra‐class and inter‐class differences before proceeding with selecting features and building a classifier model. Feature distribution characteristics should be one to analyze beforehand.","PeriodicalId":48970,"journal":{"name":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","volume":"18 1","pages":""},"PeriodicalIF":6.4000,"publicationDate":"2022-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1002/widm.1456","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 4

Abstract

This article emphasizes comprehending the “Garbage In, Garbage Out” (GIGO) rationale and ensuring the dataset quality in Machine Learning (ML) applications to achieve high and generalizable performance. An initial step should be added in an ML workflow where researchers evaluate the insights gained by quantitative analysis of the datasets sample and feature spaces. This study contributes towards achieving such a goal by suggesting a technique to quantify datasets in terms of feature frequency distribution characteristics. Hence a unique insight is provided into how the features in the available dataset samples are frequent. The technique was demonstrated in 11 benign and malign (malware) Android application datasets belonging to six academic Android mobile malware classification studies. The permissions requested by applications such as CALL_PHONE compose a relatively high‐dimensional binary feature space. The results showed that the distributions fit well into two of the four long right‐tail statistical distributions: log‐normal, exponential, power law, and Poisson. Precisely, log‐normal was the most exhibited statistical distribution except the two malign datasets that were in exponential. This study also explores statistical distribution fit/unfit feature analysis that enhances the insights in feature space. Finally, the study compiles phenomena examples in the literature exhibiting these statistical distributions that should be considered for interpreting the fitted distributions. In conclusion, conducting well‐formed statistical methods provides a clear understanding of the datasets and intra‐class and inter‐class differences before proceeding with selecting features and building a classifier model. Feature distribution characteristics should be one to analyze beforehand.

Abstract Image

在“垃圾输入,垃圾输出”原理的阴影下获得数据集的见解:特征空间分布拟合
本文强调理解“垃圾输入,垃圾输出”(GIGO)的基本原理,并确保机器学习(ML)应用程序中的数据集质量,以实现高且可推广的性能。应该在ML工作流程中添加初始步骤,研究人员评估通过数据集样本和特征空间的定量分析获得的见解。本研究通过提出一种根据特征频率分布特征量化数据集的技术,为实现这一目标做出了贡献。因此,对于可用数据集样本中的特征是如何频繁出现的,提供了独特的见解。该技术在属于6个学术Android移动恶意软件分类研究的11个良性和恶意(恶意软件)Android应用程序数据集中进行了演示。CALL_PHONE等应用程序请求的权限构成了一个相对高维的二进制特征空间。结果表明,这些分布很好地符合四种长右尾统计分布中的两种:对数正态分布、指数分布、幂律分布和泊松分布。准确地说,除了两个呈指数的恶性数据集外,对数正态分布是最明显的统计分布。本研究还探讨了统计分布适合/不适合特征分析,以增强对特征空间的洞察力。最后,研究汇编了文献中显示这些统计分布的现象示例,这些统计分布应被考虑用于解释拟合分布。总之,在继续选择特征和构建分类器模型之前,执行格式良好的统计方法可以清楚地了解数据集以及类内和类间的差异。特征分布特征是需要事先分析的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery
Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, THEORY & METHODS
CiteScore
22.70
自引率
2.60%
发文量
39
审稿时长
>12 weeks
期刊介绍: The goals of Wiley Interdisciplinary Reviews-Data Mining and Knowledge Discovery (WIREs DMKD) are multifaceted. Firstly, the journal aims to provide a comprehensive overview of the current state of data mining and knowledge discovery by featuring ongoing reviews authored by leading researchers. Secondly, it seeks to highlight the interdisciplinary nature of the field by presenting articles from diverse perspectives, covering various application areas such as technology, business, healthcare, education, government, society, and culture. Thirdly, WIREs DMKD endeavors to keep pace with the rapid advancements in data mining and knowledge discovery through regular content updates. Lastly, the journal strives to promote active engagement in the field by presenting its accomplishments and challenges in an accessible manner to a broad audience. The content of WIREs DMKD is intended to benefit upper-level undergraduate and postgraduate students, teaching and research professors in academic programs, as well as scientists and research managers in industry.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信