M. A. Oladipupo, Princewill Chima Obuzor, Babatunde J. Bamgbade, A. Adeniyi, Kazeem M. Olagunju, S. A. Ajagbe
{"title":"An Automated Python Script for Data Cleaning and Labeling using Machine Learning Technique","authors":"M. A. Oladipupo, Princewill Chima Obuzor, Babatunde J. Bamgbade, A. Adeniyi, Kazeem M. Olagunju, S. A. Ajagbe","doi":"10.31449/inf.v47i6.4474","DOIUrl":null,"url":null,"abstract":"was to obtain a financial dataset from the top database, \"Kaggle\". Create a machine learning (ML) approach in Python that intends to automate the financial dataset cleaning. This covers ingesting data, addressing incomplete data, addressing anomalies, one-hot wrapping and label encoding, extracting date and time values, and data normalization. Implementing an unsupervised machine learning method that attempts to automate financial dataset labeling (k-means). Using the method includes the elbow principle, k-means clustering, data modeling of \"age\" versus \"arrival,\" dimensionality reductions, computer vision, and dataset categorizing using the groupings. An e mpirical assessment of the cleaned and labeled automated trading dataset utilizing a comparison of the cleaned dataset before and after PCA adoption. The results show that the developed ML technique not only improved the performance of the audit data used in this study, but also classified the data after cleaning it and removing the unpleasant section and incomplete data, as shown by the k-means segmentation result and grouping by PCA.","PeriodicalId":56292,"journal":{"name":"Informatica","volume":" ","pages":""},"PeriodicalIF":3.3000,"publicationDate":"2023-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Informatica","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.31449/inf.v47i6.4474","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
was to obtain a financial dataset from the top database, "Kaggle". Create a machine learning (ML) approach in Python that intends to automate the financial dataset cleaning. This covers ingesting data, addressing incomplete data, addressing anomalies, one-hot wrapping and label encoding, extracting date and time values, and data normalization. Implementing an unsupervised machine learning method that attempts to automate financial dataset labeling (k-means). Using the method includes the elbow principle, k-means clustering, data modeling of "age" versus "arrival," dimensionality reductions, computer vision, and dataset categorizing using the groupings. An e mpirical assessment of the cleaned and labeled automated trading dataset utilizing a comparison of the cleaned dataset before and after PCA adoption. The results show that the developed ML technique not only improved the performance of the audit data used in this study, but also classified the data after cleaning it and removing the unpleasant section and incomplete data, as shown by the k-means segmentation result and grouping by PCA.
期刊介绍:
The quarterly journal Informatica provides an international forum for high-quality original research and publishes papers on mathematical simulation and optimization, recognition and control, programming theory and systems, automation systems and elements. Informatica provides a multidisciplinary forum for scientists and engineers involved in research and design including experts who implement and manage information systems applications.