Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature

IF 6 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine learning and knowledge extraction Pub Date : 2023-12-11 DOI:10.3390/make5040095

Jose Dixon, M. Rahman

{"title":"Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature","authors":"Jose Dixon, M. Rahman","doi":"10.3390/make5040095","DOIUrl":null,"url":null,"abstract":"The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a t-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing.","PeriodicalId":93033,"journal":{"name":"Machine learning and knowledge extraction","volume":"36 12","pages":""},"PeriodicalIF":6.0000,"publicationDate":"2023-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning and knowledge extraction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/make5040095","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a t-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing.

查看原文本刊更多论文

在生物医学文献研究论文数据集上对带有训练规模差异和子采样的不平衡分类进行统计分析

本文的总体目标是展示数据预处理、训练规模变化和子采样如何动态地改变不平衡文本分类的性能指标。该方法包括使用特征工程和数据预处理两种不同的监督学习分类方法，并使用五种机器学习分类器、五种不平衡采样技术、指定的训练间隔和子采样大小、使用 R 和 tidyverse 对来自世界卫生组织 Coronavirus Research Downloadable Articles of COVID-19 论文和 PubMed Central 数据库中的非 COVID-19 论文的 1000 个便携式文档格式文件数据集进行统计分析，将其分为五个标签，进行二元分类，从而影响精确度、召回率、曲线下接收者操作特征面积和准确度等性能指标。其中一种方法是根据正则表达式对句子行进行标注，与另一种方法相比，前者能显著提高不平衡采样技术的性能，后者则通过记录迭代性能指标的 t 检验进行统计分析。这项研究证明了 ML 分类器和采样技术在文本分类数据集中的有效性，在人工和自动数据处理方法中观察到了不同的性能水平和类不平衡问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊