Impact of data balancing a multiclass dataset before the creation of association rules to study bacterial vaginosis

IF 4.4 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
{"title":"Impact of data balancing a multiclass dataset before the creation of association rules to study bacterial vaginosis","authors":"","doi":"10.1016/j.imed.2023.02.001","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>Bacterial vaginosis is a polymicrobial syndrome in which the homeostasis exerted by the <em>Latobacillus</em> species that protect the vaginal mucosa has been lost. This study explored the data balancing process with the intention of improving the quality of association rules. The article aimed to balance the unbalanced multiclass dataset to improve association rule creation.</p></div><div><h3>Methods</h3><p>A dataset with 201 observations and 58 variables was analyzed. A preconstructed dataset was used. The authors collected the data between August 2016 and October 2018 in Tabasco, Mexico. The study population comprised sexually active women ages 18 to 50 who underwent gynecological inspection at the infectious and metabolic diseases research laboratory at the Universidad Juarez Autonoma de Tabasco. To determine the best <span><math><mi>k</mi></math></span>-value, the random-forest algorithm was used and the balancing was performed with the synthetic minority over-sampling technique (SMOTE), random over-sampling examples (ROSE), and adaptive syntetic sampling approach for imbalanced learning (ADASYN) algorithms. The Apriori algorithm created the rules and to select rules with statistical significance, the <em>is.redundant(), is.significant()</em>, and <em>is.maximal()</em> functions and quality metric Fisher’s exact tes were used. The biological validation was carried out by the expert (bacteriologist).</p></div><div><h3>Results</h3><p>The ADASYN algorithm at <span><math><mrow><mi>K</mi><mo>=</mo><mn>9</mn></mrow></math></span> the out of the bag (OOB) error was zero, this was the best <span><math><mi>K</mi></math></span>-values. In the balancing process the ADASYN algorithm show best the performance. From the dataset balanced with ADASYN, the apriori algorithm created the association rules and the selection with the quality metric Fisher’s exact test, and the biological validation reported 13 rules. Gram - bacteria <em>Atopobium vaginae, Gardnerella vaginalis, Megasphaera filotipo 1<strong>,</strong> Mycoplasma hominis</em> and <em>Ureaplasma parvum</em> were detected by the apriori algorithm from the balanced dataset.</p></div><div><h3>Conclusion</h3><p>Balancing may improve the creation of association rules to efficiently model the bacteria that cause bacterial vaginosis.</p></div>","PeriodicalId":73400,"journal":{"name":"Intelligent medicine","volume":"4 3","pages":"Pages 188-199"},"PeriodicalIF":4.4000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2667102623000190/pdfft?md5=8cf3d2c99555a9de09737d0e3a9fc329&pid=1-s2.0-S2667102623000190-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent medicine","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667102623000190","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Background

Bacterial vaginosis is a polymicrobial syndrome in which the homeostasis exerted by the Latobacillus species that protect the vaginal mucosa has been lost. This study explored the data balancing process with the intention of improving the quality of association rules. The article aimed to balance the unbalanced multiclass dataset to improve association rule creation.

Methods

A dataset with 201 observations and 58 variables was analyzed. A preconstructed dataset was used. The authors collected the data between August 2016 and October 2018 in Tabasco, Mexico. The study population comprised sexually active women ages 18 to 50 who underwent gynecological inspection at the infectious and metabolic diseases research laboratory at the Universidad Juarez Autonoma de Tabasco. To determine the best k-value, the random-forest algorithm was used and the balancing was performed with the synthetic minority over-sampling technique (SMOTE), random over-sampling examples (ROSE), and adaptive syntetic sampling approach for imbalanced learning (ADASYN) algorithms. The Apriori algorithm created the rules and to select rules with statistical significance, the is.redundant(), is.significant(), and is.maximal() functions and quality metric Fisher’s exact tes were used. The biological validation was carried out by the expert (bacteriologist).

Results

The ADASYN algorithm at K=9 the out of the bag (OOB) error was zero, this was the best K-values. In the balancing process the ADASYN algorithm show best the performance. From the dataset balanced with ADASYN, the apriori algorithm created the association rules and the selection with the quality metric Fisher’s exact test, and the biological validation reported 13 rules. Gram - bacteria Atopobium vaginae, Gardnerella vaginalis, Megasphaera filotipo 1, Mycoplasma hominis and Ureaplasma parvum were detected by the apriori algorithm from the balanced dataset.

Conclusion

Balancing may improve the creation of association rules to efficiently model the bacteria that cause bacterial vaginosis.

在创建研究细菌性阴道病的关联规则之前,数据平衡多类数据集的影响
背景细菌性阴道病是一种多微生物综合征,其中保护阴道粘膜的拉托杆菌失去了平衡。本研究探讨了数据平衡过程,旨在提高关联规则的质量。文章旨在平衡不平衡的多类数据集,以改进关联规则的创建。方法分析了一个包含 201 个观测值和 58 个变量的数据集。使用了预先构建的数据集。作者于 2016 年 8 月至 2018 年 10 月期间在墨西哥塔巴斯科收集了数据。研究人群包括在塔巴斯科华雷斯自治大学(Universidad Juarez Autonoma de Tabasco)传染病和代谢病研究实验室接受妇科检查的 18 至 50 岁的性活跃女性。为确定最佳 k 值,使用了随机森林算法,并通过合成少数过度采样技术(SMOTE)、随机过度采样示例(ROSE)和不平衡学习自适应合成采样方法(ADASYN)算法进行了平衡。Apriori 算法创建规则,并使用 is.redundant()、is.significant() 和 is.maximal() 函数和质量指标 Fisher's exact tes 来选择具有统计意义的规则。结果ADASYN 算法在 K=9 时的出包(OOB)误差为零,这是最佳的 K 值。在平衡过程中,ADASYN 算法表现最佳。从使用 ADASYN 算法平衡的数据集中,apriori 算法创建了关联规则,并通过质量指标费雪精确检验进行了选择,生物验证报告了 13 条规则。通过平衡数据集,apriori 算法检测出了革兰氏细菌 Atopobium vaginae、阴道加德纳菌 Gardnerella vaginalis、Megasphaera filotipo 1、人型支原体 Mycoplasma hominis 和副脲原体 Ureaplasma parvum。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Intelligent medicine
Intelligent medicine Surgery, Radiology and Imaging, Artificial Intelligence, Biomedical Engineering
CiteScore
5.20
自引率
0.00%
发文量
19
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信