利用多种交叉验证技术实现DNA数据集成方法

3C Tecnología_Glosas de innovación aplicadas a la pyme Pub Date : 2022-12-29 DOI:10.17993/3ctecno.2022.v11n2e42.59-69

B. Bawankar, Kotadi Chinnaiah

{"title":"利用多种交叉验证技术实现DNA数据集成方法","authors":"B. Bawankar, Kotadi Chinnaiah","doi":"10.17993/3ctecno.2022.v11n2e42.59-69","DOIUrl":null,"url":null,"abstract":"Due to the growing size of datasets, which contain hundreds or thousands of features, feature selection has drawn the interest of many scholars in recent years. Usually, not all columns show important values. As a result, the machine learning models may perform poorly since the noise or unnecessary columns may confound the algorithms. To address this issue, various feature selection methods have been developed to evaluate large dimensional datasets and identify their subsets of pertinent features. The data, however, frequently skews feature selection algorithms. As a result, ensemble approaches have emerged as a substitute that incorporates the benefits of single feature selection algorithms and makes up for their drawbacks. In order to handle feature selection on datasets with large dimensionality, this research aims to grasp the key ideas and links in the process of aggregating feature selection methods. The suggested idea is tested by creating a cross-validation implementation that combines a number of Python packages with functionality to enable the feature selection techniques. By identifying pertinent features in the human, chimpanzee, and dog DNA datasets, the performance of the implementation was demonstrated.","PeriodicalId":210685,"journal":{"name":"3C Tecnología_Glosas de innovación aplicadas a la pyme","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Implementation of Ensemble Method on DNA Data Using Various Cross Validation Techniques\",\"authors\":\"B. Bawankar, Kotadi Chinnaiah\",\"doi\":\"10.17993/3ctecno.2022.v11n2e42.59-69\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Due to the growing size of datasets, which contain hundreds or thousands of features, feature selection has drawn the interest of many scholars in recent years. Usually, not all columns show important values. As a result, the machine learning models may perform poorly since the noise or unnecessary columns may confound the algorithms. To address this issue, various feature selection methods have been developed to evaluate large dimensional datasets and identify their subsets of pertinent features. The data, however, frequently skews feature selection algorithms. As a result, ensemble approaches have emerged as a substitute that incorporates the benefits of single feature selection algorithms and makes up for their drawbacks. In order to handle feature selection on datasets with large dimensionality, this research aims to grasp the key ideas and links in the process of aggregating feature selection methods. The suggested idea is tested by creating a cross-validation implementation that combines a number of Python packages with functionality to enable the feature selection techniques. By identifying pertinent features in the human, chimpanzee, and dog DNA datasets, the performance of the implementation was demonstrated.\",\"PeriodicalId\":210685,\"journal\":{\"name\":\"3C Tecnología_Glosas de innovación aplicadas a la pyme\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"3C Tecnología_Glosas de innovación aplicadas a la pyme\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.17993/3ctecno.2022.v11n2e42.59-69\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"3C Tecnología_Glosas de innovación aplicadas a la pyme","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.17993/3ctecno.2022.v11n2e42.59-69","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

由于数据集的规模越来越大，其中包含成百上千个特征，特征选择近年来引起了许多学者的兴趣。通常，并非所有列都显示重要的值。因此，机器学习模型可能表现不佳，因为噪声或不必要的列可能会混淆算法。为了解决这个问题，已经开发了各种特征选择方法来评估大维度数据集并识别其相关特征子集。然而，这些数据经常会扭曲特征选择算法。因此，集成方法作为一种替代品出现，它结合了单一特征选择算法的优点并弥补了它们的缺点。为了处理大维数据集的特征选择，本研究旨在掌握特征选择方法聚合过程中的关键思想和环节。通过创建一个交叉验证实现来测试建议的想法，该实现将许多Python包与功能结合起来，以启用特征选择技术。通过识别人类、黑猩猩和狗DNA数据集中的相关特征，演示了实现的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Implementation of Ensemble Method on DNA Data Using Various Cross Validation Techniques

Due to the growing size of datasets, which contain hundreds or thousands of features, feature selection has drawn the interest of many scholars in recent years. Usually, not all columns show important values. As a result, the machine learning models may perform poorly since the noise or unnecessary columns may confound the algorithms. To address this issue, various feature selection methods have been developed to evaluate large dimensional datasets and identify their subsets of pertinent features. The data, however, frequently skews feature selection algorithms. As a result, ensemble approaches have emerged as a substitute that incorporates the benefits of single feature selection algorithms and makes up for their drawbacks. In order to handle feature selection on datasets with large dimensionality, this research aims to grasp the key ideas and links in the process of aggregating feature selection methods. The suggested idea is tested by creating a cross-validation implementation that combines a number of Python packages with functionality to enable the feature selection techniques. By identifying pertinent features in the human, chimpanzee, and dog DNA datasets, the performance of the implementation was demonstrated.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

3C Tecnología_Glosas de innovación aplicadas a la pyme

自引率

0.00%

发文量