Investigating the overlap of machine learning algorithms in the final results of RNA-seq analysis on gene expression estimation.

IF 3.4 3区医学 Q1 MEDICAL INFORMATICS

Health Information Science and Systems Pub Date : 2024-02-29 eCollection Date: 2024-12-01 DOI:10.1007/s13755-023-00265-4

Kalliopi-Maria Stathopoulou, Spiros Georgakopoulos, Sotiris Tasoulis, Vassilis P Plagianakos

{"title":"Investigating the overlap of machine learning algorithms in the final results of RNA-seq analysis on gene expression estimation.","authors":"Kalliopi-Maria Stathopoulou, Spiros Georgakopoulos, Sotiris Tasoulis, Vassilis P Plagianakos","doi":"10.1007/s13755-023-00265-4","DOIUrl":null,"url":null,"abstract":"<p><p>Advances in computer science in combination with the next-generation sequencing have introduced a new era in biology, enabling advanced state-of-the-art analysis of complex biological data. Bioinformatics is evolving as a union field between computer Science and biology, enabling the representation, storage, management, analysis and exploration of many types of data with a plethora of machine learning algorithms and computing tools. In this study, we used machine learning algorithms to detect differentially expressed genes between different types of cancer and showing the existence overlap to final results from RNA-sequencing analysis. The datasets were obtained from the National Center for Biotechnology Information resource. Specifically, dataset GSE68086 which corresponds to PMID:200,068,086. This dataset consists of 171 blood platelet samples collected from patients with six different tumors and healthy individuals. All steps for RNA-sequencing analysis (preprocessing, read alignment, transcriptome reconstruction, expression quantification and differential expression analysis) were followed. Machine Learning- based Random Forest and Gradient Boosting algorithms were applied to predict significant genes. The Rstudio statistical tool was used for the analysis.</p>","PeriodicalId":46312,"journal":{"name":"Health Information Science and Systems","volume":"12 1","pages":"14"},"PeriodicalIF":3.4000,"publicationDate":"2024-02-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10904690/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Health Information Science and Systems","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s13755-023-00265-4","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}

引用次数: 0

Abstract

Advances in computer science in combination with the next-generation sequencing have introduced a new era in biology, enabling advanced state-of-the-art analysis of complex biological data. Bioinformatics is evolving as a union field between computer Science and biology, enabling the representation, storage, management, analysis and exploration of many types of data with a plethora of machine learning algorithms and computing tools. In this study, we used machine learning algorithms to detect differentially expressed genes between different types of cancer and showing the existence overlap to final results from RNA-sequencing analysis. The datasets were obtained from the National Center for Biotechnology Information resource. Specifically, dataset GSE68086 which corresponds to PMID:200,068,086. This dataset consists of 171 blood platelet samples collected from patients with six different tumors and healthy individuals. All steps for RNA-sequencing analysis (preprocessing, read alignment, transcriptome reconstruction, expression quantification and differential expression analysis) were followed. Machine Learning- based Random Forest and Gradient Boosting algorithms were applied to predict significant genes. The Rstudio statistical tool was used for the analysis.

Abstract Image

查看原文本刊更多论文

研究 RNA-seq 分析最终结果中机器学习算法与基因表达估算的重叠。

计算机科学的进步与新一代测序技术相结合，为生物学带来了一个新时代，使复杂的生物数据能够得到最先进的分析。生物信息学是计算机科学与生物学的结合领域，它利用大量的机器学习算法和计算工具来表示、存储、管理、分析和探索多种类型的数据。在本研究中，我们使用机器学习算法检测不同类型癌症之间的差异表达基因，并显示其与 RNA 序列分析的最终结果是否存在重叠。数据集来自美国国家生物技术信息中心的资源。具体来说，数据集 GSE68086 与 PMID:200,068,086 相对应。该数据集包括从六种不同肿瘤患者和健康人身上采集的 171 份血小板样本。RNA 序列分析的所有步骤（预处理、读取比对、转录组重建、表达定量和差异表达分析）均已完成。应用基于机器学习的随机森林和梯度提升算法预测重要基因。分析中使用了 Rstudio 统计工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Health Information Science and Systems MEDICAL INFORMATICS-

CiteScore

11.30

自引率

5.00%

发文量

期刊介绍： Health Information Science and Systems is a multidisciplinary journal that integrates artificial intelligence/computer science/information technology with health science and services, embracing information science research coupled with topics related to the modeling, design, development, integration and management of health information systems, smart health, artificial intelligence in medicine, and computer aided diagnosis, medical expert systems. The scope includes: i.) smart health, artificial Intelligence in medicine, computer aided diagnosis, medical image processing, medical expert systems ii.) medical big data, medical/health/biomedicine information resources such as patient medical records, devices and equipments, software and tools to capture, store, retrieve, process, analyze, optimize the use of information in the health domain, iii.) data management, data mining, and knowledge discovery, all of which play a key role in decision making, management of public health, examination of standards, privacy and security issues, iv.) development of new architectures and applications for health information systems.