A supervised machine learning workflow for the reduction of highly dimensional biological data

Artificial intelligence in the life sciences Pub Date : 2023-11-25 DOI:10.1016/j.ailsci.2023.100090

Linnea K. Andersen , Benjamin J. Reading

{"title":"A supervised machine learning workflow for the reduction of highly dimensional biological data","authors":"Linnea K. Andersen , Benjamin J. Reading","doi":"10.1016/j.ailsci.2023.100090","DOIUrl":null,"url":null,"abstract":"<div><p>Recent technological advancements have revolutionized research capabilities across the biological sciences by enabling the collection of large data that provides a broader picture of systems from the cellular to ecosystem level at a more refined resolution. The rapid rate of generating these data has exacerbated bottlenecks in study design and data analysis approaches, especially as conventional methods that incorporate traditional statistical tests and assumptions are not suitable or sufficient for highly dimensional data (i.e., more than 1,000 variables). The application of machine learning techniques in large data analysis is one promising solution that is increasingly popular. However, limitations in expertise such that the results from machine learning models can be interpreted to gain meaningful biological insight pose a great challenge. To address this challenge, a user-friendly machine learning workflow that can be applied to a wide variety of data types to reduce these large data to those variables (attributes) most determinant of experimental and/or observed conditions is provided, as well as a general overview of data analysis and machine learning approaches and considerations thereof. The workflow presented here has been beta-tested with great success and is recommended to be incorporated into analysis pipelines of large data as a standardized approach to reduce data dimensionality. Moreover, the workflow is flexible, and the underlying concepts and steps can be modified to best suit user needs, objectives, and study parameters.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":"5 ","pages":"Article 100090"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S266731852300034X/pdfft?md5=c41b31c74fb0a867fbb87db01c8f6190&pid=1-s2.0-S266731852300034X-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial intelligence in the life sciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S266731852300034X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recent technological advancements have revolutionized research capabilities across the biological sciences by enabling the collection of large data that provides a broader picture of systems from the cellular to ecosystem level at a more refined resolution. The rapid rate of generating these data has exacerbated bottlenecks in study design and data analysis approaches, especially as conventional methods that incorporate traditional statistical tests and assumptions are not suitable or sufficient for highly dimensional data (i.e., more than 1,000 variables). The application of machine learning techniques in large data analysis is one promising solution that is increasingly popular. However, limitations in expertise such that the results from machine learning models can be interpreted to gain meaningful biological insight pose a great challenge. To address this challenge, a user-friendly machine learning workflow that can be applied to a wide variety of data types to reduce these large data to those variables (attributes) most determinant of experimental and/or observed conditions is provided, as well as a general overview of data analysis and machine learning approaches and considerations thereof. The workflow presented here has been beta-tested with great success and is recommended to be incorporated into analysis pipelines of large data as a standardized approach to reduce data dimensionality. Moreover, the workflow is flexible, and the underlying concepts and steps can be modified to best suit user needs, objectives, and study parameters.

查看原文本刊更多论文

用于减少高维生物数据的有监督机器学习工作流程

最近的技术进步彻底改变了整个生物科学领域的研究能力，使我们能够收集大量数据，以更精细的分辨率提供从细胞到生态系统层面的更广阔的系统图景。这些数据的快速生成加剧了研究设计和数据分析方法的瓶颈，尤其是包含传统统计检验和假设的传统方法不适合或不足以处理高维数据（即超过 1,000 个变量）。在大数据分析中应用机器学习技术是一种很有前景的解决方案，而且越来越受欢迎。然而，由于专业知识的限制，如何解释机器学习模型的结果以获得有意义的生物学见解成为一个巨大的挑战。为了应对这一挑战，本文提供了一个用户友好型机器学习工作流程，该流程可应用于多种数据类型，将这些海量数据还原为对实验和/或观测条件最具决定性的变量（属性），同时还概述了数据分析和机器学习方法及其注意事项。本文介绍的工作流程已经过测试，取得了巨大成功，建议将其纳入大数据分析管道，作为降低数据维度的标准化方法。此外，该工作流程非常灵活，可根据用户需求、目标和研究参数对基本概念和步骤进行修改。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Artificial intelligence in the life sciences Pharmacology, Biochemistry, Genetics and Molecular Biology (General), Computer Science Applications, Health Informatics, Drug Discovery, Veterinary Science and Veterinary Medicine (General)

CiteScore

5.00

自引率

0.00%

发文量

审稿时长

15 days