A supervised machine learning workflow for the reduction of highly dimensional biological data

Linnea K. Andersen , Benjamin J. Reading
{"title":"A supervised machine learning workflow for the reduction of highly dimensional biological data","authors":"Linnea K. Andersen ,&nbsp;Benjamin J. Reading","doi":"10.1016/j.ailsci.2023.100090","DOIUrl":null,"url":null,"abstract":"<div><p>Recent technological advancements have revolutionized research capabilities across the biological sciences by enabling the collection of large data that provides a broader picture of systems from the cellular to ecosystem level at a more refined resolution. The rapid rate of generating these data has exacerbated bottlenecks in study design and data analysis approaches, especially as conventional methods that incorporate traditional statistical tests and assumptions are not suitable or sufficient for highly dimensional data (i.e., more than 1,000 variables). The application of machine learning techniques in large data analysis is one promising solution that is increasingly popular. However, limitations in expertise such that the results from machine learning models can be interpreted to gain meaningful biological insight pose a great challenge. To address this challenge, a user-friendly machine learning workflow that can be applied to a wide variety of data types to reduce these large data to those variables (attributes) most determinant of experimental and/or observed conditions is provided, as well as a general overview of data analysis and machine learning approaches and considerations thereof. The workflow presented here has been beta-tested with great success and is recommended to be incorporated into analysis pipelines of large data as a standardized approach to reduce data dimensionality. Moreover, the workflow is flexible, and the underlying concepts and steps can be modified to best suit user needs, objectives, and study parameters.</p></div>","PeriodicalId":72304,"journal":{"name":"Artificial intelligence in the life sciences","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S266731852300034X/pdfft?md5=c41b31c74fb0a867fbb87db01c8f6190&pid=1-s2.0-S266731852300034X-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial intelligence in the life sciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S266731852300034X","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recent technological advancements have revolutionized research capabilities across the biological sciences by enabling the collection of large data that provides a broader picture of systems from the cellular to ecosystem level at a more refined resolution. The rapid rate of generating these data has exacerbated bottlenecks in study design and data analysis approaches, especially as conventional methods that incorporate traditional statistical tests and assumptions are not suitable or sufficient for highly dimensional data (i.e., more than 1,000 variables). The application of machine learning techniques in large data analysis is one promising solution that is increasingly popular. However, limitations in expertise such that the results from machine learning models can be interpreted to gain meaningful biological insight pose a great challenge. To address this challenge, a user-friendly machine learning workflow that can be applied to a wide variety of data types to reduce these large data to those variables (attributes) most determinant of experimental and/or observed conditions is provided, as well as a general overview of data analysis and machine learning approaches and considerations thereof. The workflow presented here has been beta-tested with great success and is recommended to be incorporated into analysis pipelines of large data as a standardized approach to reduce data dimensionality. Moreover, the workflow is flexible, and the underlying concepts and steps can be modified to best suit user needs, objectives, and study parameters.

用于减少高维生物数据的有监督机器学习工作流程
最近的技术进步彻底改变了整个生物科学领域的研究能力,使我们能够收集大量数据,以更精细的分辨率提供从细胞到生态系统层面的更广阔的系统图景。这些数据的快速生成加剧了研究设计和数据分析方法的瓶颈,尤其是包含传统统计检验和假设的传统方法不适合或不足以处理高维数据(即超过 1,000 个变量)。在大数据分析中应用机器学习技术是一种很有前景的解决方案,而且越来越受欢迎。然而,由于专业知识的限制,如何解释机器学习模型的结果以获得有意义的生物学见解成为一个巨大的挑战。为了应对这一挑战,本文提供了一个用户友好型机器学习工作流程,该流程可应用于多种数据类型,将这些海量数据还原为对实验和/或观测条件最具决定性的变量(属性),同时还概述了数据分析和机器学习方法及其注意事项。本文介绍的工作流程已经过测试,取得了巨大成功,建议将其纳入大数据分析管道,作为降低数据维度的标准化方法。此外,该工作流程非常灵活,可根据用户需求、目标和研究参数对基本概念和步骤进行修改。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Artificial intelligence in the life sciences
Artificial intelligence in the life sciences Pharmacology, Biochemistry, Genetics and Molecular Biology (General), Computer Science Applications, Health Informatics, Drug Discovery, Veterinary Science and Veterinary Medicine (General)
CiteScore
5.00
自引率
0.00%
发文量
0
审稿时长
15 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信