实验生物化学家和其他分子科学家的机器学习入门

Matthew J. K. Vince, Kristin A. Hughes, Anastasiya Buzuk, Deborah L. Perlstein, Lauren A. Viarengo-Baker, Adrian Whitty
{"title":"实验生物化学家和其他分子科学家的机器学习入门","authors":"Matthew J. K. Vince,&nbsp;Kristin A. Hughes,&nbsp;Anastasiya Buzuk,&nbsp;Deborah L. Perlstein,&nbsp;Lauren A. Viarengo-Baker,&nbsp;Adrian Whitty","doi":"10.1002/cpz1.70085","DOIUrl":null,"url":null,"abstract":"<p>Machine learning (ML) is rapidly gaining traction in many areas of experimental molecular science for elucidating relationships and patterns in large or complex data sets. Historically, ML was largely the preserve of those with specialized training in fields such as statistics or cheminformatics. Increasingly, however, ML methodologies are becoming part of the standard toolkit for experimental scientists across a range of disciplines. For scientists without a significant background in computer science or statistics, lowering the barrier of entry to these ML techniques is important to broadening access to these powerful methods. Here we provide detailed, step-by-step protocols for performing four ML methods that are particularly useful for applications in biochemistry, cell biology, and drug discovery: hierarchical clustering, principal component analysis (PCA), partial least squares discriminant analysis (PLSDA), and partial least squares regression (PLSR). The protocols are written for the widely used software MATLAB, but no prior experience with MATLAB is required to use them. We include an explanation of each step, pitched at a level to be understood by investigators without any prior experience with ML, MATLAB, or any kind of coding. We also highlight the scientific issues pertaining to selecting and scaling the data to be analyzed. Throughout, we emphasize the relationship between the scientific question and how to choose data and methods that will allow it to be addressed in a meaningful way. Our aim is to provide a basic introduction that will equip experimental chemical biologists, chemists, and other biomedical scientists with the knowledge required to use ML to aid in the design of experiments, the formulation and data-driven testing of hypotheses, and the analysis of experimental data. © 2025 Wiley Periodicals LLC.</p><p><b>Basic Protocol 1</b>: Clustering</p><p><b>Basic Protocol 2</b>: Principal component analysis</p><p><b>Basic Protocol 3</b>: Partial least squares-discriminant analysis</p><p><b>Basic Protocol 4</b>: Partial least squares regression</p>","PeriodicalId":93970,"journal":{"name":"Current protocols","volume":"5 4","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Getting Started with Machine Learning for Experimental Biochemists and Other Molecular Scientists\",\"authors\":\"Matthew J. K. Vince,&nbsp;Kristin A. Hughes,&nbsp;Anastasiya Buzuk,&nbsp;Deborah L. Perlstein,&nbsp;Lauren A. Viarengo-Baker,&nbsp;Adrian Whitty\",\"doi\":\"10.1002/cpz1.70085\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Machine learning (ML) is rapidly gaining traction in many areas of experimental molecular science for elucidating relationships and patterns in large or complex data sets. Historically, ML was largely the preserve of those with specialized training in fields such as statistics or cheminformatics. Increasingly, however, ML methodologies are becoming part of the standard toolkit for experimental scientists across a range of disciplines. For scientists without a significant background in computer science or statistics, lowering the barrier of entry to these ML techniques is important to broadening access to these powerful methods. Here we provide detailed, step-by-step protocols for performing four ML methods that are particularly useful for applications in biochemistry, cell biology, and drug discovery: hierarchical clustering, principal component analysis (PCA), partial least squares discriminant analysis (PLSDA), and partial least squares regression (PLSR). The protocols are written for the widely used software MATLAB, but no prior experience with MATLAB is required to use them. We include an explanation of each step, pitched at a level to be understood by investigators without any prior experience with ML, MATLAB, or any kind of coding. We also highlight the scientific issues pertaining to selecting and scaling the data to be analyzed. Throughout, we emphasize the relationship between the scientific question and how to choose data and methods that will allow it to be addressed in a meaningful way. Our aim is to provide a basic introduction that will equip experimental chemical biologists, chemists, and other biomedical scientists with the knowledge required to use ML to aid in the design of experiments, the formulation and data-driven testing of hypotheses, and the analysis of experimental data. © 2025 Wiley Periodicals LLC.</p><p><b>Basic Protocol 1</b>: Clustering</p><p><b>Basic Protocol 2</b>: Principal component analysis</p><p><b>Basic Protocol 3</b>: Partial least squares-discriminant analysis</p><p><b>Basic Protocol 4</b>: Partial least squares regression</p>\",\"PeriodicalId\":93970,\"journal\":{\"name\":\"Current protocols\",\"volume\":\"5 4\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Current protocols\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/cpz1.70085\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current protocols","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cpz1.70085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

机器学习(ML)在实验分子科学的许多领域迅速获得牵引力,用于阐明大型或复杂数据集中的关系和模式。从历史上看,机器学习主要是那些在统计学或化学信息学等领域受过专门训练的人的专利。然而,机器学习方法正逐渐成为各个学科的实验科学家的标准工具包的一部分。对于没有计算机科学或统计学背景的科学家来说,降低这些机器学习技术的准入门槛对于扩大这些强大方法的使用范围非常重要。在这里,我们提供了执行四种ML方法的详细步骤协议,这些方法对生物化学,细胞生物学和药物发现的应用特别有用:层次聚类,主成分分析(PCA),偏最小二乘判别分析(PLSDA)和偏最小二乘回归(PLSR)。这些协议是为广泛使用的软件MATLAB编写的,但不需要有MATLAB经验就可以使用它们。我们包括对每个步骤的解释,在没有任何ML, MATLAB或任何编码经验的调查人员理解的水平上进行定位。我们还强调了与选择和缩放要分析的数据有关的科学问题。在整个过程中,我们强调科学问题与如何选择数据和方法之间的关系,这些数据和方法将允许以有意义的方式解决科学问题。我们的目标是提供一个基本的介绍,使实验化学生物学家,化学家和其他生物医学科学家具备使用ML来帮助设计实验,制定和数据驱动的假设测试以及分析实验数据所需的知识。©2025 Wiley期刊有限公司基本协议1:聚类基本协议2:主成分分析基本协议3:偏最小二乘判别分析基本协议4:偏最小二乘回归
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Getting Started with Machine Learning for Experimental Biochemists and Other Molecular Scientists

Machine learning (ML) is rapidly gaining traction in many areas of experimental molecular science for elucidating relationships and patterns in large or complex data sets. Historically, ML was largely the preserve of those with specialized training in fields such as statistics or cheminformatics. Increasingly, however, ML methodologies are becoming part of the standard toolkit for experimental scientists across a range of disciplines. For scientists without a significant background in computer science or statistics, lowering the barrier of entry to these ML techniques is important to broadening access to these powerful methods. Here we provide detailed, step-by-step protocols for performing four ML methods that are particularly useful for applications in biochemistry, cell biology, and drug discovery: hierarchical clustering, principal component analysis (PCA), partial least squares discriminant analysis (PLSDA), and partial least squares regression (PLSR). The protocols are written for the widely used software MATLAB, but no prior experience with MATLAB is required to use them. We include an explanation of each step, pitched at a level to be understood by investigators without any prior experience with ML, MATLAB, or any kind of coding. We also highlight the scientific issues pertaining to selecting and scaling the data to be analyzed. Throughout, we emphasize the relationship between the scientific question and how to choose data and methods that will allow it to be addressed in a meaningful way. Our aim is to provide a basic introduction that will equip experimental chemical biologists, chemists, and other biomedical scientists with the knowledge required to use ML to aid in the design of experiments, the formulation and data-driven testing of hypotheses, and the analysis of experimental data. © 2025 Wiley Periodicals LLC.

Basic Protocol 1: Clustering

Basic Protocol 2: Principal component analysis

Basic Protocol 3: Partial least squares-discriminant analysis

Basic Protocol 4: Partial least squares regression

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
4.00
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信