实验生物化学家和其他分子科学家的机器学习入门

IF 2.2

Current protocols Pub Date : 2025-04-21 DOI:10.1002/cpz1.70085

Matthew J. K. Vince, Kristin A. Hughes, Anastasiya Buzuk, Deborah L. Perlstein, Lauren A. Viarengo-Baker, Adrian Whitty

{"title":"实验生物化学家和其他分子科学家的机器学习入门","authors":"Matthew J. K. Vince, Kristin A. Hughes, Anastasiya Buzuk, Deborah L. Perlstein, Lauren A. Viarengo-Baker, Adrian Whitty","doi":"10.1002/cpz1.70085","DOIUrl":null,"url":null,"abstract":"Machine learning (ML) is rapidly gaining traction in many areas of experimental molecular science for elucidating relationships and patterns in large or complex data sets. Historically, ML was largely the preserve of those with specialized training in fields such as statistics or cheminformatics. Increasingly, however, ML methodologies are becoming part of the standard toolkit for experimental scientists across a range of disciplines. For scientists without a significant background in computer science or statistics, lowering the barrier of entry to these ML techniques is important to broadening access to these powerful methods. Here we provide detailed, step-by-step protocols for performing four ML methods that are particularly useful for applications in biochemistry, cell biology, and drug discovery: hierarchical clustering, principal component analysis (PCA), partial least squares discriminant analysis (PLSDA), and partial least squares regression (PLSR). The protocols are written for the widely used software MATLAB, but no prior experience with MATLAB is required to use them. We include an explanation of each step, pitched at a level to be understood by investigators without any prior experience with ML, MATLAB, or any kind of coding. We also highlight the scientific issues pertaining to selecting and scaling the data to be analyzed. Throughout, we emphasize the relationship between the scientific question and how to choose data and methods that will allow it to be addressed in a meaningful way. Our aim is to provide a basic introduction that will equip experimental chemical biologists, chemists, and other biomedical scientists with the knowledge required to use ML to aid in the design of experiments, the formulation and data-driven testing of hypotheses, and the analysis of experimental data. © 2025 Wiley Periodicals LLC.Basic Protocol 1: ClusteringBasic Protocol 2: Principal component analysisBasic Protocol 3: Partial least squares-discriminant analysisBasic Protocol 4: Partial least squares regression","PeriodicalId":93970,"journal":{"name":"Current protocols","volume":"5 4","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Getting Started with Machine Learning for Experimental Biochemists and Other Molecular Scientists\",\"authors\":\"Matthew J. K. Vince, Kristin A. Hughes, Anastasiya Buzuk, Deborah L. Perlstein, Lauren A. Viarengo-Baker, Adrian Whitty\",\"doi\":\"10.1002/cpz1.70085\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning (ML) is rapidly gaining traction in many areas of experimental molecular science for elucidating relationships and patterns in large or complex data sets. Historically, ML was largely the preserve of those with specialized training in fields such as statistics or cheminformatics. Increasingly, however, ML methodologies are becoming part of the standard toolkit for experimental scientists across a range of disciplines. For scientists without a significant background in computer science or statistics, lowering the barrier of entry to these ML techniques is important to broadening access to these powerful methods. Here we provide detailed, step-by-step protocols for performing four ML methods that are particularly useful for applications in biochemistry, cell biology, and drug discovery: hierarchical clustering, principal component analysis (PCA), partial least squares discriminant analysis (PLSDA), and partial least squares regression (PLSR). The protocols are written for the widely used software MATLAB, but no prior experience with MATLAB is required to use them. We include an explanation of each step, pitched at a level to be understood by investigators without any prior experience with ML, MATLAB, or any kind of coding. We also highlight the scientific issues pertaining to selecting and scaling the data to be analyzed. Throughout, we emphasize the relationship between the scientific question and how to choose data and methods that will allow it to be addressed in a meaningful way. Our aim is to provide a basic introduction that will equip experimental chemical biologists, chemists, and other biomedical scientists with the knowledge required to use ML to aid in the design of experiments, the formulation and data-driven testing of hypotheses, and the analysis of experimental data. © 2025 Wiley Periodicals LLC.Basic Protocol 1: ClusteringBasic Protocol 2: Principal component analysisBasic Protocol 3: Partial least squares-discriminant analysisBasic Protocol 4: Partial least squares regression\",\"PeriodicalId\":93970,\"journal\":{\"name\":\"Current protocols\",\"volume\":\"5 4\",\"pages\":\"\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2025-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Current protocols\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://currentprotocols.onlinelibrary.wiley.com/doi/10.1002/cpz1.70085\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Current protocols","FirstCategoryId":"1085","ListUrlMain":"https://currentprotocols.onlinelibrary.wiley.com/doi/10.1002/cpz1.70085","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

机器学习（ML）在实验分子科学的许多领域迅速获得牵引力，用于阐明大型或复杂数据集中的关系和模式。从历史上看，机器学习主要是那些在统计学或化学信息学等领域受过专门训练的人的专利。然而，机器学习方法正逐渐成为各个学科的实验科学家的标准工具包的一部分。对于没有计算机科学或统计学背景的科学家来说，降低这些机器学习技术的准入门槛对于扩大这些强大方法的使用范围非常重要。在这里，我们提供了执行四种ML方法的详细步骤协议，这些方法对生物化学，细胞生物学和药物发现的应用特别有用：层次聚类，主成分分析（PCA），偏最小二乘判别分析（PLSDA）和偏最小二乘回归（PLSR）。这些协议是为广泛使用的软件MATLAB编写的，但不需要有MATLAB经验就可以使用它们。我们包括对每个步骤的解释，在没有任何ML， MATLAB或任何编码经验的调查人员理解的水平上进行定位。我们还强调了与选择和缩放要分析的数据有关的科学问题。在整个过程中，我们强调科学问题与如何选择数据和方法之间的关系，这些数据和方法将允许以有意义的方式解决科学问题。我们的目标是提供一个基本的介绍，使实验化学生物学家，化学家和其他生物医学科学家具备使用ML来帮助设计实验，制定和数据驱动的假设测试以及分析实验数据所需的知识。©2025 Wiley期刊有限公司基本协议1：聚类基本协议2：主成分分析基本协议3：偏最小二乘判别分析基本协议4：偏最小二乘回归

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Getting Started with Machine Learning for Experimental Biochemists and Other Molecular Scientists

查看原文本刊更多论文

Getting Started with Machine Learning for Experimental Biochemists and Other Molecular Scientists

Machine learning (ML) is rapidly gaining traction in many areas of experimental molecular science for elucidating relationships and patterns in large or complex data sets. Historically, ML was largely the preserve of those with specialized training in fields such as statistics or cheminformatics. Increasingly, however, ML methodologies are becoming part of the standard toolkit for experimental scientists across a range of disciplines. For scientists without a significant background in computer science or statistics, lowering the barrier of entry to these ML techniques is important to broadening access to these powerful methods. Here we provide detailed, step-by-step protocols for performing four ML methods that are particularly useful for applications in biochemistry, cell biology, and drug discovery: hierarchical clustering, principal component analysis (PCA), partial least squares discriminant analysis (PLSDA), and partial least squares regression (PLSR). The protocols are written for the widely used software MATLAB, but no prior experience with MATLAB is required to use them. We include an explanation of each step, pitched at a level to be understood by investigators without any prior experience with ML, MATLAB, or any kind of coding. We also highlight the scientific issues pertaining to selecting and scaling the data to be analyzed. Throughout, we emphasize the relationship between the scientific question and how to choose data and methods that will allow it to be addressed in a meaningful way. Our aim is to provide a basic introduction that will equip experimental chemical biologists, chemists, and other biomedical scientists with the knowledge required to use ML to aid in the design of experiments, the formulation and data-driven testing of hypotheses, and the analysis of experimental data. © 2025 Wiley Periodicals LLC.

Basic Protocol 1: Clustering

Basic Protocol 2: Principal component analysis

Basic Protocol 3: Partial least squares-discriminant analysis

Basic Protocol 4: Partial least squares regression

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊