Chemical complexity challenge: Is multi-instance machine learning a solution?

IF 16.8 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Wiley Interdisciplinary Reviews: Computational Molecular Science Pub Date : 2023-11-27 DOI:10.1002/wcms.1698

Dmitry Zankov, Timur Madzhidov, Alexandre Varnek, Pavel Polishchuk

{"title":"Chemical complexity challenge: Is multi-instance machine learning a solution?","authors":"Dmitry Zankov, Timur Madzhidov, Alexandre Varnek, Pavel Polishchuk","doi":"10.1002/wcms.1698","DOIUrl":null,"url":null,"abstract":"<p>Molecules are complex dynamic objects that can exist in different molecular forms (conformations, tautomers, stereoisomers, protonation states, etc.) and often it is not known which molecular form is responsible for observed physicochemical and biological properties of a given molecule. This raises the problem of the selection of the correct molecular form for machine learning modeling of target properties. The same problem is common to biological molecules (RNA, DNA, proteins)—long sequences where only key segments, which often cannot be located precisely, are involved in biological functions. Multi-instance machine learning (MIL) is an efficient approach for solving problems where objects under study cannot be uniquely represented by a single instance, but rather by a set of multiple alternative instances. Multi-instance learning was formalized in 1997 and motivated by the problem of conformation selection in drug activity prediction tasks. Since then MIL has found a lot of applications in various domains, such as information retrieval, computer vision, signal processing, bankruptcy prediction, and so on. In the given review we describe the MIL framework and its applications to the tasks associated with ambiguity in the representation of small and biological molecules in chemoinformatics and bioinformatics. We have collected examples that demonstrate the advantages of MIL over the traditional single-instance learning (SIL) approach. Special attention was paid to the ability of MIL models to identify key instances responsible for a modeling property.</p><p>This article is categorized under:\n </p>","PeriodicalId":236,"journal":{"name":"Wiley Interdisciplinary Reviews: Computational Molecular Science","volume":"14 1","pages":""},"PeriodicalIF":16.8000,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/wcms.1698","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Wiley Interdisciplinary Reviews: Computational Molecular Science","FirstCategoryId":"92","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/wcms.1698","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Molecules are complex dynamic objects that can exist in different molecular forms (conformations, tautomers, stereoisomers, protonation states, etc.) and often it is not known which molecular form is responsible for observed physicochemical and biological properties of a given molecule. This raises the problem of the selection of the correct molecular form for machine learning modeling of target properties. The same problem is common to biological molecules (RNA, DNA, proteins)—long sequences where only key segments, which often cannot be located precisely, are involved in biological functions. Multi-instance machine learning (MIL) is an efficient approach for solving problems where objects under study cannot be uniquely represented by a single instance, but rather by a set of multiple alternative instances. Multi-instance learning was formalized in 1997 and motivated by the problem of conformation selection in drug activity prediction tasks. Since then MIL has found a lot of applications in various domains, such as information retrieval, computer vision, signal processing, bankruptcy prediction, and so on. In the given review we describe the MIL framework and its applications to the tasks associated with ambiguity in the representation of small and biological molecules in chemoinformatics and bioinformatics. We have collected examples that demonstrate the advantages of MIL over the traditional single-instance learning (SIL) approach. Special attention was paid to the ability of MIL models to identify key instances responsible for a modeling property.

This article is categorized under:

Abstract Image

查看原文本刊更多论文

化学复杂性挑战:多实例机器学习是解决方案吗?

分子是复杂的动态物体，可以以不同的分子形式存在(构象、互变异构体、立体异构体、质子化状态等)，通常不知道哪种分子形式负责观察到的特定分子的物理化学和生物特性。这就提出了为目标属性的机器学习建模选择正确分子形式的问题。同样的问题也存在于生物分子(RNA, DNA，蛋白质)的长序列中，其中只有关键片段(通常无法精确定位)参与生物功能。多实例机器学习(MIL)是一种有效的方法，用于解决被研究对象不能由单个实例唯一地表示，而是由一组多个备选实例表示的问题。多实例学习在1997年正式提出，其动机是药物活性预测任务中的构象选择问题。从那时起，MIL在信息检索、计算机视觉、信号处理、破产预测等各个领域得到了广泛的应用。在本文中，我们描述了MIL框架及其在化学信息学和生物信息学中与小分子和生物分子表示的模糊性相关的任务中的应用。我们收集了一些例子来证明MIL相对于传统的单实例学习(SIL)方法的优势。特别注意MIL模型识别负责建模属性的关键实例的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Wiley Interdisciplinary Reviews: Computational Molecular Science CHEMISTRY, MULTIDISCIPLINARY-MATHEMATICAL & COMPUTATIONAL BIOLOGY

CiteScore

28.90

自引率

1.80%

发文量

审稿时长

6-12 weeks

期刊介绍： Computational molecular sciences harness the power of rigorous chemical and physical theories, employing computer-based modeling, specialized hardware, software development, algorithm design, and database management to explore and illuminate every facet of molecular sciences. These interdisciplinary approaches form a bridge between chemistry, biology, and materials sciences, establishing connections with adjacent application-driven fields in both chemistry and biology. WIREs Computational Molecular Science stands as a platform to comprehensively review and spotlight research from these dynamic and interconnected fields.