Molecular design for cardiac cell differentiation using a small dataset and decorated shape features

arXiv - QuanBio - Biomolecules Pub Date : 2024-07-22 DOI:arxiv-2407.15322

Fatemeh Etezadi, Shunichi Ito, Kosuke Yasui, Rodi Kado Abdalkader, Itsunari Minami, Motonari Uesugi, Ganesh Pandian Namasivayam, Haruko Nakano, Atsushi Nakano, Daniel M. Packwood

{"title":"Molecular design for cardiac cell differentiation using a small dataset and decorated shape features","authors":"Fatemeh Etezadi, Shunichi Ito, Kosuke Yasui, Rodi Kado Abdalkader, Itsunari Minami, Motonari Uesugi, Ganesh Pandian Namasivayam, Haruko Nakano, Atsushi Nakano, Daniel M. Packwood","doi":"arxiv-2407.15322","DOIUrl":null,"url":null,"abstract":"The discovery of small organic compounds for inducing stem cell\ndifferentiation is a time- and resource-intensive process. While data science\ncould, in principle, facilitate the discovery of these compounds, novel\napproaches are required due to the difficulty of acquiring training data from\nlarge numbers of example compounds. In this paper, we demonstrate the design of\na new compound for inducing cardiomyocyte differentiation using simple\nregression models trained with a data set containing only 80 examples. We\nintroduce decorated shape descriptors, an information-rich molecular feature\nrepresentation that integrates both molecular shape and hydrophilicity\ninformation. These models demonstrate improved performance compared to ones\nusing standard molecular descriptors based on shape alone. Model overtraining\nis diagnosed using a new type of sensitivity analysis. Our new compound is\ndesigned using a conservative molecular design strategy, and its effectiveness\nis confirmed through expression profiles of cardiomyocyte-related marker genes\nusing real-time polymerase chain reaction experiments on human iPS cell lines.\nThis work demonstrates a viable data-driven strategy for designing new\ncompounds for stem cell differentiation protocols and will be useful in\nsituations where training data is limited.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"70 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.15322","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The discovery of small organic compounds for inducing stem cell differentiation is a time- and resource-intensive process. While data science could, in principle, facilitate the discovery of these compounds, novel approaches are required due to the difficulty of acquiring training data from large numbers of example compounds. In this paper, we demonstrate the design of a new compound for inducing cardiomyocyte differentiation using simple regression models trained with a data set containing only 80 examples. We introduce decorated shape descriptors, an information-rich molecular feature representation that integrates both molecular shape and hydrophilicity information. These models demonstrate improved performance compared to ones using standard molecular descriptors based on shape alone. Model overtraining is diagnosed using a new type of sensitivity analysis. Our new compound is designed using a conservative molecular design strategy, and its effectiveness is confirmed through expression profiles of cardiomyocyte-related marker genes using real-time polymerase chain reaction experiments on human iPS cell lines. This work demonstrates a viable data-driven strategy for designing new compounds for stem cell differentiation protocols and will be useful in situations where training data is limited.

查看原文本刊更多论文

利用小型数据集和装饰形状特征进行心脏细胞分化的分子设计

发现诱导干细胞分化的小型有机化合物是一个时间和资源密集型过程。虽然数据科学原则上可以促进这些化合物的发现，但由于难以从大量示例化合物中获取训练数据，因此需要新的方法。在本文中，我们展示了如何利用仅包含 80 个示例的数据集所训练的简单回归模型来设计一种用于诱导心肌细胞分化的新化合物。我们引入了装饰形状描述符，这是一种集成了分子形状和亲水性信息的信息丰富的分子特征表征。与仅使用基于形状的标准分子描述符的模型相比，这些模型的性能有所提高。利用新型灵敏度分析诊断出了模型训练过度的问题。我们采用保守的分子设计策略设计了新化合物，并通过在人类 iPS 细胞系上进行的实时聚合酶链反应实验对心肌细胞相关标记基因的表达谱证实了其有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - QuanBio - Biomolecules

自引率

0.00%

发文量