{"title":"Molecular design for cardiac cell differentiation using a small dataset and decorated shape features","authors":"Fatemeh Etezadi, Shunichi Ito, Kosuke Yasui, Rodi Kado Abdalkader, Itsunari Minami, Motonari Uesugi, Ganesh Pandian Namasivayam, Haruko Nakano, Atsushi Nakano, Daniel M. Packwood","doi":"arxiv-2407.15322","DOIUrl":null,"url":null,"abstract":"The discovery of small organic compounds for inducing stem cell\ndifferentiation is a time- and resource-intensive process. While data science\ncould, in principle, facilitate the discovery of these compounds, novel\napproaches are required due to the difficulty of acquiring training data from\nlarge numbers of example compounds. In this paper, we demonstrate the design of\na new compound for inducing cardiomyocyte differentiation using simple\nregression models trained with a data set containing only 80 examples. We\nintroduce decorated shape descriptors, an information-rich molecular feature\nrepresentation that integrates both molecular shape and hydrophilicity\ninformation. These models demonstrate improved performance compared to ones\nusing standard molecular descriptors based on shape alone. Model overtraining\nis diagnosed using a new type of sensitivity analysis. Our new compound is\ndesigned using a conservative molecular design strategy, and its effectiveness\nis confirmed through expression profiles of cardiomyocyte-related marker genes\nusing real-time polymerase chain reaction experiments on human iPS cell lines.\nThis work demonstrates a viable data-driven strategy for designing new\ncompounds for stem cell differentiation protocols and will be useful in\nsituations where training data is limited.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"70 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.15322","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The discovery of small organic compounds for inducing stem cell
differentiation is a time- and resource-intensive process. While data science
could, in principle, facilitate the discovery of these compounds, novel
approaches are required due to the difficulty of acquiring training data from
large numbers of example compounds. In this paper, we demonstrate the design of
a new compound for inducing cardiomyocyte differentiation using simple
regression models trained with a data set containing only 80 examples. We
introduce decorated shape descriptors, an information-rich molecular feature
representation that integrates both molecular shape and hydrophilicity
information. These models demonstrate improved performance compared to ones
using standard molecular descriptors based on shape alone. Model overtraining
is diagnosed using a new type of sensitivity analysis. Our new compound is
designed using a conservative molecular design strategy, and its effectiveness
is confirmed through expression profiles of cardiomyocyte-related marker genes
using real-time polymerase chain reaction experiments on human iPS cell lines.
This work demonstrates a viable data-driven strategy for designing new
compounds for stem cell differentiation protocols and will be useful in
situations where training data is limited.