Generating synthetic multidimensional molecular time series data for machine learning: considerations

Frontiers in systems biology Pub Date : 2023-07-25 DOI:10.3389/fsysb.2023.1188009

G. An, Chase Cockrell

{"title":"Generating synthetic multidimensional molecular time series data for machine learning: considerations","authors":"G. An, Chase Cockrell","doi":"10.3389/fsysb.2023.1188009","DOIUrl":null,"url":null,"abstract":"The use of synthetic data is recognized as a crucial step in the development of neural network-based Artificial Intelligence (AI) systems. While the methods for generating synthetic data for AI applications in other domains have a role in certain biomedical AI systems, primarily related to image processing, there is a critical gap in the generation of time series data for AI tasks where it is necessary to know how the system works. This is most pronounced in the ability to generate synthetic multi-dimensional molecular time series data (subsequently referred to as synthetic mediator trajectories or SMTs); this is the type of data that underpins research into biomarkers and mediator signatures for forecasting various diseases and is an essential component of the drug development pipeline. We argue the insufficiency of statistical and data-centric machine learning (ML) means of generating this type of synthetic data is due to a combination of factors: perpetual data sparsity due to the Curse of Dimensionality, the inapplicability of the Central Limit Theorem in terms of making assumptions about the statistical distributions of this type of data, and the inability to use ab initio simulations due to the state of perpetual epistemic incompleteness in cellular/molecular biology. Alternatively, we present a rationale for using complex multi-scale mechanism-based simulation models, constructed and operated on to account for perpetual epistemic incompleteness and the need to provide maximal expansiveness in concordance with the Maximal Entropy Principle. These procedures provide for the generation of SMT that minimizes the known shortcomings associated with neural network AI systems, namely overfitting and lack of generalizability. The generation of synthetic data that accounts for the identified factors of multi-dimensional time series data is an essential capability for the development of mediator-biomarker based AI forecasting systems, and therapeutic control development and optimization.","PeriodicalId":73109,"journal":{"name":"Frontiers in systems biology","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in systems biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fsysb.2023.1188009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The use of synthetic data is recognized as a crucial step in the development of neural network-based Artificial Intelligence (AI) systems. While the methods for generating synthetic data for AI applications in other domains have a role in certain biomedical AI systems, primarily related to image processing, there is a critical gap in the generation of time series data for AI tasks where it is necessary to know how the system works. This is most pronounced in the ability to generate synthetic multi-dimensional molecular time series data (subsequently referred to as synthetic mediator trajectories or SMTs); this is the type of data that underpins research into biomarkers and mediator signatures for forecasting various diseases and is an essential component of the drug development pipeline. We argue the insufficiency of statistical and data-centric machine learning (ML) means of generating this type of synthetic data is due to a combination of factors: perpetual data sparsity due to the Curse of Dimensionality, the inapplicability of the Central Limit Theorem in terms of making assumptions about the statistical distributions of this type of data, and the inability to use ab initio simulations due to the state of perpetual epistemic incompleteness in cellular/molecular biology. Alternatively, we present a rationale for using complex multi-scale mechanism-based simulation models, constructed and operated on to account for perpetual epistemic incompleteness and the need to provide maximal expansiveness in concordance with the Maximal Entropy Principle. These procedures provide for the generation of SMT that minimizes the known shortcomings associated with neural network AI systems, namely overfitting and lack of generalizability. The generation of synthetic data that accounts for the identified factors of multi-dimensional time series data is an essential capability for the development of mediator-biomarker based AI forecasting systems, and therapeutic control development and optimization.

查看原文本刊更多论文

生成用于机器学习的合成多维分子时间序列数据:注意事项

合成数据的使用被认为是开发基于神经网络的人工智能（AI）系统的关键一步。虽然为其他领域的人工智能应用生成合成数据的方法在某些生物医学人工智能系统中发挥着作用，主要与图像处理有关，但在为人工智能任务生成时间序列数据方面存在关键差距，需要了解系统是如何工作的。这在生成合成多维分子时间序列数据（随后称为合成介质轨迹或SMT）的能力方面最为明显；这类数据是预测各种疾病的生物标志物和介体特征研究的基础，也是药物开发管道的重要组成部分。我们认为，生成这类合成数据的统计和以数据为中心的机器学习（ML）方法的不足是由于多种因素的结合：维度诅咒导致的永久数据稀疏性、中心极限定理在对这类数据的统计分布进行假设方面的不适用性，以及由于细胞/分子生物学中永久的认识不完全状态而无法使用从头算模拟。或者，我们提出了使用基于复杂多尺度机制的模拟模型的基本原理，这些模型是为了解释永久的认识不完全性和根据最大熵原理提供最大扩展性的需要而构建和操作的。这些程序提供了SMT的生成，最大限度地减少了与神经网络AI系统相关的已知缺点，即过拟合和缺乏可推广性。生成解释多维时间序列数据的已识别因素的合成数据是开发基于中介生物标志物的人工智能预测系统以及开发和优化治疗控制的重要能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in systems biology

自引率

0.00%

发文量