Scalable topological data analysis for life science applications

Proceedings of the 18th ACM International Conference on Computing Frontiers Pub Date : 2021-05-11 DOI:10.1145/3457388.3459983

A. Kalyanaraman

{"title":"Scalable topological data analysis for life science applications","authors":"A. Kalyanaraman","doi":"10.1145/3457388.3459983","DOIUrl":null,"url":null,"abstract":"Enabling discoveries and foundational understanding in modern day life sciences have largely become centered on our ability to effectively analyze large swathes of complex data from a diverse range of sources, capturing complex information encapsulated across the different layers of the nature-built system. While this data-centric approach has been the primary driver in computational life sciences and discovery pipelines for several decades now, the field has decisively diverged in the last few years on how and why these data are collected. More specifically, in contrast to yesteryear genomic and other -omic projects, modern day data collection by and large happens in an analysis-agnostic fashion---i.e., complex data are collected without any specific hypotheses to drive them; instead data are being collected because of easy availability of affordable high-throughput technologies. This has led to a fundamental shift in how we process these data and what we could glean from these data. In this work, we present a novel algorithmic and software framework called Hyppo-X, which is based on algebraic topology to discover hidden structure within complex biological data sets [1, 3]. Topology is the field of computational mathematics that deals with structure at large. Computational topology and its applications constitute an emerging area of research with ample scope for development and data-driven discovery. We present results of our extensive collaborative studies in developing and applying our methods to analyze two types of data---plant phenomics data obtained from agricultural fields [2], and patient trajectories obtained from a network of hospitals toward antimicrobial stewardship [4]. Topological data analysis holds tremendous promise to model and analyze high-dimensional data sets in numerous scientific domains, and are likely to become part of future machine learning pipelines. These early studies demonstrate its potential while also highlighting a number of challenges and opportunities for future research. The software is available for download at https://mhmethun.com/HYPPO-X/.","PeriodicalId":136482,"journal":{"name":"Proceedings of the 18th ACM International Conference on Computing Frontiers","volume":"47 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3457388.3459983","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Enabling discoveries and foundational understanding in modern day life sciences have largely become centered on our ability to effectively analyze large swathes of complex data from a diverse range of sources, capturing complex information encapsulated across the different layers of the nature-built system. While this data-centric approach has been the primary driver in computational life sciences and discovery pipelines for several decades now, the field has decisively diverged in the last few years on how and why these data are collected. More specifically, in contrast to yesteryear genomic and other -omic projects, modern day data collection by and large happens in an analysis-agnostic fashion---i.e., complex data are collected without any specific hypotheses to drive them; instead data are being collected because of easy availability of affordable high-throughput technologies. This has led to a fundamental shift in how we process these data and what we could glean from these data. In this work, we present a novel algorithmic and software framework called Hyppo-X, which is based on algebraic topology to discover hidden structure within complex biological data sets [1, 3]. Topology is the field of computational mathematics that deals with structure at large. Computational topology and its applications constitute an emerging area of research with ample scope for development and data-driven discovery. We present results of our extensive collaborative studies in developing and applying our methods to analyze two types of data---plant phenomics data obtained from agricultural fields [2], and patient trajectories obtained from a network of hospitals toward antimicrobial stewardship [4]. Topological data analysis holds tremendous promise to model and analyze high-dimensional data sets in numerous scientific domains, and are likely to become part of future machine learning pipelines. These early studies demonstrate its potential while also highlighting a number of challenges and opportunities for future research. The software is available for download at https://mhmethun.com/HYPPO-X/.

查看原文本刊更多论文

生命科学应用的可扩展拓扑数据分析

在现代生命科学中，使发现和基础理解在很大程度上集中在我们有效分析来自不同来源的大量复杂数据的能力上，捕获封装在自然构建系统的不同层中的复杂信息。虽然这种以数据为中心的方法几十年来一直是计算生命科学和发现管道的主要驱动力，但在过去几年中，该领域在如何以及为什么收集这些数据方面发生了决定性的分歧。更具体地说，与过去的基因组学和其他基因组学项目相比，现代数据收集总体上是以一种分析不可知论的方式进行的。在没有任何特定假设的情况下收集复杂的数据;相反，数据的收集是因为价格合理的高通量技术很容易获得。这导致了我们处理这些数据的方式以及我们可以从这些数据中收集到什么的根本转变。在这项工作中，我们提出了一种新的算法和软件框架hypo - x，它基于代数拓扑来发现复杂生物数据集中的隐藏结构[1,3]。拓扑学是研究结构的计算数学领域。计算拓扑及其应用构成了一个新兴的研究领域，具有广阔的发展空间和数据驱动的发现。我们介绍了我们在开发和应用我们的方法分析两种类型数据方面的广泛合作研究的结果-从农田获得的植物表型组学数据[2]，以及从医院网络获得的抗微生物药物管理的患者轨迹b[4]。拓扑数据分析在许多科学领域对高维数据集进行建模和分析具有巨大的前景，并且很可能成为未来机器学习管道的一部分。这些早期研究显示了它的潜力，同时也强调了未来研究的一些挑战和机遇。该软件可从https://mhmethun.com/HYPPO-X/下载。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 18th ACM International Conference on Computing Frontiers

自引率

0.00%

发文量