Exploring Genomic Datasets: from Batch to Interactive and Back

Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.) Pub Date : 2018-06-15 DOI:10.1145/3214708.3214710

Luca Nanni, Pietro Pinoli, Arif Canakoglu, S. Ceri

{"title":"Exploring Genomic Datasets: from Batch to Interactive and Back","authors":"Luca Nanni, Pietro Pinoli, Arif Canakoglu, S. Ceri","doi":"10.1145/3214708.3214710","DOIUrl":null,"url":null,"abstract":"Genomic data management is focused on achieving high performance over big datasets using batch, cloud-based architectures; this enables the execution of massive pipelines, but hampers the capability of exploring the solution space when it is not well-defined, by choosing different experimental samples or query extraction parameters. We present PyGMQL, a Python-based interoperability software layer that enables testing of experimental pipelines; PyGMQL solves the impedance mismatch between a batch execution environment and the agile programming style of Python, and provides transparency of access when exploration requires integrating local and remote resources. Wrapping PyGMQL and Python primitives within Jupyter notebooks guarantees reproducibility of the pipeline when used in different contexts or by different scientists. The software is freely available at https://github.com/DEIB-GECO/PyGMQL.","PeriodicalId":93360,"journal":{"name":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","volume":"30 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3214708.3214710","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Genomic data management is focused on achieving high performance over big datasets using batch, cloud-based architectures; this enables the execution of massive pipelines, but hampers the capability of exploring the solution space when it is not well-defined, by choosing different experimental samples or query extraction parameters. We present PyGMQL, a Python-based interoperability software layer that enables testing of experimental pipelines; PyGMQL solves the impedance mismatch between a batch execution environment and the agile programming style of Python, and provides transparency of access when exploration requires integrating local and remote resources. Wrapping PyGMQL and Python primitives within Jupyter notebooks guarantees reproducibility of the pipeline when used in different contexts or by different scientists. The software is freely available at https://github.com/DEIB-GECO/PyGMQL.

查看原文本刊更多论文

探索基因组数据集:从批处理到交互和返回

基因组数据管理的重点是使用批处理、基于云的架构实现大数据集的高性能;这使得大量管道的执行成为可能，但当解决方案空间没有定义好时，通过选择不同的实验样本或查询提取参数，会妨碍探索解决方案空间的能力。我们提出了PyGMQL，一个基于python的互操作性软件层，可以对实验管道进行测试;PyGMQL解决了批处理执行环境和Python敏捷编程风格之间的阻抗不匹配，并在需要集成本地和远程资源时提供透明的访问。在Jupyter笔记本中包装PyGMQL和Python原语可以保证在不同上下文中或由不同科学家使用时管道的可重复性。该软件可在https://github.com/DEIB-GECO/PyGMQL免费获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 5th International Workshop on Exploratory Search in Databases and the Web. International Workshop on Exploratory Search in Databases and the Web (5th : 2018 : Houston, Tex.)

自引率

0.00%

发文量