Feature frequency profiles for automatic sample identification using PySpark

Workshop on Python for High-Performance and Scientific Computing Pub Date : 2015-11-15 DOI:10.1145/2835857.2835862

Gregory J. Zynda, N. Gaffney, Mehmet M. Dalkilic, M. Vaughn

引用次数: 0

Abstract

When the identity of a next generation sequencing sample is lost, reads or assembled contigs are aligned to a database of known genomes and classified as the match with the most hits. However, any alignment based methods are very expensive when dealing with millions of reads and several thousand genomes with homologous sequences. Instead of relying on alignment, samples and references could be compared and classified by their feature frequency profiles (FFP), which is similar to the word frequency profile (n-gram) used to compare bodies of text. The FFP is also ideal in a metagenomics setting to reconstruct a mixed sample from a pool of reference profiles using a linear model or optimization techniques. To test the robustness of this method, an assortment of samples will be matched to complete references from NCBI Genome. Since a MapReduce framework is ideal for calculating feature frequencies in parallel, this method will be implemented using the PySpark API and run at scale on Wrangler, an XSEDE system designed for big data analytics.

查看原文本刊更多论文

使用PySpark进行自动样本识别的特征频率配置文件

当下一代测序样本的身份丢失时，读取或组装的contigs将与已知基因组的数据库对齐，并将其分类为匹配最多的匹配。然而，任何基于比对的方法在处理数百万个reads和数千个同源序列的基因组时都是非常昂贵的。样本和参考文献可以通过它们的特征频率轮廓(FFP)进行比较和分类，而不是依赖于对齐，这类似于用于比较文本主体的词频轮廓(n-gram)。FFP在宏基因组学设置中也是理想的，可以使用线性模型或优化技术从参考剖面池中重建混合样本。为了测试该方法的稳健性，将对来自NCBI基因组的完整参考进行样本分类匹配。由于MapReduce框架非常适合并行计算特征频率，因此该方法将使用PySpark API实现，并在Wrangler上大规模运行，Wrangler是为大数据分析而设计的XSEDE系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Workshop on Python for High-Performance and Scientific Computing

自引率

0.00%

发文量