Feature frequency profiles for automatic sample identification using PySpark

Gregory J. Zynda, N. Gaffney, Mehmet M. Dalkilic, M. Vaughn
{"title":"Feature frequency profiles for automatic sample identification using PySpark","authors":"Gregory J. Zynda, N. Gaffney, Mehmet M. Dalkilic, M. Vaughn","doi":"10.1145/2835857.2835862","DOIUrl":null,"url":null,"abstract":"When the identity of a next generation sequencing sample is lost, reads or assembled contigs are aligned to a database of known genomes and classified as the match with the most hits. However, any alignment based methods are very expensive when dealing with millions of reads and several thousand genomes with homologous sequences. Instead of relying on alignment, samples and references could be compared and classified by their feature frequency profiles (FFP), which is similar to the word frequency profile (n-gram) used to compare bodies of text. The FFP is also ideal in a metagenomics setting to reconstruct a mixed sample from a pool of reference profiles using a linear model or optimization techniques. To test the robustness of this method, an assortment of samples will be matched to complete references from NCBI Genome. Since a MapReduce framework is ideal for calculating feature frequencies in parallel, this method will be implemented using the PySpark API and run at scale on Wrangler, an XSEDE system designed for big data analytics.","PeriodicalId":171838,"journal":{"name":"Workshop on Python for High-Performance and Scientific Computing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Workshop on Python for High-Performance and Scientific Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2835857.2835862","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

When the identity of a next generation sequencing sample is lost, reads or assembled contigs are aligned to a database of known genomes and classified as the match with the most hits. However, any alignment based methods are very expensive when dealing with millions of reads and several thousand genomes with homologous sequences. Instead of relying on alignment, samples and references could be compared and classified by their feature frequency profiles (FFP), which is similar to the word frequency profile (n-gram) used to compare bodies of text. The FFP is also ideal in a metagenomics setting to reconstruct a mixed sample from a pool of reference profiles using a linear model or optimization techniques. To test the robustness of this method, an assortment of samples will be matched to complete references from NCBI Genome. Since a MapReduce framework is ideal for calculating feature frequencies in parallel, this method will be implemented using the PySpark API and run at scale on Wrangler, an XSEDE system designed for big data analytics.
使用PySpark进行自动样本识别的特征频率配置文件
当下一代测序样本的身份丢失时,读取或组装的contigs将与已知基因组的数据库对齐,并将其分类为匹配最多的匹配。然而,任何基于比对的方法在处理数百万个reads和数千个同源序列的基因组时都是非常昂贵的。样本和参考文献可以通过它们的特征频率轮廓(FFP)进行比较和分类,而不是依赖于对齐,这类似于用于比较文本主体的词频轮廓(n-gram)。FFP在宏基因组学设置中也是理想的,可以使用线性模型或优化技术从参考剖面池中重建混合样本。为了测试该方法的稳健性,将对来自NCBI基因组的完整参考进行样本分类匹配。由于MapReduce框架非常适合并行计算特征频率,因此该方法将使用PySpark API实现,并在Wrangler上大规模运行,Wrangler是为大数据分析而设计的XSEDE系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信