Software for pre-processing Illumina next-generation sequencing short read sequences.

Q2 Decision Sciences

Source Code for Biology and Medicine Pub Date : 2014-05-03 eCollection Date: 2014-01-01 DOI:10.1186/1751-0473-9-8

Chuming Chen, Sari S Khaleel, Hongzhan Huang, Cathy H Wu

{"title":"Software for pre-processing Illumina next-generation sequencing short read sequences.","authors":"Chuming Chen, Sari S Khaleel, Hongzhan Huang, Cathy H Wu","doi":"10.1186/1751-0473-9-8","DOIUrl":null,"url":null,"abstract":"Background: When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets.Methods: We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7.Results: Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference-based assembly as measured by assembly contiguity and correctness.Conclusions: Trimming of short read sequences can improve the quality of de novo and reference-based assembly and assembler performance. The parallel processing capability of ngsShoRT reduces trimming time and improves the memory efficiency when dealing with large datasets. We recommend combining sequencing artifacts removal, and quality score based read filtering and base trimming as the most consistent method for improving sequence quality and downstream assemblies. ngsShoRT source code, user guide and tutorial are available at http://research.bioinformatics.udel.edu/genomics/ngsShoRT/. ngsShoRT can be incorporated as a pre-processing step in genome and transcriptome assembly projects.","PeriodicalId":35052,"journal":{"name":"Source Code for Biology and Medicine","volume":"9 ","pages":"8"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/1751-0473-9-8","citationCount":"178","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Source Code for Biology and Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/1751-0473-9-8","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2014/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"Decision Sciences","Score":null,"Total":0}

引用次数: 178

Abstract

Background: When compared to Sanger sequencing technology, next-generation sequencing (NGS) technologies are hindered by shorter sequence read length, higher base-call error rate, non-uniform coverage, and platform-specific sequencing artifacts. These characteristics lower the quality of their downstream analyses, e.g. de novo and reference-based assembly, by introducing sequencing artifacts and errors that may contribute to incorrect interpretation of data. Although many tools have been developed for quality control and pre-processing of NGS data, none of them provide flexible and comprehensive trimming options in conjunction with parallel processing to expedite pre-processing of large NGS datasets.

Methods: We developed ngsShoRT (next-generation sequencing Short Reads Trimmer), a flexible and comprehensive open-source software package written in Perl that provides a set of algorithms commonly used for pre-processing NGS short read sequences. We compared the features and performance of ngsShoRT with existing tools: CutAdapt, NGS QC Toolkit and Trimmomatic. We also compared the effects of using pre-processed short read sequences generated by different algorithms on de novo and reference-based assembly for three different genomes: Caenorhabditis elegans, Saccharomyces cerevisiae S288c, and Escherichia coli O157 H7.

Results: Several combinations of ngsShoRT algorithms were tested on publicly available Illumina GA II, HiSeq 2000, and MiSeq eukaryotic and bacteria genomic short read sequences with the focus on removing sequencing artifacts and low-quality reads and/or bases. Our results show that across three organisms and three sequencing platforms, trimming improved the mean quality scores of trimmed sequences. Using trimmed sequences for de novo and reference-based assembly improved assembly quality as well as assembler performance. In general, ngsShoRT outperformed comparable trimming tools in terms of trimming speed and improvement of de novo and reference-based assembly as measured by assembly contiguity and correctness.

Conclusions: Trimming of short read sequences can improve the quality of de novo and reference-based assembly and assembler performance. The parallel processing capability of ngsShoRT reduces trimming time and improves the memory efficiency when dealing with large datasets. We recommend combining sequencing artifacts removal, and quality score based read filtering and base trimming as the most consistent method for improving sequence quality and downstream assemblies. ngsShoRT source code, user guide and tutorial are available at http://research.bioinformatics.udel.edu/genomics/ngsShoRT/. ngsShoRT can be incorporated as a pre-processing step in genome and transcriptome assembly projects.

查看原文本刊更多论文

软件预处理Illumina下一代测序短读序列。

背景:与Sanger测序技术相比，下一代测序(NGS)技术存在较短的序列读取长度、较高的碱基调用错误率、不均匀覆盖和特定平台的测序伪像等问题。这些特征降低了下游分析的质量，例如，通过引入可能导致数据错误解释的测序工件和错误，从头开始和基于参考的组装。虽然已经开发了许多用于NGS数据质量控制和预处理的工具，但它们都没有提供灵活和全面的修剪选项，以结合并行处理来加快大型NGS数据集的预处理。方法:我们开发了ngsShoRT(下一代测序Short Reads Trimmer)，这是一个用Perl编写的灵活、全面的开源软件包，提供了一套常用的NGS短读序列预处理算法。我们将ngsShoRT的功能和性能与现有工具:CutAdapt、NGS QC Toolkit和Trimmomatic进行了比较。我们还比较了使用不同算法生成的预处理短读序列对三种不同基因组(秀丽隐杆线虫、酿酒酵母S288c和大肠杆菌O157 H7)从头组装和参考组装的影响。结果:几种ngsShoRT算法组合在公开的Illumina GA II、HiSeq 2000和MiSeq真核生物和细菌基因组短读序列上进行了测试，重点是去除测序伪影和低质量读段和/或碱基。我们的研究结果表明，在三个生物和三个测序平台上，修剪提高了修剪序列的平均质量分数。在从头开始和基于参考的装配中使用裁剪序列提高了装配质量和装配器性能。一般来说，ngsShoRT在修剪速度和从头开始的改进以及基于参考的装配方面优于同类修剪工具(通过装配的连续性和正确性来衡量)。结论:对短读序列进行修剪可以提高从头组装和基于参考的组装的质量和组装器的性能。在处理大型数据集时，ngsShoRT的并行处理能力减少了裁剪时间，提高了内存效率。我们建议结合测序伪影去除、基于质量分数的读取过滤和碱基修剪作为改善序列质量和下游组装的最一致的方法。ngsShoRT源代码、用户指南和教程可在http://research.bioinformatics.udel.edu/genomics/ngsShoRT/上获得。ngsShoRT可以作为基因组和转录组组装项目的预处理步骤。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Source Code for Biology and Medicine Decision Sciences-Information Systems and Management

自引率

0.00%

发文量

期刊介绍： Source Code for Biology and Medicine is a peer-reviewed open access, online journal that publishes articles on source code employed over a wide range of applications in biology and medicine. The journal"s aim is to publish source code for distribution and use in the public domain in order to advance biological and medical research. Through this dissemination, it may be possible to shorten the time required for solving certain computational problems for which there is limited source code availability or resources.