Analysis of Compression Techniques for DNA Sequence Data

arXiv: Other Quantitative Biology Pub Date : 2020-06-01 DOI:10.13140/RG.2.2.14683.00806

Shakeela Bibi, Javed Iqbal, Adnan Iftekhar, Mir Hassan

{"title":"Analysis of Compression Techniques for DNA Sequence Data","authors":"Shakeela Bibi, Javed Iqbal, Adnan Iftekhar, Mir Hassan","doi":"10.13140/RG.2.2.14683.00806","DOIUrl":null,"url":null,"abstract":"Biological data mainly comprises of Deoxyribonucleic acid (DNA) and protein sequences. These are the biomolecules which are present in all cells of human beings. Due to the self-replicating property of DNA, it is a key constitute of genetic material that exist in all breathingcreatures. This biomolecule (DNA) comprehends the genetic material obligatory for the operational and expansion of all personified lives. To save DNA data of single person we require 10CD-ROMs.Moreover, this size is increasing constantly, and more and more sequences are adding in the public databases. This abundant increase in the sequence data arise challenges in the precise information extraction from this data. Since many data analyzing and visualization tools do not support processing of this huge amount of data. To reduce the size of DNA and protein sequence, many scientists introduced various types of sequence compression algorithms such as compress or gzip, Context Tree Weighting (CTW), Lampel Ziv Welch (LZW), arithmetic coding, run-length encoding and substitution method etc. These techniques have sufficiently contributed to minimizing the volume of the biological datasets. On the other hand, traditional compression techniques are also not much suitable for the compression of these types of sequential data. In this paper, we have explored diverse types of techniques for compression of large amounts of DNA Sequence Data. In this paper, the analysis of techniques reveals that efficient techniques not only reduce the size of the sequence but also avoid any information loss. The review of existing studies also shows that compression of a DNA sequence is significant for understanding the critical characteristics of DNA data in addition to improving storage efficiency and data transmission. In addition, the compression of the protein sequence is a challenge for the research community. The major parameters for evaluation of these compression algorithms include compression ratio, running time complexity etc.","PeriodicalId":8460,"journal":{"name":"arXiv: Other Quantitative Biology","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv: Other Quantitative Biology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.13140/RG.2.2.14683.00806","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Biological data mainly comprises of Deoxyribonucleic acid (DNA) and protein sequences. These are the biomolecules which are present in all cells of human beings. Due to the self-replicating property of DNA, it is a key constitute of genetic material that exist in all breathingcreatures. This biomolecule (DNA) comprehends the genetic material obligatory for the operational and expansion of all personified lives. To save DNA data of single person we require 10CD-ROMs.Moreover, this size is increasing constantly, and more and more sequences are adding in the public databases. This abundant increase in the sequence data arise challenges in the precise information extraction from this data. Since many data analyzing and visualization tools do not support processing of this huge amount of data. To reduce the size of DNA and protein sequence, many scientists introduced various types of sequence compression algorithms such as compress or gzip, Context Tree Weighting (CTW), Lampel Ziv Welch (LZW), arithmetic coding, run-length encoding and substitution method etc. These techniques have sufficiently contributed to minimizing the volume of the biological datasets. On the other hand, traditional compression techniques are also not much suitable for the compression of these types of sequential data. In this paper, we have explored diverse types of techniques for compression of large amounts of DNA Sequence Data. In this paper, the analysis of techniques reveals that efficient techniques not only reduce the size of the sequence but also avoid any information loss. The review of existing studies also shows that compression of a DNA sequence is significant for understanding the critical characteristics of DNA data in addition to improving storage efficiency and data transmission. In addition, the compression of the protein sequence is a challenge for the research community. The major parameters for evaluation of these compression algorithms include compression ratio, running time complexity etc.

查看原文本刊更多论文

DNA序列数据压缩技术分析

生物学数据主要包括脱氧核糖核酸(DNA)和蛋白质序列。这些是存在于人类所有细胞中的生物分子。由于DNA具有自我复制的特性，它是存在于所有呼吸生物体内的遗传物质的关键组成部分。这个生物分子(DNA)理解所有人格化生命的运作和扩展所必需的遗传物质。为了保存单个人的DNA数据，我们需要10cd - rom。而且这个规模还在不断增加，越来越多的序列被添加到公共数据库中。序列数据的大量增加给从这些数据中精确提取信息带来了挑战。由于许多数据分析和可视化工具不支持处理如此庞大的数据量。为了减小DNA和蛋白质序列的大小，许多科学家引入了各种类型的序列压缩算法，如compress或gzip、上下文树加权(Context Tree Weighting, CTW)、Lampel Ziv Welch (LZW)、算法编码、运行长度编码和替代法等。这些技术已充分有助于减少生物数据集的数量。另一方面，传统的压缩技术也不太适合压缩这些类型的顺序数据。在本文中，我们探讨了不同类型的技术压缩大量的DNA序列数据。本文通过对技术的分析表明，有效的技术不仅可以减小序列的大小，而且可以避免任何信息的丢失。对现有研究的回顾也表明，DNA序列压缩除了提高存储效率和数据传输外，对理解DNA数据的关键特征具有重要意义。此外，蛋白质序列的压缩是研究界面临的一个挑战。评价这些压缩算法的主要参数包括压缩比、运行时间复杂度等。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv: Other Quantitative Biology

自引率

0.00%

发文量