Genomic Compression With Read Alignment at the Decoder

Yotam Gershon;Yuval Cassuto
{"title":"Genomic Compression With Read Alignment at the Decoder","authors":"Yotam Gershon;Yuval Cassuto","doi":"10.1109/JSAIT.2023.3300831","DOIUrl":null,"url":null,"abstract":"We propose a new compression scheme for genomic data given as sequence fragments called reads. The scheme uses a reference genome at the decoder side only, freeing the encoder from the burdens of storing references and performing computationally costly alignment operations. The main ingredient of the scheme is a multi-layer code construction, delivering to the decoder sufficient information to align the reads, correct their differences from the reference, validate their reconstruction, and correct reconstruction errors. The core of the method is the well-known concept of distributed source coding with decoder side information, fortified by a generalized-concatenation code construction enabling efficient embedding of all the information needed for reliable reconstruction. We first present the scheme for the case of substitution errors only between the reads and the reference, and then extend it to support reads with a single deletion and multiple substitutions. A central tool in this extension is a new distance metric that is shown analytically to improve alignment performance over existing distance metrics.","PeriodicalId":73295,"journal":{"name":"IEEE journal on selected areas in information theory","volume":"4 ","pages":"314-330"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE journal on selected areas in information theory","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10198542/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

We propose a new compression scheme for genomic data given as sequence fragments called reads. The scheme uses a reference genome at the decoder side only, freeing the encoder from the burdens of storing references and performing computationally costly alignment operations. The main ingredient of the scheme is a multi-layer code construction, delivering to the decoder sufficient information to align the reads, correct their differences from the reference, validate their reconstruction, and correct reconstruction errors. The core of the method is the well-known concept of distributed source coding with decoder side information, fortified by a generalized-concatenation code construction enabling efficient embedding of all the information needed for reliable reconstruction. We first present the scheme for the case of substitution errors only between the reads and the reference, and then extend it to support reads with a single deletion and multiple substitutions. A central tool in this extension is a new distance metric that is shown analytically to improve alignment performance over existing distance metrics.
基因组压缩与读对齐在解码器
我们提出了一种新的基因组数据压缩方案,称为reads序列片段。该方案仅在解码器端使用参考基因组,从而将编码器从存储参考和执行计算上昂贵的比对操作的负担中解放出来。该方案的主要组成部分是多层码结构,向解码器提供足够的信息来对齐读取,纠正它们与参考的差异,验证它们的重建,并纠正重建错误。该方法的核心是众所周知的具有解码器侧信息的分布式源编码概念,通过通用级联代码结构进行强化,可以有效地嵌入可靠重建所需的所有信息。我们首先提出了仅在读取和引用之间存在替换错误的方案,然后将其扩展到支持一次删除和多次替换的读取。这个扩展的一个中心工具是一个新的距离度量,分析显示,以提高现有距离度量的对齐性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
8.20
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信