基于傅立叶的基因组大数据快速安全传输的数据最小化算法

2018 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2018-07-01 DOI:10.1109/BigDataCongress.2018.00024

Mohammed Aledhari, Marianne Di Pierro, F. Saeed

{"title":"基于傅立叶的基因组大数据快速安全传输的数据最小化算法","authors":"Mohammed Aledhari, Marianne Di Pierro, F. Saeed","doi":"10.1109/BigDataCongress.2018.00024","DOIUrl":null,"url":null,"abstract":"DNA sequencing plays an important role in the bioinformatics research community. DNA sequencing is important to all organisms, especially to humans and from multiple perspectives. These include understanding the correlation of specific mutations that plays a significant role in increasing or decreasing the risks of developing a disease or condition, or finding the implications and connections between the genotype and the phenotype. Advancements in the high-throughput sequencing techniques, tools, and equipment, have helped to generate big genomic datasets due to the tremendous decrease in the DNA sequence costs. However, the advancements have posed great challenges to genomic data storage, analysis, and transfer. Accessing, manipulating, and sharing the generated big genomic datasets present major challenges in terms of time and size, as well as privacy. Data size plays an important role in addressing these challenges. Accordingly, data minimization techniques have recently attracted much interest in the bioinformatics research community. Therefore, it is critical to develop new ways to minimize the data size. This paper presents a new real-time data minimization mechanism of big genomic datasets to shorten the transfer time in a more secure manner, despite the potential occurrence of a data breach. Our method involves the application of the random sampling of Fourier transform theory to the real-time generated big genomic datasets of both formats: FASTA and FASTQ and assigns the lowest possible codeword to the most frequent characters of the datasets. Our results indicate that the proposed data minimization algorithm is up to 79% of FASTA datasets' size reduction, with 98-fold faster and more secure than the standard data-encoding method. Also, the results show up to 45% of FASTQ datasets' size reduction with 57-fold faster than the standard data-encoding approach. Based on our results, we conclude that the proposed data minimization algorithm provides the best performance among current data-encoding approaches for big real-time generated genomic datasets.","PeriodicalId":177250,"journal":{"name":"2018 IEEE International Congress on Big Data (BigData Congress)","volume":"35 3","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Fourier-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets\",\"authors\":\"Mohammed Aledhari, Marianne Di Pierro, F. Saeed\",\"doi\":\"10.1109/BigDataCongress.2018.00024\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"DNA sequencing plays an important role in the bioinformatics research community. DNA sequencing is important to all organisms, especially to humans and from multiple perspectives. These include understanding the correlation of specific mutations that plays a significant role in increasing or decreasing the risks of developing a disease or condition, or finding the implications and connections between the genotype and the phenotype. Advancements in the high-throughput sequencing techniques, tools, and equipment, have helped to generate big genomic datasets due to the tremendous decrease in the DNA sequence costs. However, the advancements have posed great challenges to genomic data storage, analysis, and transfer. Accessing, manipulating, and sharing the generated big genomic datasets present major challenges in terms of time and size, as well as privacy. Data size plays an important role in addressing these challenges. Accordingly, data minimization techniques have recently attracted much interest in the bioinformatics research community. Therefore, it is critical to develop new ways to minimize the data size. This paper presents a new real-time data minimization mechanism of big genomic datasets to shorten the transfer time in a more secure manner, despite the potential occurrence of a data breach. Our method involves the application of the random sampling of Fourier transform theory to the real-time generated big genomic datasets of both formats: FASTA and FASTQ and assigns the lowest possible codeword to the most frequent characters of the datasets. Our results indicate that the proposed data minimization algorithm is up to 79% of FASTA datasets' size reduction, with 98-fold faster and more secure than the standard data-encoding method. Also, the results show up to 45% of FASTQ datasets' size reduction with 57-fold faster than the standard data-encoding approach. Based on our results, we conclude that the proposed data minimization algorithm provides the best performance among current data-encoding approaches for big real-time generated genomic datasets.\",\"PeriodicalId\":177250,\"journal\":{\"name\":\"2018 IEEE International Congress on Big Data (BigData Congress)\",\"volume\":\"35 3\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Congress on Big Data (BigData Congress)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BigDataCongress.2018.00024\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BigDataCongress.2018.00024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

DNA测序在生物信息学研究领域发挥着重要作用。DNA测序对所有生物都很重要，尤其是对人类，从多个角度来看。这些包括了解在增加或减少疾病或病症发生风险中起重要作用的特定突变的相关性，或发现基因型和表型之间的含义和联系。由于DNA测序成本的大幅下降，高通量测序技术、工具和设备的进步有助于产生大型基因组数据集。然而，这些进步对基因组数据的存储、分析和传输提出了巨大的挑战。访问、操作和共享生成的大型基因组数据集在时间、大小和隐私方面都面临着重大挑战。数据大小在应对这些挑战方面发挥着重要作用。因此，数据最小化技术最近引起了生物信息学研究界的极大兴趣。因此，开发最小化数据大小的新方法至关重要。本文提出了一种新的大基因组数据集实时数据最小化机制，在可能发生数据泄露的情况下，以更安全的方式缩短传输时间。我们的方法包括将傅立叶变换理论的随机抽样应用于实时生成的FASTA和FASTQ两种格式的大基因组数据集，并为数据集中最常见的字符分配尽可能低的码字。我们的研究结果表明，所提出的数据最小化算法最多可减少FASTA数据集大小的79%，比标准数据编码方法快98倍，安全性更高。此外，结果显示，FASTQ数据集的大小减少了45%，比标准数据编码方法快57倍。基于我们的研究结果，我们得出结论，所提出的数据最小化算法在当前的数据编码方法中为实时生成的大型基因组数据集提供了最好的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Fourier-Based Data Minimization Algorithm for Fast and Secure Transfer of Big Genomic Datasets

DNA sequencing plays an important role in the bioinformatics research community. DNA sequencing is important to all organisms, especially to humans and from multiple perspectives. These include understanding the correlation of specific mutations that plays a significant role in increasing or decreasing the risks of developing a disease or condition, or finding the implications and connections between the genotype and the phenotype. Advancements in the high-throughput sequencing techniques, tools, and equipment, have helped to generate big genomic datasets due to the tremendous decrease in the DNA sequence costs. However, the advancements have posed great challenges to genomic data storage, analysis, and transfer. Accessing, manipulating, and sharing the generated big genomic datasets present major challenges in terms of time and size, as well as privacy. Data size plays an important role in addressing these challenges. Accordingly, data minimization techniques have recently attracted much interest in the bioinformatics research community. Therefore, it is critical to develop new ways to minimize the data size. This paper presents a new real-time data minimization mechanism of big genomic datasets to shorten the transfer time in a more secure manner, despite the potential occurrence of a data breach. Our method involves the application of the random sampling of Fourier transform theory to the real-time generated big genomic datasets of both formats: FASTA and FASTQ and assigns the lowest possible codeword to the most frequent characters of the datasets. Our results indicate that the proposed data minimization algorithm is up to 79% of FASTA datasets' size reduction, with 98-fold faster and more secure than the standard data-encoding method. Also, the results show up to 45% of FASTQ datasets' size reduction with 57-fold faster than the standard data-encoding approach. Based on our results, we conclude that the proposed data minimization algorithm provides the best performance among current data-encoding approaches for big real-time generated genomic datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE International Congress on Big Data (BigData Congress)

自引率

0.00%

发文量