Privacy Preserving Principal Component Analysis Clustering for Distributed Heterogeneous Gene Expression Datasets
X. Li
{"title":"Privacy Preserving Principal Component Analysis Clustering for Distributed Heterogeneous Gene Expression Datasets","authors":"X. Li","doi":"10.4018/jcmam.2011100102","DOIUrl":null,"url":null,"abstract":"In this paper, we present approaches to perform principal component analysis (PCA) clustering for distributed heterogeneous genomic datasets with privacy protection. The approaches allow data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. We then further develop a framework for privacy preserving PCA-based gene clustering, which includes two types of participants: data providers and a trusted central site (TCS). Two different methodologies are employed: Collective PCA (C-PCA) and Repeating PCA (R-PCA). The C-PCA requires local sites to transmit a sample of original data to the TCS and can be applied to any heterogeneous datasets. The R-PCA approach requires all local sites have the same or similar number of columns, but releases no original data. Experiments on five independent genomic datasets show that both C-PCA and R-PCA approaches maintain very good accuracy compared with the centralized scenario. DOI: 10.4018/jcmam.2011100102 24 International Journal of Computational Models and Algorithms in Medicine, 2(4), 23-56, October-December 2011 Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Gene Expression and DNA Microarray A DNA microarray (Wikipedia, 2010), which is the practical realized technology of the Gene Expression (BioChemWeb.org, 2010), is a multiplex technology used in molecular biology. It consists of an arrayed series of thousands of microscopic spots of DNA oligonucleotides, called features, each containing picomoles (10-12 moles) of a specific DNA sequence, known as probes (or reporters). This can be a short section of a gene or other DNA element that is used to hybridize a cDNA or cRNA sample (called target) under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. Since an array can contain tens of thousands of probes, a microarray experiment can accomplish many genetic tests in parallel. Therefore arrays have dramatically accelerated many types of investigation. The microarray data processing pipeline (Hackl, Sanchez Cabo, Sturn, Wolkenhauer, & Trajanoski, 2004) includes a variety of statistical steps: pre-processing (including background correction, normalization, and summarization), differential analysis which contains raw p-value computation and false discovery rate (FDR) correction, and gene clustering / profiling analysis. Figure 1(a) shows that microarray experiment process in the lab and Figure 1(b) illustrates its gene clustering result. Gene Clustering on Collaborative Datasets on Vertical Partitions Due to the fact that limited technical resources are available of a single research group or institution, researchers are often required to combine multiple gene expression datasets from different research labs/groups/institutions, and to conduct meta-analysis (Griffith et al., 2006; Lu, 2009; Ramasamy et al., 2008) or Figure 1. (a) DNA microarray experiment lab processing flow; (b) the gene clustering result (heatmap) of a microarray experiment 32 more pages are available in the full version of this document, which may be purchased using the \"Add to Cart\" button on the product's webpage: www.igi-global.com/article/privacy-preserving-principalcomponent-analysis/67529?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Medicine, Healthcare, and Life Science. Recommend this product to your librarian: www.igi-global.com/e-resources/libraryrecommendation/?id=2","PeriodicalId":162417,"journal":{"name":"Int. J. Comput. Model. Algorithms Medicine","volume":"23 Suppl 6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Model. Algorithms Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4018/jcmam.2011100102","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
In this paper, we present approaches to perform principal component analysis (PCA) clustering for distributed heterogeneous genomic datasets with privacy protection. The approaches allow data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. We then further develop a framework for privacy preserving PCA-based gene clustering, which includes two types of participants: data providers and a trusted central site (TCS). Two different methodologies are employed: Collective PCA (C-PCA) and Repeating PCA (R-PCA). The C-PCA requires local sites to transmit a sample of original data to the TCS and can be applied to any heterogeneous datasets. The R-PCA approach requires all local sites have the same or similar number of columns, but releases no original data. Experiments on five independent genomic datasets show that both C-PCA and R-PCA approaches maintain very good accuracy compared with the centralized scenario. DOI: 10.4018/jcmam.2011100102 24 International Journal of Computational Models and Algorithms in Medicine, 2(4), 23-56, October-December 2011 Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Gene Expression and DNA Microarray A DNA microarray (Wikipedia, 2010), which is the practical realized technology of the Gene Expression (BioChemWeb.org, 2010), is a multiplex technology used in molecular biology. It consists of an arrayed series of thousands of microscopic spots of DNA oligonucleotides, called features, each containing picomoles (10-12 moles) of a specific DNA sequence, known as probes (or reporters). This can be a short section of a gene or other DNA element that is used to hybridize a cDNA or cRNA sample (called target) under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. Since an array can contain tens of thousands of probes, a microarray experiment can accomplish many genetic tests in parallel. Therefore arrays have dramatically accelerated many types of investigation. The microarray data processing pipeline (Hackl, Sanchez Cabo, Sturn, Wolkenhauer, & Trajanoski, 2004) includes a variety of statistical steps: pre-processing (including background correction, normalization, and summarization), differential analysis which contains raw p-value computation and false discovery rate (FDR) correction, and gene clustering / profiling analysis. Figure 1(a) shows that microarray experiment process in the lab and Figure 1(b) illustrates its gene clustering result. Gene Clustering on Collaborative Datasets on Vertical Partitions Due to the fact that limited technical resources are available of a single research group or institution, researchers are often required to combine multiple gene expression datasets from different research labs/groups/institutions, and to conduct meta-analysis (Griffith et al., 2006; Lu, 2009; Ramasamy et al., 2008) or Figure 1. (a) DNA microarray experiment lab processing flow; (b) the gene clustering result (heatmap) of a microarray experiment 32 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the product's webpage: www.igi-global.com/article/privacy-preserving-principalcomponent-analysis/67529?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Medicine, Healthcare, and Life Science. Recommend this product to your librarian: www.igi-global.com/e-resources/libraryrecommendation/?id=2
分布式异构基因表达数据集的隐私保护主成分分析聚类
在本文中,我们提出了对具有隐私保护的分布式异构基因组数据集执行主成分分析(PCA)聚类的方法。这些方法允许数据提供者合作,从全球角度识别基因档案,同时保护敏感的基因组数据免受可能的隐私泄露。然后,我们进一步开发了一个基于pca的隐私保护基因聚类框架,其中包括两种类型的参与者:数据提供者和可信中心站点(TCS)。采用两种不同的方法:集体主成分分析(C-PCA)和重复主成分分析(R-PCA)。C-PCA要求本地站点将原始数据样本传输到TCS,并且可以应用于任何异构数据集。R-PCA方法要求所有本地站点具有相同或相似的列数,但不发布原始数据。在5个独立的基因组数据集上进行的实验表明,C-PCA和R-PCA方法与集中式方案相比都保持了很好的准确性。DOI: 10.4018 / jcmam。24 International Journal of Computational Models and Algorithms in Medicine, 2(4), 23-56, October-December版权所有©2011,IGI Global。未经IGI Global书面许可,禁止以印刷或电子形式复制或分发。基因表达和DNA微阵列DNA微阵列(Wikipedia, 2010)是基因表达的实际实现技术(BioChemWeb.org, 2010),是一种应用于分子生物学的多重技术。它由数千个DNA寡核苷酸的微小斑点排列而成,称为特征,每个特征包含皮摩尔(10-12摩尔)的特定DNA序列,称为探针(或报告器)。这可以是基因或其他DNA元件的一小段,用于在高严格条件下杂交cDNA或cRNA样本(称为靶)。探针-靶标杂交通常通过检测荧光团、银或化学发光标记的靶标来检测和定量,以确定靶标中核酸序列的相对丰度。由于一个阵列可以包含数以万计的探针,微阵列实验可以并行完成许多基因测试。因此,阵列极大地加速了许多类型的研究。微阵列数据处理管道(Hackl, Sanchez Cabo, Sturn, Wolkenhauer, & Trajanoski, 2004)包括各种统计步骤:预处理(包括背景校正,归一化和汇总),包含原始p值计算和错误发现率(FDR)校正的差异分析,以及基因聚类/分析。图1(a)为实验室微阵列实验流程,图1(b)为其基因聚类结果。由于单个研究小组或机构可获得的技术资源有限,研究人员往往需要将来自不同研究实验室/小组/机构的多个基因表达数据集结合起来进行meta分析(Griffith et al., 2006;陆,2009;Ramasamy et al., 2008)或图1。(a) DNA微阵列实验实验室处理流程;(b)基因聚类结果(热图)微阵列实验的基因聚类结果(热图)在本文档的完整版本中还有32页,可通过产品网页上的“添加到购物车”按钮购买:www.igi-global.com/article/privacy-preserving-principalcomponent-analysis/67529?camid=4v1此标题可在InfoSci-Journals, InfoSci-Journal journals, Medicine, Healthcare, and Life Science中找到。向您的图书管理员推荐此产品:www.igi-global.com/e-resources/libraryrecommendation/?id=2
本文章由计算机程序翻译,如有差异,请以英文原文为准。