X. Li
{"title":"分布式异构基因表达数据集的隐私保护主成分分析聚类","authors":"X. Li","doi":"10.4018/jcmam.2011100102","DOIUrl":null,"url":null,"abstract":"In this paper, we present approaches to perform principal component analysis (PCA) clustering for distributed heterogeneous genomic datasets with privacy protection. The approaches allow data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. We then further develop a framework for privacy preserving PCA-based gene clustering, which includes two types of participants: data providers and a trusted central site (TCS). Two different methodologies are employed: Collective PCA (C-PCA) and Repeating PCA (R-PCA). The C-PCA requires local sites to transmit a sample of original data to the TCS and can be applied to any heterogeneous datasets. The R-PCA approach requires all local sites have the same or similar number of columns, but releases no original data. Experiments on five independent genomic datasets show that both C-PCA and R-PCA approaches maintain very good accuracy compared with the centralized scenario. DOI: 10.4018/jcmam.2011100102 24 International Journal of Computational Models and Algorithms in Medicine, 2(4), 23-56, October-December 2011 Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Gene Expression and DNA Microarray A DNA microarray (Wikipedia, 2010), which is the practical realized technology of the Gene Expression (BioChemWeb.org, 2010), is a multiplex technology used in molecular biology. It consists of an arrayed series of thousands of microscopic spots of DNA oligonucleotides, called features, each containing picomoles (10-12 moles) of a specific DNA sequence, known as probes (or reporters). This can be a short section of a gene or other DNA element that is used to hybridize a cDNA or cRNA sample (called target) under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. Since an array can contain tens of thousands of probes, a microarray experiment can accomplish many genetic tests in parallel. Therefore arrays have dramatically accelerated many types of investigation. The microarray data processing pipeline (Hackl, Sanchez Cabo, Sturn, Wolkenhauer, & Trajanoski, 2004) includes a variety of statistical steps: pre-processing (including background correction, normalization, and summarization), differential analysis which contains raw p-value computation and false discovery rate (FDR) correction, and gene clustering / profiling analysis. Figure 1(a) shows that microarray experiment process in the lab and Figure 1(b) illustrates its gene clustering result. Gene Clustering on Collaborative Datasets on Vertical Partitions Due to the fact that limited technical resources are available of a single research group or institution, researchers are often required to combine multiple gene expression datasets from different research labs/groups/institutions, and to conduct meta-analysis (Griffith et al., 2006; Lu, 2009; Ramasamy et al., 2008) or Figure 1. (a) DNA microarray experiment lab processing flow; (b) the gene clustering result (heatmap) of a microarray experiment 32 more pages are available in the full version of this document, which may be purchased using the \"Add to Cart\" button on the product's webpage: www.igi-global.com/article/privacy-preserving-principalcomponent-analysis/67529?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Medicine, Healthcare, and Life Science. Recommend this product to your librarian: www.igi-global.com/e-resources/libraryrecommendation/?id=2","PeriodicalId":162417,"journal":{"name":"Int. J. Comput. Model. Algorithms Medicine","volume":"23 Suppl 6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Privacy Preserving Principal Component Analysis Clustering for Distributed Heterogeneous Gene Expression Datasets\",\"authors\":\"X. Li\",\"doi\":\"10.4018/jcmam.2011100102\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we present approaches to perform principal component analysis (PCA) clustering for distributed heterogeneous genomic datasets with privacy protection. The approaches allow data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. We then further develop a framework for privacy preserving PCA-based gene clustering, which includes two types of participants: data providers and a trusted central site (TCS). Two different methodologies are employed: Collective PCA (C-PCA) and Repeating PCA (R-PCA). The C-PCA requires local sites to transmit a sample of original data to the TCS and can be applied to any heterogeneous datasets. The R-PCA approach requires all local sites have the same or similar number of columns, but releases no original data. Experiments on five independent genomic datasets show that both C-PCA and R-PCA approaches maintain very good accuracy compared with the centralized scenario. DOI: 10.4018/jcmam.2011100102 24 International Journal of Computational Models and Algorithms in Medicine, 2(4), 23-56, October-December 2011 Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Gene Expression and DNA Microarray A DNA microarray (Wikipedia, 2010), which is the practical realized technology of the Gene Expression (BioChemWeb.org, 2010), is a multiplex technology used in molecular biology. It consists of an arrayed series of thousands of microscopic spots of DNA oligonucleotides, called features, each containing picomoles (10-12 moles) of a specific DNA sequence, known as probes (or reporters). This can be a short section of a gene or other DNA element that is used to hybridize a cDNA or cRNA sample (called target) under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. Since an array can contain tens of thousands of probes, a microarray experiment can accomplish many genetic tests in parallel. Therefore arrays have dramatically accelerated many types of investigation. The microarray data processing pipeline (Hackl, Sanchez Cabo, Sturn, Wolkenhauer, & Trajanoski, 2004) includes a variety of statistical steps: pre-processing (including background correction, normalization, and summarization), differential analysis which contains raw p-value computation and false discovery rate (FDR) correction, and gene clustering / profiling analysis. Figure 1(a) shows that microarray experiment process in the lab and Figure 1(b) illustrates its gene clustering result. Gene Clustering on Collaborative Datasets on Vertical Partitions Due to the fact that limited technical resources are available of a single research group or institution, researchers are often required to combine multiple gene expression datasets from different research labs/groups/institutions, and to conduct meta-analysis (Griffith et al., 2006; Lu, 2009; Ramasamy et al., 2008) or Figure 1. (a) DNA microarray experiment lab processing flow; (b) the gene clustering result (heatmap) of a microarray experiment 32 more pages are available in the full version of this document, which may be purchased using the \\\"Add to Cart\\\" button on the product's webpage: www.igi-global.com/article/privacy-preserving-principalcomponent-analysis/67529?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Medicine, Healthcare, and Life Science. Recommend this product to your librarian: www.igi-global.com/e-resources/libraryrecommendation/?id=2\",\"PeriodicalId\":162417,\"journal\":{\"name\":\"Int. J. Comput. Model. Algorithms Medicine\",\"volume\":\"23 Suppl 6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Int. J. Comput. Model. Algorithms Medicine\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.4018/jcmam.2011100102\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Model. Algorithms Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4018/jcmam.2011100102","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Privacy Preserving Principal Component Analysis Clustering for Distributed Heterogeneous Gene Expression Datasets
In this paper, we present approaches to perform principal component analysis (PCA) clustering for distributed heterogeneous genomic datasets with privacy protection. The approaches allow data providers to collaborate together to identify gene profiles from a global viewpoint, and at the same time, protect the sensitive genomic data from possible privacy leaks. We then further develop a framework for privacy preserving PCA-based gene clustering, which includes two types of participants: data providers and a trusted central site (TCS). Two different methodologies are employed: Collective PCA (C-PCA) and Repeating PCA (R-PCA). The C-PCA requires local sites to transmit a sample of original data to the TCS and can be applied to any heterogeneous datasets. The R-PCA approach requires all local sites have the same or similar number of columns, but releases no original data. Experiments on five independent genomic datasets show that both C-PCA and R-PCA approaches maintain very good accuracy compared with the centralized scenario. DOI: 10.4018/jcmam.2011100102 24 International Journal of Computational Models and Algorithms in Medicine, 2(4), 23-56, October-December 2011 Copyright © 2011, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Gene Expression and DNA Microarray A DNA microarray (Wikipedia, 2010), which is the practical realized technology of the Gene Expression (BioChemWeb.org, 2010), is a multiplex technology used in molecular biology. It consists of an arrayed series of thousands of microscopic spots of DNA oligonucleotides, called features, each containing picomoles (10-12 moles) of a specific DNA sequence, known as probes (or reporters). This can be a short section of a gene or other DNA element that is used to hybridize a cDNA or cRNA sample (called target) under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. Since an array can contain tens of thousands of probes, a microarray experiment can accomplish many genetic tests in parallel. Therefore arrays have dramatically accelerated many types of investigation. The microarray data processing pipeline (Hackl, Sanchez Cabo, Sturn, Wolkenhauer, & Trajanoski, 2004) includes a variety of statistical steps: pre-processing (including background correction, normalization, and summarization), differential analysis which contains raw p-value computation and false discovery rate (FDR) correction, and gene clustering / profiling analysis. Figure 1(a) shows that microarray experiment process in the lab and Figure 1(b) illustrates its gene clustering result. Gene Clustering on Collaborative Datasets on Vertical Partitions Due to the fact that limited technical resources are available of a single research group or institution, researchers are often required to combine multiple gene expression datasets from different research labs/groups/institutions, and to conduct meta-analysis (Griffith et al., 2006; Lu, 2009; Ramasamy et al., 2008) or Figure 1. (a) DNA microarray experiment lab processing flow; (b) the gene clustering result (heatmap) of a microarray experiment 32 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the product's webpage: www.igi-global.com/article/privacy-preserving-principalcomponent-analysis/67529?camid=4v1 This title is available in InfoSci-Journals, InfoSci-Journal Disciplines Medicine, Healthcare, and Life Science. Recommend this product to your librarian: www.igi-global.com/e-resources/libraryrecommendation/?id=2