{"title":"微阵列数据分析与PCA在一个DBMS","authors":"W. Rinsurongkawong, C. Ordonez","doi":"10.1145/1458449.1458456","DOIUrl":null,"url":null,"abstract":"Microarray data sets contain expression levels of thousands of genes. The statistical analysis of such data sets is typically performed outside a DBMS with statistical packages or mathematical libraries. In this work, we focus on analyzing them inside the DBMS. This is a difficult problem because microarray data sets have high dimensionality, but small size. First, due to DBMS limitations on a maximum number of columns per table, the data set has to be pivoted and transformed before analysis. More importantly, the correlation matrix on tens of thousands of genes has millions of values. While most high dimensional data sets can be analyzed with the classical PCA method, small, but high dimensional, data sets can only be analyzed with Singular Value Decomposition (SVD). We adapt the Householder tridiagonalization and QR factorization numerical methods to solve SVD inside the DBMS. Since these mathematical methods require many matrix operations, which are hard to express in SQL, query optimizations and efficient UDFs are developed to get good performance. Our proposed techniques achieve processing times comparable with those from the R package, a well-known statistical tool. We experimentally show our methods scale well with high dimensionality.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Microarray data analysis with PCA in a DBMS\",\"authors\":\"W. Rinsurongkawong, C. Ordonez\",\"doi\":\"10.1145/1458449.1458456\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Microarray data sets contain expression levels of thousands of genes. The statistical analysis of such data sets is typically performed outside a DBMS with statistical packages or mathematical libraries. In this work, we focus on analyzing them inside the DBMS. This is a difficult problem because microarray data sets have high dimensionality, but small size. First, due to DBMS limitations on a maximum number of columns per table, the data set has to be pivoted and transformed before analysis. More importantly, the correlation matrix on tens of thousands of genes has millions of values. While most high dimensional data sets can be analyzed with the classical PCA method, small, but high dimensional, data sets can only be analyzed with Singular Value Decomposition (SVD). We adapt the Householder tridiagonalization and QR factorization numerical methods to solve SVD inside the DBMS. Since these mathematical methods require many matrix operations, which are hard to express in SQL, query optimizations and efficient UDFs are developed to get good performance. Our proposed techniques achieve processing times comparable with those from the R package, a well-known statistical tool. We experimentally show our methods scale well with high dimensionality.\",\"PeriodicalId\":143937,\"journal\":{\"name\":\"Data and Text Mining in Bioinformatics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-10-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data and Text Mining in Bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1458449.1458456\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data and Text Mining in Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1458449.1458456","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Microarray data sets contain expression levels of thousands of genes. The statistical analysis of such data sets is typically performed outside a DBMS with statistical packages or mathematical libraries. In this work, we focus on analyzing them inside the DBMS. This is a difficult problem because microarray data sets have high dimensionality, but small size. First, due to DBMS limitations on a maximum number of columns per table, the data set has to be pivoted and transformed before analysis. More importantly, the correlation matrix on tens of thousands of genes has millions of values. While most high dimensional data sets can be analyzed with the classical PCA method, small, but high dimensional, data sets can only be analyzed with Singular Value Decomposition (SVD). We adapt the Householder tridiagonalization and QR factorization numerical methods to solve SVD inside the DBMS. Since these mathematical methods require many matrix operations, which are hard to express in SQL, query optimizations and efficient UDFs are developed to get good performance. Our proposed techniques achieve processing times comparable with those from the R package, a well-known statistical tool. We experimentally show our methods scale well with high dimensionality.