典型相关分析和偏最小二乘的集成r包

R J. Pub Date : 2021-01-01 DOI:10.32614/rj-2021-026

Boyoung Kim, Yunju Im, Keun Yoo Jae

{"title":"典型相关分析和偏最小二乘的集成r包","authors":"Boyoung Kim, Yunju Im, Keun Yoo Jae","doi":"10.32614/rj-2021-026","DOIUrl":null,"url":null,"abstract":"Canonical correlation analysis (CCA) has a long history as an explanatory statistical method in high-dimensional data analysis and has been successfully applied in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. The so-called seedCCA is a newly developed R package, and it implements not only the standard and seeded CCA but also partial least squares. The package enables us to fit CCA to large-p and small-n data. The paper provides a complete guide. Also, the seeded CCA application results are compared with the regularized CCA in the existing R package. It is believed that the package along with the paper will contribute to highdimensional data analysis in various science field practitioners and that the statistical methodologies in multivariate analysis become more fruitful. Introduction Explanatory studies are important to identify patterns and special structure in data prior to develop a specific model. When a study between two sets of a p-dimensional random variables X (X ∈ Rp) and a r-dimensional random variable Y (Y ∈ Rr), are of primary interest, one of the popular explanatory statistical methods would be canonical correlation analysis (CCA; Hotelling (1936)). The main goal of CCA is the dimension reduction of two sets of variables by measuring an association between the two sets. For this, pairs of linear combinations of variables are constructed by maximizing the Pearson correlation. The CCA has successful application in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. In Lee and Yoo (2014) it is shown that the CCA can be used as a dimension reduction tool for high-dimensional data, but also it is connected to least square estimator. Therefore, the CCA is not only explanatory and dimension reduction method but also can be utilized as alternative of least square estimation. If max(p, r) is bigger than or equal to the sample size, n, usual CCA application is not plausible due to no incapability of inverting sample covariance matrices. To overcome this, a regularized CCA is developed by Leurgans et al. (1993), whose idea was firstly suggested in Vinod (1976). In practice, the CCA package by González et al. (2008) can implement a version of the regularized CCA. To make the sample covariance matrices saying Σ̂x and Σ̂y, invertible, in González et al. (2008), they are replaced with Σ̂ λ1 x = Σ̂x + λ1Ip and Σ̂ λ2 y = Σ̂y + λ1Ir. The optimal values of λ1 and λ2 are chosen by maximizing a cross-validation score throughout the two-dimensional grid search. Although it is discussed that a relatively small grid of reasonable values for λ1 and λ2 can lesson intensive computing in González et al. (2008), it is still time-consuming as observed in later sections. Additionally, fast regularized CCA and robust CCA via projection-pursuit are recently developed in Cruz-Cano (2012) and Alfons et al. (2016), respectively. Another version of CCA to handle max(p, r) > n is the so-called seeded canonical correlation analysis proposed by Im et al. (2014) Since the seeded CCA does not require any regularization procedure, which is computationally intensive, its implementation to larger data is quite fast. The seeded CCA requires two steps. In the initial step, a set of variables bigger than n is initially reduced based on iterative projections. In next step, the standard CCA is applied to two sets of variables acquired from the initial step to finalize CCA of data. Another advantage is that the procedure of the seeded CCA has a close relation with partial least square, which is one of the popular statistical methods for large p-small n data, so the seed CCA can yield the PLS estimates. The seedCCA package is recently developed mainly to implement the seeded CCA. However, the package can fit a collection of the statistical methodologies, which are standard canonical correlation and partial least squares with uni/multi-dimensional responses including the seeded CCA. The package has been already uploaded to CRAN (https://cran.r-project.org/web/packages/ seedCCA/index.html). The main goal of the paper is to introduce and illustrate the seedCCA package. For this, three real data are fitted by the standard CCA, the seeded CCA and partial least square, and two of the three data are available in the package. One of them has been analyzed in González et al. (2008). So, the The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 2 implementation results by the seeded and regularized CCA are closely compared. The organization of the paper is as follows. The collection of three methodologies is discussed in Section 2. The implementation of seedCCA is illustrated and compared with CCA in Section 3. In Section 4, we summarize the work. We will use the following notations throughout the rest of the paper. A p-dimensional random variable X will be denoted as X ∈ Rp. So, X ∈ Rp means a random variable, although there is no specific mention. For X ∈ Rp and Y ∈ Rr, we define that cov(X) = Σx, cov(Y) = Σy, cov(X, Y) = Σxy and cov(Y, X) = Σyx. And, it is assumed that Σx and Σy are positive-definite. Collection of implemented methodologies in seedCCA Canonical correlation analysis Suppose the two sets of variable X ∈ Rp and Y ∈ Rr and consider their linear combinations of U = aTX and V = bTY. Then we have var(U) = aΣxa, var(V) = bΣyb, and cov(U, V) = aΣxyb, where a ∈ Rp×1 and b ∈ Rr×1. Then Pearson-correlation between U and V is as follows: cor(U, V) = aΣxyb √ aTΣxa √ bTΣyb . (1) We seek to find a and b to maximize cor(U, V) with satisfying the following criteria. 1. The first canonical variate pair (U1 = a1 X, V1 = b T 1 Y) is obtained from maximizing (1). 2. The second canonical variate pair (U2 = a2 X, V2 = b T 2 Y) is constructed from the maximization of (1) with restriction that var(U2) = var(V2) = 1 and (U1, V1) and (U2, V2) are uncorrelated. 3. At the k step, the kth canonical variate pair (Uk = ak X, Vk = b T k Y) is obtained from the maximization of (1) with restriction that var(Uk) = var(Vk) = 1 and (Uk, Vk) are uncorrelated with the previous (k− 1) canonical variate pairs. 4. Repeat Steps 1 to 3 until k becomes q (= min(p, r)). 5. Select the first d pairs of (Uk, Vk) to represent the relationship between X and Y. Under this criteria, the pairs (ai, bi) are constructed as follows: ai = Σ−1/2 x ψi and bi = Σ −1/2 y φi for i = 1, . . . , q, where (ψ1, ..., ψq) and (φ1, ..., φq) are, respectively, the q eigenvectors of Σ−1/2 x ΣxyΣ −1 y ΣyxΣ −1/2 y and Σ−1/2 y ΣyxΣ −1 x ΣxyΣ −1/2 x with the corresponding common ordered-eigenvalues of ρ∗2 1 ≥ · · · ≥ ρ∗2 q ≥ 0. Then matrices of Mx = (a1, ..., ad) and My = (b1, ..., bd) are called canonical coefficient matrices for d = 1, ..., q. Also, Mx X and My Y are called canonical variates. In sample, the population quantities are replaced with their usual moment estimators. For more details regarding this standard CCA, readers may refer Johnson and Wichern (2007). Seeded canonical correlation analysis Since the standard CCA application requires the inversion of Σ̂x and Σ̂y in practice, it is not plausible for high-dimensional data with max(p, r) > n. In Im et al. (2014) a seeded canonical correlation analysis approach is proposed to overcome this deficit. The seeded CCA is a two step procedure, consisting of initialized and finalized steps. In the initialized step, the original two sets of variables are reduced to m-dimensional pairs without loss of information on the CCA application. In the initialized step, it is essential to force m << n. In the finalized step, the standard CCA is implemented to the initially-reduced pairs for the repairing and orthonormality. More detailed discussion on the seeded CCA is as follows in next subsections. Development Define a notation of S(M) as the subspace spanned by the columns of M ∈ Rp×r . Lee and Yoo (2014) show the following relation: S(Mx) = S(Σ−1 x Σxy) and S(My) = S(Σ−1 y Σyx). (2) The relation in (2) directly indicates that Mx and My form basis matrices of S(Σ−1 x Σxy) and S(Σ−1 y Σyx) and that Mx and My can be restored from Σ−1 x Σxy and Σ −1 y Σyx. The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 3 Now, we define the following two matrices: Rx,u1 ∈ Rp×ru1 = (Σxy, ΣxΣxy, . . . , Σu1−1 x Σxy) and Ry,u2 ∈ Rr×pu2 = (Σyx, ΣyΣyx, . . . , Σu2−1 y Σyx). (3) In Rx,u1 and Ry,u2 , the numbers of u1 and u2 are called termination indexes, and they decide the number of projections of Σxy and Σyx onto Σx and Σy, respectively. Also define that Mx,u1 ∈ R p×r = Rx,u1 (R T x,u1 ΣxRx,u1 ) Rx,u1 Σxy and My,u2 ∈ R r×p = Ry,u2 (R T y,u2 ΣyRy,u2 ) Ry,u2 Σyx. (4) In Cook et al. (2007) it is shown that S(Mx,u1 ) = S(Σ −1 x Σxy) and S(My,u2 ) = S(Σ −1 y Σyx) in (4), and hence Mx,u1 and M 0 y,u2 can be used to infer Mx and My, respectively. One clear advantage to use M 0 x,u1 and My,u2 is no need of the inversion of Σx and Σy. Practically, it is important to select proper values for the termination indexes u1 and u2. For this define that ∆x,u1 = M 0 x,u1+1 −M 0 x,u1 and ∆y,u2 = M 0 y,u2+1 −M 0 y,u2 . Finally, the following measure for increment of u1 and u2 is defined: nFx,u1 = ntrace(∆ T x,u1 Σx∆x,u1 ) and nFy,u2 = ntrace(∆ T y,u2 Σy∆y,u2 ). Then, a proper value of u is set to have little changes in nFx,u1 and nFx,u1+1 and in nFy,u2 and nFy,u2+1. It is not necessary that the selected u1 and u2 for Mx,u1 and M 0 y,u2 are common. Next, the original two sets of variables of X and Y are replaced with M0 T x,u1 X ∈ R r and M0 T y,u2 Y ∈ R p. This reduction of X and Y does not cause any loss of information on CCA in sense that S(Mx,u1 ) = S(Mx) and S(My,u2 ) = S(My), and it is called initialized CCA. The initialized CCA has the following two cases. case 1: Suppose that min(p, r) = r << n. Then, the original X alone is replaced with M0 T x,u1 X and the original Y is kept. case 2: If min(p, r) = r is not fairly smaller than n, Σxy and Σyx","PeriodicalId":20974,"journal":{"name":"R J.","volume":"44 1","pages":"7"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SEEDCCA: An Integrated R-Package for Canonical Correlation Analysis and Partial Least Squares\",\"authors\":\"Boyoung Kim, Yunju Im, Keun Yoo Jae\",\"doi\":\"10.32614/rj-2021-026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Canonical correlation analysis (CCA) has a long history as an explanatory statistical method in high-dimensional data analysis and has been successfully applied in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. The so-called seedCCA is a newly developed R package, and it implements not only the standard and seeded CCA but also partial least squares. The package enables us to fit CCA to large-p and small-n data. The paper provides a complete guide. Also, the seeded CCA application results are compared with the regularized CCA in the existing R package. It is believed that the package along with the paper will contribute to highdimensional data analysis in various science field practitioners and that the statistical methodologies in multivariate analysis become more fruitful. Introduction Explanatory studies are important to identify patterns and special structure in data prior to develop a specific model. When a study between two sets of a p-dimensional random variables X (X ∈ Rp) and a r-dimensional random variable Y (Y ∈ Rr), are of primary interest, one of the popular explanatory statistical methods would be canonical correlation analysis (CCA; Hotelling (1936)). The main goal of CCA is the dimension reduction of two sets of variables by measuring an association between the two sets. For this, pairs of linear combinations of variables are constructed by maximizing the Pearson correlation. The CCA has successful application in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. In Lee and Yoo (2014) it is shown that the CCA can be used as a dimension reduction tool for high-dimensional data, but also it is connected to least square estimator. Therefore, the CCA is not only explanatory and dimension reduction method but also can be utilized as alternative of least square estimation. If max(p, r) is bigger than or equal to the sample size, n, usual CCA application is not plausible due to no incapability of inverting sample covariance matrices. To overcome this, a regularized CCA is developed by Leurgans et al. (1993), whose idea was firstly suggested in Vinod (1976). In practice, the CCA package by González et al. (2008) can implement a version of the regularized CCA. To make the sample covariance matrices saying Σ̂x and Σ̂y, invertible, in González et al. (2008), they are replaced with Σ̂ λ1 x = Σ̂x + λ1Ip and Σ̂ λ2 y = Σ̂y + λ1Ir. The optimal values of λ1 and λ2 are chosen by maximizing a cross-validation score throughout the two-dimensional grid search. Although it is discussed that a relatively small grid of reasonable values for λ1 and λ2 can lesson intensive computing in González et al. (2008), it is still time-consuming as observed in later sections. Additionally, fast regularized CCA and robust CCA via projection-pursuit are recently developed in Cruz-Cano (2012) and Alfons et al. (2016), respectively. Another version of CCA to handle max(p, r) > n is the so-called seeded canonical correlation analysis proposed by Im et al. (2014) Since the seeded CCA does not require any regularization procedure, which is computationally intensive, its implementation to larger data is quite fast. The seeded CCA requires two steps. In the initial step, a set of variables bigger than n is initially reduced based on iterative projections. In next step, the standard CCA is applied to two sets of variables acquired from the initial step to finalize CCA of data. Another advantage is that the procedure of the seeded CCA has a close relation with partial least square, which is one of the popular statistical methods for large p-small n data, so the seed CCA can yield the PLS estimates. The seedCCA package is recently developed mainly to implement the seeded CCA. However, the package can fit a collection of the statistical methodologies, which are standard canonical correlation and partial least squares with uni/multi-dimensional responses including the seeded CCA. The package has been already uploaded to CRAN (https://cran.r-project.org/web/packages/ seedCCA/index.html). The main goal of the paper is to introduce and illustrate the seedCCA package. For this, three real data are fitted by the standard CCA, the seeded CCA and partial least square, and two of the three data are available in the package. One of them has been analyzed in González et al. (2008). So, the The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 2 implementation results by the seeded and regularized CCA are closely compared. The organization of the paper is as follows. The collection of three methodologies is discussed in Section 2. The implementation of seedCCA is illustrated and compared with CCA in Section 3. In Section 4, we summarize the work. We will use the following notations throughout the rest of the paper. A p-dimensional random variable X will be denoted as X ∈ Rp. So, X ∈ Rp means a random variable, although there is no specific mention. For X ∈ Rp and Y ∈ Rr, we define that cov(X) = Σx, cov(Y) = Σy, cov(X, Y) = Σxy and cov(Y, X) = Σyx. And, it is assumed that Σx and Σy are positive-definite. Collection of implemented methodologies in seedCCA Canonical correlation analysis Suppose the two sets of variable X ∈ Rp and Y ∈ Rr and consider their linear combinations of U = aTX and V = bTY. Then we have var(U) = aΣxa, var(V) = bΣyb, and cov(U, V) = aΣxyb, where a ∈ Rp×1 and b ∈ Rr×1. Then Pearson-correlation between U and V is as follows: cor(U, V) = aΣxyb √ aTΣxa √ bTΣyb . (1) We seek to find a and b to maximize cor(U, V) with satisfying the following criteria. 1. The first canonical variate pair (U1 = a1 X, V1 = b T 1 Y) is obtained from maximizing (1). 2. The second canonical variate pair (U2 = a2 X, V2 = b T 2 Y) is constructed from the maximization of (1) with restriction that var(U2) = var(V2) = 1 and (U1, V1) and (U2, V2) are uncorrelated. 3. At the k step, the kth canonical variate pair (Uk = ak X, Vk = b T k Y) is obtained from the maximization of (1) with restriction that var(Uk) = var(Vk) = 1 and (Uk, Vk) are uncorrelated with the previous (k− 1) canonical variate pairs. 4. Repeat Steps 1 to 3 until k becomes q (= min(p, r)). 5. Select the first d pairs of (Uk, Vk) to represent the relationship between X and Y. Under this criteria, the pairs (ai, bi) are constructed as follows: ai = Σ−1/2 x ψi and bi = Σ −1/2 y φi for i = 1, . . . , q, where (ψ1, ..., ψq) and (φ1, ..., φq) are, respectively, the q eigenvectors of Σ−1/2 x ΣxyΣ −1 y ΣyxΣ −1/2 y and Σ−1/2 y ΣyxΣ −1 x ΣxyΣ −1/2 x with the corresponding common ordered-eigenvalues of ρ∗2 1 ≥ · · · ≥ ρ∗2 q ≥ 0. Then matrices of Mx = (a1, ..., ad) and My = (b1, ..., bd) are called canonical coefficient matrices for d = 1, ..., q. Also, Mx X and My Y are called canonical variates. In sample, the population quantities are replaced with their usual moment estimators. For more details regarding this standard CCA, readers may refer Johnson and Wichern (2007). Seeded canonical correlation analysis Since the standard CCA application requires the inversion of Σ̂x and Σ̂y in practice, it is not plausible for high-dimensional data with max(p, r) > n. In Im et al. (2014) a seeded canonical correlation analysis approach is proposed to overcome this deficit. The seeded CCA is a two step procedure, consisting of initialized and finalized steps. In the initialized step, the original two sets of variables are reduced to m-dimensional pairs without loss of information on the CCA application. In the initialized step, it is essential to force m << n. In the finalized step, the standard CCA is implemented to the initially-reduced pairs for the repairing and orthonormality. More detailed discussion on the seeded CCA is as follows in next subsections. Development Define a notation of S(M) as the subspace spanned by the columns of M ∈ Rp×r . Lee and Yoo (2014) show the following relation: S(Mx) = S(Σ−1 x Σxy) and S(My) = S(Σ−1 y Σyx). (2) The relation in (2) directly indicates that Mx and My form basis matrices of S(Σ−1 x Σxy) and S(Σ−1 y Σyx) and that Mx and My can be restored from Σ−1 x Σxy and Σ −1 y Σyx. The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 3 Now, we define the following two matrices: Rx,u1 ∈ Rp×ru1 = (Σxy, ΣxΣxy, . . . , Σu1−1 x Σxy) and Ry,u2 ∈ Rr×pu2 = (Σyx, ΣyΣyx, . . . , Σu2−1 y Σyx). (3) In Rx,u1 and Ry,u2 , the numbers of u1 and u2 are called termination indexes, and they decide the number of projections of Σxy and Σyx onto Σx and Σy, respectively. Also define that Mx,u1 ∈ R p×r = Rx,u1 (R T x,u1 ΣxRx,u1 ) Rx,u1 Σxy and My,u2 ∈ R r×p = Ry,u2 (R T y,u2 ΣyRy,u2 ) Ry,u2 Σyx. (4) In Cook et al. (2007) it is shown that S(Mx,u1 ) = S(Σ −1 x Σxy) and S(My,u2 ) = S(Σ −1 y Σyx) in (4), and hence Mx,u1 and M 0 y,u2 can be used to infer Mx and My, respectively. One clear advantage to use M 0 x,u1 and My,u2 is no need of the inversion of Σx and Σy. Practically, it is important to select proper values for the termination indexes u1 and u2. For this define that ∆x,u1 = M 0 x,u1+1 −M 0 x,u1 and ∆y,u2 = M 0 y,u2+1 −M 0 y,u2 . Finally, the following measure for increment of u1 and u2 is defined: nFx,u1 = ntrace(∆ T x,u1 Σx∆x,u1 ) and nFy,u2 = ntrace(∆ T y,u2 Σy∆y,u2 ). Then, a proper value of u is set to have little changes in nFx,u1 and nFx,u1+1 and in nFy,u2 and nFy,u2+1. It is not necessary that the selected u1 and u2 for Mx,u1 and M 0 y,u2 are common. Next, the original two sets of variables of X and Y are replaced with M0 T x,u1 X ∈ R r and M0 T y,u2 Y ∈ R p. This reduction of X and Y does not cause any loss of information on CCA in sense that S(Mx,u1 ) = S(Mx) and S(My,u2 ) = S(My), and it is called initialized CCA. The initialized CCA has the following two cases. case 1: Suppose that min(p, r) = r << n. Then, the original X alone is replaced with M0 T x,u1 X and the original Y is kept. case 2: If min(p, r) = r is not fairly smaller than n, Σxy and Σyx\",\"PeriodicalId\":20974,\"journal\":{\"name\":\"R J.\",\"volume\":\"44 1\",\"pages\":\"7\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"R J.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.32614/rj-2021-026\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"R J.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32614/rj-2021-026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

典型相关分析(Canonical correlation analysis, CCA)作为一种用于高维数据分析的解释性统计方法有着悠久的历史，已成功应用于化学计量学、模式识别、基因组序列分析等多个科学领域。所谓的seedCCA是一个新开发的R包，它不仅实现了标准和种子的CCA，而且还实现了偏最小二乘法。该软件包使我们能够将CCA适合于大p和小n数据。本文提供了一个完整的指南。同时，将种子CCA应用结果与现有R包中的正则化CCA进行了比较。相信本文所附的包将有助于各个科学领域的从业者进行高维数据分析，并使多元分析中的统计方法更加富有成果。解释性研究对于在开发特定模型之前识别数据中的模式和特殊结构非常重要。当对两组p维随机变量X (X∈Rp)和r维随机变量Y (Y∈Rr)之间的研究主要感兴趣时，常用的解释性统计方法之一是典型相关分析(CCA;霍特林(1936))。CCA的主要目标是通过测量两组变量之间的关联来降低两组变量的维数。为此，通过最大化Pearson相关性来构造变量的线性组合对。CCA已成功应用于化学计量学、模式识别、基因组序列分析等科学领域。在Lee和Yoo(2014)中，表明CCA可以用作高维数据的降维工具，但它也与最小二乘估计器相连。因此，CCA不仅是一种解释和降维方法，而且可以作为最小二乘估计的替代方法。如果max(p, r)大于或等于样本量n，通常的CCA应用是不合理的，因为不能反转样本协方差矩阵。为了克服这一点，Leurgans等人(1993)开发了正则化CCA，其想法最早是在Vinod(1976)中提出的。在实践中，González等人(2008)的CCA包可以实现正则化CCA的一个版本。为了使González等人(2008)中的样本协方差矩阵Σ³x和Σ³y可逆，将它们替换为Σ³λ 1x = Σ³x + λ1Ip和Σ³λ 2y = Σ³y + λ1Ir。通过在整个二维网格搜索中最大化交叉验证分数来选择λ1和λ2的最优值。虽然在González等人(2008)中讨论了λ1和λ2的合理值的相对较小的网格可以进行密集计算，但在后面的章节中观察到它仍然很耗时。此外，Cruz-Cano(2012)和Alfons等人(2016)最近分别开发了快速正则化CCA和基于投影追踪的鲁棒CCA。另一种处理max(p, r) > n的CCA是Im等人(2014)提出的所谓种子型典型相关分析(seed canonical correlation analysis)。由于种子型CCA不需要任何计算量大的正则化过程，因此对更大数据的实现速度相当快。种子CCA需要两个步骤。在初始阶段，基于迭代投影，初始化简大于n的一组变量。下一步，将标准的CCA应用于初始步骤中获得的两组变量，最终完成数据的CCA。另一个优点是种子CCA的过程与偏最小二乘密切相关，偏最小二乘是大p-小n数据的常用统计方法之一，因此种子CCA可以产生PLS估计。最近开发的seedCCA包主要用于实现种子CCA。然而，该软件包可以适合一系列统计方法，这些方法是标准的典型相关和偏最小二乘法，具有单/多维响应，包括种子CCA。软件包已经上传到CRAN (https://cran.r-project.org/web/packages/ seedCCA/index.html)。本文的主要目的是介绍和说明seedCCA包。为此，使用标准CCA、种子CCA和偏最小二乘法拟合三个真实数据，其中两个数据在包中可用。其中一个已经在González et al.(2008)中进行了分析。因此，将种子CCA和正则化CCA的实施结果进行了密切比较。本文的组织结构如下。第2节将讨论这三种方法的集合。在第3节中说明了seedCCA的实现并与CCA进行了比较。在第4节中，我们对工作进行了总结。在本文的其余部分，我们将使用以下符号。p维随机变量X记为X∈Rp。所以，X∈Rp是一个随机变量，虽然这里没有特别提到。对于X∈Rp, Y∈Rr，我们定义cov(X) = Σx, cov(Y) = Σy, cov(X, Y) = Σxy, cov(Y, X) = Σyx。并且，假设Σx和Σy是正定的。典型相关分析中实现方法的集合假设两组变量X∈Rp和Y∈Rr，考虑它们的U = aTX和V = bTY的线性组合。然后我们有var(U) = aΣxa, var(V) = bΣyb, cov(U, V) = aΣxyb，其中a∈Rp×1, b∈Rr×1。则U与V之间的person -correlation为:cor(U, V) = aΣxyb√aTΣxa√bTΣyb。(1)我们寻找a和b，使cor(U, V)满足以下条件最大化。1. 第一个典型变量对(U1 = a1 X, V1 = b t1 Y)由最大化(1)得到。第二个典型变量对(U2 = a2 X, V2 = b t2 Y)由(1)的最大化构造而成，约束条件是var(U2) = var(V2) = 1，并且(U1, V1)和(U2, V2)不相关。3.在第k步，第k个典型变量对(Uk = ak X, Vk = b T k Y)由(1)的最大化得到，限制var(Uk) = var(Vk) = 1和(Uk, Vk)与前(k−1)个典型变量对不相关。4. 重复步骤1到3，直到k变成q (= min(p, r))。5. 选择(Uk, Vk)的前d对来表示X与y之间的关系，在此准则下，对(ai, bi)构造如下:ai = Σ−1/2 X ψi, bi = Σ−1/2 y φi，对于i = 1，…， q，其中(ψ1，…)， ψq)和(φ1，…， φq)分别为Σ−1/2 x ΣxyΣ−1 y ΣyxΣ−1/2 y和Σ−1/2 y ΣyxΣ−1 x ΣxyΣ−1/2 x的q特征向量，其对应的公共有序特征值为ρ∗21≥···≥ρ∗2 q≥0。则矩阵Mx = (a1，…， ad)和My = (b1，…， bd)称为正则系数矩阵，当d = 1，…同样，Mx X和My Y被称为正则变量。在样本中，用通常的矩估计量代替总体量。关于这个标准CCA的更多细节，读者可以参考Johnson和Wichern(2007)。种子典型相关分析由于标准的CCA应用在实践中需要对Σ³x和Σ³y进行反演，因此对于max(p, r) > n的高维数据是不可能的。Im等人(2014)提出了一种种子典型相关分析方法来克服这一缺陷。种子CCA是一个两步过程，包括初始化和最终化步骤。在初始化步骤中，原始的两组变量被简化为m维对，而不会丢失有关CCA应用程序的信息。在初始化步骤中，必须强制m << n。在最终化步骤中，对初始约简对实现标准的CCA，以进行修复和正交性。关于种子CCA的更详细讨论将在下一小节中进行。定义S(M)的符号为M∈Rp×r的列张成的子空间。Lee和Yoo(2014)给出了S(Mx) = S(Σ−1 x Σxy)和S(My) = S(Σ−1 y Σyx)的关系式。(2)式(2)中的关系直接表明，Mx和My构成了S(Σ−1 x Σxy)和S(Σ−1 y Σyx)的基矩阵，并且可以由Σ−1 x Σxy和Σ−1 y Σyx还原出Mx和My。文章3现在，我们定义以下两个矩阵:Rx,u1∈Rp×ru1 = (Σxy， ΣxΣxy，…)Σu1−1 xΣxy)和,u2∈Rr×pu2 =(Σyx,ΣyΣyx。(Σu2−1 y Σyx)。(3)在Rx,u1和Ry,u2中，u1和u2的个数称为终止指标，它们分别决定Σxy和Σyx在Σx和Σy上的投影个数。还定义Mx,u1∈R p×r = Rx,u1 (R T x,u1 ΣxRx,u1) Rx,u1 Σxy和My,u2∈R r×p = Ry,u2 (R T y,u2 ΣyRy,u2) Ry,u2 Σyx。(4) Cook et al.(2007)在(4)中表明S(Mx,u1) = S(Σ−1 x Σxy)和S(My,u2) = S(Σ−1 y Σyx)，因此Mx,u1和m0 y,u2可以分别用来推断Mx和My。使用m0 x u1和m0 u2的一个明显的优点是不需要Σx和Σy的反转。实际上，为终止索引u1和u2选择合适的值是很重要的。定义∆x,u1 = m0 x,u1+1 - m0 x,u1和∆y,u2 = m0 y,u2+1 - m0 y,u2。最后，定义以下u1和u2的增量度量:nFx,u1 = ntrace(∆T x,u1 Σx∆x,u1)和nFy,u2 = ntrace(∆T y,u2 Σy∆y,u2)。然后，在nFx、u1和nFx、u1+1以及nFy、u2和nFy、u2+1中设置一个合适的u值，使其变化不大。对于Mx,u1和m0 y,u2选择的u1和u2不一定是公共的。接下来，将原来的两组变量X和Y替换为M0 T X,u1 X∈R R和M0 T Y,u2 Y∈R p。这种对X和Y的约简并不会造成任何关于CCA的信息的丢失，因为S(Mx,u1) = S(Mx)和S(My,u2) = S(My)的意义，称之为初始化的CCA。初始化的CCA有以下两种情况。情形1:设min(p, r) = r << n，则将原来的X单独替换为M0 T X,u1 X，保留原来的Y。情形2:如果min(p, r) = r不相当小于n， Σxy和Σyx

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

SEEDCCA: An Integrated R-Package for Canonical Correlation Analysis and Partial Least Squares

Canonical correlation analysis (CCA) has a long history as an explanatory statistical method in high-dimensional data analysis and has been successfully applied in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. The so-called seedCCA is a newly developed R package, and it implements not only the standard and seeded CCA but also partial least squares. The package enables us to fit CCA to large-p and small-n data. The paper provides a complete guide. Also, the seeded CCA application results are compared with the regularized CCA in the existing R package. It is believed that the package along with the paper will contribute to highdimensional data analysis in various science field practitioners and that the statistical methodologies in multivariate analysis become more fruitful. Introduction Explanatory studies are important to identify patterns and special structure in data prior to develop a specific model. When a study between two sets of a p-dimensional random variables X (X ∈ Rp) and a r-dimensional random variable Y (Y ∈ Rr), are of primary interest, one of the popular explanatory statistical methods would be canonical correlation analysis (CCA; Hotelling (1936)). The main goal of CCA is the dimension reduction of two sets of variables by measuring an association between the two sets. For this, pairs of linear combinations of variables are constructed by maximizing the Pearson correlation. The CCA has successful application in many science fields such as chemomtrics, pattern recognition, genomic sequence analysis and so on. In Lee and Yoo (2014) it is shown that the CCA can be used as a dimension reduction tool for high-dimensional data, but also it is connected to least square estimator. Therefore, the CCA is not only explanatory and dimension reduction method but also can be utilized as alternative of least square estimation. If max(p, r) is bigger than or equal to the sample size, n, usual CCA application is not plausible due to no incapability of inverting sample covariance matrices. To overcome this, a regularized CCA is developed by Leurgans et al. (1993), whose idea was firstly suggested in Vinod (1976). In practice, the CCA package by González et al. (2008) can implement a version of the regularized CCA. To make the sample covariance matrices saying Σ̂x and Σ̂y, invertible, in González et al. (2008), they are replaced with Σ̂ λ1 x = Σ̂x + λ1Ip and Σ̂ λ2 y = Σ̂y + λ1Ir. The optimal values of λ1 and λ2 are chosen by maximizing a cross-validation score throughout the two-dimensional grid search. Although it is discussed that a relatively small grid of reasonable values for λ1 and λ2 can lesson intensive computing in González et al. (2008), it is still time-consuming as observed in later sections. Additionally, fast regularized CCA and robust CCA via projection-pursuit are recently developed in Cruz-Cano (2012) and Alfons et al. (2016), respectively. Another version of CCA to handle max(p, r) > n is the so-called seeded canonical correlation analysis proposed by Im et al. (2014) Since the seeded CCA does not require any regularization procedure, which is computationally intensive, its implementation to larger data is quite fast. The seeded CCA requires two steps. In the initial step, a set of variables bigger than n is initially reduced based on iterative projections. In next step, the standard CCA is applied to two sets of variables acquired from the initial step to finalize CCA of data. Another advantage is that the procedure of the seeded CCA has a close relation with partial least square, which is one of the popular statistical methods for large p-small n data, so the seed CCA can yield the PLS estimates. The seedCCA package is recently developed mainly to implement the seeded CCA. However, the package can fit a collection of the statistical methodologies, which are standard canonical correlation and partial least squares with uni/multi-dimensional responses including the seeded CCA. The package has been already uploaded to CRAN (https://cran.r-project.org/web/packages/ seedCCA/index.html). The main goal of the paper is to introduce and illustrate the seedCCA package. For this, three real data are fitted by the standard CCA, the seeded CCA and partial least square, and two of the three data are available in the package. One of them has been analyzed in González et al. (2008). So, the The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 2 implementation results by the seeded and regularized CCA are closely compared. The organization of the paper is as follows. The collection of three methodologies is discussed in Section 2. The implementation of seedCCA is illustrated and compared with CCA in Section 3. In Section 4, we summarize the work. We will use the following notations throughout the rest of the paper. A p-dimensional random variable X will be denoted as X ∈ Rp. So, X ∈ Rp means a random variable, although there is no specific mention. For X ∈ Rp and Y ∈ Rr, we define that cov(X) = Σx, cov(Y) = Σy, cov(X, Y) = Σxy and cov(Y, X) = Σyx. And, it is assumed that Σx and Σy are positive-definite. Collection of implemented methodologies in seedCCA Canonical correlation analysis Suppose the two sets of variable X ∈ Rp and Y ∈ Rr and consider their linear combinations of U = aTX and V = bTY. Then we have var(U) = aΣxa, var(V) = bΣyb, and cov(U, V) = aΣxyb, where a ∈ Rp×1 and b ∈ Rr×1. Then Pearson-correlation between U and V is as follows: cor(U, V) = aΣxyb √ aTΣxa √ bTΣyb . (1) We seek to find a and b to maximize cor(U, V) with satisfying the following criteria. 1. The first canonical variate pair (U1 = a1 X, V1 = b T 1 Y) is obtained from maximizing (1). 2. The second canonical variate pair (U2 = a2 X, V2 = b T 2 Y) is constructed from the maximization of (1) with restriction that var(U2) = var(V2) = 1 and (U1, V1) and (U2, V2) are uncorrelated. 3. At the k step, the kth canonical variate pair (Uk = ak X, Vk = b T k Y) is obtained from the maximization of (1) with restriction that var(Uk) = var(Vk) = 1 and (Uk, Vk) are uncorrelated with the previous (k− 1) canonical variate pairs. 4. Repeat Steps 1 to 3 until k becomes q (= min(p, r)). 5. Select the first d pairs of (Uk, Vk) to represent the relationship between X and Y. Under this criteria, the pairs (ai, bi) are constructed as follows: ai = Σ−1/2 x ψi and bi = Σ −1/2 y φi for i = 1, . . . , q, where (ψ1, ..., ψq) and (φ1, ..., φq) are, respectively, the q eigenvectors of Σ−1/2 x ΣxyΣ −1 y ΣyxΣ −1/2 y and Σ−1/2 y ΣyxΣ −1 x ΣxyΣ −1/2 x with the corresponding common ordered-eigenvalues of ρ∗2 1 ≥ · · · ≥ ρ∗2 q ≥ 0. Then matrices of Mx = (a1, ..., ad) and My = (b1, ..., bd) are called canonical coefficient matrices for d = 1, ..., q. Also, Mx X and My Y are called canonical variates. In sample, the population quantities are replaced with their usual moment estimators. For more details regarding this standard CCA, readers may refer Johnson and Wichern (2007). Seeded canonical correlation analysis Since the standard CCA application requires the inversion of Σ̂x and Σ̂y in practice, it is not plausible for high-dimensional data with max(p, r) > n. In Im et al. (2014) a seeded canonical correlation analysis approach is proposed to overcome this deficit. The seeded CCA is a two step procedure, consisting of initialized and finalized steps. In the initialized step, the original two sets of variables are reduced to m-dimensional pairs without loss of information on the CCA application. In the initialized step, it is essential to force m << n. In the finalized step, the standard CCA is implemented to the initially-reduced pairs for the repairing and orthonormality. More detailed discussion on the seeded CCA is as follows in next subsections. Development Define a notation of S(M) as the subspace spanned by the columns of M ∈ Rp×r . Lee and Yoo (2014) show the following relation: S(Mx) = S(Σ−1 x Σxy) and S(My) = S(Σ−1 y Σyx). (2) The relation in (2) directly indicates that Mx and My form basis matrices of S(Σ−1 x Σxy) and S(Σ−1 y Σyx) and that Mx and My can be restored from Σ−1 x Σxy and Σ −1 y Σyx. The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859 CONTRIBUTED RESEARCH ARTICLE 3 Now, we define the following two matrices: Rx,u1 ∈ Rp×ru1 = (Σxy, ΣxΣxy, . . . , Σu1−1 x Σxy) and Ry,u2 ∈ Rr×pu2 = (Σyx, ΣyΣyx, . . . , Σu2−1 y Σyx). (3) In Rx,u1 and Ry,u2 , the numbers of u1 and u2 are called termination indexes, and they decide the number of projections of Σxy and Σyx onto Σx and Σy, respectively. Also define that Mx,u1 ∈ R p×r = Rx,u1 (R T x,u1 ΣxRx,u1 ) Rx,u1 Σxy and My,u2 ∈ R r×p = Ry,u2 (R T y,u2 ΣyRy,u2 ) Ry,u2 Σyx. (4) In Cook et al. (2007) it is shown that S(Mx,u1 ) = S(Σ −1 x Σxy) and S(My,u2 ) = S(Σ −1 y Σyx) in (4), and hence Mx,u1 and M 0 y,u2 can be used to infer Mx and My, respectively. One clear advantage to use M 0 x,u1 and My,u2 is no need of the inversion of Σx and Σy. Practically, it is important to select proper values for the termination indexes u1 and u2. For this define that ∆x,u1 = M 0 x,u1+1 −M 0 x,u1 and ∆y,u2 = M 0 y,u2+1 −M 0 y,u2 . Finally, the following measure for increment of u1 and u2 is defined: nFx,u1 = ntrace(∆ T x,u1 Σx∆x,u1 ) and nFy,u2 = ntrace(∆ T y,u2 Σy∆y,u2 ). Then, a proper value of u is set to have little changes in nFx,u1 and nFx,u1+1 and in nFy,u2 and nFy,u2+1. It is not necessary that the selected u1 and u2 for Mx,u1 and M 0 y,u2 are common. Next, the original two sets of variables of X and Y are replaced with M0 T x,u1 X ∈ R r and M0 T y,u2 Y ∈ R p. This reduction of X and Y does not cause any loss of information on CCA in sense that S(Mx,u1 ) = S(Mx) and S(My,u2 ) = S(My), and it is called initialized CCA. The initialized CCA has the following two cases. case 1: Suppose that min(p, r) = r << n. Then, the original X alone is replaced with M0 T x,u1 X and the original Y is kept. case 2: If min(p, r) = r is not fairly smaller than n, Σxy and Σyx

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

R J.

自引率

0.00%

发文量