主成分分析在农艺研究中的应用说明

Cory Matthew
{"title":"主成分分析在农艺研究中的应用说明","authors":"Cory Matthew","doi":"10.1002/glr2.70003","DOIUrl":null,"url":null,"abstract":"<p>It is common in agronomic experiments to have data on a range of plant traits across treatments comprising levels of one factor, combinations of two or more factors, and/or repeat measures across time or some other entity such as soil depth. In this case, traditional univariate ANOVA, which examines the measured traits one by one and is amenable to the development of complex statistical models, risks missing overarching patterns that might emerge if the data were analyzed as an interacting set where multivariate trait associations can be elucidated. Multivariate analyses, on the other hand, do consider multiple traits simultaneously but often struggle to accommodate complex treatment combinations. For more complex agronomic data sets, the writer has often used principal component analysis (PCA) as a data exploration and pattern detection tool, to identify the salient features of a data set from a multivariate perspective. This editorial aims to introduce PCA to readers unfamiliar with it, illustrate by example how PCA works, and to demonstrate the versatility of PCA by outlining some applications of PCA that the writer has developed for particular data sets during a 40-year research career. It is not possible in a brief editorial to provide a textbook-level and statistically robust coverage of the topic of PCA; detailed expositions of PCA have been produced by Joliffe (<span>1986</span>, <span>2002</span>) and many others. A motivation to write has been that I often see PCA results published in ways that reflect incomplete understanding of its mathematical properties and behavior. However, these notes are not intended as a substitute for consultation with a professional statistician.</p><p>A notable example of elucidation of trait associations in a large data set through PCA is the worldwide leaf economic spectrum (Wright et al., <span>2004</span>). These authors found that for a set of six leaf traits (leaf lifespan, mass per unit area, photosynthetic capacity, dark respiration rate, and N and P concentrations), PCA of data for 2548 species from 175 sites worldwide yielded a PC1, explaining 74.4% of data variation with loading coefficient absolute values ranging from 0.79 to 0.91 and associating high photosynthetic capacity and dark respiration with high leaf N and P concentrations, but lower mass per unit area and shorter longevity. This PC can be interpreted as defining an ecological resource trade-off across environments between high nutrient-resource investment for high productivity with high turnover and low investment for low productivity with slower turnover.</p><p>In a perennial ryegrass (<i>Lolium perenne</i> L.) quantitative trait loci (QTL) mapping population, Sartie et al. (<span>2011</span>) used PCA to elucidate functional associations between leaf formation traits contributing to plant yield. Remarkably, PCA was able to resolve independent contributions of the trait leaf elongation rate (LER) to plant development. For autumn data, PC1 accounting for 32% of data variation, a high leaf elongation rate was associated with a compensatory shorter leaf elongation duration, more frequent leaf appearance, reduced tiller number, and increased tiller weight and this trait association was neutral for plant yield. A near-identical PC1 accounting for 33% of data variation was observed in spring data. In autumn data, PC3, accounting for 15% of data variation, independently linked LER with increased leaf length, tiller number, and plant dry weight. Similar but not identical trait associations were observed at PC2 and PC3 in spring data, accounting for 22% and 15% of data variation, respectively (Table S9). Hence, from a plant breeding perspective, PCA was able to discriminate in which genotypes increased LER was neutral for plant yield or contributed to increased plant yield. These PCAs were generated using trait mean data averaged over three plant clonal replicates for 202 genotypes. In QTL studies, when data for the clonal replicates are entered into PCA as separate columns, the loading coefficients of each trait are typically similar across replicates and are significantly the same more often than expected from random chance (Table S10). This presumably reflects the phenotypic similarity of genetically identical plant clonal replicates. In a separate experiment with the same QTL mapping population, one of the largest contributing traits to seed yield per plant was identified in PC1 (24.5% of data variation explained) as the number of florets per spikelet. Meanwhile, thousand seed weight contributed to seed yield only at PC4 with 10.8% of data variation explained and was largely independent of other component traits of seed yield, reminiscent of the right-handedness PC2 in Table 1 (Sartie et al., <span>2018</span>; Table S11). These are insights that could not easily have been obtained from other data analysis methods.</p><p>In a study of drought tolerance differences among 220 perennial ryegrass genotypes by Weerarathne (<span>2021</span>), PCA-PC3 accounting for 13.4% of variation was interpreted as identifying plant genotypes producing high dry weight with reduced soil water depletion, a trait association of interest in breeding for drought tolerance. Selection of 20 plants with high scores and 15 plants with low scores for this PC resulted in the selected trait association being promoted to PC1 and accounting for 68.2% of data variation in a follow-up experiment (Table S12a,b). This illustrates that the proportion of variance explained in PCA by a particular functional trait association depends on the number of individuals in the population expressing that trait association. A low proportion of variation explained can occur either where a few individuals display a prominent trait or many individuals display a subtle trait. Eigenvalues are in that sense ambiguous.</p><p>Use of PCA for dimension reduction is illustrated by Sumanasena (<span>2003</span>), reporting a field study of root parameters for three soil depths (0–50, 50–100, and 100–200 mm) comparing swards of perennial ryegrass (<i>L. perenne</i> L.) and white clover (<i>Trifolium repens</i> L.) with two P fertilizer application levels, three irrigation treatments, four replicates, and repeat harvests in December, February, and April. Since this experiment had two “repeat-measures” factors, soil depth and time, data could not be validly statistically analyzed by standard “repeat measurement” ANOVA procedures that accommodate only one repeat-measures factor (usually time). In this case, sets of 48 observations for root length density (cm root cm<sup>−3</sup> soil) for each of the three soil depths were analyzed separately for each harvest date as three variables in a PCA. The resulting PC1 was a “size” PC indicating an increase or decrease in root length density across all three soil depths in particular treatments (79.8% of data variation explained on average across harvest dates). PC2 indicated deep- or shallow-rootedness and on average accounted for 12.7% of data variation. ANOVA of PC scores indicated treatment effects in PC1 in all months and in PC2 in April, despite the eigenvalue being only 0.42 (Table S13). Although not presented in this way in the cited research, using this approach, a single PCA incorporating all three harvest dates, instead of separate PCAs for each harvest date, would have produced sets of 144 PC scores for overall root length density across soil depths and for deep-rootedness. The scores could then have been submitted to repeat-measures ANOVA analysis of species, irrigation, and fertilizer effects on total root mass (PC1) and deep root mass (PC2), and their interactions, thus incorporating all experimental design factors in single ANOVAs performed on scores for PC1 and PC2.</p><p>The writer is cautious about analyzing small data sets by PCA, but in one case, PCA of data on 12 farm systems descriptors from survey results from 14 farmers identified highly credible associations between farmer feed supply decisions and milk production data (Ordóñez et al., <span>2004</span>; Table S14).</p><p>The above examples of interpretation of PCA output contrast with the approach adopted by Liu et al. (<span>2023</span>). These authors collected data for 42 germplasm lines of <i>Leymus chinensis</i> (Trin.) Tzvelev from various geographic locations in northern China and Mongolia. Data for 26 traits representing drought tolerance, rhizome extension, and soil improvement and hay yield are submitted to PCA, and the loading coefficient matrix and PC scores of 8 PCs with eigenvalues &gt;1 for the 42 germplasm lines are reported. There is little attempt to “interpret” the functional trait associations of the PCs as above. Rather, the PC scores are summed into a “comprehensive” index designated “<i>F</i>.<i>”</i> A high value of <i>F</i> is held to indicate that a germplasm line has “excellent ecological functional traits.” A cluster analysis of the same data is presented that segregates the 42 germplasm lines into four groups. A PC1–PC2 biplot is presented, which also indicates cluster separation in those PCs. The germplasm lines with the 10 numerically highest and the 10 numerically lowest values of <i>F</i> are identified. Correlations of <i>F</i> with latitude, longitude, and altitude are explored. Finally, subsets of the 26 variables to best represent drought tolerance, rhizome extension, and soil improvement were identified and further PCAs or membership function analyses were performed to generate indices that rank the germplasm lines for these three capabilities.</p><p>By way of comment, the writer found that the presented <i>F</i> values of Liu et al. (<span>2023</span>) can be reproduced by an [A] × [B] matrix multiplication, where [A] is the set of 8 PC scores <i>F</i><sub>1</sub>. <i>F</i><sub>8</sub> in their Table 3 and [B] is a column vector formed from the eigenvalues in their Table 2, scaled by a factor of 1/0.8055, the cumulative variance explained. For the writer, it is not logical to sum PC scores across PCs as these authors do, especially when there is little or no prior interpretation of PC scores and consideration as to whether a positive or negative score would increase fitness, to determine if addition or subtraction would be appropriate. This appears to be a practice that has recently emerged within China, and it is recommended that the validity of this procedure be confirmed with the international statistical community before wider adoption. Moreover, with this methodology, because of the eigenvector adjusted weighting applied to PC scores when summing them by matrix algebra multiplication to calculate the comprehensive index, the PC1 score “<i>F</i><sub>1</sub>” and the comprehensive index “<i>F</i>” have a correlation of <i>r</i> = 0.684. In addition, ANOVA of PC scores by cluster group shows that PC1 and PC2 scores <i>F</i><sub>1</sub> and <i>F</i><sub>2</sub> largely differentiate the four cluster groups, while <i>F</i> discriminates cluster group 2 from the other three cluster groups. There are therefore some doubts in the writer's mind as to what the described indices actually represent and how well the described methodology would perform in a commercial plant breeding operation.</p><p>The choice between PCA and alternate multivariate methods is complex, and space and time do not permit coverage here. For resolution of hidden biological signal in multiple data dimensions, PCA has distinct advantages over many other methods because of the number of independent factors output (i.e., one PC per input variable). For parsimony, techniques such as TOPSIS (Chakraborty, <span>2022</span>), Redundancy Analysis (Capblancq &amp; Forester, <span>2021</span>), or similar methods designed to reduce data dimensionality may be indicated. A related question is the choice between PCA and statistical methods such as canonical discriminant analysis CDA that maximize the separation of treatment groups in a data set, rather than the separation of scores for each observation, as in PCA. A comparison of PCA and CDA output for a sample data set could be considered at a later date, if there is reader interest.</p><p>There is a daunting array of information to be assessed by users of PCA. PCA is a powerful tool for data pattern detection and is especially useful for identifying functional trait association patterns in agronomic data. Whereas it is often stated that PCA is a dimensionality reduction technique, in functional trait association applications of PCA, the dimensionality retention capacity of PCA is a distinct advantage, allowing multiple independent trait associations to be discerned. The example presented here based on selected biometric data of a class of 55 students indicates that established PCA presentation conventions do not always optimize pattern detection by a PCA. Where a biplot is constructed, sometimes, PCs other than PC1 and PC2 might be considered for inclusion; rejection of PCs with eigenvalues less than 1.0 may quite often mean loss of biological signal, so this rule should be used with discernment; varimax rotation, while superficially clarifying the contribution of traits to PCs, redistributes the mathematical signal between PCs, which, in the writer's experience, can disrupt detection of biological effects of interest to the researcher.</p>","PeriodicalId":100593,"journal":{"name":"Grassland Research","volume":"4 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/glr2.70003","citationCount":"0","resultStr":"{\"title\":\"Notes on the use of principal component analysis in agronomic research\",\"authors\":\"Cory Matthew\",\"doi\":\"10.1002/glr2.70003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>It is common in agronomic experiments to have data on a range of plant traits across treatments comprising levels of one factor, combinations of two or more factors, and/or repeat measures across time or some other entity such as soil depth. In this case, traditional univariate ANOVA, which examines the measured traits one by one and is amenable to the development of complex statistical models, risks missing overarching patterns that might emerge if the data were analyzed as an interacting set where multivariate trait associations can be elucidated. Multivariate analyses, on the other hand, do consider multiple traits simultaneously but often struggle to accommodate complex treatment combinations. For more complex agronomic data sets, the writer has often used principal component analysis (PCA) as a data exploration and pattern detection tool, to identify the salient features of a data set from a multivariate perspective. This editorial aims to introduce PCA to readers unfamiliar with it, illustrate by example how PCA works, and to demonstrate the versatility of PCA by outlining some applications of PCA that the writer has developed for particular data sets during a 40-year research career. It is not possible in a brief editorial to provide a textbook-level and statistically robust coverage of the topic of PCA; detailed expositions of PCA have been produced by Joliffe (<span>1986</span>, <span>2002</span>) and many others. A motivation to write has been that I often see PCA results published in ways that reflect incomplete understanding of its mathematical properties and behavior. However, these notes are not intended as a substitute for consultation with a professional statistician.</p><p>A notable example of elucidation of trait associations in a large data set through PCA is the worldwide leaf economic spectrum (Wright et al., <span>2004</span>). These authors found that for a set of six leaf traits (leaf lifespan, mass per unit area, photosynthetic capacity, dark respiration rate, and N and P concentrations), PCA of data for 2548 species from 175 sites worldwide yielded a PC1, explaining 74.4% of data variation with loading coefficient absolute values ranging from 0.79 to 0.91 and associating high photosynthetic capacity and dark respiration with high leaf N and P concentrations, but lower mass per unit area and shorter longevity. This PC can be interpreted as defining an ecological resource trade-off across environments between high nutrient-resource investment for high productivity with high turnover and low investment for low productivity with slower turnover.</p><p>In a perennial ryegrass (<i>Lolium perenne</i> L.) quantitative trait loci (QTL) mapping population, Sartie et al. (<span>2011</span>) used PCA to elucidate functional associations between leaf formation traits contributing to plant yield. Remarkably, PCA was able to resolve independent contributions of the trait leaf elongation rate (LER) to plant development. For autumn data, PC1 accounting for 32% of data variation, a high leaf elongation rate was associated with a compensatory shorter leaf elongation duration, more frequent leaf appearance, reduced tiller number, and increased tiller weight and this trait association was neutral for plant yield. A near-identical PC1 accounting for 33% of data variation was observed in spring data. In autumn data, PC3, accounting for 15% of data variation, independently linked LER with increased leaf length, tiller number, and plant dry weight. Similar but not identical trait associations were observed at PC2 and PC3 in spring data, accounting for 22% and 15% of data variation, respectively (Table S9). Hence, from a plant breeding perspective, PCA was able to discriminate in which genotypes increased LER was neutral for plant yield or contributed to increased plant yield. These PCAs were generated using trait mean data averaged over three plant clonal replicates for 202 genotypes. In QTL studies, when data for the clonal replicates are entered into PCA as separate columns, the loading coefficients of each trait are typically similar across replicates and are significantly the same more often than expected from random chance (Table S10). This presumably reflects the phenotypic similarity of genetically identical plant clonal replicates. In a separate experiment with the same QTL mapping population, one of the largest contributing traits to seed yield per plant was identified in PC1 (24.5% of data variation explained) as the number of florets per spikelet. Meanwhile, thousand seed weight contributed to seed yield only at PC4 with 10.8% of data variation explained and was largely independent of other component traits of seed yield, reminiscent of the right-handedness PC2 in Table 1 (Sartie et al., <span>2018</span>; Table S11). These are insights that could not easily have been obtained from other data analysis methods.</p><p>In a study of drought tolerance differences among 220 perennial ryegrass genotypes by Weerarathne (<span>2021</span>), PCA-PC3 accounting for 13.4% of variation was interpreted as identifying plant genotypes producing high dry weight with reduced soil water depletion, a trait association of interest in breeding for drought tolerance. Selection of 20 plants with high scores and 15 plants with low scores for this PC resulted in the selected trait association being promoted to PC1 and accounting for 68.2% of data variation in a follow-up experiment (Table S12a,b). This illustrates that the proportion of variance explained in PCA by a particular functional trait association depends on the number of individuals in the population expressing that trait association. A low proportion of variation explained can occur either where a few individuals display a prominent trait or many individuals display a subtle trait. Eigenvalues are in that sense ambiguous.</p><p>Use of PCA for dimension reduction is illustrated by Sumanasena (<span>2003</span>), reporting a field study of root parameters for three soil depths (0–50, 50–100, and 100–200 mm) comparing swards of perennial ryegrass (<i>L. perenne</i> L.) and white clover (<i>Trifolium repens</i> L.) with two P fertilizer application levels, three irrigation treatments, four replicates, and repeat harvests in December, February, and April. Since this experiment had two “repeat-measures” factors, soil depth and time, data could not be validly statistically analyzed by standard “repeat measurement” ANOVA procedures that accommodate only one repeat-measures factor (usually time). In this case, sets of 48 observations for root length density (cm root cm<sup>−3</sup> soil) for each of the three soil depths were analyzed separately for each harvest date as three variables in a PCA. The resulting PC1 was a “size” PC indicating an increase or decrease in root length density across all three soil depths in particular treatments (79.8% of data variation explained on average across harvest dates). PC2 indicated deep- or shallow-rootedness and on average accounted for 12.7% of data variation. ANOVA of PC scores indicated treatment effects in PC1 in all months and in PC2 in April, despite the eigenvalue being only 0.42 (Table S13). Although not presented in this way in the cited research, using this approach, a single PCA incorporating all three harvest dates, instead of separate PCAs for each harvest date, would have produced sets of 144 PC scores for overall root length density across soil depths and for deep-rootedness. The scores could then have been submitted to repeat-measures ANOVA analysis of species, irrigation, and fertilizer effects on total root mass (PC1) and deep root mass (PC2), and their interactions, thus incorporating all experimental design factors in single ANOVAs performed on scores for PC1 and PC2.</p><p>The writer is cautious about analyzing small data sets by PCA, but in one case, PCA of data on 12 farm systems descriptors from survey results from 14 farmers identified highly credible associations between farmer feed supply decisions and milk production data (Ordóñez et al., <span>2004</span>; Table S14).</p><p>The above examples of interpretation of PCA output contrast with the approach adopted by Liu et al. (<span>2023</span>). These authors collected data for 42 germplasm lines of <i>Leymus chinensis</i> (Trin.) Tzvelev from various geographic locations in northern China and Mongolia. Data for 26 traits representing drought tolerance, rhizome extension, and soil improvement and hay yield are submitted to PCA, and the loading coefficient matrix and PC scores of 8 PCs with eigenvalues &gt;1 for the 42 germplasm lines are reported. There is little attempt to “interpret” the functional trait associations of the PCs as above. Rather, the PC scores are summed into a “comprehensive” index designated “<i>F</i>.<i>”</i> A high value of <i>F</i> is held to indicate that a germplasm line has “excellent ecological functional traits.” A cluster analysis of the same data is presented that segregates the 42 germplasm lines into four groups. A PC1–PC2 biplot is presented, which also indicates cluster separation in those PCs. The germplasm lines with the 10 numerically highest and the 10 numerically lowest values of <i>F</i> are identified. Correlations of <i>F</i> with latitude, longitude, and altitude are explored. Finally, subsets of the 26 variables to best represent drought tolerance, rhizome extension, and soil improvement were identified and further PCAs or membership function analyses were performed to generate indices that rank the germplasm lines for these three capabilities.</p><p>By way of comment, the writer found that the presented <i>F</i> values of Liu et al. (<span>2023</span>) can be reproduced by an [A] × [B] matrix multiplication, where [A] is the set of 8 PC scores <i>F</i><sub>1</sub>. <i>F</i><sub>8</sub> in their Table 3 and [B] is a column vector formed from the eigenvalues in their Table 2, scaled by a factor of 1/0.8055, the cumulative variance explained. For the writer, it is not logical to sum PC scores across PCs as these authors do, especially when there is little or no prior interpretation of PC scores and consideration as to whether a positive or negative score would increase fitness, to determine if addition or subtraction would be appropriate. This appears to be a practice that has recently emerged within China, and it is recommended that the validity of this procedure be confirmed with the international statistical community before wider adoption. Moreover, with this methodology, because of the eigenvector adjusted weighting applied to PC scores when summing them by matrix algebra multiplication to calculate the comprehensive index, the PC1 score “<i>F</i><sub>1</sub>” and the comprehensive index “<i>F</i>” have a correlation of <i>r</i> = 0.684. In addition, ANOVA of PC scores by cluster group shows that PC1 and PC2 scores <i>F</i><sub>1</sub> and <i>F</i><sub>2</sub> largely differentiate the four cluster groups, while <i>F</i> discriminates cluster group 2 from the other three cluster groups. There are therefore some doubts in the writer's mind as to what the described indices actually represent and how well the described methodology would perform in a commercial plant breeding operation.</p><p>The choice between PCA and alternate multivariate methods is complex, and space and time do not permit coverage here. For resolution of hidden biological signal in multiple data dimensions, PCA has distinct advantages over many other methods because of the number of independent factors output (i.e., one PC per input variable). For parsimony, techniques such as TOPSIS (Chakraborty, <span>2022</span>), Redundancy Analysis (Capblancq &amp; Forester, <span>2021</span>), or similar methods designed to reduce data dimensionality may be indicated. A related question is the choice between PCA and statistical methods such as canonical discriminant analysis CDA that maximize the separation of treatment groups in a data set, rather than the separation of scores for each observation, as in PCA. A comparison of PCA and CDA output for a sample data set could be considered at a later date, if there is reader interest.</p><p>There is a daunting array of information to be assessed by users of PCA. PCA is a powerful tool for data pattern detection and is especially useful for identifying functional trait association patterns in agronomic data. Whereas it is often stated that PCA is a dimensionality reduction technique, in functional trait association applications of PCA, the dimensionality retention capacity of PCA is a distinct advantage, allowing multiple independent trait associations to be discerned. The example presented here based on selected biometric data of a class of 55 students indicates that established PCA presentation conventions do not always optimize pattern detection by a PCA. Where a biplot is constructed, sometimes, PCs other than PC1 and PC2 might be considered for inclusion; rejection of PCs with eigenvalues less than 1.0 may quite often mean loss of biological signal, so this rule should be used with discernment; varimax rotation, while superficially clarifying the contribution of traits to PCs, redistributes the mathematical signal between PCs, which, in the writer's experience, can disrupt detection of biological effects of interest to the researcher.</p>\",\"PeriodicalId\":100593,\"journal\":{\"name\":\"Grassland Research\",\"volume\":\"4 1\",\"pages\":\"1-6\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-03-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1002/glr2.70003\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Grassland Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/glr2.70003\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Grassland Research","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/glr2.70003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

在农艺试验中,在不同处理中获得一系列植物性状的数据是很常见的,包括一个因素的水平,两个或多个因素的组合,和/或跨时间或其他一些实体(如土壤深度)重复测量。在这种情况下,传统的单变量方差分析,一个接一个地检查测量的特征,并且适合复杂统计模型的发展,如果将数据作为一个相互作用的集合进行分析,可以阐明多变量特征关联,则可能会遗漏总体模式。另一方面,多变量分析确实同时考虑多种特征,但往往难以适应复杂的治疗组合。对于更复杂的农艺数据集,作者经常使用主成分分析(PCA)作为数据探索和模式检测工具,从多变量角度识别数据集的显著特征。这篇社论旨在向不熟悉PCA的读者介绍PCA,举例说明PCA是如何工作的,并通过概述作者在40年的研究生涯中为特定数据集开发的PCA的一些应用来展示PCA的多功能性。这是不可能在一个简短的社论提供一个教科书水平和统计稳健的覆盖PCA的主题;Joliffe(1986, 2002)和其他许多人对PCA进行了详细的阐述。写这篇文章的动机是,我经常看到发表的PCA结果反映了对其数学性质和行为的不完全理解。但是,这些说明并不打算代替与专业统计学家的咨询。通过PCA在大数据集中阐明性状关联的一个显著例子是全球叶片经济谱(Wright et al., 2004)。这些作者发现,一组六叶性状(叶寿命、单位面积上的质量、光合能力、暗呼吸速率,和N和P浓度),主成分分析的数据来自全世界175个网站的2548个物种产生PC1,解释74.4%的数据变化与载荷系数绝对值范围从0.79到0.91,将较高的光合能力和暗呼吸与高叶氮和磷的浓度,但较低的单位面积上的质量和较短的寿命。这个PC可以被解释为定义了一种跨环境的生态资源权衡:高营养资源投资以获得高周转率的高生产率,低投资以获得低周转率的低生产率。在一个多年生黑麦草(Lolium perenne L.)数量性状位点(QTL)作图群体中,Sartie等(2011)利用主成分分析法阐明了叶片形成性状对植物产量的功能关联。值得注意的是,主成分分析能够解决叶片伸长率(LER)性状对植物发育的独立贡献。对于秋季数据,PC1占数据变异的32%,高叶片伸长率与补偿性叶片伸长持续时间短、叶片出现频率高、分蘖数减少和分蘖重增加相关,而这种性状相关性与植株产量无关。在春季数据中观察到几乎相同的PC1占数据变化的33%。在秋季数据中,占数据变异量15%的PC3与叶片长、分蘖数和植株干重的增加独立相关。春季数据中PC2和PC3的性状关联相似但不相同,分别占数据变异的22%和15%(表S9)。因此,从植物育种的角度来看,PCA能够区分哪些基因型增加LER对植物产量是中性的,哪些基因型增加LER对植物产量有贡献。这些pca是利用202个基因型的3个植物克隆重复的平均性状数据生成的。在QTL研究中,当克隆重复的数据作为单独的列输入PCA时,每个性状的加载系数在不同的重复中通常是相似的,并且显著相同的频率比随机机会预期的要高(表S10)。这可能反映了基因相同的植物克隆复制的表型相似性。在同一QTL定位群体的单独实验中,鉴定出PC1中对单株种子产量贡献最大的性状之一(解释了24.5%的数据变异)是每小穗的小花数。同时,千粒重仅在PC4时对种子产量有贡献,解释了10.8%的数据变化,并且在很大程度上独立于种子产量的其他组成性状,让人想起表1中的右旋PC2 (Sartie et al., 2018;表S11)。这些都是其他数据分析方法无法轻易获得的见解。在Weerarathne(2021)对220种多年生黑麦草基因型抗旱性差异的研究中,PCA-PC3占13。 4%的变异被解释为产生高干重和减少土壤水分枯竭的植物基因型,这是一种与耐旱性育种相关的性状。选择该PC的20株高分株和15株低分株,所选性状关联提升至PC1,在后续实验中占数据变异的68.2%(表S12a,b)。这表明,PCA中由特定功能性状关联解释的方差比例取决于群体中表达该性状关联的个体数量。在少数个体表现出突出特征或许多个体表现出微妙特征的情况下,可以解释低比例的变异。特征值在这个意义上是模糊的。Sumanasena(2003)对三种土壤深度(0 - 50,50 - 100和100 - 200mm)的根系参数进行了实地研究,比较了多年生黑麦草(L. perenne L.)和白三叶草(Trifolium repens L.)在两种磷肥施用水平、三种灌溉处理、四次重复和12月、2月和4月的重复收获。由于该实验有两个“重复测量”因素,即土壤深度和时间,因此仅考虑一个重复测量因素(通常是时间)的标准“重复测量”方差分析程序无法有效地统计分析数据。在这种情况下,对每一种土壤深度的48组根长密度(cm根cm−3土壤)进行了分析,作为PCA中的三个变量,每个收获日期分别进行了分析。所得的PC1是一个“大小”PC,表明在特定处理下所有三种土壤深度的根长密度增加或减少(在收获日期平均解释了79.8%的数据变异)。PC2表示深根或浅根,平均占数据变化的12.7%。PC评分的方差分析显示,尽管特征值仅为0.42,但PC1在所有月份都有治疗效果,PC2在4月份也有治疗效果(表S13)。虽然在引用的研究中没有以这种方式呈现,但使用这种方法,一个包含所有三个收获日期的单一PCA,而不是每个收获日期的单独PCA,将产生144个PC分数集,用于土壤深度和深根性的总根长度密度。然后,可以将分数提交给重复测量的方差分析,分析物种、灌溉和肥料对总根质量(PC1)和深根质量(PC2)的影响及其相互作用,从而将所有实验设计因素纳入对PC1和PC2分数进行的单因素方差分析。作者对用主成分分析小数据集持谨慎态度,但在一个案例中,对来自14名农民的调查结果的12个农场系统描述符的数据进行主成分分析,发现了农民饲料供应决策与牛奶生产数据之间高度可信的关联(Ordóñez等人,2004;表S14系列)。上述PCA输出的解释示例与Liu等人(2023)采用的方法形成对比。本文收集了中国羊草(Leymus chinensis) 42个种质系的资料。来自中国北部和蒙古不同地理位置的Tzvelev。将抗旱性、根茎延伸、土壤改良和干草产量等26个性状的数据提交主成分分析,得到42个种质系8个特征值为&gt;1的负荷系数矩阵和PC得分。很少有人试图“解释”上述pc的功能特征关联。相反,个人电脑的分数被汇总成一个“综合”指数,称为“f”。F值高表明种质系具有“优良的生态功能性状”。对相同的数据进行聚类分析,将42个种质系划分为4组。给出了PC1-PC2双标图,这也表明了这些pc中的聚类分离。鉴定了F值数值最高的10个和数值最低的10个种质系。探讨了F与纬度、经度和海拔的相关性。最后,确定了最能代表耐旱性、根茎延伸和土壤改良的26个变量的子集,并进行了进一步的pca或隶属函数分析,以生成对这三种能力进行排名的种质系指数。通过评论,笔者发现Liu et al.(2023)给出的F值可以通过一个[A] × [B]矩阵乘法来再现,其中[A]为8个PC分数F1的集合。他们的表3和[B]中的F8是由他们的表2中的特征值形成的列向量,用1/0.8055的因子进行缩放,说明累积方差。 对于作者来说,像这些作者那样将PC分数相加是不符合逻辑的,特别是在很少或没有对PC分数进行事先解释的情况下,以及考虑到正分数或负分数是否会增加适应性,以确定加法或减法是否合适。这似乎是中国最近出现的一种做法,建议在广泛采用这一程序之前,先向国际统计界确认其有效性。此外,由于采用矩阵代数乘法对PC得分进行求和计算综合指数时采用了特征向量调整后的权重,因此PC1得分“F1”与综合指数“F”的相关性为r = 0.684。此外,对聚类组PC得分的方差分析显示,PC1和PC2得分F1和F2在很大程度上区分了四个聚类组,而F则将聚类2与其他三个聚类组区分开来。因此,作者对所描述的指数实际代表的内容以及所描述的方法在商业植物育种操作中的表现存在一些疑问。在PCA和其他多元方法之间的选择是复杂的,空间和时间不允许在这里覆盖。对于多数据维度的隐藏生物信号的分辨率,PCA由于输出的独立因子数量多(即每个输入变量一个PC),因此与许多其他方法相比具有明显的优势。对于简约,TOPSIS (Chakraborty, 2022)、冗余分析(Capblancq &amp;Forester, 2021),或者设计用于降低数据维数的类似方法。一个相关的问题是在PCA和统计方法(如典型判别分析CDA)之间的选择,后者最大限度地分离数据集中的治疗组,而不是像PCA那样分离每个观察值的分数。如果读者感兴趣,可以在稍后的日期考虑样本数据集的PCA和CDA输出的比较。PCA的用户需要评估的信息非常多。PCA是一种强大的数据模式检测工具,尤其适用于农艺数据中功能性状关联模式的识别。尽管人们通常认为主成分分析是一种降维技术,但在主成分分析的功能性状关联应用中,主成分分析的维数保留能力是一个明显的优势,它允许识别多个独立的性状关联。本文给出的示例基于55名学生的一个班级的选定生物特征数据,表明已建立的PCA表示惯例并不总是优化PCA的模式检测。在构建双标图时,有时可能会考虑包括PC1和PC2以外的pc;拒绝特征值小于1.0的pc可能通常意味着生物信号的丢失,因此应谨慎使用此规则;变量旋转虽然表面上澄清了性状对pc的贡献,但在pc之间重新分配了数学信号,根据作者的经验,这可能会破坏研究人员感兴趣的生物效应的检测。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Notes on the use of principal component analysis in agronomic research

It is common in agronomic experiments to have data on a range of plant traits across treatments comprising levels of one factor, combinations of two or more factors, and/or repeat measures across time or some other entity such as soil depth. In this case, traditional univariate ANOVA, which examines the measured traits one by one and is amenable to the development of complex statistical models, risks missing overarching patterns that might emerge if the data were analyzed as an interacting set where multivariate trait associations can be elucidated. Multivariate analyses, on the other hand, do consider multiple traits simultaneously but often struggle to accommodate complex treatment combinations. For more complex agronomic data sets, the writer has often used principal component analysis (PCA) as a data exploration and pattern detection tool, to identify the salient features of a data set from a multivariate perspective. This editorial aims to introduce PCA to readers unfamiliar with it, illustrate by example how PCA works, and to demonstrate the versatility of PCA by outlining some applications of PCA that the writer has developed for particular data sets during a 40-year research career. It is not possible in a brief editorial to provide a textbook-level and statistically robust coverage of the topic of PCA; detailed expositions of PCA have been produced by Joliffe (19862002) and many others. A motivation to write has been that I often see PCA results published in ways that reflect incomplete understanding of its mathematical properties and behavior. However, these notes are not intended as a substitute for consultation with a professional statistician.

A notable example of elucidation of trait associations in a large data set through PCA is the worldwide leaf economic spectrum (Wright et al., 2004). These authors found that for a set of six leaf traits (leaf lifespan, mass per unit area, photosynthetic capacity, dark respiration rate, and N and P concentrations), PCA of data for 2548 species from 175 sites worldwide yielded a PC1, explaining 74.4% of data variation with loading coefficient absolute values ranging from 0.79 to 0.91 and associating high photosynthetic capacity and dark respiration with high leaf N and P concentrations, but lower mass per unit area and shorter longevity. This PC can be interpreted as defining an ecological resource trade-off across environments between high nutrient-resource investment for high productivity with high turnover and low investment for low productivity with slower turnover.

In a perennial ryegrass (Lolium perenne L.) quantitative trait loci (QTL) mapping population, Sartie et al. (2011) used PCA to elucidate functional associations between leaf formation traits contributing to plant yield. Remarkably, PCA was able to resolve independent contributions of the trait leaf elongation rate (LER) to plant development. For autumn data, PC1 accounting for 32% of data variation, a high leaf elongation rate was associated with a compensatory shorter leaf elongation duration, more frequent leaf appearance, reduced tiller number, and increased tiller weight and this trait association was neutral for plant yield. A near-identical PC1 accounting for 33% of data variation was observed in spring data. In autumn data, PC3, accounting for 15% of data variation, independently linked LER with increased leaf length, tiller number, and plant dry weight. Similar but not identical trait associations were observed at PC2 and PC3 in spring data, accounting for 22% and 15% of data variation, respectively (Table S9). Hence, from a plant breeding perspective, PCA was able to discriminate in which genotypes increased LER was neutral for plant yield or contributed to increased plant yield. These PCAs were generated using trait mean data averaged over three plant clonal replicates for 202 genotypes. In QTL studies, when data for the clonal replicates are entered into PCA as separate columns, the loading coefficients of each trait are typically similar across replicates and are significantly the same more often than expected from random chance (Table S10). This presumably reflects the phenotypic similarity of genetically identical plant clonal replicates. In a separate experiment with the same QTL mapping population, one of the largest contributing traits to seed yield per plant was identified in PC1 (24.5% of data variation explained) as the number of florets per spikelet. Meanwhile, thousand seed weight contributed to seed yield only at PC4 with 10.8% of data variation explained and was largely independent of other component traits of seed yield, reminiscent of the right-handedness PC2 in Table 1 (Sartie et al., 2018; Table S11). These are insights that could not easily have been obtained from other data analysis methods.

In a study of drought tolerance differences among 220 perennial ryegrass genotypes by Weerarathne (2021), PCA-PC3 accounting for 13.4% of variation was interpreted as identifying plant genotypes producing high dry weight with reduced soil water depletion, a trait association of interest in breeding for drought tolerance. Selection of 20 plants with high scores and 15 plants with low scores for this PC resulted in the selected trait association being promoted to PC1 and accounting for 68.2% of data variation in a follow-up experiment (Table S12a,b). This illustrates that the proportion of variance explained in PCA by a particular functional trait association depends on the number of individuals in the population expressing that trait association. A low proportion of variation explained can occur either where a few individuals display a prominent trait or many individuals display a subtle trait. Eigenvalues are in that sense ambiguous.

Use of PCA for dimension reduction is illustrated by Sumanasena (2003), reporting a field study of root parameters for three soil depths (0–50, 50–100, and 100–200 mm) comparing swards of perennial ryegrass (L. perenne L.) and white clover (Trifolium repens L.) with two P fertilizer application levels, three irrigation treatments, four replicates, and repeat harvests in December, February, and April. Since this experiment had two “repeat-measures” factors, soil depth and time, data could not be validly statistically analyzed by standard “repeat measurement” ANOVA procedures that accommodate only one repeat-measures factor (usually time). In this case, sets of 48 observations for root length density (cm root cm−3 soil) for each of the three soil depths were analyzed separately for each harvest date as three variables in a PCA. The resulting PC1 was a “size” PC indicating an increase or decrease in root length density across all three soil depths in particular treatments (79.8% of data variation explained on average across harvest dates). PC2 indicated deep- or shallow-rootedness and on average accounted for 12.7% of data variation. ANOVA of PC scores indicated treatment effects in PC1 in all months and in PC2 in April, despite the eigenvalue being only 0.42 (Table S13). Although not presented in this way in the cited research, using this approach, a single PCA incorporating all three harvest dates, instead of separate PCAs for each harvest date, would have produced sets of 144 PC scores for overall root length density across soil depths and for deep-rootedness. The scores could then have been submitted to repeat-measures ANOVA analysis of species, irrigation, and fertilizer effects on total root mass (PC1) and deep root mass (PC2), and their interactions, thus incorporating all experimental design factors in single ANOVAs performed on scores for PC1 and PC2.

The writer is cautious about analyzing small data sets by PCA, but in one case, PCA of data on 12 farm systems descriptors from survey results from 14 farmers identified highly credible associations between farmer feed supply decisions and milk production data (Ordóñez et al., 2004; Table S14).

The above examples of interpretation of PCA output contrast with the approach adopted by Liu et al. (2023). These authors collected data for 42 germplasm lines of Leymus chinensis (Trin.) Tzvelev from various geographic locations in northern China and Mongolia. Data for 26 traits representing drought tolerance, rhizome extension, and soil improvement and hay yield are submitted to PCA, and the loading coefficient matrix and PC scores of 8 PCs with eigenvalues >1 for the 42 germplasm lines are reported. There is little attempt to “interpret” the functional trait associations of the PCs as above. Rather, the PC scores are summed into a “comprehensive” index designated “F. A high value of F is held to indicate that a germplasm line has “excellent ecological functional traits.” A cluster analysis of the same data is presented that segregates the 42 germplasm lines into four groups. A PC1–PC2 biplot is presented, which also indicates cluster separation in those PCs. The germplasm lines with the 10 numerically highest and the 10 numerically lowest values of F are identified. Correlations of F with latitude, longitude, and altitude are explored. Finally, subsets of the 26 variables to best represent drought tolerance, rhizome extension, and soil improvement were identified and further PCAs or membership function analyses were performed to generate indices that rank the germplasm lines for these three capabilities.

By way of comment, the writer found that the presented F values of Liu et al. (2023) can be reproduced by an [A] × [B] matrix multiplication, where [A] is the set of 8 PC scores F1. F8 in their Table 3 and [B] is a column vector formed from the eigenvalues in their Table 2, scaled by a factor of 1/0.8055, the cumulative variance explained. For the writer, it is not logical to sum PC scores across PCs as these authors do, especially when there is little or no prior interpretation of PC scores and consideration as to whether a positive or negative score would increase fitness, to determine if addition or subtraction would be appropriate. This appears to be a practice that has recently emerged within China, and it is recommended that the validity of this procedure be confirmed with the international statistical community before wider adoption. Moreover, with this methodology, because of the eigenvector adjusted weighting applied to PC scores when summing them by matrix algebra multiplication to calculate the comprehensive index, the PC1 score “F1” and the comprehensive index “F” have a correlation of r = 0.684. In addition, ANOVA of PC scores by cluster group shows that PC1 and PC2 scores F1 and F2 largely differentiate the four cluster groups, while F discriminates cluster group 2 from the other three cluster groups. There are therefore some doubts in the writer's mind as to what the described indices actually represent and how well the described methodology would perform in a commercial plant breeding operation.

The choice between PCA and alternate multivariate methods is complex, and space and time do not permit coverage here. For resolution of hidden biological signal in multiple data dimensions, PCA has distinct advantages over many other methods because of the number of independent factors output (i.e., one PC per input variable). For parsimony, techniques such as TOPSIS (Chakraborty, 2022), Redundancy Analysis (Capblancq & Forester, 2021), or similar methods designed to reduce data dimensionality may be indicated. A related question is the choice between PCA and statistical methods such as canonical discriminant analysis CDA that maximize the separation of treatment groups in a data set, rather than the separation of scores for each observation, as in PCA. A comparison of PCA and CDA output for a sample data set could be considered at a later date, if there is reader interest.

There is a daunting array of information to be assessed by users of PCA. PCA is a powerful tool for data pattern detection and is especially useful for identifying functional trait association patterns in agronomic data. Whereas it is often stated that PCA is a dimensionality reduction technique, in functional trait association applications of PCA, the dimensionality retention capacity of PCA is a distinct advantage, allowing multiple independent trait associations to be discerned. The example presented here based on selected biometric data of a class of 55 students indicates that established PCA presentation conventions do not always optimize pattern detection by a PCA. Where a biplot is constructed, sometimes, PCs other than PC1 and PC2 might be considered for inclusion; rejection of PCs with eigenvalues less than 1.0 may quite often mean loss of biological signal, so this rule should be used with discernment; varimax rotation, while superficially clarifying the contribution of traits to PCs, redistributes the mathematical signal between PCs, which, in the writer's experience, can disrupt detection of biological effects of interest to the researcher.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
0.70
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信