Notes on the use of principal component analysis in agronomic research

Cory Matthew
{"title":"Notes on the use of principal component analysis in agronomic research","authors":"Cory Matthew","doi":"10.1002/glr2.70003","DOIUrl":null,"url":null,"abstract":"<p>It is common in agronomic experiments to have data on a range of plant traits across treatments comprising levels of one factor, combinations of two or more factors, and/or repeat measures across time or some other entity such as soil depth. In this case, traditional univariate ANOVA, which examines the measured traits one by one and is amenable to the development of complex statistical models, risks missing overarching patterns that might emerge if the data were analyzed as an interacting set where multivariate trait associations can be elucidated. Multivariate analyses, on the other hand, do consider multiple traits simultaneously but often struggle to accommodate complex treatment combinations. For more complex agronomic data sets, the writer has often used principal component analysis (PCA) as a data exploration and pattern detection tool, to identify the salient features of a data set from a multivariate perspective. This editorial aims to introduce PCA to readers unfamiliar with it, illustrate by example how PCA works, and to demonstrate the versatility of PCA by outlining some applications of PCA that the writer has developed for particular data sets during a 40-year research career. It is not possible in a brief editorial to provide a textbook-level and statistically robust coverage of the topic of PCA; detailed expositions of PCA have been produced by Joliffe (<span>1986</span>, <span>2002</span>) and many others. A motivation to write has been that I often see PCA results published in ways that reflect incomplete understanding of its mathematical properties and behavior. However, these notes are not intended as a substitute for consultation with a professional statistician.</p><p>A notable example of elucidation of trait associations in a large data set through PCA is the worldwide leaf economic spectrum (Wright et al., <span>2004</span>). These authors found that for a set of six leaf traits (leaf lifespan, mass per unit area, photosynthetic capacity, dark respiration rate, and N and P concentrations), PCA of data for 2548 species from 175 sites worldwide yielded a PC1, explaining 74.4% of data variation with loading coefficient absolute values ranging from 0.79 to 0.91 and associating high photosynthetic capacity and dark respiration with high leaf N and P concentrations, but lower mass per unit area and shorter longevity. This PC can be interpreted as defining an ecological resource trade-off across environments between high nutrient-resource investment for high productivity with high turnover and low investment for low productivity with slower turnover.</p><p>In a perennial ryegrass (<i>Lolium perenne</i> L.) quantitative trait loci (QTL) mapping population, Sartie et al. (<span>2011</span>) used PCA to elucidate functional associations between leaf formation traits contributing to plant yield. Remarkably, PCA was able to resolve independent contributions of the trait leaf elongation rate (LER) to plant development. For autumn data, PC1 accounting for 32% of data variation, a high leaf elongation rate was associated with a compensatory shorter leaf elongation duration, more frequent leaf appearance, reduced tiller number, and increased tiller weight and this trait association was neutral for plant yield. A near-identical PC1 accounting for 33% of data variation was observed in spring data. In autumn data, PC3, accounting for 15% of data variation, independently linked LER with increased leaf length, tiller number, and plant dry weight. Similar but not identical trait associations were observed at PC2 and PC3 in spring data, accounting for 22% and 15% of data variation, respectively (Table S9). Hence, from a plant breeding perspective, PCA was able to discriminate in which genotypes increased LER was neutral for plant yield or contributed to increased plant yield. These PCAs were generated using trait mean data averaged over three plant clonal replicates for 202 genotypes. In QTL studies, when data for the clonal replicates are entered into PCA as separate columns, the loading coefficients of each trait are typically similar across replicates and are significantly the same more often than expected from random chance (Table S10). This presumably reflects the phenotypic similarity of genetically identical plant clonal replicates. In a separate experiment with the same QTL mapping population, one of the largest contributing traits to seed yield per plant was identified in PC1 (24.5% of data variation explained) as the number of florets per spikelet. Meanwhile, thousand seed weight contributed to seed yield only at PC4 with 10.8% of data variation explained and was largely independent of other component traits of seed yield, reminiscent of the right-handedness PC2 in Table 1 (Sartie et al., <span>2018</span>; Table S11). These are insights that could not easily have been obtained from other data analysis methods.</p><p>In a study of drought tolerance differences among 220 perennial ryegrass genotypes by Weerarathne (<span>2021</span>), PCA-PC3 accounting for 13.4% of variation was interpreted as identifying plant genotypes producing high dry weight with reduced soil water depletion, a trait association of interest in breeding for drought tolerance. Selection of 20 plants with high scores and 15 plants with low scores for this PC resulted in the selected trait association being promoted to PC1 and accounting for 68.2% of data variation in a follow-up experiment (Table S12a,b). This illustrates that the proportion of variance explained in PCA by a particular functional trait association depends on the number of individuals in the population expressing that trait association. A low proportion of variation explained can occur either where a few individuals display a prominent trait or many individuals display a subtle trait. Eigenvalues are in that sense ambiguous.</p><p>Use of PCA for dimension reduction is illustrated by Sumanasena (<span>2003</span>), reporting a field study of root parameters for three soil depths (0–50, 50–100, and 100–200 mm) comparing swards of perennial ryegrass (<i>L. perenne</i> L.) and white clover (<i>Trifolium repens</i> L.) with two P fertilizer application levels, three irrigation treatments, four replicates, and repeat harvests in December, February, and April. Since this experiment had two “repeat-measures” factors, soil depth and time, data could not be validly statistically analyzed by standard “repeat measurement” ANOVA procedures that accommodate only one repeat-measures factor (usually time). In this case, sets of 48 observations for root length density (cm root cm<sup>−3</sup> soil) for each of the three soil depths were analyzed separately for each harvest date as three variables in a PCA. The resulting PC1 was a “size” PC indicating an increase or decrease in root length density across all three soil depths in particular treatments (79.8% of data variation explained on average across harvest dates). PC2 indicated deep- or shallow-rootedness and on average accounted for 12.7% of data variation. ANOVA of PC scores indicated treatment effects in PC1 in all months and in PC2 in April, despite the eigenvalue being only 0.42 (Table S13). Although not presented in this way in the cited research, using this approach, a single PCA incorporating all three harvest dates, instead of separate PCAs for each harvest date, would have produced sets of 144 PC scores for overall root length density across soil depths and for deep-rootedness. The scores could then have been submitted to repeat-measures ANOVA analysis of species, irrigation, and fertilizer effects on total root mass (PC1) and deep root mass (PC2), and their interactions, thus incorporating all experimental design factors in single ANOVAs performed on scores for PC1 and PC2.</p><p>The writer is cautious about analyzing small data sets by PCA, but in one case, PCA of data on 12 farm systems descriptors from survey results from 14 farmers identified highly credible associations between farmer feed supply decisions and milk production data (Ordóñez et al., <span>2004</span>; Table S14).</p><p>The above examples of interpretation of PCA output contrast with the approach adopted by Liu et al. (<span>2023</span>). These authors collected data for 42 germplasm lines of <i>Leymus chinensis</i> (Trin.) Tzvelev from various geographic locations in northern China and Mongolia. Data for 26 traits representing drought tolerance, rhizome extension, and soil improvement and hay yield are submitted to PCA, and the loading coefficient matrix and PC scores of 8 PCs with eigenvalues &gt;1 for the 42 germplasm lines are reported. There is little attempt to “interpret” the functional trait associations of the PCs as above. Rather, the PC scores are summed into a “comprehensive” index designated “<i>F</i>.<i>”</i> A high value of <i>F</i> is held to indicate that a germplasm line has “excellent ecological functional traits.” A cluster analysis of the same data is presented that segregates the 42 germplasm lines into four groups. A PC1–PC2 biplot is presented, which also indicates cluster separation in those PCs. The germplasm lines with the 10 numerically highest and the 10 numerically lowest values of <i>F</i> are identified. Correlations of <i>F</i> with latitude, longitude, and altitude are explored. Finally, subsets of the 26 variables to best represent drought tolerance, rhizome extension, and soil improvement were identified and further PCAs or membership function analyses were performed to generate indices that rank the germplasm lines for these three capabilities.</p><p>By way of comment, the writer found that the presented <i>F</i> values of Liu et al. (<span>2023</span>) can be reproduced by an [A] × [B] matrix multiplication, where [A] is the set of 8 PC scores <i>F</i><sub>1</sub>. <i>F</i><sub>8</sub> in their Table 3 and [B] is a column vector formed from the eigenvalues in their Table 2, scaled by a factor of 1/0.8055, the cumulative variance explained. For the writer, it is not logical to sum PC scores across PCs as these authors do, especially when there is little or no prior interpretation of PC scores and consideration as to whether a positive or negative score would increase fitness, to determine if addition or subtraction would be appropriate. This appears to be a practice that has recently emerged within China, and it is recommended that the validity of this procedure be confirmed with the international statistical community before wider adoption. Moreover, with this methodology, because of the eigenvector adjusted weighting applied to PC scores when summing them by matrix algebra multiplication to calculate the comprehensive index, the PC1 score “<i>F</i><sub>1</sub>” and the comprehensive index “<i>F</i>” have a correlation of <i>r</i> = 0.684. In addition, ANOVA of PC scores by cluster group shows that PC1 and PC2 scores <i>F</i><sub>1</sub> and <i>F</i><sub>2</sub> largely differentiate the four cluster groups, while <i>F</i> discriminates cluster group 2 from the other three cluster groups. There are therefore some doubts in the writer's mind as to what the described indices actually represent and how well the described methodology would perform in a commercial plant breeding operation.</p><p>The choice between PCA and alternate multivariate methods is complex, and space and time do not permit coverage here. For resolution of hidden biological signal in multiple data dimensions, PCA has distinct advantages over many other methods because of the number of independent factors output (i.e., one PC per input variable). For parsimony, techniques such as TOPSIS (Chakraborty, <span>2022</span>), Redundancy Analysis (Capblancq &amp; Forester, <span>2021</span>), or similar methods designed to reduce data dimensionality may be indicated. A related question is the choice between PCA and statistical methods such as canonical discriminant analysis CDA that maximize the separation of treatment groups in a data set, rather than the separation of scores for each observation, as in PCA. A comparison of PCA and CDA output for a sample data set could be considered at a later date, if there is reader interest.</p><p>There is a daunting array of information to be assessed by users of PCA. PCA is a powerful tool for data pattern detection and is especially useful for identifying functional trait association patterns in agronomic data. Whereas it is often stated that PCA is a dimensionality reduction technique, in functional trait association applications of PCA, the dimensionality retention capacity of PCA is a distinct advantage, allowing multiple independent trait associations to be discerned. The example presented here based on selected biometric data of a class of 55 students indicates that established PCA presentation conventions do not always optimize pattern detection by a PCA. Where a biplot is constructed, sometimes, PCs other than PC1 and PC2 might be considered for inclusion; rejection of PCs with eigenvalues less than 1.0 may quite often mean loss of biological signal, so this rule should be used with discernment; varimax rotation, while superficially clarifying the contribution of traits to PCs, redistributes the mathematical signal between PCs, which, in the writer's experience, can disrupt detection of biological effects of interest to the researcher.</p>","PeriodicalId":100593,"journal":{"name":"Grassland Research","volume":"4 1","pages":"1-6"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/glr2.70003","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Grassland Research","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/glr2.70003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

It is common in agronomic experiments to have data on a range of plant traits across treatments comprising levels of one factor, combinations of two or more factors, and/or repeat measures across time or some other entity such as soil depth. In this case, traditional univariate ANOVA, which examines the measured traits one by one and is amenable to the development of complex statistical models, risks missing overarching patterns that might emerge if the data were analyzed as an interacting set where multivariate trait associations can be elucidated. Multivariate analyses, on the other hand, do consider multiple traits simultaneously but often struggle to accommodate complex treatment combinations. For more complex agronomic data sets, the writer has often used principal component analysis (PCA) as a data exploration and pattern detection tool, to identify the salient features of a data set from a multivariate perspective. This editorial aims to introduce PCA to readers unfamiliar with it, illustrate by example how PCA works, and to demonstrate the versatility of PCA by outlining some applications of PCA that the writer has developed for particular data sets during a 40-year research career. It is not possible in a brief editorial to provide a textbook-level and statistically robust coverage of the topic of PCA; detailed expositions of PCA have been produced by Joliffe (19862002) and many others. A motivation to write has been that I often see PCA results published in ways that reflect incomplete understanding of its mathematical properties and behavior. However, these notes are not intended as a substitute for consultation with a professional statistician.

A notable example of elucidation of trait associations in a large data set through PCA is the worldwide leaf economic spectrum (Wright et al., 2004). These authors found that for a set of six leaf traits (leaf lifespan, mass per unit area, photosynthetic capacity, dark respiration rate, and N and P concentrations), PCA of data for 2548 species from 175 sites worldwide yielded a PC1, explaining 74.4% of data variation with loading coefficient absolute values ranging from 0.79 to 0.91 and associating high photosynthetic capacity and dark respiration with high leaf N and P concentrations, but lower mass per unit area and shorter longevity. This PC can be interpreted as defining an ecological resource trade-off across environments between high nutrient-resource investment for high productivity with high turnover and low investment for low productivity with slower turnover.

In a perennial ryegrass (Lolium perenne L.) quantitative trait loci (QTL) mapping population, Sartie et al. (2011) used PCA to elucidate functional associations between leaf formation traits contributing to plant yield. Remarkably, PCA was able to resolve independent contributions of the trait leaf elongation rate (LER) to plant development. For autumn data, PC1 accounting for 32% of data variation, a high leaf elongation rate was associated with a compensatory shorter leaf elongation duration, more frequent leaf appearance, reduced tiller number, and increased tiller weight and this trait association was neutral for plant yield. A near-identical PC1 accounting for 33% of data variation was observed in spring data. In autumn data, PC3, accounting for 15% of data variation, independently linked LER with increased leaf length, tiller number, and plant dry weight. Similar but not identical trait associations were observed at PC2 and PC3 in spring data, accounting for 22% and 15% of data variation, respectively (Table S9). Hence, from a plant breeding perspective, PCA was able to discriminate in which genotypes increased LER was neutral for plant yield or contributed to increased plant yield. These PCAs were generated using trait mean data averaged over three plant clonal replicates for 202 genotypes. In QTL studies, when data for the clonal replicates are entered into PCA as separate columns, the loading coefficients of each trait are typically similar across replicates and are significantly the same more often than expected from random chance (Table S10). This presumably reflects the phenotypic similarity of genetically identical plant clonal replicates. In a separate experiment with the same QTL mapping population, one of the largest contributing traits to seed yield per plant was identified in PC1 (24.5% of data variation explained) as the number of florets per spikelet. Meanwhile, thousand seed weight contributed to seed yield only at PC4 with 10.8% of data variation explained and was largely independent of other component traits of seed yield, reminiscent of the right-handedness PC2 in Table 1 (Sartie et al., 2018; Table S11). These are insights that could not easily have been obtained from other data analysis methods.

In a study of drought tolerance differences among 220 perennial ryegrass genotypes by Weerarathne (2021), PCA-PC3 accounting for 13.4% of variation was interpreted as identifying plant genotypes producing high dry weight with reduced soil water depletion, a trait association of interest in breeding for drought tolerance. Selection of 20 plants with high scores and 15 plants with low scores for this PC resulted in the selected trait association being promoted to PC1 and accounting for 68.2% of data variation in a follow-up experiment (Table S12a,b). This illustrates that the proportion of variance explained in PCA by a particular functional trait association depends on the number of individuals in the population expressing that trait association. A low proportion of variation explained can occur either where a few individuals display a prominent trait or many individuals display a subtle trait. Eigenvalues are in that sense ambiguous.

Use of PCA for dimension reduction is illustrated by Sumanasena (2003), reporting a field study of root parameters for three soil depths (0–50, 50–100, and 100–200 mm) comparing swards of perennial ryegrass (L. perenne L.) and white clover (Trifolium repens L.) with two P fertilizer application levels, three irrigation treatments, four replicates, and repeat harvests in December, February, and April. Since this experiment had two “repeat-measures” factors, soil depth and time, data could not be validly statistically analyzed by standard “repeat measurement” ANOVA procedures that accommodate only one repeat-measures factor (usually time). In this case, sets of 48 observations for root length density (cm root cm−3 soil) for each of the three soil depths were analyzed separately for each harvest date as three variables in a PCA. The resulting PC1 was a “size” PC indicating an increase or decrease in root length density across all three soil depths in particular treatments (79.8% of data variation explained on average across harvest dates). PC2 indicated deep- or shallow-rootedness and on average accounted for 12.7% of data variation. ANOVA of PC scores indicated treatment effects in PC1 in all months and in PC2 in April, despite the eigenvalue being only 0.42 (Table S13). Although not presented in this way in the cited research, using this approach, a single PCA incorporating all three harvest dates, instead of separate PCAs for each harvest date, would have produced sets of 144 PC scores for overall root length density across soil depths and for deep-rootedness. The scores could then have been submitted to repeat-measures ANOVA analysis of species, irrigation, and fertilizer effects on total root mass (PC1) and deep root mass (PC2), and their interactions, thus incorporating all experimental design factors in single ANOVAs performed on scores for PC1 and PC2.

The writer is cautious about analyzing small data sets by PCA, but in one case, PCA of data on 12 farm systems descriptors from survey results from 14 farmers identified highly credible associations between farmer feed supply decisions and milk production data (Ordóñez et al., 2004; Table S14).

The above examples of interpretation of PCA output contrast with the approach adopted by Liu et al. (2023). These authors collected data for 42 germplasm lines of Leymus chinensis (Trin.) Tzvelev from various geographic locations in northern China and Mongolia. Data for 26 traits representing drought tolerance, rhizome extension, and soil improvement and hay yield are submitted to PCA, and the loading coefficient matrix and PC scores of 8 PCs with eigenvalues >1 for the 42 germplasm lines are reported. There is little attempt to “interpret” the functional trait associations of the PCs as above. Rather, the PC scores are summed into a “comprehensive” index designated “F. A high value of F is held to indicate that a germplasm line has “excellent ecological functional traits.” A cluster analysis of the same data is presented that segregates the 42 germplasm lines into four groups. A PC1–PC2 biplot is presented, which also indicates cluster separation in those PCs. The germplasm lines with the 10 numerically highest and the 10 numerically lowest values of F are identified. Correlations of F with latitude, longitude, and altitude are explored. Finally, subsets of the 26 variables to best represent drought tolerance, rhizome extension, and soil improvement were identified and further PCAs or membership function analyses were performed to generate indices that rank the germplasm lines for these three capabilities.

By way of comment, the writer found that the presented F values of Liu et al. (2023) can be reproduced by an [A] × [B] matrix multiplication, where [A] is the set of 8 PC scores F1. F8 in their Table 3 and [B] is a column vector formed from the eigenvalues in their Table 2, scaled by a factor of 1/0.8055, the cumulative variance explained. For the writer, it is not logical to sum PC scores across PCs as these authors do, especially when there is little or no prior interpretation of PC scores and consideration as to whether a positive or negative score would increase fitness, to determine if addition or subtraction would be appropriate. This appears to be a practice that has recently emerged within China, and it is recommended that the validity of this procedure be confirmed with the international statistical community before wider adoption. Moreover, with this methodology, because of the eigenvector adjusted weighting applied to PC scores when summing them by matrix algebra multiplication to calculate the comprehensive index, the PC1 score “F1” and the comprehensive index “F” have a correlation of r = 0.684. In addition, ANOVA of PC scores by cluster group shows that PC1 and PC2 scores F1 and F2 largely differentiate the four cluster groups, while F discriminates cluster group 2 from the other three cluster groups. There are therefore some doubts in the writer's mind as to what the described indices actually represent and how well the described methodology would perform in a commercial plant breeding operation.

The choice between PCA and alternate multivariate methods is complex, and space and time do not permit coverage here. For resolution of hidden biological signal in multiple data dimensions, PCA has distinct advantages over many other methods because of the number of independent factors output (i.e., one PC per input variable). For parsimony, techniques such as TOPSIS (Chakraborty, 2022), Redundancy Analysis (Capblancq & Forester, 2021), or similar methods designed to reduce data dimensionality may be indicated. A related question is the choice between PCA and statistical methods such as canonical discriminant analysis CDA that maximize the separation of treatment groups in a data set, rather than the separation of scores for each observation, as in PCA. A comparison of PCA and CDA output for a sample data set could be considered at a later date, if there is reader interest.

There is a daunting array of information to be assessed by users of PCA. PCA is a powerful tool for data pattern detection and is especially useful for identifying functional trait association patterns in agronomic data. Whereas it is often stated that PCA is a dimensionality reduction technique, in functional trait association applications of PCA, the dimensionality retention capacity of PCA is a distinct advantage, allowing multiple independent trait associations to be discerned. The example presented here based on selected biometric data of a class of 55 students indicates that established PCA presentation conventions do not always optimize pattern detection by a PCA. Where a biplot is constructed, sometimes, PCs other than PC1 and PC2 might be considered for inclusion; rejection of PCs with eigenvalues less than 1.0 may quite often mean loss of biological signal, so this rule should be used with discernment; varimax rotation, while superficially clarifying the contribution of traits to PCs, redistributes the mathematical signal between PCs, which, in the writer's experience, can disrupt detection of biological effects of interest to the researcher.

求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
0.70
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信