Principal Components Analysis: Row Scaling and Compositional Data

IF 2.1 4区化学 Q1 SOCIAL WORK

Journal of Chemometrics Pub Date : 2025-02-20 DOI:10.1002/cem.3606

Richard G. Brereton

{"title":"Principal Components Analysis: Row Scaling and Compositional Data","authors":"Richard G. Brereton","doi":"10.1002/cem.3606","DOIUrl":null,"url":null,"abstract":"Row scaling is sometimes called normalisation, but this term is also sometimes used for column standardisation, so we will avoid the latter term in this article, to prevent confusion.Of course, whether this improvement is observed does depend on the structure of the data, but if the difference between samples is primarily due to the relative concentrations or proportions and the amount of sample is not easy to control, row scaling to constant total often results in an improvement. It can be combined with other approaches for column transformation such as standardisation as discussed in the previous article.If there are only two variables, the simplex is a line. In Figure 4, we illustrate the scores first 2 PCs of the dataset formed by the first two variables from Table 1. We see that after row scaling there is only one non-zero PC. In this case, the position along the line relates to the class membership of each object, although this is not always so and depends on an appropriate choice of variables.In the case of the data in Table 1, row scaling improves visualisation of the class differences and structure in the data in this case. However, row scaling is not always appropriate. If the absolute values of each variable are known accurately (e.g., the amount of sample extracted can be kept constant or calibrated to a known standard), compositional data lose information. In addition, sometimes there may be one or two very intense variables that are of subsidiary interest; for example, a primary metabolite that is very intense but has little or no relationship to the factors of interest; the proportions will be dominated by this uninteresting factor.However, row scaling is a common procedure in many areas of chemometrics. There is a significant statistical literature about multivariate compositional data. If the main aim of an analysis is qualitative, for example, to separate groups or find outliers, often some of the more elaborate statistical considerations are of secondary importance. If, however, the data are to be used for statistical inference, such as hypothesis tests or p values or estimation, it is a good idea to look closely at the classical literature in order to best interpret and process compositional data.","PeriodicalId":15274,"journal":{"name":"Journal of Chemometrics","volume":"39 3","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cem.3606","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemometrics","FirstCategoryId":"92","ListUrlMain":"https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3606","RegionNum":4,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"SOCIAL WORK","Score":null,"Total":0}

引用次数: 0

Abstract

Row scaling is sometimes called normalisation, but this term is also sometimes used for column standardisation, so we will avoid the latter term in this article, to prevent confusion.

Of course, whether this improvement is observed does depend on the structure of the data, but if the difference between samples is primarily due to the relative concentrations or proportions and the amount of sample is not easy to control, row scaling to constant total often results in an improvement. It can be combined with other approaches for column transformation such as standardisation as discussed in the previous article.

If there are only two variables, the simplex is a line. In Figure 4, we illustrate the scores first 2 PCs of the dataset formed by the first two variables from Table 1. We see that after row scaling there is only one non-zero PC. In this case, the position along the line relates to the class membership of each object, although this is not always so and depends on an appropriate choice of variables.

In the case of the data in Table 1, row scaling improves visualisation of the class differences and structure in the data in this case. However, row scaling is not always appropriate. If the absolute values of each variable are known accurately (e.g., the amount of sample extracted can be kept constant or calibrated to a known standard), compositional data lose information. In addition, sometimes there may be one or two very intense variables that are of subsidiary interest; for example, a primary metabolite that is very intense but has little or no relationship to the factors of interest; the proportions will be dominated by this uninteresting factor.

However, row scaling is a common procedure in many areas of chemometrics. There is a significant statistical literature about multivariate compositional data. If the main aim of an analysis is qualitative, for example, to separate groups or find outliers, often some of the more elaborate statistical considerations are of secondary importance. If, however, the data are to be used for statistical inference, such as hypothesis tests or p values or estimation, it is a good idea to look closely at the classical literature in order to best interpret and process compositional data.

Abstract Image

查看原文本刊更多论文

主成分分析：行缩放和成分数据

行缩放有时被称为规范化，但这个术语有时也用于列标准化，因此在本文中我们将避免使用后一个术语，以防止混淆。当然，是否观察到这种改进确实取决于数据的结构，但如果样本之间的差异主要是由于相对浓度或比例，并且样本量不易控制，则行缩放到恒定总数通常会导致改进。它可以与其他列转换方法结合使用，例如上一篇文章中讨论的标准化方法。如果只有两个变量，单纯形就是一条直线。在图4中，我们演示了由表1中的前两个变量组成的数据集的前2个pc的分数。我们看到行缩放后只有一个非零PC。在这种情况下，沿着线的位置与每个对象的类成员关系相关，尽管这并不总是如此，并且取决于适当的变量选择。对于表1中的数据，行缩放改善了这种情况下数据中类差异和结构的可视化。但是，行缩放并不总是合适的。如果每个变量的绝对值是准确已知的（例如，提取的样品量可以保持不变或校准到已知的标准），成分数据丢失信息。此外，有时可能有一两个非常强烈的变量是附属的利益；例如，一种初级代谢物非常强烈，但与感兴趣的因素很少或没有关系；比例将由这个无趣的因素决定。然而，行缩放在化学计量学的许多领域是一种常见的程序。有一个重要的统计文献多元成分数据。如果分析的主要目的是定性的，例如，分离组或发现异常值，那么一些更详细的统计考虑通常是次要的。但是，如果数据要用于统计推断，例如假设检验或p值或估计，那么仔细查看经典文献是一个好主意，以便最好地解释和处理组合数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Chemometrics 化学-分析化学

CiteScore

5.20

自引率

8.30%

发文量

审稿时长

2 months

期刊介绍： The Journal of Chemometrics is devoted to the rapid publication of original scientific papers, reviews and short communications on fundamental and applied aspects of chemometrics. It also provides a forum for the exchange of information on meetings and other news relevant to the growing community of scientists who are interested in chemometrics and its applications. Short, critical review papers are a particularly important feature of the journal, in view of the multidisciplinary readership at which it is aimed.