Retiring the Term “Weighted Mean Difference” in Contemporary Evidence Synthesis

Cochrane Evidence Synthesis and Methods Pub Date : 2025-09-11 DOI:10.1002/cesm.70051

Lifeng Lin, Xing Xing, Wenshan Han, Jiayi Tong

{"title":"Retiring the Term “Weighted Mean Difference” in Contemporary Evidence Synthesis","authors":"Lifeng Lin, Xing Xing, Wenshan Han, Jiayi Tong","doi":"10.1002/cesm.70051","DOIUrl":null,"url":null,"abstract":"Evidence synthesis frequently involves quantitative analyses of continuous outcomes. A cross-sectional study examining Cochrane systematic reviews identified 6672 out of 22,453 meta-analyses (29.7%) involved continuous outcomes [1]. The primary effect measures employed in meta-analyses of continuous outcomes are the mean difference (MD) and standardized mean difference (SMD) [2]. The MD is appropriately applied when all included studies measure outcomes using identical scales (e.g., body weight in kilograms). In contrast, the SMD serves as a solution when studies utilize different measurement scales (e.g., varied questionnaire scoring methods). Although alternative measures (e.g., the ratio of means) exist [3], their application remains relatively infrequent.Despite this conceptual clarity, the term “weighted mean difference” (WMD) appears frequently in the systematic review literature [4], which can lead to confusion about its relationship to the MD. In this article, we first clarify the distinction between MD and WMD, then describe the historical factors underlying the term's adoption and persistence, discuss why contemporary methods render it unnecessary, illustrate examples of misuse, and conclude with practical recommendations for clearer reporting.The MD represents the straightforward difference between group means (e.g., intervention vs. control) for a continuous outcome. Although the true MD value relates to unknown population-level differences, practical research relies on sample estimates from individual studies. Meta-analysis systematically synthesizes these study-level MD estimates to derive an overall summary effect across studies.The term WMD emerged historically to emphasize the weighted averaging process of meta-analyses, wherein each study contributes a sample MD weighted by its statistical precision (i.e., inverse variance) [5]. Typically, larger studies with smaller variances or narrower confidence intervals are assigned greater weights. Traditional meta-analytical methods, performed through either fixed-effect (also known as common-effect) or random-effects models, follow this inverse-variance weighting principle. Under fixed-effect models, study weights directly reflect the inverse of their variances, whereas random-effects models incorporate both within-study and between-study variances.To contextualize the widespread adoption of WMD, we conducted a brief literature search using Google Scholar on June 12, 2025. Using exact-phrase queries in quotation marks, for each calendar year from 1990 to 2024, we recorded the counts for “weighted mean difference” AND “systematic review” and separately for “systematic review,” then calculated the yearly proportion (Figure 1). Google Scholar indexes titles, abstracts, and, when available, full texts, so counts reflect occurrences anywhere in the indexed record, and these counts are approximate. We did not screen individual records for correct versus incorrect usage because our objective was to describe the prevalence of terminology rather than to quantify misuse. We therefore documented the evolution of usage over time in the proportions reported in Figure 1.This analysis revealed an observable increase in WMD usage around 1996, closely following the establishment of the Cochrane Database of Systematic Reviews (CDSR) in April 1995. The influential role of Cochrane reviews likely contributed greatly to disseminating this terminology. Chapter 6.5 of the latest Cochrane Handbook [6] confirms the prevalence of the term “weighted mean difference” in early editions of CDSR, with such cautionary notes appearing in handbook versions since at least 2008: “Analyses based on this effect measure have historically been termed [WMD] analyses in the [CDSR]. This name is potentially confusing: although the meta-analysis computes a weighted average of these differences in means, no weighting is involved in the calculation of a statistical summary of a single study. Furthermore, all meta-analyses involve a weighted combination of estimates, yet we don't use the word ‘weighted’ when referring to other methods.”Another plausible factor contributing to the continued use of the term is the citation of Andrade's statement in 2020 that “the pooled MD is more accurately described as a weighted mean difference or WMD.” [7] While this interpretation is not technically incorrect in describing the statistical process behind meta-analytic pooling, it may inadvertently encourage broader or careless use of the term WMD.Despite existing notes on WMD in the literature, Figure 1 illustrates the continued widespread use of WMD. Specifically, while the total number of systematic review publications increased until peaking around 2018 and then declined (Figure 1C), both the number and proportion of publications mentioning WMD continued to rise through 2024 (Figure 1A,B). Although the term is not misused in all instances, this trend suggests that existing cautions have had limited impact and underscores the value of clearer terminology. These historical and descriptive observations motivate a focus on current analytic practice and terminology, as discussed next.The explicit emphasis on weighting inherent to the term WMD can be misleading because weighting is fundamental to conventional meta-analytical methods, regardless of the outcome type (continuous, binary, time-to-event, etc.). Nevertheless, analogous terms such as “weighted odds ratio” or “weighted hazard ratio” are rarely used. Hence, more general terms such as “pooled MD,” “combined MD,” “overall MD,” or “meta-analytical MD” may be more appropriate and consistent.Moreover, contemporary methodological advancements in evidence synthesis frequently extend beyond traditional inverse-variance weighting. Modern meta-analyses, including pairwise and network applications, are often fit as one-stage generalized linear mixed or Bayesian hierarchical models in which treatment effects are estimated jointly from the likelihood [8-10]. In these models, precision is incorporated through the model structure rather than through explicit study-specific inverse-variance weights. Consequently, when outcome scales are identical, the pooled estimate is more clearly reported as a pooled MD or another clear descriptor, such as meta-analytic MD; the term WMD is unnecessary and may suggest a distinct effect measure. Imprecise usage nonetheless persists in current literature, as illustrated below.Critically, MD specifically pertains to individual study outcomes, while WMD exclusively represents the meta-analytical synthesis. Despite this clear distinction, some systematic reviews incorrectly label individual study effects as WMD [11-14]. For example, a systematic review published recently in JAMA inaccurately reported “pooled weighted mean differences” for systolic and diastolic blood pressures between screening and control groups [11]. Here, the pooled MD inherently indicates weighting, making the addition of “weighted” redundant and misleading. Moreover, a recent article in the American Journal of Ophthalmology captions a forest plot as “weighted mean differences (WMD) … across each study.” [12] Another applied paper captions a forest plot as “WMD and 95% CI,” both implying study-level WMDs [13]. In addition, a methods book chapter explicitly states, “Table 3.4 presents the WMD and the 95% confidence interval for each study.” [14] Such misuse persists in systematic reviews over time, including many published in various high-impact journals [15].Labeling study-level effects as “WMD” can blur the distinction between a study's MD and the pooled meta-analytic estimate. For instance, a figure caption that states “WMD across each study” may suggest that each study yields a WMD rather than an MD, which can confuse evidence users about what is being pooled. Clearer labeling (e.g., “MD per study” with a “pooled MD”) reduces this risk and improves interpretability.This article underscores the potential inappropriateness of the term WMD, particularly its incorrect application to individual studies in evidence synthesis. Originating largely from early practices in Cochrane systematic reviews, WMD no longer aligns with contemporary methodological needs and rigor. Consequently, we recommend retiring the term WMD and adopting clearer terminology, using MD for study-level effects and pooled MD or meta-analytic MD for the synthesized estimate, to promote clearer, methodologically sound communication.Lifeng Lin: conceptualization, funding acquisition, investigation, writing – original draft, visualization, writing – review and editing. Xing Xing: investigation, writing – review and editing. Wenshan Han: data curation, writing – review and editing, visualization. Jiayi Tong: conceptualization, writing – review and editing.The authors declare no conflicts of interest.","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70051","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cochrane Evidence Synthesis and Methods","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cesm.70051","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Evidence synthesis frequently involves quantitative analyses of continuous outcomes. A cross-sectional study examining Cochrane systematic reviews identified 6672 out of 22,453 meta-analyses (29.7%) involved continuous outcomes [1]. The primary effect measures employed in meta-analyses of continuous outcomes are the mean difference (MD) and standardized mean difference (SMD) [2]. The MD is appropriately applied when all included studies measure outcomes using identical scales (e.g., body weight in kilograms). In contrast, the SMD serves as a solution when studies utilize different measurement scales (e.g., varied questionnaire scoring methods). Although alternative measures (e.g., the ratio of means) exist [3], their application remains relatively infrequent.

Despite this conceptual clarity, the term “weighted mean difference” (WMD) appears frequently in the systematic review literature [4], which can lead to confusion about its relationship to the MD. In this article, we first clarify the distinction between MD and WMD, then describe the historical factors underlying the term's adoption and persistence, discuss why contemporary methods render it unnecessary, illustrate examples of misuse, and conclude with practical recommendations for clearer reporting.

The MD represents the straightforward difference between group means (e.g., intervention vs. control) for a continuous outcome. Although the true MD value relates to unknown population-level differences, practical research relies on sample estimates from individual studies. Meta-analysis systematically synthesizes these study-level MD estimates to derive an overall summary effect across studies.

The term WMD emerged historically to emphasize the weighted averaging process of meta-analyses, wherein each study contributes a sample MD weighted by its statistical precision (i.e., inverse variance) [5]. Typically, larger studies with smaller variances or narrower confidence intervals are assigned greater weights. Traditional meta-analytical methods, performed through either fixed-effect (also known as common-effect) or random-effects models, follow this inverse-variance weighting principle. Under fixed-effect models, study weights directly reflect the inverse of their variances, whereas random-effects models incorporate both within-study and between-study variances.

To contextualize the widespread adoption of WMD, we conducted a brief literature search using Google Scholar on June 12, 2025. Using exact-phrase queries in quotation marks, for each calendar year from 1990 to 2024, we recorded the counts for “weighted mean difference” AND “systematic review” and separately for “systematic review,” then calculated the yearly proportion (Figure 1). Google Scholar indexes titles, abstracts, and, when available, full texts, so counts reflect occurrences anywhere in the indexed record, and these counts are approximate. We did not screen individual records for correct versus incorrect usage because our objective was to describe the prevalence of terminology rather than to quantify misuse. We therefore documented the evolution of usage over time in the proportions reported in Figure 1.

This analysis revealed an observable increase in WMD usage around 1996, closely following the establishment of the Cochrane Database of Systematic Reviews (CDSR) in April 1995. The influential role of Cochrane reviews likely contributed greatly to disseminating this terminology. Chapter 6.5 of the latest Cochrane Handbook [6] confirms the prevalence of the term “weighted mean difference” in early editions of CDSR, with such cautionary notes appearing in handbook versions since at least 2008: “Analyses based on this effect measure have historically been termed [WMD] analyses in the [CDSR]. This name is potentially confusing: although the meta-analysis computes a weighted average of these differences in means, no weighting is involved in the calculation of a statistical summary of a single study. Furthermore, all meta-analyses involve a weighted combination of estimates, yet we don't use the word ‘weighted’ when referring to other methods.”

Another plausible factor contributing to the continued use of the term is the citation of Andrade's statement in 2020 that “the pooled MD is more accurately described as a weighted mean difference or WMD.” [7] While this interpretation is not technically incorrect in describing the statistical process behind meta-analytic pooling, it may inadvertently encourage broader or careless use of the term WMD.

Despite existing notes on WMD in the literature, Figure 1 illustrates the continued widespread use of WMD. Specifically, while the total number of systematic review publications increased until peaking around 2018 and then declined (Figure 1C), both the number and proportion of publications mentioning WMD continued to rise through 2024 (Figure 1A,B). Although the term is not misused in all instances, this trend suggests that existing cautions have had limited impact and underscores the value of clearer terminology. These historical and descriptive observations motivate a focus on current analytic practice and terminology, as discussed next.

The explicit emphasis on weighting inherent to the term WMD can be misleading because weighting is fundamental to conventional meta-analytical methods, regardless of the outcome type (continuous, binary, time-to-event, etc.). Nevertheless, analogous terms such as “weighted odds ratio” or “weighted hazard ratio” are rarely used. Hence, more general terms such as “pooled MD,” “combined MD,” “overall MD,” or “meta-analytical MD” may be more appropriate and consistent.

Moreover, contemporary methodological advancements in evidence synthesis frequently extend beyond traditional inverse-variance weighting. Modern meta-analyses, including pairwise and network applications, are often fit as one-stage generalized linear mixed or Bayesian hierarchical models in which treatment effects are estimated jointly from the likelihood [8-10]. In these models, precision is incorporated through the model structure rather than through explicit study-specific inverse-variance weights. Consequently, when outcome scales are identical, the pooled estimate is more clearly reported as a pooled MD or another clear descriptor, such as meta-analytic MD; the term WMD is unnecessary and may suggest a distinct effect measure. Imprecise usage nonetheless persists in current literature, as illustrated below.

Critically, MD specifically pertains to individual study outcomes, while WMD exclusively represents the meta-analytical synthesis. Despite this clear distinction, some systematic reviews incorrectly label individual study effects as WMD [11-14]. For example, a systematic review published recently in JAMA inaccurately reported “pooled weighted mean differences” for systolic and diastolic blood pressures between screening and control groups [11]. Here, the pooled MD inherently indicates weighting, making the addition of “weighted” redundant and misleading. Moreover, a recent article in the American Journal of Ophthalmology captions a forest plot as “weighted mean differences (WMD) … across each study.” [12] Another applied paper captions a forest plot as “WMD and 95% CI,” both implying study-level WMDs [13]. In addition, a methods book chapter explicitly states, “Table 3.4 presents the WMD and the 95% confidence interval for each study.” [14] Such misuse persists in systematic reviews over time, including many published in various high-impact journals [15].

Labeling study-level effects as “WMD” can blur the distinction between a study's MD and the pooled meta-analytic estimate. For instance, a figure caption that states “WMD across each study” may suggest that each study yields a WMD rather than an MD, which can confuse evidence users about what is being pooled. Clearer labeling (e.g., “MD per study” with a “pooled MD”) reduces this risk and improves interpretability.

This article underscores the potential inappropriateness of the term WMD, particularly its incorrect application to individual studies in evidence synthesis. Originating largely from early practices in Cochrane systematic reviews, WMD no longer aligns with contemporary methodological needs and rigor. Consequently, we recommend retiring the term WMD and adopting clearer terminology, using MD for study-level effects and pooled MD or meta-analytic MD for the synthesized estimate, to promote clearer, methodologically sound communication.

Lifeng Lin: conceptualization, funding acquisition, investigation, writing – original draft, visualization, writing – review and editing. Xing Xing: investigation, writing – review and editing. Wenshan Han: data curation, writing – review and editing, visualization. Jiayi Tong: conceptualization, writing – review and editing.

The authors declare no conflicts of interest.

Abstract Image

查看原文本刊更多论文

退出当代证据综合中的“加权平均差”一词

证据合成通常涉及对连续结果的定量分析。一项检查Cochrane系统评价的横断面研究发现，在22,453项荟萃分析中，有6672项（29.7%）涉及连续结果。连续结局荟萃分析中采用的主要效应测量是平均差（MD）和标准化平均差（SMD）[2]。当所有纳入的研究使用相同的尺度（例如，以公斤为单位的体重）测量结果时，适当应用MD。相反，当研究使用不同的测量尺度（例如，不同的问卷评分方法）时，SMD可以作为一种解决方案。虽然有替代措施（例如，均值比率），但它们的应用仍然相对较少。尽管概念如此清晰，但“加权平均差”（WMD）一词经常出现在系统综述文献[4]中，这可能导致其与大规模杀伤性武器的关系混淆。在本文中，我们首先澄清了大规模杀伤性武器和大规模杀伤性武器之间的区别，然后描述了该术语被采用和持续存在的历史因素，讨论了为什么当代方法使它变得不必要，举例说明误用的例子。最后为更清晰的报告提出切实可行的建议。MD表示连续结果的组均值（例如，干预与对照）之间的直接差异。虽然真正的MD值与未知的群体水平差异有关，但实际研究依赖于个体研究的样本估计。荟萃分析系统地综合了这些研究水平的MD估计，以得出所有研究的总体总结效应。历史上，WMD一词的出现是为了强调荟萃分析的加权平均过程，其中每项研究贡献了一个样本MD，该MD由其统计精度（即逆方差）加权。通常，具有较小方差或较窄置信区间的较大研究被赋予较大权重。传统的元分析方法，通过固定效应（也称为共同效应）或随机效应模型执行，遵循这种逆方差加权原则。在固定效应模型下，研究权重直接反映其方差的反比，而随机效应模型同时包含研究内方差和研究间方差。为了了解大规模杀伤性武器被广泛采用的背景，我们于2025年6月12日使用谷歌Scholar进行了简短的文献检索。使用带引号的精确短语查询，从1990年到2024年的每个日历年，我们分别记录了“加权平均差”和“系统评论”的计数，并分别记录了“系统评论”的计数，然后计算了年度比例（图1）。谷歌Scholar对标题、摘要和全文（如果有的话）进行索引，因此计数反映了索引记录中任何地方的出现次数，这些计数是近似值。我们没有筛选正确与不正确使用的个别记录，因为我们的目标是描述术语的流行程度，而不是量化误用。因此，我们按照图1中报告的比例记录了使用随时间的演变。这项分析显示1996年前后大规模杀伤性武器的使用明显增加，紧跟着1995年4月Cochrane系统评价数据库（CDSR）的建立。Cochrane综述的重要作用可能极大地促进了这一术语的传播。最新的Cochrane手册[6]第6.5章证实了“加权平均差”一词在早期版本的CDSR中普遍存在，至少从2008年开始，手册版本中就出现了这样的警告：“基于这种效应测量的分析在历史上被称为[CDSR]中的[大规模杀伤性武器]分析。这个名字可能会让人混淆：虽然荟萃分析计算的是这些差异的加权平均值，但在计算单个研究的统计摘要时没有涉及加权。此外，所有的元分析都涉及估算的加权组合，但我们在提到其他方法时并不使用“加权”这个词。”另一个可能导致该术语继续使用的因素是引用安德拉德在2020年的声明，即“将汇总的MD更准确地描述为加权平均差或大规模杀伤性武器。”虽然这种解释在描述元分析池背后的统计过程时在技术上并不是不正确的，但它可能无意中鼓励更广泛或粗心地使用大规模杀伤性武器一词。尽管文献中已有关于大规模杀伤性武器的说明，但图1说明了大规模杀伤性武器的继续广泛使用。具体来说，虽然系统综述出版物总数在2018年左右达到峰值，然后下降（图1C），但提到大规模杀伤性武器的出版物数量和比例在2024年继续上升（图1A，B）。虽然该术语并非在所有情况下都被滥用，但这一趋势表明，现有的警告影响有限，并强调了更明确术语的价值。这些历史性的和描述性的观察激发了对当前分析实践和术语的关注，如下所述。明确强调“大规模杀伤性武器”一词固有的权重可能会产生误导，因为权重是传统元分析方法的基础，无论结果类型（连续、二元、事件时间等）如何。然而，类似的术语，如“加权优势比”或“加权风险比”很少使用。因此，更一般的术语，如“汇集医学博士”、“联合医学博士”、“整体医学博士”或“元分析医学博士”可能更合适和一致。此外，当代证据合成方法的进步经常超出传统的反方差加权。现代荟萃分析，包括两两和网络应用，通常拟合为一阶段广义线性混合或贝叶斯层次模型，其中治疗效果由似然联合估计[8-10]。在这些模型中，精度是通过模型结构而不是通过明确的研究特定的逆方差权重来考虑的。因此，当结果量表相同时，汇总估计更清楚地报告为汇总MD或其他明确的描述符，如元分析MD；“大规模杀伤性武器”一词是不必要的，它可能暗示着一种明显的效果度量。然而，不精确的用法仍然存在于当前的文献中，如下图所示。关键的是，MD专门涉及个人研究结果，而WMD专门代表荟萃分析综合。尽管存在这种明显的区别，但一些系统综述错误地将个别研究效应标记为大规模杀伤性武器[11-14]。例如，最近发表在《美国医学会杂志》上的一篇系统综述不准确地报道了筛查组和对照组之间收缩压和舒张压的“汇总加权平均差异”。在这里，汇集的MD固有地表示权重，使得“加权”的添加变得多余和误导。此外，美国眼科杂志最近的一篇文章将森林图描述为“加权平均差异（WMD）……在每个研究中。”另一篇应用论文将森林图标注为“WMD和95% CI”，两者都意味着研究水平的WMD[13]。此外，方法书的一章明确指出，“表3.4给出了每个研究的大规模杀伤性武器和95%置信区间。”随着时间的推移，这种误用在系统综述中持续存在，包括许多发表在各种高影响力期刊上的综述。将研究水平的效应标记为“大规模杀伤性武器”会模糊研究的大规模杀伤性武器和汇总的荟萃分析估计之间的区别。例如，一个说明“每项研究的大规模杀伤性武器”的图表标题可能表明，每项研究产生的是一种大规模杀伤性武器，而不是一种大规模杀伤性武器，这可能会使证据使用者对汇总的内容感到困惑。更清晰的标记（例如，“每个研究的MD”与“汇总的MD”）减少了这种风险并提高了可解释性。本文强调了“大规模杀伤性武器”一词的潜在不恰当性，特别是它在证据合成中的个别研究中的不正确应用。大规模杀伤性武器主要源于Cochrane系统评价的早期实践，不再符合当代方法的需求和严谨性。因此，我们建议取消“大规模杀伤性武器”一词，并采用更清晰的术语，将MD用于研究水平的效果，将汇总MD或荟萃分析MD用于综合估计，以促进更清晰、方法上合理的沟通。林立峰：构思、资金获取、调研、写作-原稿、可视化、写作-审稿、编辑。星星：调查、写作、评审、编辑。韩文山：数据策展、写作评审与编辑、可视化。童佳怡：构思、写作、审稿、编辑。作者声明无利益冲突。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cochrane Evidence Synthesis and Methods

自引率

0.00%

发文量