MASTER scale for methodological quality assessment: Reliability assessment and update

IF 3.6 2区 医学 Q1 MEDICINE, GENERAL & INTERNAL
Ashraf I. Ahmed, Muhammad Zain Kaleem, Amgad Mohamed Elshoeibi, Abdalla Moustafa Elsayed, Elhassan Mahmoud, Yaman A. Khamis, Luis Furuya-Kanamori, Jennifer C. Stone, Suhail A. Doi
{"title":"MASTER scale for methodological quality assessment: Reliability assessment and update","authors":"Ashraf I. Ahmed,&nbsp;Muhammad Zain Kaleem,&nbsp;Amgad Mohamed Elshoeibi,&nbsp;Abdalla Moustafa Elsayed,&nbsp;Elhassan Mahmoud,&nbsp;Yaman A. Khamis,&nbsp;Luis Furuya-Kanamori,&nbsp;Jennifer C. Stone,&nbsp;Suhail A. Doi","doi":"10.1111/jebm.12618","DOIUrl":null,"url":null,"abstract":"<p>In evidence synthesis of analytical studies, methodological quality (mQ) assessment is necessary to determine the extent to which internal validity safeguards are implemented in the included studies against a list of selected safeguards in an assessment tool. Such mQ tools consist of internal validity safeguards that are checked against those put in place by researchers when they undertake research to guard against systematic error in the design, conduct, and analysis of a study.<span><sup>1</sup></span> However, consistency or agreement among the individuals undertaking an assessment of implemented safeguards in published research against those listed in a mQ tool needs to be documented to ensure that the tool is reliable. Therefore mQ tools need to have their interrater reliability tested in order to ensure the consistency of their use in research.<span><sup>2</sup></span></p><p>Many existing tools are available to assess mQ or risk of bias (RoB) specific to a study design, which leads to a lack of comparability across studies of different designs when using different tools and assessment results which, as a whole, may lack meaning. For example, Cochrane's Risk of Bias (RoB2) tool is used to assess the RoB in RCTs while nonrandomized trials are assessed using the ROBINS-I tool. It is difficult to compare these scales to one another, and hence, there is a need for a unified scale that is not confined to one study design. The MASTER scale was developed to overcome some of these issues by providing a comprehensive list of methodological safeguards across analytic study designs that allow for comparative assessment between these studies. It uses an assessment approach that takes the reviewer all the way from mQ assessment through to an ability to make use of this for bias adjustment.<span><sup>3, 4</sup></span> A drawback for reviewers using the MASTER scale is that there is a lack of information regarding its reliability, with no studies conducted to assess this metric.</p><p>The degree to which studies maintain their relative position in a list over repeated measurements is referred to as reliability.<span><sup>5</sup></span> For example, when assessing the reliability of a tool such as the MASTER scale, it would be considered reliable if you see that studies which scored well on the tool by the first rater also scored well on subsequent assessments by different raters.<span><sup>5, 6</sup></span> The scoring system for this scale has been discussed previously.<span><sup>7</sup></span> Such consistency across the individuals undertaking mQ assessment needs to be established to ensure that the tool is reliable across different raters. Researchers trained in clinical epidemiology were chosen for this study so that they could also examine the scale item wordings to remove ambiguity and improve the readability and applicability of the wording. This study therefore serves the dual purpose of evaluating the reliability of the MASTER scale across raters and examine the scale wording to see if the tool needs to be updated for clarity and readability.</p><p>As shown in Table S1, there were 11 studies<span><sup>8-18</sup></span> chosen for assessment that contained a total of 1344 patients conducted using different study designs comparing normal saline with ringers’ lactate in the treatment of acute pancreatitis. Five<span><sup>9-11, 13, 18</sup></span> of the 11 studies were randomized-controlled trials including 299 patients, three<span><sup>8, 12, 14</sup></span> were cohort studies including 433 patients, and three<span><sup>15-17</sup></span> were abstracts with 612 patients and the designs reported within the abstracts were observational in one and possibly experimental in two. The highest mean quality safeguard count (Qi) across the raters was observed in the study by De-Madaria<span><sup>10</sup></span> at 33.17 (SD 1.33). Conversely, the lowest mean Qi was reported in the study by Vasu De Van,<span><sup>17</sup></span> an abstract based on a RCT of 50 patients, with a mean of 8.83 (SD 4.45). The highest mean for the relative rank was again found in the study by De-Madaria<span><sup>10</sup></span> at 0.99 (SD 0.01), while the lowest relative rank was noted in the study by Vasu De Van<span><sup>17</sup></span> at 0.27 (SD 0.12). Similarly, for the absolute ranks, the highest mean was observed in the study by De-Madaria<span><sup>10</sup></span> at 1.17 (SD 0.41), and the lowest was in the study by Vasu De Van<span><sup>17</sup></span> with a mean of 10.67 (SD 0.52). It should be noted that the study with the highest count always has a relative rank of 1 and this would decrease as the study rank gets lower<span><sup>7</sup></span>. On the other hand, absolute ranks are also highest at 1 but increase as ranks get lower.</p><p>Figure S1 illustrates how the six raters evaluated one of the eleven studies. The graph shows the overall safeguard count that each rater assigned as well as a breakdown analysis of the overall count that demonstrates the amount of the total count that was contributed by each standard. The results indicate high internal consistency and reliability for all three measures as shown in Table S2. The total safeguard count (Qi) and relative ranks yielded an ICC of 0.90 (95% CI: 0.79–0.97) and 0.90 (95% CI: 0.80–0.97), respectively indicating excellent level of agreement between raters. The absolute ranking measure had the highest level of agreement, with an ICC of 0.93 (95% CI: 0.86–0.98). Overall, the results suggest that there is low disparity of the overall raters' evaluation of an aggregate assessment using the MASTER scale.</p><p>When looking across each of the individual standards, there was strong interrater reliability (Table S3). For instance, standard 3 (ICC 0.89, 95% CI 0.78–0.96) made the biggest contribution to the overall reliability across raters for this study. However, for all six raters, standards 1 (ICC 0.61, 95% CI 0.36–0.84), 4 (ICC 0.62, 95% CI 0.38–0.85), 6 (ICC 0.66, 95% CI 0.43–0.87), and 7 (ICC 0.61, 95% CI 0.36–0.84) had the most room for improvement in terms of reliability in this study (Table S3). Overall, these results suggest that there is moderate to excellent agreement among the raters within the MASTER scale standards.</p><p>Table 1 depicts the updated MASTER scale depicting areas of the MASTER scale where recommended changes to the wordings of safeguards were made. Overall, 26 safeguards had suggestions raised within the following four standards of the MASTER scale “Equal recruitment,” “Equal retention,” “Equal implementation” and “Equal prognosis.” However, the following standards, “Equal ascertainment,” “Sufficient analysis,” and “Temporal precedence”, had no suggestions raised. We present this version of the MASTER scale for future use as version 1.01.</p><p>In conclusion, the MASTER scale (updated V1.01, Table 1) appears to be a reliable unified (across analytical study designs) tool for assessing individual studies in evidence syntheses. Our study has identified some areas where the wording of the scale could be improved, which would enhance its clarity and further increase its reliability. The main issues flagged were around the wording of the questions, and how they could be improved for interpretation and understanding, especially by those not experts in clinical epidemiology. The main limitation of using any scale, not just the MASTER scale, is the time and expertise required for generating the assessment. Other than this, the MASTER scale has no other significant limitations. This opens the door for further research in examining the reliability of the MASTER scale when assessed by other health students, clinical researchers, and other health care workers. Overall, the findings of this study have significant implications for future use and wider adoption of the MASTER scale in evidence synthesis due its applicability to all types of study designs.</p><p>Authors JS and SD were responsible for the creation of the MASTER scale. There are no other interests to report.</p><p>We acknowledge the financial support provided by the College of Medicine at Qatar University, which enabled the successful completion of this research project.</p>","PeriodicalId":16090,"journal":{"name":"Journal of Evidence‐Based Medicine","volume":null,"pages":null},"PeriodicalIF":3.6000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jebm.12618","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Evidence‐Based Medicine","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jebm.12618","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0

Abstract

In evidence synthesis of analytical studies, methodological quality (mQ) assessment is necessary to determine the extent to which internal validity safeguards are implemented in the included studies against a list of selected safeguards in an assessment tool. Such mQ tools consist of internal validity safeguards that are checked against those put in place by researchers when they undertake research to guard against systematic error in the design, conduct, and analysis of a study.1 However, consistency or agreement among the individuals undertaking an assessment of implemented safeguards in published research against those listed in a mQ tool needs to be documented to ensure that the tool is reliable. Therefore mQ tools need to have their interrater reliability tested in order to ensure the consistency of their use in research.2

Many existing tools are available to assess mQ or risk of bias (RoB) specific to a study design, which leads to a lack of comparability across studies of different designs when using different tools and assessment results which, as a whole, may lack meaning. For example, Cochrane's Risk of Bias (RoB2) tool is used to assess the RoB in RCTs while nonrandomized trials are assessed using the ROBINS-I tool. It is difficult to compare these scales to one another, and hence, there is a need for a unified scale that is not confined to one study design. The MASTER scale was developed to overcome some of these issues by providing a comprehensive list of methodological safeguards across analytic study designs that allow for comparative assessment between these studies. It uses an assessment approach that takes the reviewer all the way from mQ assessment through to an ability to make use of this for bias adjustment.3, 4 A drawback for reviewers using the MASTER scale is that there is a lack of information regarding its reliability, with no studies conducted to assess this metric.

The degree to which studies maintain their relative position in a list over repeated measurements is referred to as reliability.5 For example, when assessing the reliability of a tool such as the MASTER scale, it would be considered reliable if you see that studies which scored well on the tool by the first rater also scored well on subsequent assessments by different raters.5, 6 The scoring system for this scale has been discussed previously.7 Such consistency across the individuals undertaking mQ assessment needs to be established to ensure that the tool is reliable across different raters. Researchers trained in clinical epidemiology were chosen for this study so that they could also examine the scale item wordings to remove ambiguity and improve the readability and applicability of the wording. This study therefore serves the dual purpose of evaluating the reliability of the MASTER scale across raters and examine the scale wording to see if the tool needs to be updated for clarity and readability.

As shown in Table S1, there were 11 studies8-18 chosen for assessment that contained a total of 1344 patients conducted using different study designs comparing normal saline with ringers’ lactate in the treatment of acute pancreatitis. Five9-11, 13, 18 of the 11 studies were randomized-controlled trials including 299 patients, three8, 12, 14 were cohort studies including 433 patients, and three15-17 were abstracts with 612 patients and the designs reported within the abstracts were observational in one and possibly experimental in two. The highest mean quality safeguard count (Qi) across the raters was observed in the study by De-Madaria10 at 33.17 (SD 1.33). Conversely, the lowest mean Qi was reported in the study by Vasu De Van,17 an abstract based on a RCT of 50 patients, with a mean of 8.83 (SD 4.45). The highest mean for the relative rank was again found in the study by De-Madaria10 at 0.99 (SD 0.01), while the lowest relative rank was noted in the study by Vasu De Van17 at 0.27 (SD 0.12). Similarly, for the absolute ranks, the highest mean was observed in the study by De-Madaria10 at 1.17 (SD 0.41), and the lowest was in the study by Vasu De Van17 with a mean of 10.67 (SD 0.52). It should be noted that the study with the highest count always has a relative rank of 1 and this would decrease as the study rank gets lower7. On the other hand, absolute ranks are also highest at 1 but increase as ranks get lower.

Figure S1 illustrates how the six raters evaluated one of the eleven studies. The graph shows the overall safeguard count that each rater assigned as well as a breakdown analysis of the overall count that demonstrates the amount of the total count that was contributed by each standard. The results indicate high internal consistency and reliability for all three measures as shown in Table S2. The total safeguard count (Qi) and relative ranks yielded an ICC of 0.90 (95% CI: 0.79–0.97) and 0.90 (95% CI: 0.80–0.97), respectively indicating excellent level of agreement between raters. The absolute ranking measure had the highest level of agreement, with an ICC of 0.93 (95% CI: 0.86–0.98). Overall, the results suggest that there is low disparity of the overall raters' evaluation of an aggregate assessment using the MASTER scale.

When looking across each of the individual standards, there was strong interrater reliability (Table S3). For instance, standard 3 (ICC 0.89, 95% CI 0.78–0.96) made the biggest contribution to the overall reliability across raters for this study. However, for all six raters, standards 1 (ICC 0.61, 95% CI 0.36–0.84), 4 (ICC 0.62, 95% CI 0.38–0.85), 6 (ICC 0.66, 95% CI 0.43–0.87), and 7 (ICC 0.61, 95% CI 0.36–0.84) had the most room for improvement in terms of reliability in this study (Table S3). Overall, these results suggest that there is moderate to excellent agreement among the raters within the MASTER scale standards.

Table 1 depicts the updated MASTER scale depicting areas of the MASTER scale where recommended changes to the wordings of safeguards were made. Overall, 26 safeguards had suggestions raised within the following four standards of the MASTER scale “Equal recruitment,” “Equal retention,” “Equal implementation” and “Equal prognosis.” However, the following standards, “Equal ascertainment,” “Sufficient analysis,” and “Temporal precedence”, had no suggestions raised. We present this version of the MASTER scale for future use as version 1.01.

In conclusion, the MASTER scale (updated V1.01, Table 1) appears to be a reliable unified (across analytical study designs) tool for assessing individual studies in evidence syntheses. Our study has identified some areas where the wording of the scale could be improved, which would enhance its clarity and further increase its reliability. The main issues flagged were around the wording of the questions, and how they could be improved for interpretation and understanding, especially by those not experts in clinical epidemiology. The main limitation of using any scale, not just the MASTER scale, is the time and expertise required for generating the assessment. Other than this, the MASTER scale has no other significant limitations. This opens the door for further research in examining the reliability of the MASTER scale when assessed by other health students, clinical researchers, and other health care workers. Overall, the findings of this study have significant implications for future use and wider adoption of the MASTER scale in evidence synthesis due its applicability to all types of study designs.

Authors JS and SD were responsible for the creation of the MASTER scale. There are no other interests to report.

We acknowledge the financial support provided by the College of Medicine at Qatar University, which enabled the successful completion of this research project.

方法学质量评估 MASTER 量表:可靠性评估与更新。
在对分析研究进行证据综述时,有必要进行方法学质量(mQ)评估,以便对照评估工具中的选定保障措施清单,确定所纳入研究在多大程度上实施了内部有效性保障措施。此类 mQ 工具由内部有效性保障措施组成,研究人员在开展研究时要对照这些保障措施进行检查,以防止在研究的设计、实施和分析过程中出现系统性错误。1 但是,对已发表研究中已实施的保障措施与 mQ 工具中列出的保障措施进行评估的人员之间是否保持一致或达成一致,需要记录在案,以确保工具的可靠性。2 现有的许多工具可用于评估特定研究设计的 mQ 或偏倚风险 (RoB),这导致不同设计的研究在使用不同工具时缺乏可比性,评估结果整体上可能缺乏意义。例如,Cochrane 的偏倚风险(RoB2)工具用于评估 RCT 的 RoB,而非随机试验则使用 ROBINS-I 工具进行评估。这些量表很难相互比较,因此需要一个不局限于一种研究设计的统一量表。开发 MASTER 量表的目的就是为了克服其中的一些问题,它提供了一份跨分析研究设计的方法学保障措施综合清单,以便在这些研究之间进行比较评估。该量表采用一种评估方法,让评审者从 mQ 评估一直到能够利用该量表进行偏倚调整。3, 4 评审者使用 MASTER 量表的一个缺点是缺乏有关其可靠性的信息,没有研究对这一指标进行评估。例如,在评估 MASTER 量表等工具的可靠性时,如果发现第一位评分者在该工具上得分较高的研究在不同评分者的后续评估中也得分较高,则可认为该工具是可靠的。本研究选择了受过临床流行病学培训的研究人员,这样他们也可以检查量表项目的措辞,以消除歧义,提高措辞的可读性和适用性。如表 S1 所示,共有 11 项研究8-18 被选中进行评估,这些研究共涉及 1344 名患者,采用不同的研究设计,比较了用生理盐水和环状乳酸盐治疗急性胰腺炎的效果。这 11 项研究中,5 项9-11、13、18 为随机对照试验,包括 299 名患者;3 项8、12、14 为队列研究,包括 433 名患者;3 项15-17 为摘要,包括 612 名患者。在 De-Madaria10 的研究中,所有评分者的平均质量保障计数(Qi)最高,为 33.17(标清 1.33)。相反,Vasu De Van 17 的研究报告中的平均质量保障指数最低,该研究摘要基于一项 50 名患者的 RCT,平均质量保障指数为 8.83(标准差为 4.45)。德-马达利亚(De-Madaria)10 的研究中也发现了最高的相对等级平均值,为 0.99(标度 0.01),而瓦苏-德-范(Vasu De Van)17 的研究中则发现了最低的相对等级平均值,为 0.27(标度 0.12)。同样,在绝对等级方面,De-Madaria10 的研究中平均等级最高,为 1.17(标准差 0.41),而 Vasu De Van17 的研究中平均等级最低,为 10.67(标准差 0.52)。值得注意的是,计数最高的研究的相对排名总是 1,随着研究排名的降低,相对排名也会降低7。图 S1 展示了六位评定者如何对 11 项研究中的一项进行评定。图中显示了每位评分者分配的总体保障措施计数,以及对总体计数的细分分析,显示了每项标准在总计数中所占的比例。如表 S2 所示,结果表明所有三项测量的内部一致性和可靠性都很高。保障措施总计数(Qi)和相对等级的 ICC 分别为 0.90(95% CI:0.79-0.97)和 0.90(95% CI:0.80-0.97),表明评分者之间的一致性极佳。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Evidence‐Based Medicine
Journal of Evidence‐Based Medicine MEDICINE, GENERAL & INTERNAL-
CiteScore
11.20
自引率
1.40%
发文量
42
期刊介绍: The Journal of Evidence-Based Medicine (EMB) is an esteemed international healthcare and medical decision-making journal, dedicated to publishing groundbreaking research outcomes in evidence-based decision-making, research, practice, and education. Serving as the official English-language journal of the Cochrane China Centre and West China Hospital of Sichuan University, we eagerly welcome editorials, commentaries, and systematic reviews encompassing various topics such as clinical trials, policy, drug and patient safety, education, and knowledge translation.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信