Ashraf I. Ahmed, Muhammad Zain Kaleem, Amgad Mohamed Elshoeibi, Abdalla Moustafa Elsayed, Elhassan Mahmoud, Yaman A. Khamis, Luis Furuya-Kanamori, Jennifer C. Stone, Suhail A. Doi
{"title":"方法学质量评估 MASTER 量表:可靠性评估与更新。","authors":"Ashraf I. Ahmed, Muhammad Zain Kaleem, Amgad Mohamed Elshoeibi, Abdalla Moustafa Elsayed, Elhassan Mahmoud, Yaman A. Khamis, Luis Furuya-Kanamori, Jennifer C. Stone, Suhail A. Doi","doi":"10.1111/jebm.12618","DOIUrl":null,"url":null,"abstract":"<p>In evidence synthesis of analytical studies, methodological quality (mQ) assessment is necessary to determine the extent to which internal validity safeguards are implemented in the included studies against a list of selected safeguards in an assessment tool. Such mQ tools consist of internal validity safeguards that are checked against those put in place by researchers when they undertake research to guard against systematic error in the design, conduct, and analysis of a study.<span><sup>1</sup></span> However, consistency or agreement among the individuals undertaking an assessment of implemented safeguards in published research against those listed in a mQ tool needs to be documented to ensure that the tool is reliable. Therefore mQ tools need to have their interrater reliability tested in order to ensure the consistency of their use in research.<span><sup>2</sup></span></p><p>Many existing tools are available to assess mQ or risk of bias (RoB) specific to a study design, which leads to a lack of comparability across studies of different designs when using different tools and assessment results which, as a whole, may lack meaning. For example, Cochrane's Risk of Bias (RoB2) tool is used to assess the RoB in RCTs while nonrandomized trials are assessed using the ROBINS-I tool. It is difficult to compare these scales to one another, and hence, there is a need for a unified scale that is not confined to one study design. The MASTER scale was developed to overcome some of these issues by providing a comprehensive list of methodological safeguards across analytic study designs that allow for comparative assessment between these studies. It uses an assessment approach that takes the reviewer all the way from mQ assessment through to an ability to make use of this for bias adjustment.<span><sup>3, 4</sup></span> A drawback for reviewers using the MASTER scale is that there is a lack of information regarding its reliability, with no studies conducted to assess this metric.</p><p>The degree to which studies maintain their relative position in a list over repeated measurements is referred to as reliability.<span><sup>5</sup></span> For example, when assessing the reliability of a tool such as the MASTER scale, it would be considered reliable if you see that studies which scored well on the tool by the first rater also scored well on subsequent assessments by different raters.<span><sup>5, 6</sup></span> The scoring system for this scale has been discussed previously.<span><sup>7</sup></span> Such consistency across the individuals undertaking mQ assessment needs to be established to ensure that the tool is reliable across different raters. Researchers trained in clinical epidemiology were chosen for this study so that they could also examine the scale item wordings to remove ambiguity and improve the readability and applicability of the wording. This study therefore serves the dual purpose of evaluating the reliability of the MASTER scale across raters and examine the scale wording to see if the tool needs to be updated for clarity and readability.</p><p>As shown in Table S1, there were 11 studies<span><sup>8-18</sup></span> chosen for assessment that contained a total of 1344 patients conducted using different study designs comparing normal saline with ringers’ lactate in the treatment of acute pancreatitis. Five<span><sup>9-11, 13, 18</sup></span> of the 11 studies were randomized-controlled trials including 299 patients, three<span><sup>8, 12, 14</sup></span> were cohort studies including 433 patients, and three<span><sup>15-17</sup></span> were abstracts with 612 patients and the designs reported within the abstracts were observational in one and possibly experimental in two. The highest mean quality safeguard count (Qi) across the raters was observed in the study by De-Madaria<span><sup>10</sup></span> at 33.17 (SD 1.33). Conversely, the lowest mean Qi was reported in the study by Vasu De Van,<span><sup>17</sup></span> an abstract based on a RCT of 50 patients, with a mean of 8.83 (SD 4.45). The highest mean for the relative rank was again found in the study by De-Madaria<span><sup>10</sup></span> at 0.99 (SD 0.01), while the lowest relative rank was noted in the study by Vasu De Van<span><sup>17</sup></span> at 0.27 (SD 0.12). Similarly, for the absolute ranks, the highest mean was observed in the study by De-Madaria<span><sup>10</sup></span> at 1.17 (SD 0.41), and the lowest was in the study by Vasu De Van<span><sup>17</sup></span> with a mean of 10.67 (SD 0.52). It should be noted that the study with the highest count always has a relative rank of 1 and this would decrease as the study rank gets lower<span><sup>7</sup></span>. On the other hand, absolute ranks are also highest at 1 but increase as ranks get lower.</p><p>Figure S1 illustrates how the six raters evaluated one of the eleven studies. The graph shows the overall safeguard count that each rater assigned as well as a breakdown analysis of the overall count that demonstrates the amount of the total count that was contributed by each standard. The results indicate high internal consistency and reliability for all three measures as shown in Table S2. The total safeguard count (Qi) and relative ranks yielded an ICC of 0.90 (95% CI: 0.79–0.97) and 0.90 (95% CI: 0.80–0.97), respectively indicating excellent level of agreement between raters. The absolute ranking measure had the highest level of agreement, with an ICC of 0.93 (95% CI: 0.86–0.98). Overall, the results suggest that there is low disparity of the overall raters' evaluation of an aggregate assessment using the MASTER scale.</p><p>When looking across each of the individual standards, there was strong interrater reliability (Table S3). For instance, standard 3 (ICC 0.89, 95% CI 0.78–0.96) made the biggest contribution to the overall reliability across raters for this study. However, for all six raters, standards 1 (ICC 0.61, 95% CI 0.36–0.84), 4 (ICC 0.62, 95% CI 0.38–0.85), 6 (ICC 0.66, 95% CI 0.43–0.87), and 7 (ICC 0.61, 95% CI 0.36–0.84) had the most room for improvement in terms of reliability in this study (Table S3). Overall, these results suggest that there is moderate to excellent agreement among the raters within the MASTER scale standards.</p><p>Table 1 depicts the updated MASTER scale depicting areas of the MASTER scale where recommended changes to the wordings of safeguards were made. Overall, 26 safeguards had suggestions raised within the following four standards of the MASTER scale “Equal recruitment,” “Equal retention,” “Equal implementation” and “Equal prognosis.” However, the following standards, “Equal ascertainment,” “Sufficient analysis,” and “Temporal precedence”, had no suggestions raised. We present this version of the MASTER scale for future use as version 1.01.</p><p>In conclusion, the MASTER scale (updated V1.01, Table 1) appears to be a reliable unified (across analytical study designs) tool for assessing individual studies in evidence syntheses. Our study has identified some areas where the wording of the scale could be improved, which would enhance its clarity and further increase its reliability. The main issues flagged were around the wording of the questions, and how they could be improved for interpretation and understanding, especially by those not experts in clinical epidemiology. The main limitation of using any scale, not just the MASTER scale, is the time and expertise required for generating the assessment. Other than this, the MASTER scale has no other significant limitations. This opens the door for further research in examining the reliability of the MASTER scale when assessed by other health students, clinical researchers, and other health care workers. Overall, the findings of this study have significant implications for future use and wider adoption of the MASTER scale in evidence synthesis due its applicability to all types of study designs.</p><p>Authors JS and SD were responsible for the creation of the MASTER scale. There are no other interests to report.</p><p>We acknowledge the financial support provided by the College of Medicine at Qatar University, which enabled the successful completion of this research project.</p>","PeriodicalId":16090,"journal":{"name":"Journal of Evidence‐Based Medicine","volume":"17 2","pages":"263-266"},"PeriodicalIF":3.6000,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jebm.12618","citationCount":"0","resultStr":"{\"title\":\"MASTER scale for methodological quality assessment: Reliability assessment and update\",\"authors\":\"Ashraf I. Ahmed, Muhammad Zain Kaleem, Amgad Mohamed Elshoeibi, Abdalla Moustafa Elsayed, Elhassan Mahmoud, Yaman A. Khamis, Luis Furuya-Kanamori, Jennifer C. Stone, Suhail A. Doi\",\"doi\":\"10.1111/jebm.12618\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In evidence synthesis of analytical studies, methodological quality (mQ) assessment is necessary to determine the extent to which internal validity safeguards are implemented in the included studies against a list of selected safeguards in an assessment tool. Such mQ tools consist of internal validity safeguards that are checked against those put in place by researchers when they undertake research to guard against systematic error in the design, conduct, and analysis of a study.<span><sup>1</sup></span> However, consistency or agreement among the individuals undertaking an assessment of implemented safeguards in published research against those listed in a mQ tool needs to be documented to ensure that the tool is reliable. Therefore mQ tools need to have their interrater reliability tested in order to ensure the consistency of their use in research.<span><sup>2</sup></span></p><p>Many existing tools are available to assess mQ or risk of bias (RoB) specific to a study design, which leads to a lack of comparability across studies of different designs when using different tools and assessment results which, as a whole, may lack meaning. For example, Cochrane's Risk of Bias (RoB2) tool is used to assess the RoB in RCTs while nonrandomized trials are assessed using the ROBINS-I tool. It is difficult to compare these scales to one another, and hence, there is a need for a unified scale that is not confined to one study design. The MASTER scale was developed to overcome some of these issues by providing a comprehensive list of methodological safeguards across analytic study designs that allow for comparative assessment between these studies. It uses an assessment approach that takes the reviewer all the way from mQ assessment through to an ability to make use of this for bias adjustment.<span><sup>3, 4</sup></span> A drawback for reviewers using the MASTER scale is that there is a lack of information regarding its reliability, with no studies conducted to assess this metric.</p><p>The degree to which studies maintain their relative position in a list over repeated measurements is referred to as reliability.<span><sup>5</sup></span> For example, when assessing the reliability of a tool such as the MASTER scale, it would be considered reliable if you see that studies which scored well on the tool by the first rater also scored well on subsequent assessments by different raters.<span><sup>5, 6</sup></span> The scoring system for this scale has been discussed previously.<span><sup>7</sup></span> Such consistency across the individuals undertaking mQ assessment needs to be established to ensure that the tool is reliable across different raters. Researchers trained in clinical epidemiology were chosen for this study so that they could also examine the scale item wordings to remove ambiguity and improve the readability and applicability of the wording. This study therefore serves the dual purpose of evaluating the reliability of the MASTER scale across raters and examine the scale wording to see if the tool needs to be updated for clarity and readability.</p><p>As shown in Table S1, there were 11 studies<span><sup>8-18</sup></span> chosen for assessment that contained a total of 1344 patients conducted using different study designs comparing normal saline with ringers’ lactate in the treatment of acute pancreatitis. Five<span><sup>9-11, 13, 18</sup></span> of the 11 studies were randomized-controlled trials including 299 patients, three<span><sup>8, 12, 14</sup></span> were cohort studies including 433 patients, and three<span><sup>15-17</sup></span> were abstracts with 612 patients and the designs reported within the abstracts were observational in one and possibly experimental in two. The highest mean quality safeguard count (Qi) across the raters was observed in the study by De-Madaria<span><sup>10</sup></span> at 33.17 (SD 1.33). Conversely, the lowest mean Qi was reported in the study by Vasu De Van,<span><sup>17</sup></span> an abstract based on a RCT of 50 patients, with a mean of 8.83 (SD 4.45). The highest mean for the relative rank was again found in the study by De-Madaria<span><sup>10</sup></span> at 0.99 (SD 0.01), while the lowest relative rank was noted in the study by Vasu De Van<span><sup>17</sup></span> at 0.27 (SD 0.12). Similarly, for the absolute ranks, the highest mean was observed in the study by De-Madaria<span><sup>10</sup></span> at 1.17 (SD 0.41), and the lowest was in the study by Vasu De Van<span><sup>17</sup></span> with a mean of 10.67 (SD 0.52). It should be noted that the study with the highest count always has a relative rank of 1 and this would decrease as the study rank gets lower<span><sup>7</sup></span>. On the other hand, absolute ranks are also highest at 1 but increase as ranks get lower.</p><p>Figure S1 illustrates how the six raters evaluated one of the eleven studies. The graph shows the overall safeguard count that each rater assigned as well as a breakdown analysis of the overall count that demonstrates the amount of the total count that was contributed by each standard. The results indicate high internal consistency and reliability for all three measures as shown in Table S2. The total safeguard count (Qi) and relative ranks yielded an ICC of 0.90 (95% CI: 0.79–0.97) and 0.90 (95% CI: 0.80–0.97), respectively indicating excellent level of agreement between raters. The absolute ranking measure had the highest level of agreement, with an ICC of 0.93 (95% CI: 0.86–0.98). Overall, the results suggest that there is low disparity of the overall raters' evaluation of an aggregate assessment using the MASTER scale.</p><p>When looking across each of the individual standards, there was strong interrater reliability (Table S3). For instance, standard 3 (ICC 0.89, 95% CI 0.78–0.96) made the biggest contribution to the overall reliability across raters for this study. However, for all six raters, standards 1 (ICC 0.61, 95% CI 0.36–0.84), 4 (ICC 0.62, 95% CI 0.38–0.85), 6 (ICC 0.66, 95% CI 0.43–0.87), and 7 (ICC 0.61, 95% CI 0.36–0.84) had the most room for improvement in terms of reliability in this study (Table S3). Overall, these results suggest that there is moderate to excellent agreement among the raters within the MASTER scale standards.</p><p>Table 1 depicts the updated MASTER scale depicting areas of the MASTER scale where recommended changes to the wordings of safeguards were made. Overall, 26 safeguards had suggestions raised within the following four standards of the MASTER scale “Equal recruitment,” “Equal retention,” “Equal implementation” and “Equal prognosis.” However, the following standards, “Equal ascertainment,” “Sufficient analysis,” and “Temporal precedence”, had no suggestions raised. We present this version of the MASTER scale for future use as version 1.01.</p><p>In conclusion, the MASTER scale (updated V1.01, Table 1) appears to be a reliable unified (across analytical study designs) tool for assessing individual studies in evidence syntheses. Our study has identified some areas where the wording of the scale could be improved, which would enhance its clarity and further increase its reliability. The main issues flagged were around the wording of the questions, and how they could be improved for interpretation and understanding, especially by those not experts in clinical epidemiology. The main limitation of using any scale, not just the MASTER scale, is the time and expertise required for generating the assessment. Other than this, the MASTER scale has no other significant limitations. This opens the door for further research in examining the reliability of the MASTER scale when assessed by other health students, clinical researchers, and other health care workers. Overall, the findings of this study have significant implications for future use and wider adoption of the MASTER scale in evidence synthesis due its applicability to all types of study designs.</p><p>Authors JS and SD were responsible for the creation of the MASTER scale. There are no other interests to report.</p><p>We acknowledge the financial support provided by the College of Medicine at Qatar University, which enabled the successful completion of this research project.</p>\",\"PeriodicalId\":16090,\"journal\":{\"name\":\"Journal of Evidence‐Based Medicine\",\"volume\":\"17 2\",\"pages\":\"263-266\"},\"PeriodicalIF\":3.6000,\"publicationDate\":\"2024-06-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jebm.12618\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Evidence‐Based Medicine\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/jebm.12618\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MEDICINE, GENERAL & INTERNAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Evidence‐Based Medicine","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jebm.12618","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
MASTER scale for methodological quality assessment: Reliability assessment and update
In evidence synthesis of analytical studies, methodological quality (mQ) assessment is necessary to determine the extent to which internal validity safeguards are implemented in the included studies against a list of selected safeguards in an assessment tool. Such mQ tools consist of internal validity safeguards that are checked against those put in place by researchers when they undertake research to guard against systematic error in the design, conduct, and analysis of a study.1 However, consistency or agreement among the individuals undertaking an assessment of implemented safeguards in published research against those listed in a mQ tool needs to be documented to ensure that the tool is reliable. Therefore mQ tools need to have their interrater reliability tested in order to ensure the consistency of their use in research.2
Many existing tools are available to assess mQ or risk of bias (RoB) specific to a study design, which leads to a lack of comparability across studies of different designs when using different tools and assessment results which, as a whole, may lack meaning. For example, Cochrane's Risk of Bias (RoB2) tool is used to assess the RoB in RCTs while nonrandomized trials are assessed using the ROBINS-I tool. It is difficult to compare these scales to one another, and hence, there is a need for a unified scale that is not confined to one study design. The MASTER scale was developed to overcome some of these issues by providing a comprehensive list of methodological safeguards across analytic study designs that allow for comparative assessment between these studies. It uses an assessment approach that takes the reviewer all the way from mQ assessment through to an ability to make use of this for bias adjustment.3, 4 A drawback for reviewers using the MASTER scale is that there is a lack of information regarding its reliability, with no studies conducted to assess this metric.
The degree to which studies maintain their relative position in a list over repeated measurements is referred to as reliability.5 For example, when assessing the reliability of a tool such as the MASTER scale, it would be considered reliable if you see that studies which scored well on the tool by the first rater also scored well on subsequent assessments by different raters.5, 6 The scoring system for this scale has been discussed previously.7 Such consistency across the individuals undertaking mQ assessment needs to be established to ensure that the tool is reliable across different raters. Researchers trained in clinical epidemiology were chosen for this study so that they could also examine the scale item wordings to remove ambiguity and improve the readability and applicability of the wording. This study therefore serves the dual purpose of evaluating the reliability of the MASTER scale across raters and examine the scale wording to see if the tool needs to be updated for clarity and readability.
As shown in Table S1, there were 11 studies8-18 chosen for assessment that contained a total of 1344 patients conducted using different study designs comparing normal saline with ringers’ lactate in the treatment of acute pancreatitis. Five9-11, 13, 18 of the 11 studies were randomized-controlled trials including 299 patients, three8, 12, 14 were cohort studies including 433 patients, and three15-17 were abstracts with 612 patients and the designs reported within the abstracts were observational in one and possibly experimental in two. The highest mean quality safeguard count (Qi) across the raters was observed in the study by De-Madaria10 at 33.17 (SD 1.33). Conversely, the lowest mean Qi was reported in the study by Vasu De Van,17 an abstract based on a RCT of 50 patients, with a mean of 8.83 (SD 4.45). The highest mean for the relative rank was again found in the study by De-Madaria10 at 0.99 (SD 0.01), while the lowest relative rank was noted in the study by Vasu De Van17 at 0.27 (SD 0.12). Similarly, for the absolute ranks, the highest mean was observed in the study by De-Madaria10 at 1.17 (SD 0.41), and the lowest was in the study by Vasu De Van17 with a mean of 10.67 (SD 0.52). It should be noted that the study with the highest count always has a relative rank of 1 and this would decrease as the study rank gets lower7. On the other hand, absolute ranks are also highest at 1 but increase as ranks get lower.
Figure S1 illustrates how the six raters evaluated one of the eleven studies. The graph shows the overall safeguard count that each rater assigned as well as a breakdown analysis of the overall count that demonstrates the amount of the total count that was contributed by each standard. The results indicate high internal consistency and reliability for all three measures as shown in Table S2. The total safeguard count (Qi) and relative ranks yielded an ICC of 0.90 (95% CI: 0.79–0.97) and 0.90 (95% CI: 0.80–0.97), respectively indicating excellent level of agreement between raters. The absolute ranking measure had the highest level of agreement, with an ICC of 0.93 (95% CI: 0.86–0.98). Overall, the results suggest that there is low disparity of the overall raters' evaluation of an aggregate assessment using the MASTER scale.
When looking across each of the individual standards, there was strong interrater reliability (Table S3). For instance, standard 3 (ICC 0.89, 95% CI 0.78–0.96) made the biggest contribution to the overall reliability across raters for this study. However, for all six raters, standards 1 (ICC 0.61, 95% CI 0.36–0.84), 4 (ICC 0.62, 95% CI 0.38–0.85), 6 (ICC 0.66, 95% CI 0.43–0.87), and 7 (ICC 0.61, 95% CI 0.36–0.84) had the most room for improvement in terms of reliability in this study (Table S3). Overall, these results suggest that there is moderate to excellent agreement among the raters within the MASTER scale standards.
Table 1 depicts the updated MASTER scale depicting areas of the MASTER scale where recommended changes to the wordings of safeguards were made. Overall, 26 safeguards had suggestions raised within the following four standards of the MASTER scale “Equal recruitment,” “Equal retention,” “Equal implementation” and “Equal prognosis.” However, the following standards, “Equal ascertainment,” “Sufficient analysis,” and “Temporal precedence”, had no suggestions raised. We present this version of the MASTER scale for future use as version 1.01.
In conclusion, the MASTER scale (updated V1.01, Table 1) appears to be a reliable unified (across analytical study designs) tool for assessing individual studies in evidence syntheses. Our study has identified some areas where the wording of the scale could be improved, which would enhance its clarity and further increase its reliability. The main issues flagged were around the wording of the questions, and how they could be improved for interpretation and understanding, especially by those not experts in clinical epidemiology. The main limitation of using any scale, not just the MASTER scale, is the time and expertise required for generating the assessment. Other than this, the MASTER scale has no other significant limitations. This opens the door for further research in examining the reliability of the MASTER scale when assessed by other health students, clinical researchers, and other health care workers. Overall, the findings of this study have significant implications for future use and wider adoption of the MASTER scale in evidence synthesis due its applicability to all types of study designs.
Authors JS and SD were responsible for the creation of the MASTER scale. There are no other interests to report.
We acknowledge the financial support provided by the College of Medicine at Qatar University, which enabled the successful completion of this research project.
期刊介绍:
The Journal of Evidence-Based Medicine (EMB) is an esteemed international healthcare and medical decision-making journal, dedicated to publishing groundbreaking research outcomes in evidence-based decision-making, research, practice, and education. Serving as the official English-language journal of the Cochrane China Centre and West China Hospital of Sichuan University, we eagerly welcome editorials, commentaries, and systematic reviews encompassing various topics such as clinical trials, policy, drug and patient safety, education, and knowledge translation.