{"title":"Anchoring Validity Evidence for Automated Essay Scoring","authors":"Mark D. Shermis","doi":"10.1111/jedm.12336","DOIUrl":"10.1111/jedm.12336","url":null,"abstract":"<p>One of the challenges of discussing validity arguments for machine scoring of essays centers on the absence of a commonly held definition and theory of good writing. At best, the algorithms attempt to measure select attributes of writing and calibrate them against human ratings with the goal of accurate prediction of scores for new essays. Sometimes these attributes are based on the fundamentals of writing (e.g., fluency), but quite often they are based on locally developed rubrics that may be confounded with specific content coverage expectations. This lack of transparency makes it difficult to provide systematic evidence that machine scoring is assessing writing, but slices or correlates of writing performance.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46704579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Historical Perspectives on Score Comparability Issues Raised by Innovations in Testing","authors":"Peter Baldwin, Brian E. Clauser","doi":"10.1111/jedm.12318","DOIUrl":"10.1111/jedm.12318","url":null,"abstract":"<p>While score comparability across test forms typically relies on common (or randomly equivalent) examinees or items, innovations in item formats, test delivery, and efforts to extend the range of score interpretation may require a special data collection before examinees or items can be used in this way—or may be incompatible with common examinee or item designs altogether. When comparisons are necessary under these nonroutine conditions, forms still must be connected by <i>something</i> and this article focuses on these form-invariant connective <i>somethings</i>. A conceptual framework for thinking about the problem of score comparability in this way is given followed by a description of three classes of connectives. Examples from the history of innovations in testing are given for each class.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44486843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recent Challenges to Maintaining Score Comparability: A Commentary","authors":"Neil J. Dorans, Shelby J. Haberman","doi":"10.1111/jedm.12319","DOIUrl":"10.1111/jedm.12319","url":null,"abstract":"","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45896082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Validating Performance Standards via Latent Class Analysis","authors":"Salih Binici, Ismail Cuhadar","doi":"10.1111/jedm.12325","DOIUrl":"10.1111/jedm.12325","url":null,"abstract":"<p>Validity of performance standards is a key element for the defensibility of standard setting results, and validating performance standards requires collecting multiple pieces of evidence at every step during the standard setting process. This study employs a statistical procedure, latent class analysis, to set performance standards and compares latent class analysis results with previously established performance standards via the modified-Angoff method for cross-validation. The context of the study is an operational large-scale science assessment administered in one of the southern states in the United States. Results show that the number of classes that emerged in the latent class analysis concurs with the number of existing performance levels. In addition, there is a substantial level of agreement between latent class analysis results and modified-Angoff method in terms of classifying students into the same performance levels. Overall, the findings establish evidence for the validity of the performance standards identified via the modified-Angoff method. Practical implications of the study findings are discussed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43539035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Score Comparability Issues with At-Home Testing and How to Address Them","authors":"Gautam Puhan, Sooyeon Kim","doi":"10.1111/jedm.12324","DOIUrl":"10.1111/jedm.12324","url":null,"abstract":"<p>As a result of the COVID-19 pandemic, at-home testing has become a popular delivery mode in many testing programs. When programs offer at-home testing to expand their service, the score comparability between test takers testing remotely and those testing in a test center is critical. This article summarizes statistical procedures that could be used to evaluate potential mode effects at both the item level and the total score levels. Using operational data from a licensure test, we also compared linking relationships between the test center and at-home testing groups to determine the reporting score conversion from a subpopulation invariance perspective.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43479468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Impact of Cheating on Score Comparability via Pool-Based IRT Pre-equating","authors":"Jinghua Liu, Kirk Becker","doi":"10.1111/jedm.12321","DOIUrl":"10.1111/jedm.12321","url":null,"abstract":"<p>For any testing programs that administer multiple forms across multiple years, maintaining score comparability via equating is essential. With continuous testing and high-stakes results, especially with less secure online administrations, testing programs must consider the potential for cheating on their exams. This study used empirical and simulated data to examine the impact of item exposure and prior knowledge on the estimation of item difficulty and test taker's ability via pool-based IRT preequating. Raw-to-theta transformations were derived from two groups of test takers with and without possible prior knowledge of exposed items, and these were compared to a criterion raw to theta transformation. Results indicated that item exposure has a large impact on item difficulty, not only altering the difficulty of exposed items, but also altering the difficulty of unexposed items. Item exposure makes test takers with prior knowledge appear more able. Further, theta estimation bias for test takers without prior knowledge increases when more test takers with possible prior knowledge are in the calibration population. Score inflation occurs for test takers with and without prior knowledge, especially for those with lower abilities.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46066972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paul Jones, Ye Tong, Jinghua Liu, Joshua Borglum, Vince Primoli
{"title":"Score Comparability between Online Proctored and In-Person Credentialing Exams","authors":"Paul Jones, Ye Tong, Jinghua Liu, Joshua Borglum, Vince Primoli","doi":"10.1111/jedm.12320","DOIUrl":"10.1111/jedm.12320","url":null,"abstract":"<p>This article studied two methods to detect mode effects in two credentialing exams. In Study 1, we used a “modal scale comparison approach,” where the same pool of items was calibrated separately, without transformation, within two TC cohorts (TC1 and TC2) and one OP cohort (OP1) matched on their pool-based scale score distributions. The calibrations from all three groups were used to score the TC2 cohort, designated the validation sample. The TC1 item parameters and TC1-based thetas and pass rates were more like the native TC2 values than the OP1-based values, indicating mode effects, but the score and pass/fail decision differences were small. In Study 2, we used a “cross-modal repeater approach” in which test takers who failed their first attempt in one modality took the test again in either the same or different modality. The two pairs of repeater groups (TC → TC: TC → OP, and OP → OP: OP → TC) were matched exactly on their first attempt scores. Results showed increased pass rate and greater score variability in all conditions involving OP, with mode effects noticeable in both the TC → OP condition and less-strongly in the OP → TC condition. Limitations of the study and implications for exam developers were discussed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43064453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Random Responders in the TIMSS 2015 Student Questionnaire: A Threat to Validity?","authors":"Saskia van Laar, Johan Braeken","doi":"10.1111/jedm.12317","DOIUrl":"https://doi.org/10.1111/jedm.12317","url":null,"abstract":"<p>The low-stakes character of international large-scale educational assessments implies that a participating student might at times provide unrelated answers as if s/he was not even reading the items and choosing a response option randomly throughout. Depending on the severity of this invalid response behavior, interpretations of the assessment results are at risk of being invalidated. Not much is known about the prevalence nor impact of such <i>random responders</i> in the context of international large-scale educational assessments. Following a mixture item response theory (IRT) approach, an initial investigation of both issues is conducted for the Confidence in and Value of Mathematics/Science (VoM/VoS) scales in the Trends in International Mathematics and Science Study (TIMSS) 2015 student questionnaire. We end with a call to facilitate further mapping of invalid response behavior in this context by the inclusion of instructed response items and survey completion speed indicators in the assessments and a habit of sensitivity checks in all secondary data studies.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jedm.12317","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"137552821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting Differential Item Functioning Using Posterior Predictive Model Checking: A Comparison of Discrepancy Statistics","authors":"Seang-Hwane Joo, Philseok Lee","doi":"10.1111/jedm.12316","DOIUrl":"https://doi.org/10.1111/jedm.12316","url":null,"abstract":"<p>This study proposes a new Bayesian differential item functioning (DIF) detection method using posterior predictive model checking (PPMC). Item fit measures including infit, outfit, observed score distribution (OSD), and Q1 were considered as discrepancy statistics for the PPMC DIF methods. The performance of the PPMC DIF method was evaluated via a Monte Carlo simulation manipulating sample size, DIF size, DIF type, DIF percentage, and subpopulation trait distribution. Parametric DIF methods, such as Lord's chi-square and Raju's area approaches, were also included in the simulation design in order to compare the performance of the proposed PPMC DIF methods to those previously existing. Based on Type I error and power analysis, we found that PPMC DIF methods showed better-controlled Type I error rates than the existing methods and comparable power to detect uniform DIF. The implications and recommendations for applied researchers are discussed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"137981441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Two IRT Characteristic Curve Linking Methods Weighted by Information","authors":"Shaojie Wang, Minqiang Zhang, Won-Chan Lee, Feifei Huang, Zonglong Li, Yixing Li, Sufang Yu","doi":"10.1111/jedm.12315","DOIUrl":"10.1111/jedm.12315","url":null,"abstract":"<p>Traditional IRT characteristic curve linking methods ignore parameter estimation errors, which may undermine the accuracy of estimated linking constants. Two new linking methods are proposed that take into account parameter estimation errors. The item- (IWCC) and test-information-weighted characteristic curve (TWCC) methods employ weighting components in the loss function from traditional methods by their corresponding item and test information, respectively. Monte Carlo simulation was conducted to evaluate the performances of the new linking methods and compare them with traditional ones. Ability difference between linking groups, sample size, and test length were manipulated under the common-item nonequivalent groups design. Results showed that the two information-weighted characteristic curve methods outperformed traditional methods, in general. TWCC was found to be more accurate and stable than IWCC. A pseudo-form pseudo-group analysis was also performed, and similar results were observed. Finally, guidelines for practice and future directions are discussed.</p>","PeriodicalId":47871,"journal":{"name":"Journal of Educational Measurement","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2022-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48483173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}