{"title":"Detecting Cheating in Large-Scale Assessment: The Transfer of Detectors to New Tests.","authors":"Jochen Ranger, Nico Schmidt, Anett Wolgast","doi":"10.1177/00131644221132723","DOIUrl":"10.1177/00131644221132723","url":null,"abstract":"<p><p>Recent approaches to the detection of cheaters in tests employ detectors from the field of machine learning. Detectors based on supervised learning algorithms achieve high accuracy but require labeled data sets with identified cheaters for training. Labeled data sets are usually not available at an early stage of the assessment period. In this article, we discuss the approach of adapting a detector that was trained previously with a labeled training data set to a new unlabeled data set. The training and the new data set may contain data from different tests. The adaptation of detectors to new data or tasks is denominated as transfer learning in the field of machine learning. We first discuss the conditions under which a detector of cheating can be transferred. We then investigate whether the conditions are met in a real data set. We finally evaluate the benefits of transferring a detector of cheating. We find that a transferred detector has higher accuracy than an unsupervised detector of cheating. A naive transfer that consists of a simple reuse of the detector increases the accuracy considerably. A transfer via a self-labeling (SETRED) algorithm increases the accuracy slightly more than the naive transfer. The findings suggest that the detection of cheating might be improved by using existing detectors of cheating at an early stage of an assessment period.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 5","pages":"1033-1058"},"PeriodicalIF":2.1,"publicationDate":"2023-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10470164/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10525104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multimodal Data Fusion to Detect Preknowledge Test-Taking Behavior Using Machine Learning","authors":"Kaiwen Man","doi":"10.1177/00131644231193625","DOIUrl":"https://doi.org/10.1177/00131644231193625","url":null,"abstract":"In various fields, including college admission, medical board certifications, and military recruitment, high-stakes decisions are frequently made based on scores obtained from large-scale assessments. These decisions necessitate precise and reliable scores that enable valid inferences to be drawn about test-takers. However, the ability of such tests to provide reliable, accurate inference on a test-taker’s performance could be jeopardized by aberrant test-taking practices, for instance, practicing real items prior to the test. As a result, it is crucial for administrators of such assessments to develop strategies that detect potential aberrant test-takers after data collection. The aim of this study is to explore the implementation of machine learning methods in combination with multimodal data fusion strategies that integrate bio-information technology, such as eye-tracking, and psychometric measures, including response times and item responses, to detect aberrant test-taking behaviors in technology-assisted remote testing settings.","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135014578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anthony A Mangino, Jocelyn H Bolin, W Holmes Finch
{"title":"Fixed Effects or Mixed Effects Classifiers? Evidence From Simulated and Archival Data.","authors":"Anthony A Mangino, Jocelyn H Bolin, W Holmes Finch","doi":"10.1177/00131644221108180","DOIUrl":"10.1177/00131644221108180","url":null,"abstract":"<p><p>This study seeks to compare fixed and mixed effects models for the purposes of predictive classification in the presence of multilevel data. The first part of the study utilizes a Monte Carlo simulation to compare fixed and mixed effects logistic regression and random forests. An applied examination of the prediction of student retention in the public-use U.S. PISA data set was considered to verify the simulation findings. Results of this study indicate fixed effects models performed comparably with mixed effects models across both the simulation and PISA examinations. Results broadly suggest that researchers should be cognizant of the type of predictors and data structure being used, as these factors carried more weight than did the model type.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"710-739"},"PeriodicalIF":2.1,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311958/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9747521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploration of the Stacking Ensemble Machine Learning Algorithm for Cheating Detection in Large-Scale Assessment.","authors":"Todd Zhou, Hong Jiao","doi":"10.1177/00131644221117193","DOIUrl":"10.1177/00131644221117193","url":null,"abstract":"<p><p>Cheating detection in large-scale assessment received considerable attention in the extant literature. However, none of the previous studies in this line of research investigated the stacking ensemble machine learning algorithm for cheating detection. Furthermore, no study addressed the issue of class imbalance using resampling. This study explored the application of the stacking ensemble machine learning algorithm to analyze the item response, response time, and augmented data of test-takers to detect cheating behaviors. The performance of the stacking method was compared with that of two other ensemble methods (bagging and boosting) as well as six base non-ensemble machine learning algorithms. Issues related to class imbalance and input features were addressed. The study results indicated that stacking, resampling, and feature sets including augmented summary data generally performed better than its counterparts in cheating detection. Compared with other competing machine learning algorithms investigated in this study, the meta-model from stacking using discriminant analysis based on the top two base models-Gradient Boosting and Random Forest-generally performed the best when item responses and the augmented summary statistics were used as the input features with an under-sampling ratio of 10:1 among all the study conditions.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"831-854"},"PeriodicalIF":2.1,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311957/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9747522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparing the Psychometric Properties of a Scale Across Three Likert and Three Alternative Formats: An Application to the Rosenberg Self-Esteem Scale.","authors":"Xijuan Zhang, Linnan Zhou, Victoria Savalei","doi":"10.1177/00131644221111402","DOIUrl":"https://doi.org/10.1177/00131644221111402","url":null,"abstract":"<p><p>Zhang and Savalei proposed an alternative scale format to the Likert format, called the Expanded format. In this format, response options are presented in complete sentences, which can reduce acquiescence bias and method effects. The goal of the current study was to compare the psychometric properties of the Rosenberg Self-Esteem Scale (RSES) in the Expanded format and in two other alternative formats, relative to several versions of the traditional Likert format. We conducted two studies to compare the psychometric properties of the RSES across the different formats. We found that compared with the Likert format, the alternative formats tend to have a unidimensional factor structure, less response inconsistency, and comparable validity. In addition, we found that the Expanded format resulted in the best factor structure among the three alternative formats. Researchers should consider the Expanded format, especially when creating short psychological scales such as the RSES.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"649-683"},"PeriodicalIF":2.7,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/0c/99/10.1177_00131644221111402.PMC10311935.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9802113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Relative Robustness of CDMs and (M)IRT in Measuring Growth in Latent Skills.","authors":"Qi Helen Huang, Daniel M Bolt","doi":"10.1177/00131644221117194","DOIUrl":"10.1177/00131644221117194","url":null,"abstract":"<p><p>Previous studies have demonstrated evidence of latent skill continuity even in tests intentionally designed for measurement of binary skills. In addition, the assumption of binary skills when continuity is present has been shown to potentially create a lack of invariance in item and latent ability parameters that may undermine applications. In this article, we examine measurement of growth as one such application, and consider multidimensional item response theory (MIRT) as a competing alternative. Motivated by prior findings concerning the effects of skill continuity, we study the relative robustness of cognitive diagnostic models (CDMs) and (M)IRT models in the measurement of growth under both binary and continuous latent skill distributions. We find CDMs to be a less robust way of quantifying growth under misspecification, and subsequently provide a real-data example suggesting underestimation of growth as a likely consequence. It is suggested that researchers should regularly attend to the assumptions associated with the use of latent binary skills and consider (M)IRT as a potentially more robust alternative if unsure of their discrete nature.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"808-830"},"PeriodicalIF":2.1,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311955/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9747520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea H Stoevenbelt, Jelte M Wicherts, Paulette C Flore, Lorraine A T Phillips, Jakob Pietschnig, Bruno Verschuere, Martin Voracek, Inga Schwabe
{"title":"Are Speeded Tests Unfair? Modeling the Impact of Time Limits on the Gender Gap in Mathematics.","authors":"Andrea H Stoevenbelt, Jelte M Wicherts, Paulette C Flore, Lorraine A T Phillips, Jakob Pietschnig, Bruno Verschuere, Martin Voracek, Inga Schwabe","doi":"10.1177/00131644221111076","DOIUrl":"https://doi.org/10.1177/00131644221111076","url":null,"abstract":"<p><p>When cognitive and educational tests are administered under time limits, tests may become speeded and this may affect the reliability and validity of the resulting test scores. Prior research has shown that time limits may create or enlarge gender gaps in cognitive and academic testing. On average, women complete fewer items than men when a test is administered with a strict time limit, whereas gender gaps are frequently reduced when time limits are relaxed. In this study, we propose that gender differences in test strategy might inflate gender gaps favoring men, and relate test strategy to stereotype threat effects under which women underperform due to the pressure of negative stereotypes about their performance. First, we applied a Bayesian two-dimensional item response theory (IRT) model to data obtained from two registered reports that investigated stereotype threat in mathematics, and estimated the latent correlation between underlying test strategy (here, completion factor, a proxy for working speed) and mathematics ability. Second, we tested the gender gap and assessed potential effects of stereotype threat on female test performance. We found a positive correlation between the completion factor and mathematics ability, such that more able participants dropped out later in the test. We did not observe a stereotype threat effect but found larger gender differences on the latent completion factor than on latent mathematical ability, suggesting that test strategies affect the gender gap in timed mathematics performance. We argue that if the effect of time limits on tests is not taken into account, this may lead to test unfairness and biased group comparisons, and urge researchers to consider these effects in either their analyses or study planning.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"684-709"},"PeriodicalIF":2.7,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311959/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10299044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Robust Method for Detecting Item Misfit in Large-Scale Assessments.","authors":"Matthias von Davier, Ummugul Bezirhan","doi":"10.1177/00131644221105819","DOIUrl":"10.1177/00131644221105819","url":null,"abstract":"<p><p>Viable methods for the identification of item misfit or Differential Item Functioning (DIF) are central to scale construction and sound measurement. Many approaches rely on the derivation of a limiting distribution under the assumption that a certain model fits the data perfectly. Typical DIF assumptions such as the monotonicity and population independence of item functions are present even in classical test theory but are more explicitly stated when using item response theory or other latent variable models for the assessment of item fit. The work presented here provides a robust approach for DIF detection that does not assume perfect model data fit, but rather uses Tukey's concept of contaminated distributions. The approach uses robust outlier detection to flag items for which adequate model data fit cannot be established.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"740-765"},"PeriodicalIF":2.1,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311954/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9747519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Bayesian General Model to Account for Individual Differences in Operation-Specific Learning Within a Test.","authors":"José H Lozano, Javier Revuelta","doi":"10.1177/00131644221109796","DOIUrl":"10.1177/00131644221109796","url":null,"abstract":"<p><p>The present paper introduces a general multidimensional model to measure individual differences in learning within a single administration of a test. Learning is assumed to result from practicing the operations involved in solving the items. The model accounts for the possibility that the ability to learn may manifest differently for correct and incorrect responses, which allows for distinguishing different types of learning effects in the data. Model estimation and evaluation is based on a Bayesian framework. A simulation study is presented that examines the performance of the estimation and evaluation methods. The results show accuracy in parameter recovery as well as good performance in model evaluation and selection. An empirical study illustrates the applicability of the model to data from a logical ability test.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"782-807"},"PeriodicalIF":2.1,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311956/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10300370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the Importance of Coefficient Alpha for Measurement Research: Loading Equality Is Not Necessary for Alpha's Utility as a Scale Reliability Index.","authors":"Tenko Raykov, James C Anthony, Natalja Menold","doi":"10.1177/00131644221104972","DOIUrl":"10.1177/00131644221104972","url":null,"abstract":"<p><p>The population relationship between coefficient alpha and scale reliability is studied in the widely used setting of unidimensional multicomponent measuring instruments. It is demonstrated that for any set of component loadings on the common factor, regardless of the extent of their inequality, the discrepancy between alpha and reliability can be arbitrarily small in any considered population and hence practically ignorable. In addition, the set of parameter values where this discrepancy is negligible is shown to possess the same dimensionality as that of the underlying model parameter space. The article contributes to the measurement and related literature by pointing out that (a) approximate or strict loading identity is not a necessary condition for the utility of alpha as a trustworthy index of scale reliability, and (b) coefficient alpha can be a dependable reliability measure with any extent of inequality in the component loadings.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":"83 4","pages":"766-781"},"PeriodicalIF":2.1,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10311953/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9747518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}