Hyeon-Ah Kang, Gregory Arbet, Joe Betts, William Muntean
{"title":"Location-Matching Adaptive Testing for Polytomous Technology-Enhanced Items","authors":"Hyeon-Ah Kang, Gregory Arbet, Joe Betts, William Muntean","doi":"10.1177/01466216241227548","DOIUrl":"https://doi.org/10.1177/01466216241227548","url":null,"abstract":"The article presents adaptive testing strategies for polytomously scored technology-enhanced innovative items. We investigate item selection methods that match examinee’s ability levels in location and explore ways to leverage test-taking speeds during item selection. Existing approaches to selecting polytomous items are mostly based on information measures and tend to experience an item pool usage problem. In this study, we introduce location indices for polytomous items and show that location-matched item selection significantly improves the usage problem and achieves more diverse item sampling. We also contemplate matching items’ time intensities so that testing times can be regulated across the examinees. Numerical experiment from Monte Carlo simulation suggests that location-matched item selection achieves significantly better and more balanced item pool usage. Leveraging working speed in item selection distinctly reduced the average testing times as well as variation across the examinees. Both the procedures incurred marginal measurement cost (e.g., precision and efficiency) and yet showed significant improvement in the administrative outcomes. The experiment in two test settings also suggested that the procedures can lead to different administrative gains depending on the test design.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139619277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sebastian Weirich, Karoline A. Sachse, Sofie Henschel, Carola Schnitzler
{"title":"Comparing Test-Taking Effort Between Paper-Based and Computer-Based Tests","authors":"Sebastian Weirich, Karoline A. Sachse, Sofie Henschel, Carola Schnitzler","doi":"10.1177/01466216241227535","DOIUrl":"https://doi.org/10.1177/01466216241227535","url":null,"abstract":"The article compares the trajectories of students’ self-reported test-taking effort during a 120 minutes low-stakes large-scale assessment of English comprehension between a paper-and-pencil (PPA) and a computer-based assessment (CBA). Test-taking effort was measured four times during the test. Using a within-subject design, each of the N = 2,676 German ninth-grade students completed half of the test in PPA and half in CBA mode, where the sequence of modes was balanced between students. Overall, students’ test-taking effort decreased considerably during the course of the test. On average, effort was lower in CBA than in PPA. While on average, effort was lower in CBA than in PPA, the decline did not vary between both modes during the test. That is, students’ self-reported effort was higher if the items were easier (compared to students’ abilities). The consequences of these results concerning the further development of CBA tests and large-scale assessments in general are discussed.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.2,"publicationDate":"2024-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139531162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Auxiliary Item Information in the Item Parameter Estimation of a Graded Response Model for a Small to Medium Sample Size: Empirical Versus Hierarchical Bayes Estimation","authors":"Matthew Naveiras, Sun-Joo Cho","doi":"10.1177/01466216231209758","DOIUrl":"https://doi.org/10.1177/01466216231209758","url":null,"abstract":"Marginal maximum likelihood estimation (MMLE) is commonly used for item response theory item parameter estimation. However, sufficiently large sample sizes are not always possible when studying rare populations. In this paper, empirical Bayes and hierarchical Bayes are presented as alternatives to MMLE in small sample sizes, using auxiliary item information to estimate the item parameters of a graded response model with higher accuracy. Empirical Bayes and hierarchical Bayes methods are compared with MMLE to determine under what conditions these Bayes methods can outperform MMLE, and to determine if hierarchical Bayes can act as an acceptable alternative to MMLE in conditions where MMLE is unable to converge. In addition, empirical Bayes and hierarchical Bayes methods are compared to show how hierarchical Bayes can result in estimates of posterior variance with greater accuracy than empirical Bayes by acknowledging the uncertainty of item parameter estimates. The proposed methods were evaluated via a simulation study. Simulation results showed that hierarchical Bayes methods can be acceptable alternatives to MMLE under various testing conditions, and we provide a guideline to indicate which methods would be recommended in different research situations. R functions are provided to implement these proposed methods.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135819514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Bayesian Random Weights Linear Logistic Test Model for Within-Test Practice Effects","authors":"José H. Lozano, Javier Revuelta","doi":"10.1177/01466216231209752","DOIUrl":"https://doi.org/10.1177/01466216231209752","url":null,"abstract":"The present paper introduces a random weights linear logistic test model for the measurement of individual differences in operation-specific practice effects within a single administration of a test. The proposed model is an extension of the linear logistic test model of learning developed by Spada (1977) in which the practice effects are considered random effects varying across examinees. A Bayesian framework was used for model estimation and evaluation. A simulation study was conducted to examine the behavior of the model in combination with the Bayesian procedures. The results demonstrated the good performance of the estimation and evaluation methods. Additionally, an empirical study was conducted to illustrate the applicability of the model to real data. The model was applied to a sample of responses from a logical ability test providing evidence of individual differences in operation-specific practice effects.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135371926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Controlling the Minimum Item Exposure Rate in Computerized Adaptive Testing: A Two-Stage Sympson–Hetter Procedure","authors":"Hsiu-Yi Chao, Jyun-Hong Chen","doi":"10.1177/01466216231209756","DOIUrl":"https://doi.org/10.1177/01466216231209756","url":null,"abstract":"Computerized adaptive testing (CAT) can improve test efficiency, but it also causes the problem of unbalanced item usage within a pool. The effect of uneven item exposure rates can not only induce a test security problem due to overexposed items but also raise economic concerns about item pool development due to underexposed items. Therefore, this study proposes a two-stage Sympson–Hetter (TSH) method to enhance balanced item pool utilization by simultaneously controlling the minimum and maximum item exposure rates. The TSH method divides CAT into two stages. While the item exposure rates are controlled above a prespecified level (e.g., r min ) in the first stage to increase the exposure rates of the underexposed items, they are controlled below another prespecified level (e.g., r max ) in the second stage to prevent items from overexposure. To reduce the effect on trait estimation, TSH only administers a minimum sufficient number of underexposed items that are generally less discriminating in the first stage of CAT. The simulation study results indicate that the TSH method can effectively improve item pool usage without clearly compromising trait estimation precision in most conditions while maintaining the required level of test security.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135567498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Two Statistics for Measuring the Score Comparability of Computerized Adaptive Tests","authors":"Adam E. Wyse","doi":"10.1177/01466216231209749","DOIUrl":"https://doi.org/10.1177/01466216231209749","url":null,"abstract":"This study introduces two new statistics for measuring the score comparability of computerized adaptive tests (CATs) based on comparing conditional standard errors of measurement (CSEMs) for examinees that achieved the same scale scores. One statistic is designed to evaluate score comparability of alternate CAT forms for individual scale scores, while the other statistic is designed to evaluate the overall score comparability of alternate CAT forms. The effectiveness of the new statistics is illustrated using data from grade 3 through 8 reading and math CATs. Results suggest that both CATs demonstrated reasonably high levels of score comparability, that score comparability was less at very high or low scores where few students score, and that using random samples with fewer students per grade did not have a big impact on score comparability. Results also suggested that score comparability was sometimes higher when the bottom 20% of scorers were used to calculate overall score comparability compared to all students. Additional discussion related to applying the statistics in different contexts is provided.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135780396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficiency Analysis of Item Response Theory Kernel Equating for Mixed-Format Tests","authors":"Joakim Wallmark, Maria Josefsson, Marie Wiberg","doi":"10.1177/01466216231209757","DOIUrl":"https://doi.org/10.1177/01466216231209757","url":null,"abstract":"This study aims to evaluate the performance of Item Response Theory (IRT) kernel equating in the context of mixed-format tests by comparing it to IRT observed score equating and kernel equating with log-linear presmoothing. Comparisons were made through both simulations and real data applications, under both equivalent groups (EG) and non-equivalent groups with anchor test (NEAT) sampling designs. To prevent bias towards IRT methods, data were simulated with and without the use of IRT models. The results suggest that the difference between IRT kernel equating and IRT observed score equating is minimal, both in terms of the equated scores and their standard errors. The application of IRT models for presmoothing yielded smaller standard error of equating than the log-linear presmoothing approach. When test data were generated using IRT models, IRT-based methods proved less biased than log-linear kernel equating. However, when data were simulated without IRT models, log-linear kernel equating showed less bias. Overall, IRT kernel equating shows great promise when equating mixed-format tests.","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135779713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eli A Jones, Stefanie A Wind, Chia-Lin Tsai, Yuan Ge
{"title":"Comparing Person-Fit and Traditional Indices Across Careless Response Patterns in Surveys.","authors":"Eli A Jones, Stefanie A Wind, Chia-Lin Tsai, Yuan Ge","doi":"10.1177/01466216231194358","DOIUrl":"10.1177/01466216231194358","url":null,"abstract":"<p><p>Methods to identify carelessness in survey research can be valuable tools in reducing bias during survey development, validation, and use. Because carelessness may take multiple forms, researchers typically use multiple indices when identifying carelessness. In the current study, we extend the literature on careless response identification by examining the usefulness of three item-response theory-based person-fit indices for both random and overconsistent careless response identification: infit <i>MSE</i> outfit <i>MSE</i>, and the polytomous <i>l</i><sub><i>z</i></sub> statistic. We compared these statistics with traditional careless response indices using both empirical data and simulated data. The empirical data included 2,049 high school student surveys of teaching effectiveness from the Network for Educator Effectiveness. In the simulated data, we manipulated type of carelessness (random response or overconsistency) and percent of carelessness present (0%, 5%, 10%, 20%). Results suggest that infit and outfit <i>MSE</i> and the <i>l</i><sub><i>z</i></sub> statistic may provide complementary information to traditional indices such as LongString, Mahalanobis Distance, Validity Items, and Completion Time. Receiver operating characteristic curves suggested that the person-fit indices showed good sensitivity and specificity for classifying both over-consistent and under-consistent careless patterns, thus functioning in a bidirectional manner. Carelessness classifications based on low fit values correlated with carelessness classifications from LongString and completion time, and classifications based on high fit values correlated with classifications from Mahalanobis Distance. We consider implications for research and practice.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10552731/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41155112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Does Sparseness Matter? Examining the Use of Generalizability Theory and Many-Facet Rasch Measurement in Sparse Rating Designs.","authors":"Stefanie A Wind, Eli Jones, Sara Grajeda","doi":"10.1177/01466216231182148","DOIUrl":"10.1177/01466216231182148","url":null,"abstract":"<p><p>Sparse rating designs, where each examinee's performance is scored by a small proportion of raters, are prevalent in practical performance assessments. However, relatively little research has focused on the degree to which different analytic techniques alert researchers to rater effects in such designs. We used a simulation study to compare the information provided by two popular approaches: Generalizability theory (G theory) and Many-Facet Rasch (MFR) measurement. In previous comparisons, researchers used complete data that were not simulated-thus limiting their ability to manipulate characteristics such as rater effects, and to understand the impact of incomplete data on the results. Both approaches provided information about rating quality in sparse designs, but the MFR approach highlighted rater effects related to centrality and bias more readily than G theory.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10552733/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41174005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jennifer Reimers, Ronna C Turner, Jorge N Tendeiro, Wen-Juo Lo, Elizabeth Keiffer
{"title":"The Effects of Aberrant Responding on Model-Fit Assuming Different Underlying Response Processes.","authors":"Jennifer Reimers, Ronna C Turner, Jorge N Tendeiro, Wen-Juo Lo, Elizabeth Keiffer","doi":"10.1177/01466216231201987","DOIUrl":"10.1177/01466216231201987","url":null,"abstract":"<p><p>Aberrant responding on tests and surveys has been shown to affect the psychometric properties of scales and the statistical analyses from the use of those scales in cumulative model contexts. This study extends prior research by comparing the effects of four types of aberrant responding on model fit in both cumulative and ideal point model contexts using graded partial credit (GPCM) and generalized graded unfolding (GGUM) models. When fitting models to data, model misfit can be both a function of misspecification and aberrant responding. Results demonstrate how varying levels of aberrant data can severely impact model fit for both cumulative and ideal point data. Specifically, longstring responses have a stronger impact on dimensionality for both ideal point and cumulative data, while random responding tends to have the most negative impact on data model fit according to information criteria (AIC, BIC). The results also indicate that ideal point data models such as GGUM may be able to fit cumulative data as well as the cumulative model itself (GPCM), whereas cumulative data models may not provide sufficient model fit for data simulated using an ideal point model.</p>","PeriodicalId":48300,"journal":{"name":"Applied Psychological Measurement","volume":null,"pages":null},"PeriodicalIF":1.0,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10552732/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41171817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}