Deborah J. Harris, Catherine J. Welch, Stephen B. Dunbar
{"title":"In the beginning, there was an item…","authors":"Deborah J. Harris, Catherine J. Welch, Stephen B. Dunbar","doi":"10.1111/emip.12647","DOIUrl":"https://doi.org/10.1111/emip.12647","url":null,"abstract":"<p>As educational researchers, we take scored item responses, create data sets to analyze, draw inferences from those analyses, and make decisions, about students’ educational knowledge and future success, judge how successful educational programs are, determine what to teach tomorrow, and so on. It is good to remind ourselves that the basis for all our analyses, from simple means to complex multilevel, multidimensional modeling, interpretations of those analyses, and decisions we make based on the analyses are at the core based on a test taker responding to an item. With all the emphasis on modeling, analyses, big data, machine learning, etc., we need to remember it all starts with the items we collect information on. If we get those wrong, then the results of subsequent analyses are unlikely to provide the information we are seeking.</p><p>It is true that how students and educators interact with items has changed, and continues to change. More and more of the student-item interactions are happening online, and the days when an educator had relatively easy access to the actual test items, often after test administration, are in the past. This lack of access is also true for the researchers analyzing the response data: instead of a single test booklet aligned to a data file of test taker responses, there are large pools of items, and while the researcher may know a test taker was administered, say, item #SK-65243-0273A and what the response was, they do not know what the text of the item actually was, which can make it challenging to interpret analysis results at times.</p><p>From having a test author write the items for an assessment, to contracting with content specialists to draft items, to cloning items from a template, to having large language models/artificial intelligence produce items, item development has morphed over the past and present, and will continue to morph into the future. Item tryouts for pretesting the quality and functioning of an item, including gathering data for generating item statistics to aid in forms construction and in some instances scoring, now attempt to develop algorithms that can accurately predict item characteristics, including item statistics, without gathering item data in advance of operational use (or at all). We are developing more innovative item types, and collecting more data, such as latencies, click streams, and other process data on student responses to those items.</p><p>Sometimes we are so enamored of what we can do with the data, the analyses seem distant from the actual experience: a test taker responding to an item. And this makes it challenging at times to interpret analysis results in terms of actionable steps. Our aim here is to examine the evolution of how items are developed and considered, concentrating on large-scale, K–12 educational assessments.</p><p>The <i>Standards for Educational and Psychological Testing</i> (<i>Standards</i>; American Educational Research Association [AERA], the ","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"40-45"},"PeriodicalIF":2.7,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12647","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measurement Invariance for Multilingual Learners Using Item Response and Response Time in PISA 2018","authors":"Jung Yeon Park, Sean Joo, Zikun Li, Hyejin Yoon","doi":"10.1111/emip.12640","DOIUrl":"https://doi.org/10.1111/emip.12640","url":null,"abstract":"<p>This study examines potential assessment bias based on students' primary language status in PISA 2018. Specifically, multilingual (MLs) and nonmultilingual (non-MLs) students in the United States are compared with regard to their response time as well as scored responses across three cognitive domains (reading, mathematics, and science). Differential item functioning (DIF) analysis reveals that 7–14% of items exhibit DIF-related problems in scored responses between the two groups, aligning with PISA technical report results. While MLs generally spend more time on the test than non-MLs across cognitive levels, differential response time (DRT) functioning identifies significant time differences in 7–10% of items for students with similar cognitive levels. It was noticeable that items with DIF and DRT issues show limited overlap, suggesting diverse reasons for student struggles in the assessment. A deeper examination of item characteristics is recommended for test developers and teachers to gain a better understanding of these nuances.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"44 1","pages":"55-65"},"PeriodicalIF":2.7,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12640","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143423536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"You Win Some, You Lose Some","authors":"Gregory J. Cizek","doi":"10.1111/emip.12643","DOIUrl":"https://doi.org/10.1111/emip.12643","url":null,"abstract":"<p>In a 1993 EM:IP article, I made six predictions related to measurement policy issues for the approaching millenium. In this article, I evaluate the accuracy of those predictions (Spoiler: I was only modestly accurate) and I proffer a mix of seven contemporary predictions, recommendations, and aspirations regarding assessment generally, NCME as an association, and specific psychometric practices.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"126-136"},"PeriodicalIF":2.7,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143245272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Introduction to the Special Section on the Past, Present, and Future of Educational Measurement","authors":"Zhongmin Cui","doi":"10.1111/emip.12660","DOIUrl":"https://doi.org/10.1111/emip.12660","url":null,"abstract":"","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"38-39"},"PeriodicalIF":2.7,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephen G. Sireci, Javier Suárez-Álvarez, April L. Zenisky, Maria Elena Oliveri
{"title":"Evolving Educational Testing to Meet Students’ Needs: Design-in-Real-Time Assessment","authors":"Stephen G. Sireci, Javier Suárez-Álvarez, April L. Zenisky, Maria Elena Oliveri","doi":"10.1111/emip.12653","DOIUrl":"https://doi.org/10.1111/emip.12653","url":null,"abstract":"<p>The goal in personalized assessment is to best fit the needs of each individual test taker, given the assessment purposes. Design-In-Real-Time (DIRTy) assessment reflects the progressive evolution in testing from a single test, to an adaptive test, to an adaptive assessment <i>system</i>. In this article, we lay the foundation for DIRTy assessment and illustrate how it meets the complex needs of each individual learner. The assessment framework incorporates culturally responsive assessment principles, thus making it innovative with respect to both technology and equity. Key aspects are (a) assessment building blocks called “assessment task modules” (ATMs) linked to multiple content standards and skill domains, (b) gathering information on test takers’ characteristics and preferences and using this information to improve their testing experience, and (c) selecting, modifying, and compiling ATMs to create a personalized test that best meets the needs of the testing purpose and individual test taker.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"112-118"},"PeriodicalIF":2.7,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AI: Can You Help Address This Issue?","authors":"Deborah J. Harris","doi":"10.1111/emip.12655","DOIUrl":"https://doi.org/10.1111/emip.12655","url":null,"abstract":"<p>Linking across test forms or pools of items is necessary to ensure scores that are reported across different administrations are comparable and lead to consistent decisions for examinees whose abilities are the same, but who were administered different items. Most of these linkages consist of equating test forms or scaling calibrated items or pools to be on the same theta scale. The typical methodology to accomplish this linking makes use of common examinees or common items, where common examinees are understood to be groups of examinees of comparable ability, whether obtained through a single group (where the same examinees are administered multiple assessments) or a random groups design, where random assignment or pseudo random assignment is done (such as spiraling the test forms, say 1, 2, 3, 4, 5, and distributing them such that every 5th examinee receives the same form). Common item methodology is usually implemented by having identical items in multiple forms and using those items to link across forms or pools. These common items may be scored or unscored in terms of whether they are treated as internal or external anchors (i.e., whether they are contributing to the examinee's score).</p><p>There are situations where it is not practical to have either common examinees nor common items. Typically, these are high-stakes settings, where the security of the assessment questions would likely be at risk if any were repeated. This would include scenarios where the entire assessment is released after administration to promote transparency. In some countries, a single form of a national test may be administered to all examinees during a single administration time. While in some cases a student who does not do as well as they had hoped may retest the following year, this may be a small sample and these students would not be considered representative of the entire body of test-takers. In addition, it is presumed they would have spent the intervening year studying for the exam, and so they could not really be considered common examinees across years and assessment forms.</p><p>Although the decisions (such as university admissions) based on the assessment scores are comparable within the year, because all examinees are administered the same set of items on the same date, it is difficult to monitor trends over time as there is no linkage between forms across years. Although the general populations may be similar (e.g., 2024 secondary school graduates versus 2023 secondary school graduates), there is no evidence that the groups are strictly equivalent across years. Similarly, comparing how examinees perform across years (e.g., highest scores, average raw score, and so on) is challenging as there is no adjustment for yearly fluctuations in form difficulty across years.</p><p>There have been variations of both common item and common examinee linking, such as using similar items, rather than identical items, including where perhaps these similar items are","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"9-12"},"PeriodicalIF":2.7,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12655","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Comparative Analysis of Psychometric Frameworks and Properties of Scores from Autogenerated Test Forms","authors":"Won-Chan Lee, Stella Y. Kim","doi":"10.1111/emip.12648","DOIUrl":"https://doi.org/10.1111/emip.12648","url":null,"abstract":"<p>This paper explores the psychometric properties of scores derived from autogenerated test forms by introducing three conceptual frameworks: Alternate Test Forms, Randomly Parallel Forms, and Approximately Parallel Forms. Each framework provides a distinct perspective on score comparability, definitions of true score and standard error of measurement (SEM), and the necessity of equating. Through a simulation study, we illustrate how these frameworks compare in terms of true scores and SEMs, while also assessing the impact of equating on score comparability across varying levels of form variability. Ultimately, this study seeks to lay the groundwork for implementing scoring practices in large-scale standardized assessments that use autogenerated forms.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"13-23"},"PeriodicalIF":2.7,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/emip.12648","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Linking Unlinkable Tests: A Step Forward","authors":"Silvia Testa, Renato Miceli, Renato Miceli","doi":"10.1111/emip.12638","DOIUrl":"https://doi.org/10.1111/emip.12638","url":null,"abstract":"<p>Random Equating (RE) and Heuristic Approach (HA) are two linking procedures that may be used to compare the scores of individuals in two tests that measure the same latent trait, in conditions where there are no common items or individuals. In this study, RE—that may only be used when the individuals taking the two tests come from the same population—was used as a benchmark for evaluating HA, which, in contrast, does not require any distributional assumptions. The comparison was based on both simulated and empirical data. Simulations showed that HA was good at reproducing the link shift connecting the difficulty parameters of the two sets of items, performing similarly to RE under the condition of slight violation of the distributional assumption. Empirical results showed satisfactory correspondence between the estimates of item and person parameters obtained via the two procedures.</p>","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"44 1","pages":"66-72"},"PeriodicalIF":2.7,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kyndra V. Middleton, Comfort H. Omonkhodion, Ernest Y. Amoateng, Lucy O. Okam, Daniela Cardoza, Alexis Oakley
{"title":"From Mandated to Test-Optional College Admissions Testing: Where Do We Go from Here?","authors":"Kyndra V. Middleton, Comfort H. Omonkhodion, Ernest Y. Amoateng, Lucy O. Okam, Daniela Cardoza, Alexis Oakley","doi":"10.1111/emip.12649","DOIUrl":"https://doi.org/10.1111/emip.12649","url":null,"abstract":"","PeriodicalId":47345,"journal":{"name":"Educational Measurement-Issues and Practice","volume":"43 4","pages":"33-37"},"PeriodicalIF":2.7,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143252345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}