{"title":"Building a Validity Argument for the TOEFL Junior® Tests","authors":"Ching‐Ni Hsieh","doi":"10.1002/ets2.12379","DOIUrl":"https://doi.org/10.1002/ets2.12379","url":null,"abstract":"The TOEFL Junior® tests are designed to evaluate young language students' English reading, listening, speaking, and writing skills in an English‐medium secondary instructional context. This paper articulates a validity argument constructed to support the use and interpretation of the TOEFL Junior test scores for the purpose of placement, progress monitoring, and evaluation of a test taker's English skills. The validity argument is built within an argument‐based approach to validation and consists of six validity inferences that provide a coherent narrative about the measurement quality and intended uses of the TOEFL Junior test scores. Each validity inference is underpinned by specific assumptions and corresponding evidential support. The claims and supporting evidence presented in the validity argument demonstrate how the TOEFL Junior research program takes a rigorous approach to supporting the uses of the tests. The compilation of validity evidence serves as a resource for score users and stakeholders, guiding them to make informed decisions regarding the use and interpretation of TOEFL Junior test scores within their educational contexts.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"48 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140974663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Steven Holtzman, Jonathan Steinberg, Jonathan Weeks, Christopher Robertson, Jessica Findley, David M Klieger
{"title":"Validity, Reliability, and Fairness Evidence for the JD‐Next Exam","authors":"Steven Holtzman, Jonathan Steinberg, Jonathan Weeks, Christopher Robertson, Jessica Findley, David M Klieger","doi":"10.1002/ets2.12378","DOIUrl":"https://doi.org/10.1002/ets2.12378","url":null,"abstract":"At a time when institutions of higher education are exploring alternatives to traditional admissions testing, institutions are also seeking to better support students and prepare them for academic success. Under such an engaged model, one may seek to measure not just the accumulated knowledge and skills that students would bring to a new academic program but also their ability to grow and learn through the academic program. To help prepare students for law school before they matriculate, the JD‐Next is a fully online, noncredit, 7‐ to 10‐week course to train potential juris doctor students in case reading and analysis skills. This study builds on the work presented for previous JD‐Next cohorts by introducing new scoring and reliability estimation methodologies based on a recent redesign of the assessment for the 2021 cohort, and it presents updated validity and fairness findings using first‐year grades, rather than merely first‐semester grades as in prior cohorts. Results support the claim that the JD‐Next exam is reliable and valid for predicting law school success, providing a statistically significant increase in predictive power over baseline models, including entrance exam scores and grade point averages. In terms of fairness across racial and ethnic groups, smaller score disparities are found with JD‐Next than with traditional admissions assessments, and the assessment is shown to be equally predictive for students from underrepresented minority groups and for first‐generation students. These findings, in conjunction with those from previous research, support the use of the JD‐Next exam for both preparing and admitting future law school students.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"7 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140716447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongwen Guo, Matthew S. Johnson, Daniel F. McCaffrey, Lixong Gu
{"title":"Practical Considerations in Item Calibration With Small Samples Under Multistage Test Design: A Case Study","authors":"Hongwen Guo, Matthew S. Johnson, Daniel F. McCaffrey, Lixong Gu","doi":"10.1002/ets2.12376","DOIUrl":"https://doi.org/10.1002/ets2.12376","url":null,"abstract":"The multistage testing (MST) design has been gaining attention and popularity in educational assessments. For testing programs that have small test‐taker samples, it is challenging to calibrate new items to replenish the item pool. In the current research, we used the item pools from an operational MST program to illustrate how research studies can be built upon literature and program‐specific data to help to fill the gaps between research and practice and to make sound psychometric decisions to address the small‐sample issues. The studies included choice of item calibration methods, data collection designs to increase sample sizes, and item response theory models in producing the score conversion tables. Our results showed that, with small samples, the fixed parameter calibration (FIPC) method performed consistently the best for calibrating new items, compared to the traditional separate‐calibration with scaling method and a new approach of a calibration method based on the minimum discriminant information adjustment. In addition, the concurrent FIPC calibration with data from multiple administrations also improved parameter estimation of new items. However, because of the program‐specific settings, a simpler model may not improve current practice when the sample size was small and when the initial item pools were well‐calibrated using a two‐parameter logistic model with a large field trial data.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"11 1-2","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139867309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongwen Guo, Matthew S. Johnson, Daniel F. McCaffrey, Lixong Gu
{"title":"Practical Considerations in Item Calibration With Small Samples Under Multistage Test Design: A Case Study","authors":"Hongwen Guo, Matthew S. Johnson, Daniel F. McCaffrey, Lixong Gu","doi":"10.1002/ets2.12376","DOIUrl":"https://doi.org/10.1002/ets2.12376","url":null,"abstract":"The multistage testing (MST) design has been gaining attention and popularity in educational assessments. For testing programs that have small test‐taker samples, it is challenging to calibrate new items to replenish the item pool. In the current research, we used the item pools from an operational MST program to illustrate how research studies can be built upon literature and program‐specific data to help to fill the gaps between research and practice and to make sound psychometric decisions to address the small‐sample issues. The studies included choice of item calibration methods, data collection designs to increase sample sizes, and item response theory models in producing the score conversion tables. Our results showed that, with small samples, the fixed parameter calibration (FIPC) method performed consistently the best for calibrating new items, compared to the traditional separate‐calibration with scaling method and a new approach of a calibration method based on the minimum discriminant information adjustment. In addition, the concurrent FIPC calibration with data from multiple administrations also improved parameter estimation of new items. However, because of the program‐specific settings, a simpler model may not improve current practice when the sample size was small and when the initial item pools were well‐calibrated using a two‐parameter logistic model with a large field trial data.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2011 22","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139807398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Paul Deane, Duanli Yan, Katherine Castellano, Y. Attali, Michelle Lamar, Mo Zhang, Ian Blood, James V. Bruno, Chen Li, Wenju Cui, Chunyi Ruan, Colleen Appel, Kofi James, Rodolfo Long, Farah Qureshi
{"title":"Modeling Writing Traits in a Formative Essay Corpus","authors":"Paul Deane, Duanli Yan, Katherine Castellano, Y. Attali, Michelle Lamar, Mo Zhang, Ian Blood, James V. Bruno, Chen Li, Wenju Cui, Chunyi Ruan, Colleen Appel, Kofi James, Rodolfo Long, Farah Qureshi","doi":"10.1002/ets2.12377","DOIUrl":"https://doi.org/10.1002/ets2.12377","url":null,"abstract":"This paper presents a multidimensional model of variation in writing quality, register, and genre in student essays, trained and tested via confirmatory factor analysis of 1.37 million essay submissions to ETS' digital writing service, Criterion®. The model was also validated with several other corpora, which indicated that it provides a reasonable fit for essay data from 4th grade to college. It includes an analysis of the test‐retest reliability of each trait, longitudinal trends by trait, both within the school year and from 4th to 12th grades, and analysis of genre differences by trait, using prompts from the Criterion topic library aligned with the major modes of writing (exposition, argumentation, narrative, description, process, comparison and contrast, and cause and effect). It demonstrates that many of the traits are about as reliable as overall e‐rater® scores, that the trait model can be used to build models somewhat more closely aligned with human scores than standard e‐rater models, and that there are large, significant trait differences by genre, consistent with genre differences in trait patterns described in the larger literature. Some of the traits demonstrated clear trends between successive revisions. Students using Criterion appear to have consistently improved grammar, usage, and spelling after getting Criterion feedback and to have marginally improved essay organization. Many of the traits also demonstrated clear grade level trends. These features indicate that the trait model could be used to support more detailed scoring and reporting for writing assessments and learning tools.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":" 10","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139617004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Use of TOEFL iBT® in Admissions Decisions: Stakeholder Perceptions of Policies and Practices","authors":"Sara T. Cushing, Haoshan Ren, Yi Tan","doi":"10.1002/ets2.12375","DOIUrl":"https://doi.org/10.1002/ets2.12375","url":null,"abstract":"This paper reports partial results from a larger study of how three different groups of stakeholders—university admissions officers, faculty in graduate programs involved in admissions decisions, and Intensive English Program (IEP) faculty—interpret and use TOEFL iBT® scores in making admissions decisions or preparing students to meet minimum test score requirements. Our overall goal was to gain a better understanding of the perceived role of English language proficiency in admissions decisions and the internal and external factors that inform decisions about acceptable ways to demonstrate proficiency and minimal standards. To that end, we designed surveys for each stakeholder group that contained questions for all groups and questions specific to each group. This report focuses on the questions that were common to all three groups across two areas: (1) understandings of and participation in institutional policy making around English language proficiency tests and (2) knowledge of and attitudes toward the TOEFL iBT test itself. Our results suggested that, as predicted, university admissions staff were the most aware of and involved in policy making but frequently consulted with ESL experts such as IEP faculty when setting policies. This stakeholder group was also the most knowledgeable about the TOEFL iBT test. Faculty in graduate programs varied in their understanding of and involvement in policy making and reported the least familiarity with the test. However, they reported that more information about many aspects of the test would help them make better admissions decisions. The results of the study add to the growing literature on language assessment literacy among various stakeholder groups, especially in terms of identifying aspects of assessment literacy that are important to different groups of stakeholders.","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":" 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139619988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael E. Walker, Margarita Olivera-Aguilar, Blair Lehman, Cara Laitusis, Danielle Guzman-Orth, Melissa Gholson
{"title":"Culturally Responsive Assessment: Provisional Principles","authors":"Michael E. Walker, Margarita Olivera-Aguilar, Blair Lehman, Cara Laitusis, Danielle Guzman-Orth, Melissa Gholson","doi":"10.1002/ets2.12374","DOIUrl":"10.1002/ets2.12374","url":null,"abstract":"<p>Recent criticisms of large-scale summative assessments have claimed that the assessments are biased against historically excluded groups because of the assessments' lack of cultural representation. Accompanying these criticisms is a call for more culturally responsive assessments—assessments that take into account the background characteristics of the students; their beliefs, values, and ethics; their lived experiences; and everything that affects how they learn and behave and communicate. In this paper, we present provisional principles, based on a review of research, that we deem necessary for fostering cultural responsiveness in assessment. We believe the application of these principles can address the criticisms of current assessments.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2023 1","pages":"1-24"},"PeriodicalIF":0.0,"publicationDate":"2023-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12374","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45011414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interpretation and Use of a Workplace English Language Proficiency Test Score Report: Perspectives of TOEIC® Test Takers and Score Users in Taiwan","authors":"Ching-Ni Hsieh","doi":"10.1002/ets2.12373","DOIUrl":"10.1002/ets2.12373","url":null,"abstract":"<p>Research in validity suggests that stakeholders' interpretation and use of test results should be an aspect of validity. Claims about the meaningfulness of test score interpretations and consequences of test use should be backed by evidence that stakeholders understand the definition of the construct assessed and the score report information. The current study explored stakeholders' uses and interpretations of the score report of a workplace English language proficiency test, the TOEIC® Listening and Reading (TOEIC L&R) test. Online surveys were administered to TOEIC L&R test takers and institutional and corporate score users in Taiwan to collect data about their uses and interpretations of the test score report. Eleven survey respondents participated in follow-up interviews to further elaborate on their uses of the different score reporting information within the stakeholders' respective contexts. Results indicated that the participants used the TOEIC L&R test scores largely as intended by the test developer although some elements of the score report appeared to be less useful and could be confusing for stakeholders. Findings from this study highlight the importance of providing score reporting information with clarity and ease to enhance appropriate use and interpretation.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2023 1","pages":"1-21"},"PeriodicalIF":0.0,"publicationDate":"2023-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12373","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48765322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Teresa M. Ober, Blair A. Lehman, Reginald Gooch, Olasumbo Oluwalana, Jaemarie Solyst, Geoffrey Phelps, Laura S. Hamilton
{"title":"Culturally Responsive Personalized Learning: Recommendations for a Working Definition and Framework","authors":"Teresa M. Ober, Blair A. Lehman, Reginald Gooch, Olasumbo Oluwalana, Jaemarie Solyst, Geoffrey Phelps, Laura S. Hamilton","doi":"10.1002/ets2.12372","DOIUrl":"10.1002/ets2.12372","url":null,"abstract":"<p>Culturally responsive personalized learning (CRPL) emphasizes the importance of aligning personalized learning approaches with previous research on culturally responsive practices to consider social, cultural, and linguistic contexts for learning. In the present discussion, we briefly summarize two bodies of literature considered in defining and developing a framework for CRPL: technology-enabled personalized learning and culturally relevant, responsive, and sustaining pedagogy. We then provide a definition and framework consisting of six key principles of CRPL, along with a brief discussion of theories and empirical evidence to support these principles. These six principles include agency, dynamic adaptation, connection to lived experiences, consideration of social movements, opportunities for collaboration, and shared power. These principles fall into three domains: fostering flexible student-centered learning experiences, leveraging relevant content and practices, and supporting meaningful interactions within a community. Finally, we conclude with some implications of this framework for researchers, policymakers, and practitioners working to ensure that all students receive high-quality learning opportunities that are both personalized and culturally responsive.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2023 1","pages":"1-14"},"PeriodicalIF":0.0,"publicationDate":"2023-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12372","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47605083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Geoffrey Phelps, Devon Kinsey, Thomas Florek, Nathan Jones
{"title":"Using Performance Tasks to Provide Feedback and Assess Progress in Teacher Preparation","authors":"Geoffrey Phelps, Devon Kinsey, Thomas Florek, Nathan Jones","doi":"10.1002/ets2.12371","DOIUrl":"10.1002/ets2.12371","url":null,"abstract":"<p>This report presents results from a survey of 64 elementary mathematics and reading language arts teacher educators providing feedback on a new type of short performance task. The performance tasks each present a brief teaching scenario and then require a short performance as if teaching actual students. Teacher educators participating in the study first reviewed six performance tasks, followed by a more in-depth review of two of the tasks. After reviewing the tasks, teacher educators completed an online survey providing input on the value of the tasks and on potential uses to support teacher preparation. The survey responses were positive with the majority of teacher educators supporting a variety of different uses of the performance tasks to support teacher preparation. The report concludes by proposing a larger theory for how the performance tasks can be used as both formative assessment tools to support teacher learning and summative assessments to guide decisions about candidates' readiness for the classroom.</p>","PeriodicalId":11972,"journal":{"name":"ETS Research Report Series","volume":"2023 1","pages":"1-44"},"PeriodicalIF":0.0,"publicationDate":"2023-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/ets2.12371","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49083835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}