{"title":"Using Evidence-Centered Design to Support the Development of Culturally and Linguistically Sensitive Collaborative Problem-Solving Assessments","authors":"M. Oliveri, René Lawless, R. Mislevy","doi":"10.1080/15305058.2018.1543308","DOIUrl":"https://doi.org/10.1080/15305058.2018.1543308","url":null,"abstract":"Collaborative problem solving (CPS) ranks among the top five most critical skills necessary for college graduates to meet workforce demands (Hart Research Associates, 2015). It is also deemed a critical skill for educational success (Beaver, 2013). It thus deserves more prominence in the suite of courses and subjects assessed in K-16. Such inclusion, however, presents the need for improvements in the conceptualization, design, and analysis of CPS, which challenges us to think differently about assessing the skills than the current focus given to assessing individuals’ substantive knowledge. In this article, we discuss an Evidence-Centered Design approach to assess CPS in a culturally and linguistically diverse educational environment. We demonstrate ways to consider a sociocognitive perspective to conceptualize and model possible linguistic and/or cultural differences between populations along key stages of assessment development including assessment conceptualization and design to help reduce possible construct-irrelevant differences when assessing complex constructs with diverse populations.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2019-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1543308","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44350922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Shavelson, O. Zlatkin‐Troitschanskaia, K. Beck, Susanne Schmidt, Julián P. Mariño
{"title":"Assessment of University Students’ Critical Thinking: Next Generation Performance Assessment","authors":"R. Shavelson, O. Zlatkin‐Troitschanskaia, K. Beck, Susanne Schmidt, Julián P. Mariño","doi":"10.1080/15305058.2018.1543309","DOIUrl":"https://doi.org/10.1080/15305058.2018.1543309","url":null,"abstract":"Following employers’ criticisms and recent societal developments, policymakers and educators have called for students to develop a range of generic skills such as critical thinking (“twenty-first century skills”). So far, such skills have typically been assessed by student self-reports or with multiple-choice tests. An alternative approach is criterion-sampling measurement. This approach leads to developing performance assessments using “criterion” tasks, which are drawn from real-world situations in which students are being educated, both within and across academic or professional domains. One current project, iPAL (The international Performance Assessment of Learning), consolidates previous research and focuses on the next generation performance assessments. In this paper, we present iPAL’s assessment framework and show how it guides the development of such performance assessments, exemplify these assessments with a concrete task, and provide preliminary evidence of its reliability and validity, which allows us to draw initial implications for further test design and development.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2019-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1543309","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48194695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Examination of Different Methods of Setting Cutoff Values in Person Fit Research","authors":"A. Mousavi, Ying Cui, Todd Rogers","doi":"10.1080/15305058.2018.1464010","DOIUrl":"https://doi.org/10.1080/15305058.2018.1464010","url":null,"abstract":"This simulation study evaluates four different methods of setting cutoff values for person fit assessment, including (a) using fixed cutoff values either from theoretical distributions of person fit statistics, or arbitrarily chosen by the researchers in the literature; (b) using the specific percentile rank of empirical sampling distribution of person fit statistics from simulated fitting responses; (c) using bootstrap method to estimate cutoff values of empirical sampling distribution of person fit statistics from simulated fitting responses; and (d) using the p-value methods to identify misfitting responses conditional on ability levels. The Snijders' (2001), as an index with known theoretical distribution, van der Flier's U3 (1982) and Sijtsma's HT coefficient (1986), as indices with unknown theoretical distribution, were chosen. According to the simulation results, different methods of setting cutoff values tend to produce different levels of Type I error and detection rates, indicating it is critical to select an appropriate method for setting cutoff values in person fit research.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2019-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1464010","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48532510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Comparison of the Relative Performance of Four IRT Models on Equating Passage-Based Tests","authors":"Kyung Yong Kim, Euijin Lim, Won‐Chan Lee","doi":"10.1080/15305058.2018.1530239","DOIUrl":"https://doi.org/10.1080/15305058.2018.1530239","url":null,"abstract":"For passage-based tests, items that belong to a common passage often violate the local independence assumption of unidimensional item response theory (UIRT). In this case, ignoring local item dependence (LID) and estimating item parameters using a UIRT model could be problematic because doing so might result in inaccurate parameter estimates, which, in turn, could impact the results of equating. Under the random groups design, the main purpose of this article was to compare the relative performance of the three-parameter logistic (3PL), graded response (GR), bifactor, and testlet models on equating passage-based tests when various degrees of LID were present due to passage. Simulation results showed that the testlet model produced the most accurate equating results, followed by the bifactor model. The 3PL model worked as well as the bifactor and testlet models when the degree of LID was low but returned less accurate equating results than the two multidimensional models as the degree of LID increased. Among the four models, the polytomous GR model provided the least accurate equating results.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2018-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1530239","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46453114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Test Instructions Do Not Moderate the Indirect Effect of Perceived Test Importance on Test Performance in Low-Stakes Testing Contexts","authors":"S. Finney, Aaron J. Myers, C. Mathers","doi":"10.1080/15305058.2017.1396466","DOIUrl":"https://doi.org/10.1080/15305058.2017.1396466","url":null,"abstract":"Assessment specialists expend a great deal of energy to promote valid inferences from test scores gathered in low-stakes testing contexts. Given the indirect effect of perceived test importance on test performance via examinee effort, assessment practitioners have manipulated test instructions with the goal of increasing perceived test importance. Importantly, no studies have investigated the impact of test instructions on this indirect effect. In the current study, students were randomly assigned to one of three test instruction conditions intended to increase test relevance while keeping the test low-stakes to examinees. Test instructions did not impact average perceived test importance, examinee effort, or test performance. Furthermore, the indirect relationship between importance and performance via effort was not moderated by instructions. Thus, the effect of perceived test importance on test scores via expended effort appears consistent across different messages regarding the personal relevance of the test to examinees. The main implication for testing practice is that the effect of instructions may be negligible when reflective of authentic low-stakes test score use. Future studies should focus on uncovering instructions that increase the value of performance to the examinee yet remain truthful regarding score use.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1396466","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49293024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Investigating the Reliability of the Sentence Verification Technique","authors":"Amanda M Marcotte, Francis Rick, C. Wells","doi":"10.1080/15305058.2018.1497636","DOIUrl":"https://doi.org/10.1080/15305058.2018.1497636","url":null,"abstract":"Reading comprehension plays an important role in achievement for all academic domains. The purpose of this study is to describe the sentence verification technique (SVT) (Royer, Hastings, & Hook, 1979) as an alternative method of assessing reading comprehension, which can be used with a variety of texts and across diverse populations and educational contexts. Additionally, this study adds a unique contribution to the extant literature on the SVT through an investigation of the precision of the instrument across proficiency levels. Data were gathered from a sample of 464 fourth-grade students from the Northeast region of the United States. Reliability was estimated using one, two, three, and four passage test forms. Two or three passages provided sufficient reliability. The conditional reliability analyses revealed that the SVT test scores were reliable for readers with average to below average proficiency, but did not provide reliable information for students who were very poor or strong readers.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2018-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1497636","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45868181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Item Parameter Drift in Context Questionnaires from International Large-Scale Assessments","authors":"HyeSun Lee, K. Geisinger","doi":"10.1080/15305058.2018.1481852","DOIUrl":"https://doi.org/10.1080/15305058.2018.1481852","url":null,"abstract":"The purpose of the current study was to examine the impact of item parameter drift (IPD) occurring in context questionnaires from an international large-scale assessment and determine the most appropriate way to address IPD. Focusing on the context of psychometric and educational research where scores from context questionnaires composed of polytomous items were employed for the classification of examinees, the current research investigated the impacts of IPD on the estimation of questionnaire scores and classification accuracy with five manipulated factors: the length of a questionnaire, the proportion of items exhibiting IPD, the direction and magnitude of IPD, and three decisions about IPD. The results indicated that the impact of IPD occurring in a short context questionnaire on the accuracy of score estimation and classification of examinees was substantial. The accuracy in classification considerably decreased especially at the lowest and highest categories of a trait. Unlike the recommendation from literature in educational testing, the current study demonstrated that keeping items exhibiting IPD and removing them only for transformation were appropriate when IPD occurred in relatively short context questionnaires. Using 2011 TIMSS data from Iran, an applied example demonstrated the application of provided guidance in making appropriate decisions about IPD.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2018-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1481852","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42801965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephen D. Holmes, M. Meadows, I. Stockford, Qingping He
{"title":"Investigating the Comparability of Examination Difficulty Using Comparative Judgement and Rasch Modelling","authors":"Stephen D. Holmes, M. Meadows, I. Stockford, Qingping He","doi":"10.1080/15305058.2018.1486316","DOIUrl":"https://doi.org/10.1080/15305058.2018.1486316","url":null,"abstract":"The relationship of expected and actual difficulty of items on six mathematics question papers designed for 16-year olds in England was investigated through paired comparison using experts and testing with students. A variant of the Rasch model was applied to the comparison data to establish a scale of expected difficulty. In testing, the papers were taken by 2933 students using an equivalent-groups design, allowing the actual difficulty of the items to be placed on the same measurement scale. It was found that the expected difficulty derived using the comparative judgement approach and the actual difficulty derived from the test data was reasonably strongly correlated. This suggests that comparative judgement may be an effective way to investigate the comparability of difficulty of examinations. The approach could potentially be used as a proxy for pretesting high-stakes tests in situations where pretesting is not feasible due to reasons of security or other risks.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2018-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1486316","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45405533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing Job Analysis Data Using Mixture Rasch Models","authors":"Adam E. Wyse","doi":"10.1080/15305058.2018.1481853","DOIUrl":"https://doi.org/10.1080/15305058.2018.1481853","url":null,"abstract":"An important piece of validity evidence to support the use of credentialing exams comes from performing a job analysis of the profession. One common job analysis method is the task inventory method, where people working in the field are surveyed using rating scales about the tasks thought necessary to safely and competently perform the job. This article describes how mixture Rasch models can be used to analyze these data, and how results from these analyses can help to identify whether different groups of people may be responding to job tasks differently. Three examples from different credentialing programs illustrate scenarios that can be found when applying mixture Rasch models to job analysis data. Discussion of what these results may imply for the development of credentialing exams and other analyses of job analysis data is provided.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2018-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2018.1481853","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47874147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongbo Tu, Chanjin Zheng, Yan Cai, Xuliang Gao, Daxun Wang
{"title":"A Polytomous Model of Cognitive Diagnostic Assessment for Graded Data","authors":"Dongbo Tu, Chanjin Zheng, Yan Cai, Xuliang Gao, Daxun Wang","doi":"10.1080/15305058.2017.1396465","DOIUrl":"https://doi.org/10.1080/15305058.2017.1396465","url":null,"abstract":"Pursuing the line of the difference models in IRT (Thissen & Steinberg, 1986), this article proposed a new cognitive diagnostic model for graded/polytomous data based on the deterministic input, noisy, and gate (Haertel, 1989; Junker & Sijtsma, 2001), which is named the DINA model for graded data (DINA-GD). We investigated the performance of a full Bayesian estimation of the proposed model. In the simulation, the classification accuracy and item recovery for the DINA-GD model were investigated. The results indicated that the proposed model had acceptable examinees' correct attribute classification rate and item parameter recovery. In addition, a real-data example was used to illustrate the application of this new model with the graded data or polytomously scored items.","PeriodicalId":46615,"journal":{"name":"International Journal of Testing","volume":null,"pages":null},"PeriodicalIF":1.7,"publicationDate":"2018-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1080/15305058.2017.1396465","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49274990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}