{"title":"Assessing the Assessment: Rubrics Training for Pre-Service and New In-Service Teachers.","authors":"Michael G. Lovorn, A. Rezaei","doi":"10.7275/SJT6-5K13","DOIUrl":"https://doi.org/10.7275/SJT6-5K13","url":null,"abstract":"","PeriodicalId":20361,"journal":{"name":"Practical Assessment, Research and Evaluation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74739725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Best Practices in Using Large, Complex Samples: The Importance of Using Appropriate Weights and Design Effect Compensation.","authors":"J. Osborne","doi":"10.7275/2KYG-M659","DOIUrl":"https://doi.org/10.7275/2KYG-M659","url":null,"abstract":"Large surveys often use probability sampling in order to obtain representative samples, and these data sets are valuable tools for researchers in all areas of science. Yet many researchers are not formally prepared to appropriately utilize these resources. Indeed, users of one popular dataset were generally found not to have modeled the analyses to take account of the complex sample (Johnson & Elliott, 1998) even when publishing in highly-regarded journals. It is well known that failure to appropriately model the complex sample can substantially bias the results of the analysis. Examples presented in this paper highlight the risk of error of inference and mis-estimation of parameters from failure to analyze these data sets appropriately.","PeriodicalId":20361,"journal":{"name":"Practical Assessment, Research and Evaluation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79354428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Graphical Transition Table for Communicating Status and Growth.","authors":"Adam E. Wyse, Ji Zeng, Joseph A. Martineau","doi":"10.7275/T9R9-D719","DOIUrl":"https://doi.org/10.7275/T9R9-D719","url":null,"abstract":"This paper introduces a simple and intuitive graphical display for transition table based accountability models that can be used to communicate information about students’ status and growth simultaneously. This graphical transition table includes the use of shading to convey year to year transitions and different sized letters for performance categories to depict yearly status. Examples based on Michigan’s transition table used on their Michigan Educational Assessment Program (MEAP) assessments are provided to illustrate the utility of the graphical transition table in practical contexts. Additional potential applications of the graphical transition table are also suggested.","PeriodicalId":20361,"journal":{"name":"Practical Assessment, Research and Evaluation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81305010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Too Reliable to Be True? Response Bias as a Potential Source of Inflation in Paper-and-Pencil Questionnaire Reliability.","authors":"Eyal Péer, Eyal Gamliel","doi":"10.7275/E482-N724","DOIUrl":"https://doi.org/10.7275/E482-N724","url":null,"abstract":"When respondents answer paper-and-pencil (PP) questionnaires, they sometimes modify their responses to correspond to previously answered items. As a result, this response bias might artificially inflate the reliability of PP questionnaires. We compared the internal consistency of PP questionnaires to computerized questionnaires that presented a different number of items on a computer screen simultaneously. Study 1 showed that a PP questionnaire’s internal consistency was higher than that of the same questionnaire presented on a computer screen with one, two or four questions per screen. Study 2 replicated these findings to show that internal consistency was also relatively high when all questions were shown on one screen. This suggests that the differences found in Study 1 were not due to the difference in presentation medium. Thus, this paper suggests that reliability measures of PP questionnaires might be inflated because of a response bias resulting from participants cross-checking their answers against ones given to previous questions.","PeriodicalId":20361,"journal":{"name":"Practical Assessment, Research and Evaluation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85264898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Is a Picture Is Worth a Thousand Words? Creating Effective Questionnaires with Pictures.","authors":"Laura Reynolds-Keefer, Robert Johnson","doi":"10.7275/BGPE-A067","DOIUrl":"https://doi.org/10.7275/BGPE-A067","url":null,"abstract":"In developing attitudinal instruments for young children, researchers, program evaluators, and clinicians often use response scales with pictures or images (e.g., smiley faces) as anchors. This article considers highlights connections between word-based and picture based Likert scales and highlights the value in translating conventions used in word-based Likert scales to those with pictures or images.","PeriodicalId":20361,"journal":{"name":"Practical Assessment, Research and Evaluation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87150928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Applying Tests of Equivalence for Multiple Group Comparisons: Demonstration of the Confidence Interval Approach.","authors":"Shayna A. Rusticus, C. Lovato","doi":"10.7275/D5WF-5P77","DOIUrl":"https://doi.org/10.7275/D5WF-5P77","url":null,"abstract":"Assessing the comparability of different groups is an issue facing many researchers and evaluators in a variety of settings. Commonly, null hypothesis significance testing (NHST) is incorrectly used to demonstrate comparability when a non-significant result is found. This is problematic because a failure to find a difference between groups is not equivalent to showing that the groups are comparable. This paper provides a comparison of the confidence interval approach to equivalency testing and the more traditional analysis of variance (ANOVA) method using both continuous and rating scale data from three geographically separate medical education teaching sites. Equivalency testing is recommended as a better alternative to demonstrating comparability through its examination of whether mean differences between two groups are small enough that these differences can be considered practically unimportant and thus, the groups can be treated as equivalent.","PeriodicalId":20361,"journal":{"name":"Practical Assessment, Research and Evaluation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85520336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating the Quantity-Quality Trade-off in the Selection of Anchor Items: a Vertical Scaling Approach","authors":"Florian Pibal, H. Cesnik","doi":"10.7275/NNCY-EW26","DOIUrl":"https://doi.org/10.7275/NNCY-EW26","url":null,"abstract":"When administering tests across grades, vertical scaling is often employed to place scores from different tests on a common overall scale so that test-takers’ progress can be tracked. In order to be able to link the results across grades, however, common items are needed that are included in both test forms. In the literature there seems to be no clear agreement about the ideal number of common items. In line with some scholars, we argue that a greater number of anchor items bear a higher risk of unwanted effects like displacement, item drift, or undesired fit statistics and that having fewer psychometrically well-functioning anchor items can sometimes be more desirable. In order to demonstrate this, a study was conducted that included the administration of a reading-comprehension test to 1,350 test-takers across grades 6 to 8. In employing a step-by-step approach, we found that the paradox of high item drift in test administrations across grades can be mitigated and eventually even be eliminated. At the same time, a positive side effect was an increase in the explanatory power of the empirical data. Moreover, it was found that scaling adjustment can be used to evaluate the effectiveness of a vertical scaling approach and, in certain cases, can lead to more accurate results than the use of calibrated anchor items.","PeriodicalId":20361,"journal":{"name":"Practical Assessment, Research and Evaluation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91005580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Sanders, Bhavneet Walia, Joel Potter, Kenneth W. Linna
{"title":"Do more online instructional ratings lead to better prediction of instructor quality","authors":"S. Sanders, Bhavneet Walia, Joel Potter, Kenneth W. Linna","doi":"10.7275/NHNN-1N13","DOIUrl":"https://doi.org/10.7275/NHNN-1N13","url":null,"abstract":"Online instructional ratings are taken by many with a grain of salt. This study analyzes the ability of said ratings to estimate the official (university-administered) instructional ratings of the same respective university instructors. Given self-selection among raters, we further test whether more online ratings of instructors lead to better prediction of official ratings in terms of both R-squared value and root mean squared error. We lastly test and correct for heteroskedastic error terms in the regression analysis to allow for the first robust estimations on the topic. Despite having a starkly different distribution of values, online ratings explain much of the variation in official ratings. This conclusion strengthens, and root mean squared error typically falls, as one considers regression subsets over which instructors have a larger number of online ratings. Though (public) online ratings do not mimic the results of (semi-private) official ratings, they provide a reliable source of information for predicting official ratings. There is strong evidence that this reliability increases in online rating usage.","PeriodicalId":20361,"journal":{"name":"Practical Assessment, Research and Evaluation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85329773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Termination Criteria for Computerized Classification Testing.","authors":"Nathan A. Thompson","doi":"10.7275/WQ8M-ZK25","DOIUrl":"https://doi.org/10.7275/WQ8M-ZK25","url":null,"abstract":"Computerized classification testing (CCT) is an approach to designing tests with intelligent algorithms, similar to adaptive testing, but specifically designed for the purpose of classifying examinees into categories such as “pass” and “fail.” Like adaptive testing for point estimation of ability, the key component is the termination criterion, namely the algorithm that decides whether to classify the examinee and end the test or to continue and administer another item. This paper applies a newly suggested termination criterion, the generalized likelihood ratio (GLR), to CCT. It also explores the role of the indifference region in the specification of likelihood-ratio based termination criteria, comparing the GLR to the sequential probability ratio test. Results from simulation studies suggest that the GLR is always at least as efficient as existing methods.","PeriodicalId":20361,"journal":{"name":"Practical Assessment, Research and Evaluation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74696461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FORMATIVE USE OF ASSESSMENT INFORMATION: IT'S A PROCESS, SO LET'S SAY WHAT WE MEAN","authors":"Robert Good","doi":"10.7275/3YVY-AT83","DOIUrl":"https://doi.org/10.7275/3YVY-AT83","url":null,"abstract":"The term formative assessment is often used to describe a type of assessment. The purpose of this paper is to challenge the use of this phrase given that formative assessment as a noun phrase ignores the well-established understanding that it is a process more than an object. A model that combines content, context, and strategies is presented as one way to view the process nature of assessing formatively. The alternate phrase formative use of assessment information is suggested as a more appropriate way to describe how content, context, and strategies can be used together in order to close the gap between where a student is performing currently and the intended learning goal. Let’s start with an elementary grammar review: adjectives modify nouns; adverbs modify verbs, adjectives, and other adverbs. Applied to recent assessment literature, the term formative assessment would therefore contain the adjective formative modifying the noun assessment, creating a noun phrase representing a thing or object. Indeed, formative assessment as a noun phrase is regularly juxtaposed to summative assessment in both purpose and timing. Formative assessment is commonly understood to occur during instruction with the intent to identify relative strengths and weaknesses and guide instruction, while summative assessment occurs after a unit of instruction with the intent of measuring performance levels of the skills and content related to the unit of instruction (Stiggins, Arter, Chappuis, & Chappuis, 2006). Distinguishing formative and summative assessments in this manner may have served an important introductory purpose, however using formative as a descriptor of a type of assessment has had ramifi cations that merit critical consideration. Given that formative assessment has received considerable attention in the literature over the last 20 or so years, this article contends that it is time to move beyond the well-established broad distinctions between formative and summative assessments and consider the subtle – yet important – distinction between the term formative assessment as an object and the intended meaning. The focus here is to suggest that if we want to realize the true potential of formative practices in our classrooms, then we need to start saying what we mean.","PeriodicalId":20361,"journal":{"name":"Practical Assessment, Research and Evaluation","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2011-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75781333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}