{"title":"Complementary strengths? Evaluation of a hybrid human-machine scoring approach for a test of oral academic English","authors":"Larry Davis, S. Papageorgiou","doi":"10.1080/0969594X.2021.1979466","DOIUrl":null,"url":null,"abstract":"ABSTRACT Human raters and machine scoring systems potentially have complementary strengths in evaluating language ability; specifically, it has been suggested that automated systems might be used to make consistent measurements of specific linguistic phenomena, whilst humans evaluate more global aspects of performance. We report on an empirical study that explored the possibility of combining human and machine scores using responses from the speaking section of the TOEFL iBT® test. Human raters awarded scores for three sub-constructs: delivery, language use and topic development. The SpeechRaterSM automated scoring system produced scores for delivery and language use. Composite scores computed from three different combinations of human and automated analytic scores were equally or more reliable than human holistic scores, probably due to the inclusion of multiple observations in composite scores. However, composite scores calculated solely from human analytic scores showed the highest reliability and reliability steadily decreased as more machine scores replaced human scores.","PeriodicalId":51515,"journal":{"name":"Assessment in Education-Principles Policy & Practice","volume":"36 1","pages":"437 - 455"},"PeriodicalIF":2.7000,"publicationDate":"2021-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Assessment in Education-Principles Policy & Practice","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1080/0969594X.2021.1979466","RegionNum":3,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 4
Abstract
ABSTRACT Human raters and machine scoring systems potentially have complementary strengths in evaluating language ability; specifically, it has been suggested that automated systems might be used to make consistent measurements of specific linguistic phenomena, whilst humans evaluate more global aspects of performance. We report on an empirical study that explored the possibility of combining human and machine scores using responses from the speaking section of the TOEFL iBT® test. Human raters awarded scores for three sub-constructs: delivery, language use and topic development. The SpeechRaterSM automated scoring system produced scores for delivery and language use. Composite scores computed from three different combinations of human and automated analytic scores were equally or more reliable than human holistic scores, probably due to the inclusion of multiple observations in composite scores. However, composite scores calculated solely from human analytic scores showed the highest reliability and reliability steadily decreased as more machine scores replaced human scores.
期刊介绍:
Recent decades have witnessed significant developments in the field of educational assessment. New approaches to the assessment of student achievement have been complemented by the increasing prominence of educational assessment as a policy issue. In particular, there has been a growth of interest in modes of assessment that promote, as well as measure, standards and quality. These have profound implications for individual learners, institutions and the educational system itself. Assessment in Education provides a focus for scholarly output in the field of assessment. The journal is explicitly international in focus and encourages contributions from a wide range of assessment systems and cultures. The journal''s intention is to explore both commonalities and differences in policy and practice.