{"title":"利用与内容无关的特征预测简短回复评分:分层建模方法","authors":"Aubrey Condor","doi":"arxiv-2405.08574","DOIUrl":null,"url":null,"abstract":"We explore whether the human ratings of open ended responses can be explained\nwith non-content related features, and if such effects vary across different\nmathematics-related items. When scoring is rigorously defined and rooted in a\nmeasurement framework, educators intend that the features of a response which\nare indicative of the respondent's level of ability are contributing to scores.\nHowever, we find that features such as response length, a grammar score of the\nresponse, and a metric relating to key phrase frequency are significant\npredictors for response ratings. Although our findings are not causally\nconclusive, they may propel us to be more critical of he way in which we assess\nopen ended responses, especially in high stakes scenarios. Educators take great\ncare to provide unbiased, consistent ratings, but it may be that extraneous\nfeatures unrelated to those which were intended to be rated are being\nevaluated.","PeriodicalId":501323,"journal":{"name":"arXiv - STAT - Other Statistics","volume":"39 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predicting Short Response Ratings with Non-Content Related Features: A Hierarchical Modeling Approach\",\"authors\":\"Aubrey Condor\",\"doi\":\"arxiv-2405.08574\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We explore whether the human ratings of open ended responses can be explained\\nwith non-content related features, and if such effects vary across different\\nmathematics-related items. When scoring is rigorously defined and rooted in a\\nmeasurement framework, educators intend that the features of a response which\\nare indicative of the respondent's level of ability are contributing to scores.\\nHowever, we find that features such as response length, a grammar score of the\\nresponse, and a metric relating to key phrase frequency are significant\\npredictors for response ratings. Although our findings are not causally\\nconclusive, they may propel us to be more critical of he way in which we assess\\nopen ended responses, especially in high stakes scenarios. Educators take great\\ncare to provide unbiased, consistent ratings, but it may be that extraneous\\nfeatures unrelated to those which were intended to be rated are being\\nevaluated.\",\"PeriodicalId\":501323,\"journal\":{\"name\":\"arXiv - STAT - Other Statistics\",\"volume\":\"39 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Other Statistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.08574\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Other Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.08574","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Predicting Short Response Ratings with Non-Content Related Features: A Hierarchical Modeling Approach
We explore whether the human ratings of open ended responses can be explained
with non-content related features, and if such effects vary across different
mathematics-related items. When scoring is rigorously defined and rooted in a
measurement framework, educators intend that the features of a response which
are indicative of the respondent's level of ability are contributing to scores.
However, we find that features such as response length, a grammar score of the
response, and a metric relating to key phrase frequency are significant
predictors for response ratings. Although our findings are not causally
conclusive, they may propel us to be more critical of he way in which we assess
open ended responses, especially in high stakes scenarios. Educators take great
care to provide unbiased, consistent ratings, but it may be that extraneous
features unrelated to those which were intended to be rated are being
evaluated.