King Yiu Suen, Victoria Yaneva, L. Ha, Janet Mee, Yiyun Zhou, Polina Harik
{"title":"ACTA: Short-Answer Grading in High-Stakes Medical Exams","authors":"King Yiu Suen, Victoria Yaneva, L. Ha, Janet Mee, Yiyun Zhou, Polina Harik","doi":"10.18653/v1/2023.bea-1.36","DOIUrl":"https://doi.org/10.18653/v1/2023.bea-1.36","url":null,"abstract":"This paper presents the ACTA system, which performs automated short-answer grading in the domain of high-stakes medical exams. The system builds upon previous work on neural similarity-based grading approaches by applying these to the medical domain and utilizing contrastive learning as a means to optimize the similarity metric. ACTA is evaluated against three strong baselines and is developed in alignment with operational needs, where low-confidence responses are flagged for human review. Learning curves are explored to understand the effects of training data on performance. The results demonstrate that ACTA leads to substantially lower number of responses being flagged for human review, while maintaining high classification accuracy.","PeriodicalId":363390,"journal":{"name":"Workshop on Innovative Use of NLP for Building Educational Applications","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121107503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Beata Beigman Klebanov, Mike Suhan, Zuowei Wang, T. O’Reilly
{"title":"A dynamic model of lexical experience for tracking of oral reading fluency","authors":"Beata Beigman Klebanov, Mike Suhan, Zuowei Wang, T. O’Reilly","doi":"10.18653/v1/2023.bea-1.48","DOIUrl":"https://doi.org/10.18653/v1/2023.bea-1.48","url":null,"abstract":"We present research aimed at solving a problem in assessment of oral reading fluency using children’s oral reading data from our online book reading app. It is known that properties of the passage being read aloud impact fluency estimates; therefore, passage-based measures are used to remove passage-related variance when estimating growth in oral reading fluency. However, passage-based measures reported in the literature tend to treat passages as independent events, without explicitly modeling accumulation of lexical experience as one reads through a book. We propose such a model and show that it helps explain additional variance in the measurements of children’s fluency as they read through a book, improving over a strong baseline. These results have implications for measuring growth in oral reading fluency.","PeriodicalId":363390,"journal":{"name":"Workshop on Innovative Use of NLP for Building Educational Applications","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121679672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reconciling Adaptivity and Task Orientation in the Student Dashboard of an Intelligent Language Tutoring System","authors":"Leona Colling, Tanja Heck, Walt Detmar Meurers","doi":"10.18653/v1/2023.bea-1.25","DOIUrl":"https://doi.org/10.18653/v1/2023.bea-1.25","url":null,"abstract":"In intelligent language tutoring systems, student dashboards should display the learning progress and performance and support the navigation through the learning content. Designing an interface that transparently offers information on students’ learning in relation to specific learning targets while linking to the overarching functional goal, that motivates and organizes the practice in current foreign language teaching, is challenging.This becomes even more difficult in systems that adaptively expose students to different learning material and individualize system interactions. If such a system is used in an ecologically valid setting of blended learning, this generates additional requirements to incorporate the needs of students and teachers for control and customizability.We present the conceptual design of a student dashboard for a task-based, user-adaptive intelligent language tutoring system intended for use in real-life English classes in secondary schools. We highlight the key challenges and spell out open questions for future research.","PeriodicalId":363390,"journal":{"name":"Workshop on Innovative Use of NLP for Building Educational Applications","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129308751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transformer-based Hebrew NLP models for Short Answer Scoring in Biology","authors":"Abigail Gurin Schleifer, Beata Beigman Klebanov, Moriah Ariely, Giora Alexandron","doi":"10.18653/v1/2023.bea-1.46","DOIUrl":"https://doi.org/10.18653/v1/2023.bea-1.46","url":null,"abstract":"Pre-trained large language models (PLMs) are adaptable to a wide range of downstream tasks by fine-tuning their rich contextual embeddings to the task, often without requiring much task-specific data. In this paper, we explore the use of a recently developed Hebrew PLM aleph-BERT for automated short answer grading of high school biology items. We show that the alephBERT-based system outperforms a strong CNN-based baseline, and that it general-izes unexpectedly well in a zero-shot paradigm to items on an unseen topic that address the same underlying biological concepts, opening up the possibility of automatically assessing new items without item-specific fine-tuning.","PeriodicalId":363390,"journal":{"name":"Workshop on Innovative Use of NLP for Building Educational Applications","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121124498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Grammatical Error Correction for Sentence-level Assessment in Language Learning","authors":"Anisia Katinskaia, R. Yangarber","doi":"10.18653/v1/2023.bea-1.41","DOIUrl":"https://doi.org/10.18653/v1/2023.bea-1.41","url":null,"abstract":"The paper presents experiments on using a Grammatical Error Correction (GEC) model to assess the correctness of answers that language learners give to grammar exercises. We explored whether a GEC model can be applied in the language learning context for a language with complex morphology. We empirically check a hypothesis that a GEC model corrects only errors and leaves correct answers unchanged. We perform a test on assessing learner answers in a real but constrained language-learning setup: the learners answer only fill-in-the-blank and multiple-choice exercises. For this purpose, we use ReLCo, a publicly available manually annotated learner dataset in Russian (Katinskaia et al., 2022). In this experiment, we fine-tune a large-scale T5 language model for the GEC task and estimate its performance on the RULEC-GEC dataset (Rozovskaya and Roth, 2019) to compare with top-performing models. We also release an updated version of the RULEC-GEC test set, manually checked by native speakers. Our analysis shows that the GEC model performs reasonably well in detecting erroneous answers to grammar exercises and potentially can be used for best-performing error types in a real learning setup. However, it struggles to assess answers which were tagged by human annotators as alternative-correct using the aforementioned hypothesis. This is in large part due to a still low recall in correcting errors, and the fact that the GEC model may modify even correct words—it may generate plausible alternatives, which are hard to evaluate against the gold-standard reference.","PeriodicalId":363390,"journal":{"name":"Workshop on Innovative Use of NLP for Building Educational Applications","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130765091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Erfan Al-Hossami, Razvan C. Bunescu, Ryan Teehan, Laurel Powell, Khyati Mahajan, Mohsen Dorodchi
{"title":"Socratic Questioning of Novice Debuggers: A Benchmark Dataset and Preliminary Evaluations","authors":"Erfan Al-Hossami, Razvan C. Bunescu, Ryan Teehan, Laurel Powell, Khyati Mahajan, Mohsen Dorodchi","doi":"10.18653/v1/2023.bea-1.57","DOIUrl":"https://doi.org/10.18653/v1/2023.bea-1.57","url":null,"abstract":"Socratic questioning is a teaching strategy where the student is guided towards solving a problem on their own, instead of being given the solution directly. In this paper, we introduce a dataset of Socratic conversations where an instructor helps a novice programmer fix buggy solutions to simple computational problems. The dataset is then used for benchmarking the Socratic debugging abilities of GPT-based language models. While GPT-4 is observed to perform much better than GPT-3.5, its precision, and recall still fall short of human expert abilities, motivating further work in this area.","PeriodicalId":363390,"journal":{"name":"Workshop on Innovative Use of NLP for Building Educational Applications","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128847849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable and Explainable Automated Scoring for Open-Ended Constructed Response Math Word Problems","authors":"Scott Hellman, Alejandro Andrade, Kyle Habermehl","doi":"10.18653/v1/2023.bea-1.12","DOIUrl":"https://doi.org/10.18653/v1/2023.bea-1.12","url":null,"abstract":"Open-ended constructed response math word problems (“math plus text”, or MPT) are a powerful tool in the assessment of students’ abilities to engage in mathematical reasoning and creative thinking. Such problems ask the student to compute a value or construct an expression and then explain, potentially in prose, what steps they took and why they took them. MPT items can be scored against highly structured rubrics, and we develop a novel technique for the automated scoring of MPT items that leverages these rubrics to provide explainable scoring. We show that our approach can be trained automatically and performs well on a large dataset of 34,417 responses across 14 MPT items.","PeriodicalId":363390,"journal":{"name":"Workshop on Innovative Use of NLP for Building Educational Applications","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120900867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. M. Perkoff, A. Bhattacharyya, Jon Z. Cai, Jie Cao
{"title":"Comparing Neural Question Generation Architectures for Reading Comprehension","authors":"E. M. Perkoff, A. Bhattacharyya, Jon Z. Cai, Jie Cao","doi":"10.18653/v1/2023.bea-1.47","DOIUrl":"https://doi.org/10.18653/v1/2023.bea-1.47","url":null,"abstract":"In recent decades, there has been a significant push to leverage technology to aid both teachers and students in the classroom. Language processing advancements have been harnessed to provide better tutoring services, automated feedback to teachers, improved peer-to-peer feedback mechanisms, and measures of student comprehension for reading. Automated question generation systems have the potential to significantly reduce teachers’ workload in the latter. In this paper, we compare three differ- ent neural architectures for question generation across two types of reading material: narratives and textbooks. For each architecture, we explore the benefits of including question attributes in the input representation. Our models show that a T5 architecture has the best overall performance, with a RougeL score of 0.536 on a narrative corpus and 0.316 on a textbook corpus. We break down the results by attribute and discover that the attribute can improve the quality of some types of generated questions, including Action and Character, but this is not true for all models.","PeriodicalId":363390,"journal":{"name":"Workshop on Innovative Use of NLP for Building Educational Applications","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122146549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automated evaluation of written discourse coherence using GPT-4","authors":"Ben Naismith, Phoebe Mulcaire, J. Burstein","doi":"10.18653/v1/2023.bea-1.32","DOIUrl":"https://doi.org/10.18653/v1/2023.bea-1.32","url":null,"abstract":"The popularization of large language models (LLMs) such as OpenAI’s GPT-3 and GPT-4 have led to numerous innovations in the field of AI in education. With respect to automated writing evaluation (AWE), LLMs have reduced challenges associated with assessing writing quality characteristics that are difficult to identify automatically, such as discourse coherence. In addition, LLMs can provide rationales for their evaluations (ratings) which increases score interpretability and transparency. This paper investigates one approach to producing ratings by training GPT-4 to assess discourse coherence in a manner consistent with expert human raters. The findings of the study suggest that GPT-4 has strong potential to produce discourse coherence ratings that are comparable to human ratings, accompanied by clear rationales. Furthermore, the GPT-4 ratings outperform traditional NLP coherence metrics with respect to agreement with human ratings. These results have implications for advancing AWE technology for learning and assessment.","PeriodicalId":363390,"journal":{"name":"Workshop on Innovative Use of NLP for Building Educational Applications","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128709967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Justin Vasselli, Christopher Vasselli, Adam Nohejl, Taro Watanabe
{"title":"NAISTeacher: A Prompt and Rerank Approach to Generating Teacher Utterances in Educational Dialogues","authors":"Justin Vasselli, Christopher Vasselli, Adam Nohejl, Taro Watanabe","doi":"10.18653/v1/2023.bea-1.63","DOIUrl":"https://doi.org/10.18653/v1/2023.bea-1.63","url":null,"abstract":"This paper presents our approach to the BEA 2023 shared task of generating teacher responses in educational dialogues, using the Teacher-Student Chatroom Corpus. Our system prompts GPT-3.5-turbo to generate initial suggestions, which are then subjected to reranking. We explore multiple strategies for candidate generation, including prompting for multiple candidates and employing iterative few-shot prompts with negative examples. We aggregate all candidate responses and rerank them based on DialogRPT scores. To handle consecutive turns in the dialogue data, we divide the task of generating teacher utterances into two components: teacher replies to the student and teacher continuations of previously sent messages. Through our proposed methodology, our system achieved the top score on both automated metrics and human evaluation, surpassing the reference human teachers on the latter.","PeriodicalId":363390,"journal":{"name":"Workshop on Innovative Use of NLP for Building Educational Applications","volume":"202 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124287535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}