Assessing student’s answers and in particular natural language answers is a
crucial challenge in the field of education. Advances in machine learning,
including transformer-based models such as Large Language Models(LLMs), have
led to significant progress in various natural language tasks. Nevertheless,
amidst the growing trend of evaluating LLMs across diverse tasks, evaluating
LLMs in the realm of automated answer assesment has not received much
attention. To address this gap, we explore the potential of using LLMs for
automated assessment of student’s short and open-ended answer. Particularly, we
use LLMs to compare students’ explanations with expert explanations in the
context of line-by-line explanations of computer programs.

For comparison purposes, we assess both Large Language Models (LLMs) and
encoder-based Semantic Textual Similarity (STS) models in the context of
assessing the correctness of students’ explanation of computer code. Our
findings indicate that LLMs, when prompted in few-shot and chain-of-thought
setting perform comparable to fine-tuned encoder-based models in evaluating
students’ short answers in programming domain.

The field of education faces a significant challenge in assessing students’ answers, particularly when it comes to evaluating natural language responses. However, recent advancements in machine learning, especially transformer-based models like Large Language Models (LLMs), have shown promising progress in various natural language tasks. Despite the increasing use of LLMs in evaluating different tasks, their potential in automated answer assessment has not been extensively explored.

To bridge this gap, this study focuses on leveraging LLMs for the automated assessment of students’ short and open-ended answers. Specifically, the researchers investigate the usage of LLMs to compare students’ explanations with expert explanations in the context of line-by-line explanations of computer programs. This multi-disciplinary approach combines insights from the fields of education and natural language processing.

To provide a benchmark for comparison, the study also considers encoder-based Semantic Textual Similarity (STS) models alongside LLMs for assessing the correctness of students’ explanations of computer code. By conducting experiments and analyzing the results, the findings demonstrate that LLMs perform comparably to fine-tuned encoder-based models in evaluating students’ short answers in the programming domain, particularly when prompted in few-shot and chain-of-thought settings.

This research sheds light on the potential of LLMs in automating answer assessment, a task that has traditionally been labor-intensive for educators. The incorporation of LLMs and STS models in education provides an exciting avenue for developing intelligent tutoring systems and personalized feedback mechanisms. It also highlights the multi-disciplinary nature of this work, involving domains such as education, natural language processing, and machine learning.

Moving forward, there are several areas for further exploration. Firstly, investigating the generalizability of LLMs across different domains and subject areas would be valuable. Additionally, exploring the interpretability of LLMs’ assessments and providing explanations for their decisions could enhance transparency and build trust in their usage within educational settings. Lastly, addressing the challenges of bias and fairness in automated assessment systems should be a crucial consideration to ensure equitable evaluations for students from diverse backgrounds.

In conclusion, the study showcases the potential of LLMs in automating the evaluation of students’ short answers and offers insights into their comparable performance with encoder-based models in the programming domain. This research emphasizes the multi-disciplinary nature of leveraging machine learning in education and paves the way for further advancements in automated answer assessment, benefiting both educators and students alike.
Read the original article