Evaluating Language Models for Generating and Judging Programming Feedback

Koutcheme, Charles; Dainese, Nicola; Hellas, Arto; Sarsa, Sami; Leinonen, Juho; Ashraf, Syed; Denny, Paul

Computer Science > Artificial Intelligence

arXiv:2407.04873 (cs)

[Submitted on 5 Jul 2024 (v1), last revised 22 Nov 2024 (this version, v2)]

Title:Evaluating Language Models for Generating and Judging Programming Feedback

Authors:Charles Koutcheme, Nicola Dainese, Arto Hellas, Sami Sarsa, Juho Leinonen, Syed Ashraf, Paul Denny

View PDF HTML (experimental)

Abstract:The emergence of large language models (LLMs) has transformed research and practice across a wide range of domains. Within the computing education research (CER) domain, LLMs have garnered significant attention, particularly in the context of learning programming. Much of the work on LLMs in CER, however, has focused on applying and evaluating proprietary models. In this article, we evaluate the efficiency of open-source LLMs in generating high-quality feedback for programming assignments and judging the quality of programming feedback, contrasting the results with proprietary models. Our evaluations on a dataset of students' submissions to introductory Python programming exercises suggest that state-of-the-art open-source LLMs are nearly on par with proprietary models in both generating and assessing programming feedback. Additionally, we demonstrate the efficiency of smaller LLMs in these tasks and highlight the wide range of LLMs accessible, even for free, to educators and practitioners.

Comments:	2 tables. Accepted for SIGCSE TS 2025
Subjects:	Artificial Intelligence (cs.AI); Computers and Society (cs.CY)
Cite as:	arXiv:2407.04873 [cs.AI]
	(or arXiv:2407.04873v2 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2407.04873

Submission history

From: Juho Leinonen [view email]
[v1] Fri, 5 Jul 2024 21:44:11 UTC (198 KB)
[v2] Fri, 22 Nov 2024 01:13:13 UTC (85 KB)

Computer Science > Artificial Intelligence

Title:Evaluating Language Models for Generating and Judging Programming Feedback

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Artificial Intelligence

Title:Evaluating Language Models for Generating and Judging Programming Feedback

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators