Automatic Short-Answer Grading in Sustainability Education: AI-Human Agreement

EMİRTEKİN, EMRAH; Özarslan, Yasin

doi:10.1002/jcal.70160

Automatic Short-Answer Grading in Sustainability Education: AI-Human Agreement

EMİRTEKİN E., Özarslan Y.

JOURNAL OF COMPUTER ASSISTED LEARNING, cilt.42, sa.1, 2026 (SSCI, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 42 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.1002/jcal.70160
Dergi Adı: JOURNAL OF COMPUTER ASSISTED LEARNING
Derginin Tarandığı İndeksler: Social Sciences Citation Index (SSCI), Scopus, CINAHL, Education Abstracts, Educational research abstracts (ERA), ERIC (Education Resources Information Center), INSPEC, Psycinfo
Manisa Celal Bayar Üniversitesi Adresli: Evet

Özet

Background Sustainability education emphasises critical thinking and interdisciplinary understanding, making the assessment of students' learning outcomes complex. While Large Language Models (LLMs) have shown promise in educational assessment, their reliability in domains requiring contextual reasoning-such as sustainability-remains unclear. Objectives This study aims to evaluate the agreement between human raters and several LLMs (GPT-4o, Gemini 2.0 Flash, DeepSeek V3, LLaMA 3.3) in assessing short-answer responses from a university-level Sustainability course. It also investigates how this agreement varies across cognitive skill levels. Methods A total of 232 short-answer responses were evaluated using a rubric aligned with Bloom's Revised Taxonomy. Consensus scores from human raters were compared to LLM-generated scores using multiple statistical measures, including Quadratic Weighted Kappa (QWK), Intraclass Correlation Coefficient (ICC), Pearson correlation, and distributional overlap. Results Moderate agreement was found between LLMs and human raters in total scores (QWK: 0.585-0.640; r: 0.660-0.668; eta: 0.681-0.803). Inter-rater reliability among humans was good to excellent (ICC: 0.667-0.800). Criterion-level agreement declined as cognitive complexity increased, with notably low agreement on evaluating higher-order skills. Conclusions Overall, LLM-human agreement was moderate on total scores but declined at higher cognitive levels, indicating that LLMs are suitable for basic comprehension checks while human oversight remains necessary for complex reasoning.