Automatic Short-Answer Grading in Sustainability Education: AI-Human Agreement


EMİRTEKİN E., Özarslan Y.

JOURNAL OF COMPUTER ASSISTED LEARNING, cilt.42, sa.1, 2026 (SSCI, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 42 Sayı: 1
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1002/jcal.70160
  • Dergi Adı: JOURNAL OF COMPUTER ASSISTED LEARNING
  • Derginin Tarandığı İndeksler: Social Sciences Citation Index (SSCI), Scopus, CINAHL, Education Abstracts, Educational research abstracts (ERA), ERIC (Education Resources Information Center), INSPEC, Psycinfo
  • Manisa Celal Bayar Üniversitesi Adresli: Evet

Özet

Background Sustainability education emphasises critical thinking and interdisciplinary understanding, making the assessment of students' learning outcomes complex. While Large Language Models (LLMs) have shown promise in educational assessment, their reliability in domains requiring contextual reasoning-such as sustainability-remains unclear. Objectives This study aims to evaluate the agreement between human raters and several LLMs (GPT-4o, Gemini 2.0 Flash, DeepSeek V3, LLaMA 3.3) in assessing short-answer responses from a university-level Sustainability course. It also investigates how this agreement varies across cognitive skill levels. Methods A total of 232 short-answer responses were evaluated using a rubric aligned with Bloom's Revised Taxonomy. Consensus scores from human raters were compared to LLM-generated scores using multiple statistical measures, including Quadratic Weighted Kappa (QWK), Intraclass Correlation Coefficient (ICC), Pearson correlation, and distributional overlap. Results Moderate agreement was found between LLMs and human raters in total scores (QWK: 0.585-0.640; r: 0.660-0.668; eta: 0.681-0.803). Inter-rater reliability among humans was good to excellent (ICC: 0.667-0.800). Criterion-level agreement declined as cognitive complexity increased, with notably low agreement on evaluating higher-order skills. Conclusions Overall, LLM-human agreement was moderate on total scores but declined at higher cognitive levels, indicating that LLMs are suitable for basic comprehension checks while human oversight remains necessary for complex reasoning.