Understanding Large Language Model Performance in Question Answering: A Comparative Analysis of Semantic and Lexical Metrics

Erkan E. E., Kömürcü M. A., ÇELİKTEN T., Ergün A. E., Onan A.

7th International Conference on Intelligent and Fuzzy Systems, INFUS 2025, İstanbul, Türkiye, 29 - 31 Temmuz 2025, cilt.1529 LNNS, ss.519-526, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Cilt numarası: 1529 LNNS
Doi Numarası: 10.1007/978-3-031-97992-7_58
Basıldığı Şehir: İstanbul
Basıldığı Ülke: Türkiye
Sayfa Sayıları: ss.519-526
Anahtar Kelimeler: BLEU Score, Contextual Understanding, Cosine Similarity, Language Model Evaluation, Model Comparison, NLP Metrics, Question-Answering, ROUGE-L
Manisa Celal Bayar Üniversitesi Adresli: Evet

Özet

Evaluating the performance of language models is essential for determining their effectiveness across various NLP tasks. This study investigates four state-of-the-art models, Mistral:7B, LLaMA3.2:1B, LLaMA3.2:3B, and Qwen2.5:3B, using a dataset of 1,293 question-answer pairs focused on pet-related topics such as animal behavior, health, and care. Reference answers were generated using ChatGPT to maintain consistency and quality. The dataset comprises 12,029 words in questions (averaging 9.30 words per question) and 20,002 words in answers (averaging 15.47 words per answer), highlighting the need for detailed and context-rich responses. The models were evaluated based on three key metrics: Cosine similarity, which measures semantic alignment, BLEU (with smoothing) for lexical overlap, and ROUGE-L for fluency and contextual relevance. The results indicate that Qwen2.5:3B achieved the highest semantic alignment with an average Cosine Similarity score of 0.87, demonstrating superior contextual understanding. LLaMA3.2:3B excelled in BLEU scores, achieving an average of 76.5, highlighting its ability to replicate lexical patterns accurately. Mistral:7B outperformed others in ROUGE-L with an average score of 0.81, showcasing its strength in generating coherent and fluent responses. Furthermore, an analysis of the generated text length revealed significant differences among the models. LLaMA3.2:1B produced the longest responses, with an average of 26.90 words per answer, while Qwen2.5:3B generated more concise responses, averaging 19.31 words per answer. These findings underscore the impact of model architecture and training paradigms on response verbosity and informativeness. This study highlights the strengths and weaknesses of each model, demonstrating the importance of employing multi-metric evaluations to capture both semantic and lexical aspects of performance.