Computational Intelligence in Legal NLP: Evaluating Language Models on Turkish Legal Texts

Cam B., Kömürcü M. A., Kaya M., Ergün A. E., ÇELİKTEN T., Onan A.

7th International Conference on Intelligent and Fuzzy Systems, INFUS 2025, İstanbul, Türkiye, 29 - 31 Temmuz 2025, cilt.1529 LNNS, ss.527-534, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Cilt numarası: 1529 LNNS
Doi Numarası: 10.1007/978-3-031-97992-7_59
Basıldığı Şehir: İstanbul
Basıldığı Ülke: Türkiye
Sayfa Sayıları: ss.527-534
Anahtar Kelimeler: BERTScore, BLEU, Cosine similarity, Evaluation metrics, Language models, ROUGE, Semantic similarity, Word count analysis
Manisa Celal Bayar Üniversitesi Adresli: Evet

Özet

The evaluation of language models is crucial in determining their effectiveness across various NLP tasks. This study investigates the performance of four prominent language models: Turkcell-LLM-7b, Trendyol-LLM-7b, Gemma-7b, and Gemma2-2b. Using a comprehensive set of evaluation metrics, including ROUGE, BLEU, BERTScore, semantic similarity, and cosine similarity, we analyzed their ability to generate high-quality responses. Our research is motivated by the need to understand how these models perform in diverse linguistic contexts and tasks, aiming to bridge the gap between lexical overlap and semantic understanding. The dataset consists of 1,446 question-answer pairs related to Turkish Rent Law, with an average of 9.03 words per question and 11.27 words per answer. The evaluation reveals distinct strengths for each model, with Gemma-7b excelling in ROUGE metrics and Turkcell-LLM-7b showing superior semantic alignment through BERTScore. Furthermore, Trendyol-LLM-7b demonstrated competitive precision in BLEU evaluations, while Gemma2-2b showcased robust performance in cosine similarity assessments. The word count analysis indicates significant differences in response length among the models, with Turkcell-LLM-7b generating the most detailed answers (51,302 words in total, averaging 35.48 words per answer), whereas Gemma-7b and Gemma2-2b produced more concise responses (12.73 and 13.15 words per answer, respectively). These findings underscore the importance of using varied evaluation metrics to capture the multifaceted nature of language generation quality while also highlighting the impact of response length on model performance in legal NLP applications.