EmbedTurk: Leveraging Large Language Models as Text Encoders for Turkish Language


Oytac D., Ergün A. E., ÇELİKTEN T., Onan A.

7th International Conference on Intelligent and Fuzzy Systems, INFUS 2025, İstanbul, Türkiye, 29 - 31 Temmuz 2025, cilt.1529 LNNS, ss.593-600, (Tam Metin Bildiri) identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası: 1529 LNNS
  • Doi Numarası: 10.1007/978-3-031-97992-7_66
  • Basıldığı Şehir: İstanbul
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.593-600
  • Anahtar Kelimeler: Large Language Models, Supervised learning, Text Embedding
  • Manisa Celal Bayar Üniversitesi Adresli: Evet

Özet

Text embedding methods play a critical role in natural language processing (NLP) by transforming textual information into dense vector representations. These representations capture the meaning and semantic relationships between words, phrases, and documents. Text embedding models can be applied to numerous tasks such as information retrieval, semantic textual similarity, and text classification. With increasing attention to Retrieval Augmented Generation (RAG), embedding models are becoming more crucial. Previous studies focused on encoder-only architectures such as BERT. However, limitations of these models in understanding complex semantic relationships—especially in low-resource languages—recent advancements show that decoder-only architectures outperform these models. In this study, we investigated the transformation of LLMs to text embedding models in Turkish using a two-stage training approach: 1) masked next token prediction and 2) supervised contrastive learning. The findings are expected to improve the understanding of language models in Turkish text embedding tasks. This research can reveal the strengths and limitations of each language model by comparing different models and could contribute to how embedding models can be optimized for low-resource languages.