QUBVIS: query based multi-modal summarization system using CLIP based transformer and vision language models


ALTUNDOĞAN T. G., Karakose M.

SoftwareX, cilt.31, 2025 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 31
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1016/j.softx.2025.102303
  • Dergi Adı: SoftwareX
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
  • Anahtar Kelimeler: Query based summarization, Transformers, Video summarization, Vision language models
  • Manisa Celal Bayar Üniversitesi Adresli: Evet

Özet

In this study, a new approach is proposed for user-interactive summarization of online videos. In the proposed approach, video-to-video summarization is performed with a very high success rate using a multimodal transformer architecture (QUBVIS) that also takes activity queries from the user as input, and the resulting summary video is subjected to captioning using a Vision Language Model with a GPT-2 decoder. The developed models are integrated with a Flask API and presented in a way that online video platforms can easily integrate into their systems. In addition, a simple web interface using this API is developed to provide API communication with the user. The performance evaluations of both models of the proposed method show our superiority over similar studies in the literature.