QUBVIS: query based multi-modal summarization system using CLIP based transformer and vision language models

ALTUNDOĞAN, TURAN; Karakose, Mehmet

doi:10.1016/j.softx.2025.102303

QUBVIS: query based multi-modal summarization system using CLIP based transformer and vision language models

ALTUNDOĞAN T. G., Karakose M.

SoftwareX, cilt.31, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 31
Basım Tarihi: 2025
Doi Numarası: 10.1016/j.softx.2025.102303
Dergi Adı: SoftwareX
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
Anahtar Kelimeler: Query based summarization, Transformers, Video summarization, Vision language models
Manisa Celal Bayar Üniversitesi Adresli: Evet

Özet

In this study, a new approach is proposed for user-interactive summarization of online videos. In the proposed approach, video-to-video summarization is performed with a very high success rate using a multimodal transformer architecture (QUBVIS) that also takes activity queries from the user as input, and the resulting summary video is subjected to captioning using a Vision Language Model with a GPT-2 decoder. The developed models are integrated with a Flask API and presented in a way that online video platforms can easily integrate into their systems. In addition, a simple web interface using this API is developed to provide API communication with the user. The performance evaluations of both models of the proposed method show our superiority over similar studies in the literature.