Abstract
In this paper, we introduce the fifth release of VISIONE, an advanced video retrieval system offering diverse search functionalities. The user can search for a target video using textual prompts, drawing objects and colors appearing in the target scenes in a canvas, or images as query examples to search for video keyframes with similar content. Compared to the previous version of our system, which was runner-up at VBS 2023, the forthcoming release, set to participate in VBS 2024, showcases a refined user interface that enhances its usability and updated AI models for more effective video content analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amato, G., et al.: The VISIONE video search system: exploiting off-the-shelf text search engines for large-scale video retrieval. J. Imag. 7(5), 76 (2021)
Amato, G., et al.: Visione: a large-scale video retrieval system with advanced search functionalities. In: Proceedings of the 2023 ACM International Conference on Multimedia Retrieval,D pp. 649–653 (2023)
Amato, G., et al.: VISIONE at video browser showdown 2023. In: Dang-Nguyen, D.-T., et al. (eds.) MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part I, pp. 615–621. Springer International Publishing, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_48
Amato, G., et al.: VISIONE Feature Repository for VBS: Multi-Modal Features and Detected Objects from LapGyn100 Dataset, October 2023. https://doi.org/10.5281/zenodo.10013328
Amato, G., et al.: VISIONE Feature Repository for VBS: Multi-Modal Features and Detected Objects from MVK Dataset (2023). https://doi.org/10.5281/zenodo.8355037
Amato, G.,et al.: VISIONE feature repository for VBS: multi-modal features and detected objects from V3C1+V3C2 dataset (Jul 2023). https://doi.org/10.5281/zenodo.8188570
Amato, G., et al.: VISIONE for newbies: an easier-to-use video retrieval system. In: Proceedings of the 20th International Conference on Content-based Multimedia Indexing. Association for Computing Machinery (2023)
Amato, G., Carrara, F., Falchi, F., Gennaro, C., Vadicamo, L.: Large-scale instance-level image retrieval. Inform. Process. Manage. 57(6), 102100 (2020)
Carrara, F., Gennaro, C., Vadicamo, L., Amato, G.: Vec2Doc: transforming dense vectors into sparse representations for efficient information retrieval. In: Pedreira, O., Estivill-Castro, V. (eds.) Similarity Search and Applications: 16th International Conference, SISAP 2023, A Coruña, Spain, October 9–11, 2023, Proceedings, pp. 215–222. Springer Nature Switzerland, Cham (2023). https://doi.org/10.1007/978-3-031-46994-7_18
Carrara, F., Vadicamo, L., Gennaro, C., Amato, G.: Approximate nearest neighbor search on standard search engines. In: Skopal, T., Falchi, F., Lokoč, J., Sapino, M.L., Bartolini, I., Patella, M. (eds.) Similarity Search and Applications: 15th International Conference, SISAP 2022, Bologna, Italy, October 5–7, 2022, Proceedings, pp. 214–221. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-031-17849-8_17
Cormack, G.V., Clarke, C.L., Buettcher, S.: Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 758–759 (2009)
Fang, H., Xiong, P., Xu, L., Chen, Y.: Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021)
Heller, S., Gsteiger, V., Bailer, W., Gurrin, C., Jónsson, B.Þ, Lokoč, J., et al.: Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th video browser showdown. Int. J. Multimed. Inform. Retrieval 11(1), 1–18 (2022)
Ilharco, G., et al.: Openclip (2021). https://doi.org/10.5281/zenodo.5143773
Lokoč, J., et al.: Interactive video retrieval in the age of effective joint embedding deep models: lessons from the 11th vbs. Multimedia Systems, pp. 1–24 (2023)
Lokoč, J., et al.: A Task Category Space for User-Centric Comparative Multimedia Search Evaluations. In: Þór Jónsson, B., Gurrin, C., Tran, M.-T., Dang-Nguyen, D.-T., Hu, A.M.-C., Huynh Thi Thanh, B., Huet, B. (eds.) MultiMedia Modeling: 28th International Conference, MMM 2022, Phu Quoc, Vietnam, June 6–10, 2022, Proceedings, Part I, pp. 193–204. Springer International Publishing, Cham (2022). https://doi.org/10.1007/978-3-030-98358-1_16
Lokoč, J., et al.: Is the reign of interactive search eternal? findings from the video browser showdown 2020. ACM Trans. Multimed. Comput. Commun. Appl. 17(3), 1–26 (2021)
Lokoč, J., Vopálková, Z., Dokoupil, P., Peška, L.: Video search with CLIP and interactive text query reformulation. In: Dang-Nguyen, D.-T., Gurrin, C., Larson, M., Smeaton, A.F., Rudinac, S., Dao, M.-S., Trattner, C., Chen, P. (eds.) MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part I, pp. 628–633. Springer International Publishing, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_50
Ma, Z., Wu, J., Loo, W., Ngo, C.W.: Reinforcement learning enhanced pichunter for interactive search. In: MultiMedia Modeling (2023)
Messina, N., et al.: Aladin: distilling fine-grained alignment scores for efficient image-text matching and retrieval. In: Proceedings of the 19th International Conference on Content-based Multimedia Indexing, pp. 64–70 (2022)
Oquab, M., et al.: Dinov2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, ICML 2021, pp. 8748–8763. PMLR (2021)
Rossetto, L., Gasser, R., Sauter, L., Bernstein, A., Schuldt, H.: A system for interactive multimedia retrieval evaluations. In: Lokoč, J., Skopal, T., Schoeffmann, K., Mezaris, V., Li, X., Vrochidis, S., Patras, I. (eds.) MultiMedia Modeling: 27th International Conference, MMM 2021, Prague, Czech Republic, June 22–24, 2021, Proceedings, Part II, pp. 385–390. Springer International Publishing, Cham (2021). https://doi.org/10.1007/978-3-030-67835-7_33
Rossetto, L., Schuldt, H., Awad, G., Butt, A.A.: V3C – a research video collection. In: Kompatsiaris, I., Huet, B., Mezaris, V., Gurrin, C., Cheng, W.-H., Vrochidis, S. (eds.) MultiMedia Modeling: 25th International Conference, MMM 2019, Thessaloniki, Greece, January 8–11, 2019, Proceedings, Part I, pp. 349–360. Springer International Publishing, Cham (2019). https://doi.org/10.1007/978-3-030-05710-7_29
Schall, K., Hezel, N., Jung, K., Barthel, K.U.: Vibro: video browsing with semantic and visual image embeddings. In: Dang-Nguyen, D.-T., Gurrin, C., Larson, M., Smeaton, A.F., Rudinac, S., Dao, M.-S., Trattner, C., Chen, P. (eds.) MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part I, pp. 665–670. Springer International Publishing, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_56
Schoeffmann, K.: lifexplore at the lifelog search challenge 2023. In: Proceedings of the 6th Annual ACM Lifelog Search Challenge, pp. 53–58 (2023)
Schuhmann, C., et al.: Laion-5b: an open large-scale dataset for training next generation image-text models. Adv. Neural. Inf. Process. Syst. 35, 25278–25294 (2022)
Spiess, F., Heller, S., Rossetto, L., Sauter, L., Weber, P., Schuldt, H.: Traceable asynchronous workflows in video retrieval with vitrivr-VR. In: Dang-Nguyen, D.-T., Gurrin, C., Larson, M., Smeaton, A.F., Rudinac, S., Dao, M.-S., Trattner, C., Chen, P. (eds.) MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part I, pp. 622–627. Springer International Publishing, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_49
Truong, Q.-T., et al.: Marine Video Kit: a new marine video dataset for content-based analysis and retrieval. In: Dang-Nguyen, D.-T., Gurrin, C., Larson, M., Smeaton, A.F., Rudinac, S., Dao, M.-S., Trattner, C., Chen, P. (eds.) MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part I, pp. 539–550. Springer International Publishing, Cham (2023). https://doi.org/10.1007/978-3-031-27077-2_42
Zhang, S., et al. Large-scale domain-specific pretraining for biomedical vision-language processing (2023). https://doi.org/10.48550/ARXIV.2303.00915
Acknowledgements
This work was partially funded by AI4Media - A European Excellence Centre for Media, Society and Democracy (EC, H2020 n. 951911), the PNRR-National Centre for HPC, Big Data and Quantum Computing project CUP B93C22000620006, and by the Horizon Europe Research & Innovation Programme under Grant agreement N. 101092612 (Social and hUman ceNtered XR - SUN project). Views and opinions expressed in this paper are those of the authors only and do not necessarily reflect those of the European Union. Neither the European Union nor the European Commission can be held responsible for them.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Amato, G. et al. (2024). VISIONE 5.0: Enhanced User Interface and AI Models for VBS2024. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14557. Springer, Cham. https://doi.org/10.1007/978-3-031-53302-0_29
Download citation
DOI: https://doi.org/10.1007/978-3-031-53302-0_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53301-3
Online ISBN: 978-3-031-53302-0
eBook Packages: Computer ScienceComputer Science (R0)