DiPS: Discriminative Pseudo-Label Sampling with Self-Supervised Transformers for Weakly Supervised Object Localization

Murtaza, Shakeeb; Belharbi, Soufiane; Pedersoli, Marco; Sarraf, Aydin; Granger, Eric

doi:10.1016/j.imavis.2023.104838

Computer Science > Computer Vision and Pattern Recognition

arXiv:2310.06196 (cs)

[Submitted on 9 Oct 2023 (v1), last revised 19 Oct 2023 (this version, v2)]

Title:DiPS: Discriminative Pseudo-Label Sampling with Self-Supervised Transformers for Weakly Supervised Object Localization

Authors:Shakeeb Murtaza, Soufiane Belharbi, Marco Pedersoli, Aydin Sarraf, Eric Granger

View PDF

Abstract:Self-supervised vision transformers (SSTs) have shown great potential to yield rich localization maps that highlight different objects in an image. However, these maps remain class-agnostic since the model is unsupervised. They often tend to decompose the image into multiple maps containing different objects while being unable to distinguish the object of interest from background noise objects. In this paper, Discriminative Pseudo-label Sampling (DiPS) is introduced to leverage these class-agnostic maps for weakly-supervised object localization (WSOL), where only image-class labels are available. Given multiple attention maps, DiPS relies on a pre-trained classifier to identify the most discriminative regions of each attention map. This ensures that the selected ROIs cover the correct image object while discarding the background ones, and, as such, provides a rich pool of diverse and discriminative proposals to cover different parts of the object. Subsequently, these proposals are used as pseudo-labels to train our new transformer-based WSOL model designed to perform classification and localization tasks. Unlike standard WSOL methods, DiPS optimizes performance in both tasks by using a transformer encoder and a dedicated output head for each task, each trained using dedicated loss functions. To avoid overfitting a single proposal and promote better object coverage, a single proposal is randomly selected among the top ones for a training image at each training step. Experimental results on the challenging CUB, ILSVRC, OpenImages, and TelDrone datasets indicate that our architecture, in combination with our transformer-based proposals, can yield better localization performance than state-of-the-art methods.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2310.06196 [cs.CV]
	(or arXiv:2310.06196v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2310.06196
Journal reference:	Image and Vision Computing 140C (2023) 104838
Related DOI:	https://doi.org/10.1016/j.imavis.2023.104838

Submission history

From: Shakeeb Murtaza [view email]
[v1] Mon, 9 Oct 2023 22:52:43 UTC (28,749 KB)
[v2] Thu, 19 Oct 2023 00:11:05 UTC (28,749 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:DiPS: Discriminative Pseudo-Label Sampling with Self-Supervised Transformers for Weakly Supervised Object Localization

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:DiPS: Discriminative Pseudo-Label Sampling with Self-Supervised Transformers for Weakly Supervised Object Localization

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators