Learning from Silence and Noise for Visual Sound Source Localization Models

Learning from Silence and Noise for Visual Sound Source Localization Models

Xavier Juanola

xavier.juanola@upf.edu

Universitat Pompeu Fabra,
Barcelona, Spain

Giovana Morais

gv2167@nyu.edu

New York University,
New York City, USA

Magdalena Fuentes

mf3734@nyu.edu

New York University,
New York City, USA

Gloria Haro

gloria.haro@upf.edu

Universitat Pompeu Fabra,
Barcelona, Spain

Paper accepted to BMVC 2025

[Paper]      [Code]


Abstract

Visual sound source localization is a fundamental perception task that aims to detect the location of sounding sources in a video given its audio. Despite recent progress, we identify two shortcomings in current methods: 1) most approaches perform poorly in cases with low audio-visual semantic correspondence such as silence, noise, and offscreen sounds, i.e. in the presence of negative audio; and 2) most prior evaluations are limited to positive cases, where both datasets and metrics convey scenarios with a single visible sound source in the scene. To address this, we introduce three key contributions. First, we propose a new training strategy that incorporates silence and noise, which improves performance in positive cases, while being more robust against negative sounds. Our resulting self-supervised model, SSL-SaN, achieves state-of-the-art performance compared to other self-supervised models, both in sound localization and cross-modal retrieval. Second, we propose a new metric that quantifies the trade-off between alignment and separability of auditory and visual features across positive and negative audio-visual pairs. Third, we present IS3+, an extended and improved version of the IS3 synthetic dataset with negative audio. Our data, metrics and code are available at GitHub.


Test set IS3+

Test set coming soon!

The IS3+ test set will be made available shortly.
Please check back soon for access and download instructions.

Cross Modal Retrieval VGG-SS

Audio → Image Retrieval

Query Audio
Top 1
Top 2
Top 3
Top 4
Top1
Top2
Top3
Top4
Top1
Top2
Top3
Top4
Top1
Top2
Top3
Top4
Top1
Top2
Top3
Top4

Figure 1. Audio → Image Retrieval examples in VGG-SS.

Image → Audio Retrieval

Query Image
Top 1
Top 2
Top 3
Top 4
Query image
Query image
Query image
Query image

Figure 2. Image → Audio Retrieval examples in VGG-SS.

Cross Modal Retrieval IS3+

Audio → Image Retrieval

Query Audio
Top 1
Top 2
Top 3
Top 4
Top1
Top2
Top3
Top4
Top1
Top2
Top3
Top4
Top1
Top2
Top3
Top4
Top1
Top2
Top3
Top4

Figure 1. Audio → Image Retrieval examples in IS3+.

Image → Audio Retrieval

Query Image
Top 1
Top 2
Top 3
Top 4
Query image
Query image
Query image
Query image

Figure 2. Image → Audio Retrieval examples in IS3+.

Qualitative results VGG-SS

Piano
Silence
Noise
Offscreen
Chicken
Silence
Noise
Offscreen
LVS
EZ-VSL
FNAC
SLAVC
SSL-TIE
SSL-Align (S. Sup.)
ACL
SSL-SaN

Qualitative results IS3+

Cattle
Stream
Silence
Noise
Offscreen
Firework
Icecream Truck
Silence
Noise
Offscreen
LVS
EZ-VSL
FNAC
SLAVC
SSL-TIE
SSL-Align (S. Sup.)
ACL
SSL-SaN

Qualitative results AVS-Bench S4

Race Car
Silence
Noise
Offscreen
Keyboard
Silence
Noise
Offscreen
LVS
EZ-VSL
FNAC
SLAVC
SSL-TIE
SSL-Align (S. Sup.)
ACL
SSL-SaN