One of the first things you learn when building a machine learning system is that models fail in surprising ways at the boundaries of their training distribution. For Visual Sound Source Localization (VSSL) models, one of those surprising failure modes is deceptively simple: what happens when there's nothing to hear?
The Assumption Buried in Every Benchmark
The standard VSSL evaluation protocol assumes that the audio always corresponds to something visible in the image — a dog barking at a dog, a piano playing next to a pianist. This is called a positive audio case. Models are trained and evaluated almost exclusively on these positive cases.
But the real world isn't so clean. Audio can be:
- Silence — nothing is making a sound
- Noise — ambient sound with no meaningful source
- Offscreen — the sound source exists but is outside the camera frame
We call these negative audio cases. And when we tested state-of-the-art models on them, the results were revealing.
What We Found
In our work at UPF and NYU, we evaluated a wide range of SOTA VSSL models using negative audio inputs. The finding was consistent: most models confidently localize a "sound source" even when the audio is pure silence or random noise.
These models aren't localizing audio. They're localizing visually salient regions — and using audio as a justification after the fact.
This is a significant problem. It means that standard benchmarks overestimate real-world model performance. A model that scores well on positive-only evaluation might be completely useless when deployed in conditions where negative audio is possible — which is most real-world scenarios.
Why This Happens
The root cause is in the training signal. Self-supervised VSSL models learn to find correlations between audio and visual features. But because they're only ever trained on positive pairs, they learn to always produce a localization output — there's no mechanism to say "I don't hear anything relevant here."
The model has never learned to abstain. It doesn't know what silence looks like — or rather, it doesn't know that silence should produce a flat, uninformative heatmap.
Our Solution: Learning from Silence and Noise
In SSL-SaN (accepted to BMVC 2025), we propose a simple but effective training strategy: include silence and noise in the training set as negative examples. When the model sees silence, it should learn to produce a near-uniform heatmap — no confident localization.
We also introduce new evaluation metrics that measure a model's performance on both positive and negative audio simultaneously. This gives a much more honest picture of what these models can actually do.
The results are promising: SSL-SaN achieves state-of-the-art performance among self-supervised models on standard benchmarks, while being dramatically more robust to negative audio.
Why This Matters for Industry
If you're building a system that relies on audio-visual understanding — robot perception, smart surveillance, accessibility tools — you need models that know when to say "I don't know." A model that hallucinates sound sources under silence is not just inaccurate; it can be actively misleading.
The lesson generalizes beyond VSSL: always test your model at the boundaries of its training distribution, and specifically with inputs that should produce null or negative outputs. The failures there are often the most informative.