This post discusses recent research exploring the over-smoothing effect in Vision Transformer (ViT) models — a phenomenon where deeper layers lose spatial detail and act like low-pass filters. A new study demonstrates how Fourier domain analysis and filtering can help retain critical high-frequency components, leading to more expressive and robust models.
The Problem: Transformers as Low-Pass Filters
Vision Transformers have become a cornerstone in deep learning for computer vision, but they often suffer from a subtle issue called over-smoothing. In this condition, features in deeper layers become increasingly indistinct, especially in spatial resolution. This is mathematically analogous to applying a low-pass filter — retaining only the coarse, low-frequency information while discarding high-frequency details.
Over-smoothing leads to loss of discriminative power, blurring boundaries, and reducing generalization.
Theoretical Insight
Choi et al. (2024) offer a detailed analysis using graph signal processing and Fourier transforms to show that Vision Transformers behave similarly to low-pass filters. Their solution? Enrich attention mechanisms with high-frequency representations.
They introduce frequency-tunable attention heads that can extract and amplify high-frequency signals, thereby acting as band-pass filters that preserve rich spatial structure.
Proposed Fix: Frequency-Enhanced Attention
The authors propose augmenting the standard self-attention block in Transformers with graph convolutional operations that are sensitive to frequency content.
- They integrate graph Laplacian eigenvectors to enable explicit control over frequency bands.
- The modified attention mechanism behaves like a spectral filter, mitigating the over-smoothing effect.
Why This Matters
- Preserves fine-grained features like textures and edges.
- Improves performance on tasks sensitive to spatial precision, such as segmentation or super-resolution.
- Makes ViTs more explainable by revealing which frequency bands contribute most to the output.
Here’s a visual metaphor
In the same way that overcooking a crepe can ruin its texture, over-smoothing in ViTs flattens useful features.