This post discusses recent research exploring the over-smoothing effect in Vision Transformer (ViT) models — a phenomenon where deeper layers lose spatial detail and act like low-pass filters. A new study demonstrates how Fourier domain analysis and filtering can help retain critical high-frequency components, leading to more expressive and robust models.


The Problem: Transformers as Low-Pass Filters

Vision Transformers have become a cornerstone in deep learning for computer vision, but they often suffer from a subtle issue called over-smoothing. In this condition, features in deeper layers become increasingly indistinct, especially in spatial resolution. This is mathematically analogous to applying a low-pass filter — retaining only the coarse, low-frequency information while discarding high-frequency details.

Over-smoothing leads to loss of discriminative power, blurring boundaries, and reducing generalization.

Theoretical Insight

Choi et al. (2024) offer a detailed analysis using graph signal processing and Fourier transforms to show that Vision Transformers behave similarly to low-pass filters. Their solution? Enrich attention mechanisms with high-frequency representations.

They introduce frequency-tunable attention heads that can extract and amplify high-frequency signals, thereby acting as band-pass filters that preserve rich spatial structure.

Proposed Fix: Frequency-Enhanced Attention

The authors propose augmenting the standard self-attention block in Transformers with graph convolutional operations that are sensitive to frequency content.

  • They integrate graph Laplacian eigenvectors to enable explicit control over frequency bands.
  • The modified attention mechanism behaves like a spectral filter, mitigating the over-smoothing effect.

Why This Matters

  • Preserves fine-grained features like textures and edges.
  • Improves performance on tasks sensitive to spatial precision, such as segmentation or super-resolution.
  • Makes ViTs more explainable by revealing which frequency bands contribute most to the output.

Here’s a visual metaphor

Crepe

In the same way that overcooking a crepe can ruin its texture, over-smoothing in ViTs flattens useful features.