Faster VGGT with Block-Sparse Global Attention

RWTH Aachen University

⚡️ Global Attention is the main computational contributor in VGGT and \(\pi^3\). Our analysis reveals sparsity patterns in the attention that we exploit with a training-free approach, achieving up to \(4\times\) faster inference.
Global Attention Overview

Visualization of VGGT's global attention matrix. A very small number of entries is highly activated, while the vast majority of entries is near zero. This visualization shows the average attention map over all heads of layer 15 in the VGGT aggregator, at an input resolution of \(224\times 182\). Upper highlight: The special tokens attend to each other and form a distinctive pattern. Lower highlight: Patch-level attention is localized on a small subset of highly activated entries. See the supplementary material for an enlarged visualization.

Abstract

Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT and \(\pi^3\) have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block-sparse kernels, yielding up to \(\mathbf{4\times}\) faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and \(\pi^3\), and supports larger image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.

Analysis

Performance Overview

Runtime of VGGT’s forward pass. FA denotes frame-wise attention. As the number of input frames increases, global attention dominates the computational cost (measured with FlashAttention2 on an H100 GPU). We propose to adapt a block-sparse attention method that considerably reduces the cost of Global Attention while preserving result quality.

Correspondences Sparsity Plot

VGGT’s global attention matrix is extremely sparse. Left: We visualize the tokens corresponding to the top-k activated entries of the attention map of layer 15. Right: Average & maximum attention scores in the global attention maps; the shorthand {S,P}2{P,S} denotes attention between special (S) and patch (P) tokens. Layers in the middle of the aggregator exhibit higher activations and increased sparsity. Note the different scalings of the mean and max activations.

Method

We employ a training-free adaptive sparse block-attention in the global attention layers of the model to exploit these sparsity patterns.

VGGT Overview

Architecture overview of VGGT. The pretrained checkpoint contains a lightweight camera regression head and three DPT heads. The DINO patchifier and the aggregator contain roughly 300M parameters each, while the DPT heads contain around 32M each.

Sparse Attention Inference

Overview of the training-free adaptive sparse attention. Keys and queries are average pooled to estimate a low-resolution approximation of the attention map. This low-resolution attention map is used to create the binary mask for block-sparse attention.

Qualitative Examples

We show examples from the ETH3D dataset. Increasing sparsity leads to small perturbations in the reconstruction, but the overall quality stays remarkably high.

Original Model
50% Sparsity
70% Sparsity
VGGT
\(\pi^3\)

Quantitative Experiments

Relative Pose Estimation & Multi-View Reconstruction

Quantitative Experiments

Results for Relative Pose Estimation (top) Multi-View Reconstruction (bottom). Multi-view reconstruction performance seems to be robust against sparsification of global attention; even in the highest sparsity settings, the results are on par or better than other state-of-the-art methods. We provide comprehensive tables for these results in the supplementary material.

Camera Pose Estimation

Method RRA@5↑ RTA@5↑ ATE↓ Time [s]↓
200 VGGT 83.9 79.9 0.012 18
VGGT-S25 83.1 79.6 0.011 8.5
VGGT-S50 80.7 78.4 0.011 7.3
VGGT-S75 57.1 60.8 0.013 5.5
π3 85.4 83.9 0.009 13.9
π3-S25 84.6 83.5 0.009 6.8
π3-S50 82.9 82.3 0.009 5.8
π3-S75 59.8 67.7 0.009 4.4
full VGGT 73.4 72.5 0.008 35
VGGT-S25 72.7 72.2 0.009 17.9
VGGT-S50 70.3 71.1 0.008 14.4
VGGT-S75 46.0 53.0 0.009 10.4
π3 75.8 75.8 0.006 27.9
π3-S25 74.8 75.3 0.006 13.6
π3-S50 73.0 74.2 0.006 11.3
π3-S75 50.0 59.1 0.006 7.8

Feed-Forward Camera Pose Estimation on Tanks & Temples. See the paper for the full table.

Sparsity

Results on Tanks & Temples for different input sizes and sparsity ratios.

BibTeX

@article{wang2025sparsevggt,
  title     = {{Faster VGGT with Block-Sparse Global Attention}},
  author    = {Wang, Chung-Shien Brian and Schmidt, Christian and Piekenbrinck, Jens and Leibe, Bastian},
  journal   = {arXiv preprint arXiv:2509.07120},
  year      = {2025}
}