Faster VGGT with Block-Sparse Global Attention

⚡️ Global Attention is the main computational contributor in VGGT and \(\pi^3\). Our analysis reveals sparsity patterns in the attention that we exploit with a training-free approach, achieving up to \(4\times\) faster inference.

Visualization of VGGT's global attention matrix. A very small number of entries is highly activated, while the vast majority of entries is near zero. This visualization shows the average attention map over all heads of layer 15 in the VGGT aggregator, at an input resolution of \(224\times 182\). Upper highlight: The special tokens attend to each other and form a distinctive pattern. Lower highlight: Patch-level attention is localized on a small subset of highly activated entries. See the supplementary material for an enlarged visualization.

Abstract

Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT and \(\pi^3\) have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block-sparse kernels, yielding up to \(\mathbf{4\times}\) faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and \(\pi^3\), and supports larger image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.

Analysis

Runtime of VGGT’s forward pass. FA denotes frame-wise attention. As the number of input frames increases, global attention dominates the computational cost (measured with FlashAttention2 on an H100 GPU). We propose to adapt a block-sparse attention method that considerably reduces the cost of Global Attention while preserving result quality.

VGGT’s global attention matrix is extremely sparse. Left: We visualize the tokens corresponding to the top-k activated entries of the attention map of layer 15. Right: Average & maximum attention scores in the global attention maps; the shorthand {S,P}2{P,S} denotes attention between special (S) and patch (P) tokens. Layers in the middle of the aggregator exhibit higher activations and increased sparsity. Note the different scalings of the mean and max activations.

Method

We employ a training-free adaptive sparse block-attention in the global attention layers of the model to exploit these sparsity patterns.

Architecture overview of VGGT. The pretrained checkpoint contains a lightweight camera regression head and three DPT heads. The DINO patchifier and the aggregator contain roughly 300M parameters each, while the DPT heads contain around 32M each.

Overview of the training-free adaptive sparse attention. Keys and queries are average pooled to estimate a low-resolution approximation of the attention map. This low-resolution attention map is used to create the binary mask for block-sparse attention.

Qualitative Examples

We show examples from the ETH3D dataset. Increasing sparsity leads to small perturbations in the reconstruction, but the overall quality stays remarkably high.

Original Model

50% Sparsity

70% Sparsity

VGGT

\(\pi^3\)

Quantitative Experiments

Relative Pose Estimation & Multi-View Reconstruction

Results for Relative Pose Estimation (top) Multi-View Reconstruction (bottom). Multi-view reconstruction performance seems to be robust against sparsification of global attention; even in the highest sparsity settings, the results are on par or better than other state-of-the-art methods. We provide comprehensive tables for these results in the supplementary material.

Camera Pose Estimation

	Method	RRA@5↑	RTA@5↑	ATE↓	Time [s]↓
200	VGGT	83.9	79.9	0.012	18
	VGGT-S25	83.1	79.6	0.011	8.5
	VGGT-S50	80.7	78.4	0.011	7.3
	VGGT-S75	57.1	60.8	0.013	5.5
	π³	85.4	83.9	0.009	13.9
	π³-S25	84.6	83.5	0.009	6.8
	π³-S50	82.9	82.3	0.009	5.8
	π³-S75	59.8	67.7	0.009	4.4
full	VGGT	73.4	72.5	0.008	35
	VGGT-S25	72.7	72.2	0.009	17.9
	VGGT-S50	70.3	71.1	0.008	14.4
	VGGT-S75	46.0	53.0	0.009	10.4
	π³	75.8	75.8	0.006	27.9
	π³-S25	74.8	75.3	0.006	13.6
	π³-S50	73.0	74.2	0.006	11.3
	π³-S75	50.0	59.1	0.006	7.8

Feed-Forward Camera Pose Estimation on Tanks & Temples. See the paper for the full table.

Results on Tanks & Temples for different input sizes and sparsity ratios.

BibTeX

@article{wang2025sparsevggt,
  title     = {{Faster VGGT with Block-Sparse Global Attention}},
  author    = {Wang, Chung-Shien Brian and Schmidt, Christian and Piekenbrinck, Jens and Leibe, Bastian},
  journal   = {arXiv preprint arXiv:2509.07120},
  year      = {2025}
}