Visualization of VGGT's global attention matrix. A very small number of entries is highly activated, while the vast majority of entries is near zero. This visualization shows the average attention map over all heads of layer 15 in the VGGT aggregator, at an input resolution of \(224\times 182\). Upper highlight: The special tokens attend to each other and form a distinctive pattern. Lower highlight: Patch-level attention is localized on a small subset of highly activated entries. See the supplementary material for an enlarged visualization.
Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT and \(\pi^3\) have achieved impressive results with simple architectures, yet they face an inherent runtime bottleneck, due to the quadratic complexity of the global attention layers, that limits the scalability to large image sets. In this paper, we empirically analyze the global attention matrix of these models and observe that probability mass concentrates on a small subset of patch-patch interactions that correspond to cross-view geometric matches. Motivated by the structured attention and inspired by recent advancement in large language models, we propose a replacement for the dense global attention operation based on highly optimized block-sparse kernels, yielding up to \(\mathbf{4\times}\) faster inference with comparable task performance. Our retrofit requires no retraining of the backbone, extends to both VGGT and \(\pi^3\), and supports larger image collections. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate the effectiveness of our approach.
Runtime of VGGT’s forward pass. FA denotes frame-wise attention. As the number of input frames increases, global attention dominates the computational cost (measured with FlashAttention2 on an H100 GPU). We propose to adapt a block-sparse attention method that considerably reduces the cost of Global Attention while preserving result quality.
VGGT’s global attention matrix is extremely sparse. Left: We visualize the tokens corresponding to the top-k activated entries of the attention map of layer 15. Right: Average & maximum attention scores in the global attention maps; the shorthand {S,P}2{P,S} denotes attention between special (S) and patch (P) tokens. Layers in the middle of the aggregator exhibit higher activations and increased sparsity. Note the different scalings of the mean and max activations.
We employ a training-free adaptive sparse block-attention in the global attention layers of the model to exploit these sparsity patterns.
Architecture overview of VGGT. The pretrained checkpoint contains a lightweight camera regression head and three DPT heads. The DINO patchifier and the aggregator contain roughly 300M parameters each, while the DPT heads contain around 32M each.
Overview of the training-free adaptive sparse attention. Keys and queries are average pooled to estimate a low-resolution approximation of the attention map. This low-resolution attention map is used to create the binary mask for block-sparse attention.
We show examples from the ETH3D dataset. Increasing sparsity leads to small perturbations in the reconstruction, but the overall quality stays remarkably high.
Results for Relative Pose Estimation (top) Multi-View Reconstruction (bottom). Multi-view reconstruction performance seems to be robust against sparsification of global attention; even in the highest sparsity settings, the results are on par or better than other state-of-the-art methods. We provide comprehensive tables for these results in the supplementary material.
Method | RRA@5↑ | RTA@5↑ | ATE↓ | Time [s]↓ | |
---|---|---|---|---|---|
200 | VGGT | 83.9 | 79.9 | 0.012 | 18 |
VGGT-S25 | 83.1 | 79.6 | 0.011 | 8.5 | |
VGGT-S50 | 80.7 | 78.4 | 0.011 | 7.3 | |
VGGT-S75 | 57.1 | 60.8 | 0.013 | 5.5 | |
π3 | 85.4 | 83.9 | 0.009 | 13.9 | |
π3-S25 | 84.6 | 83.5 | 0.009 | 6.8 | |
π3-S50 | 82.9 | 82.3 | 0.009 | 5.8 | |
π3-S75 | 59.8 | 67.7 | 0.009 | 4.4 | |
full | VGGT | 73.4 | 72.5 | 0.008 | 35 |
VGGT-S25 | 72.7 | 72.2 | 0.009 | 17.9 | |
VGGT-S50 | 70.3 | 71.1 | 0.008 | 14.4 | |
VGGT-S75 | 46.0 | 53.0 | 0.009 | 10.4 | |
π3 | 75.8 | 75.8 | 0.006 | 27.9 | |
π3-S25 | 74.8 | 75.3 | 0.006 | 13.6 | |
π3-S50 | 73.0 | 74.2 | 0.006 | 11.3 | |
π3-S75 | 50.0 | 59.1 | 0.006 | 7.8 |
Feed-Forward Camera Pose Estimation on Tanks & Temples. See the paper for the full table.
Results on Tanks & Temples for different input sizes and sparsity ratios.
@article{wang2025sparsevggt,
title = {{Faster VGGT with Block-Sparse Global Attention}},
author = {Wang, Chung-Shien Brian and Schmidt, Christian and Piekenbrinck, Jens and Leibe, Bastian},
journal = {arXiv preprint arXiv:2509.07120},
year = {2025}
}