Visualization of VGGT's global attention matrix. A very small number of entries is highly activated, while the vast majority of entries is near zero. This visualization shows the average attention map over all heads of layer 15 in the VGGT aggregator, at an input resolution of \(224\times 182\). Upper highlight: The special tokens attend to each other and form a distinctive pattern. Lower highlight: Patch-level attention is localized on a small subset of highly activated entries. See the supplementary material for an enlarged visualization.
Efficient and accurate feed-forward multi-view reconstruction has long been an important task in computer vision. Recent transformer-based models like VGGT, \(\pi^3\) and MapAnything have demonstrated remarkable performance with relatively simple architectures. However, their scalability is fundamentally constrained by the quadratic complexity of global attention, which imposes a significant runtime bottleneck when processing large image sets. In this work, we empirically analyze the global attention matrix of these models and observe that the probability mass concentrates on a small subset of patch-patch interactions corresponding to cross-view geometric correspondences. Building on this insight and inspired by recent advances in large language models, we propose a training-free, block-sparse replacement for dense global attention, implemented with highly optimized kernels. Our method accelerates inference by more than \(\mathbf{3\times}\) while maintaining comparable task performance. Evaluations on a comprehensive suite of multi-view benchmarks demonstrate that our approach seamlessly integrates into existing global attention-based architectures such as VGGT, \(\pi^3\), and MapAnything, while substantially improving scalability to large image collections.
Runtime of VGGT’s forward pass. FA denotes frame-wise attention. As the number of input frames increases, global attention dominates the computational cost (measured with FlashAttention2 on an H100 GPU). We propose to adapt a block-sparse attention method that considerably reduces the cost of Global Attention while preserving result quality.
VGGT’s global attention matrix is extremely sparse. Left: We visualize the tokens corresponding to the top-k activated entries of the attention map of layer 15. Right: Average & maximum attention scores in the global attention maps; the shorthand {S,P}2{P,S} denotes attention between special (S) and patch (P) tokens. Layers in the middle of the aggregator exhibit higher activations and increased sparsity. Note the different scalings of the mean and max activations.
We employ a training-free adaptive sparse block-attention in the global attention layers of the model to exploit these sparsity patterns.
Architecture overview of VGGT. The pretrained checkpoint contains a lightweight camera regression head and three DPT heads. The DINO patchifier and the aggregator contain roughly 300M parameters each, while the DPT heads contain around 32M each.
Overview of the training-free adaptive sparse attention. Keys and queries are average pooled to estimate a low-resolution approximation of the attention map. This low-resolution attention map is used to create the binary mask for block-sparse attention.
We show examples from the ETH3D dataset. Increasing sparsity leads to small perturbations in the reconstruction, but the overall quality stays remarkably high.
Results for Relative Pose Estimation (top) Multi-View Reconstruction (bottom). Multi-view reconstruction performance seems to be robust against sparsification of global attention; even in the highest sparsity settings, the results are on par or better than other state-of-the-art methods. We provide comprehensive tables for these results in the supplementary material.
| Method | RRA@5↑ | RTA@5↑ | ATE↓ | Time [s]↓ | |
|---|---|---|---|---|---|
| 200 | VGGT | 83.9 | 79.9 | 0.012 | 18 |
| VGGT-S25 | 83.1 | 79.6 | 0.011 | 8.5 | |
| VGGT-S50 | 80.7 | 78.4 | 0.011 | 7.3 | |
| VGGT-S75 | 57.1 | 60.8 | 0.013 | 5.5 | |
| π3 | 85.4 | 83.9 | 0.009 | 13.9 | |
| π3-S25 | 84.6 | 83.5 | 0.009 | 6.8 | |
| π3-S50 | 82.9 | 82.3 | 0.009 | 5.8 | |
| π3-S75 | 59.8 | 67.7 | 0.009 | 4.4 | |
| full | VGGT | 73.4 | 72.5 | 0.008 | 35 |
| VGGT-S25 | 72.7 | 72.2 | 0.009 | 17.9 | |
| VGGT-S50 | 70.3 | 71.1 | 0.008 | 14.4 | |
| VGGT-S75 | 46.0 | 53.0 | 0.009 | 10.4 | |
| π3 | 75.8 | 75.8 | 0.006 | 27.9 | |
| π3-S25 | 74.8 | 75.3 | 0.006 | 13.6 | |
| π3-S50 | 73.0 | 74.2 | 0.006 | 11.3 | |
| π3-S75 | 50.0 | 59.1 | 0.006 | 7.8 |
Feed-Forward Camera Pose Estimation on Tanks & Temples. See the paper for the full table.
Results on Tanks & Temples for different input sizes and sparsity ratios.
@article{wang2025sparsevggt,
title = {{Block-Sparse Global Attention for Efficient Multi-View Geometry Transformers}},
author = {Wang, Chung-Shien Brian and Schmidt, Christian and Piekenbrinck, Jens and Leibe, Bastian},
journal = {arXiv preprint arXiv:2509.07120},
year = {2025}
}