MonarchRT: Efficient Attention for Real-Time Video Generation



1Carnegie Mellon University   2University at Buffalo   3Morpheus AI

TL;DR: We introduce MonarchRT, a method that sparsely parameterizes attention maps in video generation DiTs as Monarch matrices. Specifically, it addresses three key limitations of existing work on Monarch-style attention to make it feasible for video: (1) video dimension block alignment, (2) monotonic refinement with compute via tiled Monarch parameterization, and (3) linear iterative refinement overhead, mitigated by 1-iteration + finetuning and custom Triton kernels. Empirically, MonarchRT reaches up to 95% effective attention sparsity with no quality loss and achieves speedups over FlashAttention(-2, -3, -4) kernels in the range of 1.4–11.8$\times$. Notably, this enables, for the first time, true real-time 16 FPS video generation with Self-Forcing on a single RTX 5090.

Note Box
While MonarchRT sparsely parameterizes the attention map, it does not directly enforce a sparse mask, and the attention map itself can remain dense. Our speedups come from computing this dense attention via a Monarch-structured factorization, not from skipping tokens/entries.

  Motivation and Observations: Structure of 3D Attention

3D self-attention is a major computational bottleneck for Diffusion Transformers (DiTs) in video generation. Many recent works have shown that video attention maps are highly redundant, and costs can be significantly reduced through various sparse attention approximations. However, these methods generally target bidirectional models with 50 or more diffusion steps, which can easily hide approximation errors; meanwhile, real-time video sits in a harsh regime consisting of few-step diffusion and autoregressive generation, where approximation errors can propagate across steps/frames.

In this setting, we observe that sparsification can be fundamentally brittle, as video attention is not reliably sparse. In reality, attention maps must carry significantly more information over fewer (causally-masked) steps, so the same redundancy assumptions from bidirectional, many-step diffusion do not hold. This presents a major limitation for adopting these models, as aggressive sparsity may be required for true real-time speed but can cause severe quality degradation. This seems like an important problem to solve, especially as latency-critical use cases like world modeling and robotics applications emerge. This motivates us to explore sparse parameterizations that can still capture the intrinsic structure of video attention.

Oracle top-k vs. Monarch approximation error
Figure 1: MSE of attention map approximation using oracle top-$k$ and Monarch on Self-Forcing for layer 0 head 0 (L0H0) and layer 8 head 7 (L8H7) for varying sparsity levels. Even oracle top-$k$ can incur large errors at high sparsity, while Monarch preserves structure.

In particular, we find that video attention combines (i) strong spatiotemporal periodic structure, (ii) sparse semantic correspondences, and (iii) intermittent dense mixing, which can exceed the representational capacity of even oracle top-$k$ attention. Consider a video latent with frames $f$, height $h$, width $w$ (token count $N = f h w$). Writing indices as $(f_0,h_0,w_0)$ and $(f_1,h_1,w_1)$, we model the (post-softmax) attention matrix $\mathbf{A}$ as

$$\mathbf{A}_{(f_0,h_0,w_0),(f_1,h_1,w_1)} = \mathbf{D}_{(f_0,h_0,w_0),(f_1,h_1,w_1)} + \mathbf{S}_{(f_0,h_0,w_0),(f_1,h_1,w_1)} + \varepsilon,$$ $$\mathbf{D}_{(f_0,h_0,w_0),(f_1,h_1,w_1)} = d_f(f_0,f_1)\, d_h(h_0,h_1)\, d_w(w_0,w_1).$$

Here $\mathbf{D}$ captures separable positional structure (smoothly decaying distances along time/height/width), while $\mathbf{S}$ captures sparse semantic links (retrieval-like spikes) that are independent of position.

Why top-$k$ can fail: even if $\mathbf{S}$ is sparse, $\mathbf{D}$ is not. Some heads may require a large fraction of tokens to recover most of the attention mass, so enforcing a hard $k$ can delete essential periodic mixing.
3D attention map
Figure 2: Illustration of 3D attention map, with diagonal bands from positional periodicity + sparse semantic spikes.

A useful consequence of our proposed attention map model is that, after a suitable permutation $\mathbf{P}$ that groups tokens based on spatiotemporal structure, the positional component $\mathbf{D}$ can become blockwise rank-1. For this reason, we explore the possibility of approximating these attention maps as Monarch matrices, which have sufficient representational power to capture such patterns with minimal error.

  Monarch + MonarchAttention: Background and Video Challenges

Monarch parameterization represents an $N\times N$ matrix (with $N=b_1 b_2$) as a structured factorization $$\mathbf{M} = \mathbf{P}\,\mathbf{L}\,\mathbf{P}^\top\,\mathbf{R},$$ where $\mathbf{P}$ is a fixed permutation, and $\mathbf{L}$ and $\mathbf{R}$ are block-diagonal. Equivalently, $\mathbf{M}$ can be reshaped into a 4D tensor of shape $b_1\times b_2\times b_1\times b_2$ whose local slices are rank-$1$. Writing the blocked tensor as $\widetilde{\mathbf{M}}\in\mathbb{R}^{b_1\times b_2\times b_1\times b_2}$, $$\widetilde{\mathbf{M}}_{\ell j k i}=\mathbf{L}_{j\ell k}\,\mathbf{R}_{kji},$$ so for fixed $(j,k)$, the slice $\widetilde{\mathbf{M}}_{:,j,k,:}$ is rank-$1$ in $(\ell,i)$. Choosing block sizes that align with spatiotemporal structure is crucial for effective Monarch approximation in video attention.

MonarchAttention

MonarchAttention (Yaras et al.) proposes to approximate the attention matrix $\mathbf{A}=\softmax(\mathbf{Q}\mathbf{K}^\top)\in\mathbb{R}^{N\times N}$ as a Monarch matrix by directly optimizing the Monarch factors via an iterative alternating refinement procedure. The key practical advantage is that one can update and apply factors $\mathbf{L}$ and $\mathbf{R}$ without explicitly forming $\mathbf{A}$.

MonarchAttention pipeline
Figure 3: MonarchAttention alternates factor updates to approximate $\softmax(\mathbf{Q}\mathbf{K}^\top)$ without materializing the full attention matrix.

Challenges of Monarch Parameterization for Video Attention

We identify three key challenges of directly applying a MonarchAttention-style algorithm to video attention:

C1 (Block Alignment). Picking generic block sizes $b_1$ and $b_2$ such that $N = b_1 b_2$ does not yield an accuracy-preserving sparse parameterization for video. Instead, there are strict block size constraints tied to the video latent dimensions (frames $f$, height $h$, width $w$) that must be satisfied for the parameterization to properly capture the positional structure.

C2 (Non-monotonic improvement with compute). Because of the alignment constraints, only a small set of block sizes actually “work” in practice. Critically, using block sizes outside this set does not necessarily improve approximation, even if the representation becomes less sparse. Not being able to reliably achieve higher accuracy by investing more compute makes the approach fairly inflexible.

C3 (Iteration overhead). A MonarchAttention-style alternating refinement procedure scales linearly with the number of refinement steps. In few-step diffusion and autoregressive video generation, this multi-iteration overhead can significantly mitigate any intended latency improvements.

  MonarchRT: Solutions for Real-Time Video Attention

1) Block Alignment with Spatiotemporal Structure

For video DiTs, attention rows/columns correspond to physical visual patches, and a large fraction of the attention structure is driven by a separable positional component (the $\mathbf{D}$ term in our decomposition). For Monarch to fully capture this positional structure, the block sizes must be chosen so that the three positional terms can be placed cleanly into either $\mathbf{L}$ or $\mathbf{R}$ without splitting any spatiotemporal axis.

This becomes concrete if we focus on representing the positional term $$\mathbf{D}_{(f_0,h_0,w_0),(f_1,h_1,w_1)} = d_f(f_0,f_1)\,d_h(h_0,h_1)\,d_w(w_0,w_1).$$ Under the $(b_1,b_2)$ blocking, Monarch enforces a blockwise rank-1 form $$\widetilde{\mathbf{M}}_{\ell j k i}=\mathbf{L}_{j\ell k}\,\mathbf{R}_{kji}.$$ Exact representation of $\mathbf{D}$ is possible only if we can assign the three separable terms to the two Monarch factors without mixing axes: $\mathbf{L}$ must contain (a product of) some subset of $\{d_f,d_h,d_w\}$, and $\mathbf{R}$ must contain the remaining term(s). If a spatiotemporal axis is split across both block sizes (misalignment), then neither $\mathbf{L}$ nor $\mathbf{R}$ is indexed by both coordinates of that axis. (Here, "both coordinates" refers to the indices along a given axis for a query and key pair, e.g. $(w_0, w_1)$ along the width axis.) For example, if the width dimension were split between both block sizes, then one factor would be indexed by $w_0$ while the other is indexed by $w_1$, so the full 2D positional term $d_w(w_0,w_1)$ cannot live entirely inside a single factor. In that case, capturing $d_w$ exactly would require it to be rank-1 under the induced split, which is generally not true, so $\mathbf{D}$ cannot be represented exactly.

Quality comparison of dense, top-k, and Monarch
Figure 4: On Self-Forcing, Monarch parameterization can succeed (especially over oracle top-$k$), but only if block sizes are aligned with video dimensions; misalignment can introduce permutation-like artifacts.

When each of $f$, $h$, and $w$ is fully contained in exactly one block size, the three separable terms can be assigned cleanly to $\mathbf{L}$ and $\mathbf{R}$. For example, with $(b_1,b_2)=(fh,w)$ one valid construction is $$\mathbf{L}_{w_0,(f_0,h_0),(f_1,h_1)}=d_f(f_0,f_1)\,d_h(h_0,h_1),\qquad \mathbf{R}_{(f_1,h_1),w_0,w_1}=d_w(w_0,w_1),$$ which makes $\mathbf{M}$ reproduce $\mathbf{D}$ exactly.

To generalize, for $N=f\,h\,w$, the alignment rule yields exactly six (nontrivial) aligned choices for $(b_1,b_2)$: $$ (fh,w),\ (w,fh),\ (f,hw),\ (hw,f),\ (fw,h),\ (h,fw). $$

Takeaway: Misaligned block sizes prevent exact representation of the positional term $\mathbf{D}$, so an accurate parameterization requires obeying alignment of the Monarch block sizes with the spatiotemporal dimensions $f$, $h$, and $w$.

2) Tiled Monarch Parameterization

From the alignment discussion above, there are only six block size configurations that exactly preserve the separable positional structure in video. This creates a practical limitation: if you want to reduce sparsity (i.e., spend more compute / parameters to improve approximation), the most obvious approach is to modify the block sizes $(b_1,b_2)$; however, this is unreliable, as block sizes outside the aligned set are misaligned with spatiotemporal axes, so they may produce worse approximations even when the representation is less sparse (see Figure 4).

MonarchRT introduces tiled Monarch to reduce sparsity while maintaining alignment: start from one of the six aligned base configurations, then subdivide the coarse rank-1 blocks induced by Monarch into smaller subtiles with factors $(c_1,c_2)$ (where $c_1\mid b_1$, $c_2\mid b_2$). So instead of estimating each block with a single set of rank-1 parameters, we estimate each subtile as its own rank-1 piece.

More concretely, tiling replaces each original Monarch block with a grid of $c_1c_2$ subtiles (and does this across all blocks), where each subtile behaves like a smaller rank-1 block with effective sizes $\bigl(\tfrac{b_1}{c_1},\tfrac{b_2}{c_2}\bigr)$. This increases parameter count while maintaining expressivity through block size alignment.

Tiled block sizes
Figure 5: Comparison of regular and Tiled Monarch. Tiled Monarch subdivides each block into smaller tiles, improving locality while keeping aligned structure.

Tiled Monarch strictly generalizes untiled Monarch. For the same base block sizes $(b_1,b_2)$, untiled Monarch is a special case of tiled Monarch: set the subtile parameters so that all subtiles within an original block share the same rank-1 factors. Therefore, $$\mathcal{M}(b_1,b_2)\subseteq \mathcal{M}_{\mathrm{tile}}(b_1,b_2;c_1,c_2).$$ Here $\mathcal{M}(b_1,b_2)$ denotes the set of (untiled) Monarch matrices with block sizes $(b_1,b_2)$, and $\mathcal{M}_{\mathrm{tile}}(b_1,b_2;c_1,c_2)$ denotes the set of tiled Monarch matrices with the same base block sizes and tiling factors $(c_1,c_2)$. When $c_1>1$ or $c_2>1$, this containment is strict: tiled Monarch can exactly parameterize certain matrices that violate the rank-$1$ constraint of an untiled block. Concretely, let $\boldsymbol{B}^{(j,k)}\in\mathbb{R}^{b_1\times b_2}$ denote the $(j,k)$-th untiled block, and focus on the specific block $\boldsymbol{B}^{(0,0)}$. Pick two coordinates $(r_1,s_1)$ and $(r_2,s_2)$ with $r_1\neq r_2$ and $s_1\neq s_2$ that lie in different subtiles of this same block, and set all subtiles to zero except those two; within each of the two chosen subtiles, set all entries to zero except at its designated coordinate $(r_1,s_1)$ or $(r_2,s_2)$, so that $$\boldsymbol{B}^{(0,0)}_{r_1,s_1}=1,\; \boldsymbol{B}^{(0,0)}_{r_2,s_2}=1,\; \text{and } \boldsymbol{B}^{(0,0)}_{r_1,s_2}=\boldsymbol{B}^{(0,0)}_{r_2,s_1}=0.$$ Then the $2\times 2$ submatrix of $\boldsymbol{B}^{(0,0)}$ restricted to rows $\{r_1,r_2\}$ and columns $\{s_1,s_2\}$ is $$\begin{pmatrix}1 & 0\\ 0 & 1\end{pmatrix},$$ hence $\mathrm{rank}(\boldsymbol{B}^{(0,0)})\ge 2$, which is impossible for untiled Monarch to exactly represent. But since every subtile is still rank-1, tiled Monarch can represent this construction exactly.

Takeaway: Tiling restores a clear accuracy–efficiency tradeoff: increasing $(c_1,c_2)$ refines locality and improves approximation quality by investing more compute/parameters. Refining the granularity of each rank-1 partition also makes it easier to capture non-positional patterns, like sparse semantic correlations.
MonarchAttention and tiled MonarchAttention pseudocode
Figure 6: MonarchAttention vs. tiled MonarchAttention pseudocode. Tiling modifies the refinement updates to operate on smaller sub-blocks.
Note: Another way to reduce sparsity while maintaining alignment is to estimate each original Monarch block as rank-$k$ for $k>1$. However, we find this does not yield a clean MonarchAttention-style iterative refinement algorithm for estimating the factors, so we do not adopt this approach.

3) Finetuning + Triton Kernels (Reducing Iterative Overhead)

MonarchAttention uses iterative refinement to estimate factors $\mathbf{L}$ and $\mathbf{R}$. Using more iterations improves the accuracy of the approximation but linearly increases the runtime. MonarchRT reduces this overhead by employing finetuning, so that 1 iteration + finetuning matches the quality of much longer refinement without finetuning, significantly lowering the algorithmic cost.

Iterations vs. quality
Figure 7: Iterative refinement improves quality but costs scale with iterations; finetuning recovers the quality of many iterations using 1.

On the systems side, we write custom Triton kernels to efficiently implement the tiled MonarchAttention forward and backward passes, so the 1-iteration refinement cost remains small.

Kernel visualization
Figure 8: Two-stage kernel visualization for tiled MonarchAttention forward pass.

  Empirical Results (Quality + Speed)

Quality at High Sparsity

MonarchRT is designed for extreme regimes where quality cannot be traded for speed. On Self-Forcing, we reach 95% attention sparsity while matching dense attention on VBench.

Dense Attention
MonarchRT
Example generations on autoregressive Self-Forcing, comparing dense attention and MonarchRT at 95% sparsity.
Method Quality Semantic Total
Dense Attention 0.844 0.804 0.836
MonarchRT (95% sparse) 0.846 0.805 0.838

The table above shows VBench scores on Self-Forcing for dense attention and MonarchRT at 95% (effective) sparsity. We inject MonarchRT directly into the distribution matching distillation (DMD) stage of Self-Forcing. Since MonarchRT reduces both inference and training costs, and since both models are trained for the same number of iterations, MonarchRT achieves higher performance even when using a smaller training budget than the dense model.

Efficiency: Kernel and End-to-End Speed

With optimized kernels, MonarchRT speeds up attention substantially on Self-Forcing. Below we report measured 480p results at the same sparsity level (95%) used for the quality analysis above.

Kernel Latency (Decode Final Frame)

GPU FlashAttention (ms) MonarchRT (ms) Speedup
RTX 5090 4.97 1.27 3.9×
H100 1.58 1.04 1.5×

End-to-End Latency (Generate 81 Frames)

GPU FlashAttention (ms) MonarchRT (ms) Speedup
RTX 5090 8309.06 6094.45 1.36×
H100 3661.16 3228.45 1.13×

Furthermore, applying torch.compile to MonarchRT at 95% sparsity yields an even higher practical end-to-end throughput gain: on RTX 5090 at 480p, FlashAttention-2 achieves 11 FPS while MonarchRT achieves 16 FPS, without requiring any additional lossy optimizations.

We can further push MonarchRT to achieve even higher theoretical speedups; for example, MonarchRT provides an 11.8$\times$ attention speedup over FA-2 on RTX 5090 for bidirectional attention with 720p video dimensions, at 98% effective sparsity. More detailed experiments are included in the paper.

  Conclusion and Discussion

MonarchRT reframes efficient video attention: instead of assuming attention is sparse, we treat it as a mixture of separable positional structure, sparse semantic correspondences, and dense mixing; Monarch parameterizations provide the right inductive bias for this structure. By enforcing spatiotemporal block alignment, introducing tiled Monarch for monotonic refinement, and finetuning with custom kernels, MonarchRT delivers large attention speedups without compromising real-time video fidelity.

MonarchRT is most useful when (i) attention is a dominant runtime term (typical for most video DiTs), (ii) the target regime is latency-sensitive (few-step diffusion, interactive generation), and (iii) quality must be preserved even at high sparsity. If the workload is already dominated by non-attention components, or if the model is tolerant to aggressive sparsity (e.g., highly redundant many-step diffusion), then simpler static or dynamic sparse methods may suffice.

Citation

@misc{agarwal2026monarchrtefficientattentionrealtime,
      title={MonarchRT: Efficient Attention for Real-Time Video Generation}, 
      author={Krish Agarwal and Zhuoming Chen and Cheng Luo and Yongqi Chen and Haizhong Zheng and Xun Huang and Atri Rudra and Beidi Chen},
      year={2026},
      eprint={2602.12271},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.12271}, 
}