AstraFlow / 2026 · Dataflow-Oriented RL · Agentic LLMs

AstraFlow

Dataflow-Oriented Reinforcement Learning for (Multi-)Agentic LLMs

Agentic RL Multi-Agent RL Asynchronous Elastic Cross-Region Heterogeneous Composable Data Algorithms
Haizhong Zheng1· Yizhuo Di1· Jiahui Wang1· Shuowei Jin2· Xueshen Liu2· Yongji Wu3· Z. Morley Mao2· Ion Stoica3· Jiawei Zhao4· Beidi Chen1
1Carnegie Mellon University 2U. Michigan 3UC Berkeley 4Meta
Figure 1 · Elasticity falls out of the contract Mixed-hardware nodes (H100, A100, L40S, …) across regions (US, EU, APAC, …) join and leave the pool on demand. AstraFlow contains no scheduler- or region-specific code — elasticity and heterogeneity fall out of the RaaS contract.

§What AstraFlow supports out of the box

AstraFlow is an open-source, dataflow-oriented reinforcement learning system built for flexibility and scale. It natively supports the following for LLM RL training — without any feature-specific system engineering:

Open source · Apache 2.0

Build your own agentic workflow for RL training on AstraFlow.

§What Existing RL Systems Lack

Need

Scale RL to agentic LLMs — multi-policy collaborative training, dynamic execution, elastic / heterogeneous / cross-region compute — under one system.

Today

Existing LLM RL systems are trainer-centered. A single trainer loop owns rollout scheduling, data movement, optimization, and weight sync. Multi-agent serving systems run rich agent workflows but don't train. Recent systems add multi-policy / elastic / heterogeneous rollout as ad-hoc patches — they're hard to combine, hard to reuse, and require feature-specific engineering each time.

Why

The root cause is the lack of clean abstraction boundaries among rollout execution, dataflow management, training, and weight transfer. Compute decoupling (separating rollout from training computation) is just placement, it isn't a principled component abstraction. Without those boundaries, new capabilities can't be supported by the architecture itself; they have to be hand-engineered onto a trainer-centered loop.

§Dataflow-oriented RL: Separate control, not just compute

Figure 2 · The three abstractions in action AstraFlow's design principle is dataflow-oriented coordination. Disaggregation does not only separate rollout and training computation; it also separates their control responsibilities. Rollout services, trainers, and the dataflow layer each run autonomous control loops and interact only through minimal data and weight interfaces. Capabilities like multi-policy collaborative training, elastic rollout pools, heterogeneous and cross-region rollouts, and modular data algorithms are then expressed by the architecture itself, not bolted on.

1.Dataflow Abstraction: Data algorithms become plug-ins

Dataflow layer abstraction
Dataflow layer. The coordination plane between rollout services and trainers. Buffers RL data in its natural units (prompts, trajectories, metadata, batches); applies sampling, filtering, and routing policies. Data algorithms become first-class plug-ins instead of pipeline rewrites.

RaaS nodes pull rollout tasks from the data layer and push completed trajectories back, while trainers independently pull batches. The layer exposes programmable dataflow policies: selective rollout, curriculum scheduling, filtering, sampling, replay, mixing, and staleness correction, without requiring changes to trainers, RaaS nodes, or orchestration. It also regulates autonomous components by throttling slow rollouts, prioritizing fresh trajectories, blocking unsuitable batches via backpressure, and routing multi-policy data using metadata such as policy, model version, timestamp, reward, and task type.

2.RaaS Abstraction: Rollout becomes a service contract

RaaS abstraction
RaaS. Each node consumes tasks, produces trajectories, refreshes weights. Three operations, one contract — that's all rollout has to be.

The RaaS contract makes rollout execution substitutable: any efficient agent-serving runtime plugs in. The runtime doesn't need to know how trajectories are sampled, filtered, or assigned to trainers; the trainer doesn't need to know which runtime produced a trajectory. AstraFlow can reuse specialized agent-serving systems as backends instead of re-implementing their internal logic.

RaaS also makes capacity elastic. Adding capacity is just launching more nodes connected to the same dataflow and weight interfaces; removing nodes, slow workers, or failures only affect the rate trajectories arrive, not the trainer loop. Heterogeneous and cross-region settings — where rollout services have different latency, throughput, bandwidth — fall out of the same contract.

3.Trainer Abstraction: Training backends become substitutable

Trainer abstraction
(a) Trainer abstraction.
Delta weight transfer
(b) Fully-async pull-based sparse weight update.

A trainer consumes batches and publishes updated weights via a trainer-side interface. To the trainer, AstraFlow is just a streaming corpus plus a weight publication target: it does not manage rollout workers, serve weights, or coordinate with other trainers. This makes training backends interchangeable: any RL, SFT, or fault-tolerant backend can participate by consuming batches and publishing weights. For multi-policy training, each policy has its own trainer and weight stream while the dataflow layer routes trajectories.

Weight transfer is decoupled from training: model versions are stored and exposed to RaaS nodes, which pull updates when appropriate. The same interface supports full-model transfer, sparse deltas, and version-aware refresh without adding trainer-side complexity.

§What can AstraFlow enable?

Multi-agent RL: 2.7× faster than verl-based MAS RL, +5.4 pp better than single-policy

AstraFlow expresses multi-policy workflows as workflow-level changes, not pipeline modifications. To the best of our knowledge, AstraFlow is the first LLM RL framework to support fully asynchronous multi-policy collaborative training. Dr.MAS inherits verl's colocated synchronous execution, so long-tail multi-agent rollouts stall the iteration — AstraFlow's async dataflow doesn't.

Math solver + verifier workflow
(a) Math: Solver + Verifier.
Code solver + selector workflow
(b) Code: Solver + Selector.
Code solver + test-case generator workflow
(c) Code: Solver + Test-Case Generator.
Math multi-policy training under matched conditions. AstraFlow reaches comparable / better accuracy while cutting iteration time by 2.7×.
Method AIME24 AIME25 MATH500 Minerva Avg Acc. Time / iter (s)
Solver 42.9 31.8 90.5 39.2 51.1
Solver + Verifier (verl) 44.6 (+1.7) 41.5 (+9.7) 90.7 (+0.2) 40.9 (+1.7) 54.4 (+3.3) 212.64
Solver + Verifier (AstraFlow) 47.3 (+4.4) 40.6 (+8.8) 92.9 (+2.4) 45.0 (+5.8) 56.5 (+5.4) 77.65

The same abstraction generalizes beyond math. On code (LiveCodeBench v5/v6, Codeforces), Solver + Test-Case Generator improves the matched Solver baseline from 30.29% → 34.55% average (+4.26 pp). Both workflows are workflow-level changes; the trainer, rollout, dataflow, and weight interfaces are identical to the math run.

Code-generation accuracy: single-policy vs. multi-policy collaborative training (Qwen3-8B).
Method LCB v5 LCB v6 Codeforces Avg
Solver 36.83 32.86 21.20 30.29
Solver + Selector 38.32 (+1.49) 35.43 (+2.57) 22.67 (+1.47) 32.14 (+1.85)
Solver + Test-Case Gen. 41.62 (+4.79) 36.29 (+3.43) 25.74 (+4.54) 34.55 (+4.26)

Auto-scaling: Rollout auto-scaling with an agentic maintainer

Pool size and trainer waiting under different settings
Auto-scaling in action. Rollout GPU count tracks demand: scale up when trainer waiting spikes, hold inside the dead band, scale down on sustained low waiting.

The dataflow layer observes how much each rollout pool produces, how much the trainer consumes, and how long the trainer spends waiting. It exports a target pool size via a simple dead-band policy. An agentic maintainer (Claude Code in the experiment) reads the report and resizes the pool — launching or retiring RaaS instances per the suggested target. AstraFlow contains no scheduler-specific code.

Cross-region & Heterogeneous: Trainer never blocks on a slow remote pool

Three RaaS pools at 100% / 60% / 30% relative throughput (induced by 700W / 400W / 250W per-GPU power caps), two of them remote and shaped to 4 Gbit/s + 300 ms RTT. All three pools contribute every iteration. AstraFlow reaches 67.6 average math accuracy — comparable to a homogeneous local baseline. The trainer's downtime is essentially flat; even when remote pools finish weight transfer late, ongoing training masks the cost.

Per-iteration rollout throughput from different RaaS pools
(a) Per-iteration throughput from each pool.
Weight-transfer time across local and remote pools
(b) Weight-transfer time per link.
Trainer and rollout downtime in the heterogeneous cross-region setting
(c) Trainer / rollout downtime per iteration.

Sparsity makes remote training feasible: Full-sync 28 GB drops to ~1.5 GB

Weight delta sparsity
Per-iteration weight delta sparsity across Qwen3-1.7B / 8B / 14B on math, and Qwen2.5-7B on AlfWorld, WebShop, Search. Math runs land in 0.989–0.993; Qwen2.5-7B tasks reach ≥0.996.

RL weight updates under bf16 are bit-exactly sparse: most parameters are unchanged from one iteration to the next. Sparsity is largely independent of model size and task — only learning rate moves it, and even at the most aggressive setting it stays above 0.97.

That drops the per-iteration payload from a ~28 GB full sync to ~1.5 GB of delta bytes on Qwen3-14B. Most remote transfers complete in tens of seconds even at 4 Gbit/s + 300 ms RTT. The request-based delta-pull keeps slow links off the trainer's critical path.

Data algorithms compose as plug-ins: Three intervention points, one interface

GRESO picks prompts before rollout; dynamic sampling filters trajectories after rollout; buffer replay reuses trajectories at batch serving. Three different points along the RL data path, all expressed through the same dataflow-layer interface.

Math accuracy vs generated rollouts
Math accuracy vs. generated rollouts for the three data algorithms. Different trade-offs along the data path.

Dynamic sampling lifts final accuracy but inflates generation cost ~3.5× (~200k → ~700k rollouts) — post-filtering throws a lot away. GRESO and buffer replay sit on the opposite side: both reach baseline accuracy with far fewer generated rollouts.

Each algorithm is a self-contained policy class the dataflow layer imports as a plug-in — prompt selection, rollout filtering, trajectory replay become composable, not system-wide rewrites.

Full ablations, hyperparameters, and experimental setup are in the paper and the code repo.

§BibTeX

@article{zheng2026astraflow,
  title   = {AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs},
  author  = {Zheng, Haizhong and Di, Yizhuo and Wang, Jiahui and Jin, Shuowei
             and Liu, Xueshen and Wu, Yongji and Mao, Z. Morley and Stoica, Ion
             and Zhao, Jiawei and Chen, Beidi},
  journal = {arXiv preprint},
  year    = {2026}
}