Dataflow-Oriented Reinforcement Learning for (Multi-)Agentic LLMs
Multi-policy training, elastic scaling, heterogeneous GPUs, and cross-region rollout. No feature-specific system engineering needed.
[ 02 ] · multi-policyFaster than verl-based training, with comparable or better accuracy.
[ 03 ] · sparse transferSparse RL weight deltas shrink ~28 GB full syncs to ~1.5 GB for remote rollout.
AstraFlow is an open-source, dataflow-oriented reinforcement learning system built for flexibility and scale. It natively supports the following for LLM RL training — without any feature-specific system engineering:
Build your own agentic workflow for RL training on AstraFlow.
Scale RL to agentic LLMs — multi-policy collaborative training, dynamic execution, elastic / heterogeneous / cross-region compute — under one system.
Existing LLM RL systems are trainer-centered. A single trainer loop owns rollout scheduling, data movement, optimization, and weight sync. Multi-agent serving systems run rich agent workflows but don't train. Recent systems add multi-policy / elastic / heterogeneous rollout as ad-hoc patches — they're hard to combine, hard to reuse, and require feature-specific engineering each time.
The root cause is the lack of clean abstraction boundaries among rollout execution, dataflow management, training, and weight transfer. Compute decoupling (separating rollout from training computation) is just placement, it isn't a principled component abstraction. Without those boundaries, new capabilities can't be supported by the architecture itself; they have to be hand-engineered onto a trainer-centered loop.
RaaS nodes pull rollout tasks from the data layer and push completed trajectories back, while trainers independently pull batches. The layer exposes programmable dataflow policies: selective rollout, curriculum scheduling, filtering, sampling, replay, mixing, and staleness correction, without requiring changes to trainers, RaaS nodes, or orchestration. It also regulates autonomous components by throttling slow rollouts, prioritizing fresh trajectories, blocking unsuitable batches via backpressure, and routing multi-policy data using metadata such as policy, model version, timestamp, reward, and task type.
The RaaS contract makes rollout execution substitutable: any efficient agent-serving runtime plugs in. The runtime doesn't need to know how trajectories are sampled, filtered, or assigned to trainers; the trainer doesn't need to know which runtime produced a trajectory. AstraFlow can reuse specialized agent-serving systems as backends instead of re-implementing their internal logic.
RaaS also makes capacity elastic. Adding capacity is just launching more nodes connected to the same dataflow and weight interfaces; removing nodes, slow workers, or failures only affect the rate trajectories arrive, not the trainer loop. Heterogeneous and cross-region settings — where rollout services have different latency, throughput, bandwidth — fall out of the same contract.
A trainer consumes batches and publishes updated weights via a trainer-side interface. To the trainer, AstraFlow is just a streaming corpus plus a weight publication target: it does not manage rollout workers, serve weights, or coordinate with other trainers. This makes training backends interchangeable: any RL, SFT, or fault-tolerant backend can participate by consuming batches and publishing weights. For multi-policy training, each policy has its own trainer and weight stream while the dataflow layer routes trajectories.
Weight transfer is decoupled from training: model versions are stored and exposed to RaaS nodes, which pull updates when appropriate. The same interface supports full-model transfer, sparse deltas, and version-aware refresh without adding trainer-side complexity.
AstraFlow natively supports by design:
AstraFlow expresses multi-policy workflows as workflow-level changes, not pipeline modifications. To the best of our knowledge, AstraFlow is the first LLM RL framework to support fully asynchronous multi-policy collaborative training. Dr.MAS inherits verl's colocated synchronous execution, so long-tail multi-agent rollouts stall the iteration — AstraFlow's async dataflow doesn't.
| Method | AIME24 | AIME25 | MATH500 | Minerva | Avg Acc. | Time / iter (s) |
|---|---|---|---|---|---|---|
| Solver | 42.9 | 31.8 | 90.5 | 39.2 | 51.1 | — |
| Solver + Verifier (verl) | 44.6 (+1.7) | 41.5 (+9.7) | 90.7 (+0.2) | 40.9 (+1.7) | 54.4 (+3.3) | 212.64 |
| Solver + Verifier (AstraFlow) | 47.3 (+4.4) | 40.6 (+8.8) | 92.9 (+2.4) | 45.0 (+5.8) | 56.5 (+5.4) | 77.65 |
The same abstraction generalizes beyond math. On code (LiveCodeBench v5/v6, Codeforces), Solver + Test-Case Generator improves the matched Solver baseline from 30.29% → 34.55% average (+4.26 pp). Both workflows are workflow-level changes; the trainer, rollout, dataflow, and weight interfaces are identical to the math run.
| Method | LCB v5 | LCB v6 | Codeforces | Avg |
|---|---|---|---|---|
| Solver | 36.83 | 32.86 | 21.20 | 30.29 |
| Solver + Selector | 38.32 (+1.49) | 35.43 (+2.57) | 22.67 (+1.47) | 32.14 (+1.85) |
| Solver + Test-Case Gen. | 41.62 (+4.79) | 36.29 (+3.43) | 25.74 (+4.54) | 34.55 (+4.26) |
The dataflow layer observes how much each rollout pool produces, how much the trainer consumes, and how long the trainer spends waiting. It exports a target pool size via a simple dead-band policy. An agentic maintainer (Claude Code in the experiment) reads the report and resizes the pool — launching or retiring RaaS instances per the suggested target. AstraFlow contains no scheduler-specific code.
Three RaaS pools at 100% / 60% / 30% relative throughput (induced by 700W / 400W / 250W per-GPU power caps), two of them remote and shaped to 4 Gbit/s + 300 ms RTT. All three pools contribute every iteration. AstraFlow reaches 67.6 average math accuracy — comparable to a homogeneous local baseline. The trainer's downtime is essentially flat; even when remote pools finish weight transfer late, ongoing training masks the cost.
RL weight updates under bf16 are bit-exactly sparse: most parameters are unchanged from one iteration to the next. Sparsity is largely independent of model size and task — only learning rate moves it, and even at the most aggressive setting it stays above 0.97.
That drops the per-iteration payload from a ~28 GB full sync to ~1.5 GB of delta bytes on Qwen3-14B. Most remote transfers complete in tens of seconds even at 4 Gbit/s + 300 ms RTT. The request-based delta-pull keeps slow links off the trainer's critical path.
GRESO picks prompts before rollout; dynamic sampling filters trajectories after rollout; buffer replay reuses trajectories at batch serving. Three different points along the RL data path, all expressed through the same dataflow-layer interface.
Dynamic sampling lifts final accuracy but inflates generation cost ~3.5× (~200k → ~700k rollouts) — post-filtering throws a lot away. GRESO and buffer replay sit on the opposite side: both reach baseline accuracy with far fewer generated rollouts.
Each algorithm is a self-contained policy class the dataflow layer imports as a plug-in — prompt selection, rollout filtering, trajectory replay become composable, not system-wide rewrites.
@article{zheng2026astraflow,
title = {AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs},
author = {Zheng, Haizhong and Di, Yizhuo and Wang, Jiahui and Jin, Shuowei
and Liu, Xueshen and Wu, Yongji and Mao, Z. Morley and Stoica, Ion
and Zhao, Jiawei and Chen, Beidi},
journal = {arXiv preprint},
year = {2026}
}