System Overview¶
AstraFlow is a fully asynchronous reinforcement learning system where inference, training, and data orchestration run as independent services coordinated via HTTP. The system is designed for distributed GPU clusters and supports elastic scaling of both inference and training.
Components¶
The system has four cooperating components:
Dataflow — Central orchestrator. Manages a global pool of RaaS instances, acquires rollout data, buffers and serves training batches, and coordinates weight synchronization.
RaaS (Remote Agentic Serving) — Inference service. Launches and manages vLLM/SGLang engines, executes rollout workflows with reward functions, and loads updated weights. Multiple RaaS instances can run in parallel, and instances can join or leave dynamically.
Trainer — Distributed training engine (FSDP/Megatron). Fetches batches from Dataflow, runs RL algorithm updates (GRPO, PPO, M2PO, etc.), and pushes updated weights. Multiple trainers can run in parallel for multi-model training.
WeightManager — Weight transfer subsystem. Copies FSDP weights to shared-memory buffers and transfers them to RaaS via TCP or NCCL.
Architecture¶
┌─────────────────────────────────────────────────────────┐
│ Dataflow │
│ (Orchestration + Buffering) │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Data │ │ Data │ │
│ │ Acquisition │ │ Serving │ │
│ └──────┬───────┘ └──────┬───────┘ │
└──────────────┼─────────────────────────┼────────────────┘
│ │
submit / pull │ │ get_batch
┌──────┴──────┐ ┌───────┴──────┐
│ │ │ │
┌───────▼───────┐ ┌───▼───────┐ ┌▼───────────┐ ┌▼──────────┐
│ RaaS #1 │ │ RaaS #2 │ │ Trainer │ │ Trainer │
│ (SGLang) │ │ (SGLang) │ │ (model0) │ │ (model1) │
│ 4 GPUs │ │ 4 GPUs │ │ 2 GPUs │ │ 2 GPUs │
└───────┬───────┘ └───┬───────┘ └─────┬──────┘ └──────┬────┘
│ │ │ │
│ weight sync (TCP or NCCL) │
◄─────────────────────────────┘───────────────┘
Inter-Component Call Graph¶
All HTTP calls between the three components, organized by phase.
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ Trainer │ │ Dataflow │ │ RaaS │
│ (train_worker/) │ │ (astraflow/) │ │ (raas/) │
└─────────┬───────────┘ └──────────┬──────────┘ └──────────┬──────────┘
│ │ │
╔═════╧════════════════════════════╧═══════════════════════════╧═════╗
║ STARTUP ║
╚═════╤════════════════════════════╤═══════════════════════════╤═════╝
│ │ │
│ │ POST /register_raas │
│ │ <─────────────────────────┤
│ │ │
│ │ POST /register_workflow
│ ├──────────────────────────>│
│ │ │
│ POST /ready │ │
├───────────────────────────>│ │
│ │ │
╔═════╧════════════════════════════╧═══════════════════════════╧═════╗
║ TRAINING LOOP (each iteration) ║
╚═════╤════════════════════════════╤═══════════════════════════╤═════╝
│ │ │
│ │ GET /availability │
│ ├──────────────────────────>│
│ │ │
│ │ POST /submit │
│ ├──────────────────────────>│
│ │ │
│ │ POST /pull │
│ ├──────────────────────────>│
│ │ │
│ GET /batch │ │
├───────────────────────────>│ │
│ │ │
│ ... trainer runs RL update ... │
│ │ │
│ POST /notify_version │ │
├───────────────────────────>│ │
│ │ │
│ │ POST /notify_version │
│ │ (one call per model │
│ │ per RaaS instance) │
│ ├──────────────────────────>│
│ │ │
│ POST /register_sglang_instance (first pull) │
│ <─────────────────────────────────────────────────────┤
│ │ │
│ POST /request_transfer (each pull) │
│ <─────────────────────────────────────────────────────┤
│ │ │
╔═════╧════════════════════════════╧═══════════════════════════╧═════╗
║ EVALUATION (triggered after weight sync) ║
╚═════╤════════════════════════════╤═══════════════════════════╤═════╝
│ │ │
│ │ POST /eval_start │
│ ├──────────────────────────>│
│ │ │
│ │ POST /eval_submit │
│ ├──────────────────────────>│
│ │ │
│ │ POST /eval_pull │
│ ├──────────────────────────>│
│ │ │
│ │ POST /eval_end │
│ ├──────────────────────────>│
│ │ │
╔═════╧════════════════════════════╧═══════════════════════════╧═════╗
║ LIFECYCLE ║
╚═════╤════════════════════════════╤═══════════════════════════╤═════╝
│ │ │
│ │ GET /status │
│ ├──────────────────────────>│
│ │ │
│ │ POST /shutdown │
│ ├──────────────────────────>│
│ │ │
Arrow direction indicates who initiates the HTTP call. During weight sync, RaaS calls Trainer directly (bypassing Dataflow) for the bulk TCP transfer.
Dynamic RaaS Pool¶
RaaS instances are managed by a global RaaSPool shared across all agents. Instances register and deregister at runtime via HTTP:
Join: A new RaaS instance calls
POST /register_raas. The pool initializes the engine, catches it up to the current weight version (so it never serves stale rollouts), then adds it to the live pool.Leave: An instance can deregister via
POST /deregister_raas, or the pool automatically detects failures via heartbeat monitoring.Failure detection: Data-path calls (submit, pull) mark failing instances as suspect. A background heartbeat thread confirms via
GET /status(which reflects actual engine health, not a trivial OK) and deregisters dead instances. This two-phase approach avoids false positives from transient network issues.Capacity-based routing:
submit_autoroutes each rollout request to the RaaS instance with the most available slots, balancing load across the pool.Parallel collect:
pull_completedfans out to all live instances in parallel and merges results.
This means you can scale inference throughput by adding more RaaS instances at any time — no restart required.
Multi-Model, Multi-Trainer¶
AstraFlow supports training multiple models simultaneously (e.g., a solver and a verifier in solve-and-verify). Each model has:
Its own Trainer process with dedicated GPUs and a separate weight transfer port.
Its own buffer inside Dataflow with independent staleness filtering.
Its own SGLang engine inside RaaS (e.g.,
sglang[model0]:d2+sglang[model1]:d2).
Coordination uses a version barrier: each trainer independently notifies its version after a training step. Weight sync to RaaS is triggered only after all models reach the same version. The last trainer to arrive becomes the “leader” and initiates the coordinated weight load across all RaaS instances. Other trainers block until the leader completes.
Training Loop¶
Each iteration proceeds as follows:
Data Acquisition (background, continuous): Dataflow checks RaaS pool availability, submits prompts to the least-loaded instance, and collects completed rollouts from all instances in parallel. Rollouts are filtered by staleness and reward distribution, then buffered.
Batch Fetch: Trainer requests a batch from Dataflow via
GET /batch. Dataflow mixes fresh rollouts and replay data according toreplay_ratio, applies staleness filtering, and returns a padded batch.Train Step: Trainer runs forward/backward pass with the RL algorithm (PPO/GRPO/M2PO), computes loss, and updates model weights.
Weight Sync:
Trainer stages new weights to shared memory (via
WeightManager.offload) and callsPOST /notify_versionon Dataflow.In multi-model setups, Dataflow’s Python-side version barrier waits until every registered
model_idreports the same version. For non-eval steps, each per-model weight load is fired asynchronously (fire-and-forget) and the staleness filter absorbs briefly-stale rollouts. For eval steps, the weight load is deferred to the barrier leader so every model updates atomically.The leader (or each async firer) fans out one
POST /notify_versionper model to every live RaaS instance viaRaaSPool.notify_version(). Each RaaS pulls the model’s weights over TCP from the sender agent, pauses that model’s engine, loads from/dev/shm, and resumes.
Repeat with the next iteration.
Async Execution Model¶
Each component runs as a separate process:
Component |
Execution Model |
Typical GPUs |
|---|---|---|
Dataflow |
Threaded (Flask + ThreadPoolExecutor) |
CPU-only |
RaaS (×N) |
Async (FastAPI + asyncio) |
4+ GPUs each |
Trainer (×M) |
Synchronous (PyTorch DDP/FSDP) |
2–8 GPUs each |
Data acquisition runs two background threads inside Dataflow:
Submit thread: Checks pool availability, gathers prompts, submits in parallel via ThreadPoolExecutor.
Collect thread: Polls all RaaS instances for completed rollouts, applies filtering, ingests into buffer.
This decoupling means inference and training overlap — while the trainer processes one batch, the next batch of rollouts is already being generated across all RaaS instances.
Version Tracking and Staleness¶
Every rollout is tagged with the model weight version that produced it. During batch retrieval, samples older than current_version - max_staleness are dropped. This prevents training on outdated rollouts while allowing a configurable lag for throughput.
When a new RaaS instance joins mid-training, the pool sends it the current weight version and sender endpoints so it loads the latest weights before serving any traffic.
Fault Tolerance¶
RaaS heartbeat: Background thread monitors all instances via
GET /statusevery 10s. Two consecutive failures trigger deregistration. The pool continues operating with remaining instances.Suspect-and-confirm: Data-path failures mark instances as suspect (skipped for routing) but don’t immediately deregister — the heartbeat thread confirms via
GET /statusfirst.Empty pool: If all RaaS instances fail, submit/collect become no-ops until a new instance registers. Training pauses at the next
get_batchcall (blocks until data is available).Checkpointing: Both model weights and buffer state can be saved and restored. On recovery, trainers report their recovered version so staleness filtering works correctly.