Code

Reinforcement learning for code-generation reasoning with RLVR, rewarded by test-case execution.

Code recipes: examples/code/ and examples/code-multi-agent/

Each recipe ships an all-in-one launch script under scripts/ and its config under yaml/.

LiveCodeBench v5 eval data needs a one-time manual download — see Eval Dataset Setup below.

Eval Dataset Setup

The recipes evaluate on HumanEval, LiveCodeBench v5, and DeepCoder Codeforces. Only LiveCodeBench v5 needs a one-time manual download before launching — otherwise the periodic eval steps fail.

Download the AReaL-boba-2-RL-Code dataset from the repo root:

huggingface-cli download inclusionAI/AReaL-boba-2-RL-Code \
  --repo-type dataset \
  --local-dir ./data-data/AReaL-boba-2-RL-Code

This provides ./data-data/AReaL-boba-2-RL-Code/code_benchmark/lcb_v5/test.jsonl.

HumanEval ships vendored in the repo (astraEnv/human-eval/data/HumanEval.jsonl), and DeepCoder Codeforces is loaded directly from Hugging Face during eval — neither needs a download.

The training dataset (DeepCoder-Preview, primeintellect subset) is fetched from Hugging Face automatically on first run, so LiveCodeBench v5 above is the only manual download needed to run a recipe end to end.

Qwen3-8B — 8 GPUs (single-agent)

Single-agent code-generation RL on one 8-GPU node — 4 GPUs for inference, 4 for training. It comes in two variants that differ only in weight transfer mode:

A near-identical single-agent baseline also lives at code-multi-agent/qwen3-8b-single-agent-m2po-full/ as the fair-comparison reference for the codegen+verifier recipe below.

Run

One script launches all three processes — the AstraFlow service, the RaaS inference server, and the trainer:

# delta weight transfer
bash examples/code/qwen3-8b-m2po-delta/scripts/run_qwen3-8b-m2po-delta.sh

# full weight transfer
bash examples/code/qwen3-8b-m2po-full/scripts/run_qwen3-8b-m2po-full.sh

Settings

Setting

Value

Model

Qwen3-8B

GPUs

8 — RaaS ×4 (SGLang, DP=4), Trainer ×4 (FSDP, DP=4)

Algorithm

M2PO (m2_threshold 0.01)

Weight transfer

TCP — full, or delta (delta_full_sync_interval 10)

Context length

12288

Max new tokens

4096

Rollouts per prompt

8 (temperature 1.0)

Train batch size

128

Learning rate

3e-6 (Adam, constant schedule)

Train steps

1200

Workflow / reward

livecodebench_single_turn / livecodebench_reward

Train dataset

DeepCoder-Preview (primeintellect subset)

Eval datasets

HumanEval, LiveCodeBench v5, DeepCoder Codeforces

Qwen3-8B codegen + verifier — 2 nodes × 8 GPUs (multi-agent)

A two-agent recipe: model0 generates code, model1 generates verification test cases, and the two are co-trained against each other’s outputs. It spans two 8-GPU nodes — both nodes host both models for inference, while the two trainers live on the starting node:

Run

Launch the all-in-one script on the first node; it starts the AstraFlow service, RaaS-B, and both trainers, then prints the exact command to run on the second node (RaaS-A):

bash examples/code-multi-agent/qwen3-8b-codegen-verifier-m2po-full-2node/scripts/run_qwen3-8b-codegen-verifier-m2po-full-2node.sh

Settings

Setting

Value

Models

model0 codegen (Qwen3-8B), model1 testcase verifier (Qwen3-8B)

GPUs

16 — RaaS-A ×8 (SGLang, model0 DP=6 / model1 DP=2), RaaS-B ×4 (SGLang, model0 DP=3 / model1 DP=1), Trainer model0 ×2 (FSDP, DP=2), Trainer model1 ×2 (FSDP, DP=2)

Algorithm

M2PO (m2_threshold 0.01)

Weight transfer

TCP — full

Context length

12288

Max new tokens

4096

Rollouts per prompt

8 (temperature 1.0)

Train batch size

256

Learning rate

3e-6 (Adam, constant schedule)

Train steps

800

Workflow / reward

code_actor_and_verify_v2 / livecodebench_reward (2 generated test cases, verify_timeout 6s)

Train dataset

DeepCoder-Preview (primeintellect subset)

Eval datasets

HumanEval, LiveCodeBench v5, DeepCoder Codeforces