Code¶
Reinforcement learning for code-generation reasoning with RLVR, rewarded by test-case execution.
Code recipes: examples/code/ and examples/code-multi-agent/
Each recipe ships an all-in-one launch script under scripts/ and its config under yaml/.
LiveCodeBench v5 eval data needs a one-time manual download — see Eval Dataset Setup below.
Eval Dataset Setup¶
The recipes evaluate on HumanEval, LiveCodeBench v5, and DeepCoder Codeforces. Only LiveCodeBench v5 needs a one-time manual download before launching — otherwise the periodic eval steps fail.
Download the AReaL-boba-2-RL-Code dataset from the repo root:
huggingface-cli download inclusionAI/AReaL-boba-2-RL-Code \
--repo-type dataset \
--local-dir ./data-data/AReaL-boba-2-RL-Code
This provides ./data-data/AReaL-boba-2-RL-Code/code_benchmark/lcb_v5/test.jsonl.
HumanEval ships vendored in the repo (astraEnv/human-eval/data/HumanEval.jsonl), and
DeepCoder Codeforces is loaded directly from Hugging Face during eval — neither needs
a download.
The training dataset (DeepCoder-Preview, primeintellect subset) is fetched from
Hugging Face automatically on first run, so LiveCodeBench v5 above is the only manual
download needed to run a recipe end to end.
Qwen3-8B — 8 GPUs (single-agent)¶
Single-agent code-generation RL on one 8-GPU node — 4 GPUs for inference, 4 for training. It comes in two variants that differ only in weight transfer mode:
code/qwen3-8b-m2po-full/— full weight transfercode/qwen3-8b-m2po-delta/— delta weight transfer (only changed weights are sent)
A near-identical single-agent baseline also lives at code-multi-agent/qwen3-8b-single-agent-m2po-full/ as the fair-comparison reference for the codegen+verifier recipe below.
Run¶
One script launches all three processes — the AstraFlow service, the RaaS inference server, and the trainer:
# delta weight transfer
bash examples/code/qwen3-8b-m2po-delta/scripts/run_qwen3-8b-m2po-delta.sh
# full weight transfer
bash examples/code/qwen3-8b-m2po-full/scripts/run_qwen3-8b-m2po-full.sh
Settings¶
Setting |
Value |
|---|---|
Model |
Qwen3-8B |
GPUs |
8 — RaaS ×4 (SGLang, DP=4), Trainer ×4 (FSDP, DP=4) |
Algorithm |
M2PO ( |
Weight transfer |
TCP — full, or delta ( |
Context length |
12288 |
Max new tokens |
4096 |
Rollouts per prompt |
8 ( |
Train batch size |
128 |
Learning rate |
3e-6 (Adam, constant schedule) |
Train steps |
1200 |
Workflow / reward |
|
Train dataset |
DeepCoder-Preview ( |
Eval datasets |
HumanEval, LiveCodeBench v5, DeepCoder Codeforces |
Qwen3-8B codegen + verifier — 2 nodes × 8 GPUs (multi-agent)¶
A two-agent recipe: model0 generates code, model1 generates verification test cases, and the two are co-trained against each other’s outputs. It spans two 8-GPU nodes — both nodes host both models for inference, while the two trainers live on the starting node:
code-multi-agent/qwen3-8b-codegen-verifier-m2po-full-2node/— codegen + verifier, full weight transfer
Run¶
Launch the all-in-one script on the first node; it starts the AstraFlow service, RaaS-B, and both trainers, then prints the exact command to run on the second node (RaaS-A):
bash examples/code-multi-agent/qwen3-8b-codegen-verifier-m2po-full-2node/scripts/run_qwen3-8b-codegen-verifier-m2po-full-2node.sh
Settings¶
Setting |
Value |
|---|---|
Models |
model0 codegen (Qwen3-8B), model1 testcase verifier (Qwen3-8B) |
GPUs |
16 — RaaS-A ×8 (SGLang, model0 DP=6 / model1 DP=2), RaaS-B ×4 (SGLang, model0 DP=3 / model1 DP=1), Trainer model0 ×2 (FSDP, DP=2), Trainer model1 ×2 (FSDP, DP=2) |
Algorithm |
M2PO ( |
Weight transfer |
TCP — full |
Context length |
12288 |
Max new tokens |
4096 |
Rollouts per prompt |
8 ( |
Train batch size |
256 |
Learning rate |
3e-6 (Adam, constant schedule) |
Train steps |
800 |
Workflow / reward |
|
Train dataset |
DeepCoder-Preview ( |
Eval datasets |
HumanEval, LiveCodeBench v5, DeepCoder Codeforces |