AgentBench¶
Reinforcement learning for multi-turn, interactive agents on the ALFWorld and WebShop environments from AgentBench, with M2PO.
AgentBench recipes:
ALFWorld:
examples/alfworld/WebShop:
examples/webshop/
Each recipe ships an all-in-one launch script under scripts/ and its config under yaml/.
Environment setup¶
The AgentBench recipes need the AgentBench environment installed in its own conda env, plus the task container image:
cd astraEnv/AgentBench
conda create -n agent-bench python=3.9
conda activate agent-bench
pip install -r requirements.txt
conda install podman
# pull the image for the environment you want to train on
docker pull longinyu/agentbench-alfworld # for ALFWorld
docker pull longinyu/agentbench-webshop # for WebShop
The all-in-one run_*.sh starts the task server itself; the standalone 0_<env>_server.sh is provided for split launches.
ALFWorld — Qwen2.5-7B-Instruct — 8 GPUs¶
Trains an agent to complete embodied household tasks in a text-based environment. Each rollout is a multi-turn episode (up to 15 turns) against an AgentBench task server. The recipe comes in two variants that differ only in weight transfer mode:
qwen2.5-7b-instruct-m2po-full/— full weight transferqwen2.5-7b-instruct-m2po-delta/— delta weight transfer (full sync every 10 steps)
Run¶
One script launches four processes — the AgentBench ALFWorld environment server, the AstraFlow service, the RaaS inference server, and the trainer. The all-in-one script starts the local environment server itself; the standalone 0_alfworld_server.sh is provided for split launches.
# full weight transfer
bash examples/alfworld/qwen2.5-7b-instruct-m2po-full/scripts/run_qwen2.5-7b-instruct-m2po-full.sh
# delta weight transfer
bash examples/alfworld/qwen2.5-7b-instruct-m2po-delta/scripts/run_qwen2.5-7b-instruct-m2po-delta.sh
Settings¶
Setting |
Value |
|---|---|
Model |
Qwen2.5-7B-Instruct |
GPUs |
8 — RaaS ×4 (SGLang, DP=4), Trainer ×4 (FSDP, DP=4) |
Algorithm |
M2PO ( |
Weight transfer |
TCP — full, or delta ( |
Context length |
16384 |
Max new tokens |
512 |
Rollouts per prompt |
8 ( |
Train batch size |
128 |
Learning rate |
1e-6 (Adam, constant schedule) |
Train steps |
1200 |
Workflow / reward |
|
Train dataset |
ALFWorld train indices |
Eval datasets |
ALFWorld valid indices (eval max 10 turns, every 10 steps) |
Environment |
ALFWorld AgentBench task server ( |
WebShop — Qwen2.5-7B-Instruct — 8 GPUs¶
Trains an agent to navigate and purchase products on a simulated e-commerce site. Each rollout is a multi-turn episode (up to 10 turns) against an AgentBench task server. It also comes in full and delta transfer variants:
qwen2.5-7b-instruct-m2po-full/— full weight transferqwen2.5-7b-instruct-m2po-delta/— delta weight transfer (full sync every 10 steps)
Run¶
The same single-script pattern launches four processes — the AgentBench WebShop environment server, the AstraFlow service, the RaaS inference server, and the trainer. The all-in-one script starts the local environment server itself; the standalone 0_webshop_server.sh is provided for split launches.
# full weight transfer
bash examples/webshop/qwen2.5-7b-instruct-m2po-full/scripts/run_qwen2.5-7b-instruct-m2po-full.sh
# delta weight transfer
bash examples/webshop/qwen2.5-7b-instruct-m2po-delta/scripts/run_qwen2.5-7b-instruct-m2po-delta.sh
Settings¶
Setting |
Value |
|---|---|
Model |
Qwen2.5-7B-Instruct |
GPUs |
8 — RaaS ×4 (SGLang, DP=4), Trainer ×4 (FSDP, DP=4) |
Algorithm |
M2PO ( |
Weight transfer |
TCP — full, or delta ( |
Context length |
16384 |
Max new tokens |
512 |
Rollouts per prompt |
8 ( |
Train batch size |
256 |
Learning rate |
1e-6 (Adam, constant schedule) |
Train steps |
1200 |
Workflow / reward |
|
Train dataset |
WebShop train indices |
Eval datasets |
WebShop valid indices (eval max 10 turns, every 10 steps) |
Environment |
WebShop AgentBench task server ( |