TextCraft (Recursive Agent)#

A multi-turn recursive-agent recipe on TextCraft, reproducing the design from Recursive Agent Optimization (Gandhi et al., 2026). The agent acts in a stateful crafting environment (Minecraft-style recipes + inventory) and can recursively spawn up to 4 sub-agents in parallel per turn — each shares the parent’s inventory by reference, so their work mutates the same state.

A root TextCraft agent recursively spawning sub-agents that share inventory and report back via finish messages

Recipe: examples/textcraft-recursive-agent/qwen3-4b-recursive/

Workflow class: astraflow/core/workflow/impl/textcraft/workflow.py — registered as recursive_agent.

Results#

Validation accuracy (eval-avg/textcraft_val/avg@1) over a 500-step run. Starting from the base Qwen3-4B-Instruct-2507, the recursive agent climbs from ~41% at the first eval (step 20) to ~80% by step 500, peaking at 85% around step 440 — the team-reward broadcast and shared-inventory spawning are enough to learn the multi-turn crafting policy with no SFT bootstrap.

TextCraft validation accuracy (avg@1) rising from 41% to ~80% over 500 training steps

Run#

One-time prep (synthesizes 1000 train + 100 val tasks locally from the bundled recipe DB; no network required):

# Generated automatically on first launch; or force-regenerate:
python -c "from astraflow.dataflow.dataset.textcraft import download_dataset; download_dataset()"

Pre-fetch the model (one-time, ~8 GB):

huggingface-cli download Qwen/Qwen3-4B-Instruct-2507

Run:

bash examples/textcraft-recursive-agent/qwen3-4b-recursive/scripts/run_qwen3-4b-recursive.sh

How it works#

Tool-call protocol#

Each turn the model emits exactly one action block:

<action type="get_info">{"items": ["stick", "oak_planks"]}</action>
<action type="view_inventory">{}</action>
<action type="craft">{"ingredients": {"oak_log": 1}, "target": ["oak_planks", 4]}</action>
<action type="spawn">{"subtasks": [
  {"targets": {"oak_planks": 16}, "max_steps": 8},
  {"targets": {"stick": 8}, "max_steps": 5}
]}</action>
<action type="finish">{"message": "crafted 4 wooden_pickaxe"}</action>

We use an XML/JSON action surface (rather than executable code) because SGLang runs with --skip-tokenizer-init (can’t do string-stop) and we want zero sandbox infrastructure. Base Qwen3 reads the format from the system prompt with no SFT bootstrap needed.

Stateful environment#

TextCraftEnv holds a mutable inventory: dict[str, int] and a shared read-only recipe database (~860 Minecraft recipes bundled in astraflow/core/workflow/impl/textcraft/recipes/).

env.fork(child_task) returns a child env whose inventory is the same dict object as the parent’s. When a sub-agent calls craft, the mutation is visible to the parent. Single asyncio loop → no race.

Spawning#

A <action type="spawn"> block runs all subtasks in parallel via asyncio.gather, each as a full sub-episode with its own forked env and trajectory. Up to 4 children per spawn, up to depth 3 (root + 2 levels of nesting). Sub-agents share the root’s step budget.

Aggregation — finish-message only#

The parent’s view of a spawn is bounded — only each child’s finish_message:

<spawn_result>
<sub_agent_0 task="craft 16 oak_planks">crafted 16 oak_planks</sub_agent_0>
<sub_agent_1 task="craft 8 stick">crafted 8 stick</sub_agent_1>
</spawn_result>

The sub-agent’s intermediate turns (other craft / get_info calls) are NOT shown to the parent. This forces sub-agents to summarize their work in finish messages and bounds context growth across recursion.

Settings#

Setting	Value
Model	Qwen/Qwen3-4B-Instruct-2507
`enable_thinking`	`false`
Algorithm	M2PO
Fine-tuning	Full-FT
Inference backend	SGLang + RaaS + AstraFlow
Tool-call protocol	XML / JSON
`group_size` (n_samples)	8 train / 1 eval
`train_batch_size`	512
`max_steps_per_episode`	50
`lr`	3e-6
Adam (β₁, β₂)	(0.9, 0.95)
`grad_clip`	0 (off)
`max_staleness`	8
`total_train_steps`	1000
Eval cadence	every 20 steps
`max_depth`	3 (safety cap)
`max_breadth`	4 (safety cap)
`max_concurrent_subagents`	8 (bounds K^N RaaS queue blowup)
`delegation_reward_cap`	0.0
`depth_level_weighting`	false
Dataset	TextCraft 1000 train / 100 val (original Minecraft recipes)
SGLang context_length	32768 (bumped from math recipe’s 16k for recursion overhead)

Reference#

Gandhi, Chakraborty, Wang, Kumar, Neubig. Recursive Agent Optimization. arXiv:2605.06639, 2026. https://arxiv.org/abs/2605.06639