Search¶
Reinforcement learning for search-augmented agents (ASearcher) that interleave reasoning with local retrieval against a Wikipedia knowledge base.
Search recipes: examples/search/
Each recipe ships an all-in-one launch script under scripts/ and its config under yaml/.
Environment setup¶
The search recipes query a local FAISS retrieval server over the Wikipedia 2018 corpus. Set it up once before training.
Install the retrieval server dependencies:
conda create -n rag-retriever python=3.10 -y
conda activate rag-retriever
pip install -r astraEnv/ASearcher/requirements-rag-server.txt
Download the knowledge corpus and build the index (this can take hours to build the index):
cd astraEnv/ASearcher
conda activate rag-retriever
export WIKI2018_WORK_DIR=data/wiki2018
mkdir -p "$WIKI2018_WORK_DIR"
huggingface-cli download inclusionAI/ASearcher-Local-Knowledge \
--repo-type dataset --local-dir "$WIKI2018_WORK_DIR" --local-dir-use-symlinks False
bash scripts/build_index.sh
Start the retrieval server before training — it uses the 2 GPUs the launcher leaves free:
cd astraEnv/ASearcher
conda activate rag-retriever
export RAG_SERVER_ADDR_DIR=./tmp-log/rag_server_addrs
export PORT=7000
export USE_FAISS_GPU=1 # set 0 to disable GPU FAISS
bash scripts/launch_rag_server.sh 6,7
Qwen2.5-7B-Instruct — 8 GPUs¶
The search recipe trains an ASearcher agent on an 8-GPU node — 4 GPUs for inference, 2 for training, with 2 GPUs left for a local retrieval (RAG) server. It comes in two variants that differ only in weight transfer mode:
qwen2.5-7b-instruct-m2po-full/— full weight transferqwen2.5-7b-instruct-m2po-delta/— delta weight transfer (only changed weights are sent)
The agent uses the async-search-access search client, which queries the local FAISS retrieval server from Environment setup above. The launch script reads the server addresses from astraEnv/ASearcher/tmp-log/rag_server_addrs and aborts if the server is not running.
Run¶
With the retrieval server already running, one script launches all three processes — the AstraFlow service, the RaaS inference server, and the trainer:
# delta weight transfer
bash examples/search/qwen2.5-7b-instruct-m2po-delta/scripts/run_qwen2.5-7b-instruct-m2po-delta.sh
# full weight transfer
bash examples/search/qwen2.5-7b-instruct-m2po-full/scripts/run_qwen2.5-7b-instruct-m2po-full.sh
Settings¶
Setting |
Value |
|---|---|
Model |
Qwen2.5-7B-Instruct |
GPUs |
6 of 8 — RaaS ×4 (SGLang, DP=4), Trainer ×2 (FSDP, DP=2) |
Algorithm |
M2PO ( |
Weight transfer |
TCP — full, or delta ( |
Context length |
16384 |
Max new tokens |
1024 |
Rollouts per prompt |
8 ( |
Train batch size |
256 |
Learning rate |
5e-6 (Adam, constant schedule) |
Train steps |
1000 |
Workflow / reward |
|
Retrieval |
Local FAISS RAG server over Wikipedia 2018, |
Train dataset |
ASearcher-Base-35k |
Eval datasets |
TriviaQA, PopQA, HotpotQA, Bamboogle |