Installation#

For AMD ROCm (MI300/MI325) installs, see Installation (AMD ROCm).

Prerequisites#

Linux (Ubuntu 20.04+ recommended)
NVIDIA GPU with CUDA support

Option A: Custom Installation (Conda)#

These steps install AstraFlow into a local conda environment.

Step 1: Create conda environment#

conda create -n astraflow python=3.12 -y
conda activate astraflow

Step 2: Install uv (fast pip replacement)#

pip install -U "uv>=0.10"

uv ≥ 0.10 is required. pyproject.toml uses [tool.uv] settings (extra-build-dependencies, override-dependencies) that older uv releases don’t recognize. When uv hits an unknown [tool.uv] key it silently ignores the entire [tool.uv] table, so the transformers==5.8.1 override (the version the pinned sglang release requires) is dropped and the install fails with an unsolvable transformers conflict. The Docker images install the latest uv via the official installer and are unaffected.

Step 3: Install AstraFlow (core + dev tools)#

uv pip install -e ".[dev]"

This installs all core dependencies (~260 packages) including PyTorch 2.11.0, Transformers 5.8.1, Megatron-Core 0.13.1, Ray, W&B, and dev tools (pytest, ruff, ipython).

Step 4: Install Flash Attention and SGLang#

Flash Attention#

This is FlashAttention-2 (import flash_attn), used by the FSDP trainer. It is excluded from uv resolution (see pyproject.toml [tool.uv]) and built from source, so it needs the CUDA 13 toolchain and a roomy build-temp directory:

Run this install from OUTSIDE the repo directory. Inside the repo, uv picks up pyproject.toml’s [tool.uv] exclude-dependencies = ["flash-attn"] (which exists so project installs don’t try to resolve/build flash-attn) and applies it to this explicit install too — the command exits 0 without installing anything. Worse, import flash_attn still appears to succeed afterwards, because the flash-attn-4 package (pulled in by sglang) leaves a namespace stub at site-packages/flash_attn/ — see the verification in Step 6 that actually catches this.

# nvcc must be on PATH and match torch's CUDA (13.0 for torch 2.11+cu130)
export CUDA_HOME=/usr/local/cuda-13.0
export PATH="$CUDA_HOME/bin:$PATH"

# nvcc writes GBs of intermediate files to $TMPDIR. Point it at local scratch
# with plenty of space — NOT a small/NFS-quota'd home, or the build fails with
# "nvFatbin error: empty input" or "Disk quota exceeded" from truncated temps.
export TMPDIR=/tmp/fa-build && mkdir -p "$TMPDIR"

# IMPORTANT: leave the repo so uv ignores the project's [tool.uv] table
# (same reason the Dockerfile installs flash-attn from /tmp).
cd /tmp
uv pip install "flash-attn==2.8.3" --no-build-isolation
cd -

On a single-GPU-arch box you can speed up the build and shrink its footprint with FLASH_ATTN_CUDA_ARCHS=<arch> NVCC_THREADS=1 (e.g. 90 for H100, 80 for A100, 89 for L40/4090). These are optional — the real requirement is a roomy TMPDIR.

SGLang (inference backend)#

Install via the project extra so uv applies the [tool.uv] overrides (the transformers==5.8.1 pin and the flash-attn-4 pre-release allowance). SGLang pulls in FlashAttention-4 (flash-attn-4, a pre-release wheel) automatically for its own attention backend — you do not install that one yourself.

uv pip install -e ".[sglang]"

Step 5 (optional): Install the Megatron training backend#

Only needed if you want to train with the Megatron-LM backend (tensor / pipeline / expert parallelism, MoE models). The default FSDP backend and all inference need nothing here — skip to Step 6.

Prefer Docker? Skip this entire step with the pre-built astraflowai/astraflow:v0.1.2.megatron image (see Option B below), which already bundles Transformer Engine + apex.

megatron-core and mbridge are already installed by Step 3. The Megatron backend additionally uses Transformer Engine (fused LayerNorm + sequence parallelism) and benefits from apex (fused LayerNorm / Adam). Both are compiled from source against the installed PyTorch:

# nvcc must be on PATH (same CUDA toolchain as the flash-attn build above)
export CUDA_HOME=/usr/local/cuda-13.0
export PATH="$CUDA_HOME/bin:$PATH"
export NVTE_FRAMEWORK=pytorch

# Transformer Engine (required for the Megatron backend with TP/SP).
# The prebuilt transformer-engine wheels link libcublas.so.12 and do NOT load
# on a CUDA 13 install (ImportError: libcublas.so.12). Build TE from source
# against your CUDA 13 toolkit instead; nvidia-mathdx provides the build-time
# cuBLASDx / cuDNN frontend headers.
uv pip install nvidia-mathdx==25.6.0
uv pip install -v --no-build-isolation \
  "git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.13"

# apex (optional — Megatron falls back to Torch Norm / torch Adam if absent).
# APEX_CPP_EXT/APEX_CUDA_EXT select the fused kernels; FORCE_CUDA=1 builds them
# without a visible GPU.
git clone --depth 1 https://github.com/NVIDIA/apex.git /tmp/apex
cd /tmp/apex
FORCE_CUDA=1 APEX_CPP_EXT=1 APEX_CUDA_EXT=1 \
  uv pip install -v --no-build-isolation .
cd - && rm -rf /tmp/apex

If apex’s build complains about a CUDA toolkit vs. PyTorch CUDA minor-version mismatch, the difference is safe to ignore — comment out the check_cuda_torch_binary_vs_bare_metal guard in apex’s setup.py, or just skip apex (Transformer Engine is the only hard requirement). The docker/Dockerfile.sglang.megatron image automates all of this.

Verify the Megatron extras:

python -c "
import transformer_engine.pytorch  # noqa: F401
print('transformer-engine: OK')
try:
    from apex.normalization import FusedLayerNorm  # noqa: F401
    print('apex: OK')
except ImportError:
    print('apex: not installed (Torch Norm fallback)')
"

Step 6: Verify installation#

python -c "
import astraflow, torch, transformers
print(f'astraflow:    {astraflow.version.__version__}')
print(f'torch:        {torch.__version__}, CUDA: {torch.cuda.is_available()}, GPUs: {torch.cuda.device_count()}')
print(f'transformers: {transformers.__version__}')
"

Verify Flash Attention and SGLang:

python -c "
# NOTE: a bare 'import flash_attn' succeeding is NOT proof FA2 is installed —
# flash-attn-4 (sglang's attention backend) leaves a namespace stub at
# site-packages/flash_attn/ that imports fine but is empty. Import a real
# symbol and read the wheel metadata instead.
from importlib.metadata import version
from flash_attn import flash_attn_func  # fails if FA2 was silently skipped
import sglang
print(f'flash-attn: {version(\"flash-attn\")}')
print(f'sglang:     {sglang.__version__}')
"

If the flash_attn_func import fails with cannot import name ... (unknown location), flash-attn was silently skipped — re-run the Step 4 install from a directory outside the repo.

Option B: Docker#

Pre-built images are published on Docker Hub — they skip the from-source steps above entirely. Requires the NVIDIA Container Toolkit so --gpus all works. Choose the image by training backend:

The recommended workflow: the image provides the environment, your local checkout provides the code. astraflow is installed editable from /workspace/astraflow, so mounting your repo over that path runs your local code — no rebuild needed for code changes; rebuild only when dependency pins or the CUDA stack change.

# FSDP backend (default) — run a recipe from YOUR local checkout
docker run --gpus all --net=host --shm-size=512g --ulimit nofile=65536:65536 \
  -v /path/to/astraflow:/workspace/astraflow \
  -v ~/.cache/huggingface:/hf -e HF_HOME=/hf \
  -e WANDB_API_KEY=<your-key> \
  astraflowai/astraflow:v0.1.2 \
  bash examples/math/qwen3-8b-m2po-full/scripts/run_qwen3-8b-m2po-full.sh

# Megatron-LM backend — adds Transformer Engine + apex (Step 5 above, pre-built in).
# Use this for `backend: megatron` (TP/PP/EP, MoE, large models).
docker run --gpus all --net=host --shm-size=512g --ulimit nofile=65536:65536 -it astraflowai/astraflow:v0.1.2.megatron

Drop the -v /path/to/astraflow:... mount to use the code baked into the image instead; add -it for an interactive shell. The -v ~/.cache/huggingface mount reuses your host model/dataset cache instead of re-downloading in the container.

Note on --shm-size: this sets the size of the container’s /dev/shm. A recipe run co-locates the trainer, RaaS, and SGLang in a single container, all sharing one /dev/shm — in particular RaaS stages received model weights under /dev/shm/astraflow_weights during weight transfer. The container default (64 MB) and small values such as 16g are far too small and cause OSError: [Errno 28] No space left on device during training on 8B-scale recipes. Size /dev/shm generously (512g above). It is a tmpfs cap, not a reservation, so it only consumes host RAM as actually used — set it to a value comfortably below host RAM.

Note on --ulimit nofile: a recipe run drives many concurrent rollouts whose reward workers open a large number of file descriptors. The container’s default nofile soft limit (1024) is far too low and the reward pool fails with [Errno 24] Too many open files. Raise it with --ulimit nofile=65536:65536.

The image bundles astraflow, SGLang, and flash-attn. Pin a version tag (v0.1.2) for reproducibility; :latest tracks the most recent release. See docker/README.md for build details and the NVIDIA Container Toolkit install guide.