Installation¶
Prerequisites¶
Linux (Ubuntu 20.04+ recommended)
NVIDIA GPU with CUDA support
Option A: Custom Installation (Conda)¶
These steps install AstraFlow into a local conda environment.
Step 1: Create conda environment¶
conda create -n astraflow python=3.12 -y
conda activate astraflow
Step 2: Install uv (fast pip replacement)¶
pip install -U "uv>=0.10"
uv ≥ 0.10 is required.
pyproject.tomluses[tool.uv]settings (extra-build-dependencies,override-dependencies) that older uv releases don’t recognize. When uv hits an unknown[tool.uv]key it silently ignores the entire[tool.uv]table, so thetransformers==5.6.1override (which must beat sglang’s==5.6.0pin) is dropped and the install fails with an unsolvabletransformersconflict. The Docker images install the latest uv via the official installer and are unaffected.
Step 3: Install AstraFlow (core + dev tools)¶
uv pip install -e ".[dev]"
This installs all core dependencies (~260 packages) including PyTorch 2.11.0, Transformers 5.6.1, Megatron-Core 0.13.1, Ray, W&B, and dev tools (pytest, ruff, ipython).
Step 4: Install Flash Attention and SGLang¶
Flash Attention¶
This is FlashAttention-2 (import flash_attn), used by the FSDP trainer. It
is excluded from uv resolution (see pyproject.toml [tool.uv]) and built from
source, so it needs the CUDA 13 toolchain and a roomy build-temp directory:
# nvcc must be on PATH and match torch's CUDA (13.0 for torch 2.11+cu130)
export CUDA_HOME=/usr/local/cuda-13.0
export PATH="$CUDA_HOME/bin:$PATH"
# nvcc writes GBs of intermediate files to $TMPDIR. Point it at local scratch
# with plenty of space — NOT a small/NFS-quota'd home, or the build fails with
# "nvFatbin error: empty input" or "Disk quota exceeded" from truncated temps.
export TMPDIR=/tmp/fa-build && mkdir -p "$TMPDIR"
uv pip install "flash-attn==2.8.3" --no-build-isolation
On a single-GPU-arch box you can speed up the build and shrink its footprint with
FLASH_ATTN_CUDA_ARCHS=<arch> NVCC_THREADS=1(e.g.90for H100,80for A100,89for L40/4090). These are optional — the real requirement is a roomyTMPDIR.
SGLang (inference backend)¶
Install via the project extra so uv applies the [tool.uv] overrides (the
transformers==5.6.1 pin and the flash-attn-4 pre-release allowance). SGLang
pulls in FlashAttention-4 (flash-attn-4, a pre-release wheel) automatically
for its own attention backend — you do not install that one yourself.
uv pip install -e ".[sglang]"
Step 5 (optional): Install the Megatron training backend¶
Only needed if you want to train with the Megatron-LM backend (tensor / pipeline / expert parallelism, MoE models). The default FSDP backend and all inference need nothing here — skip to Step 6.
Prefer Docker? Skip this entire step with the pre-built
astraflowai/astraflow:v0.1.1.megatronimage (see Option B below), which already bundles Transformer Engine + apex.
megatron-core and mbridge are already installed by Step 3. The Megatron
backend additionally uses Transformer Engine (fused LayerNorm + sequence
parallelism) and benefits from apex (fused LayerNorm / Adam). Both are
compiled from source against the installed PyTorch:
# nvcc must be on PATH (same CUDA toolchain as the flash-attn build above)
export CUDA_HOME=/usr/local/cuda-13.0
export PATH="$CUDA_HOME/bin:$PATH"
export NVTE_FRAMEWORK=pytorch
# Transformer Engine (required for the Megatron backend with TP/SP).
# The prebuilt transformer-engine wheels link libcublas.so.12 and do NOT load
# on a CUDA 13 install (ImportError: libcublas.so.12). Build TE from source
# against your CUDA 13 toolkit instead; nvidia-mathdx provides the build-time
# cuBLASDx / cuDNN frontend headers.
uv pip install nvidia-mathdx==25.6.0
uv pip install -v --no-build-isolation \
"git+https://github.com/NVIDIA/TransformerEngine.git@release_v2.13"
# apex (optional — Megatron falls back to Torch Norm / torch Adam if absent).
# APEX_CPP_EXT/APEX_CUDA_EXT select the fused kernels; FORCE_CUDA=1 builds them
# without a visible GPU.
git clone --depth 1 https://github.com/NVIDIA/apex.git /tmp/apex
cd /tmp/apex
FORCE_CUDA=1 APEX_CPP_EXT=1 APEX_CUDA_EXT=1 \
uv pip install -v --no-build-isolation .
cd - && rm -rf /tmp/apex
If apex’s build complains about a CUDA toolkit vs. PyTorch CUDA minor-version mismatch, the difference is safe to ignore — comment out the
check_cuda_torch_binary_vs_bare_metalguard in apex’ssetup.py, or just skip apex (Transformer Engine is the only hard requirement). Thedocker/Dockerfile.sglang.megatronimage automates all of this.
Verify the Megatron extras:
python -c "
import transformer_engine.pytorch # noqa: F401
print('transformer-engine: OK')
try:
from apex.normalization import FusedLayerNorm # noqa: F401
print('apex: OK')
except ImportError:
print('apex: not installed (Torch Norm fallback)')
"
Step 6: Verify installation¶
python -c "
import astraflow, torch, transformers
print(f'astraflow: {astraflow.version.__version__}')
print(f'torch: {torch.__version__}, CUDA: {torch.cuda.is_available()}, GPUs: {torch.cuda.device_count()}')
print(f'transformers: {transformers.__version__}')
"
Verify Flash Attention and SGLang:
python -c "
import flash_attn, sglang
print(f'flash-attn: {flash_attn.__version__}')
print(f'sglang: {sglang.__version__}')
"
Option B: Docker¶
Pre-built images are published on Docker Hub — they skip the from-source steps above
entirely. Requires the NVIDIA Container Toolkit so --gpus all works. Choose the image
by training backend:
# FSDP backend (default) — covers most recipes
docker run --gpus all --net=host --shm-size=512g --ulimit nofile=65536:65536 -it astraflowai/astraflow:v0.1.1
# Megatron-LM backend — adds Transformer Engine + apex (Step 5 above, pre-built in).
# Use this for `backend: megatron` (TP/PP/EP, MoE, large models).
docker run --gpus all --net=host --shm-size=512g --ulimit nofile=65536:65536 -it astraflowai/astraflow:v0.1.1.megatron
Note on
--shm-size: this sets the size of the container’s/dev/shm. A recipe run co-locates the trainer, RaaS, and SGLang in a single container, all sharing one/dev/shm— in particular RaaS stages received model weights under/dev/shm/astraflow_weightsduring weight transfer. The container default (64 MB) and small values such as16gare far too small and causeOSError: [Errno 28] No space left on deviceduring training on 8B-scale recipes. Size/dev/shmgenerously (512gabove). It is a tmpfs cap, not a reservation, so it only consumes host RAM as actually used — set it to a value comfortably below host RAM.
Note on
--ulimit nofile: a recipe run drives many concurrent rollouts whose reward workers open a large number of file descriptors. The container’s defaultnofilesoft limit (1024) is far too low and the reward pool fails with[Errno 24] Too many open files. Raise it with--ulimit nofile=65536:65536.
The image bundles astraflow, SGLang, and flash-attn. Pin a version tag (v0.1.1) for
reproducibility; :latest tracks the most recent release. See docker/README.md for
build details and the NVIDIA Container Toolkit install guide.