Axolotl¶
This guide will help you get started with post-training (SFT, RLHF, RM, PRM) for Qwen3 / Qwen3_MOE using Axolotl, and covers optimizations to enable for better performance.
Requirements¶
GPU: NVIDIA Ampere (or newer) for
bf16andFlash Attention, or AMD GPUPython: ≥3.11
CUDA: ≥12.4 (for NVIDIA GPUs)
Installation¶
You can install Axolotl using PyPI, Conda, Git, Docker, or launch a cloud environment.
Important
Install PyTorch before installing Axolotl to ensure CUDA compatibility.
For the latest instructions, see the official Axolotl Installation Guide.
Quickstart¶
SFT¶
We have provided a sample YAML config for SFT with Qwen/Qwen3-32B: SFT 32B QLoRA config.
# Train the model
axolotl train path/to/32b-qlora.yaml
# Merge LoRA weights with the base model
# This will create a new `merged` directory under `{output_dir}`
axolotl merge-lora path/to/32b-qlora.yaml
Tip
To train a smaller model, edit the base_model in your config:
base_model: Qwen/Qwen3-8B
Qwen3 works with all Axolotl features including Flash Attention, bf16, LoRA, torch_compile, and QLoRA.
To run on more than single GPU, please take a look at the Multi-GPU Training Guide or Multi-node Training Guide.
RLHF¶
See the RLHF Guide for required dataset formats and examples for each method.
RM/PRM¶
Please refer to the Reward Modelling Guide for required dataset formats and config examples.
Dataset¶
By default, the example config uses the mlabonne/FineTome-100k dataset (from HuggingFace Hub). You can substitute any dataset of your own.
SFT Dataset Format¶
Axolotl handles various SFT dataset formats, but the current recommended format (for use with chat_template) is the OpenAI Messages format:
[
{
"messages": [
{
"role": "user",
"content": "What is Qwen3?"
},
{
"role": "assistant",
"content": "Qwen3 is a language model..."
}
]
}
]
Use this in your config:
datasets:
- path: path/to/your/dataset.json
type: chat_template
You can also load datasets from multiple sources: HuggingFace Hub, local files, directories, S3, GCS, Azure, etc.
See the Dataset Loading Guide for more details.
To load different dataset formats, refer to the SFT Dataset Formats Guide.
Optimizations¶
With Qwen3/Qwen3_MOE, you can leverage Axolotl’s custom optimizations for improved speed and reduced memory usage:
(LoRA/QLoRA only): LoRA Kernels Optimization
Additional Suggestions¶
Troubleshooting¶
Ensure your CUDA version matches your GPU and PyTorch version.
If running into out-of-memory issues, try reducing your batch size, enable the optimizations above, or reduce sequence length.
Qwen3 MoE may have slower training due to the upstream transformer’s handling of MoE layers.
For help, check the help channel on Axolotl Discord or create a Discussion on Axolotl GitHub.