Speed Benchmark

We report the speed performance of bfloat16 models and quantized models (including FP8, GPTQ, AWQ) of the Qwen3 series. Specifically, we report the inference speed (tokens/s) as well as memory footprint (GB) under different context lengths.

Environments

Hugging Face Transformers

  • Hardware:

    • NVIDIA H20 96GB

  • Software for Non-AutoAWQ:

    • PyTorch 2.6.0

    • Flash Attention 2.7.4

    • Transformers 4.51.3

    • GPTQModel 2.2.0+cu128torch2.6

  • Software for AutoAWQ:

    • PyTorch 2.6.0+cu124

    • Transformers 4.51.3

    • AutoAWQ 0.2.9

    • AutoAWQ_kernels 0.0.9

SGLang

  • Hardware:

    • NVIDIA H20 96GB

  • Software:

    • PyTorch 2.6.0+cu124

    • Transformers 4.51.3

    • SGLang 0.4.6.post1

    • SGL-kernel 0.1.0

    • vLLM 0.7.2 (Required by SGLang for AWQ quantization)

Notes

  • Inference Speed (tokens/s) is calculated as:

    \[\text{Speed} = \frac{\text{tokens}_{\text{prompt}} + \text{tokens}_{\text{generation}}}{\text{time}}\]
  • We use a batch size of 1 and the minimum number of GPUs possible for evaluation.

  • We test the speed and memory usage when generating 2048 tokens, with input lengths of 1, 6144, 14336, 30720, 63488, and 129024 tokens.

  • For SGLang:

    • Memory usage is not reported because SGLang pre-allocates all GPU memory.
      By default, we set mem_fraction_static=0.85.

    • We configure context_length=140000 and enable enable_mixed_chunk=True.

    • For AWQ quantization, we use the awq_marlin backend.

    • We set skip_tokenizer_init=True and perform generation using input_ids instead of raw text prompts.

  • FP8 Performance in Transformers: The inference speed of Transformers in FP8 mode is currently not optimal and requires further optimization.

  • GPTQ-INT4 Performance in SGLang: The performance of GPTQ-INT4 in SGLang also needs improvement, and we are actively working with the team to enhance it.

Results

Qwen3-0.6B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-0.6B 1BF161414.17
FP81458.03
GPTQ-Int81344.92
6144BF1611426.46
FP811572.95
GPTQ-Int811234.29
14336BF1612478.02
FP812689.08
GPTQ-Int812198.82
30720BF1613577.42
FP813819.86
GPTQ-Int813342.06

Qwen3-0.6B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-0.6B 1BF16158.571394
FP8124.601217
GPTQ-Int8126.56986
6144BF161154.822066
FP8173.961943
GPTQ-Int8193.841658
14336BF161168.482963
FP81104.992839
GPTQ-Int81219.612554
30720BF161175.934755
FP81132.784632
GPTQ-Int81345.714347

Qwen3-1.7B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-1.7B 1BF161227.80
FP81333.90
GPTQ-Int81257.40
6144BF161838.28
FP811198.20
GPTQ-Int81945.91
14336BF1611525.71
FP812095.61
GPTQ-Int811707.63
30720BF1612439.03
FP813165.32
GPTQ-Int812706.16

Qwen3-1.7B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-1.7B 1BF16159.833412
FP8123.832726
GPTQ-Int8128.062229
6144BF161238.534213
FP8190.873462
GPTQ-Int81110.822901
14336BF161352.595109
FP81153.374359
GPTQ-Int81222.783798
30720BF161418.136902
FP81235.616151
GPTQ-Int81386.855590

Qwen3-4B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-4B 1BF161133.13
FP81200.61
AWQ-INT41199.71
6144BF161466.19
FP81662.26
AWQ-INT41640.07
14336BF161789.25
FP811066.23
AWQ-INT411006.23
30720BF1611165.75
FP811467.71
AWQ-INT411358.84
63488BF1611423.98
FP811660.67
AWQ-INT411513.97
129042BF1611371.04
FP811497.27
AWQ-INT411375.71

Qwen3-4B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-4B 1BF16145.947973
FP8117.335281
AWQ-INT4151.572915
6144BF161159.958860
FP8160.556144
AWQ-INT41183.043881
14336BF161195.3110012
FP8196.817297
AWQ-INT41265.225151
30720BF161217.9712317
FP81138.849611
AWQ-INT41481.697742

Qwen3-8B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-8B 1BF16181.73
FP81150.25
AWQ-INT41144.11
6144BF161296.25
FP81516.64
AWQ-INT41477.89
14336BF161524.70
FP81859.92
AWQ-INT41770.44
30720BF161832.67
FP811242.24
AWQ-INT411075.91
63488BF1611112.78
FP811476.46
AWQ-INT411254.91
129042BF1611173.32
FP811393.21
AWQ-INT411198.06

Qwen3-8B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory(MB)
Qwen3-8B 1BF16145.3215947
FP8115.469323
AWQ-INT4151.336177
6144BF161146.1216811
FP8155.0710187
AWQ-INT41163.237113
14336BF161183.2917963
FP8189.6411340
AWQ-INT41242.978409
30720BF161208.9820267
FP81130.9313644
AWQ-INT41438.6211001

Qwen3-14B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-14B 1BF16147.10
FP8197.11
AWQ-INT4196.49
6144BF161174.85
FP81342.95
AWQ-INT41321.62
14336BF161317.56
FP81587.33
AWQ-INT41525.74
30720BF161525.80
FP81880.72
AWQ-INT41744.74
63488BF161742.36
FP811089.04
AWQ-INT41884.06
129042BF161826.15
FP811049.64
AWQ-INT41857.56

Qwen3-14B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory (MB)
Qwen3-14B 1BF16140.6628402
FP8113.0216012
AWQ-INT4144.679962
6144BF161108.5229495
FP8144.8616972
AWQ-INT41128.0811020
14336BF161136.3630775
FP8171.9618253
AWQ-INT41220.6212438
30720BF161155.3833336
FP81102.6320813
AWQ-INT41363.2515323

Qwen3-32B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-32B 1BF16120.72
FP8146.17
AWQ-INT4147.67
6144BF16177.82
FP81165.71
AWQ-INT41159.99
14336BF161143.08
FP81287.60
AWQ-INT41260.44
30720BF161240.75
FP81436.59
AWQ-INT41366.84
63488BF161342.96
FP81532.18
AWQ-INT41425.23
129042BF162711.40TP=2
FP81491.45
AWQ-INT41395.96

Qwen3-32B (Transformers)

Model Input Length Quantization GPU Num Speed (tokens/s) GPU Memory (MB)
Qwen3-32B 1BF16126.2462751
FP817.3733379
AWQ-INT4141.819109
6144BF16151.4164583
FP8123.5734915
AWQ-INT4168.7120795
14336BF16162.4166632
FP8136.3036963
AWQ-INT41107.0223105
30720BF16169.1670728
FP8149.4441060
AWQ-INT41188.1127718

Qwen3-30B-A3B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-30B-A3B 1BF161137.18
FP81155.55
GPTQ-INT4131.29GPTQ-Marlin
6144BF161490.10
FP81551.34
GPTQ-INT41120.13GPTQ-Marlin
14336BF161849.62
FP81945.13
GPTQ-INT41227.27GPTQ-Marlin
30720BF1611283.94
FP811405.91
GPTQ-INT41404.45GPTQ-Marlin
63488BF1611538.79
FP811647.89
GPTQ-INT41617.09GPTQ-Marlin
129042BF1611385.65
FP811442.14
GPTQ-INT41704.82GPTQ-Marlin

Qwen3-30B-A3B (Transformers)

Model Input length Quantization GPU Num Speed (tokens/s) GPU Memory (MB) Notes
Qwen3-30B-A3B 1BF1611.8958462
FP810.4430296
GPTQ-INT4---MoE Kernel Unsupported
6144BF1617.4559037
FP811.7730872
GPTQ-INT4---MoE Kernel Unsupported
14336BF16114.4759806
FP813.531641
GPTQ-INT4---MoE Kernel Unsupported
30720BF16127.0361342
FP816.8633177
GPTQ-INT4---MoE Kernel Unsupported

Qwen3-235B-A22B (SGLang)

Model Input Length Quantization GPU Num Speed (tokens/s) Note
Qwen3-235B-A22B 1BF16874.50TP=8
FP8471.65TP=4
GPTQ-INT4414.69TP=4
GPTQ-Marlin
6144BF168289.03TP=8
FP84275.16TP=4
GPTQ-INT4456.97TP=4
GPTQ-Marlin
14336BF168546.73TP=8
FP84514.23TP=4
GPTQ-INT44109.13TP=4
GPTQ-Marlin
30720BF168979.41TP=8
FP84887.90TP=4
GPTQ-INT44198.99TP=4
GPTQ-Marlin
63488BF1681493.91TP=8
FP841269.34TP=4
GPTQ-INT44422.77TP=4
GPTQ-Marlin
129042BF1681639.54TP=8
FP841319.66TP=4
GPTQ-INT44552.28TP=4
GPTQ-Marlin