48 lines
1.3 KiB
Markdown
48 lines
1.3 KiB
Markdown
---
|
|
license: apache-2.0
|
|
base_model:
|
|
- Qwen/Qwen3-32B
|
|
base_model_relation: quantized
|
|
library_name: transformers
|
|
tags:
|
|
- Qwen
|
|
- fp4
|
|
---
|
|
## Evaluation
|
|
|
|
The test results in the following table are based on the MMLU benchmark.
|
|
|
|
In order to speed up the test, we prevent the model from generating too long thought chains, so the score may be different from that with longer thought chain.
|
|
|
|
In our experiment, **the accuracy of the FP4 quantized version is almost the same as the BF16 version, and it can be used for faster inference.**
|
|
|
|
| Data Format | MMLU Score |
|
|
|:---|:---|
|
|
| BF16 Official | 88.21 |
|
|
| FP4 Quantized | 87.43 |
|
|
## Quickstart
|
|
We recommend using the Chitu inference framework(https://github.com/thu-pacman/chitu) to run this model.
|
|
Here provides a simple command to show you how to run Qwen3-32B-fp4.
|
|
```bash
|
|
torchrun --nproc_per_node 1 \
|
|
--master_port=22525 \
|
|
-m chitu \
|
|
serve.port=21002 \
|
|
infer.cache_type=paged \
|
|
infer.pp_size=1 \
|
|
infer.tp_size=1 \
|
|
models=Qwen3-32B-fp4 \
|
|
models.ckpt_dir="your model path" \
|
|
models.tokenizer_path="your model path" \
|
|
dtype=float16 \
|
|
infer.do_load=True \
|
|
infer.max_reqs=1 \
|
|
scheduler.prefill_first.num_tasks=100 \
|
|
infer.max_seq_len=4096 \
|
|
request.max_new_tokens=100 \
|
|
infer.use_cuda_graph=True
|
|
```
|
|
## Contact
|
|
|
|
solution@qingcheng.ai
|