System update meta information
This commit is contained in:
parent
5e3cec687a
commit
6f383ea19e
47
README.md
Normal file
47
README.md
Normal file
@ -0,0 +1,47 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
base_model:
|
||||
- Qwen/Qwen3-32B
|
||||
base_model_relation: quantized
|
||||
library_name: transformers
|
||||
tags:
|
||||
- Qwen
|
||||
- fp4
|
||||
---
|
||||
## Evaluation
|
||||
|
||||
The test results in the following table are based on the MMLU benchmark.
|
||||
|
||||
In order to speed up the test, we prevent the model from generating too long thought chains, so the score may be different from that with longer thought chain.
|
||||
|
||||
In our experiment, **the accuracy of the FP4 quantized version is almost the same as the BF16 version, and it can be used for faster inference.**
|
||||
|
||||
| Data Format | MMLU Score |
|
||||
|:---|:---|
|
||||
| BF16 Official | 88.21 |
|
||||
| FP4 Quantized | 87.43 |
|
||||
## Quickstart
|
||||
We recommend using the Chitu inference framework(https://github.com/thu-pacman/chitu) to run this model.
|
||||
Here provides a simple command to show you how to run Qwen3-32B-fp4.
|
||||
```bash
|
||||
torchrun --nproc_per_node 1 \
|
||||
--master_port=22525 \
|
||||
-m chitu \
|
||||
serve.port=21002 \
|
||||
infer.cache_type=paged \
|
||||
infer.pp_size=1 \
|
||||
infer.tp_size=1 \
|
||||
models=Qwen3-32B-fp4 \
|
||||
models.ckpt_dir="your model path" \
|
||||
models.tokenizer_path="your model path" \
|
||||
dtype=float16 \
|
||||
infer.do_load=True \
|
||||
infer.max_reqs=1 \
|
||||
scheduler.prefill_first.num_tasks=100 \
|
||||
infer.max_seq_len=4096 \
|
||||
request.max_new_tokens=100 \
|
||||
infer.use_cuda_graph=True
|
||||
```
|
||||
## Contact
|
||||
|
||||
solution@qingcheng.ai
|
||||
Loading…
Reference in New Issue
Block a user