From 6f383ea19e5003ed38d025f0b2df89273470ea8a Mon Sep 17 00:00:00 2001 From: llaama Date: Thu, 26 Jun 2025 03:07:25 +0000 Subject: [PATCH] System update meta information --- README.md | 47 +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..da03612 --- /dev/null +++ b/README.md @@ -0,0 +1,47 @@ +--- +license: apache-2.0 +base_model: +- Qwen/Qwen3-32B +base_model_relation: quantized +library_name: transformers +tags: +- Qwen +- fp4 +--- +## Evaluation + +The test results in the following table are based on the MMLU benchmark. + +In order to speed up the test, we prevent the model from generating too long thought chains, so the score may be different from that with longer thought chain. + +In our experiment, **the accuracy of the FP4 quantized version is almost the same as the BF16 version, and it can be used for faster inference.** + +| Data Format | MMLU Score | +|:---|:---| +| BF16 Official | 88.21 | +| FP4 Quantized | 87.43 | +## Quickstart +We recommend using the Chitu inference framework(https://github.com/thu-pacman/chitu) to run this model. +Here provides a simple command to show you how to run Qwen3-32B-fp4. +```bash +torchrun --nproc_per_node 1 \ + --master_port=22525 \ + -m chitu \ + serve.port=21002 \ + infer.cache_type=paged \ + infer.pp_size=1 \ + infer.tp_size=1 \ + models=Qwen3-32B-fp4 \ + models.ckpt_dir="your model path" \ + models.tokenizer_path="your model path" \ + dtype=float16 \ + infer.do_load=True \ + infer.max_reqs=1 \ + scheduler.prefill_first.num_tasks=100 \ + infer.max_seq_len=4096 \ + request.max_new_tokens=100 \ + infer.use_cuda_graph=True +``` +## Contact + +solution@qingcheng.ai