From 6f383ea19e5003ed38d025f0b2df89273470ea8a Mon Sep 17 00:00:00 2001
From: llaama <llaama@163.com>
Date: Thu, 26 Jun 2025 03:07:25 +0000
Subject: [PATCH] System update meta information

---
 README.md | 47 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)
 create mode 100644 README.md

diff --git a/README.md b/README.md
new file mode 100644
index 0000000..da03612
--- /dev/null
+++ b/README.md
@@ -0,0 +1,47 @@
+---
+license: apache-2.0
+base_model:
+- Qwen/Qwen3-32B
+base_model_relation: quantized
+library_name: transformers
+tags:
+- Qwen
+- fp4
+---
+## Evaluation
+
+The test results in the following table are based on the MMLU benchmark.
+
+In order to speed up the test, we prevent the model from generating too long thought chains, so the score may be different from that with longer thought chain.
+
+In our experiment, **the accuracy of the FP4 quantized version is almost the same as the BF16 version, and it can be used for faster inference.**
+
+| Data Format | MMLU Score |
+|:---|:---|
+| BF16 Official | 88.21 |
+| FP4 Quantized | 87.43 |
+## Quickstart
+We recommend using the Chitu inference framework(https://github.com/thu-pacman/chitu) to run this model.
+Here provides a simple command to show you how to run Qwen3-32B-fp4.
+```bash
+torchrun --nproc_per_node 1 \
+    --master_port=22525 \
+    -m chitu \
+    serve.port=21002 \
+    infer.cache_type=paged \
+    infer.pp_size=1 \
+    infer.tp_size=1 \
+    models=Qwen3-32B-fp4 \
+    models.ckpt_dir="your model path" \
+    models.tokenizer_path="your model path" \
+    dtype=float16 \
+    infer.do_load=True \
+    infer.max_reqs=1 \
+    scheduler.prefill_first.num_tasks=100 \
+    infer.max_seq_len=4096 \
+    request.max_new_tokens=100 \
+    infer.use_cuda_graph=True
+```
+## Contact
+
+solution@qingcheng.ai