From 30b8421510892303dc5ddd6cd0ac90ca2053478d Mon Sep 17 00:00:00 2001 From: An Yang Date: Tue, 29 Apr 2025 08:52:30 +0000 Subject: [PATCH] Update README.md --- README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 3a1d398..a16bb91 100644 --- a/README.md +++ b/README.md @@ -90,7 +90,7 @@ print("thinking content:", thinking_content) print("content:", content) ``` -For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.4` or to create an OpenAI-compatible API endpoint: +For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.5` or to create an OpenAI-compatible API endpoint: - SGLang: ```shell python -m sglang.launch_server --model-path Qwen/Qwen3-32B --reasoning-parser qwen3 @@ -100,7 +100,7 @@ For deployment, you can use `sglang>=0.4.6.post1` or `vllm>=0.8.4` or to create vllm serve Qwen/Qwen3-32B --enable-reasoning --reasoning-parser deepseek_r1 ``` -For local use, applications such as llama.cpp, Ollama, LMStudio, and MLX-LM have also supported Qwen3. +For local use, applications such as Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers have also supported Qwen3. ## Switching Between Thinking and Non-Thinking Mode @@ -274,7 +274,7 @@ YaRN is currently supported by several inference frameworks, e.g., `transformers { ..., "rope_scaling": { - "type": "yarn", + "rope_type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768 } @@ -286,12 +286,12 @@ YaRN is currently supported by several inference frameworks, e.g., `transformers For `vllm`, you can use ```shell - vllm serve ... --rope-scaling '{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 + vllm serve ... --rope-scaling '{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}' --max-model-len 131072 ``` For `sglang`, you can use ```shell - python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}' + python -m sglang.launch_server ... --json-model-override-args '{"rope_scaling":{"rope_type":"yarn","factor":4.0,"original_max_position_embeddings":32768}}' ``` For `llama-server` from `llama.cpp`, you can use