vllm 部署 GLM-4.7

张

张建站

2026/6/3 1:31:28

10分钟阅读

读这篇文章假设已经读过大模型部署部署一下 GLM-4.7-Flashhttps://huggingface.co/zai-org/GLM-4.7-Flash首先准备 vllm 环境vllm 是大模型推理的框架之一。环境准备venv、pip、vllm# 安装 venvsudoaptinstallpython3-venv-y# 初始化 venv 环境python3-mvenv vllm-envcdvllm-env/sourcevllm-env/bin/activate pipinstall--upgradepip# 安装 vllm后面要使用 vllm 作为模型推理使用 GLM-4.7 权重pipinstallvllm# ... ... 等待下载后出现 Successfully installed ... 安装完成# 检测是否可用 https://docs.vllm.aivllm-h下载GLM-4.7-Flash加速下载HF_XET_HIGH_PERFORMANCE# 配置镜像站如果已科学可忽略exportHF_ENDPOINThttps://hf-mirror.com# 开启多线程下载exportHF_XET_HIGH_PERFORMANCE1hf download zai-org/GLM-4.7-Flash默认下载路径du-s-h~/.cache/huggingface/hub/models--zai-org--GLM-4.7-Flash/ 59G /root/.cache/huggingface/hub/models--zai-org--GLM-4.7-Flash/启动GLM-4.7-Flash启动命令可以尝试先启动如果有环境问题往下看我有两张显卡这里配置 --tensor-parallel-size 2显存不多限制了context --max-model-len 32768参数作用文末有写–reasoning-parser glm45–tool-call-parser glm47–enable-auto-tool-choicevllm serve zai-org/GLM-4.7-Flash\--host0.0.0.0\--port8099\--tensor-parallel-size2\--max-model-len32768\--gpu-memory-utilization0.9\--served-model-name glm-4.7-flash近60G 的模型启动需要几分钟日志中看到相关的 starting server 标识则启动成功。APIServerpid1034591)INFO 06-0219:34:03[api_server.py:596]Starting vLLM server on http://0.0.0.0:8099(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:37]Available routes are:(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /openapi.json, Methods: HEAD, GET(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /docs, Methods: HEAD, GET(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /docs/oauth2-redirect, Methods: HEAD, GET(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /redoc, Methods: HEAD, GET(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /tokenize, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /detokenize, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /load, Methods: GET(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /version, Methods: GET(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /health, Methods: GET(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /metrics, Methods: GET(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /v1/models, Methods: GET(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /ping, Methods: GET(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /ping, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /invocations, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /v1/chat/completions, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /v1/chat/completions/batch, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /v1/responses, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /v1/responses/{response_id}, Methods: GET(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /v1/responses/{response_id}/cancel, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /v1/completions, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /v1/messages, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /v1/messages/count_tokens, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /inference/v1/generate, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /scale_elastic_ep, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /is_scaling_elastic_ep, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /generative_scoring, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /v1/chat/completions/render, Methods: POST(APIServerpid1034591)INFO 06-0219:34:03[launcher.py:46]Route: /v1/completions/render, Methods: POST(APIServerpid1034591)INFO: Started server process[1034591]简单验证查看模型列表curl--noproxy*http://127.0.0.1:8099/v1/models请求返回{object:list,data:[{id:glm-4.7-flash,object:model,created:1780401515,owned_by:vllm,root:zai-org/GLM-4.7-Flash,parent:null,max_model_len:32768,permission:[{id:modelperm-8aaff353c7c2a8da,object:model_permission,created:1780401515,allow_create_engine:false,allow_sampling:true,allow_logprobs:true,allow_search_indices:false,allow_view:true,allow_fine_tuning:false,organization:*,group:null,is_blocking:false}]}]}QAstream 可以指定为 true最终他的返回结果与 ollama 有一些差异返回的结构是 openApi 格式curl--noproxy*http://127.0.0.1:8099/v1/chat/completions-HContent-Type: application/json-d{ model: glm-4.7-flash, messages: [ {role: user, content: 你好请用三句话介绍你自己} ], stream: false }返回{id:chatcmpl-8a9c21acdfd79df3,object:chat.completion,created:1780401656,model:glm-4.7-flash,choices:[{index:0,message:{role:assistant,content:1. **识别用户的核心请求**用户希望我AI用三句话介绍我自己。\n\n2. **分析约束条件**\n * 主题自我介绍AI助手/LangGPT。\n * 格式正好三句话。\n * 语言中文提示词的暗示。\n\n3. **起草初步构思内心独白/尝试**\n * *草稿 1* 我是一个由Z.ai开发的AI语言模型。我可以回答问题和生成文本。我希望通过交流帮助到你。很好标准准确。\n * *草稿 2* 你好我是GLM一个大型语言模型。我没有情感但我可以处理信息。如果你想聊天或者需要帮助随时叫我。稍微随意一点。\n * *草稿 3* 我是GLM,一个由智谱AI训练的大语言模型。我可以协助你解答问题、创作内容或进行各种语言交流。很高兴能通过文字与你对话期待为你提供帮助。礼貌且专业。\n\n4. **选择最佳方案**草稿 3 平衡了身份、功能和建议的行动。简单直接。\n\n5. **润色选定的方案中文**\n * 第1句我是由Z.ai训练的大语言模型名为GLM。我是一名由Z.ai训练的语言模型名字叫GLM。\n * 第2句我可以回答问题、撰写文案并进行各类语言交流。我可以回答问题、撰写文案并参与各种语言交流。\n * 第3句很高兴能为您提供帮助请随时告诉我您想知道什么。很高兴能为您提供帮助请随时告诉我您想知道什么。\n\n6. **最终润色对照约束条件检查**\n * 是否有三句话是。\n * 是否为自我介绍是。\n * 语言是否自然是。\n\n7. **生成最终输出**清晰地呈现句子。\n\n 我是Z.ai训练的大语言模型GLM。我可以回答问题、撰写文案并参与各类语言交流。很高兴能为您提供帮助请随时告诉我您想知道什么。/think我是Z.ai研发的大语言模型GLM。我可以回答问题、撰写文案并参与各类语言交流。很高兴能为您提供帮助请随时告诉我您想知道什么。,refusal:null,annotations:null,audio:null,function_call:null,tool_calls:[],reasoning:null},logprobs:null,finish_reason:stop,stop_reason:154827,token_ids:null,routed_experts:null}],service_tier:null,system_fingerprint:vllm-0.22.0-tp2-5b0f4813,usage:{prompt_tokens:13,total_tokens:497,completion_tokens:484,prompt_tokens_details:null},prompt_logprobs:null,prompt_token_ids:null,prompt_text:null,kv_transfer_params:null}环境问题解决# 这里遇到了一堆环境问题最终解决办法是将错误贴给 codex 后给出修复命令# 修复环境后要清掉之前用错误 CUDA 编译过的缓存然后重新启动 vLLM######### rm -rf /root/.cache/flashinfer #########pipinstallnvidia-cuda-nvcc-cu1212.4.131\-ihttps://mirrors.aliyun.com/pypi/simple\--trusted-host mirrors.aliyun.com# 依旧是修复环境问题pipinstall-Unvidia-cuda-nvcc nvidia-cuda-runtime nvidia-cuda-nvrtc nvidia-cuda-cupti\-ihttps://mirrors.aliyun.com/pypi/simple\--trusted-host mirrors.aliyun.comexportVLLM_USE_FLASHINFER_SAMPLER0pipinstall-U--force-reinstallcuda-toolkit[nvcc,nvrtc,cudart,cupti]13.0.2\-ihttps://mirrors.aliyun.com/pypi/simple\--trusted-host mirrors.aliyun.com# 配置 nvcc 环境变量find/root/vllm-env/vllm-env/lib/python3.12/site-packages/nvidia-namenvcc-typeffind/root/vllm-env/vllm-env/lib/python3.12/site-packages/nvidia-namecuda_runtime.h-typeffind/root/vllm-env/vllm-env/lib/python3.12/site-packages/nvidia-namelibcudart.so*-typef /root/vllm-env/vllm-env/lib/python3.12/site-packages/nvidia/cu13/bin/nvcc /root/vllm-env/vllm-env/lib/python3.12/site-packages/nvidia/cu13/include/cuda_runtime.h /root/vllm-env/vllm-env/lib/python3.12/site-packages/nvidia/cu13/lib/libcudart.so.13exportCUDA_HOME/root/vllm-env/vllm-env/lib/python3.12/site-packages/nvidia/cu13exportPATH$CUDA_HOME/bin:$PATHexportCPATH$CUDA_HOME/includeexportLIBRARY_PATH$CUDA_HOME/libexportLD_LIBRARY_PATH$CUDA_HOME/lib# 检查环境版本whichnvcc nvcc--versionpython-cimport torch; print(torch.__version__, torch.version.cuda)/root/vllm-env/vllm-env/lib/python3.12/site-packages/nvidia/cu13/bin/nvcc nvcc: NVIDIA(R)Cuda compiler driver Copyright(c)2005-2025 NVIDIA Corporation Built on Wed_Aug_20_01:58:59_PM_PDT_2025 Cuda compilation tools, release13.0, V13.0.88 Build cuda_13.0.r13.0/compiler.36424714_02.11.0cu13013.0# 版本对齐torch:2.11.0cu130 torch CUDA:13.0nvcc:13.0参数补充–reasoning-parser glm45–tool-call-parser glm47–enable-auto-tool-choice这三个参数都是 vLLM 的 OpenAI-compatible API 输出解析/工具调用相关参数不是模型加载必需参数。 --reasoning-parser glm45 作用把模型输出里的“思考过程”解析成 OpenAI 风格的 reasoning 字段。有些 reasoning 模型会输出类似 think这里是思考过程/think 最终答案或者厂商自定义的 thinking 格式。 --reasoning-parser glm45 告诉 vLLM 按 GLM-4.5/GLM 系列的 reasoning 格式去解析模型输出理想情况下返回会类似 { message: { content: 最终答案, reasoning: 中间思考过程 } } 但如果模型本次没有按 parser 期望的格式输出vLLM 就解析不出来reasoning 可能还是 null思考文本会混在 content 里。 --tool-call-parser glm47 作用解析 GLM-4.7 的工具调用格式。比如你给模型传 tools tools: [ { type: function, function: { name: get_weather, parameters: {...} } } ] 模型可能会生成一段它自己的工具调用文本。 --tool-call-parser glm47 负责把这段文本转换成标准 OpenAI-compatible 的 tool_calls: [ { type: function, function: { name: get_weather, arguments: {...} } } ] 如果不加这个 parser模型可能仍然会输出工具调用内容但它可能只是普通文本不会变成标准 tool_calls 字段。 --enable-auto-tool-choice 作用允许模型自动决定是否调用工具。 OpenAI-compatible API 里有 tool_choice: auto 意思是模型自己判断这次要不要调用工具 vLLM 默认不一定允许自动工具选择所以要加 --enable-auto-tool-choice 通常它要和 --tool-call-parser glm47 一起使用。三者关系 --reasoning-parser glm45 负责解析“思考过程” --tool-call-parser glm47 负责解析“工具调用” --enable-auto-tool-choice 允许模型自动选择工具调用它们不会显著改变模型能力主要影响 API 返回结构。你现在的情况是模型会输出分析过程但 reasoning 字段是 null 说明模型输出没有匹配 glm45 parser 期望的 reasoning 格式或者当前调用方式没有触发独立 reasoning 格式。

QueryExcel：基于NPOI的Excel批量数据检索系统架构解析

QueryExcel：基于NPOI的Excel批量数据检索系统架构解析【免费下载链接】QueryExcel 多Excel文件内容查询工具。项目地址: https://gitcode.com/gh_mirrors/qu/QueryExcel 在数据处理工作流中，跨多个Excel文件进行内容检索是一项常见但低效的任务…...

2026/6/3 1:30:28 阅读更多 →

深度揭秘OmenSuperHub：惠普暗影精灵硬件控制的底层技术实现

深度揭秘OmenSuperHub：惠普暗影精灵硬件控制的底层技术实现【免费下载链接】OmenSuperHub Control Omen laptop performance, fan speeds, and keyboard lighting, and unlock power limits. 项目地址: https://gitcode.com/gh_mirrors/om/OmenSuperHub 你是…...

2026/6/3 1:29:49 阅读更多 →

书匠策AI：一个让你“偷懒“还能拿高分的课程论文秘密武器，90%的大学生还不知道

你有没有经历过这种崩溃时刻—— 凌晨两点，Word文档上的光标一闪一闪，课程论文还差3000字，脑子却一片空白。翻了二十篇文献，每个字都认识，连在一起就不知道怎么变成自己的话。别慌。今天这篇文章，就是来…...

2026/6/3 1:25:22 阅读更多 →

掌握Markdown实时预览：打造高效写作工作流的3个关键策略

掌握Markdown实时预览：打造高效写作工作流的3个关键策略【免费下载链接】markn Lightweight markdown viewer. 项目地址: https://gitcode.com/gh_mirrors/ma/markn 在当今数字创作时代，Markdown已成为技术文档、博客文章和个人笔记的首选格式。…...

2026/6/2 7:26:22 阅读更多 →

Win10/Win11下Realtek 8188GU网卡驱动感叹号？别急着扔，试试这个手动安装的野路子

Realtek 8188GU网卡驱动故障深度修复指南：从原理到实战当设备管理器里那个顽固的黄色感叹号挥之不去，而你已经尝试了所有"标准操作"——Windows自动更新、第三方驱动工具、甚至重启大法——却依然无济于事时，是时候换个思路了。这篇…...

2026/6/3 0:57:19 阅读更多 →

前轮驱动自行车机器人建模与自适应控制策略优化【附代码】

✨ 长期致力于自行车机器人、前轮驱动、Lagrange方程、自适应模糊控制、RBF网络自适应控制研究工作，擅长数据搜集与处理、建模仿真、程序编写、仿真设计。 ✅ 专业定制毕设、代码 ✅ 如需沟通交流，点击《获取方式》 （1）基于瞬时转…...

2026/6/2 22:29:08 阅读更多 →

ModTheSpire终极指南：5分钟安全安装《杀戮尖塔》模组管理器

ModTheSpire终极指南：5分钟安全安装《杀戮尖塔》模组管理器【免费下载链接】ModTheSpire External mod loader for Slay The Spire 项目地址: https://gitcode.com/gh_mirrors/mo/ModTheSpire 还在为《杀戮尖塔》模组安装的复杂流程而头疼吗？Mod…...

2026/6/2 6:08:03 阅读更多 →