DeepSeek-R1 在 CANN 上的推理部署

张

张建站

2026/5/23 20:14:53

10分钟阅读

本文基于昇腾CANN和昇腾NPU围绕 cann-recipes-infer 仓库的相关技术展开。DeepSeek-R1 是个 MoE 模型——671B 总参数但每次推理只激活 37B。这对推理系统是个结构性的挑战MoE 的路由选择和 Expert 调度依赖通信CANN 的集合通信库 HCCL 和单边通信库 hixl 构成了 MoE 推理的通信底座。MoE 的推理计算图# DeepSeek-R1 的 MoE 层——路由 Expert 计算classMoELayer(torch.nn.Module):def__init__(self,num_experts256,top_k8,expert_dim4096):super().__init__()self.num_expertsnum_experts self.top_ktop_k# 路由 Gate——决定每个 Token 发到哪些 Expertself.gatetorch.nn.Linear(expert_dim,num_experts,biasFalse)# Expert 网络——256 个 FFN每个是 2 层 MLPself.expertstorch.nn.ModuleList([torch.nn.Sequential(torch.nn.Linear(expert_dim,2*expert_dim*4),# SwiGLUtorch.nn.Linear(2*expert_dim*4,expert_dim),)for_inrange(num_experts)])defforward(self,x):# x: [batch * seq_len, expert_dim]B,Dx.shape# Step 1: Gate 算路由分数gate_logitsself.gate(x)# [B, 256]gate_scorestorch.softmax(gate_logits,dim-1)# Step 2: Top-K 选择——每个 Token 选 8 个 Experttopk_weights,topk_indicestorch.topk(gate_scores,self.top_k,dim-1)topk_weightstopk_weights/topk_weights.sum(dim-1,keepdimTrue)# Step 3: 分发 Token 到对应 Expert# 每个 Expert 收到的 Token 集合# 这步需要 All-to-All 通信——Expert 分布在不同卡上dispatchedself.dispatch_tokens(x,topk_indices)# Step 4: 每个 Expert 算自己的那部分expert_outputs[]fori,expertinenumerate(self.experts):iflen(dispatched[i])0:outexpert(dispatched[i])expert_outputs.append((i,out))# Step 5: 收集 Expert 输出 All-to-All 反向通信# 把各卡 Expert 的结果收回到对应 Token 位置outputself.collect_outputs(expert_outputs,topk_indices,topk_weights)returnoutputCANN 上的 MoE 通信模式// DeepSeek-R1 的 MoE 推理——8 卡 Expert 并行的通信模式classMoEInferenceExecutor{// 每张卡部署 256/8 32 个 Expertconstintexperts_per_device32;// 路由结果的 Token 分发——用 HCCL 的 All-to-AllvoidTokenDispatch(int*topk_indices,float*hidden_states,intnum_tokens,intnum_devices){// Step 1: 统计每个 Device 要发多少 Tokenintsend_counts[8]{0};intsend_displs[9]{0};for(intt0;tnum_tokens;t){for(intk0;ktop_k;k){intexpert_idtopk_indices[t*top_kk];intdevice_idexpert_id/experts_per_device;send_counts[device_id];}}// 算 displacement 用于 scatterfor(intd1;dnum_devices;d){send_displs[d]send_displs[d-1]send_counts[d-1];}// Step 2: 用 HCCL 做 All-to-All——NVIDIA 的 AlltoAll 对应// CANN 的 HCCL 支持 HcclAlltoAllV——不等长收发HcclAlltoAllV(hidden_states,send_counts,send_displs,// 发送recv_buffer,recv_counts,recv_displs,// 接收HCCL_FLOAT,num_devices,hccl_comm);}// Step 3: 各卡算完 Expert 后反向 All-to-All 收回voidTokenCollect(float*expert_output,int*topk_indices,float*topk_weights,float*final_output){// Expert 输出同样 All-to-All 回去HcclAlltoAllV(expert_output,recv_counts,recv_displs,final_buffer,send_counts,send_displs,HCCL_FLOAT,num_devices,hccl_comm);// 按 TopK 权重加权合并——每个 Token 的 8 个 Expert 结果for(intt0;tnum_tokens;t){for(intk0;ktop_k;k){intidxtopk_indices[t*top_kk];floatweighttopk_weights[t*top_kk];// 累加for(intd0;dhidden_dim;d){final_output[t*hidden_dimd]final_buffer[idx*hidden_dimd]*weight;}}}}};All-to-All 通信是 MoE 推理的瓶颈。DeepSeek-R1 每层 MoE 要做两次 All-to-All分发收集80 层就是 160 次。CANN 的 HCCL 用 NVLink 等价的卡间互联拓扑做 AlltoAllV 优化——让 Token 分布均衡的设备路由不走跨交换机。PD 分离架构下的 DeepSeek-R1# DeepSeek-R1 的 PDPrefill-Decode分离部署classDeepSeekPDSeparation: DeepSeek-R1·671B 推荐用 PD 分离部署 - Prefill 阶段计算密集型配较少卡大 Batch - Decode 阶段访存密集型配较多卡小 Batch CANN 的 hixl 支持零拷贝的单边通信——Prefill 算完的 KV Cache 直接暴露给 Decode 卡读不用显式搬运。 def__init__(self):# Prefill 池4 张卡每卡处理 16 个请求的 Prefillself.prefill_pool[4,ascend910,prefill]# Decode 池16 张卡每卡处理 8 个请求的 Decodeself.decode_pool[16,ascend910,decode]# hixl 初始化——零拷贝共享内存importhixl self.shared_kvhixl.SharedMemory(size_per_token128*64*2*2,# 128heads × 128dim × FP16 × KVnum_devices20# 416)defhandoff(self,request_id): Prefill 完成后把 KV Cache 句柄传给 Decode 卡 hixl 用远端内存直接映射——不用走 HCCL 搬运 kv_handleself.prefill_pool.export_kv(request_id)# kv_handle 包含物理地址长度设备 ID# Decode 卡通过 hixl 的 rdma_read 直接读self.decode_pool.import_kv(request_id,kv_handle)# 零拷贝——实际只有一次 PCIe/NVLink 的读取DeepSeek-R1 的 256 Expert × 8 TopK 的稀疏激活特点让 PD 分离 MoE All-to-All 成为推理系统设计的关键。CANN 在这一场景的独特优势是 hixl 的单边通信——PD 分离场景下 KV Cache 的零拷贝传输能省掉 30% 的卡间带宽。参考仓库DeepSeek-R1 推理配方MoE 相关 Transformer 算子hixl 单边通信库HCCL 集合通信

星盘接口开发文档：比较盘接口指南

星盘接口开发文档：比较盘接口指南 1. 引言本文档详细介绍了占星系统的比较盘接口的使用方法，包括请求参数详解、响应数据结构、错误处理机制以及最佳实践建议。 2. 接口基础信息接口名称: 比较盘请求方式: POSTContent-Type: application/x-www-form-…...

2026/5/23 20:13:15 阅读更多 →

字节面试官追问：你的 Agent 调工具失败了怎么办？重试、幂等、回滚都没设计，线上迟早炸

前几天有个读者跟我复盘字节的面试，说自己本来聊 Agent 项目聊得挺顺。简历上写着“基于 Function Calling 构建业务 Agent，支持自动查询订单、生成工单、调用内部系统完成任务”。 👨‍💻面试官听完之后问了一句：“你…...

2026/5/23 20:13:02 阅读更多 →

STM32寄存器学习日记1-RCC

一、目标实现STM32系统时钟初始化二、方案硬件：正点原子STM32F103ZET6精英开发版软件：RCC配置步骤： // 第1阶段：复位和默认状态可选 1、复位APB1、APB2、AHB总线上的外设（RCC->APB1RSTR等） 2、…...

2026/5/23 20:08:46 阅读更多 →

单相光伏发电并网控制【附代码】

✨ 长期致力于光伏电池、整流控制、逆变控制、最大功率点跟踪技术研究工作，擅长数据搜集与处理、建模仿真、程序编写、仿真设计。 ✅ 专业定制毕设、代码 ✅ 如需沟通交流，点击《获取方式》 （1）自适应变步长电导增量法最大功率点跟…...

2026/5/22 11:02:58 阅读更多 →

【代码】hot100

Easy 两数之和两数之和 class Solution:def twoSum(self, nums: List[int], target: int) -> List[int]:xdict{}for i in range(len(nums)):jtarget-nums[i]if j in xdict.keys():return [i,xdict[j]]else:xdict[nums[i]]i 有效的括号有效的括号 class Soluti…...

2026/5/22 12:51:34 阅读更多 →

G-Helper终极教程：华硕笔记本轻量级性能控制神器

G-Helper终极教程：华硕笔记本轻量级性能控制神器【免费下载链接】g-helper Lightweight Armoury Crate alternative for Asus laptops with nearly the same functionality. Works with ROG Zephyrus, Flow, TUF, Strix, Scar, ProArt, Vivobook, Zenbook, Expertb…...

2026/5/22 16:38:09 阅读更多 →