Kubernetes 1.25集群中NVIDIA GPU配置实战指南在AI训练、高性能计算等领域GPU加速已成为提升计算效率的关键技术。本文将手把手带你完成Kubernetes 1.25集群中NVIDIA GPU的完整配置流程涵盖从环境检查到故障排查的全套实战经验。1. 环境准备与前置检查在开始配置前我们需要确保基础环境满足要求。以下是一个完整的检查清单硬件层面确认节点已安装NVIDIA Tesla/GeForce/RTX系列显卡运行lspci | grep -i nvidia应能看到GPU设备信息软件依赖Kubernetes 1.25集群正常运行NVIDIA驱动版本≥450.80.02建议使用官方仓库安装nvidia-container-toolkit≥1.7.0Docker或containerd作为容器运行时重要提示生产环境建议使用Ubuntu 20.04/22.04或CentOS 7/8等经过NVIDIA官方认证的操作系统验证驱动安装成功的命令nvidia-smi # 应显示GPU状态信息如果输出中包含类似以下内容说明驱动安装正确----------------------------------------------------------------------------- | NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | |---------------------------------------------------------------------------2. 容器运行时配置正确的容器运行时配置是GPU支持的核心环节。我们以containerd为例展示配置细节安装nvidia-container-toolkitdistribution$(. /etc/os-release;echo $ID$VERSION_ID) \ curl -s -L https://nvidia.github.io/libnvidia-container/gpgkey | sudo apt-key add - \ curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list sudo apt-get update sudo apt-get install -y nvidia-container-toolkit配置containerd使用nvidia作为默认运行时 编辑/etc/containerd/config.toml在[plugins.io.containerd.grpc.v1.cri.containerd]部分添加[plugins.io.containerd.grpc.v1.cri.containerd.runtimes.nvidia] privileged_without_host_devices false runtime_engine runtime_root runtime_type io.containerd.runc.v2 [plugins.io.containerd.grpc.v1.cri.containerd.runtimes.nvidia.options] BinaryName nvidia-container-runtime重启服务使配置生效sudo systemctl restart containerd常见问题排查表问题现象可能原因解决方案docker info显示默认运行时仍是runc配置未生效检查daemon.json语法确保重启了docker服务容器内无法识别GPU运行时配置错误验证nvidia-container-cli --info输出nvidia-smi命令找不到驱动未正确安装重新安装驱动并检查内核模块加载3. 部署NVIDIA设备插件Kubernetes通过Device Plugin机制管理GPU资源。以下是优化后的部署方案创建device-plugin的DaemonSet配置apiVersion: apps/v1 kind: DaemonSet metadata: name: nvidia-device-plugin-daemonset namespace: kube-system spec: selector: matchLabels: name: nvidia-device-plugin-ds updateStrategy: type: RollingUpdate template: metadata: labels: name: nvidia-device-plugin-ds spec: tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule priorityClassName: system-node-critical containers: - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1 name: nvidia-device-plugin-ctr securityContext: allowPrivilegeEscalation: false capabilities: drop: [ALL] volumeMounts: - name: device-plugin mountPath: /var/lib/kubelet/device-plugins volumes: - name: device-plugin hostPath: path: /var/lib/kubelet/device-plugins应用配置并验证kubectl apply -f nvidia-device-plugin.yaml kubectl get pods -n kube-system | grep nvidia-device-plugin检查节点资源信息kubectl describe node node-name | grep nvidia.com/gpu4. 实战测试与验证让我们通过实际工作负载验证GPU配置是否成功。创建测试PodapiVersion: v1 kind: Pod metadata: name: gpu-test-pod spec: restartPolicy: Never containers: - name: cuda-vectoradd image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda12.5.0 resources: limits: nvidia.com/gpu: 1 tolerations: - key: nvidia.com/gpu operator: Exists effect: NoSchedule部署并监控日志kubectl apply -f gpu-pod.yaml kubectl logs gpu-test-pod成功输出应包含[Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED5. 高级配置与性能优化对于生产环境我们还需要考虑以下优化点多GPU调度策略使用nvidia.com/gpu.product指定特定型号GPU通过节点标签实现GPU亲和性调度GPU共享配置apiVersion: v1 kind: ConfigMap metadata: name: nvidia-device-plugin-config namespace: kube-system data: config.toml: | version v1 [gpu] [gpu.sharing] enabled true maxShared 10监控与指标收集部署DCGM Exporter收集GPU指标集成PrometheusGrafana实现可视化监控性能优化对比表配置项默认值优化建议预期提升GPU显存分配粒度1GB根据应用调整资源利用率提升30%计算模式DEFAULT设置为EXCLUSIVE_PROCESS减少上下文切换开销功率限制最大功率根据负载动态调整节能20-40%6. 深度故障排查指南当遇到问题时可以按照以下流程排查基础检查节点GPU是否被识别nvidia-smi设备插件Pod是否正常运行kubectl get pods -n kube-system日志分析# 查看设备插件日志 kubectl logs -n kube-system nvidia-device-plugin-pod-name # 检查kubelet设备插件注册情况 journalctl -u kubelet | grep -i device.plugin常见问题解决方案问题Pod无法调度提示0/1 nodes are available: 1 Insufficient nvidia.com/gpu排查检查节点资源分配kubectl describe node解决确认设备插件DaemonSet已部署到目标节点问题容器内nvidia-smi命令报错排查验证容器运行时配置和挂载的库文件解决确保/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1等库文件正确挂载问题GPU利用率显示为0%排查检查CUDA版本兼容性解决确保容器镜像CUDA版本与驱动兼容在最近的一个客户案例中我们发现当Kubernetes升级到1.25后原有的设备插件配置会导致GPU资源无法识别。通过分析kubelet日志发现是API版本兼容性问题更新到v0.14.1版本的设备插件后问题解决。