Kubernetes 容量规划实践:从预测到实施
Kubernetes 容量规划实践从预测到实施前言哥们别整那些花里胡哨的理论。今天直接上硬菜——我在大厂一线做 Kubernetes 容量规划的真实经验总结。作为一个白天写前端、晚上打鼓的硬核工程师我对容量规划的追求就像对鼓点节奏的把控一样严格。背景最近我们团队的 Kubernetes 集群频繁出现资源不足和容量瓶颈问题。经过一个月的容量规划实践我们建立了完善的容量管理体系资源短缺问题减少了 80%系统稳定性显著提升。今天就把这些干货分享给大家。容量评估1. 当前容量分析问题如何评估当前集群容量。解决方案直接上代码# 查看节点容量 kubectl describe nodes | grep -A 5 Capacity # 查看资源分配情况 kubectl top nodes # 查看 Pod 资源使用 kubectl top pods --all-namespaces # 生成容量报告 kubectl get nodes -o json | jq .items[] | {name: .metadata.name, capacity: .status.capacity, allocatable: .status.allocatable}2. 容量利用率计算问题如何计算容量利用率。解决方案# Prometheus 容量指标 apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: capacity-metrics namespace: monitoring spec: groups: - name: capacity rules: - record: cluster:cpu_utilization:ratio expr: | sum(kube_pod_container_resource_requests{resourcecpu}) / sum(kube_node_status_allocatable{resourcecpu}) - record: cluster:memory_utilization:ratio expr: | sum(kube_pod_container_resource_requests{resourcememory}) / sum(kube_node_status_allocatable{resourcememory}) - record: cluster:pod_utilization:ratio expr: | count(kube_pod_info) / sum(kube_node_status_capacity{resourcepods})容量预测1. 增长趋势分析问题如何预测容量需求增长。解决方案# 容量预测脚本 import pandas as pd import numpy as np from sklearn.linear_model import LinearRegression # 加载历史数据 data pd.read_csv(resource_usage.csv) # 训练预测模型 X data[[day]].values y_cpu data[cpu_usage].values y_memory data[memory_usage].values cpu_model LinearRegression() cpu_model.fit(X, y_cpu) memory_model LinearRegression() memory_model.fit(X, y_memory) # 预测未来30天 future_days np.array([[i] for i in range(1, 31)]) cpu_prediction cpu_model.predict(future_days) memory_prediction memory_model.predict(future_days) print(f30天后 CPU 需求预测: {cpu_prediction[-1]:.2f} cores) print(f30天后 Memory 需求预测: {memory_prediction[-1]:.2f} GB)2. 季节性分析问题如何处理季节性流量波动。解决方案# 季节性容量配置 apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: seasonal-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: music-app minReplicas: 3 maxReplicas: 100 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 behavior: scaleUp: stabilizationWindowSeconds: 0 policies: - type: Percent value: 100 periodSeconds: 15 scaleDown: stabilizationWindowSeconds: 300 policies: - type: Percent value: 10 periodSeconds: 60容量规划实施1. 节点池规划问题如何规划不同规格的节点池。解决方案# 节点池配置 apiVersion: karpenter.sh/v1alpha5 kind: Provisioner metadata: name: general-purpose spec: requirements: - key: karpenter.sh/capacity-type operator: In values: [on-demand, spot] - key: node.kubernetes.io/instance-type operator: In values: [m5.large, m5.xlarge, m5.2xlarge] limits: resources: cpu: 1000 memory: 4000Gi ttlSecondsAfterEmpty: 300 ttlSecondsUntilExpired: 86400 --- apiVersion: karpenter.sh/v1alpha5 kind: Provisioner metadata: name: compute-optimized spec: requirements: - key: karpenter.sh/capacity-type operator: In values: [on-demand] - key: node.kubernetes.io/instance-type operator: In values: [c5.large, c5.xlarge, c5.2xlarge] taints: - key: compute-optimized value: true effect: NoSchedule limits: resources: cpu: 500 memory: 1000Gi2. 预留容量问题如何预留突发容量。解决方案# 优先级类配置 apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: overprovisioning value: -1 globalDefault: false description: Used for overprovisioning --- # 占位 Pod apiVersion: apps/v1 kind: Deployment metadata: name: overprovisioning namespace: kube-system spec: replicas: 2 selector: matchLabels: app: overprovisioning template: metadata: labels: app: overprovisioning spec: priorityClassName: overprovisioning containers: - name: pause image: k8s.gcr.io/pause:3.1 resources: requests: cpu: 2000m memory: 4Gi容量监控1. 容量告警问题如何监控容量使用情况。解决方案# 容量告警规则 apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: capacity-alerts namespace: monitoring spec: groups: - name: capacity rules: - alert: ClusterCPUHigh expr: | cluster:cpu_utilization:ratio 0.8 for: 10m labels: severity: warning annotations: summary: Cluster CPU utilization is high description: Cluster CPU utilization is {{ $value | humanizePercentage }} - alert: ClusterMemoryHigh expr: | cluster:memory_utilization:ratio 0.8 for: 10m labels: severity: warning annotations: summary: Cluster memory utilization is high description: Cluster memory utilization is {{ $value | humanizePercentage }} - alert: ClusterPodCountHigh expr: | cluster:pod_utilization:ratio 0.9 for: 5m labels: severity: critical annotations: summary: Cluster pod count is high description: Cluster pod utilization is {{ $value | humanizePercentage }}最佳实践容量评估定期评估当前容量建立容量基线监控容量趋势容量预测使用历史数据预测考虑季节性因素预留安全余量容量实施分阶段扩容使用多节点池配置自动扩缩容容量监控设置容量告警定期容量回顾优化容量使用常见问题与解决方案1. 容量不足问题集群容量无法满足需求。解决方案提前扩容节点池优化资源使用使用 Cluster Autoscaler2. 容量浪费问题大量容量闲置。解决方案优化资源请求使用 Spot 实例配置自动缩容3. 预测不准问题容量预测与实际需求偏差大。解决方案调整预测模型增加数据样本考虑业务因素4. 扩容不及时问题流量高峰时扩容不及时。解决方案提前扩容配置快速扩容使用预留容量