K8S
1、项目总览
1.1、最终架构方案
1.2、IP分配规划
1.3、域名规划
1.4、CPU优化指南
1.5、云C部署指南
1.6、部署检查清单
1.7、快速参考手册
2.1、API-VIP高可用配置
2.2、Calico网络配置
2.3、存储方案配置
2.4、Ingress入口配置
2.5、安全加固配置
2.6、etcd优化配置
2.7、灾难恢复配置
2.8、公司网络配置
K8s部署
本文档使用 MrDoc 发布
-
+
首页
2.6、etcd优化配置
# etcd性能优化配置 ## 📖 说明 本文档提供etcd混合云环境下的性能优化和架构方案。 ## 🎯 问题分析 etcd对延迟极其敏感: - Raft协议需要多数派确认(3节点需2个确认) - 建议延迟 <10ms,可接受 <50ms - 如果公司到云端延迟 >100ms,集群会频繁超时 ## 📄 方案对比 | 方案 | 写入延迟 | 容灾能力 | 复杂度 | 适用场景 | |------|---------|---------|--------|---------| | **方案A:2云Member + 1公司Learner** | 低 | ⭐⭐⭐⭐ | ⭐⭐⭐ | 推荐 | | **方案B:External etcd** | 低 | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 最优 | | **方案C:纯云侧etcd** | 最低 | ⭐⭐⭐ | ⭐ | 简单场景 | ## 📄 完整配置 ### 方案A:2云Member + 1公司Learner(推荐) ```yaml # kubeadm-config.yaml(云A初始化时) apiVersion: kubeadm.k8s.io/v1beta3 kind: ClusterConfiguration controlPlaneEndpoint: "10.255.0.1:6443" # 使用HAProxy方案 networking: podSubnet: 10.244.0.0/16 serviceSubnet: 10.96.0.0/12 etcd: local: dataDir: /var/lib/etcd serverCertSANs: - 10.255.0.1 - 10.255.0.2 peerCertSANs: - 10.255.0.1 - 10.255.0.2 extraArgs: initial-cluster: "cloud-a=https://10.255.0.1:2380,cloud-b=https://10.255.0.2:2380" initial-cluster-state: "new" heartbeat-interval: "200" # 默认100ms,跨WAN适当增加 election-timeout: "2000" # 默认1000ms,跨WAN增加到2s ``` ### 方案B:External etcd(最优) ```bash # 云A(10.255.0.1)etcd服务配置 cat > /etc/systemd/system/etcd.service << 'EOF' [Unit] Description=etcd After=network.target [Service] Type=notify ExecStart=/usr/local/bin/etcd \ --name=cloud-a \ --data-dir=/var/lib/etcd \ --listen-peer-urls=https://10.255.0.1:2380 \ --listen-client-urls=https://10.255.0.1:2379,https://127.0.0.1:2379 \ --advertise-client-urls=https://10.255.0.1:2379 \ --initial-advertise-peer-urls=https://10.255.0.1:2380 \ --initial-cluster=cloud-a=https://10.255.0.1:2380,cloud-b=https://10.255.0.2:2380 \ --initial-cluster-state=new \ --initial-cluster-token=k8s-etcd-cluster \ --cert-file=/etc/etcd/pki/server.crt \ --key-file=/etc/etcd/pki/server.key \ --peer-cert-file=/etc/etcd/pki/peer.crt \ --peer-key-file=/etc/etcd/pki/peer.key \ --trusted-ca-file=/etc/etcd/pki/ca.crt \ --peer-trusted-ca-file=/etc/etcd/pki/ca.crt \ --peer-client-cert-auth \ --client-cert-auth Restart=always RestartSec=10 [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable --now etcd ``` ```yaml # kubeadm配置使用external etcd apiVersion: kubeadm.k8s.io/v1beta3 kind: ClusterConfiguration controlPlaneEndpoint: "10.255.0.1:6443" etcd: external: endpoints: - https://10.255.0.1:2379 - https://10.255.0.2:2379 - https://10.255.0.3:2379 # learner也可以提供读服务 caFile: /etc/kubernetes/pki/etcd/ca.crt certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key ``` ### 方案C:纯云侧etcd(最简单) ```yaml # 如果公司到云端延迟 >100ms,建议只在云侧部署etcd # 公司节点只作为Worker apiVersion: kubeadm.k8s.io/v1beta3 kind: ClusterConfiguration controlPlaneEndpoint: "10.255.0.1:6443" etcd: local: extraArgs: initial-cluster: "cloud-a=https://10.255.0.1:2380,cloud-b=https://10.255.0.2:2380" ``` ### etcd性能优化 ```bash #!/bin/bash # etcd性能调优脚本 # 1. 磁盘IO优化(etcd对磁盘延迟敏感) # 确保etcd数据目录在SSD上 mount | grep /var/lib/etcd # 应该看到 ssd 或 nvme 字样 # 2. 设置IO调度器 echo "deadline" > /sys/block/sda/queue/scheduler # 或noop # 3. 禁用透明大页 echo never > /sys/kernel/mm/transparent_hugepage/enabled # 4. 调整文件描述符限制 ulimit -n 65536 # 5. etcd压缩与碎片整理(定期执行) ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ compact $(etcdctl endpoint status --write-out="json" | jq '.[0].Status.header.revision' -r) ETCDCTL_API=3 etcdctl \ --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/kubernetes/pki/etcd/ca.crt \ --cert=/etc/kubernetes/pki/etcd/server.crt \ --key=/etc/kubernetes/pki/etcd/server.key \ defrag --cluster ``` ### etcd监控告警 ```yaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: etcd-alerts namespace: monitoring spec: groups: - name: etcd interval: 30s rules: - alert: EtcdHighLatency expr: histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5 for: 5m annotations: summary: "etcd WAL fsync延迟高于500ms" - alert: EtcdInsufficientMembers expr: sum(up{job="etcd"}) < 2 for: 3m annotations: summary: "etcd集群可用成员少于2个" - alert: EtcdNoLeader expr: etcd_server_has_leader == 0 for: 1m annotations: summary: "etcd集群无Leader" - alert: EtcdHighNumberOfLeaderChanges expr: increase(etcd_server_leader_changes_seen_total[15m]) > 3 annotations: summary: "etcd频繁Leader切换,检查网络" - alert: EtcdDatabaseQuotaExceeded expr: etcd_mvcc_db_total_size_in_bytes > 8e9 # 8GB for: 10m annotations: summary: "etcd数据库大小超过8GB,需要压缩" ``` ## 🔧 部署步骤 ### 测试延迟 ```bash # 从公司节点测试到云端延迟(必须先执行) ping -c 100 10.255.0.1 # 如果平均延迟 >50ms,必须采用方案B或C ``` ### External etcd部署 ```bash # 1. 在云A、云B部署etcd集群 # 参考上面的systemd配置 # 2. 添加公司节点为learner ETCDCTL_API=3 etcdctl \ --endpoints=https://10.255.0.1:2379 \ --cacert=/etc/etcd/pki/ca.crt \ --cert=/etc/etcd/pki/client.crt \ --key=/etc/etcd/pki/client.key \ member add corp --learner --peer-urls=https://10.255.0.3:2380 # 3. 验证集群健康 ETCDCTL_API=3 etcdctl \ --endpoints=https://10.255.0.1:2379,https://10.255.0.2:2379 \ --cacert=/etc/etcd/pki/ca.crt \ --cert=/etc/etcd/pki/client.crt \ --key=/etc/etcd/pki/client.key \ endpoint health ``` ## ⚠️ 注意事项 - **延迟测试**:部署前必须测试网络延迟 - **SSD存储**:etcd必须使用SSD存储 - **定期维护**:每周执行压缩和碎片整理 - **监控告警**:配置Prometheus监控etcd健康状态 - **备份策略**:每6小时自动快照备份 ## 📁 原始文件 原始YAML配置文件位于:`solutions/etcd-optimization.yaml` ## 🔗 相关文档 - 架构方案:[1.1、最终架构方案.md](./1.1、最终架构方案.md) - 灾难恢复:[2.7、灾难恢复配置.md](./2.7、灾难恢复配置.md) --- **更新时间:** 2025-01-22
arise
2025年11月22日 09:49
转发文档
收藏文档
‹‹
‹
14
/ 17
›
››
手机扫码
复制链接
手机扫一扫转发分享
复制链接
Markdown文件
PDF文档(打印)
分享
链接
类型
密码
更新密码