Skip to content

ETCD 备份还原

使用 ETCD 备份功能创建备份策略,可以将指定集群的 etcd 数据定时备份到 S3 存储中。本文主要介绍如何将已经备份的数据还原到当前集群中。

Note

  • AI 算力中心ETCD 备份还原仅限于针对同一集群(节点数和 IP 地址没有变化)进行备份与还原。 例如,备份了 A 集群 的 etcd 数据后,只能将备份数据还原到 A 集群中,不能还原到 B 集群。
  • 对于跨集群的备份与还原,建议使用应用备份还原功能。
  • 首先创建备份策略,备份当前状态,建议参考ETCD 备份功能。

下面通过具体的案例来说明备份还原的整个过程。

环境信息

首先介绍还原的目标集群和 S3 存储的基本信息。这里以 MinIo 作为 S3 存储,整个集群有 3 个控制面(3 个 etcd 副本)。

IP 主机 角色 备注
10.6.212.10 host01 k8s-master01 k8s 节点 1
10.6.212.11 host02 k8s-master02 k8s 节点 2
10.6.212.12 host03 k8s-master03 k8s 节点 3
10.6.212.13 host04 minio minio 服务

前提条件

安装 etcdbrctl 工具

为了实现 ETCD 数据备份还原,需要在上述任意一个 Kubernetes 节点上安装 etcdbrctl 开源工具。 此工具暂时没有二进制文件,需要自行编译。编译方式请参考 Gardener / etcd-backup-restore 本地开发文档

安装完成后用如下命令检查工具是否可用:

etcdbrctl -v

预期输出如下:

INFO[0000] etcd-backup-restore Version: v0.23.0-dev
INFO[0000] Git SHA: b980beec
INFO[0000] Go Version: go1.19.3
INFO[0000] Go OS/Arch: linux/amd64

检查备份数据

还原之前需要检查下列事项:

  • 是否已经在 AI 算力中心中成功备份了数据
  • 检查 S3 存储中备份数据是否存在

Note

AI 算力中心的备份是全量数据备份,还原时将还原最后一次备份的全量数据。

minio

关闭集群

在备份之前,必须要先关闭集群。默认集群 etcdkube-apiserver 都是以静态 Pod 的形式启动的。 这里的关闭集群是指将静态 Pod manifest 文件移动到 /etc/kubernetes/manifest 目录外,集群就会移除对应 Pod,达到关闭服务的作用。

  1. 首先删除之前的备份数据,移除数据并非将现有 etcd 数据删除,而是指修改 etcd 数据目录的名称。 等备份还原成功之后再删除此目录。这样做的目的是,如果 etcd 备份还原失败,还可以尝试还原当前集群。此步骤每个节点均需执行。

    rm -rf /var/lib/etcd_bak
    
  2. 然后需要关闭 kube-apiserver 的服务,确保 etcd 的数据没有新变化。此步骤每个节点均需执行。

    mv /etc/kubernetes/manifests/kube-apiserver.yaml /tmp/kube-apiserver.yaml
    
  3. 同时还需要关闭 etcd 的服务。此步骤每个节点均需执行。

    mv /etc/kubernetes/manifests/etcd.yaml /tmp/etcd.yaml
    
  4. 确保所有控制平面的 kube-apiserveretcd 服务都已经关闭。

  5. 关闭所有的节点后,使用如下命令检查 etcd 集群状态。此命令在任意一个节点执行即可。

    endpoints 的值需要替换为实际节点名称

    etcdctl endpoint status --endpoints=controller-node-1:2379,controller-node-2:2379,controller-node-3:2379 -w table \
      --cacert="/etc/kubernetes/ssl/etcd/ca.crt" \
      --cert="/etc/kubernetes/ssl/apiserver-etcd-client.crt" \
      --key="/etc/kubernetes/ssl/apiserver-etcd-client.key"
    

    预期输出如下,表示所有的 etcd 节点都被销毁:

    {"level":"warn","ts":"2023-03-29T17:51:50.817+0800","logger":"etcd-client","caller":"v3@v3.5.6/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001ba000/controller-node-1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.5.14.31:2379: connect: connection refused\""}
    Failed to get the status of endpoint controller-node-1:2379 (context deadline exceeded)
    {"level":"warn","ts":"2023-03-29T17:51:55.818+0800","logger":"etcd-client","caller":"v3@v3.5.6/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001ba000/controller-node-2:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.5.14.32:2379: connect: connection refused\""}
    Failed to get the status of endpoint controller-node-2:2379 (context deadline exceeded)
    {"level":"warn","ts":"2023-03-29T17:52:00.820+0800","logger":"etcd-client","caller":"v3@v3.5.6/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001ba000/controller-node-1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.5.14.33:2379: connect: connection refused\""}
    Failed to get the status of endpoint controller-node-3:2379 (context deadline exceeded)
    +----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
    +----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    +----------+----+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
    

还原备份

只需要还原一个节点的数据,其他节点的 etcd 数据就会自动进行同步。

  1. 设置环境变量

    使用 etcdbrctl 还原数据之前,执行如下命令将连接 S3 的认证信息设置为环境变量:

    export ECS_ENDPOINT=http://10.6.212.13:9000 # (1)!
    export ECS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE # (2)!
    export ECS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY # (3)!
    
    1. S3 存储的访问点
    2. S3 存储的用户名
    3. S3 存储的密码
  2. 执行还原操作

    执行 etcdbrctl 命令行工具执行还原,这是最关键的一步。

    etcdbrctl restore --data-dir /var/lib/etcd/ --store-container="etcd-backup" \ 
      --storage-provider=ECS \
      --initial-cluster=controller-node1=https://10.6.212.10:2380 \
      --initial-advertise-peer-urls=https://10.6.212.10:2380 
    

    参数说明如下:

    • --data-dir: etcd 数据目录,此目录必须跟 etcd 数据目录一致,etcd 才能正常加载数据。
    • --store-container:S3 存储的位置,MinIO 中对应的 bucket,必须跟数据备份的 bucket 相对应。
    • --initial-cluster:etcd 初始化配置, etcd 集群的名称必须跟原来一致。
    • --initial-advertise-peer-urls:etcd member 集群之间访问地址。必须跟 etcd 的配置保持一致。

    预期输出如下:

    INFO[0000] Finding latest set of snapshot to recover from...
    INFO[0000] Restoring from base snapshot: Full-00000000-00111147-1679991074  actor=restorer
    INFO[0001] successfully fetched data of base snapshot in 1.241380207 seconds  actor=restorer
    {"level":"info","ts":1680011221.2511616,"caller":"mvcc/kvstore.go:380","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":110327}
    {"level":"info","ts":1680011221.3045986,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"66638454b9dd7b8a","local-member-id":"0","added-peer-id":"123c2503a378fc46","added-peer-peer-urls":["https://10.6.212.10:2380"]}
    INFO[0001] Starting embedded etcd server...              actor=restorer
    ....
    
    {"level":"info","ts":"2023-03-28T13:47:02.922Z","caller":"embed/etcd.go:565","msg":"stopped serving peer traffic","address":"127.0.0.1:37161"}
    {"level":"info","ts":"2023-03-28T13:47:02.922Z","caller":"embed/etcd.go:367","msg":"closed etcd server","name":"default","data-dir":"/var/lib/etcd","advertise-peer-urls":["http://localhost:0"],"advertise-client-urls":["http://localhost:0"]}
    INFO[0003] Successfully restored the etcd data directory.
    

    可以查看 etcd 的 YAML 文件进行对照,以免配置错误

    cat /tmp/etcd.yaml | grep initial-
    - --experimental-initial-corrupt-check=true
    - --initial-advertise-peer-urls=https://10.6.212.10:2380
    - --initial-cluster=controller-node-1=https://10.6.212.10:2380
    
  3. 以下命令在节点 01 上执行,为了恢复节点 01 的 etcd 服务。

    首先将 etcd 静态 Pod 的 manifest 文件移动到 /etc/kubernetes/manifests 目录下,kubelet 将会重启 etcd:

    mv /tmp/etcd.yaml /etc/kubernetes/manifests/etcd.yaml
    

    然后等待 etcd 服务启动完成以后,检查 etcd 的状态,etcd 相关证书默认目录: /etc/kubernetes/ssl 。如果集群证书存放在其他位置,请指定对应路径。

    • 检查 etcd 集群列表:

      etcdctl member list -w table \
      --cacert="/etc/kubernetes/ssl/etcd/ca.crt" \
      --cert="/etc/kubernetes/ssl/apiserver-etcd-client.crt" \
      --key="/etc/kubernetes/ssl/apiserver-etcd-client.key" 
      

      预期输出如下:

      +------------------+---------+-------------------+--------------------------+--------------------------+------------+
      |        ID        | STATUS  |       NAME        |        PEER ADDRS        |       CLIENT ADDRS       | IS LEARNER |
      +------------------+---------+-------------------+--------------------------+--------------------------+------------+
      | 123c2503a378fc46 | started | controller-node-1 | https://10.6.212.10:2380 | https://10.6.212.10:2379 |      false |
      +------------------+---------+-------------------+--------------------------+--------------------------+------------+
      
    • 查看 controller-node-1 状态:

      etcdctl endpoint status --endpoints=controller-node-1:2379 -w table \
      --cacert="/etc/kubernetes/ssl/etcd/ca.crt" \
      --cert="/etc/kubernetes/ssl/apiserver-etcd-client.crt" \
      --key="/etc/kubernetes/ssl/apiserver-etcd-client.key"
      

      预期输出如下:

      +------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
      |        ENDPOINT        |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
      +------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
      | controller-node-1:2379 | 123c2503a378fc46 |   3.5.6 |   15 MB |      true |      false |         3 |       1200 |               1199 |        |
      +------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
      
  4. 恢复其他节点数据

    上述步骤已经还原了节点 01 的数据,若想要还原其他节点数据,只需要将 etcd 的 Pod 启动起来,让 etcd 自己完成数据同步。

    • 在节点 02 和节点 03 都执行相同的操作:

      mv /tmp/etcd.yaml /etc/kubernetes/manifests/etcd.yaml
      
    • etcd member 集群之间的数据同步需要一定的时间,可以查看 etcd 集群状态,确保所有 etcd 集群正常:

      检查 etcd 集群状态是否正常:

      etcdctl member list -w table \
      --cacert="/etc/kubernetes/ssl/etcd/ca.crt" \
      --cert="/etc/kubernetes/ssl/apiserver-etcd-client.crt" \
      --key="/etc/kubernetes/ssl/apiserver-etcd-client.key"
      

      预期输出如下:

      +------------------+---------+-------------------+-------------------------+-------------------------+------------+
      |        ID        | STATUS  |    NAME           |       PEER ADDRS        |      CLIENT ADDRS       | IS LEARNER |
      +------------------+---------+-------------------+-------------------------+-------------------------+------------+
      | 6ea47110c5a87c03 | started | controller-node-1 | https://10.5.14.31:2380 | https://10.5.14.31:2379 |      false |
      | e222e199f1e318c4 | started | controller-node-2 | https://10.5.14.32:2380 | https://10.5.14.32:2379 |      false |
      | f64eeda321aabe2d | started | controller-node-3 | https://10.5.14.33:2380 | https://10.5.14.33:2379 |      false |
      +------------------+---------+-------------------+-------------------------+-------------------------+------------+
      

      检查 3 个 member 节点是否正常:

      etcdctl endpoint status --endpoints=controller-node-1:2379,controller-node-2:2379,controller-node-3:2379 -w table \
      --cacert="/etc/kubernetes/ssl/etcd/ca.crt" \
      --cert="/etc/kubernetes/ssl/apiserver-etcd-client.crt" \
      --key="/etc/kubernetes/ssl/apiserver-etcd-client.key"
      

      预期输出如下:

      +------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
      |     ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
      +------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
      | controller-node-1:2379 | 6ea47110c5a87c03 |   3.5.6 |   88 MB |      true |      false |         6 |     199008 |             199008 |        |
      | controller-node-2:2379 | e222e199f1e318c4 |   3.5.6 |   88 MB |     false |      false |         6 |     199114 |             199114 |        |
      | controller-node-3:2379 | f64eeda321aabe2d |   3.5.6 |   88 MB |     false |      false |         6 |     199316 |             199316 |        |
      +------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
      

恢复集群

等所有节点的 etcd 数据同步完成后,则可以将 kube-apiserver 进行重新启动,将整个集群恢复到可访问状态:

  1. 重新启动 node1 的 kube-apiserver 服务

    mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
    
  2. 重新启动 node2 的 kube-apiserver 服务

    mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
    
  3. 重新启动 node3 的 kube-apiserver 服务

    mv /tmp/kube-apiserver.yaml /etc/kubernetes/manifests/kube-apiserver.yaml
    
  4. 等待 kubelet 将 kube-apiserver 启动后,检查还原的 k8s 数据是否正常:

    kubectl get nodes
    

    预期输出如下:

    NAME                STATUS     ROLES           AGE     VERSION
    controller-node-1   Ready      <none>          3h30m   v1.25.4
    controller-node-3   Ready      control-plane   3h29m   v1.25.4
    controller-node-3   Ready      control-plane   3h28m   v1.25.4