V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
seagull7558
V2EX  ›  Kubernetes

etcd 断电数据文件损坏如何恢复?

  •  
  •   seagull7558 · 2022-04-22 10:30:50 +08:00 · 2869 次点击
    这是一个创建于 944 天前的主题,其中的信息可能已经有所发展或是发生改变。

    虽然提前通知会断电重启机器,但是因为之前也遇到过断电的问题,并且一般情况下都是会自己启动 kubelet 和 docker 及相关上层服务,我就没在意,也没有说提前关闭集群,可是万万没想到啊,今天开机后发现 Master 节点居然都挂掉了

    排查过程

    1. 首先登陆主节点,查看各节点状态
    [root@k8s-master wal]# kubectl get nodes
    The connection to the server 192.168.113.152:6443 was refused - did you specify the right host or port?
    

    1. 发现 kubelet 好像是有问题,去看看状态
    systemctl status kubelet
    
    [root@k8s-master wal]# systemctl status kubelet
    ● kubelet.service - kubelet: The Kubernetes Node Agent
       Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
      Drop-In: /usr/lib/systemd/system/kubelet.service.d
               └─10-kubeadm.conf
       Active: active (running) since Fri 2022-04-22 09:18:34 CST; 1h 4min ago
         Docs: https://kubernetes.io/docs/
     Main PID: 1017 (kubelet)
        Tasks: 17 (limit: 75136)
       Memory: 152.0M
       CGroup: /system.slice/kubelet.service
               └─1017 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --network-plugin=cni --pod-infra-c>
    
    4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.289217    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
    4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.390603    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
    4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.491934    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
    4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.593267    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
    4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.693429    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
    4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.793744    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
    4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.894905    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
    4 月 22 10:22:55 k8s-master kubelet[1017]: E0421 22:22:55.995215    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
    4 月 22 10:22:56 k8s-master kubelet[1017]: E0421 22:22:56.095825    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
    4 月 22 10:22:56 k8s-master kubelet[1017]: E0421 22:22:56.196740    1017 kubelet.go:2422] "Error getting node" err="node \"k8s-master\" not found"
    lines 1-22/22 (END)
    

    1. 获取主节点失败,看了眼各组件容器日志,发现 apiserver 有点问题 查看对应的 apiserver 容器日志
    2022-04-22T02:21:31.203199623Z I0422 02:21:31.202994       1 server.go:565] external host was not specified, using 192.168.113.152
    2022-04-22T02:21:31.231588697Z I0422 02:21:31.231374       1 server.go:172] Version: v1.23.5
    2022-04-22T02:21:31.389807666Z I0422 02:21:31.389666       1 shared_informer.go:240] Waiting for caches to sync for node_authorizer
    2022-04-22T02:21:31.390038398Z I0422 02:21:31.389913       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
    2022-04-22T02:21:31.390047004Z I0422 02:21:31.389934       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
    2022-04-22T02:21:31.390767833Z I0422 02:21:31.390608       1 plugins.go:158] Loaded 12 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,DefaultStorageClass,StorageObjectInUseProtection,RuntimeClass,DefaultIngressClass,MutatingAdmissionWebhook.
    2022-04-22T02:21:31.390778978Z I0422 02:21:31.390639       1 plugins.go:161] Loaded 11 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,PodSecurity,Priority,PersistentVolumeClaimResize,RuntimeClass,CertificateApproval,CertificateSigning,CertificateSubjectRestriction,ValidatingAdmissionWebhook,ResourceQuota.
    2022-04-22T02:21:31.392738574Z W0422 02:21:31.392578       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
    2022-04-22T02:21:32.391488027Z W0422 02:21:32.391315       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
    2022-04-22T02:21:32.394099214Z W0422 02:21:32.393918       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
    2022-04-22T02:21:33.393065852Z W0422 02:21:33.392633       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
    2022-04-22T02:21:33.835046849Z W0422 02:21:33.833992       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
    2022-04-22T02:21:35.158712152Z W0422 02:21:35.158257       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
    2022-04-22T02:21:36.424633514Z W0422 02:21:36.424302       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
    2022-04-22T02:21:38.117720953Z W0422 02:21:38.117282       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
    2022-04-22T02:21:41.173338381Z W0422 02:21:41.172823       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
    2022-04-22T02:21:42.894615034Z W0422 02:21:42.894174       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
    2022-04-22T02:21:47.511910137Z W0422 02:21:47.511585       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
    2022-04-22T02:21:50.459764076Z W0422 02:21:50.459450       1 clientconn.go:1331] [core] grpc: addrConn.createTransport failed to connect to {127.0.0.1:2379 127.0.0.1 <nil> 0 <nil>}. Err: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused". Reconnecting...
    2022-04-22T02:21:51.393624485Z E0422 02:21:51.393263       1 run.go:74] "command failed" err="context deadline exceeded"
    

    1. 端口号 2379 接着往下排查 很明显是连接 etcd 报错,这时候我特么很慌啊,etcd 存储了很多 k8s 配置信息,这玩意要是修不好,那许多服务还得重新搭建
    2022-04-22T02:26:23.131382248Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"etcdmain/etcd.go:72","msg":"Running: ","args":["etcd","--advertise-client-urls=https://192.168.113.152:2379","--cert-file=/etc/kubernetes/pki/etcd/server.crt","--client-cert-auth=true","--data-dir=/var/lib/etcd","--initial-advertise-peer-urls=https://192.168.113.152:2380","--initial-cluster=k8s-master=https://192.168.113.152:2380","--key-file=/etc/kubernetes/pki/etcd/server.key","--listen-client-urls=https://127.0.0.1:2379,https://192.168.113.152:2379","--listen-metrics-urls=http://127.0.0.1:2381","--listen-peer-urls=https://192.168.113.152:2380","--name=k8s-master","--peer-cert-file=/etc/kubernetes/pki/etcd/peer.crt","--peer-client-cert-auth=true","--peer-key-file=/etc/kubernetes/pki/etcd/peer.key","--peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt","--snapshot-count=10000","--trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt"]}
    2022-04-22T02:26:23.131412789Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"etcdmain/etcd.go:115","msg":"server has been already initialized","data-dir":"/var/lib/etcd","dir-type":"member"}
    2022-04-22T02:26:23.131416304Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:131","msg":"configuring peer listeners","listen-peer-urls":["https://192.168.113.152:2380"]}
    2022-04-22T02:26:23.131418664Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:478","msg":"starting with peer TLS","tls-info":"cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, client-cert=, client-key=, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file = ","cipher-suites":[]}
    2022-04-22T02:26:23.131421030Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:139","msg":"configuring client listeners","listen-client-urls":["https://127.0.0.1:2379","https://192.168.113.152:2379"]}
    2022-04-22T02:26:23.131423955Z {"level":"info","ts":"2022-04-22T02:26:23.130Z","caller":"embed/etcd.go:307","msg":"starting an etcd server","etcd-version":"3.5.1","git-sha":"e8732fb5f","go-version":"go1.16.3","go-os":"linux","go-arch":"amd64","max-cpu-set":4,"max-cpu-available":4,"member-initialized":true,"name":"k8s-master","data-dir":"/var/lib/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/var/lib/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":10000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://192.168.113.152:2380"],"listen-peer-urls":["https://192.168.113.152:2380"],"advertise-client-urls":["https://192.168.113.152:2379"],"listen-client-urls":["https://127.0.0.1:2379","https://192.168.113.152:2379"],"listen-metrics-urls":["http://127.0.0.1:2381"],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-size-bytes":2147483648,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","auto-compaction-mode":"periodic","auto-compaction-retention":"0s","auto-compaction-interval":"0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
    2022-04-22T02:26:23.133621480Z panic: freepages: failed to get all reachable pages (page 478: multiple references)
    2022-04-22T02:26:23.133635080Z 
    2022-04-22T02:26:23.133637664Z goroutine 120 [running]:
    2022-04-22T02:26:23.133639420Z go.etcd.io/bbolt.(*DB).freepages.func2(0xc00047e540)
    2022-04-22T02:26:23.133648190Z 	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.etcd.io/[email protected]/db.go:1056 +0xe9
    2022-04-22T02:26:23.133650552Z created by go.etcd.io/bbolt.(*DB).freepages
    2022-04-22T02:26:23.133652184Z 	/home/remote/sbatsche/.gvm/pkgsets/go1.16.3/global/pkg/mod/go.etcd.io/[email protected]/db.go:1054 +0x1cd
    

    然后重启 etcd 也不好使,一番搜索发现是装在 vmware 里会有问题,可能导致 etcd 文件损坏,寄!

    关联 issues

    9 条回复    2022-07-17 09:10:31 +08:00
    rrfeng
        1
    rrfeng  
       2022-04-22 10:41:43 +08:00 via Android
    用之前的 snapshot 恢复

    这玩意也要定时离线备份的…
    seagull7558
        2
    seagull7558  
    OP
       2022-04-22 10:43:40 +08:00
    @rrfeng 老哥是指删掉 /var/lib/etcd/member/snap 文件夹下最新的.snap 文件吗?
    FullBridgeRect
        3
    FullBridgeRect  
       2022-04-22 10:57:26 +08:00
    etcd 有做高可用么,达到容灾上限之前可以直接开新节点重建吧?
    seagull7558
        4
    seagull7558  
    OP
       2022-04-22 11:05:26 +08:00
    @FullBridgeRect 没做高可用 本来就是开发环境 为了模拟配置的 确实偷懒没做高可用 也没做备份
    FullBridgeRect
        5
    FullBridgeRect  
       2022-04-22 11:06:34 +08:00
    @seagull7558 那没了啊😑
    disk
        6
    disk  
       2022-04-22 11:14:54 +08:00
    备份下,*.wal 或*.snap 删了看好不好使,不行看看能不能从最新的.snap 恢复,再不行就没必要再试图恢复了,成本太高。
    seagull7558
        7
    seagull7558  
    OP
       2022-04-22 11:29:39 +08:00
    @disk 试过直接删掉 wal 和 snap 文件夹 会丢失 K8S 集群下的所有配置信息
    zhaoyeye
        8
    zhaoyeye  
       2022-04-23 00:07:41 +08:00 via Android
    @seagull7558 我前两天刚通过 snapshot 恢复了一次,反正是挺多坑的
    stevefan1999
        9
    stevefan1999  
       2022-07-17 09:10:31 +08:00 via Android
    etcd 背後用的是 BoltDB 理論上是毋須數據恢復的 如果集羣有其他 HA master 那把有問題的節點完整地刪了 再不行找個備份
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   3404 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 24ms · UTC 00:07 · PVG 08:07 · LAX 16:07 · JFK 19:07
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.