TiDB6.5.12版本PD文件损坏后无法通过清除数据后重建恢复

【TiDB 使用环境】测试
【TiDB 版本】6.5.12
【操作系统】K8S部署
【部署方式】TIOpertor部署
【集群数据量】
【集群节点数】3
【问题复现路径】模拟一个节点文件损坏,pd无法成功启动,删除pd绑定pvc所挂载pv的物理目录;重新删除pod
【遇到的问题:问题现象及影响】按照6.5.8版本,pd可以重新加入集群,但是现在不行一直报错
【资源配置】
pd错误日志:
[2025/09/30 01:36:57.220 +00:00] [INFO] [join.go:218] [“failed to open directory, maybe start for the first time”] [error=“open /var/lib/pd/member: no such file or directory”]
[2025/09/30 01:36:57.227 +00:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=etcd-endpoints://0xc000525c00/basic-pd-0.basic-pd-peer.namespace.svc:2379] [attempt=0] [error=“rpc error: code = Unavailable desc = etcdserver: unhealthy cluster”]
[2025/09/30 01:36:57.228 +00:00] [FATAL] [main.go:91] [“join meet error”] [error=“etcdserver: unhealthy cluster”] [stack=“main.main\n\t/workspace/source/pd/cmd/pd-server/main.go:91\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250”]

而在pd-1看到一直连接pd-0错误:
[2025/09/29 11:33:44.373 +00:00] [WARN] [cluster_util.go:321] [“failed to reach the peer URL”] [address=http://basic-pd-0.basic-pd-peer.namespace.svc:2380/version] [remote-member-id=b7628fdafcdacb48] [error=“Get "http://basic-pd-0.basic-pd-peer.namespace.svc:2380/version\”: dial tcp 100.79.165.128:2380: connect: connection refused"]
[2025/09/29 11:33:44.373 +00:00] [WARN] [cluster_util.go:171] [“failed to get version”] [remote-member-id=b7628fdafcdacb48] [error=“Get "http://basic-pd-0.basic-pd-peer.namespace.svc:2380/version\”: dial tcp 100.79.165.128:2380: connect: connection refused"]

原计划去pd-1上去删除pd-0,但是发现pd-ctl看不到pd-0,:

curl -s http://127.0.0.1:2379/pd/api/v1/health

[
{
“name”: “basic-pd-2”,
“member_id”: 13214282427366296392,
“client_urls”: [
http://basic-pd-2.basic-pd-peer.namespace.svc:2379
],
“health”: true
},
{
“name”: “basic-pd-1”,
“member_id”: 16019942294483605917,
“client_urls”: [
http://basic-pd-1.basic-pd-peer.namespace.svc:2379
],
“health”: true
}
]

通过etcdctl查看到有一个pd-0的memberid:
ETCDCTL_API=3 ./etcdctl --endpoints=http://127.0.0.1:2379 get /pd/7554968709352511942 --prefix --keys-only
/pd/7554968709352511942/alloc_id
/pd/7554968709352511942/config
/pd/7554968709352511942/gc/safe_point
/pd/7554968709352511942/gc/safe_point/service/gc_worker
/pd/7554968709352511942/gc/safe_point/service/ticdc-default-4886466985454171141
/pd/7554968709352511942/keyspaces/id/DEFAULT
/pd/7554968709352511942/keyspaces/meta/00000000
/pd/7554968709352511942/leader
/pd/7554968709352511942/member/13214282427366296392/binary_version
/pd/7554968709352511942/member/13214282427366296392/deploy_path
/pd/7554968709352511942/member/13214282427366296392/git_hash
/pd/7554968709352511942/member/16019942294483605917/binary_version
/pd/7554968709352511942/member/16019942294483605917/deploy_path
/pd/7554968709352511942/member/16019942294483605917/git_hash
/pd/7554968709352511942/member/8084451159922055542/binary_version
/pd/7554968709352511942/member/8084451159922055542/deploy_path
/pd/7554968709352511942/member/8084451159922055542/git_hash

但是最终通过delete去删除这个memberId,找不到memberId。

现在找不到方案去恢复这个节点了。

可以用这个工具恢复一下:

https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/pd-recover

之前尝试了一遍,失败了。 我再尝试一下,只是有担心他有要求–from-old-member 参数,而当前pd-0数据是全部被清除掉了

我采用手册说的进入debug模式,手动启动pd,失败了。

[2025/09/30 09:59:02.167 +00:00] [INFO] [etcd.go:377] [“closed etcd server”] [name=basic-pd-0] [data-dir=/var/lib/pd] [advertise-peer-urls=“[http://basic-pd-0.basic-pd-peer.namespace.svc:2380]”] [advertise-client-urls=“[http://basic-pd-0.basic-pd-peer.namespace.svc:2379]”]
[2025/09/30 09:59:02.167 +00:00] [FATAL] [main.go:120] [“run server failed”] [error=“[PD:etcd:ErrStartEtcd]couldn’t find local name "basic-pd-0" in the initial cluster configuration: couldn’t find local name "basic-pd-0" in the initial cluster configuration”] [stack=“main.main\n\t/workspace/source/pd/cmd/pd-server/main.go:120\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250”]

备份数据,恢复数据也是一个比较好的方案

这是除了高可用之外,最靠谱的方案,没有之一

由于etcd集群处于unhealthy状态,新pod无法加入,需手动干预清理etcd中的残留member并重建pd节点来试试。

这不是找不到目录错误吗

这种情况我们一般都是通过扩容缩容的方式去处理。

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。