TiDB6.5.12版本PD文件损坏后无法通过清除数据后重建恢复

TiDBer_G64jJ9u8 · 2025 年9 月 30 日 06:30

【TiDB 使用环境】测试
【TiDB 版本】6.5.12
【操作系统】K8S部署
【部署方式】TIOpertor部署
【集群数据量】
【集群节点数】3
【问题复现路径】模拟一个节点文件损坏，pd无法成功启动，删除pd绑定pvc所挂载pv的物理目录；重新删除pod
【遇到的问题：问题现象及影响】按照6.5.8版本，pd可以重新加入集群，但是现在不行一直报错
【资源配置】
pd错误日志：
[2025/09/30 01:36:57.220 +00:00] [INFO] [join.go:218] [“failed to open directory, maybe start for the first time”] [error=“open /var/lib/pd/member: no such file or directory”]
[2025/09/30 01:36:57.227 +00:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=etcd-endpoints://0xc000525c00/basic-pd-0.basic-pd-peer.namespace.svc:2379] [attempt=0] [error=“rpc error: code = Unavailable desc = etcdserver: unhealthy cluster”]
[2025/09/30 01:36:57.228 +00:00] [FATAL] [main.go:91] [“join meet error”] [error=“etcdserver: unhealthy cluster”] [stack=“main.main\n\t/workspace/source/pd/cmd/pd-server/main.go:91\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250”]

而在pd-1看到一直连接pd-0错误：
[2025/09/29 11:33:44.373 +00:00] [WARN] [cluster_util.go:321] [“failed to reach the peer URL”] [address=http://basic-pd-0.basic-pd-peer.namespace.svc:2380/version] [remote-member-id=b7628fdafcdacb48] [error=“Get "http://basic-pd-0.basic-pd-peer.namespace.svc:2380/version\”: dial tcp 100.79.165.128:2380: connect: connection refused"]
[2025/09/29 11:33:44.373 +00:00] [WARN] [cluster_util.go:171] [“failed to get version”] [remote-member-id=b7628fdafcdacb48] [error=“Get "http://basic-pd-0.basic-pd-peer.namespace.svc:2380/version\”: dial tcp 100.79.165.128:2380: connect: connection refused"]

原计划去pd-1上去删除pd-0，但是发现pd-ctl看不到pd-0,：

curl -s http://127.0.0.1:2379/pd/api/v1/health

[
{
“name”: “basic-pd-2”,
“member_id”: 13214282427366296392,
“client_urls”: [
“http://basic-pd-2.basic-pd-peer.namespace.svc:2379”
],
“health”: true
},
{
“name”: “basic-pd-1”,
“member_id”: 16019942294483605917,
“client_urls”: [
“http://basic-pd-1.basic-pd-peer.namespace.svc:2379”
],
“health”: true
}
]

通过etcdctl查看到有一个pd-0的memberid：
ETCDCTL_API=3 ./etcdctl --endpoints=http://127.0.0.1:2379 get /pd/7554968709352511942 --prefix --keys-only
/pd/7554968709352511942/alloc_id
/pd/7554968709352511942/config
/pd/7554968709352511942/gc/safe_point
/pd/7554968709352511942/gc/safe_point/service/gc_worker
/pd/7554968709352511942/gc/safe_point/service/ticdc-default-4886466985454171141
/pd/7554968709352511942/keyspaces/id/DEFAULT
/pd/7554968709352511942/keyspaces/meta/00000000
/pd/7554968709352511942/leader
/pd/7554968709352511942/member/13214282427366296392/binary_version
/pd/7554968709352511942/member/13214282427366296392/deploy_path
/pd/7554968709352511942/member/13214282427366296392/git_hash
/pd/7554968709352511942/member/16019942294483605917/binary_version
/pd/7554968709352511942/member/16019942294483605917/deploy_path
/pd/7554968709352511942/member/16019942294483605917/git_hash
/pd/7554968709352511942/member/8084451159922055542/binary_version
/pd/7554968709352511942/member/8084451159922055542/deploy_path
/pd/7554968709352511942/member/8084451159922055542/git_hash

但是最终通过delete去删除这个memberId，找不到memberId。

现在找不到方案去恢复这个节点了。

Billmay表妹 · 2025 年9 月 30 日 08:18

可以用这个工具恢复一下：

https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/pd-recover

TiDBer_G64jJ9u8 · 2025 年9 月 30 日 08:27

之前尝试了一遍，失败了。我再尝试一下，只是有担心他有要求–from-old-member 参数，而当前pd-0数据是全部被清除掉了

TiDBer_G64jJ9u8 · 2025 年9 月 30 日 10:03

我采用手册说的进入debug模式，手动启动pd，失败了。

[2025/09/30 09:59:02.167 +00:00] [INFO] [etcd.go:377] [“closed etcd server”] [name=basic-pd-0] [data-dir=/var/lib/pd] [advertise-peer-urls=“[http://basic-pd-0.basic-pd-peer.namespace.svc:2380]”] [advertise-client-urls=“[http://basic-pd-0.basic-pd-peer.namespace.svc:2379]”]
[2025/09/30 09:59:02.167 +00:00] [FATAL] [main.go:120] [“run server failed”] [error=“[PD:etcd:ErrStartEtcd]couldn’t find local name "basic-pd-0" in the initial cluster configuration: couldn’t find local name "basic-pd-0" in the initial cluster configuration”] [stack=“main.main\n\t/workspace/source/pd/cmd/pd-server/main.go:120\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250”]

xfworld · 2025 年10 月 3 日 03:30

备份数据，恢复数据也是一个比较好的方案

这是除了高可用之外，最靠谱的方案，没有之一

lllzd · 2025 年10 月 3 日 09:33

由于etcd集群处于unhealthy状态，新pod无法加入，需手动干预清理etcd中的残留member并重建pd节点来试试。

独善其身 · 2026 年1 月 28 日 02:55

这不是找不到目录错误吗

Kongdom · 2026 年1 月 28 日 06:04

这种情况我们一般都是通过扩容缩容的方式去处理。

system · 2026 年2 月 4 日 06:05

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。