【TiDB 使用环境】测试
【TiDB 版本】6.5.12
【操作系统】K8S部署
【部署方式】TIOpertor部署
【集群数据量】
【集群节点数】3
【问题复现路径】模拟一个节点文件损坏,pd无法成功启动,删除pd绑定pvc所挂载pv的物理目录;重新删除pod
【遇到的问题:问题现象及影响】按照6.5.8版本,pd可以重新加入集群,但是现在不行一直报错
【资源配置】
pd错误日志:
[2025/09/30 01:36:57.220 +00:00] [INFO] [join.go:218] [“failed to open directory, maybe start for the first time”] [error=“open /var/lib/pd/member: no such file or directory”]
[2025/09/30 01:36:57.227 +00:00] [WARN] [retry_interceptor.go:62] [“retrying of unary invoker failed”] [target=etcd-endpoints://0xc000525c00/basic-pd-0.basic-pd-peer.namespace.svc:2379] [attempt=0] [error=“rpc error: code = Unavailable desc = etcdserver: unhealthy cluster”]
[2025/09/30 01:36:57.228 +00:00] [FATAL] [main.go:91] [“join meet error”] [error=“etcdserver: unhealthy cluster”] [stack=“main.main\n\t/workspace/source/pd/cmd/pd-server/main.go:91\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250”]
而在pd-1看到一直连接pd-0错误:
[2025/09/29 11:33:44.373 +00:00] [WARN] [cluster_util.go:321] [“failed to reach the peer URL”] [address=http://basic-pd-0.basic-pd-peer.namespace.svc:2380/version] [remote-member-id=b7628fdafcdacb48] [error=“Get "http://basic-pd-0.basic-pd-peer.namespace.svc:2380/version\”: dial tcp 100.79.165.128:2380: connect: connection refused"]
[2025/09/29 11:33:44.373 +00:00] [WARN] [cluster_util.go:171] [“failed to get version”] [remote-member-id=b7628fdafcdacb48] [error=“Get "http://basic-pd-0.basic-pd-peer.namespace.svc:2380/version\”: dial tcp 100.79.165.128:2380: connect: connection refused"]
原计划去pd-1上去删除pd-0,但是发现pd-ctl看不到pd-0,:
curl -s http://127.0.0.1:2379/pd/api/v1/health
[
{
“name”: “basic-pd-2”,
“member_id”: 13214282427366296392,
“client_urls”: [
“http://basic-pd-2.basic-pd-peer.namespace.svc:2379”
],
“health”: true
},
{
“name”: “basic-pd-1”,
“member_id”: 16019942294483605917,
“client_urls”: [
“http://basic-pd-1.basic-pd-peer.namespace.svc:2379”
],
“health”: true
}
]
通过etcdctl查看到有一个pd-0的memberid:
ETCDCTL_API=3 ./etcdctl --endpoints=http://127.0.0.1:2379 get /pd/7554968709352511942 --prefix --keys-only
/pd/7554968709352511942/alloc_id
/pd/7554968709352511942/config
/pd/7554968709352511942/gc/safe_point
/pd/7554968709352511942/gc/safe_point/service/gc_worker
/pd/7554968709352511942/gc/safe_point/service/ticdc-default-4886466985454171141
/pd/7554968709352511942/keyspaces/id/DEFAULT
/pd/7554968709352511942/keyspaces/meta/00000000
/pd/7554968709352511942/leader
/pd/7554968709352511942/member/13214282427366296392/binary_version
/pd/7554968709352511942/member/13214282427366296392/deploy_path
/pd/7554968709352511942/member/13214282427366296392/git_hash
/pd/7554968709352511942/member/16019942294483605917/binary_version
/pd/7554968709352511942/member/16019942294483605917/deploy_path
/pd/7554968709352511942/member/16019942294483605917/git_hash
/pd/7554968709352511942/member/8084451159922055542/binary_version
/pd/7554968709352511942/member/8084451159922055542/deploy_path
/pd/7554968709352511942/member/8084451159922055542/git_hash
但是最终通过delete去删除这个memberId,找不到memberId。
现在找不到方案去恢复这个节点了。