【TiDB 使用环境】生产环境
【TiDB 版本】v8.1.1
【操作系统】Ubuntu 24.04.1 aarch64
【部署方式】aws云上部署 PD节点配置为8C16G
【集群数据量】110GB
【集群节点数】3 pd/3 tikv/3 tidb
【问题复现路径】没办法手动触发
【遇到的问题:问题现象及影响】 PD leader节点老是切换,影响可用性
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【复制黏贴 ERROR 报错的日志】
[2025/08/07 14:26:49.192 +08:00] [WARN] [client.go:121] [“SendRequest failed”] [uri=http://10.xxx.xxx.51:2379/pd/api/v2/ms/members/tso]
[2025/08/07 14:26:49.192 +08:00] [WARN] [error.go:89] [“Error when handling request”] [uri=/dashboard/api/topology/tso] [remoteAddr=10.xxx.xxx.153:44144] [errorFullText=]
[2025/08/07 14:26:49.193 +08:00] [WARN] [client.go:121] [“SendRequest failed”] [uri=http://10.xxx.xxx.51:2379/pd/api/v2/ms/members/scheduling]
[2025/08/07 14:26:49.193 +08:00] [WARN] [error.go:89] [“Error when handling request”] [uri=/dashboard/api/topology/scheduling] [remoteAddr=10.xxx.xxx.153:44160] [errorFullText=]
[2025/08/07 14:26:54.263 +08:00] [WARN] [client.go:121] [“SendRequest failed”] [uri=http://10.xxx.xxx.51:2379/pd/api/v2/ms/members/tso]
[2025/08/07 14:26:54.263 +08:00] [WARN] [error.go:89] [“Error when handling request”] [uri=/dashboard/api/topology/tso] [remoteAddr=10.xxx.xxx.153:44206] [errorFullText=]
[2025/08/07 14:26:54.265 +08:00] [WARN] [client.go:121] [“SendRequest failed”] [uri=http://10.xxx.xxx.51:2379/pd/api/v2/ms/members/scheduling]
[2025/08/07 14:26:54.265 +08:00] [WARN] [error.go:89] [“Error when handling request”] [uri=/dashboard/api/topology/scheduling] [remoteAddr=10.xxx.xxx.153:44246] [errorFullText=]
[2025/08/07 14:26:54.517 +08:00] [WARN] [tso.go:334] [“clock offset”] [jet-lag=318.496387ms] [prev-physical=2025/08/07 14:26:54.199 +08:00] [now=2025/08/07 14:26:54.517 +0
8:00] [update-physical-interval=50ms]
—接着就很多类似如下的报错
[2025/08/07 14:26:55.626 +08:00] [WARN] [util.go:167] [“apply request took too long”] [took=146.613908ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:
"/pd/7506434620956800431/member/13250362261525450493/git_hash" “] [response=“range_response_count:1 size:123”]
[2025/08/07 14:26:55.626 +08:00] [INFO] [trace.go:152] [“trace[153750403] range”] [detail=”{range_begin:/pd/7506434620956800431/member/13250362261525450493/git_hash; range_en
d:; response_count:1; response_revision:20735107; }”] [duration=146.656934ms] [start=2025/08/07 14:26:55.479 +08:00] [end=2025/08/07 14:26:55.626 +08:00] [steps=”["trace[153
750403] ‘agreement among raft nodes before linearized reading’ (duration: 146.558048ms)"]”]
[2025/08/07 14:26:56.458 +08:00] [WARN] [tso.go:334] [“clock offset”] [jet-lag=848.850783ms] [prev-physical=2025/08/07 14:26:55.609 +08:00] [now=2025/08/07 14:26:56.458 +0
8:00] [update-physical-interval=50ms]
[2025/08/07 14:26:56.765 +08:00] [INFO] [trace.go:152] [“trace[6604888] linearizableReadLoop”] [detail=“{readStateIndex:20738039; appliedIndex:20738039; }”] [duration=385.977
681ms] [start=2025/08/07 14:26:56.379 +08:00] [end=2025/08/07 14:26:56.765 +08:00] [steps=“["trace[6604888] ‘read index received’ (duration: 385.974845ms)","trace[6604888
] ‘applied index is now lower than readState.Index’ (duration: 2.385µs)"]”]
[2025/08/07 14:27:02.048 +08:00] [WARN] [raft.go:441] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=b7e2be45f
f774efd] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=2.303206548s]
--------------------然后就是发生选举了
[2025/08/07 14:27:09.004 +08:00] [WARN] [lease.go:173] [“lease keep alive failed”] [purpose=“leader election”] [start=2025/08/07 14:27:08.659 +08:00] [error=“context canceled
“]
[2025/08/07 14:27:09.005 +08:00] [INFO] [lease.go:190] [“stop lease keep alive worker”] [purpose=“leader election”]
[2025/08/07 14:27:09.005 +08:00] [WARN] [lease.go:173] [“lease keep alive failed”] [purpose=“leader election”] [start=2025/08/07 14:27:09.005 +08:00] [error=“context canceled
“]
[2025/08/07 14:27:09.005 +08:00] [WARN] [lease.go:173] [“lease keep alive failed”] [purpose=“leader election”] [start=2025/08/07 14:27:06.873 +08:00] [error=“context canceled
“]
[2025/08/07 14:27:09.327 +08:00] [WARN] [tso.go:334] [“clock offset”] [jet-lag=1.372870775s] [prev-physical=2025/08/07 14:27:07.954 +08:00] [now=2025/08/07 14:27:09.327 +0
8:00] [update-physical-interval=50ms]
[2025/08/07 14:27:09.415 +08:00] [WARN] [raft.go:441] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=b7e2be45f
f774efd] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=1.419305471s]
[2025/08/07 14:27:09.416 +08:00] [WARN] [raft.go:441] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=835032772
834fce9] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=1.419356449s]
[2025/08/07 14:27:09.416 +08:00] [INFO] [trace.go:152] [“trace[226867292] linearizableReadLoop”] [detail=”{readStateIndex:20738052; appliedIndex:20738052; }”] [duration=2.630
75935s] [start=2025/08/07 14:27:06.785 +08:00] [end=2025/08/07 14:27:09.416 +08:00] [steps=”["trace[226867292] ‘read index received’ (duration: 2.630756381s)","trace[2268
67292] ‘applied index is now lower than readState.Index’ (duration: 2.549µs)"]”]
[2025/08/07 14:27:09.582 +08:00] [WARN] [util.go:167] [“apply request took too long”] [took=251.805091ms] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:
"/tidb/server/info" range_end:"/tidb/server/infp" “] [response=] [error=“context canceled”]
[2025/08/07 14:27:09.582 +08:00] [INFO] [trace.go:152] [“trace[1543858108] range”] [detail=”{range_begin:/tidb/server/info; range_end:/tidb/server/infp; }”] [duration=251.865
614ms] [start=2025/08/07 14:27:09.330 +08:00] [end=2025/08/07 14:27:09.582 +08:00] [steps=”["trace[1543858108] ‘agreement among raft nodes before linearized reading’ (durat
ion: 251.804422ms)"]”]
[2025/08/07 14:27:10.721 +08:00] [WARN] [tidb_requests.go:39] [“failed to get tidb schema version”] [error=“context deadline exceeded”]
[2025/08/07 14:27:11.445 +08:00] [INFO] [server.go:1815] [“no longer a leader because lease has expired, pd leader will step down"]
[2025/08/07 14:27:11.445 +08:00] [WARN] [util.go:167] [“apply request took too long”] [took=1.862660743s] [expected-duration=100ms] [prefix=“read-only range “] [request=“key:
"/tidb/ddl/global_schema_version" “] [response=] [error=“context deadline exceeded”]
[2025/08/07 14:27:11.445 +08:00] [INFO] [trace.go:152] [“trace[576701383] range”] [detail=”{range_begin:/tidb/ddl/global_schema_version; range_end:; }”] [duration=1.862698572
s] [start=2025/08/07 14:27:09.583 +08:00] [end=2025/08/07 14:27:11.445 +08:00] [steps=”["trace[576701383] ‘agreement among raft nodes before linearized reading’ (duration:
1.862659542s)"]”]
[2025/08/07 14:27:11.445 +08:00] [WARN] [context.go:174] [trace] [elapsed=1.550975391s] [rows=0] [sql=“SELECT * FROM log_previews WHERE task_group_id = "128" ORDER BY tim
e LIMIT 500”]
[2025/08/07 14:27:12.142 +08:00] [ERROR] [middleware.go:219] [“redirect but server is not leader”] [from=pd-az1-1] [server=pd-az2-1] [error=”[PD:apiutil:ErrRedirectToNotLeade
r]redirect to not leader”]
[2025/08/07 14:27:13.929 +08:00] [ERROR] [middleware.go:219] [“redirect but server is not leader”] [from=pd-az2-1] [server=pd-az2-1] [error=”[PD:apiutil:ErrRedirectToNotLeade
r]redirect to not leader"]
[2025/08/07 14:27:13.930 +08:00] [ERROR] [middleware.go:219] [“redirect but server is not leader”] [from=pd-az2-1] [server=pd-az2-1] [error=“[PD:apiutil:ErrRedirectToNotLeade
r]redirect to not leader”]
[2025/08/07 14:27:15.204 +08:00] [WARN] [stat_persist.go:29] [trace] [elapsed=6.543094639s] [rows=1] [sql=“DELETE FROM xxx WHERE layer_num = 0 AND time = "2025-08-
07 13:26:07.575"”]
[2025/08/07 14:27:15.275 +08:00] [WARN] [raft.go:441] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=b7e2be45f
f774efd] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=3.158787322s]
[2025/08/07 14:27:15.275 +08:00] [WARN] [raft.go:441] [“leader failed to send out heartbeat on time; took too long, leader is overloaded likely from slow disk”] [to=835032772
834fce9] [heartbeat-interval=500ms] [expected-duration=1s] [exceeded-duration=3.158808306s]
[2025/08/07 14:27:15.275 +08:00] [INFO] [cluster.go:708] [“update store stats background jobs has been stopped”]
[2025/08/07 14:27:15.275 +08:00] [INFO] [cluster.go:2314] [“min resolved ts background jobs has been stopped”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [cluster.go:666] [“metrics are reset”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [cluster.go:668] [“metrics collection job has been stopped”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [cluster.go:456] [“gc tuner is stopped”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [cluster.go:387] [“service check job is stopped”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [server.go:127] [“region syncer has been stopped”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [cluster.go:690] [“node state check job has been stopped”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [cluster.go:500] [“sync store config job is stopped”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [labeler.go:70] [“region labeler GC stopped”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [scheduling_controller.go:160] [“scheduling metrics are reset”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [scheduling_controller.go:162] [“scheduling metrics collection job has been stopped”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [scheduling_controller.go:138] [“statistics background jobs has been stopped”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [coordinator.go:341] [“drive push operator has been stopped”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [coordinator.go:277] [“check suspect key ranges has been stopped”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [coordinator.go:359] [“drive slow node scheduler is stopped”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [coordinator.go:389] [“coordinator is stopping”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [scheduler_controller.go:363] [“scheduler has been stopped”] [scheduler-name=evict-slow-store-scheduler] [error=“context canceled”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [scheduler_controller.go:363] [“scheduler has been stopped”] [scheduler-name=balance-region-scheduler] [error=“context canceled”]
[2025/08/07 14:27:15.276 +08:00] [INFO] [scheduler_controller.go:363] [“scheduler has been stopped”] [scheduler-name=balance-leader-scheduler] [error=“context canceled”]
[2025/08/07 14:27:15.277 +08:00] [WARN] [etcdutil.go:153] [“kv gets too slow”] [request-key=/pd/7506434620956800431/config] [cost=2.727653733s]
[2025/08/07 14:27:15.318 +08:00] [WARN] [context.go:174] [trace] [elapsed=3.626012478s] [rows=1] [sql=“SELECT * FROM log_search_task_groups WHERE id = "128" ORDER BY log_search_task_groups.id LIMIT 1”]
[2025/08/07 14:27:16.142 +08:00] [INFO] [raft.go:865] [“690db2d1cf394467 [term: 11] received a MsgAppResp message with higher term from 835032772834fce9 [term: 12]”]
[2025/08/07 14:27:16.142 +08:00] [INFO] [raft.go:706] [“690db2d1cf394467 became follower at term 12”]
[2025/08/07 14:27:16.142 +08:00] [INFO] [node.go:333] [“raft.node: 690db2d1cf394467 lost leader 690db2d1cf394467 at term 12”]
[2025/08/07 14:27:16.187 +08:00] [ERROR] [middleware.go:219] [“redirect but server is not leader”] [from=pd-az4-1] [server=pd-az2-1] [error=“[PD:apiutil:ErrRedirectToNotLeader]redirect to not leader”]
[2025/08/07 14:27:16.187 +08:00] [ERROR] [middleware.go:219] [“redirect but server is not leader”] [from=pd-az4-1] [server=pd-az2-1] [error=“[PD:apiutil:ErrRedirectToNotLeader]redirect to not leader”]
[2025/08/07 14:27:16.187 +08:00] [ERROR] [middleware.go:219] [“redirect but server is not leader”] [from=pd-az2-1] [server=pd-az2-1] [error=“[PD:apiutil:ErrRedirectToNotLeader]redirect to not leader”]
[2025/08/07 14:27:16.187 +08:00] [ERROR] [middleware.go:219] [“redirect but server is not leader”] [from=pd-az2-1] [server=pd-az2-1] [error=“[PD:apiutil:ErrRedirectToNotLeader]redirect to not leader”]
[2025/08/07 14:27:16.188 +08:00] [INFO] [scheduler_controller.go:363] [“scheduler has been stopped”] [scheduler-name=balance-hot-region-scheduler] [error=“context canceled”]
[2025/08/07 14:27:16.569 +08:00] [WARN] [etcdutil.go:153] [“kv gets too slow”] [request-key=/pd/7506434620956800431/config] [cost=1.292322833s]
[2025/08/07 14:27:16.570 +08:00] [ERROR] [middleware.go:219] [“redirect but server is not leader”] [from=pd-az2-1] [server=pd-az2-1] [error=“[PD:apiutil:ErrRedirectToNotLeader]redirect to not leader”]
[2025/08/07 14:27:16.571 +08:00] [ERROR] [middleware.go:219] [“redirect but server is not leader”] [from=pd-az1-1] [server=pd-az2-1] [error=“[PD:apiutil:ErrRedirectToNotLeader]redirect to not leader”]
[2025/08/07 14:27:16.571 +08:00] [ERROR] [middleware.go:219] [“redirect but server is not leader”] [from=pd-az1-1] [server=pd-az2-1] [error=“[PD:apiutil:ErrRedirectToNotLeader]redirect to not leader”]
[2025/08/07 14:27:16.571 +08:00] [WARN] [util.go:167] [“apply request took too long”] [took=9.786204457s] [expected-duration=100ms] [prefix="read-only range "] [request="key:"health" "] [response=] [error=“context canceled”]
【其他附件:截图/日志/监控】
操作系统日志有
2025-08-07T14:26:50.498355+08:00 db-tidb-prod-az2-2 systemd[1]: node_exporter-9100.service: Main process exited, code=exited, status=1/FAILURE
2025-08-07T14:26:50.498491+08:00 db-tidb-prod-az2-2 systemd[1]: node_exporter-9100.service: Failed with result ‘exit-code’.
2025-08-07T14:26:52.019723+08:00 db-tidb-prod-az2-2 snapd[2915804]: storehelpers.go:916: cannot refresh: snap has no updates available: “amazon-ssm-agent”, “core18”, “core22”
, “snapd”
2025-08-07T14:27:06.260048+08:00 db-tidb-prod-az2-2 systemd[1]: node_exporter-9100.service: Scheduled restart job, restart counter is at 351095.
2025-08-07T14:27:07.813857+08:00 db-tidb-prod-az2-2 kernel: systemd-journald[404263]: Under memory pressure, flushing caches.
2025-08-07T14:27:09.444696+08:00 db-tidb-prod-az2-2 systemd[1]: Started node_exporter-9100.service - node_exporter service.
2025-08-07T14:27:14.252324+08:00 db-tidb-prod-az2-2 bash[1162229]: ts=2025-08-07T06:27:14.251Z caller=node_exporter.go:180 level=info msg=“Starting node_exporter” version=“(version=1.5.0, branch=HEAD, revision=1b48970ffcf5630534fb00bb0687d73c66d1c959)”
2025-08-07T14:27:15.157025+08:00 db-tidb-prod-az2-2 bash[1162229]: ts=2025-08-07T06:27:14.251Z caller=node_exporter.go:181 level=info msg=“Build context” build_context=“(go=go1.19.3, user=root@6e7732a7b81b, date=20221129-18:59:41)”
2025-08-07T14:27:15.157141+08:00 db-tidb-prod-az2-2 bash[1162229]: ts=2025-08-07T06:27:14.643Z caller=diskstats_common.go:111 level=info collector=diskstats msg=“Parsed flag --collector.diskstats.device-exclude” flag=^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$
2025-08-07T14:27:15.157170+08:00 db-tidb-prod-az2-2 bash[1162229]: ts=2025-08-07T06:27:14.643Z caller=filesystem_common.go:111 level=info collector=filesystem msg=“Parsed flag --collector.filesystem.mount-points-exclude” flag=^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+|var/lib/containers/storage/.+)($|/)
2025-08-07T14:27:15.157189+08:00 db-tidb-prod-az2-2 bash[1162229]: ts=2025-08-07T06:27:14.643Z caller=filesystem_common.go:113 level=info collector=filesystem msg=“Parsed flag --collector.filesystem.fs-types-exclude” flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
这个集群没有什么业务量,pd的leader老是切换,压测环境配置更低,反而没有怎么切换,找aws的问了,说EC2 EBS和网络都没性能问题,这是哪里有问题吗?麻烦懂的大佬帮忙看看,谢谢!








