tidb pd中region分布 不均匀 对应的几个scheduler的状态是halt

【TiDB 使用环境】生产环境
【TiDB 版本】7.5.1
【操作系统】龙蜥7.9
【部署方式】机器部署(160G内存、磁盘1.1T)
【集群数据量】 173.5G
【集群节点数】1
【问题复现路径】
单节点tidb集群,当时部署了三个tikv节点,因跑任务有时会报错反馈资源紧张,备份也会报region is unavailable 后 强制下线掉两个tikv节点(这里是个雷)后 报错,又扩容回3个tikv,扩容结束后集群正常,但region分布不均匀,原始正常的1个tikv节点数据目录200G,扩容的2个节点数据目录10G+。跑任务报错比之前频繁了,报错region is unavailable 或please make sure tidb can connect to tikv。
【遇到的问题:问题现象及影响】
单节点tidb集群,当时部署了三个tikv节点,因跑任务有时会报错反馈资源紧张,备份也会报region is unavailable 后 强制下线掉两个tikv节点(这里是个雷)后 报错,又扩容回3个tikv,扩容结束后集群正常,但region分布不均匀,原始正常的1个tikv节点数据目录200G,扩容的2个节点数据目录10G+。跑任务报错比之前频繁了,报错region is unavailable 或please make sure tidb can connect to tikv。
影响就是偶发的region不可达

【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【复制黏贴 ERROR 报错的日志】
pd的报错日志:
[2025/10/23 16:16:13.710 +08:00] [INFO] [prepare_checker.go:65] [“not loaded from storage region number is satisfied, finish prepare checker”] [not-from-storage-region=17584] [total-region=17584]
[2025/10/23 16:16:13.710 +08:00] [INFO] [coordinator.go:390] [“coordinator has finished cluster information preparation”]
[2025/10/23 16:16:13.710 +08:00] [INFO] [coordinator.go:400] [“coordinator starts to run schedulers”]
[2025/10/23 16:16:13.711 +08:00] [INFO] [coordinator.go:461] [“create scheduler with independent configuration”] [scheduler-name=balance-hot-region-scheduler]
[2025/10/23 16:16:13.713 +08:00] [INFO] [coordinator.go:461] [“create scheduler with independent configuration”] [scheduler-name=balance-leader-scheduler]
[2025/10/23 16:16:13.715 +08:00] [INFO] [coordinator.go:461] [“create scheduler with independent configuration”] [scheduler-name=balance-region-scheduler]
[2025/10/23 16:16:13.716 +08:00] [INFO] [coordinator.go:461] [“create scheduler with independent configuration”] [scheduler-name=balance-witness-scheduler]
[2025/10/23 16:16:13.720 +08:00] [INFO] [coordinator.go:461] [“create scheduler with independent configuration”] [scheduler-name=evict-leader-scheduler]
[2025/10/23 16:16:13.720 +08:00] [ERROR] [coordinator.go:464] [“can not add scheduler with independent configuration”] [scheduler-name=evict-leader-scheduler] [scheduler-args=“[4422294]”] [error=“[PD:scheduler:ErrSchedulerExisted]scheduler existed”]
[2025/10/23 16:16:13.720 +08:00] [INFO] [coordinator.go:461] [“create scheduler with independent configuration”] [scheduler-name=transfer-witness-leader-scheduler]
[2025/10/23 16:16:13.721 +08:00] [INFO] [coordinator.go:487] [“create scheduler”] [scheduler-name=balance-region-scheduler] [scheduler-args=“”]
[2025/10/23 16:16:13.721 +08:00] [INFO] [coordinator.go:487] [“create scheduler”] [scheduler-name=balance-leader-scheduler] [scheduler-args=“”]
[2025/10/23 16:16:13.722 +08:00] [INFO] [coordinator.go:487] [“create scheduler”] [scheduler-name=balance-witness-scheduler] [scheduler-args=“”]
[2025/10/23 16:16:13.722 +08:00] [INFO] [coordinator.go:487] [“create scheduler”] [scheduler-name=balance-hot-region-scheduler] [scheduler-args=“”]
[2025/10/23 16:16:13.722 +08:00] [INFO] [coordinator.go:487] [“create scheduler”] [scheduler-name=transfer-witness-leader-scheduler] [scheduler-args=“”]
[2025/10/23 16:16:13.722 +08:00] [INFO] [coordinator.go:487] [“create scheduler”] [scheduler-name=evict-leader-scheduler] [scheduler-args=“[4422294]”]
[2025/10/23 16:16:13.722 +08:00] [INFO] [coordinator.go:487] [“create scheduler”] [scheduler-name=evict-leader-scheduler] [scheduler-args=“[1]”]
[2025/10/23 16:16:13.726 +08:00] [INFO] [coordinator.go:256] [“coordinator begins to check suspect key ranges”]
[2025/10/23 16:16:13.726 +08:00] [INFO] [coordinator.go:320] [“coordinator begins to actively drive push operator”]
[2025/10/23 16:16:13.726 +08:00] [INFO] [coordinator.go:147] [“coordinator starts patrol regions”]
[2025/10/23 16:19:03.542 +08:00] [WARN] [grpclog.go:60] [“grpc: Server.processUnaryRPC failed to write status: connection error: desc = "transport is closing"”]
[2025/10/23 16:19:22.445 +08:00] [WARN] [heartbeat_streams.go:165] [“send keepalive message fail, store maybe disconnected”] [target-store-id=12250001] [error=EOF]

【其他附件:截图/日志/监控】

1 个赞

估计Region 未迁移过去,集群数据分布失衡,部分 Region 可能因副本集中在单一节点而存在单点风险。 TIKV_REGION_STATUS看下Region的状态

1 个赞

数据倾斜了?

1 个赞

先检查 Region 的状态吧

1 个赞

传输关闭错误吗?

  • 扩容新 TiKV 节点后,PD 默认会触发balance-region调度,但由于之前强制下线留下的元数据干扰、以及可能的调度参数(如并发量、速率限制)过于保守,调度未有效推进,导致原始节点的 Region(200G)几乎没有迁移到新节点(仅 10G+);
  • 单 TiDB 节点 + 原始 TiKV 节点承载了几乎所有数据和请求,单点压力过载(CPU、内存、IO),进一步阻塞 PD 调度,形成「压力大→调度慢→分布更不均→压力更大」的恶性循环。
  1. Region 分布不均引发的「单点故障隐患」:原始 TiKV 节点承载了 95% 以上的 Region,一旦该节点出现短暂的资源耗尽(如 IO 繁忙、内存抖动),就会导致其上的大量 Region 无法响应请求,报region is unavailable;同时,TiDB 与该 TiKV 节点的连接会因压力过大出现短暂断开,报please make sure tidb can connect to tikv
  2. 单节点 TiDB 的资源约束放大问题:整个集群仅 1 个 TiDB 节点,所有业务请求、DDL 操作、统计信息分析都集中在该节点,一旦原始 TiKV 节点压力过载,TiDB 与 TiKV 的交互会出现超时,进一步加剧偶发故障的频率。

PD 默认的 balance 策略比较温和