GC无法正常进行

生产环境,使用tiup部署。
GC从1月8号开始无法正常执行,修改了tikv_gc_safe_point。
[root@ecs-de9c-1219806 log]# grep gc_work *.log
tidb.log:[2026/01/26 17:21:32.902 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:22:31.517 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:23:42.275 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:24:41.468 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:25:33.538 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:26:33.124 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:27:33.824 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:28:31.799 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:29:30.452 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:30:30.262 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:31:30.624 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:32:29.680 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:33:52.321 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:34:52.737 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:35:31.613 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:36:30.915 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:37:31.547 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:38:35.984 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:39:36.475 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:40:29.864 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:41:34.779 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:42:30.601 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:44:03.665 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:44:43.820 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:45:35.285 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:46:31.554 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:47:31.268 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:48:32.124 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:49:33.793 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:50:33.526 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:51:33.980 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]
tidb.log:[2026/01/26 17:52:40.513 +08:00] [INFO] [gc_worker.go:379] [“there’s already a gc job running, skipped”] [category=“gc worker”] [“leaderTick on”=66fe5c5885c0058]

1 个赞

备份 METADATA,然后 通过工具(如 tidb-ctl )或者直接操作系统表来重置 GC 状态. 看下dashboard那天io, cpu这些有没有打满的情况再分析一下原因呗

1 个赞

有GC任务卡住了

1 个赞

GC从1月8号开始无法正常执行,修改了tikv_gc_safe_point。 这个能修改?这个应该是一个状态值吧。通过这个指看gc有没有往前推进

1 个赞

如果晚上没有业务可以重启一下

1 个赞

感觉是集群资源不足

1 个赞

gc运行时间太长了,估计阻塞了吧

1 个赞

GC 阻塞核心诱因:生产环境中,GC 无法执行大多是「存在长期未结束的事务、锁占用、备份 / 同步任务阻塞、TiKV 节点异常」导致。

存在长期未结束的事务

  1. 定位长期事务:从 INFORMATION_SCHEMA.CLUSTER_LONG_RUNNING_TRANSACTIONS 的结果中,获取 START_TIME(事务开始时间)、ID(事务 ID)、NODE_ID(执行节点)、SQL(事务对应的 SQL)。
  2. 处理方式
  • 若为「业务正常长事务」(如批量数据导入):建议等待事务提交 / 回滚,或协调业务侧拆分事务(避免单次事务运行超过 GC 保留时间)。
  • 若为「异常挂起事务」(如应用崩溃导致事务未提交):强制终止该事务(需 root 权限):

sql

-- 替换 <txn_id> 为查询到的事务 ID,<node_id> 为节点 ID
KILL '<txn_id>' ON '<node_id>';
  1. 验证:终止后重新查询 CLUSTER_LONG_RUNNING_TRANSACTIONS,无对应结果即表示事务已清理。
1 个赞

嗯, 长期未结束的事务确实是导致 TiDB 生产环境中 GC(垃圾回收)被阻塞的最主要原因之一

1 个赞

集群资源有可能不够了,重启再看看

2 个赞

查询一下是不是有长事务

2 个赞

日志已经很明显了,先把之前的gc任务清除了吧

2 个赞

日志显示 GC 任务卡住,先查长事务(INFORMATION_SCHEMA.CLUSTER_LONG_RUNNING_TRANSACTIONS)并终止,再重启 TiDB 节点释放 GC 锁。

1 个赞

这里有tiup ctl:v7.5.0 pd -u http://127.0.0.1:2379 service-gc-safepoint 可以看卡住的任务。