问题简介
gc不能正常消除版本
环境说明
mysql> select version();
+--------------------+
| version() |
+--------------------+
| 5.7.25-TiDB-v7.1.0 |
+--------------------+
PD+KV+TiDB混合部署
问题详细说明
今日是20260427,查看gc的safe_point是0425,而且,上次的运行日期也不对,说明gc有2天未进行触发。
mysql> select * from mysql.tidb d where d.VARIABLE_NAME in ('tikv_gc_safe_point','tikv_gc_life_time','tikv_gc_last_run_time');
+-----------------------+-----------------------------+----------------------------------------------------------------------------------------+
| VARIABLE_NAME | VARIABLE_VALUE | COMMENT |
+-----------------------+-----------------------------+----------------------------------------------------------------------------------------+
| tikv_gc_safe_point | 20260425-10:24:59.290 +0800 | All versions after safe point can be accessed. (DO NOT EDIT) |
| tikv_gc_life_time | 10m0s | All versions within life time will not be collected by GC, at least 10m, in Go format. |
| tikv_gc_last_run_time | 20260425-12:25:34.589 +0800 | The time when last GC starts. (DO NOT EDIT) |
+-----------------------+-----------------------------+----------------------------------------------------------------------------------------+
3 rows in set (0.00 sec)
/***补充类似场景tikv日志
[2026/05/09 15:09:34.632 +08:00] [INFO] [gc_worker.go:1250] ["[gc worker] start resolve locks"] [uuid=6770a0e28ac0001] [safePoint=0] [try-resol
ve-locks-ts=466173368622120974] [concurrency=3]
[2026/05/09 15:09:34.632 +08:00] [INFO] [range_task.go:137] ["range task started"] [name=resolve-locks-runner] [startKey=] [endKey=] [concurren
cy=3]
[2026/05/09 15:09:34.838 +08:00] [INFO] [range_task.go:246] ["range task finished"] [name=resolve-locks-runner] [startKey=] [endKey=] ["cost ti
me"=205.679483ms] ["completed regions"=230]
[2026/05/09 15:09:34.838 +08:00] [INFO] [gc_worker.go:1272] ["[gc worker] finish resolve locks"] [uuid=6770a0e28ac0001] [safePoint=0] [try-reso
lve-locks-ts=466173368622120974] [regions=230]
[2026/05/09 15:10:34.638 +08:00] [INFO] [gc_worker.go:750] ["[gc worker] there's another service in the cluster requires an earlier safe point.
gc will continue with the earlier one"] [uuid=6770a0e28ac0001] [ourSafePoint=466173305707298816] [minSafePoint=466172647188463622]
[2026/05/09 15:10:34.638 +08:00] [INFO] [gc_worker.go:726] ["[gc worker] last safe point is later than current one.No need to gc.This might be
caused by manually enlarging gc lifetime"] ["leaderTick on"=6770a0e28ac0001] ["last safe point"=2026/05/09 14:18:42.539 +08:00] ["current safe
point"=2026/05/09 14:18:42.539 +08:00]
[2026/05/09 15:10:34.639 +08:00] [INFO] [gc_worker.go:1250] ["[gc worker] start resolve locks"] [uuid=6770a0e28ac0001] [safePoint=0] [try-resol
ve-locks-ts=466173384363606017] [concurrency=3]
[2026/05/09 15:10:34.639 +08:00] [INFO] [range_task.go:137] ["range task started"] [name=resolve-locks-runner] [startKey=] [endKey=] [concurren
cy=3]
[2026/05/09 15:10:34.848 +08:00] [INFO] [range_task.go:246] ["range task finished"] [name=resolve-locks-runner] [startKey=] [endKey=] ["cost ti
me"=209.279219ms] ["completed regions"=230]
[2026/05/09 15:10:34.848 +08:00] [INFO] [gc_worker.go:1272] ["[gc worker] finish resolve locks"] [uuid=6770a0e28ac0001] [safePoint=0] [try-reso
lve-locks-ts=466173384363606017] [regions=230]
[2026/05/09 15:11:27.308 +08:00] [INFO] [domain.go:2652] ["refreshServerIDTTL succeed"] [serverID=2982014] ["lease id"=4f2c9dc28364930a]
***/
解决方案
1) 检查是否存在长事务导致gc不能推进
select * from INFORMATION_SCHEMA.CLUSTER_PROCESSLIST d where d.COMMAND!='Sleep' order by d.TIME desc;
2)查看参数,以及gc是否被禁用
mysql> show variables like 'tidb_gc_enable%';
+----------------+-------+
| Variable_name | Value |
+----------------+-------+
| tidb_gc_enable | ON |
+----------------+-------+
3)查看tidb日志,搜索gc_worker相关日志,是否有ERROR或WARN信息。
4)使用pd-ctl命令查看是否有其他服务(如TICDC、Backup)持有了更早的安全点,阻止了GC推进。
问题1:br log备份导致
» service-gc-safepoint
{
"service_gc_safe_points": [
{
"service_id": "gc_worker",
"expired_at": 9223372036854775807,
"safe_point": 465900634004258816
},
{
"service_id": "log-backup-coordinator",
"expired_at": 1777356899,
"safe_point": 465851881695477760
}
],
"gc_safe_point": 465851881695477760
}
» tso 465900634004258816
system: 2026-04-27 14:04:34.589 +0800 CST
logic: 0
» tso 465851881695477760
system: 2026-04-25 10:24:59.29 +0800 CST --与tikv_gc_safe_point值相同,说明阻塞源是log-backup-coordinator
logic: 0
»
log-backup-coordinator是TiDB日志备份(PITR增量备份)功能的核心组件,当启动了一个日志备份任务后,该协调器会注册一个安全点,以确保备份所需的历史数据(MVCC版本)不会被GC清理,从而保证备份的连续性和可恢复性。
根据该项排查,minio服务未配置自动启动,br 备份节点宕机后备份任务未启动导致。启动minio以及resume 任务。
[root@localhost ~]# systemctl status minio
● minio.service - MinIO
Loaded: loaded (/etc/systemd/system/minio.service; disabled; vendor preset: disabled)
Active: inactive (dead)
Docs: https://min.io/docs/minio/linux/index.html
[root@localhost ~]# systemctl start minio
[root@localhost ~]# systemctl status minio
Starting component br: /home/tidb/.tiup/components/br/v7.1.0/br log status --pd 10.10.30.203:14279
Detail BR log in /tmp/br.log.2026-04-27T14.19.57+0800
● Total 1 Tasks.
> #1 <
name: logbak_minio
status: ○ ERROR
start: 2026-03-16 14:13:53.539 +0800
end: 2090-11-18 22:07:45.624 +0800
storage: s3://testbuct/minio_logbak_2603161409
speed(est.): 0.00 ops/s
checkpoint[global]: 2026-04-25 10:24:59.29 +0800; gap=51h55m0s
error[store=164]: KV:LogBackup:Io
error-happen-at[store=164]: 2026-04-25 13:18:57.29 +0800; gap=49h1m2s
error-message[store=164]: I/O Error: failed to put object rusoto error Error during dispatch: error trying to connect: tcp connect error: Connection refused (os error 111)
[tidb@localhost log]$ tiup br log resume --task-name logbak_minio --pd 10.10.30.203:14279
Starting component br: /home/tidb/.tiup/components/br/v7.1.0/br log resume --task-name logbak_minio --pd 10.10.30.203:14279
Detail BR log in /tmp/br.log.2026-04-27T14.20.07+0800
[2026/04/27 14:20:09.064 +08:00] [INFO] [collector.go:77] ["log resume"] [streamTaskInfo="{taskName=logbak_minio,startTs=464949512382447622,endTS=999999999999999999,tableFilter=*.*}"]
[2026/04/27 14:20:25.120 +08:00] [INFO] [collector.go:77] ["log resume success summary"] [total-ranges=0] [ranges-succeed=0] [ranges-failed=0] [total-take=18.088103452s]
以上处理完成之后,等待下一轮的GC时间,pd-ctl查看service-gc-safepoint,恢复正常。
br任务启动后的service-gc-safepoint结果为:
» service-gc-safepoint
{
"service_gc_safe_points": [
{
"service_id": "backup-stream-logbak_minio-1",
"expired_at": 1777279492,
"safe_point": 465901229490569216
},
{
"service_id": "backup-stream-logbak_minio-164",
"expired_at": 1777279491,
"safe_point": 465901229490569216
},
{
"service_id": "backup-stream-logbak_minio-508005",
"expired_at": 1777279486,
"safe_point": 465901228271861762
},
{
"service_id": "gc_worker",
"expired_at": 9223372036854775807,
"safe_point": 465901137320738816
},
{
"service_id": "log-backup-coordinator",
"expired_at": 1777358795,
"safe_point": 465901266322587648
},
{
"service_id": "logbak_minio_pause_safepoint", --该任务会在下次运行时自动清理,也可以手动清理
"expired_at": 1777272609,
"safe_point": 465851881695477760
}
],
"gc_safe_point": 465851881695477760
}
问题2:ticdc 任务导致
ticdc的任务均remove,但是由于ticdc的tso占用导致无法推进,即使将cdc 组件stop,也无法推进。
"service_id": "ticdc-default-13416319017619663283",
"expired_at": 1778641765,
"safe_point": 466237428698710015
}
],
"gc_safe_point": 466237428698710015
}
» tso 466237428698710015
system: 2026-05-12 10:57:24.388 +0800 CST
logic: 262143
使用curl也可以查询tso
curl http://10.10.30.203:14279/pd/api/v1/gc/safepoint
使用curl命令强制删除。
拓展
使用API是否可以人工删除过期的service_id。
curl http://10.10.30.203:14279/pd/api/v1/gc/safepoint
使用curl命令进行gc的删除
[tidb@localhost ~]$ curl -X DELETE http://10.10.30.203:14279/pd/api/v1/gc/safepoint/ticdc-default-13416319017619663283
"Delete service GC safepoint successfully."