tidb数据库delete大表后,占用的磁盘空间不减少

【TiDB 使用环境】测试环境
【TiDB 版本】8.5.1
【集群节点数】2个tidb3个tikv3个pd共六台服务器
【遇到的问题:问题现象及影响】
数据库表大表清理后,占用的空间不减少,通过SELECT * FROM mysql.tidb WHERE variable_name IN (‘tikv_gc_safe_point’, ‘tikv_gc_last_run_time’);查询结果正常,如下图:

,不正常有两个方面
1、通过tiup ctl:v8.5.1 pd -u http://192.168.181.57:2379 service-gc-safepoint show命令,查询的"gc_safe_point": 450204182675718148停留在2024-06-03 13:31:04.626 +0800 CST,不正常

2、tidb日志报错有间隙,日志如下:
[2026/02/03 15:08:17.464 +08:00] [INFO] [gc_worker.go:415] [“starts the whole job”] [category=“gc worker”] [uuid=67021f41a540018] [safePoint=464021591790714880] [concurrency=3]
[2026/02/03 15:08:17.464 +08:00] [INFO] [gc_worker.go:1201] [“start resolve locks”] [category=“gc worker”] [uuid=67021f41a540018] [safePoint=464021591790714880] [concurrency=3]
[2026/02/03 15:08:17.464 +08:00] [INFO] [range_task.go:167] [“range task started”] [name=resolve-locks-runner] [startKey=] [endKey=] [concurrency=3]
[2026/02/03 15:08:28.414 +08:00] [INFO] [domain.go:345] [“diff load InfoSchema success”] [isV2=false] [currentSchemaVersion=269919] [neededSchemaVersion=269920] [“elapsed time”=2.184694ms] [gotSchemaVersion=269920] [phyTblIDs=“[454052,454054]”] [actionTypes=“[11,11]”] [diffTypes=“["truncate table"]”]
[2026/02/03 15:08:28.421 +08:00] [INFO] [domain.go:1079] [“mdl gets lock, update self version to owner”] [jobID=454055] [version=269920]
[2026/02/03 15:09:06.978 +08:00] [WARN] [backoff.go:179] [“pdRPC backoffer.maxSleep 40000ms is exceeded, errors:\nPD returned regions have gaps, limit: 128 at 2026-02-03T15:08:59.959162437+08:00\nPD returned regions have gaps, limit: 128 at 2026-02-03T15:09:02.843258856+08:00\nPD returned regions have gaps, limit: 128 at 2026-02-03T15:09:05.214818078+08:00\ntotal-backoff-times: 19, backoff-detail: pdRPC:19, maxBackoffTimeExceeded: true, maxExcludedTimeExceeded: false\nlongest sleep type: pdRPC, time: 41626ms”]
[2026/02/03 15:09:06.978 +08:00] [INFO] [range_task.go:223] [“range task try to get range end key failure”] [name=resolve-locks-runner] [startKey=] [endKey=] [loadRegionKey=74800000000003150d] [“cost time”=49.514091412s] [error=]
[2026/02/03 15:09:06.978 +08:00] [ERROR] [gc_worker.go:1219] [“resolve locks failed”] [category=“gc worker”] [uuid=67021f41a540018] [safePoint=464021591790714880] [error=]
[2026/02/03 15:09:06.978 +08:00] [ERROR] [gc_worker.go:750] [“resolve locks returns an error”] [category=“gc worker”] [uuid=67021f41a540018] [error=]
[2026/02/03 15:09:06.979 +08:00] [ERROR] [gc_worker.go:220] [runGCJob] [category=“gc worker”] [error=]
请大家帮忙看下问题出在哪里?如何解决?

1 个赞

GC 完成后,需要手动告诉 TiKV 立即进行 Compaction,释放空间。

1 个赞
# 连接到 PD
tiup ctl:v8.5.1 pd -u http://192.168.181.57:2379
# 执行命令
>> region check
检查集群健康状态 检查 Region 分布与重叠
1 个赞

从之前的信息来看,空间不能回收的原因是gc_safe_point卡主了。gc_safe_point卡主的原因,常见的原因有:大事务未结束、cdc卡主、备份。

日志报错里有:
[2026/02/03 15:09:06.978 +08:00] [ERROR] [gc_worker.go:1219] [“resolve locks failed”]

而且cdc也删掉了,可能要怀疑下是否有未完成的大事务了,用SHOW PROCESSLIST查下?
另外,既然是测试库,可以考虑把集群整体重启下,再看看gc_safe_point能否前进。

1 个赞

GC还没有开始执行清理完成呢吧

TiDB 的 GC(垃圾回收)机制未正常触发或未完成清理

  • tikv_gc_last_run_time 显示的时间正常,但 service-gc-safepoint show 结果中的 gc_safe_point 停留在 2024-06-03,远早于当前时间,这说明 GC 流程出现了阻塞
  • TiDB 并不会在删除数据后立即释放磁盘空间,而是需要等待 GC 清理掉过期的 MVCC 版本,再由 TiKV 的 RocksDB 后台 Compaction 回收物理空间。

整个tidb集群都重启过了,SHOW PROCESSLIST里面也是空的,还有两个信息,不知道对日志报错有没有帮助:[tidb@test4 ~]$ tiup ctl:v8.5.1 pd -u http://192.168.181.57:2379 region range-holes
Starting component ctl: /home/tidb/.tiup/components/ctl/v8.5.1/ctl pd -u http://192.168.181.57:2379 region range-holes
[
[
“748000000000032AFF6F00000000000000F8”,
“748000000000032AFF7000000000000000F8”
]
]

[tidb@test4 ~]$ tiup ctl:v8.5.1 pd -u http://192.168.181.57:2379 region check extra-peer
Starting component ctl: /home/tidb/.tiup/components/ctl/v8.5.1/ctl pd -u http://192.168.181.57:2379 region check extra-peer
{“count”:1,“regions”:[{“id”:91970761,“start_key”:“74800000000006E6FF9700000000000000F8”,“end_key”:“748000FFFFFFFFFFFFF900000000000000F8”,“epoch”:{“conf_ver”:41,“version”:154974},“peers”:[{“role_name”:“Voter”,“id”:91970762,“store_id”:5},{“role_name”:“Voter”,“id”:91970763,“store_id”:4},{“role_name”:“Voter”,“id”:91970764,“store_id”:1}],“leader”:{“role_name”:“Voter”,“id”:91970764,“store_id”:1},“cpu_usage”:0,“written_bytes”:0,“read_bytes”:239701,“written_keys”:0,“read_keys”:1573,“approximate_size”:482,“approximate_keys”:157380}]}

range-holesextra-peer 是集群底层数据一致性的两个重要警示。它们表明集群虽然启动了,但内部状态并不健康。建议优先排查这两个问题,因为它们很可能是导致业务报错的根源