大表在线添加字段带default current_timestamp太慢

TiDBer_nxTFcVWq · 2025 年6 月 19 日 08:13

【TiDB 使用环境】生产环境
【TiDB 版本】8.1
【操作系统】redhat7.6
【部署方式】lxc，ssd+nvme
【集群数据量】几百G
【集群节点数】7个tikv节点
【问题复现路径】做过哪些操作出现的问题
【遇到的问题：问题现象及影响】
同样的DDL操作，在测试环境执行秒级完成，但线上环境却执行了40多分钟，线上环境已经停止了更新操作只保留查询服务。测试环境的数据是从正式环境导入，数据量相当。要修改的2张表记录都是3亿多，A表字段少些没有大字段，B表有1个longtext大字段和多个长varchar字段。A和B表在测试环境都是秒级完成。A表在正式环境用了3分钟，B表用了40多分钟。对比过2套环境的参数，基本都是使用的默认参数没有做特别修改。
测试环境只有4个TIKV节点都是ssd的。
A表是执行 add column createTime datetime DEFAULT CURRENT_TIMESTAMP
B表是执行 add column insertTime datetime DEFAULT CURRENT_TIMESTAMP

线上库有部署CDC，分别同步数据到MySQL和异机房TiDB集群，在做DDL操作前都先暂停了CDC同步，另外临时关闭tidb_gc_enable，防止操作时间太长导致后续CDC中断。这个操作在测试库上面也测试过线关闭在DDL，也是很快

另外收集了一下B表的region分布，线上是有7000多个region，测试是有5000多个region
，从region看并没有较大差异。

另外各个节点当时的io也不忙
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【复制黏贴 ERROR 报错的日志】
没有报错，就是2套环境，执行时间差异太大，似乎测试环境没有因为新字段带动态缺省值触发数据回填，而线上库却触发了。

看tidb节点大量日志有如下：
2025/06/19 00:33:13.620 +08:00] [INFO] [syncer.go:352] [“syncer check all versions, someone is not synced”] [category=ddl] [info=“ins
tance ip 10.18.160.170, port 4000, id eb5abdf5-d0fe-4513-bd55-2f72d63729d3”] [“ddl job id”=235] [ver=226]
2025/06/19 00:42:00.116 +08:00] [INFO] [syncer.go:352] [“syncer check all versions, someone is not synced”] [category=ddl] [info=“ins tance ip 10.18.160.170, port 4000, id eb5abdf5-d0fe-4513-bd55-2f72d63729d3”] [“ddl job id”=235] [ver=226] [2025/06/19 00:42:00.224 +08:00] [INFO] [ddl_worker.go:1454] [“wait latest schema version changed(get the metadata lock if tidb_enable _metadata_lock is true)”] [category=ddl] [ver=226] [“take time”=8m46.656811475s] [job=“ID:235, Type:add column, State:running, SchemaS tate:delete only, SchemaID:108, TableID:213, RowCount:0, ArgLen:4, start time: 2025-06-19 00:33:13.507 +0800 CST, Err:, ErrCount: 0, SnapshotVersion:0, LocalMode: false”] [2025/06/19 00:42:00.230 +08:00] [INFO] [ddl_worker.go:1206] [“run DDL job”] [worker=“worker 1, tp general”] [category=ddl] [jobID=235 ] [conn=2725483390] [category=ddl] [job=“ID:235, Type:add column, State:running, SchemaState:delete only, SchemaID:108, TableID:213, R owCount:0, ArgLen:0, start time: 2025-06-19 00:33:13.507 +0800 CST, Err:, ErrCount:0, SnapshotVersion:0, LocalMode: false”] [2025/06/19 00:42:00.242 +08:00] [INFO] [domain.go:280] [“diff load InfoSchema success”] [currentSchemaVersion=226] [neededSchemaVersi on=227] [“start time”=1.498309ms] [gotSchemaVersion=227] [phyTblIDs=“[213]”] [actionTypes=“[5]”] [diffTypes=“["add column"]”] [2025/06/19 00:42:00.245 +08:00] [INFO] [domain.go:886] [“mdl gets lock, update self version to owner”] [jobID=235] [version=227] [2025/06/19 00:42:00.290 +08:00] [INFO] [syncer.go:390] [“syncer check all versions, someone is not synced, continue checking”] [categ ory=ddl] [ddl=/tidb/ddl/all_schema_by_job_versions/235/eb5abdf5-d0fe-4513-bd55-2f72d63729d3] [currentVer=226] [latestVer=227]
【其他附件：截图/日志/监控】
B表数据

A表数据

有猫万事足 · 2025 年6 月 19 日 14:06

github.com/pingcap/tidb

drop index blocked and caused that adding index job always queueing

已打开 06:29AM - 03 Apr 24 UTC

已关闭 04:00PM - 09 Apr 24 UTC

Lily2025

type/bug priority/release-blocker severity/critical affects-6.5 affects-7.1 component/ddl affects-7.5 affects-8.1

## Bug Report Please answer these questions before submitting your issue. Tha…nks! ### 1. Minimal reproduce step (Required) 1、run sysbench 2、inject one of tikv network partition during adding index and then drop index from the tidb logs，may there was some fault on pd logs： [endless-ha-test-add-index-tps-7563381-1-794.tar.gz](https://github.com/pingcap/tidb/files/14847966/endless-ha-test-add-index-tps-7563381-1-794.tar.gz) ### 2. What did you expect to see? (Required) add index and drop index can success ### 3. What did you see instead (Required) drop index blocked and caused that add index job queueing ![img_v3_029j_e6e7b4c1-e210-4646-b2bb-50405b121deg](https://github.com/pingcap/tidb/assets/84712107/00f3f7d5-e221-430c-bc10-8ab234c884f1) ![img_v3_029j_1338fe4a-eed5-4ea2-ae41-5409394a287g](https://github.com/pingcap/tidb/assets/84712107/ee5332cf-7468-4def-b3a0-d30678825260) ### 4. What is your TiDB version? (Required) ./tidb-server -V Release Version: v8.1.0-alpha Edition: Community Git Commit Hash: 3cfea6a32a3a1fbce7ff11ea43ef78ecdd976a4d Git Branch: heads/refs/tags/v8.1.0-alpha UTC Build Time: 2024-04-01 13:19:27 GoVersion: go1.21.6 Race Enabled: false Check Table Before Drop: false Store: unistore

和这个bug可能有关系。

这个bug到8.1.1修复。

https://docs.pingcap.com/zh/tidb/stable/release-8.1.1/

修复 DDL 错误使用 etcd 导致任务排队的问题 #52335 @wjhuang2016

github.com/pingcap/tidb

ddl: make sure put key into ETCD monotonously

master ← wjhuang2016:fix_etcd_mono

已打开 08:36AM - 07 Apr 24 UTC

wjhuang2016

+92 -1

### What problem does this PR solve? Issue Number: close #47060 and close #…52335 Problem Summary: ### What changed and how does it work? ### Check List Tests - [x] Unit test - [ ] Integration test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test > - [ ] I checked and no code files have been changed. > Side effects - [ ] Performance regression: Consumes more CPU - [ ] Performance regression: Consumes more Memory - [ ] Breaking backward compatibility Documentation - [ ] Affects user behaviors - [ ] Contains syntax changes - [ ] Contains variable changes - [ ] Contains experimental features - [ ] Changes MySQL compatibility ### Release note Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note. ```release-note None ```

看修复内容是元数据锁（mdl），在etcd上的存取的时候不能保证单调造成的。从日志的内容上，多次出现

mdl gets lock, update self version to owner

bug修复后添加的测试用例也在syncer这个文件中。如果没有其他头绪，还是建议升级看看是否会再次出现类似问题。

毕竟并发问题的复现是比较难的。

随缘天空 · 2025 年6 月 20 日 02:23

是否事务冲突？生产环境添加字段时，如果有大量并发事务正在对目标表进行写入操作，可能会导致ADD COLUMN操作需要等待以避免事务冲突，从而延长执行时间。而测试环境一般用于验证，添加字段时业务人员没有操作，所以最好生产环境没有业务操作时添加字段看看时间差异

TiDBer_nxTFcVWq · 2025 年6 月 20 日 07:07

开始DDL前已经停掉了主要的写入服务，线上只保留了少量的更新操作，应该不是更新操作引发的MDL等待，感觉是和一些长查询冲突了，不过这个大表加带缺省值的操作，获取MDL是阶段性多次获取吗，怎么看日志是这样的。

WalterWj · 2025 年6 月 25 日 03:34

日志相关：

TiDB 的 DDL worker 在检查 DDL 任务执行进度时，发现集群中有节点的 schema 版本（ver=226）还没同步到最新。这种情况下，DDL 任务会等待，直到所有节点的 schema 都同步完成。

[syncer.go:352] [“syncer check all versions, someone is not synced”] ... [ver=226]

DDL worker 在等待元数据锁（MDL），以便更新 schema。期间记录了等待时间（这里超过8分钟），说明有锁等待甚至阻塞。

[ddl_worker.go:1454] [“wait latest schema version changed(get the metadata lock if tidb_enable_metadata_lock is true)”] ... [take time=8m46.656811475s] ...

成功获取到了元数据锁（MDL），并且将本节点的 schema 版本更新到了最新（ver=227）。
这一步正常后，DDL 操作才能继续推进。

[domain.go:886] [“mdl gets lock, update self version to owner”] [jobID=235] [version=227]

依然在检测，有节点没同步最新 schema，继续等待。
这说明 DDL 的 schema 变更需要全体节点都知晓，否则整个 DDL 过程会受阻。

[syncer.go:390] [“syncer check all versions, someone is not synced, continue checking”] ...

你不行试试将 mdl lock 关掉。还有生产环境的 tidb-server 是不是相对多一些？有没有某个节点比较慢这种的？

lllzd · 2025 年6 月 27 日 03:40

你的日志里显示 ”syncer check all versions, someone is not synced“，这表示 DDL 同步过程中，部分节点未能及时同步 schema 版本，可能导致等待时间增加。