【TiDB 使用环境】生产环境 /测试环境
【TiDB 版本】v7.1.1
【遇到的问题:问题现象及影响】
问题1:cdc延迟:
创建changefeed任务,下游是kafka.。2张表仅在凌晨存在几百条增量。 凌晨cdc 任务出现延迟,checkpoint tso不推进,checkpoint lag持续增加,resolved ts lag正常,未延迟。手动pause-resume后瞬间恢复。
cdc log:
[2026/05/07 05:08:45.845 +08:00] [WARN] [kafka_manager.go:110] [“Get metadata of topics failed”] [namespace=default] [changefeed=social-hermes-71] [error=“[CDC:ErrReachMaxTry]reach maximum try: 3, error: write tcp xxxx:33764->xxxx:9092: write: broken pipe: write tcp xxxx:33764->xxxx:9092: write: broken pipe”] [errorVerbose=“[CDC:ErrReachMaxTry]reach maximum try: 3, error: write tcp xxxx:33764->xxxx:9092: write: broken pipe: write tcp xxxx:33764->xxxx:9092: write: broken pipe\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/retry.run\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:69\ngithub.com/pingcap/tiflow/pkg/retry.Do\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:34\ngithub.com/pingcap/tiflow/pkg/sink/kafka.(*saramaAdminClient).queryClusterWithRetry\n\tgithub.com/pingcap/tiflow/pkg/sink/kafka/admin.go:92\ngithub.com/pingcap/tiflow/pkg/sink/kafka.(*saramaAdminClient).GetTopicsMeta\n\tgithub.com/pingcap/tiflow/pkg/sink/kafka/admin.go:251\ngithub.com/pingcap/tiflow/cdc/sink/dmlsink/mq/manager.(*kafkaTopicManager).getMetadataOfTopics\n\tgithub.com/pingcap/tiflow/cdc/sink/dmlsink/mq/manager/kafka_manager.go:165\ngithub.com/pingcap/tiflow/cdc/sink/dmlsink/mq/manager.(*kafkaTopicManager).backgroundRefreshMeta\n\tgithub.com/pingcap/tiflow/cdc/sink/dmlsink/mq/manager/kafka_manager.go:106\nruntime.goexit\n\truntime/asm_amd64.s:1598”
[2026/05/07 05:10:31.717 +08:00] [WARN] [admin.go:100] [“query kafka cluster meta failed, retry it”] [namespace=default] [changefeed=social-arvo-71] [error=“write tcp xxxx:56178->xxxx:9092: write: broken pipe”]
[2026/05/07 05:10:31.718 +08:00] [INFO] [table_sink_impl.go:171] [“Stopping table sink”] [namespace=default] [changefeed=xxx] [span={table_id:8449,start_key:7480000000000021ff015f720000000000fa,end_key:7480000000000021ff015f730000000000fa}] [checkpointTs=466118726239649828]
[2026/05/07 05:10:42.689 +08:00] [WARN] [pd.go:152] [“get timestamp too slow”] [“cost time”=224.798627ms]
[2026/05/07 05:10:44.869 +08:00] [WARN] [pd.go:152] [“get timestamp too slow”] [“cost time”=405.069348ms]
[2026/05/07 05:11:02.018 +08:00] [WARN] [progress_tracker.go:310] [“Close table doesn’t return in time, may be stuck”] [span={table_id:8449,start_key:7480000000000021ff015f720000000000fa,end_key:7480000000000021ff015f730000000000fa}] [trackingCount=100] [lastMinResolvedTs=“{"Mode":0,"Ts":466118726239649828,"BatchID":18446744073709551615}”]
[2026/05/07 05:11:33.018 +08:00] [WARN] [progress_tracker.go:310] [“Close table doesn’t return in time, may be stuck”] [span={table_id:8449,start_key:7480000000000021ff015f720000000000fa,end_key:7480000000000021ff015f730000000000fa}] [trackingCount=100] [lastMinResolvedTs=“{"Mode":0,"Ts":466118726239649828,"BatchID":18446744073709551615}”]
监控:
延迟发生时 puller-table sink output都正常,sink flush rows无数据变化。
应该是向kafka写数据卡住 出现延迟 但是另一个7.1.1版本集群创建任务正常,未出现延迟情况
问题2:
changefeed任务remove后 grafana监控指标未消失,已删除任务status变为stop 触发告警。尝试创建重名任务再删除,没有用。另一个7.7.1版本集群也未出现该问题。