cdc任务卡住.重启cdc进程后任务恢复。

像风一样的男子 · 2023 年10 月 28 日 01:21

凌晨企业微信群cdc监控告警

查看cdc日志，有如下warn
[WARN] [client.go:279] [“etcd client outCh blocking too long, the etcdWorker may be stuck”] [duration=4h4m27.999698335s] [role=processor]

在论坛搜索后发现是已知bug

github.com/pingcap/tiflow

cdc emit row changed event to downstream kafka may block a long time.

已打开 09:26AM - 22 Mar 22 UTC

已关闭 06:13AM - 24 Mar 22 UTC

3AceShowHand

type/bug severity/major found/automation area/ticdc affects-6.0

### What did you do? 1. create a changefeed with kafka sink 2. stop the chan…gfeed 3. prepare `go-tpc` workload 4. resume the changefeed 5. run `go-tpc` workload periodically use `kill -s STOP` to pause the Kafka process for around 40 ~ 50s, each a few minutes. ### What did you expect to see? The whole CDC works properly. ### What did you see instead? Processor gets blocked for a long time. When the problem happen, kafka process is healthy, resumed by `kill -s CONT` ``` [2022/03/22 14:00:30.348 +08:00] [WARN] [client.go:263] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=23.000155792s] [role=processor] [2022/03/22 14:00:31.348 +08:00] [WARN] [client.go:263] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=23.9999115s] [role=processor] [2022/03/22 14:00:32.348 +08:00] [WARN] [client.go:263] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=24.9999985s] [role=processor] [2022/03/22 14:00:33.349 +08:00] [WARN] [client.go:263] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=26.000115375s] [role=processor] [2022/03/22 14:00:34.349 +08:00] [WARN] [client.go:263] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=27.000068583s] [role=processor] [2022/03/22 14:00:35.349 +08:00] [WARN] [client.go:263] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=28.000139292s] [role=processor] [2022/03/22 14:00:36.348 +08:00] [WARN] [client.go:263] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=28.999832417s] [role=processor] [2022/03/22 14:00:37.349 +08:00] [WARN] [client.go:263] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=29.99986875s] [role=processor] [2022/03/22 14:00:38.349 +08:00] [WARN] [client.go:263] ["etcd client outCh blocking too long, the etcdWorker may be stuck"] [duration=30.999821875s] [role=processor] ``` ### Versions of the cluster Upstream TiDB cluster version (execute `SELECT tidb_version();` in a MySQL client): ```console master ``` Upstream TiKV version (execute `tikv-server --version`): ```console master ``` TiCDC version (execute `cdc version`): ```console master ``` [etcd_worker.log](https://github.com/pingcap/tiflow/files/8322983/etcd_worker.log) [goroutine2.txt](https://github.com/pingcap/tiflow/files/8322990/goroutine2.txt)

操作：重启cdc故障节点后任务恢复正常。

大飞哥online · 2023 年10 月 28 日 01:26

睡不了了，该起来升级了

Kongdom · 2023 年10 月 28 日 01:27

自我标记一下最佳答案吧

像风一样的男子 · 2023 年10 月 28 日 01:27

这个bug没看到哪个版本修复的

大飞哥online · 2023 年10 月 28 日 01:30

我看2022年都提了，挺早的了

像风一样的男子 · 2023 年10 月 28 日 01:30

三个cdc节点，就一个节点出故障，三个任务卡死了一个任务，重启故障节点就好了

像风一样的男子 · 2023 年10 月 28 日 01:33

我看到有帖子反馈6.5.2还有这个bug报出。

Kongdom · 2023 年10 月 28 日 01:41

看上去只在6.0及以后的版本修复了~

像风一样的男子 · 2023 年10 月 28 日 01:42

你看这个帖子6.5.2还有，6.5版本可是今年新出的版本

Fly-bird · 2023 年10 月 28 日 01:43

万能的重启大法

Kongdom · 2023 年10 月 28 日 01:45

那可能是没修复好~

像风一样的男子 · 2023 年10 月 28 日 01:46

下游kafka不出故障不会触发这个bug,触发了就重启。

Kongdom · 2023 年10 月 29 日 14:01

重启解决99.99%的问题

system · 2023 年12 月 28 日 14:02

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。