cdc数据同步无法正常

xmlianfeng · 2021 年7 月 7 日 09:16

数据无法正常同步, 日志一直在打印，但是tso就是没变化
tidb、cdc版本：
5.0.3
同步信息：

相关日志:
cdc.txt (7.4 MB)

HHHHHHULK · 2021 年7 月 7 日 09:54

任务启动多久了？还处在 normal 状态应该问题不大，如果有问题状态会改变。

xmlianfeng · 2021 年7 月 8 日 09:00

状态还是normal，
同步过程中出现异常，
请问下怎么处理，，，卡了两天了。
日志如下
2.tar (2).gz (3.1 MB)

hey-hoho · 2021 年7 月 8 日 09:36

看日志里面有很多loadStore from PD failed和context deadline exceeded，是不是集群出啥问题了，display集群看看，再查一下pd的log

xmlianfeng · 2021 年7 月 8 日 09:51

display 状态都正常

cdc同步状态变了，提示[CDC:ErrPDBatchLoadRegions][tikv:9001]PD server timeout"
但是我没看到有9001的端口。。。
部分cdc日志
[2021/07/08 17:47:18.462 +08:00] [WARN] [base_client.go:284] [“[pd] cannot update member”] [address=http://172.19.16.135:2379] [error=“[PD:client:ErrClientGetMember]error:rpc error: code = DeadlineExceeded desc = context deadline exceeded target:172.19.16.135:2379 status:READY”]
[2021/07/08 17:47:20.463 +08:00] [WARN] [owner.go:823] [“failed to update service safe point”] [error=“rpc error: code = DeadlineExceeded desc = context deadline exceeded”] [errorVerbose=“rpc error: code = DeadlineExceeded desc = context deadline exceeded
github.com/tikv/pd/client.(*client).UpdateServiceGCSafePoint
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210527030735-a544782ee076/client/client.go:1253
github.com/pingcap/ticdc/cdc.(*Owner).flushChangeFeedInfos
\tgithub.com/pingcap/ticdc@/cdc/owner.go:820
github.com/pingcap/ticdc/cdc.(*Owner).run
\tgithub.com/pingcap/ticdc@/cdc/owner.go:1432
github.com/pingcap/ticdc/cdc.(*Owner).Run
\tgithub.com/pingcap/ticdc@/cdc/owner.go:1295
github.com/pingcap/ticdc/cdc.(*Server).campaignOwnerLoop
\tgithub.com/pingcap/ticdc@/cdc/server.go:228
github.com/pingcap/ticdc/cdc.(*Server).run.func2
\tgithub.com/pingcap/ticdc@/cdc/server.go:316
The Go Programming Language
\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57
runtime.goexit
\truntime/asm_amd64.s:1357”] [since-last-update=42m24.42831357s]
[2021/07/08 17:47:21.463 +08:00] [WARN] [base_client.go:284] [“[pd] cannot update member”] [address=http://172.19.16.135:2379] [error=“[PD:client:ErrClientGetMember]error:rpc error: code = DeadlineExceeded desc = context deadline exceeded target:172.19.16.135:2379 status:READY”]
[2021/07/08 17:47:23.501 +08:00] [ERROR] [client.go:346] [“tso request is canceled due to timeout”] [dc-location=global] [error=“[PD:client:ErrClientGetTSOTimeout]get TSO timeout”]
[2021/07/08 17:47:23.501 +08:00] [ERROR] [client.go:599] [“[pd] getTS error”] [dc-location=global] [error=“[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled”]
[2021/07/08 17:47:23.501 +08:00] [WARN] [owner.go:84] [“Fail to update minGCSafePointCache.”] [error=“rpc error: code = Canceled desc = context canceled”] [errorVerbose=“rpc error: code = Canceled desc = context canceled
github.com/tikv/pd/client.(*client).processTSORequests
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210527030735-a544782ee076/client/client.go:717
github.com/tikv/pd/client.(*client).handleDispatcher
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210527030735-a544782ee076/client/client.go:587
runtime.goexit
\truntime/asm_amd64.s:1357
github.com/tikv/pd/client.(*tsoRequest).Wait
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210527030735-a544782ee076/client/client.go:913
github.com/tikv/pd/client.(*client).GetTS
\tgithub.com/tikv/pd@v1.1.0-beta.0.20210527030735-a544782ee076/client/client.go:933
github.com/pingcap/ticdc/cdc.(*Owner).getMinGCSafePointCache
\tgithub.com/pingcap/ticdc@/cdc/owner.go:82
github.com/pingcap/ticdc/cdc.(*Owner).flushChangeFeedInfos
\tgithub.com/pingcap/ticdc@/cdc/owner.go:758
github.com/pingcap/ticdc/cdc.(*Owner).run
\tgithub.com/pingcap/ticdc@/cdc/owner.go:1432
github.com/pingcap/ticdc/cdc.(*Owner).Run
\tgithub.com/pingcap/ticdc@/cdc/owner.go:1295
github.com/pingcap/ticdc/cdc.(*Server).campaignOwnerLoop
\tgithub.com/pingcap/ticdc@/cdc/server.go:228
github.com/pingcap/ticdc/cdc.(*Server).run.func2
\tgithub.com/pingcap/ticdc@/cdc/server.go:316
The Go Programming Language
\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57
runtime.goexit
\truntime/asm_amd64.s:1357”]

pd的日志
1.txt (4.7 MB)

hey-hoho · 2021 年7 月 8 日 10:08

从日志来看还是PD出问题了，检查下PD和CDC的通信是否有问题，还有Dashboard和pd ctl是否能正常使用

xmlianfeng · 2021 年7 月 8 日 10:30

Dashboard和pd ctl 都正常，查询了下service-gc-safepoint 可以正常响应数据。
我看到一个现象，
cdc的机器，在运行的时候，与tikv 的网络连接数有12000个左右这会不会有问题。。。

xmlianfeng · 2021 年7 月 9 日 04:37

这个有人帮忙看下吗。。。

Billmay表妹 · 2021 年7 月 12 日 04:57

@Ricklee 帮忙看看~

这道题我不会 · 2021 年7 月 12 日 05:00

1.麻烦反馈下该同步任务的具体信息，命令如下：

cdc cli changefeed query -s --pd=http://{pd-ip}:2379 --changefeed-id={chagefeed-id}

2.将 ticdc 的监控面板数据也提供下，谢谢

xmlianfeng · 2021 年7 月 12 日 10:09

已经弄好了，初步判断应该是表太多引起的，有3个库超过2000张表，

这道题我不会 · 2021 年7 月 12 日 10:11

方便告知下具体是如何调整的吗？这样其他人遇到类似问题时可以参考下。

db_user · 2022 年1 月 27 日 11:11

我可能大概清楚原因了，不知道你这里和我的是不是一样，我这里cdc遇到了同样的情况，tso一直不变，我也一直以为没有更新，然后在cdc这里做了tcpdump,发现端口能够接收到数据，n多条delete，和上游沟通后才发现上游进行了批量的删除，然后cdc同步这里是分成了多个事务来处理，这块就一直卡死在这里，等处理完也就好了

system · 2022 年10 月 31 日 19:20

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。