TiKV日志中有大量的错误:cdc initialize fail

【TiDB 使用环境】生产环境
【TiDB 版本】7.5.5
【操作系统】
【部署方式】
该集群是近期从6.5.9版本升级到7.5.5版本:升级完成之后,TiKV节点日志中有大量如下的错误日志,而且TiKV节点偶尔总是莫名奇怪的出现重启

TiKV节点日志中有大量如下的错误日志:看错误日志内容好像和cdc有关,这个集群有cdc任务,但是cdc任务运行是正常的


[2025/10/30 16:46:50.156 +08:00] [FATAL] [lib.rs:512] [“region 253923838 commit_ts: TimeStamp(461848971686707330), resolved_ts: TimeStamp(461848971686707333)”] [backtrace=" 0: tikv_util::set_panic_hook::{{closure}}\n at /workspace/source/tikv/components/tikv_util/src/lib.rs:511:18\n 1: <alloc::boxed::Box<F,A> as core::ops::function::Fn>::call\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2032:9\n std::panicking::rust_panic_with_hook\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:692:13\n 2: std::panicking::begin_panic_handler::{{closure}}\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:579:13\n 3: std::sys_common::backtrace::__rust_end_short_backtrace\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:137:18\n 4: rust_begin_unwind\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:575:5\n 5: core::panicking::panic_fmt\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panicking.rs:65:14\n 6: cdc::delegate::Delegate::sink_txn_put\n at /workspace/source/tikv/components/cdc/src/delegate.rs:929:21\n cdc::delegate::Delegate::sink_put\n at /workspace/source/tikv/components/cdc/src/delegate.rs:889:13\n cdc::delegate::Delegate::sink_data\n at /workspace/source/tikv/components/cdc/src/delegate.rs:694:21\n 7: cdc::delegate::Delegate::on_batch\n at /workspace/source/tikv/components/cdc/src/delegate.rs:561:17\n 8: cdc::endpoint::Endpoint<T,E,S>::on_multi_batch\n at /workspace/source/tikv/components/cdc/src/endpoint.rs:889:33\n <cdc::endpoint::Endpoint<T,E,S> as tikv_util::worker::pool::Runnable>::run\n at /workspace/source/tikv/components/cdc/src/endpoint.rs:1283:18\n 9: tikv_util::worker::pool::Worker::start_with_timer_impl::{{closure}}\n at /workspace/source/tikv/components/tikv_util/src/worker/pool.rs:506:25\n <core::future::from_generator::GenFuture as core::future::future::Future>::poll\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/future/mod.rs:91:19\n <tracker::tls::TrackedFuture as core::future::future::Future>::poll::{{closure}}\n at /workspace/source/tikv/components/tracker/src/tls.rs:64:23\n std::thread::local::LocalKey::try_with\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/local.rs:446:16\n std::thread::local::LocalKey::with\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/local.rs:422:9\n <tracker::tls::TrackedFuture as core::future::future::Future>::poll\n at /workspace/source/tikv/components/tracker/src/tls.rs:62:9\n <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll\n at /workspace/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/futures-util-0.3.31/src/future/future/map.rs:55:37\n <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll\n at /workspace/.cargo/registry/src/mirrors.tuna.tsinghua.edu.cn-df7c3c540f42cdbd/futures-util-0.3.31/src/lib.rs:86:13\n yatp::task::future::RawTask::poll\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5572a78/src/task/future.rs:59:9\n 10: yatp::task::future::TaskCell::poll\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5572a78/src/task/future.rs:103:9\n <yatp::task::future::Runner as yatp::pool::runner::Runner>::handle\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5572a78/src/task/future.rs:387:20\n 11: <tikv_util::yatp_pool::YatpPoolRunner as yatp::pool::runner::Runner>::handle\n at /workspace/source/tikv/components/tikv_util/src/yatp_pool/mod.rs:199:24\n yatp::pool::worker::WorkerThread<T,R>::run\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5572a78/src/pool/worker.rs:48:13\n yatp::pool::builder::LazyBuilder::build::{{closure}}\n at /workspace/.cargo/git/checkouts/yatp-e704b73c3ee279b6/5572a78/src/pool/builder.rs:114:25\n std::sys_common::backtrace::rust_begin_short_backtrace\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:121:18\n 12: std::thread::Builder::spawn_unchecked::{{closure}}::{{closure}}\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:551:17\n <core::panic::unwind_safe::AssertUnwindSafe as core::ops::function::FnOnce<()>>::call_once\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/panic/unwind_safe.rs:271:9\n std::panicking::try::do_call\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:483:40\n std::panicking::try\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panicking.rs:447:19\n std::panic::catch_unwind\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/panic.rs:137:14\n std::thread::Builder::spawn_unchecked::{{closure}}\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:550:30\n core::ops::function::FnOnce::call_once{{vtable.shim}}\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:513:5\n 13: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2000:9\n <alloc::boxed::Box<F,A> as core::ops::function::FnOnce>::call_once\n at /root/.rustup/toolchains/nightly-2022-11-15-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/boxed.rs:2000:9\n std::sys::unix::thread::thread::new::thread_start\n

1 个赞

感觉好像是触发了这个bug:#17656 introduces a panic into TiKV cdc module · Issue #18142 · tikv/tikv · GitHub

没有升8吗。8的cdc升级了

有没有人能讲讲raft原理

raft 和 paxos有啥区别

raft好,还是paxos好

CDC重新部一下试试

重新部署是指缩容了,然后重新加节点吗?如果是bug的话,感觉重新部署也应该不会有啥用

要不升级到: * 7.5.7: 2025-09-04

两个小版本我感觉闭眼升。

@seiang 看一下这个建议,我觉得可以升级到最新的小版本,修复 TiCDC bug 看看效果。

建议升级至7.5.6或更高版本来解决这个问题。

今天早上已经升级到 7.5.7版本了,cdc相关的报错好像并没有解决掉

已经升级到 7.5.7版本了,但是cdc相关的报错好像并没有解决掉,tikv日志中还是有大量cdc相关的报错日志

应该类似这段代码:

CDC 组件在 delegate.rs:878 处(对应提供代码的 630-636 行)的断言失败

解决方案

PR #17274 - “cdc: re-design delegate and downstream to support partial region subscription better”

该 PR 在 v8.0 到 v8.3 版本之间重构了 CDC 的 DelegateDownstream 组件:

主要改进:

  1. 移除了这个过于严格的断言
  2. 重新设计以更好地支持部分区域订阅:
  • 增量扫描现在可以在订阅的范围内执行,而不是整个 region
  • 如果锁在订阅的 region 内但不在订阅的范围内,resolved-ts 不再被阻塞

建议措施:

  • 升级 TiKV 到 v8.5 可以解决此问题 :upside_down_face:
  • 该修复在 2024年7月30日合并到 master 分支

技术细节

为什么会出现 resolved_ts ≥ commit_ts:

  • 在故障注入场景下,CDC 的 EventFeed (cdc::Delegate) 维护的 resolved_ts 可能会大于后续观察到的 commit_ts
  • 这在某些异常情况下是可能发生的,原来的断言过于严格

相关 Issue: #16526 other tikv panic when inject one of tikv failure with ticdc changfeed running · Issue #16526 · tikv/tikv · GitHub 更适合追踪此问题的详细信息

这个bug会触发tikv节点异常重启吗?短时间内还不会考虑升级到 v8.5

你们 tikv crash 了么?

影响尚可的话就先不急?

7.5.5版本上周触发了4次部分tikv节点异常重启,今天早上刚升级到7.5.7,目前还没有出现tikv异常重启的情况;就是日志中频繁出现和7.5.5相同的cdc报错日志

感觉能接受就先不着急升级。

不行就升级 8.5.3 版本看看。。。对应报错异常代码,8.5 已经去掉这部分逻辑了。

嗯嗯,暂时先不升级8.5.3,7.5.7这个版本在观察一段时间吧;如果tikv crash还是很频繁的话,在考虑是升级还是降级到原来6.5的版本

降级干啥。。。降级不如升级。明年讲不好就是 9.0 版本了 :grimacing: