tiflash 异常重启

tidb版本:8.1.1,部署在华为云上
tiflash 重启,查看tiflash.log,发现有下列报错,

2131036:[2026/03/11 11:01:32.451 +08:00] [FATAL] [Exception.cpp:106] ["Code: 0, e.displayText() = DB::TiFlashException: Memory limit exceeded caused by 'RSS(Resident Set Size) much larger than limit' : process memory size would be 137.91 GiB for (attempt to allocate chunk of 2097152 bytes), limit of memory for data computing : 136.63 GiB. Memory Usage of Storage: non-query: peak=30.95 GiB, amount=1.92 MiB; kvstore: peak=904.57 MiB, amount=10.27 KiB; query-storage-task: peak=13.20 GiB, amount=12.75 GiB; fetch-pages: peak=0.00 B, amount=0.00 B; shared-column-data: peak=13.20 GiB, amount=12.75 GiB., e.what() = DB::TiFlashException, Stack trace:\n\n\n       0x1b97e0c\tDB::TiFlashException::TiFlashException(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, DB::TiFlashError const&) [tiflash+28933644]\n                \tdbms/src/Common/TiFlashException.h:263\n       0x1b97120\tMemoryTracker::alloc(long, bool) [tiflash+28930336]\n                \tdbms/src/Common/MemoryTracker.cpp:219\n       0x1b96cf8\tMemoryTracker::alloc(long, bool) [tiflash+28929272]\n                \tdbms/src/Common/MemoryTracker.cpp:230\n       0x1ba1c0c\tAllocator<false>::alloc(unsigned long, unsigned long) [tiflash+28974092]\n                \tdbms/src/Common/Allocator.cpp:68\n       0x1c084c0\tvoid DB::PODArrayBase<1ul, 4096ul, Allocator<false>, 15ul, 16ul>::alloc<>(unsigned long) [tiflash+29394112]\n                \tdbms/src/Common/PODArray.h:145\n       0x7291568\tDB::ColumnString::insertRangeFrom(DB::IColumn const&, unsigned long, unsigned long) [tiflash+120132968]\n                \tdbms/src/Columns/ColumnString.cpp:97\n       0x6a6a800\tDB::DM::ColumnFileInMemory::readDataForFlush() const [tiflash+111585280]\n                \tdbms/src/Storages/DeltaMerge/ColumnFile/ColumnFileInMemory.cpp:106\n       0x6a9ec44\tDB::DM::MemTableSet::buildFlushTask(DB::DM::DMContext&, unsigned long, unsigned long, unsigned long) [tiflash+111799364]\n                \tdbms/src/Storages/DeltaMerge/Delta/MemTableSet.cpp:310\n       0x6a8cb90\tDB::DM::DeltaValueSpace::flush(DB::DM::DMContext&) [tiflash+111725456]\n                \tdbms/src/Storages/DeltaMerge/Delta/DeltaValueSpace.cpp:365\n       0x695d65c\tDB::DM::Segment::flushCache(DB::DM::DMContext&) [tiflash+110483036]\n                \tdbms/src/Storages/DeltaMerge/Segment.cpp:2279\n       0x690008c\tDB::DM::DeltaMergeStore::flushCache(std::__1::shared_ptr<DB::DM::DMContext> const&, DB::DM::RowKeyRange const&, bool) [tiflash+110100620]\n                \tdbms/src/Storages/DeltaMerge/DeltaMergeStore.cpp:774\n       0x69028e0\tDB::DM::DeltaMergeStore::flushCache(DB::Context const&, DB::DM::RowKeyRange const&, bool) [tiflash+110110944]\n                \tdbms/src/Storages/DeltaMerge/DeltaMergeStore.cpp:747\n       0x7bd939c\tDB::KVStore::tryFlushRegionCacheInStorage(DB::TMTContext&, DB::Region const&, std::__1::shared_ptr<DB::Logger> const&, bool) [tiflash+129864604]\n                \tdbms/src/Storages/KVStore/KVStore.cpp:227\n       0x7c340c4\tDB::KVStore::forceFlushRegionDataImpl(DB::Region&, bool, DB::TMTContext&, DB::RegionTaskLock const&, unsigned long, unsigned long) const [tiflash+130236612]\n                \tdbms/src/Storages/KVStore/MultiRaft/Persistence.cpp:255\n       0x7c3363c\tDB::KVStore::canFlushRegionDataImpl(std::__1::shared_ptr<DB::Region> const&, unsigned char, bool, DB::TMTContext&, DB::RegionTaskLock const&, unsigned long, unsigned long, unsigned long, unsigned long) [tiflash+130233916]\n                \tdbms/src/Storages/KVStore/MultiRaft/Persistence.cpp:230\n       0x7c33dd4\tDB::KVStore::tryFlushRegionData(unsigned long, bool, bool, DB::TMTContext&, unsigned long, unsigned long, unsigned long, unsigned long) [tiflash+130235860]\n                \tdbms/src/Storages/KVStore/MultiRaft/Persistence.cpp:123\n       0x7c0ea08\tTryFlushData [tiflash+130083336]\n                \tdbms/src/Storages/KVStore/FFI/ProxyFFI.cpp:161\n  0xffff9a840f9c\t_$LT$engine_store_ffi..observer..TiFlashObserver$LT$T$C$ER$GT$$u20$as$u20$raftstore..coprocessor..AdminObserver$GT$::pre_exec_admin::h2f5bf67dbdf7c90f [libtiflash_proxy.so+26152860]\n                \tcontrib/tiflash-proxy/proxy_components/engine_store_ffi/src/observer.rs:120\n  0xffff9b665724\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::apply_raft_cmd::h9308910d47c3ade6 [libtiflash_proxy.so+40982308]\n                \tcontrib/tiflash-proxy/components/raftstore/src/store/fsm/apply.rs:1429\n  0xffff9b67aa94\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::process_raft_cmd::he5587c01a9599a25 [libtiflash_proxy.so+41069204]\n                \tcontrib/tiflash-proxy/components/raftstore/src/store/fsm/apply.rs:1377\n  0xffff9b67cb6c\traftstore::store::fsm::apply::ApplyDelegate$LT$EK$GT$::handle_raft_committed_entries::h849a05848402ae24 [libtiflash_proxy.so+41077612]\n                \tcontrib/tiflash-proxy/components/raftstore/src/store/fsm/apply.rs:1129\n  0xffff9b65bce4\traftstore::store::fsm::apply::ApplyFsm$LT$EK$GT$::handle_apply::h915edb389d0ce878 [libtiflash_proxy.so+40942820]\n                \tcontrib/tiflash-proxy/components/raftstore/src/store/fsm/apply.rs:4020\n  0xffff9b65ec14\traftstore::store::fsm::apply::ApplyFsm$LT$EK$GT$::handle_tasks::hc0f710a21a8448f8 [libtiflash_proxy.so+40954900]\n                \tcontrib/tiflash-proxy/components/raftstore/src/store/fsm/apply.rs:4351\n  0xffff9a91e7b8\t_$LT$raftstore..store..fsm..apply..ApplyPoller$LT$EK$GT$$u20$as$u20$batch_system..batch..PollHandler$LT$raftstore..store..fsm..apply..ApplyFsm$LT$EK$GT$$C$raftstore..store..fsm..apply..ControlFsm$GT$$GT$::handle_normal::h474edac058d2c646 [libtiflash_proxy.so+27060152]\n                \tcontrib/tiflash-proxy/components/raftstore/src/store/fsm/apply.rs:4633\n  0xffff9a89b618\tbatch_system::batch::Poller$LT$N$C$C$C$Handler$GT$::poll::hdbfc86c50b98d3ed [libtiflash_proxy.so+26523160]\n                \tcontrib/tiflash-proxy/components/batch-system/src/batch.rs:380\n  0xffff9a970444\tstd::sys_common::backtrace::__rust_begin_short_backtrace::h209bcd90e7cc37ca [libtiflash_proxy.so+27395140]\n                \t/root/.rustup/toolchains/nightly-2022-11-15-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys_common/backtrace.rs:121\n  0xffff9a9b2284\tcore::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::h159d73113cffcc67 [libtiflash_proxy.so+27665028]\n                \t/root/.rustup/toolchains/nightly-2022-11-15-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/ops/function.rs:513\n  0xffff9bd9f29c\tstd::sys::unix::thread::Thread::new::thread_start::h45f22376cc6c77f8 [libtiflash_proxy.so+48558748]\n                \t/root/.rustup/toolchains/nightly-2022-11-15-aarch64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/sys/unix/thread.rs:108\n  0xffff98e17d38\tstart_thread [libpthread.so.0+32056]\n  0xffff98c0f680\tthread_start [libc.so.6+915072]"] [source="uint8_t DB::TryFlushData(DB::EngineStoreServerWrap *, uint64_t, uint8_t, uint64_t, uint64_t, uint64_t, uint64_t)"] [thread_id=9637]

看起来是语句要使用的内存超过tiflash内存本身导致重启了,但是查看慢查询也没有发现特别耗费多的内存的语句。大家有遇见过这种类似的情况吗

看tiflash服务器所在的日志文件中有没有耗时或内存较大的sql,dashboard面板上的慢sql是已经执行完的语句,执行中的不会统计到。

tiflash_tikv.log 会记录 内存较大的SQL嘛

几个 tiflash 资源怎么样

另外,还有个疑问,这样的日志级别是FATAL 是不是不太合理啊,超过内存了就重启tiflash了,感觉这个机制是不是不太合理

快小200G了

free -m看下?

information_schema.tiflash_replica 表来查看副本同步的进度(PROGRESS 列)

TiFlash 刷盘操作时内存分配超出预设的 136.63 GiB 限制,触发内存超限崩溃,得限制一下才行。

应该是有的,你把报错区间的日志拉下来,然后找到日志文件中错误时间区间的日志看一下

v8.1.1 存在部分内存管理相关 Bug,是不是bug导致内存异常增长。

内存引用达到性能峰值了吧

Merge 任务占用大量内存

例如:

background merge

当很多 small part 合并时。

可以查看:

SELECT *
FROM system.merges;

1 个赞

如果是 merge:

可以限制:

SET max_memory_usage=0;

或者调整服务器配置:

max_server_memory_usage

1 个赞

这是啥意思

  • v8.1.1 存在一些内存管理相关的已知问题
    • 建议升级到 v8.1.2 或更新版本

后台数据合并(Merge/Flush)任务或内存配置不合理吧

合并配置看看条件松一点试试能不能正常启动

可以升级到8.5版本看看