【TiDB 使用环境】生产环境
【TiDB 版本】5.3.0
【部署方式】 机器部署
【操作系统/CPU 架构/芯片详情】
centos7.9
【机器部署详情】CPU 20核 /内存256G /磁盘3.0T
【集群数据量】1.3T+225G2 (三个节点数据目录大小不一样)
【集群节点数】3个node节点IP分别是161/162/163混部,pd3,tidb3,tikv3,alertmanager+prometheus+grafana
【问题复现路径】做过哪些操作出现的问题
该集群已停用几个月,今天刚登录。之前服务器发生过啥暂不清楚。
【遇到的问题:问题现象及影响】
登上服务器后display 集群,163节点的tikv 没启动,尝试启动起不来,163tikv日志里一直刷的内容:
174061:[2026/01/04 10:30:06.562 +08:00] [ERROR] [server.rs:1052] [“failed to init io snooper”] [err_code=KV:Unknown] [err=“"IO snooper is not started due to not compiling with BCC"”]
1452439:[2026/01/04 10:32:45.677 +08:00] [FATAL] [server.rs:843] [“failed to start node: Engine(Other("[components/raftstore/src/store/fsm/store.rs:1004]: \"[components/raftstore/src/store/peer_storage.rs:695]: [region 140264249] 140264250 validate state fail: Other(\\\"[components/raftstore/src/store/peer_storage.rs:605]: term of raft state < commit term of apply state, region 140264249, raft state hard_state { term: 7 commit: 9 } last_index: 9, apply state applied_index: 9 commit_index: 9 commit_term: 8 truncated_state { index: 5 term: 5 }\\\")\""))”]
其他两个tikv节点,一直刷如下日志
[2026/01/04 10:27:29.714 +08:00] [INFO] [raft.rs:2609] [“switched to configuration”] [config=“Configuration { voters: Configuration { incoming: Configuration { voters: {42866800, 138879260, 42866798} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }”] [raft_id=42866800] [region_id=42866797]
[2026/01/04 10:27:29.714 +08:00] [INFO] [raft.rs:1092] [“became follower at term 141”] [term=141] [raft_id=42866800] [region_id=42866797]
[2026/01/04 10:27:29.714 +08:00] [INFO] [raft.rs:384] [newRaft] [peers=“Configuration { incoming: Configuration { voters: {42866800, 138879260, 42866798} }, outgoing: Configuration { voters: {} } }”] [“last term”=141] [“last index”=181] [applied=181] [commit=181] [term=141] [raft_id=42866800] [region_id=42866797]
[2026/01/04 10:27:29.714 +08:00] [INFO] [raw_node.rs:315] [“RawNode created with id 42866800.”] [id=42866800] [raft_id=42866800] [region_id=42866797]
[2026/01/04 10:27:29.750 +08:00] [INFO] [peer.rs:208] [“create peer”] [peer_id=42877204] [region_id=42877201]
[2026/01/04 10:27:29.750 +08:00] [INFO] [raft.rs:2609] [“switched to configuration”] [config=“Configuration { voters: Configuration { incoming: Configuration { voters: {42877204, 42877202, 139105818} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }”] [raft_id=42877204] [region_id=42877201]
[2026/01/04 10:27:29.750 +08:00] [INFO] [raft.rs:1092] [“became follower at term 147”] [term=147] [raft_id=42877204] [region_id=42877201]
[2026/01/04 10:27:29.750 +08:00] [INFO] [raft.rs:384] [newRaft] [peers=“Configuration { incoming: Configuration { voters: {42877204, 42877202, 139105818} }, outgoing: Configuration { voters: {} } }”] [“last term”=147] [“last index”=196] [applied=196] [commit=196] [term=147] [raft_id=42877204] [region_id=42877201]
[2026/01/04 10:27:29.750 +08:00] [INFO] [raw_node.rs:315] [“RawNode created with id 42877204.”] [id=42877204] [raft_id=42877204] [region_id=42877201]
[2026/01/04 10:27:29.808 +08:00] [INFO] [peer.rs:208] [“create peer”] [peer_id=42890188] [region_id=42890185]
[2026/01/04 10:27:29.808 +08:00] [INFO] [raft.rs:2609] [“switched to configuration”] [config=“Configuration { voters: Configuration { incoming: Configuration { voters: {42890188, 42890186, 139068687} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }”] [raft_id=42890188] [region_id=42890185]
[2026/01/04 10:27:29.808 +08:00] [INFO] [raft.rs:1092] [“became follower at term 142”] [term=142] [raft_id=42890188] [region_id=42890185]
查看了下 有损坏的region,特别多,这只是一点示例:
140971013: “[src/server/debug.rs:511]: "[components/raftstore/src/store/peer_storage.rs:695]: [region 140971013] 140971015 validate state fail: Other(\"[components/raftstore/src/store/peer_storage.rs:578]: log at recorded commit index [6] 6 doesn’t exist, may lose data, region 140971013, raft state hard_state { term: 5 commit: 5 } last_index: 5, apply state applied_index: 6 commit_index: 6 commit_term: 6 truncated_state { index: 5 term: 5 }\")"”
140971017: “[src/server/debug.rs:511]: "[components/raftstore/src/store/peer_storage.rs:695]: [region 140971017] 140971019 validate state fail: Other(\"[components/raftstore/src/store/peer_storage.rs:578]: log at recorded commit index [6] 6 doesn’t exist, may lose data, region 140971017, raft state hard_state { term: 5 commit: 5 } last_index: 5, apply state applied_index: 6 commit_index: 6 commit_term: 6 truncated_state { index: 5 term: 5 }\")"”
140971021: “[src/server/debug.rs:511]: "[components/raftstore/src/store/peer_storage.rs:695]: [region 140971021] 140971023 validate state fail: Other(\"[components/raftstore/src/store/peer_storage.rs:578]: log at recorded commit index [6] 6 doesn’t exist, may lose data, region 140971021, raft state hard_state { term: 5 commit: 5 } last_index: 5, apply state applied_index: 6 commit_index: 6 commit_term: 6 truncated_state { index: 5 term: 5 }\")"”
140971025: “[src/server/debug.rs:511]: "[components/raftstore/src/store/peer_storage.rs:695]: [region 140971025] 140971027 validate state fail: Other(\"[components/raftstore/src/store/peer_storage.rs:578]: log at recorded commit index [6] 6 doesn’t exist, may lose data, region 140971025, raft state hard_state { term: 5 commit: 5 } last_index: 5, apply state applied_index: 6 commit_index: 6 commit_term: 6 truncated_state { index: 5 term: 5 }\")"”
140971029: “[src/server/debug.rs:511]: "[components/raftstore/src/store/peer_storage.rs:695]: [region 140971029] 140971031 validate state fail: Other(\"[components/raftstore/src/store/peer_storage.rs:578]: log at recorded commit index [6] 6 doesn’t exist, may lose data, region 140971029, raft state hard_state { term: 5 commit: 5 } last_index: 5, apply state applied_index: 6 commit_index: 6 commit_term: 6 truncated_state { index: 5 term: 5 }\")"”
140971033: “[src/server/debug.rs:511]: "[components/raftstore/src/store/peer_storage.rs:695]: [region 140971033] 140971035 validate state fail: Other(\"[components/raftstore/src/store/peer_storage.rs:578]: log at recorded commit index [6] 6 doesn’t exist, may lose data, region 140971033, raft state hard_state { term: 5 commit: 5 } last_index: 5, apply state applied_index: 6 commit_index: 6 commit_term: 6 truncated_state { index: 5 term: 5 }\")"”
140971037: “[src/server/debug.rs:511]: "[components/raftstore/src/store/peer_storage.rs:695]: [region 140971037] 140971039 validate state fail: Other(\"[components/raftstore/src/store/peer_storage.rs:578]: log at recorded commit index [6] 6 doesn’t exist, may lose data, region 140971037, raft state hard_state { term: 5 commit: 5 } last_index: 5, apply state applied_index: 6 commit_index: 6 commit_term: 6 truncated_state { index: 5 term: 5 }\")"”
140971041: “[src/server/debug.rs:511]: "[components/raftstore/src/store/peer_storage.rs:695]: [region 140971041] 140971043 validate state fail: Other(\"[components/raftstore/src/store/peer_storage.rs:578]: log at recorded commit index [6] 6 doesn’t exist, may lose data, region 140971041, raft state hard_state { term: 5 commit: 5 } last_index: 5, apply state applied_index: 6 commit_index: 6 commit_term: 6 truncated_state { index: 5 term: 5 }\")"”
140971045: “[src/server/debug.rs:511]: "[components/raftstore/src/store/peer_storage.rs:695]: [region 140971045] 140971047 validate state fail: Other(\"[components/raftstore/src/store/peer_storage.rs:578]: log at recorded commit index [6] 6 doesn’t exist, may lose data, region 140971045, raft state hard_state { term: 5 commit: 5 } last_index: 5, apply state applied_index: 6 commit_index: 6 commit_term: 6 truncated_state { index: 5 term: 5 }\")"”
140971049: “[src/server/debug.rs:511]: "[components/raftstore/src/store/peer_storage.rs:695]: [region 140971049] 140971051 validate state fail: Other(\"[components/raftstore/src/store/peer_storage.rs:578]: log at recorded commit index [6] 6 doesn’t exist, may lose data, region 140971049, raft state hard_state { term: 5 commit: 5 } last_index: 5, apply state applied_index: 6 commit_index: 6 commit_term: 6 truncated_state { index: 5 term: 5 }\")"”
140971053: “[src/server/debug.rs:511]: "[components/raftstore/src/store/peer_storage.rs:695]: [region 140971053] 140971055 validate state fail: Other(\"[components/raftstore/src/store/peer_storage.rs:578]: log at recorded commit index [6] 6 doesn’t exist, may lose data, region 140971053, raft state hard_state { term: 5 commit: 5 } last_index: 5, apply state applied_index: 6 commit_index: 6 commit_term: 6 truncated_state { index: 5 term: 5 }\")"”
[root@node3 log]# dmesg | grep -i “memory|page|oom”
没有内容
后续 store状态
[tidb@node1 tidb-community-server-v5.3.0-linux-amd64]$ ./pd-ctl -u http://172.17.11.161:2381 store
{
“count”: 3,
“stores”: [
{
“store”: {
“id”: 5,
“address”: “172.17.11.161:20160”,
“version”: “5.3.0”,
“status_address”: “172.17.11.161:20180”,
“git_hash”: “6c1424706f3d5885faa668233f34c9f178302f36”,
“start_timestamp”: 1767496164,
“deploy_path”: “/app/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1767497057173780660,
“state_name”: “Up”
},
“status”: {
“capacity”: “2.999TiB”,
“available”: “568.9GiB”,
“used_size”: “1.234TiB”,
“leader_count”: 149014,
“leader_weight”: 1,
“leader_score”: 149014,
“leader_size”: 1642486,
“region_count”: 224972,
“region_weight”: 1,
“region_score”: 2833978.1840664097,
“region_size”: 2405349,
“slow_score”: 8,
“start_ts”: “2026-01-04T11:09:24+08:00”,
“last_heartbeat_ts”: “2026-01-04T11:24:17.17378066+08:00”,
“uptime”: “14m53.17378066s”
}
},
{
“store”: {
“id”: 138495900,
“address”: “172.17.11.162:20160”,
“version”: “5.3.0”,
“status_address”: “172.17.11.162:20180”,
“git_hash”: “6c1424706f3d5885faa668233f34c9f178302f36”,
“start_timestamp”: 1767496986,
“deploy_path”: “/app/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1767497050613457241,
“state_name”: “Up”
},
“status”: {
“capacity”: “1.499TiB”,
“available”: “851.9GiB”,
“used_size”: “224.7GiB”,
“leader_count”: 75958,
“leader_weight”: 1,
“leader_score”: 75958,
“leader_size”: 762863,
“region_count”: 224972,
“region_weight”: 1,
“region_score”: 2950266.8211040213,
“region_size”: 2405349,
“slow_score”: 4,
“start_ts”: “2026-01-04T11:23:06+08:00”,
“last_heartbeat_ts”: “2026-01-04T11:24:10.613457241+08:00”,
“uptime”: “1m4.613457241s”
}
},
{
“store”: {
“id”: 1,
“address”: “172.17.11.163:20160”,
“version”: “5.3.0”,
“status_address”: “172.17.11.163:20180”,
“git_hash”: “6c1424706f3d5885faa668233f34c9f178302f36”,
“start_timestamp”: 1767495278,
“deploy_path”: “/app/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1742400514180401910,
“state_name”: “Down”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 217046,
“region_weight”: 1,
“region_score”: 2103347,
“region_size”: 2103347,
“slow_score”: 0,
“start_ts”: “2026-01-04T10:54:38+08:00”,
“last_heartbeat_ts”: “2025-03-20T00:08:34.18040191+08:00”
}
}
]
想下掉163的tikv节点,重新扩容一个,强制下线后,161的tikv也disconnect了。
目前集群状态不可用。请教老师们,有没有可以恢复的思路。