博客 - # 阿里云数据库迁移至 TiDB 实践方案

1. 概述

1.1 背景

随着业务规模增长，传统单机 MySQL 架构面临存储容量与并发瓶颈。TiDB 作为开源分布式 HTAP 数据库，具备以下核心优势：

特性	说明
水平扩展	计算与存储分离，在线扩缩容，无需分库分表
MySQL 兼容	高度兼容 MySQL 5.7/8.0 协议与语法
金融级高可用	基于 Raft 协议的多副本机制，RPO = 0
HTAP 混合负载	OLTP + OLAP 一体，支持实时分析
云原生架构	天然适配 Kubernetes，支持多租户

2. 迁移前评估

2.1 源端信息采集

2.1.1 实例基本信息

-- 查看版本
SELECT VERSION();

-- 查看数据库列表和大小
SELECT 
    table_schema AS '数据库',
    ROUND(SUM(data_length + index_length) / 1024 / 1024 / 1024, 2) AS '大小(GB)'
FROM information_schema.tables
GROUP BY table_schema
ORDER BY SUM(data_length + index_length) DESC;

-- 查看各表行数与大小
SELECT 
    table_schema,
    table_name,
    table_rows,
    ROUND((data_length + index_length) / 1024 / 1024 / 1024, 4) AS '总大小(GB)'
FROM information_schema.tables
WHERE table_schema NOT IN ('information_schema', 'mysql', 'performance_schema', 'sys')
ORDER BY (data_length + index_length) DESC
LIMIT 50;

2.1.2 核心监控指标采集

指标	采集方式	目标值
QPS	`SHOW GLOBAL STATUS LIKE 'Questions'`	峰值 QPS
TPS	`SHOW GLOBAL STATUS LIKE 'Com_commit'`	峰值 TPS
慢查询	慢查询日志 / 阿里云 CloudDBA	慢查询 TOP 100
连接数	`SHOW GLOBAL STATUS LIKE 'Threads_connected'`	最大并发连接
磁盘使用量	阿里云控制台	数据 + 日志实际占用
Binlog 产生速率	`SHOW BINARY LOGS`	MB/小时

2.1.3 阿里云 RDS 特色参数采集

-- 检查阿里云特有的参数与插件
SHOW VARIABLES LIKE '%alisql%';
SHOW VARIABLES LIKE '%rds%';
SHOW PLUGINS;

-- 检查是否使用阿里云 TDE 透明加密
SHOW VARIABLES LIKE '%encrypt%';

-- 检查是否使用读写分离地址
SHOW VARIABLES LIKE '%read_only%';

2.2 应用层评估

2.2.1 评估清单

评估项	详细内容	风险等级
SQL 方言差异	存储过程、触发器、自定义函数、事件调度器	🔴 高
ORM 框架	MyBatis / Hibernate / GORM 等是否兼容 TiDB	🟡 中
连接池配置	连接池大小、超时参数是否需要调整	🟢 低
事务模型	是否存在大事务（> 10s）、锁等待	🔴 高
读写分离	应用是否依赖 MySQL 主从读写分离	🟡 中
分库分表中间件	ShardingSphere / MyCAT 等是否可以移除	🔴 高
定时任务	存储过程定时任务、Event Scheduler	🔴 高
全文检索	依赖 MySQL FULLTEXT 索引的场景	🟡 中

2.2.2 压力模型采集

# 使用 pt-query-digest 分析慢查询
pt-query-digest /path/to/slow.log --limit=100 > slow_analysis.txt

# 使用 pt-index-usage 分析索引使用情况
pt-index-usage /path/to/slow.log --host=xxx --user=xxx --password=xxx

2.3 TiDB 资源规划

2.3.1 集群规模估算

估算维度	计算公式	说明
TiKV 节点数	`ceil(数据量GB × 副本数 / 单节点容量)`	建议单节点 ≤ 2TB
TiDB 节点数	`ceil(峰值QPS / 单节点QPS)` + 冗余	建议至少 2 台，冗余 50%
PD 节点数	3/5/7（奇数）	生产环境建议 ≥ 3
TiFlash 节点	按需（HTAP 场景）	OLAP 查询需要时添加

2.3.2 阿里云 RDS 常见规格对应建议

阿里云 RDS 规格	数据量	推荐 TiDB 集群
4C16G (通用型)	< 200GB	TiDB×2 + TiKV×3 (8C32G) + PD×3
8C32G (独享型)	200GB-1TB	TiDB×3 + TiKV×5 (16C64G) + PD×3
16C64G (独享型)	1TB-3TB	TiDB×4 + TiKV×7 (16C64G) + PD×5
32C128G+ (专属集群)	> 3TB	TiDB×6+ + TiKV×9+ + PD×5 + TiFlash

3. MySQL/TiDB 兼容性分析

3.1 不兼容特性详解

3.1.1 完全不支持的特性

特性	说明	替代方案
存储过程	TiDB 不支持存储过程	应用层实现业务逻辑
触发器	不支持触发器	应用层 + TiCDC 实现
自定义函数 (UDF)	不支持 UDF	应用层替代
事件调度器 (Event Scheduler)	不支持定时事件	外部调度（如 CronJob）
外键约束	`FOREIGN KEY` 语法可解析但不生效	应用层保证数据完整性
全文索引 (FULLTEXT)	不支持	Elasticsearch / TiDB + TiSpark

3.1.2 行为差异特性

特性	MySQL 行为	TiDB 行为	影响
AUTO_INCREMENT	单调递增，唯一	可能不连续（分布式分配），全局唯一	不能依赖自增 ID 的顺序性
AUTO_RANDOM	不存在	TiDB 特有，替代 AUTO_INCREMENT，写性能更好	推荐在分布式场景使用
事务大小	无硬限制（但受 binlog 限制）	单个事务有大小限制（默认 100MB）	大事务必须拆分
DDL 执行	阻塞式，metadata lock	在线 DDL，不阻塞读写	需注意 DDL 排队机制
字符集	`utf8` = `utf8mb3`	`utf8` = `utf8mb4`	统一使用 `utf8mb4`
默认排序规则	`utf8mb4_general_ci`	`utf8mb4_bin`（默认）	大小写敏感差异
sql_mode	宽松模式（部分）	严格模式（`ONLY_FULL_GROUP_BY`）	SQL 可能报错
查询计划	基于成本 + 启发式	基于成本（CBO）+ Raft	需重新评估索引

3.2 兼容性检查工具

3.2.1 使用 TiDB Dashboard 检查

# 启动 TiDB Dashboard
# 访问 http://{tidb-server-ip}:2379/dashboard
# 使用"SQL 诊断"和"慢查询"功能分析

3.2.2 使用 Dumpling 导出时验证

# 导出并检查不兼容 DDL
tiup dumpling \
  -h {rds-host} -P 3306 \
  -u {user} -p {password} \
  --filetype sql \
  --output /data/migration-check/ \
  --no-data \
  -B {database}

3.2.3 关键兼容性检测 SQL

-- 检查存储过程
SELECT ROUTINE_NAME, ROUTINE_TYPE 
FROM information_schema.ROUTINES 
WHERE ROUTINE_SCHEMA NOT IN ('sys', 'mysql', 'information_schema', 'performance_schema');

-- 检查触发器
SELECT TRIGGER_SCHEMA, TRIGGER_NAME, EVENT_MANIPULATION, EVENT_OBJECT_TABLE
FROM information_schema.TRIGGERS;

-- 检查事件调度器
SELECT EVENT_SCHEMA, EVENT_NAME, STATUS, EVENT_TYPE
FROM information_schema.EVENTS;

-- 检查外键
SELECT TABLE_SCHEMA, TABLE_NAME, CONSTRAINT_NAME
FROM information_schema.KEY_COLUMN_USAGE
WHERE REFERENCED_TABLE_NAME IS NOT NULL;

-- 检查视图
SELECT TABLE_SCHEMA, TABLE_NAME
FROM information_schema.VIEWS
WHERE TABLE_SCHEMA NOT IN ('sys', 'mysql', 'information_schema', 'performance_schema');

-- 检查自定义函数
SELECT name, type, language
FROM mysql.func;

4. 迁移方案总体设计

4.1 迁移架构图

                    ┌─────────────────────────────────────────┐
                    │           阿里云 RDS MySQL               │
                    │  ┌─────┐  ┌─────┐  ┌─────┐             │
                    │  │ 主库 │──│ 从库 │──│ 从库 │             │
                    │  └──┬──┘  └─────┘  └─────┘             │
                    └─────┼────────────────────────────────────┘
                          │ Binlog Stream
                          ▼
               ┌─────────────────────┐
               │   TiDB DM Cluster   │
               │ ┌────────┐ ┌──────┐ │
               │ │DM-master│ │DM-   │ │
               │ │        │ │worker│ │
               │ └────────┘ └──────┘ │
               │    Schema Mapping   │
               │    Filter Rules     │
               └──────────┬──────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
    ┌──────────┐   ┌──────────┐   ┌──────────┐
    │   TiDB   │──│   TiDB   │──│   TiDB   │   (SQL 层)
    └────┬─────┘   └────┬─────┘   └────┬─────┘
         │               │               │
    ┌────┴───────────────┴───────────────┴────┐
    │              PD Cluster                  │   (调度层)
    └──────────────────┬──────────────────────┘
         ┌─────────────┼──────────────┐
    ┌────┴────┐   ┌────┴────┐   ┌────┴────┐
    │  TiKV   │   │  TiKV   │   │  TiKV   │    (存储层)
    │(Region) │   │(Region) │   │(Region) │
    └─────────┘   └─────────┘   └─────────┘

4.2 推荐工具链

阶段	工具	用途
数据导出	Dumpling	MySQL 全量数据导出为 SQL/CSV
全量导入	TiDB Lightning	高速并行导入 TiKV
增量同步	TiDB Data Migration (DM)	Binlog 实时同步
数据校验	sync-diff-inspector	上下游数据一致性对比
备份恢复	BR (Backup & Restore)	分布式备份与恢复
集群管理	TiUP	集群部署、升级、运维

4.3 迁移策略选择

策略	适用场景	停机时间	复杂度
全量 + 增量（推荐）	大多数在线业务	分钟级（切换瞬间）	⭐⭐⭐
仅全量	允许长时间停机的非核心系统	数小时到数天	⭐
双写方案	对一致性要求极高的金融场景	零停机	⭐⭐⭐⭐⭐

推荐方案: 全量 + 增量迁移，适用于 90% 以上的场景。

5. 环境准备与部署

5.1 TiDB 集群部署

5.1.1 拓扑文件示例 (topology.yaml)

global:
  user: "tidb"
  ssh_port: 22
  deploy_dir: "/data/tidb-deploy"
  data_dir: "/data/tidb-data"

pd_servers:
  - host: 10.0.1.1
  - host: 10.0.1.2
  - host: 10.0.1.3

tidb_servers:
  - host: 10.0.2.1
  - host: 10.0.2.2
  - host: 10.0.2.3

tikv_servers:
  - host: 10.0.3.1
    data_dir: "/data1/tikv"
  - host: 10.0.3.2
    data_dir: "/data1/tikv"
  - host: 10.0.3.3
    data_dir: "/data1/tikv"
  - host: 10.0.3.4
    data_dir: "/data1/tikv"
  - host: 10.0.3.5
    data_dir: "/data1/tikv"

monitoring_servers:
  - host: 10.0.4.1

grafana_servers:
  - host: 10.0.4.1

alertmanager_servers:
  - host: 10.0.4.1

5.1.2 部署命令

# 使用 TiUP 部署集群
tiup cluster deploy tidb-prod v7.5.0 ./topology.yaml \
  --user root -p

# 启动集群
tiup cluster start tidb-prod

# 查看集群状态
tiup cluster display tidb-prod

# 初始化 root 密码
mysql -h {tidb-host} -P 4000 -u root
ALTER USER 'root'@'%' IDENTIFIED BY 'your-password';
FLUSH PRIVILEGES;

5.1.3 关键参数初始化

-- TiDB 层面参数调优
SET GLOBAL tidb_gc_life_time = '24h';          -- 迁移期间放宽 GC
SET GLOBAL tidb_dml_batch_size = 20000;         -- 批量写入优化
SET GLOBAL tidb_enable_amend_pessimistic_txn = 0; -- 关闭悲观事务自动重试

-- 查看当前配置
SHOW CONFIG WHERE NAME LIKE '%gc%';
SHOW CONFIG WHERE NAME LIKE '%batch%';

5.2 DM 集群部署

5.2.1 DM 拓扑文件 (dm-topology.yaml)

global:
  user: "tidb"
  deploy_dir: "/data/dm-deploy"

dm_master_servers:
  - host: 10.0.5.1
  - host: 10.0.5.2
  - host: 10.0.5.3

dm_worker_servers:
  - host: 10.0.6.1
  - host: 10.0.6.2
  - host: 10.0.6.3

dm_portal_servers:
  - host: 10.0.5.1

monitoring_servers:
  - host: 10.0.4.1

grafana_servers:
  - host: 10.0.4.1

5.2.2 部署命令

# 部署 DM 集群
tiup dm deploy dm-prod v7.5.0 ./dm-topology.yaml --user root -p

# 启动 DM
tiup dm start dm-prod

# 验证
tiup dm display dm-prod

5.3 源端阿里云 RDS 准备

5.3.1 创建迁移专用账号

-- 在阿里云 RDS 上创建迁移专用账号
CREATE USER 'tidb_migration'@'%' IDENTIFIED BY 'your-secure-password';

-- 授予必要权限
GRANT 
    SELECT, 
    RELOAD, 
    REPLICATION SLAVE, 
    REPLICATION CLIENT, 
    LOCK TABLES,
    SHOW VIEW,
    PROCESS
ON *.* TO 'tidb_migration'@'%';

FLUSH PRIVILEGES;

5.3.2 确认 Binlog 配置

-- 检查 binlog 是否开启
SHOW VARIABLES LIKE 'log_bin';
-- 必须为 ON

-- 检查 binlog 格式
SHOW VARIABLES LIKE 'binlog_format';
-- 必须为 ROW

-- 检查 binlog 保留时间（阿里云 RDS 默认 18h）
SHOW VARIABLES LIKE 'expire_logs_days';

-- 确保 binlog_row_image = FULL
SHOW VARIABLES LIKE 'binlog_row_image';

5.3.3 阿里云白名单配置

在阿里云 RDS 控制台中，将 DM-worker 所在服务器 IP 地址加入 RDS 白名单。

6. 全量数据迁移

6.1 方案选择

方案 A：Dumpling + TiDB Lightning（推荐）

适合数据量较大（> 50GB）的场景，速度快，对源库压力小。

方案 B：DM 全量迁移

适合中小数据量，集全量与增量于一体，配置简单。

6.2 Dumpling 全量导出

6.2.1 导出命令

# 全量导出所有库
tiup dumpling \
  -h {rds-host} \
  -P 3306 \
  -u tidb_migration \
  -p 'your-password' \
  --filetype sql \
  --threads 8 \
  --rows 200000 \
  --consistency snapshot \
  --snapshot "420000000000000000" \
  --output /data/migration/full-dump/

# 按库导出
tiup dumpling \
  -h {rds-host} \
  -P 3306 \
  -u tidb_migration \
  -p 'your-password' \
  -B db1,db2,db3 \
  --filetype sql \
  --threads 8 \
  --output /data/migration/full-dump/

# 排除特定表
tiup dumpling \
  -h {rds-host} \
  -P 3306 \
  -u tidb_migration \
  -p 'your-password' \
  --filter 'db1.*' \
  --filter '!db1.temp_table' \
  --filetype csv \
  --output /data/migration/full-dump/

6.2.2 关键参数说明

参数	说明	建议值
`--threads`	并发线程数	8-16
`--rows`	每个文件最大行数	200000
`--consistency`	一致性保证方式	snapshot（推荐）/ lock
`--snapshot`	指定快照时间戳	用于增量同步起始点
`--filetype`	输出格式	sql（直接导入）/ csv（Lightning）
`--params`	连接参数	`"tidb_migration&charset=utf8mb4"`

注意: 导出时记录 snapshot 时间戳，作为后续 DM 增量同步的起点 (start-tso)。

6.3 TiDB Lightning 导入

6.3.1 Lightning 配置 (tidb-lightning.toml)

[lightning]
# 并发度
index-concurrency = 8
table-concurrency = 8
region-concurrency = 80

# 检查点（断点续传）
checkpoint-import-data = true

[checkpoint]
enable = true
schema = "tidb_lightning_checkpoint"
driver = "mysql"
dsn = "root:password@tcp(127.0.0.1:4000)/"

[tikv-importer]
# 使用 local 模式（推荐）
backend = "local"
sorted-kv-dir = "/data/tidb-lightning/sorted-kv"

[mydumper]
# 源数据目录（Dumpling 输出目录）
data-source-dir = "/data/migration/full-dump"
no-schema = false
filter = ['*.*']

[tidb]
host = "127.0.0.1"
port = 4000
user = "root"
password = "your-password"
status-port = 10080
pd-addr = "127.0.0.1:2379"

[post-restore]
# 导入后执行 ANALYZE
checksum = true
analyze = true

6.3.2 执行导入

# 使用 local 模式导入（速度快，但会占用 TiKV 资源）
tiup tidb-lightning -config tidb-lightning.toml

# 监控导入进度
grep "progress" /tmp/lightning.log

6.3.3 导入后处理

-- 收集统计信息
ANALYZE TABLE db1.table1, db1.table2;

-- 恢复 GC 参数
SET GLOBAL tidb_gc_life_time = '10m';

-- 验证行数
SELECT 
    TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS
FROM information_schema.tables
WHERE TABLE_SCHEMA IN ('db1', 'db2', 'db3')
ORDER BY TABLE_SCHEMA, TABLE_NAME;

7. 增量数据同步

7.1 DM 任务配置

7.1.1 基础同步任务 (task.yaml)

name: "rds-to-tidb-prod"
task-mode: "all"        # all=全量+增量, incremental=仅增量

# 目标 TiDB 配置
target-database:
  host: "10.0.2.1"
  port: 4000
  user: "root"
  password: "your-tidb-password"

# 源 MySQL 配置
mysql-instances:
  - source-id: "aliyun-rds-prod"
    
    # 黑白名单过滤
    block-allow-list: "instance-filter"
    
    # Binlog 事件过滤
    event-filters: "replicate-filter"
    
    # 路由规则（表名映射）
    route-rules: ["mapping-rule"]
    
    # Dumpling 全量导出配置（仅在 task-mode=all 时使用）
    mydumper-config-name: "global"
    
    # 增量同步起始位置
    syncer-config-name: "global"
    
    # 从指定 binlog 位置开始增量同步
    # syncer:
    #   enable-heartbeat: true

# Binlog 事件过滤器
filters:
  replicate-filter:
    schema-pattern: "*"
    table-pattern: "*"
    events: ["all"]
    action: "Do"
    # 针对 Online DDL 优化（gh-ost / pt-osc）
    sql-pattern: ["^ALTER\\s+TABLE.*"]

# 表路由规则
routes:
  mapping-rule:
    schema-pattern: "source_db"
    table-pattern: "source_table_*"
    target-schema: "target_db"
    target-table: "target_table"   # 分表合并到目标表

# 黑白名单
block-allow-list:
  instance-filter:
    do-dbs: 
      - "your_database_1"
      - "your_database_2"
    ignore-dbs:
      - "mysql"
      - "information_schema"
      - "performance_schema"
      - "sys"
    do-tables: []
    ignore-tables:
      - "your_database_1.temp_table"
      - "your_database_1.backup_*"

# 数据源
mydumpers:
  global:
    mydumper-path: "dumpling"
    threads: 8
    chunk-filesize: "128"
    skip-tz-utc: true
    extra-args: "--consistency=auto"

# 同步器
syncers:
  global:
    worker-count: 16
    batch: 100
    safe-mode: false
    # 对于数据量大的表，适当调整
    # max-retry: 100

# 数据校验
validators:
  validator-1:
    mode: "full"      # full/fast/none
    worker-count: 4
    row-error-delay: "30m"

7.1.2 分库分表合并迁移配置

# 阿里云上如果有多个 RDS 实例或分库分表需要合并
mysql-instances:
  - source-id: "aliyun-rds-shard-1"
    route-rules: ["shard-merge"]
    block-allow-list: "shard-filter"
    
  - source-id: "aliyun-rds-shard-2"
    route-rules: ["shard-merge"]
    block-allow-list: "shard-filter"

routes:
  shard-merge:
    schema-pattern: "order_db_*"
    table-pattern: "t_order_*"
    target-schema: "order_db"
    target-table: "t_order"

# DDL 同步协调
# DM 会自动处理分片 DDL，协调多个上游实例的 schema 变更

7.2 DM 任务管理

# 启动任务
tiup dmctl --master-addr 10.0.5.1:8261 \
  start-task task.yaml

# 查看任务状态
tiup dmctl --master-addr 10.0.5.1:8261 \
  query-status rds-to-tidb-prod

# 暂停任务
tiup dmctl --master-addr 10.0.5.1:8261 \
  pause-task rds-to-tidb-prod

# 恢复任务
tiup dmctl --master-addr 10.0.5.1:8261 \
  resume-task rds-to-tidb-prod

# 查看任务配置
tiup dmctl --master-addr 10.0.5.1:8261 \
  get-task-config rds-to-tidb-prod

# 处理失败的 DDL
tiup dmctl --master-addr 10.0.5.1:8261 \
  handle-error rds-to-tidb-prod \
  --binlog-pos "mysql-bin.000001:12345" \
  --sql-pattern "ALTER TABLE.*" \
  replace "ALTER TABLE t1 ADD COLUMN c1 INT" \
           "ALTER TABLE t1 ADD COLUMN c1 INT DEFAULT 0"

# 跳过错误的 DDL
tiup dmctl --master-addr 10.0.5.1:8261 \
  handle-error rds-to-tidb-prod \
  --binlog-pos "mysql-bin.000001:12345" \
  skip

7.3 增量同步监控

-- 在 TiDB 上查看同步延迟
SELECT 
    task_name,
    unit,
    (UNIX_TIMESTAMP() - UNIX_TIMESTAMP(create_time)) AS delay_seconds
FROM information_schema.dm_sync_status;

-- 查看 DM 内部状态
-- 使用 dmctl
tiup dmctl --master-addr 10.0.5.1:8261 query-status --more

8. 数据一致性校验

8.1 sync-diff-inspector 配置

8.1.1 配置文件 (diff-config.toml)

# 数据源配置
[data-sources]
[data-sources.mysql-rds]
    host = "{rds-host}"
    port = 3306
    user = "tidb_migration"
    password = "your-password"

[data-sources.tidb-target]
    host = "10.0.2.1"
    port = 4000
    user = "root"
    password = "your-password"

# 校验任务
[task]
    # 输出目录
    output-dir = "/data/sync-diff/output"
    
    # 源端
    source-instances = ["mysql-rds"]
    
    # 目标端
    target-instance = "tidb-target"
    
    # 对比的表
    target-check-tables = [
        "db1.table1",
        "db1.table2",
        "db2.*"
    ]
    
    # 校验模式
    [task.mode]
        # 线程数
        threads = 8
        # 每次对比的 chunk 大小
        chunk-size = 1000
        # 抽样比例（0-100）
        sample-percent = 100
    
    # 过滤条件
    [task.source-filters]
        mysql-rds = [
            "db1.ignore_table"
        ]
    
    # 修复不一致数据（可选）
    [task.fix-sql]
        fix-on-target = true
        fix-sql-file = "/data/sync-diff/fix.sql"
    
    # 忽略的列（如更新时间戳）
    [task.config]
        ignore-columns = ["updated_at", "version"]
        
    # 忽略的数据类型检查
    [task.table-config."db1.float_table"]
        ignore-data-check = true

8.1.2 执行校验

# 运行数据校验
tiup sync-diff-inspector --config=diff-config.toml

# 持续校验（在线模式，用于增量同步阶段）
tiup sync-diff-inspector --config=diff-config.toml --check-round=3

# 生成修复 SQL
tiup sync-diff-inspector --config=diff-config.toml --gen-fix-sql

8.2 校验策略

阶段	校验方式	频率
全量迁移后	全量逐表校验	一次性
增量同步期间	核心表抽样校验	每天 1 次
切换前	全量数据校验	切换窗口前
切换后	写入验证	持续 24h

8.3 业务层校验

-- 行数对比
SELECT '源端' AS source, COUNT(*) FROM source_db.orders
UNION ALL
SELECT '目标端' AS source, COUNT(*) FROM target_db.orders;

-- 关键指标对比
SELECT 
    DATE(create_time) AS date,
    COUNT(*) AS order_count,
    SUM(amount) AS total_amount
FROM target_db.orders
GROUP BY DATE(create_time)
ORDER BY date;

9. 业务切换方案

9.1 切换前检查清单

9.1.1 技术检查

全量数据迁移完成，导入成功无报错
增量同步延迟 < 5 秒
sync-diff-inspector 数据校验通过（核心表 100% 一致）
应用兼容性测试通过（单元测试 + 集成测试）
压测结果达到业务要求（TPS/QPS 不低于源库 80%）
慢查询已优化，无性能劣化
监控告警配置完成
备份策略已配置并验证
应急预案已演练

9.1.2 业务检查

所有应用方已确认参与切换
切换时间窗口已审批
业务方已准备回归测试用例
回滚方案已确认

9.2 灰度切换流程

阶段一：只读业务切换（低风险）

流量分配: TiDB 10% → 50% → 100%
监控观察: 每步观察 30 分钟

# DNS 切换 / 负载均衡切换
# 通过配置中心或 DNS 权重逐步切换流量

阶段二：写入业务切换（核心）

切换步骤:
  1. 停止源端所有写入流量 (应用层 write-off)
  2. 等待 DM 增量同步延迟归零 (binlog position 追平)
  3. 执行最终数据校验 (sync-diff-inspector)
  4. 切换应用数据库连接串到 TiDB
  5. 启动写入流量
  6. 持续监控 30 分钟

连接串变更示例:
  # 从
  jdbc:mysql://aliyun-rds.internal:3306/db?useSSL=true
  # 切换为
  jdbc:mysql://tidb-lb.internal:4000/db?useSSL=true&useServerPrepStmts=false

阶段三：全量切换

确认核心业务稳定后，逐步迁移剩余非核心业务

9.3 TiDB 连接串配置最佳实践

# JDBC 连接串（推荐配置）
jdbc:mysql://{tidb-lb}:4000/db?
  useSSL=true&
  useServerPrepStmts=false&           # TiDB 不支持服务端预处理
  cachePrepStmts=true&                # 客户端缓存预处理
  rewriteBatchedStatements=true&       # 批量写入优化
  socketTimeout=30000&                # Socket 超时
  connectTimeout=5000&                # 连接超时
  characterEncoding=utf8mb4           # 字符集

# 连接池建议（HikariCP 示例）
hikari:
  maximum-pool-size: 100              # TiDB 为无状态计算层，可适当大
  minimum-idle: 10
  idle-timeout: 600000               # 10 分钟
  max-lifetime: 1800000              # 30 分钟
  connection-timeout: 30000

10. 回滚与应急预案

10.1 回滚方案

方案 A：快速回滚（切换后 1 小时内）

条件: 源库数据未变更或变更可控
操作:
  1. 紧急停止 TiDB 端写入
  2. 应用连接串切回阿里云 RDS
  3. 验证源库数据完整性
  4. 恢复业务
时间: < 15 分钟

方案 B：反向同步回滚（切换后较长时间）

# 使用 TiCDC 或 DM 建立反向同步
name: "tidb-to-rds-reverse"
task-mode: "incremental"
target-database:
  host: "{rds-host}"
  port: 3306
  user: "tidb_migration"
  password: "your-password"
mysql-instances:
  - source-id: "tidb-cluster"
    block-allow-list: "instance-filter"

10.2 应急场景与处理

场景	现象	处理步骤
性能严重下降	延迟飙升、超时	1. 分析慢查询 2. 调整索引 3. 必要时扩容 TiDB/TiKV
数据不一致	行数或内容差异	1. 停止写入 2. sync-diff-inspector 定位差异 3. 增量修复或重新同步
集群故障	节点宕机	1. TiDB 自动 failover 2. 检查 Raft Leader 分布 3. 恢复故障节点
DDL 同步阻塞	DM 任务卡住	1. `handle-error skip` 跳过 2. 手动执行 DDL 3. 恢复任务
连接池耗尽	`Too many connections`	1. 增加 `max_connections` 2. 检查连接泄漏 3. 重启问题应用

10.3 回滚决策树

发现异常
    │
    ├─ 影响核心业务？─ 是 ─→ 立即回滚（15 分钟内）
    │       │
    │       否
    │       │
    ├─ 性能下降 > 50%？─ 是 ─→ 评估原因 → 无法快速修复？─ 是 ─→ 回滚
    │       │
    │       否
    │       │
    └─ 数据不一致？─ 是 ─→ 可修复？─ 是 ─→ 修复后继续
            │                  │
            否                 否
            │                  │
        继续观察              回滚

11. 性能调优指南

11.1 表结构优化

11.1.1 主键设计

-- ❌ 不推荐：传统自增主键（热点写入）
CREATE TABLE orders (
    id BIGINT AUTO_INCREMENT PRIMARY KEY,
    ...
);

-- ✅ 推荐：AUTO_RANDOM 或 SHARD_ROW_ID_BITS
CREATE TABLE orders (
    id BIGINT AUTO_RANDOM PRIMARY KEY,  -- 需要 TiDB >= v4.0
    ...
);

-- ✅ 或使用 SHARD_ROW_ID_BITS
CREATE TABLE orders (
    id BIGINT NOT NULL AUTO_INCREMENT,
    ...
) SHARD_ROW_ID_BITS = 4 PRE_SPLIT_REGIONS = 4;

11.1.2 索引优化

-- ❌ 在 TiDB 中避免过多二级索引（写入开销大）
-- ✅ 只保留必要的索引

-- 分析索引使用情况
SELECT 
    table_name,
    index_name,
    seq_in_index,
    column_name
FROM information_schema.statistics
WHERE table_schema = 'your_db'
ORDER BY table_name, index_name, seq_in_index;

-- 删除冗余索引
-- DROP INDEX idx_redundant ON your_table;

11.1.3 大事务拆分

-- ❌ 不推荐：单次大批量更新
DELETE FROM logs WHERE created_at < '2024-01-01';

-- ✅ 推荐：分批处理
-- 每次处理 10000 行，间歇提交
SELECT 1;  -- 保持 session alive
DELETE FROM logs WHERE created_at < '2024-01-01' LIMIT 10000;
-- 循环执行直到 affected rows = 0

11.2 SQL 调优

11.2.1 常见优化模式

问题	优化方案
大表 JOIN	确保 JOIN 键上有索引，小表做 Build 侧
`SELECT *`	只选择需要的列
`OR` 条件	改为 `UNION ALL`
隐式类型转换	确保比较双方类型一致
子查询	改为 JOIN（适合大部分场景）
`LIMIT` 大偏移量	使用延迟关联或游标分页

11.2.2 查看执行计划

-- EXPLAIN ANALYZE 查看实际执行计划
EXPLAIN ANALYZE 
SELECT o.*, u.name 
FROM orders o 
JOIN users u ON o.user_id = u.id 
WHERE o.created_at > '2026-01-01';

-- 查看执行统计
SELECT 
    digest_text,
    sum_latency,
    avg_latency,
    exec_count
FROM information_schema.cluster_slow_query
WHERE db = 'your_db'
ORDER BY sum_latency DESC
LIMIT 20;

11.3 TiKV 优化

-- 查看 Region 分布
SHOW TABLE your_db.your_table REGIONS;

-- 查看热点 Region
SELECT 
    table_name,
    is_index,
    flow_type,
    max(flow_bytes) as max_bytes
FROM information_schema.tikv_region_peers 
JOIN information_schema.tikv_region_status 
GROUP BY table_name, is_index;

11.4 TiDB 参数调优

参数	默认值	推荐值（迁移后）	说明
`tidb_gc_life_time`	10m	10m	迁移完成后恢复
`tidb_store_limit`	0	0	不限制写入
`tidb_dml_batch_size`	0	20000	批量 DML
`tidb_enable_amend_pessimistic_txn`	OFF	OFF	悲观事务
`tidb_txn_mode`	""	"pessimistic"	悲观事务模式
`tidb_enable_tso_follower_proxy`	OFF	ON	TSO 负载均衡

12. 运维监控体系

12.1 TiDB 关键监控指标

12.1.1 Grafana 核心面板

面板	关键指标	告警阈值
Overview	QPS, Duration, Failed Query	Duration P99 > 500ms
TiDB	Connection Count, Transaction OPS	Connection > 80% max
TiKV-Details	Region, Scheduler, Raft IO	异常 Region > 100
PD	Store Status, Balance Leader	Store 离线 > 5m
DM	Syncer Binlog File Gap, Queue	Lag > 5min

12.1.2 自定义告警规则

# prometheus-alert-rules.yaml
groups:
  - name: tidb-migration
    rules:
      - alert: DMHighLag
        expr: dm_syncer_binlog_file_gauge{task="rds-to-tidb-prod"} > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "DM 同步延迟超过 5 分钟"
          
      - alert: TiKVStoreDown
        expr: up{job="tikv"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "TiKV 节点宕机超过 5 分钟"
          
      - alert: TiDBHighQueryDuration
        expr: histogram_quantile(0.99, rate(tidb_server_handle_query_duration_seconds_bucket[5m])) > 1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "TiDB P99 查询延迟超过 1 秒"

12.2 日常巡检

-- 每日巡检 SQL

-- 1. 检查集群状态
SELECT * FROM information_schema.cluster_info;

-- 2. 检查 Region 健康度
SELECT 
    TYPE, 
    COUNT(*) as count,
    leader_count
FROM information_schema.tikv_store_status 
GROUP BY TYPE;

-- 3. 检查慢查询 Top 10
SELECT 
    digest_text,
    avg_latency,
    exec_count
FROM information_schema.cluster_slow_query
WHERE time > DATE_SUB(NOW(), INTERVAL 1 DAY)
ORDER BY avg_latency DESC
LIMIT 10;

-- 4. 检查磁盘使用率
SELECT 
    store_id,
    address,
    capacity,
    available,
    ROUND((1 - available/capacity) * 100, 2) AS used_percent
FROM information_schema.tikv_store_status;

-- 5. 检查热点表
SELECT 
    table_name,
    index_name,
    COUNT(*) as region_count
FROM information_schema.tikv_region_status
GROUP BY table_name, index_name
ORDER BY region_count DESC
LIMIT 20;

12.3 备份策略

# 全量备份（BR）
tiup br backup full \
  --pd "10.0.1.1:2379" \
  --storage "s3://backup-bucket/tidb/full/$(date +%Y%m%d)" \
  --s3.endpoint "http://s3.internal" \
  --ratelimit 128 \
  --log-file /var/log/tidb/backup.log

# 增量备份（BR）
tiup br backup full \
  --pd "10.0.1.1:2379" \
  --storage "s3://backup-bucket/tidb/inc/$(date +%Y%m%d%H)" \
  --lastbackupts "$(cat /tmp/last_backup_ts)"

# 逻辑备份（Dumpling）
tiup dumpling \
  -h 10.0.2.1 -P 4000 -u root -p password \
  --output /backup/dumpling/$(date +%Y%m%d)/ \
  --filetype sql \
  --threads 4

13. 常见问题与解决方案

13.1 迁移阶段

问题	原因	解决方案
Dumpling 导出慢	网络带宽不足、线程数少	增加 `--threads`、使用 CSV 格式、压缩传输
Lightning 导入报错	TiKV 资源不足	降低 `region-concurrency`、增加 TiKV 节点
DM 同步延迟大	Binlog 量大、worker 不足	增加 `worker-count`、检查网络延迟
源库 Binlog 被清理	RDS binlog 保留时间不足	联系阿里云增加保留时间（最多 168h）
DDL 同步失败	语法不兼容	`handle-error skip` 后手动执行兼容 DDL
外键表迁移失败	TiDB 不支持外键	使用 `skip-check-foreign-key` 跳过检查

13.2 切换阶段

问题	原因	解决方案
连接失败	白名单未配置	确保所有应用 IP 在 TiDB 白名单中
性能骤降	统计信息不准	执行 `ANALYZE TABLE`
热点写入	自增主键/单调递增索引	使用 `AUTO_RANDOM` 或 `SHARD_ROW_ID_BITS`
事务冲突	悲观锁冲突	减小事务粒度、设置重试逻辑

13.3 运行阶段

问题	原因	解决方案
GC 压力大	大量旧版本数据	适当增加 `tidb_gc_life_time`
Region 分布不均	调度策略	手动 `SHOW TABLE REGIONS` + split
OOM	大查询占用太多内存	设置 `tidb_mem_quota_query` 限制
排错 SQL 慢	缺索引或统计信息过期	添加索引 + `ANALYZE TABLE`

13.4 应急联系

支持渠道	联系方式
官方文档	https://docs.pingcap.com/zh/tidb/stable
社区论坛	https://asktug.com
GitHub Issues	https://github.com/pingcap/tidb/issues
商业技术支持	联系 PingCAP 技术支持团队

14. 附录

14.1 常用命令速查

# ===== TiUP 管理命令 =====
tiup cluster list                            # 列出所有集群
tiup cluster display tidb-prod               # 查看集群详情
tiup cluster start tidb-prod                 # 启动集群
tiup cluster stop tidb-prod                  # 停止集群
tiup cluster restart tidb-prod               # 重启集群
tiup cluster edit-config tidb-prod            # 编辑配置
tiup cluster reload tidb-prod                 # 重载配置
tiup cluster upgrade tidb-prod v7.5.0         # 升级版本

# ===== DM 管理命令 =====
tiup dm list                                 # 列出 DM 集群
tiup dm display dm-prod                       # 查看 DM 详情
tiup dmctl --master-addr {addr}:8261 query-status  # 任务状态

# ===== 数据操作 =====
tiup dumpling -h {host} -P 3306 -u {user} -p {pass} -o /data/dump/
tiup tidb-lightning -config lightning.toml
tiup sync-diff-inspector --config diff-config.toml

# ===== BR 备份恢复 =====
tiup br backup full --pd "{pd}:2379" --storage "s3://bucket/backup"
tiup br restore full --pd "{pd}:2379" --storage "s3://bucket/backup"

# ===== 诊断命令 =====
tiup cluster check tidb-prod --cluster       # 集群健康检查
tiup ctl pd store                            # PD 存储状态
tiup ctl tikv region                          # TiKV Region 信息

14.2 TiDB 与 MySQL 版本对应关系

TiDB 版本	兼容 MySQL 版本	发布年份	备注
v6.5 LTS	5.7 / 8.0	2022	长期支持版本
v7.1	5.7 / 8.0	2023	增强 HTAP
v7.5 LTS	5.7 / 8.0	2024	长期支持版本（推荐）
v8.x	5.7 / 8.0	2025	新一代架构

14.3 阿里云 RDS 与 TiDB 功能对照

功能	阿里云 RDS MySQL	TiDB	备注
自动备份	✅ 自动备份	✅ BR 工具	需自行配置
SQL 审计	✅ DAS 审计	✅ TiDB Dashboard
性能洞察	✅ CloudDBA	✅ Grafana + 慢查询
读写分离	✅ 只读实例	✅ TiDB 计算层无状态	无需特殊配置
弹性扩容	升级规格	✅ 在线扩容	TiDB 更灵活
全球多活	部分支持	✅ TiDB Global Table
TDE 加密	✅ 内置	✅ 透明加密