新部署的tidb 使用tiup cluster start..的时候报错显示 pd-2379.service启动失败 timed out waiting for port 2379 to be started after 2m0s

Hacker_AkXHVl7a · 2021 年11 月 17 日 08:08

为提高效率，请提供以下信息，问题描述清晰能够更快得到解决：
【 TiDB 使用环境】
【概述】新部署的tidb 启动的时候报错
Error: failed to start pd: failed to start: 192.168.38.64 pd-2379.service, please check the instance’s log(/tidb-deploy/pd-2379/log) for more detail.: timed out waiting for port 2379 to be started after 2m0s

【背景】做过哪些操作
啥也没干刚准备启动
【现象】业务和数据库现象
【业务影响】
【TiDB 版本】
【附件】

相关日志和监控
2021-11-17T02:37:52.848-0500 INFO SSHCommand {“host”: “192.168.38.63”, “port”: “22”, “cmd”: “export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin ss -ltn”, “stdout”: “State Recv-Q Send-Q Local Address:Port Peer Address:Port \ LISTEN 0 128 0.0.0.0:22 0.0.0.0:* \ LISTEN 0 128 [::]:22 [::]:* \ ”, “stderr”: “”}
2021-11-17T02:37:52.848-0500 INFO CheckPoint {“host”: “192.168.38.63”, “port”: 22, “user”: “tidb”, “sudo”: false, “cmd”: “ss -ltn”, “stdout”: “State Recv-Q Send-Q Local Address:Port Peer Address:Port \ LISTEN 0 128 0.0.0.0:22 0.0.0.0:* \ LISTEN 0 128 [::]:22 [::]:* \ ”, “stderr”: “”, “hash”: “ce8eb0a645cc3ead96a44d67b1ecd5034d112cf0”, “func”: “github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute”, “hit”: false}
2021-11-17T02:37:52.848-0500 DEBUG retry error: operation timed out after 2m0s
2021-11-17T02:37:52.848-0500 DEBUG TaskFinish {“task”: “StartCluster”, “error”: “failed to start pd: failed to start: 192.168.38.63 pd-2379.service, please check the instance’s log(/tidb-deploy/pd-2379/log) for more detail.: timed out waiting for port 2379 to be started after 2m0s”, “errorVerbose”: “timed out waiting for port 2379 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\ \tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91\ github.com/pingcap/tiup/pkg/cluster/spec.PortStarted\ \tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:115\ github.com/pingcap/tiup/pkg/cluster/spec.(*BaseInstance).Ready\ \tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:147\ github.com/pingcap/tiup/pkg/cluster/operation.startInstance\ \tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:359\ github.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\ \tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:485\ golang.org/x/sync/errgroup.(*Group).Go.func1\ \tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\ runtime.goexit\ \truntime/asm_amd64.s:1581\ failed to start: 192.168.38.63 pd-2379.service, please check the instance’s log(/tidb-deploy/pd-2379/log) for more detail.\ failed to start pd”}
2021-11-17T02:37:52.848-0500 INFO Execute command finished {“code”: 1, “error”: “failed to start pd: failed to start: 192.168.38.63 pd-2379.service, please check the instance’s log(/tidb-deploy/pd-2379/log) for more detail.: timed out waiting for port 2379 to be started after 2m0s”, “errorVerbose”: “timed out waiting for port 2379 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\ \tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91\ github.com/pingcap/tiup/pkg/cluster/spec.PortStarted\ \tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:115\ github.com/pingcap/tiup/pkg/cluster/spec.(*BaseInstance).Ready\ \tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:147\ github.com/pingcap/tiup/pkg/cluster/operation.startInstance\ \tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:359\ github.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\ \tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:485\ golang.org/x/sync/errgroup.(*Group).Go.func1\ \tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\ runtime.goexit\ \truntime/asm_amd64.s:1581\ failed to start: 192.168.38.63 pd-2379.service, please check the instance’s log(/tidb-deploy/pd-2379/log) for more detail.\ failed to start pd”}

topology.yaml的配置信息:

# Global variables are applied to all deployments and used as the default value of

# the deployments if a specific deployment value is missing.

global:
user: “tidb”
ssh_port: 22
deploy_dir: “/tidb-deploy”
data_dir: “/tidb-data”

# Monitored variables are applied to all the machines.

#monitored:

node_exporter_port: 9100

blackbox_exporter_port: 9115

deploy_dir: “/tidb-deploy/monitored-9100”

data_dir: “/tidb-data/monitored-9100”

log_dir: “/tidb-deploy/monitored-9100/log”

# Server configs are used to specify the runtime configuration of TiDB components.

# All configuration items can be found in TiDB docs:

# - TiDB: https://pingcap.com/docs/stable/reference/configuration/tidb-server/configuration-file/

# - TiKV: https://pingcap.com/docs/stable/reference/configuration/tikv-server/configuration-file/

# - PD: https://pingcap.com/docs/stable/reference/configuration/pd-server/configuration-file/

# All configuration items use points to represent the hierarchy, e.g:

# readpool.storage.use-unified-pool

# You can overwrite this configuration via the instance-level `config` field.

server_configs:

tidb:

log.slow-threshold: 300

binlog.enable: false

binlog.ignore-error: false

tikv:

# server.grpc-concurrency: 4
# raftstore.apply-pool-size: 2
# raftstore.store-pool-size: 2
# rocksdb.max-sub-compactions: 1
# storage.block-cache.capacity: "16GB"
# readpool.unified.max-thread-count: 12

readpool.storage.use-unified-pool: false

readpool.coprocessor.use-unified-pool: true

pd:
replication.enable-placement-rules: true

schedule.leader-schedule-limit: 4

schedule.region-schedule-limit: 2048

schedule.replica-schedule-limit: 64

tidb_servers:

host: 192.168.38.63
ssh_port: 22

port: 4000

status_port: 10080

deploy_dir: “/tidb-deploy/tidb-4000”

log_dir: “/tidb-deploy/tidb-4000/log”

numa_node: “0,1”

# The following configs are used to overwrite the server_configs.tidb values.

config:

log.slow-query-file: tidb-slow-overwrited.log

pd_servers:

host: 192.168.38.64
ssh_port: 22

name: “pd-1”

client_port: 2379

peer_port: 2380

deploy_dir: “/tidb-deploy/pd-2379”

data_dir: “/tidb-data/pd-2379”

log_dir: “/tidb-deploy/pd-2379/log”

numa_node: “0,1”

# The following configs are used to overwrite the server_configs.pd values.

config:

schedule.max-merge-region-size: 20

schedule.max-merge-region-keys: 200000

tikv_servers:

host: 192.168.38.60
ssh_port: 22

port: 20160

status_port: 20180

deploy_dir: “/tidb-deploy/tikv-20160”

data_dir: “/tidb-data/tikv-20160”

log_dir: “/tidb-deploy/tikv-20160/log”

numa_node: “0,1”

# The following configs are used to overwrite the server_configs.tikv values.

config:

server.grpc-concurrency: 4

server.labels: { zone: “zone1”, dc: “dc1”, host: “host1” }
host: 192.168.38.61
host: 192.168.38.62

tiflash_servers:

host: 192.168.38.65

monitoring_servers:

host: 192.168.38.64
ssh_port: 22

port: 9090

deploy_dir: “/tidb-deploy/prometheus-8249”

data_dir: “/tidb-data/prometheus-8249”

log_dir: “/tidb-deploy/prometheus-8249/log”

grafana_servers:

host: 192.168.38.64
port: 3000

deploy_dir: /tidb-deploy/grafana-3000

alertmanager_servers:

host: 192.168.38.64
ssh_port: 22

web_port: 9093

cluster_port: 9094

deploy_dir: “/tidb-deploy/alertmanager-9093”

data_dir: “/tidb-data/alertmanager-9093”

log_dir: “/tidb-deploy/alertmanager-9093/log”

Hacker_AkXHVl7a · 2021 年11 月 17 日 08:11

/tidb-deploy/pd-2379/log
这个文件里面没有日志信息

h5n1 · 2021 年11 月 17 日 08:13

先检查看看防火墙网络端口占用

Hacker_AkXHVl7a · 2021 年11 月 17 日 08:14

防火墙全部都关闭了
网络是要检查什么?
这是新虚拟出来的机器端口没有占用

h5n1 · 2021 年11 月 17 日 08:15

网络连通性

Hacker_AkXHVl7a · 2021 年11 月 17 日 08:16

网络也没啥事问题耶…ping都ping的通

h5n1 · 2021 年11 月 17 日 08:19

手动在pd主机 systemctl start pd-2379.service试试

Hacker_AkXHVl7a · 2021 年11 月 17 日 08:21

Hacker_AkXHVl7a · 2021 年11 月 17 日 08:31

大佬们帮忙看一下是什么问题;我这是按照之前的步骤来的之前部署在开发环境部署都是可以的…不知道这次为什么失败了

Hacker_AkXHVl7a · 2021 年11 月 17 日 09:17

大佬还在吗

Kongdom · 2021 年11 月 17 日 09:39

会不会是ntp服务没起来？

h5n1 · 2021 年11 月 17 日 09:57

感觉还是网络方面的问题，systemctl start后pd能看到进程吗

Meditator · 2021 年11 月 17 日 14:37

看看这个pd的日志，报错的具体原因是什么

Hacker_AkXHVl7a · 2021 年11 月 18 日 01:05

pd里面的日志是空的

Hacker_AkXHVl7a · 2021 年11 月 18 日 01:06

ntp服务是?

Meditator · 2021 年11 月 18 日 01:10

按照@h5n1 说的，启动后，ps下pd-server进程，如果确认进程存在，在其他两台pd主机上 telnet pdhost phport 确认下网络连通性。

Hacker_AkXHVl7a · 2021 年11 月 18 日 01:14

我再试一下看看

Hacker_AkXHVl7a · 2021 年11 月 18 日 01:25

如果是网络的问题我在执行部署的时候是不是就出错了;
tiup cluster deploy tidb-test v5.2.0 ./topology.yaml --user root -p这回命令执行是成功的就是sftart的时候超时

h5n1 · 2021 年11 月 18 日 01:28

端口占用会有这种情况，部署时没问题，但启动或停止时就报错了，你可以试试换一个端口

Kongdom · 2021 年11 月 18 日 01:33

我之前遇到过一次，ntp挂了之后，节点启动不了
https://docs.pingcap.com/zh/tidb/stable/check-before-deployment#检测及安装-ntp-服务

新部署的tidb 使用tiup cluster start..的时候 报错显示 pd-2379.service启动失败 timed out waiting for port 2379 to be started after 2m0s

# Global variables are applied to all deployments and used as the default value of

# the deployments if a specific deployment value is missing.

# Monitored variables are applied to all the machines.

node_exporter_port: 9100

blackbox_exporter_port: 9115

deploy_dir: “/tidb-deploy/monitored-9100”

data_dir: “/tidb-data/monitored-9100”

log_dir: “/tidb-deploy/monitored-9100/log”

# Server configs are used to specify the runtime configuration of TiDB components.

# All configuration items can be found in TiDB docs:

# - TiDB: https://pingcap.com/docs/stable/reference/configuration/tidb-server/configuration-file/

# - TiKV: https://pingcap.com/docs/stable/reference/configuration/tikv-server/configuration-file/

# - PD: https://pingcap.com/docs/stable/reference/configuration/pd-server/configuration-file/

# All configuration items use points to represent the hierarchy, e.g:

# readpool.storage.use-unified-pool

# You can overwrite this configuration via the instance-level config field.

tidb:

log.slow-threshold: 300

binlog.enable: false

binlog.ignore-error: false

tikv:

readpool.storage.use-unified-pool: false

readpool.coprocessor.use-unified-pool: true

schedule.leader-schedule-limit: 4

schedule.region-schedule-limit: 2048

schedule.replica-schedule-limit: 64

ssh_port: 22

port: 4000

status_port: 10080

deploy_dir: “/tidb-deploy/tidb-4000”

log_dir: “/tidb-deploy/tidb-4000/log”

numa_node: “0,1”

# The following configs are used to overwrite the server_configs.tidb values.

config:

log.slow-query-file: tidb-slow-overwrited.log

ssh_port: 22

name: “pd-1”

client_port: 2379

peer_port: 2380

deploy_dir: “/tidb-deploy/pd-2379”

data_dir: “/tidb-data/pd-2379”

log_dir: “/tidb-deploy/pd-2379/log”

numa_node: “0,1”

# The following configs are used to overwrite the server_configs.pd values.

config:

schedule.max-merge-region-size: 20

schedule.max-merge-region-keys: 200000

ssh_port: 22

port: 20160

status_port: 20180

deploy_dir: “/tidb-deploy/tikv-20160”

data_dir: “/tidb-data/tikv-20160”

log_dir: “/tidb-deploy/tikv-20160/log”

numa_node: “0,1”

# The following configs are used to overwrite the server_configs.tikv values.

config:

server.grpc-concurrency: 4

server.labels: { zone: “zone1”, dc: “dc1”, host: “host1” }

ssh_port: 22

port: 9090

deploy_dir: “/tidb-deploy/prometheus-8249”

data_dir: “/tidb-data/prometheus-8249”

log_dir: “/tidb-deploy/prometheus-8249/log”

port: 3000

deploy_dir: /tidb-deploy/grafana-3000

ssh_port: 22

web_port: 9093

cluster_port: 9094

deploy_dir: “/tidb-deploy/alertmanager-9093”

data_dir: “/tidb-data/alertmanager-9093”

log_dir: “/tidb-deploy/alertmanager-9093/log”

新部署的tidb 使用tiup cluster start..的时候报错显示 pd-2379.service启动失败 timed out waiting for port 2379 to be started after 2m0s

# You can overwrite this configuration via the instance-level `config` field.

# The following configs are used to overwrite the `server_configs.tidb` values.

# The following configs are used to overwrite the `server_configs.pd` values.

# The following configs are used to overwrite the `server_configs.tikv` values.