Kubernetes集群监控之exporter

[TOC]

参考

https://prometheus.io/docs/instrumenting/exporters/

blackbox_exporter:

node_exporter

mysql_exporter:

jmx_exporter:

前言

大部分exporter都可以在https://github.com/prometheus中搜到

以及可以使用helm search node-exporter来进行搜索

blackbox_exporter黑盒监测

Prometheus 官方提供的 exporter 之一,可以提供 http、dns、tcp、icmp 的监控数据采集

部署

安装包部署

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
[sss@prometheus01 ]$ cd /usr/local/blackbox_exporter/
[sss@prometheus01 ]$ wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.12.0/blackbox_exporter-0.12.0.linux-amd64.tar.gz
[sss@prometheus01 ]$ tar zxvf blackbox_exporter-0.12.0.linux-amd64.tar.gz
[sss@prometheus01 blackbox_exporter-0.12.0.linux-amd64]$ cd blackbox_exporter-0.12.0.linux-amd64
[sss@prometheus01 blackbox_exporter-0.12.0.linux-amd64]$ ll
total 15720
-rwxr-xr-x. 1 1000 1000 16074005 Feb 27 2018 blackbox_exporter
-rw-rw-r--. 1 1000 1000 932 Nov 21 16:05 blackbox.yml
-rw-rw-r--. 1 1000 1000 11357 Feb 27 2018 LICENSE
-rw-rw-r--. 1 1000 1000 94 Feb 27 2018 NOTICE
[sss@prometheus01 blackbox_exporter-0.12.0.linux-amd64]$cp -r blackbox_exporter /usr/local/bin
[sss@prometheus01 blackbox_exporter-0.12.0.linux-amd64]$ cat /etc/supervisord.conf|grep blackbox -A 20
[program:blackbox_exporter]
command=/usr/local/bin/blackbox_exporter --config.file=/usr/local/prometheus/blackbox_exporter/blackbox_exporter-0.12.0.linux-amd64/blackbox.yml
stdout_logfile=/tmp/prometheus/blackbox_exporter.log
autostart=true
autorestart=true
startsecs=5
priority=1
user=root
stopasgroup=true
killasgroup=true
[sss@prometheus01 blackbox_exporter-0.12.0.linux-amd64]$ supervisorctl status |grep blackbox
blackbox_exporter RUNNING pid 25343, uptime 0:19:25

blackbox的配置文件

  • 通过 blackbox.yml 定义模块详细信息
  • 在 Prometheus 配置文件中引用该模块以及配置被监控目标主机
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
modules:
http_2xx:
prober: http
timeout: 10s
http:
preferred_ip_protocol: "ip4" ##如果http监测是使用ipv4 就要写上,目前国内使用ipv6很少。
http_post_2xx_query: ##用于post请求使用的模块)由于每个接口传参不同 可以定义多个module 用于不同接口(例如此命名为http_post_2xx_query 用于监测query.action接口
prober: http
timeout: 15s
http:
preferred_ip_protocol: "ip4" ##使用ipv4
method: POST
headers:
Content-Type: application/json ##header头
body: '{"hmac":"","params":{"publicFundsKeyWords":"xxx"}}' ##传参
tcp_connect:
prober: tcp
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^+OK"
tls: true
tls_config:
insecure_skip_verify: false
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
icmp:
prober: icmp
timeout: 5s
icmp:

在kubernetes集群中部署

blackbox_exporter-deploy.yml

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: prometheus-blackbox-exporter
namespace: monitoring
labels:
k8s-app: prometheus-blackbox-exporter
spec:
selector:
matchLabels:
k8s-app: prometheus-blackbox-exporter
replicas: 1
template:
metadata:
labels:
k8s-app: prometheus-blackbox-exporter
spec:
restartPolicy: Always
containers:
- name: prometheus-blackbox-exporter
image: prom/blackbox-exporter:v0.16.0
imagePullPolicy: IfNotPresent
ports:
- name: blackbox-port
containerPort: 9115
readinessProbe:
tcpSocket:
port: 9115
initialDelaySeconds: 5
timeoutSeconds: 5
resources:
requests:
memory: 50Mi
cpu: 50m
limits:
memory: 60Mi
cpu: 100m
volumeMounts:
- name: config
mountPath: /etc/blackbox_exporter
args:
- --config.file=/etc/blackbox_exporter/blackbox.yml
- --log.level=debug
- --web.listen-address=:9115
volumes:
- name: config
configMap:
name: prometheus-blackbox-exporter

---
apiVersion: v1
kind: Service
metadata:
labels:
k8s-app: prometheus-blackbox-exporter
name: prometheus-blackbox-exporter
namespace: monitoring
annotations:
prometheus.io/scrape: 'true'
spec:
type: ClusterIP
selector:
k8s-app: prometheus-blackbox-exporter
ports:
- name: blackbox
port: 9115
targetPort: 9115
protocol: TCP

---
apiVersion: v1
kind: ConfigMap
metadata:
labels:
k8s-app: prometheus-blackbox-exporter
name: prometheus-blackbox-exporter
namespace: monitoring
data:
blackbox.yml: |-
modules:
http_2xx:
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
valid_status_codes: []
method: GET
preferred_ip_protocol: "ip4"
http_post_2xx: # http post 监测模块
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
method: POST
preferred_ip_protocol: "ip4"
tcp_connect:
prober: tcp
timeout: 10s
icmp:
prober: icmp
timeout: 10s
icmp:
preferred_ip_protocol: "ip4"

应用场景

  • HTTP 测试

    定义 Request Header 信息
    判断 Http status / Http Respones Header / Http Body 内容

  • TCP 测试

    业务组件端口状态监听

    应用层协议定义与监听

  • ICMP 测试

    主机探活机制

  • POST 测试

    接口联通性

  • SSL 证书过期时间

http测试

  • 相关代码块添加到 Prometheus 文件内
  • 对应 blackbox.yml文件的 http_2xx 模块
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
- job_name: 'blackbox_http_2xx'
scrape_interval: 45s
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- https://www.baidu.com/
- 172.0.0.1:9090
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 10.XXX.XX.XX:9115 # The blackbox exporter's real hostname:port.

TCP 测试

  • 监听 业务端口地址,用来判断服务是否在线,我觉的和telnet 差不多
  • 相关代码块添加到 Prometheus 文件内
  • 对应 blackbox.yml文件的 tcp_connect 模块
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
- job_name: "blackbox_telnet_port]"
scrape_interval: 5s
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets: [ '1x3.x1.xx.xx4:443' ]
labels:
group: 'xxxidc机房ip监控'
- targets: ['10.xx.xx.xxx:443']
labels:
group: 'Process status of nginx(main) server'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 10.xxx.xx.xx:9115

ICMP 测试

  • 相关代码块添加到 Prometheus 配置文件内
  • 对应 blackbox.yml文件的 icmp 模块
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
- job_name: 'blackbox00_ping_idc_ip'
scrape_interval: 10s
metrics_path: /probe
params:
module: [icmp] #ping
static_configs:
- targets: [ '1x.xx.xx.xx' ]
labels:
group: 'xxnginx 虚拟IP'
relabel_configs:
- source_labels: [__address__]
regex: (.*)(:80)?
target_label: __param_target
replacement: ${1}
- source_labels: [__param_target]
regex: (.*)
target_label: ping
replacement: ${1}
- source_labels: []
regex: .*
target_label: __address__
replacement: 1x.xxx.xx.xx:9115

POST 测试

  • 监听业务接口地址,用来判断接口是否在线
  • 相关代码块添加到 Prometheus 文件内
  • 对应 blackbox.yml文件的 http_post_2xx_query 模块(监听query.action这个接口)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
- job_name: 'blackbox_http_2xx_post'
scrape_interval: 10s
metrics_path: /probe
params:
module: [http_post_2xx_query]
static_configs:
- targets:
- https://xx.xxx.com/api/xx/xx/fund/query.action
labels:
group: 'Interface monitoring'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 1x.xx.xx.xx:9115 # The blackbox exporter's real hostname:port.

查看监听过程

类似于

1
curl http://172.16.10.65:9115/probe?target=prometheus.io&module=http_2xx&debug=true

告警应用测试

icmp、tcp、http、post 监测是否正常可以观察probe_success 这一指标
probe_success == 0 ##联通性异常
probe_success == 1 ##联通性正常
告警也是判断这个指标是否等于0,如等于0 则触发异常报警

1
2
3
4
5
6
7
8
9
10
11
12
[sss@prometheus01 prometheus]$ cat rules/blackbox-alert.rules 
groups:
- name: blackbox_network_stats
rules:
- alert: blackbox_network_stats
expr: probe_success == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "This requires immediate action!"

SSL 证书过期时间监测

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
cat << 'EOF' > prometheus.yml
rule_files:
- ssl_expiry.rules
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx] # Look for a HTTP 200 response.
static_configs:
- targets:
- example.com # Target to probe
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9115 # Blackbox exporter.
EOF
cat << 'EOF' > ssl_expiry.rules
groups:
- name: ssl_expiry.rules
rules:
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 30
for: 10m
EOF

自定义探针测试

HTTP服务通常会以不同的形式对外展现,有些可能就是一些简单的网页,而有些则可能是一些基于REST的API服务。 对于不同类型的HTTP的探测需要管理员能够对HTTP探针的行为进行更多的自定义设置,包括:HTTP请求方法、HTTP头信息、请求参数等。对于某些启用了安全认证的服务还需要能够对HTTP探测设置相应的Auth支持。对于HTTPS类型的服务还需要能够对证书进行自定义设置。

如下所示,这里通过method定义了探测时使用的请求方法,对于一些需要请求参数的服务,还可以通过headers定义相关的请求头信息,使用body定义请求内容:

1
2
3
4
5
6
7
8
http_post_2xx:
prober: http
timeout: 5s
http:
method: POST
headers:
Content-Type: application/json
body: '{}'

如果HTTP服务启用了安全认证,Blockbox Exporter内置了对basic_auth的支持,可以直接设置相关的认证信息即可:

1
2
3
4
5
6
7
8
9
10
http_basic_auth_example:
prober: http
timeout: 5s
http:
method: POST
headers:
Host: "login.example.com"
basic_auth:
username: "username"
password: "mysecret"

对于使用了Bear Token的服务也可以通过bearer_token配置项直接指定令牌字符串,或者通过bearer_token_file指定令牌文件。

对于一些启用了HTTPS的服务,但是需要自定义证书的服务,可以通过tls_config指定相关的证书信息:

1
2
3
4
5
6
http_custom_ca_example:
prober: http
http:
method: GET
tls_config:
ca_file: "/certs/my_cert.crt"

自定义探针行为

在默认情况下HTTP探针只会对HTTP返回状态码进行校验,如果状态码为2XX(200 <= StatusCode < 300)则表示探测成功,并且探针返回的指标probe_success值为1。

如果用户需要指定HTTP返回状态码,或者对HTTP版本有特殊要求,如下所示,可以使用valid_http_versions和valid_status_codes进行定义:

1
2
3
4
5
6
http_2xx_example:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2"]
valid_status_codes: []

默认情况下,Blockbox返回的样本数据中也会包含指标probe_http_ssl,用于表明当前探针是否使用了SSL:

1
2
3
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0

而如果用户对于HTTP服务是否启用SSL有强制的标准。则可以使用fail_if_ssl和fail_if_not_ssl进行配置。fail_if_ssl为true时,表示如果站点启用了SSL则探针失败,反之成功。fail_if_not_ssl刚好相反。

1
2
3
4
5
6
7
8
9
http_2xx_example:
prober: http
timeout: 5s
http:
valid_status_codes: []
method: GET
no_follow_redirects: false
fail_if_ssl: false
fail_if_not_ssl: false

除了基于HTTP状态码,HTTP协议版本以及是否启用SSL作为控制探针探测行为成功与否的标准以外,还可以匹配HTTP服务的响应内容。使用fail_if_matches_regexp和fail_if_not_matches_regexp用户可以定义一组正则表达式,用于验证HTTP返回内容是否符合或者不符合正则表达式的内容。

1
2
3
4
5
6
7
8
9
http_2xx_example:
prober: http
timeout: 5s
http:
method: GET
fail_if_matches_regexp:
- "Could not connect to database"
fail_if_not_matches_regexp:
- "Download the latest version here"

最后需要提醒的时,默认情况下HTTP探针会走IPV6的协议。 在大多数情况下,可以使用preferred_ip_protocol=ip4强制通过IPV4的方式进行探测。在Bloackbox响应的监控样本中,也会通过指标probe_ip_protocol,表明当前的协议使用情况:

1
2
3
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 6

除了支持对HTTP协议进行网络探测以外,Blackbox还支持对TCP、DNS、ICMP等其他网络协议,感兴趣的读者可以从Blackbox的Github项目中获取更多使用信息.

node_exporter

功能对照表

默认开启的功能

名称 说明 系统
arp /proc/net/arp 中收集 ARP 统计信息 Linux
conntrack /proc/sys/net/netfilter/ 中收集 conntrack 统计信息 Linux
cpu 收集 cpu 统计信息 Darwin, Dragonfly,FreeBSD, Linux
diskstats /proc/diskstats 中收集磁盘 I/O 统计信息 Linux
edac 错误检测与纠正统计信息 Linux
entropy 可用内核熵信息 Linux
exec execution 统计信息 Dragonfly, FreeBSD
filefd /proc/sys/fs/file-nr 中收集文件描述符统计信息 Linux
filesystem 文件系统统计信息,例如磁盘已使用空间 Darwin, Dragonfly,FreeBSD, Linux, OpenBSD
hwmon /sys/class/hwmon/ 中收集监控器或传感器数据信息 Linux
infiniband 从 InfiniBand 配置中收集网络统计信息 Linux
loadavg 收集系统负载信息 Darwin, Dragonfly, FreeBSD,Linux, NetBSD, OpenBSD, Solaris
mdadm /proc/mdstat 中获取设备统计信息 Linux
meminfo 内存统计信息 Darwin, Dragonfly,FreeBSD, Linux
netdev 网口流量统计信息,单位 bytes Darwin, Dragonfly,FreeBSD, Linux, OpenBSD
netstat /proc/net/netstat 收集网络统计数据,等同于 netstat -s Linux
sockstat /proc/net/sockstat 中收集 socket 统计信息 Linux
stat /proc/stat 中收集各种统计信息,包含系统启动时间,forks, 中断等 Linux
textfile 通过 --collector.textfile.directory参数指定本地文本收集路径,收集文本信息 any
time 系统当前时间 any
uname 通过 uname 系统调用, 获取系统信息 any
vmstat /proc/vmstat 中收集统计信息 Linux
wifi 收集 wifi 设备相关统计数据 Linux
xfs 收集 xfs 运行时统计信息 Linux (kernel 4.4+)
zfs 收集 zfs 性能统计信息 Linux

默认关闭的功能

名称 说明 系统
bonding 收集系统配置以及激活的绑定网卡数量 Linux
buddyinfo /proc/buddyinfo 中收集内存碎片统计信息 Linux
devstat 收集设备统计信息 Dragonfly, FreeBSD
drbd 收集远程镜像块设备(DRBD)统计信息 Linux
interrupts 收集更具体的中断统计信息 Linux,OpenBSD
ipvs /proc/net/ip_vs 中收集 IPVS 状态信息,从 /proc/net/ip_vs_stats 获取统计信息 Linux
ksmd /sys/kernel/mm/ksm 中获取内核和系统统计信息 Linux
logind logind 中收集会话统计信息 Linux
meminfo_numa /proc/meminfo_numa 中收集内存统计信息 Linux
mountstats /proc/self/mountstat 中收集文件系统统计信息,包括 NFS 客户端统计信息 Linux
nfs /proc/net/rpc/nfs 中收集 NFS 统计信息,等同于 nfsstat -c Linux
qdisc 收集队列推定统计信息 Linux
runit 收集 runit 状态信息 any
supervisord 收集 supervisord 状态信息 any
systemd systemd 中收集设备系统状态信息 Linux
tcpstat /proc/net/tcp/proc/net/tcp6 收集 TCP 连接状态信息 Linux

注意:我们可以使用 --collectors.enabled 运行参数指定 node_exporter 收集的功能模块, 如果不指定,将使用默认模块。

安装

下载地址:https://prometheus.io/download/#node_exporter

安装包安装

安装node exporter

1
2
3
tar -zxvf node_exporter-0.16.0.linux-amd64.tar.gz

mv node_exporter-0.16.0.linux-amd64 /usr/local/node_exporter

创建systemd服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
vim /etc/systemd/system/node_exporter.service

[Unit]
Description=node_exporter
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target

启动node_exporter

1
2
3
4
systemctl daemon-reload
systemctl start node_exporter
systemctl status node_exporter
systemctl enable node_exporter

验证启动成功

1
curl 127.0.0.1:9100/metrics

kubernetes安装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
name: prometheus-node-exporter
namespace: monitoring
labels:
k8s-app: prometheus-node-exporter
spec:
template:
metadata:
name: prometheus-node-exporter
labels:
k8s-app: prometheus-node-exporter
spec:
containers:
- image: prom/node-exporter:v0.18.0
imagePullPolicy: IfNotPresent
name: prometheus-node-exporter
ports:
- name: prom-node-exp
containerPort: 9100
hostPort: 9100
livenessProbe:
failureThreshold: 3
httpGet:
path: /
port: 9100
scheme: HTTP
readinessProbe:
failureThreshold: 3
httpGet:
path: /
port: 9100
scheme: HTTP
resources:
limits:
cpu: 20m
memory: 2Gi
requests:
cpu: 10m
memory: 1Gi
dnsPolicy: ClusterFirst
hostNetwork: true
hostPID: true

---
apiVersion: v1
kind: Service
metadata:
annotations:
prometheus.io/scrape: 'true'
prometheus.io/app-metrics: 'true'
prometheus.io/app-metrics-path: '/metrics'
name: prometheus-node-exporter
namespace: monitoring
labels:
k8s-app: prometheus-node-exporter
spec:
ports:
- name: prometheus-node-exporter
port: 9100
protocol: TCP
selector:
k8s-app: prometheus-node-exporter
type: ClusterIP

配置

可以利用 Prometheus 的 static_configs 来拉取 node_exporter 的数据。

1
2
3
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']

重启prometheus,然后在Prometheus页面中的Targets中就能看到新加入的node

常用查询语句

收集到 node_exporter 的数据后,我们可以使用 PromQL 进行一些业务查询和监控,下面是一些比较常见的查询

以下查询均以单个节点作为例子,如果大家想查看所有节点,将 instance="xxx" 去掉即可。

cpu使用率

1
100 - (avg by (instance) (irate(node_cpu{instance="172.16.8.153:9100", mode="idle"}[5m])) * 100)

CPU各个mode使用率

1
avg by (instance, mode) (irate(node_cpu{instance="172.16.8.153:9100"}[5m])) * 100
  • User:CPU一共花了多少比例的时间运行在用户态空间或者说是用户进程(running user space processes)。典型的用户态空间程序有:Shells、数据库、web服务器等
  • Nice:可理解为,用户空间进程的CPU的调度优先级,范围为[-20,19]
  • System:System的含义与User相似。System表示:CPU花了多少比例的时间在内核空间运行。分配内存、IO操作、创建子进程……都是内核操作。这也表明,当IO操作频繁时,System参数会很高
  • ioWait:在计算机中,读写磁盘的操作远比CPU运行的速度要慢,CPU负载处理数据,而数据一般在磁盘上需要读到内存中才能处理。当CPU发起读写操作后,需要等着磁盘驱动器将数据读入内存,从而导致CPU 在等待的这一段时间内无事可做。CPU处于这种等待状态的时间由Wait参数来衡量
  • Idle:CPU处于空闲状态时间比例。一般而言,idel + user + nice 约等于100%

机器平均负载

1
2
3
node_load1{instance="172.16.8.153:9100"}   // 1分钟负载
node_load5{instance="172.16.8.153:9100"} // 5分钟负载
node_load15{instance="172.16.8.153:9100"} // 15分钟负载

内存使用率

1
100-(node_memory_MemFree{instance="172.16.8.172:9100"}+node_memory_Cached{instance="172.16.8.172:9100"}+node_memory_Buffers{instance="172.16.8.172:9100"})/node_memory_MemTotal{instance="172.16.8.172:9100"} * 100

磁盘使用率

1
100 - node_filesystem_free{instance="172.16.8.153:9100",fstype!~"rootfs|selinuxfs|autofs|rpc_pipefs|tmpfs|udev|none|devpts|sysfs|debugfs|fuse.*"} / node_filesystem_size{instance="172.16.8.153:9100",fstype!~"rootfs|selinuxfs|autofs|rpc_pipefs|tmpfs|udev|none|devpts|sysfs|debugfs|fuse.*"} * 100

网卡出入包

1
2
3
4
5
// 入包量
sum by (instance) (rate(node_network_receive_bytes{instance="172.16.8.153:9100",device!="lo"}[5m]))

// 出包量
sum by (instance) (rate(node_network_transmit_bytes{instance="172.16.8.153:9100",device!="lo"}[5m]))

dashboard模板

可以在grafana官网中搜索对应模板

Mysql exporter

安装

使用helm安装部署

1
helm --fetch mysql_exporter

拉取下来之后修改values文件

授权

mysqld_exporter需要连接Mysql,首先为它创建用户并赋予所需要的权限:

1
2
3
CREATE USER 'exporter'@'%' IDENTIFIED BY '123456' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
flush privileges;

prometheus配置

1
2
3
4
- job_name: 'mysql'
static_configs:
- targets:
- localhost:9104

验证状态

1
curl localhost:9104/metrics

jmx_exporter

举例监控kafka

jmx_prometheus_javaagent 方式收集kafka指标

下载jmx_prometheus_javaagent和kafka.yml

1
2
wget https://raw.githubusercontent.com/prometheus/jmx_exporter/master/example_configs/kafka-0-8-2.yml
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.6/jmx_prometheus_javaagent-0.6.jar

编辑kafka的启动文件 kafka-server-start.sh

添加几行代码:

1
2
export JMX_PORT="9999"
export KAFKA_OPTS="-javaagent:/path/jmx_prometheus_javaagent-0.6.jar=9991:/path/kafka-0-8-2.yml"

然后重启kafka。
访问 http://localhost:9991/metrics 可以看到各种指标了。

举例监控hadoop

下载jar包

1
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.3.1/jmx_prometheus_javaagent-0.3.1.jar

创建配置文件

创建namenode.yaml(datanode.yaml)放在任意位置,内容为你想要的metrics

1
2
3
4
5
6
7
---
startDelaySeconds: 0
hostPort: master:1234 #master为本机IP(一般可设置为localhost);1234为想设置的jmx端口(可设置为未被占用的端口)
#jmxUrl: service:jmx:rmi:///jndi/rmi://127.0.0.1:1234/jmxrmi
ssl: false
lowercaseOutputName: false
lowercaseOutputLabelNames: false

参数说明

Name Description
startDelaySeconds start delay before serving requests. Any requests within the delay period will result in an empty metrics set.
hostPort The host and port to connect to via remote JMX. If neither this nor jmxUrl is specified, will talk to the local JVM.
username The username to be used in remote JMX password authentication.
password The password to be used in remote JMX password authentication.
jmxUrl A full JMX URL to connect to. Should not be specified if hostPort is.
ssl Whether JMX connection should be done over SSL. To configure certificates you have to set following system properties: -Djavax.net.ssl.keyStore=/home/user/.keystore -Djavax.net.ssl.keyStorePassword=changeit -Djavax.net.ssl.trustStore=/home/user/.truststore -Djavax.net.ssl.trustStorePassword=changeit
lowercaseOutputName Lowercase the output metric name. Applies to default format and name. Defaults to false.
lowercaseOutputLabelNames Lowercase the output metric label names. Applies to default format and labels. Defaults to false.
whitelistObjectNames A list of ObjectNames to query. Defaults to all mBeans.
blacklistObjectNames A list of ObjectNames to not query. Takes precedence over whitelistObjectNames. Defaults to none.
rules A list of rules to apply in order, processing stops at the first matching rule. Attributes that aren’t matched aren’t collected. If not specified, defaults to collecting everything in the default format.
pattern Regex pattern to match against each bean attribute. The pattern is not anchored. Capture groups can be used in other options. Defaults to matching everything.
attrNameSnakeCase Converts the attribute name to snake case. This is seen in the names matched by the pattern and the default format. For example, anAttrName to an_attr_name. Defaults to false.
name The metric name to set. Capture groups from the pattern can be used. If not specified, the default format will be used. If it evaluates to empty, processing of this attribute stops with no output.
value Value for the metric. Static values and capture groups from the pattern can be used. If not specified the scraped mBean value will be used.
valueFactor Optional number that value (or the scraped mBean value if value is not specified) is multiplied by, mainly used to convert mBean values from milliseconds to seconds.
labels A map of label name to label value pairs. Capture groups from pattern can be used in each. name must be set to use this. Empty names and values are ignored. If not specified and the default format is not being used, no labels are set.
help Help text for the metric. Capture groups from pattern can be used. name must be set to use this. Defaults to the mBean attribute decription and the full name of the attribute.
type The type of the metric, can be GAUGE, COUNTER or UNTYPED. name must be set to use this. Defaults to UNTYPED.

修改$HADOOP_HOME/etc/hadoop/hadoop-env.sh

NameNode节点添加:

1
export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false   -Dcom.sun.management.jmxremote.port=1234 $HADOOP_NAMENODE_OPTS "

DataNode节点添加:

1
export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false   -Dcom.sun.management.jmxremote.port=1235 $HADOOP_DATANODE_OPTS "

ps:

端口1234(1235)要与之前设置的jmx端口保持一致

修改$HADOOP_HOME/bin/hdfs

NameNode:

1
export HADOOP_NAMENODE_OPTS="$HADOOP_NAMENODE_OPTS -javaagent:/home/hadoop/jmx_prometheus_javaagent-0.3.1.jar=9200:/home/hadoop/namenode.yaml"

DataNode:

1
export HADOOP_DATANODE_OPTS="$HADOOP_DATANODE_OPTS -javaagent:/home/hadoop/jmx_prometheus_javaagent-0.3.1.jar=9300:/home/hadoop/datanode.yaml"

提示:9200(9300)为jmx_exporter提供metrics数据端口,后续Prometheus从此端口获取数据

访问http://master:9200/metrics就能获得需要的metrics数据:

1
2
3
4
5
6
7
8
9
10
# HELP jvm_buffer_pool_used_bytes Used bytes of a given JVM buffer pool.
# TYPE jvm_buffer_pool_used_bytes gauge
jvm_buffer_pool_used_bytes{pool="direct",} 1181032.0
jvm_buffer_pool_used_bytes{pool="mapped",} 0.0
# HELP jvm_buffer_pool_capacity_bytes Bytes capacity of a given JVM buffer pool.
# TYPE jvm_buffer_pool_capacity_bytes gauge
jvm_buffer_pool_capacity_bytes{pool="direct",} 1181032.0
jvm_buffer_pool_capacity_bytes{pool="mapped",} 0.0
# HELP jvm_buffer_pool_used_buffers Used buffers of a given JVM buffer pool.
...
--------------------本文结束,感谢您的阅读--------------------