[TOC]

参考

https://prometheus.io/docs/instrumenting/exporters/

blackbox_exporter:

node_exporter

mysql_exporter:

https://github.com/prometheus/mysqld_exporter

jmx_exporter:

前言

大部分exporter都可以在https://github.com/prometheus中搜到

以及可以使用helm search node-exporter来进行搜索

blackbox_exporter黑盒监测

Prometheus 官方提供的 exporter 之一，可以提供 http、dns、tcp、icmp 的监控数据采集

部署

安装包部署

[sss@prometheus01 ]$ cd /usr/local/blackbox_exporter/
[sss@prometheus01 ]$ wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.12.0/blackbox_exporter-0.12.0.linux-amd64.tar.gz 
[sss@prometheus01 ]$ tar zxvf blackbox_exporter-0.12.0.linux-amd64.tar.gz
[sss@prometheus01 blackbox_exporter-0.12.0.linux-amd64]$ cd blackbox_exporter-0.12.0.linux-amd64
[sss@prometheus01 blackbox_exporter-0.12.0.linux-amd64]$ ll
total 15720
-rwxr-xr-x. 1 1000 1000 16074005 Feb 27  2018 blackbox_exporter
-rw-rw-r--. 1 1000 1000      932 Nov 21 16:05 blackbox.yml
-rw-rw-r--. 1 1000 1000    11357 Feb 27  2018 LICENSE
-rw-rw-r--. 1 1000 1000       94 Feb 27  2018 NOTICE
[sss@prometheus01 blackbox_exporter-0.12.0.linux-amd64]$cp -r blackbox_exporter /usr/local/bin
[sss@prometheus01 blackbox_exporter-0.12.0.linux-amd64]$ cat /etc/supervisord.conf|grep blackbox -A 20
[program:blackbox_exporter]
command=/usr/local/bin/blackbox_exporter   --config.file=/usr/local/prometheus/blackbox_exporter/blackbox_exporter-0.12.0.linux-amd64/blackbox.yml 
stdout_logfile=/tmp/prometheus/blackbox_exporter.log
autostart=true
autorestart=true
startsecs=5
priority=1
user=root
stopasgroup=true
killasgroup=true
[sss@prometheus01 blackbox_exporter-0.12.0.linux-amd64]$ supervisorctl  status |grep blackbox
blackbox_exporter                RUNNING   pid 25343, uptime 0:19:25

blackbox的配置文件

通过 blackbox.yml 定义模块详细信息
在 Prometheus 配置文件中引用该模块以及配置被监控目标主机

modules:
  http_2xx:
    prober: http
    timeout: 10s
    http:
      preferred_ip_protocol: "ip4" ##如果http监测是使用ipv4 就要写上，目前国内使用ipv6很少。
  http_post_2xx_query: ##用于post请求使用的模块）由于每个接口传参不同 可以定义多个module 用于不同接口（例如此命名为http_post_2xx_query 用于监测query.action接口 
    prober: http
    timeout: 15s
    http:
      preferred_ip_protocol: "ip4" ##使用ipv4
      method: POST
      headers:
        Content-Type: application/json ##header头
      body: '{"hmac":"","params":{"publicFundsKeyWords":"xxx"}}' ##传参
  tcp_connect:
    prober: tcp
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
  icmp:
    prober: icmp
    timeout: 5s
    icmp:

在kubernetes集群中部署

blackbox_exporter-deploy.yml

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prometheus-blackbox-exporter
  namespace: monitoring
  labels:
    k8s-app: prometheus-blackbox-exporter
spec:
  selector:
    matchLabels:
      k8s-app: prometheus-blackbox-exporter
  replicas: 1
  template:
    metadata:
      labels:
        k8s-app: prometheus-blackbox-exporter
    spec:
      restartPolicy: Always
      containers:
      - name: prometheus-blackbox-exporter
        image: prom/blackbox-exporter:v0.16.0
        imagePullPolicy: IfNotPresent
        ports:
        - name: blackbox-port
          containerPort: 9115
        readinessProbe:
          tcpSocket:
            port: 9115
          initialDelaySeconds: 5
          timeoutSeconds: 5
        resources:
          requests:
            memory: 50Mi
            cpu: 50m
          limits:
            memory: 60Mi
            cpu: 100m
        volumeMounts:
        - name: config
          mountPath: /etc/blackbox_exporter
        args:
        - --config.file=/etc/blackbox_exporter/blackbox.yml
        - --log.level=debug
        - --web.listen-address=:9115
      volumes:
      - name: config
        configMap:
          name: prometheus-blackbox-exporter
          
---
apiVersion: v1
kind: Service
metadata:
  labels:
    k8s-app: prometheus-blackbox-exporter
  name: prometheus-blackbox-exporter
  namespace: monitoring
  annotations:
    prometheus.io/scrape: 'true'
spec:
  type: ClusterIP
  selector:
    k8s-app: prometheus-blackbox-exporter
  ports:
  - name: blackbox
    port: 9115
    targetPort: 9115
    protocol: TCP
    
---
apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    k8s-app: prometheus-blackbox-exporter
  name: prometheus-blackbox-exporter
  namespace: monitoring
data:
  blackbox.yml: |-
    modules:
      http_2xx:
        prober: http
        timeout: 10s
        http:
          valid_http_versions: ["HTTP/1.1", "HTTP/2"]
          valid_status_codes: []
          method: GET
          preferred_ip_protocol: "ip4"
      http_post_2xx: # http post 监测模块
        prober: http
        timeout: 10s
        http:
          valid_http_versions: ["HTTP/1.1", "HTTP/2"]
          method: POST
          preferred_ip_protocol: "ip4"
      tcp_connect:
        prober: tcp
        timeout: 10s
      icmp:
        prober: icmp
        timeout: 10s
        icmp:
          preferred_ip_protocol: "ip4"

应用场景

HTTP 测试

定义 Request Header 信息
判断 Http status / Http Respones Header / Http Body 内容
TCP 测试

业务组件端口状态监听

应用层协议定义与监听
ICMP 测试

主机探活机制
POST 测试

接口联通性
SSL 证书过期时间

http测试

相关代码块添加到 Prometheus 文件内
对应 blackbox.yml文件的 http_2xx 模块

- job_name: 'blackbox_http_2xx'
  scrape_interval: 45s
  metrics_path: /probe
  params:
    module: [http_2xx]  # Look for a HTTP 200 response.
  static_configs:
      - targets:
        - https://www.baidu.com/
        - 172.0.0.1:9090
  relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 10.XXX.XX.XX:9115  # The blackbox exporter's real hostname:port.

TCP 测试

监听业务端口地址，用来判断服务是否在线，我觉的和telnet 差不多
相关代码块添加到 Prometheus 文件内
对应 blackbox.yml文件的 tcp_connect 模块

- job_name: "blackbox_telnet_port]"
  scrape_interval: 5s
  metrics_path: /probe
  params:
    module: [tcp_connect]
  static_configs:
      - targets: [ '1x3.x1.xx.xx4:443' ]
        labels:
          group: 'xxxidc机房ip监控'
      - targets: ['10.xx.xx.xxx:443']
        labels:
          group: 'Process status of nginx(main) server'
  relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 10.xxx.xx.xx:9115

ICMP 测试

相关代码块添加到 Prometheus 配置文件内
对应 blackbox.yml文件的 icmp 模块

- job_name: 'blackbox00_ping_idc_ip'
  scrape_interval: 10s
  metrics_path: /probe
  params:
    module: [icmp]  #ping
  static_configs:
      - targets: [ '1x.xx.xx.xx' ]
        labels:
          group: 'xxnginx 虚拟IP'
  relabel_configs:
      - source_labels: [__address__]
        regex: (.*)(:80)?
        target_label: __param_target
        replacement: ${1}
      - source_labels: [__param_target]
        regex: (.*)
        target_label: ping
        replacement: ${1}
      - source_labels: []
        regex: .*
        target_label: __address__
        replacement: 1x.xxx.xx.xx:9115

POST 测试

监听业务接口地址，用来判断接口是否在线
相关代码块添加到 Prometheus 文件内
对应 blackbox.yml文件的 http_post_2xx_query 模块（监听query.action这个接口）

- job_name: 'blackbox_http_2xx_post'
  scrape_interval: 10s
  metrics_path: /probe
  params:
    module: [http_post_2xx_query]
  static_configs:
      - targets:
        - https://xx.xxx.com/api/xx/xx/fund/query.action
        labels:
          group: 'Interface monitoring'
  relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 1x.xx.xx.xx:9115  # The blackbox exporter's real hostname:port.

查看监听过程

类似于

1	curl http://172.16.10.65:9115/probe?target=prometheus.io&module=http_2xx&debug=true

告警应用测试

icmp、tcp、http、post 监测是否正常可以观察probe_success 这一指标
probe_success == 0 ##联通性异常
probe_success == 1 ##联通性正常
告警也是判断这个指标是否等于0，如等于0 则触发异常报警

[sss@prometheus01 prometheus]$ cat rules/blackbox-alert.rules 
groups:
- name: blackbox_network_stats
  rules:
  - alert: blackbox_network_stats
    expr: probe_success == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }}  is down"
      description: "This requires immediate action!"

SSL 证书过期时间监测

cat << 'EOF' > prometheus.yml
rule_files:
  - ssl_expiry.rules
scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]  # Look for a HTTP 200 response.
    static_configs:
      - targets:
        - example.com  # Target to probe
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9115  # Blackbox exporter.
        EOF 
cat << 'EOF' > ssl_expiry.rules 
groups: 
  - name: ssl_expiry.rules 
    rules: 
      - alert: SSLCertExpiringSoon 
        expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 30 
        for: 10m
EOF

自定义探针测试

HTTP服务通常会以不同的形式对外展现，有些可能就是一些简单的网页，而有些则可能是一些基于REST的API服务。对于不同类型的HTTP的探测需要管理员能够对HTTP探针的行为进行更多的自定义设置，包括：HTTP请求方法、HTTP头信息、请求参数等。对于某些启用了安全认证的服务还需要能够对HTTP探测设置相应的Auth支持。对于HTTPS类型的服务还需要能够对证书进行自定义设置。

如下所示，这里通过method定义了探测时使用的请求方法，对于一些需要请求参数的服务，还可以通过headers定义相关的请求头信息，使用body定义请求内容：

http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{}'

如果HTTP服务启用了安全认证，Blockbox Exporter内置了对basic_auth的支持，可以直接设置相关的认证信息即可：

http_basic_auth_example:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Host: "login.example.com"
      basic_auth:
        username: "username"
        password: "mysecret"

对于使用了Bear Token的服务也可以通过bearer_token配置项直接指定令牌字符串，或者通过bearer_token_file指定令牌文件。

对于一些启用了HTTPS的服务，但是需要自定义证书的服务，可以通过tls_config指定相关的证书信息：

http_custom_ca_example:
   prober: http
   http:
     method: GET
     tls_config:
       ca_file: "/certs/my_cert.crt"

自定义探针行为

在默认情况下HTTP探针只会对HTTP返回状态码进行校验，如果状态码为2XX（200 <= StatusCode < 300）则表示探测成功，并且探针返回的指标probe_success值为1。

如果用户需要指定HTTP返回状态码，或者对HTTP版本有特殊要求，如下所示，可以使用valid_http_versions和valid_status_codes进行定义：

http_2xx_example:
  prober: http
  timeout: 5s
  http:
    valid_http_versions: ["HTTP/1.1", "HTTP/2"]
    valid_status_codes: []

默认情况下，Blockbox返回的样本数据中也会包含指标probe_http_ssl，用于表明当前探针是否使用了SSL：

1
2
3

# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0

而如果用户对于HTTP服务是否启用SSL有强制的标准。则可以使用fail_if_ssl和fail_if_not_ssl进行配置。fail_if_ssl为true时，表示如果站点启用了SSL则探针失败，反之成功。fail_if_not_ssl刚好相反。

http_2xx_example:
  prober: http
  timeout: 5s
  http:
    valid_status_codes: []
    method: GET
    no_follow_redirects: false
    fail_if_ssl: false
    fail_if_not_ssl: false

除了基于HTTP状态码，HTTP协议版本以及是否启用SSL作为控制探针探测行为成功与否的标准以外，还可以匹配HTTP服务的响应内容。使用fail_if_matches_regexp和fail_if_not_matches_regexp用户可以定义一组正则表达式，用于验证HTTP返回内容是否符合或者不符合正则表达式的内容。

http_2xx_example:
  prober: http
  timeout: 5s
  http:
    method: GET
    fail_if_matches_regexp:
      - "Could not connect to database"
    fail_if_not_matches_regexp:
      - "Download the latest version here"

最后需要提醒的时，默认情况下HTTP探针会走IPV6的协议。在大多数情况下，可以使用preferred_ip_protocol=ip4强制通过IPV4的方式进行探测。在Bloackbox响应的监控样本中，也会通过指标probe_ip_protocol，表明当前的协议使用情况：

1
2
3

# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 6

除了支持对HTTP协议进行网络探测以外，Blackbox还支持对TCP、DNS、ICMP等其他网络协议，感兴趣的读者可以从Blackbox的Github项目中获取更多使用信息.

node_exporter

功能对照表

默认开启的功能

名称	说明	系统
arp	从 `/proc/net/arp` 中收集 ARP 统计信息	Linux
conntrack	从 `/proc/sys/net/netfilter/` 中收集 conntrack 统计信息	Linux
cpu	收集 cpu 统计信息	Darwin, Dragonfly,FreeBSD, Linux
diskstats	从 `/proc/diskstats` 中收集磁盘 I/O 统计信息	Linux
edac	错误检测与纠正统计信息	Linux
entropy	可用内核熵信息	Linux
exec	execution 统计信息	Dragonfly, FreeBSD
filefd	从 `/proc/sys/fs/file-nr` 中收集文件描述符统计信息	Linux
filesystem	文件系统统计信息，例如磁盘已使用空间	Darwin, Dragonfly,FreeBSD, Linux, OpenBSD
hwmon	从 `/sys/class/hwmon/` 中收集监控器或传感器数据信息	Linux
infiniband	从 InfiniBand 配置中收集网络统计信息	Linux
loadavg	收集系统负载信息	Darwin, Dragonfly, FreeBSD,Linux, NetBSD, OpenBSD, Solaris
mdadm	从 `/proc/mdstat` 中获取设备统计信息	Linux
meminfo	内存统计信息	Darwin, Dragonfly,FreeBSD, Linux
netdev	网口流量统计信息，单位 bytes	Darwin, Dragonfly,FreeBSD, Linux, OpenBSD
netstat	从 `/proc/net/netstat` 收集网络统计数据，等同于 `netstat -s`	Linux
sockstat	从 `/proc/net/sockstat` 中收集 socket 统计信息	Linux
stat	从 `/proc/stat` 中收集各种统计信息，包含系统启动时间，forks, 中断等	Linux
textfile	通过 `--collector.textfile.directory`参数指定本地文本收集路径，收集文本信息	any
time	系统当前时间	any
uname	通过 `uname` 系统调用, 获取系统信息	any
vmstat	从 `/proc/vmstat` 中收集统计信息	Linux
wifi	收集 wifi 设备相关统计数据	Linux
xfs	收集 xfs 运行时统计信息	Linux (kernel 4.4+)
zfs	收集 zfs 性能统计信息	Linux

默认关闭的功能

名称	说明	系统
bonding	收集系统配置以及激活的绑定网卡数量	Linux
buddyinfo	从 `/proc/buddyinfo` 中收集内存碎片统计信息	Linux
devstat	收集设备统计信息	Dragonfly, FreeBSD
drbd	收集远程镜像块设备（DRBD）统计信息	Linux
interrupts	收集更具体的中断统计信息	Linux，OpenBSD
ipvs	从 `/proc/net/ip_vs` 中收集 IPVS 状态信息，从 `/proc/net/ip_vs_stats` 获取统计信息	Linux
ksmd	从 `/sys/kernel/mm/ksm` 中获取内核和系统统计信息	Linux
logind	从 `logind` 中收集会话统计信息	Linux
meminfo_numa	从 `/proc/meminfo_numa` 中收集内存统计信息	Linux
mountstats	从 `/proc/self/mountstat` 中收集文件系统统计信息，包括 NFS 客户端统计信息	Linux
nfs	从 `/proc/net/rpc/nfs` 中收集 NFS 统计信息，等同于 `nfsstat -c`	Linux
qdisc	收集队列推定统计信息	Linux
runit	收集 runit 状态信息	any
supervisord	收集 supervisord 状态信息	any
systemd	从 `systemd` 中收集设备系统状态信息	Linux
tcpstat	从 `/proc/net/tcp` 和 `/proc/net/tcp6` 收集 TCP 连接状态信息	Linux

注意：我们可以使用 --collectors.enabled 运行参数指定 node_exporter 收集的功能模块, 如果不指定，将使用默认模块。

安装

下载地址：https://prometheus.io/download/#node_exporter

安装包安装

安装node exporter

1
2
3

tar -zxvf node_exporter-0.16.0.linux-amd64.tar.gz

mv node_exporter-0.16.0.linux-amd64 /usr/local/node_exporter

创建systemd服务

vim /etc/systemd/system/node_exporter.service

[Unit]
Description=node_exporter
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target

启动node_exporter

systemctl daemon-reload
systemctl start node_exporter
systemctl status node_exporter
systemctl enable node_exporter

验证启动成功

1	curl 127.0.0.1:9100/metrics

kubernetes安装

apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
  name: prometheus-node-exporter
  namespace: monitoring
  labels:
    k8s-app: prometheus-node-exporter
spec:
  template:
    metadata:
      name: prometheus-node-exporter
      labels:
        k8s-app: prometheus-node-exporter
    spec:
      containers:
      - image: prom/node-exporter:v0.18.0
        imagePullPolicy: IfNotPresent
        name: prometheus-node-exporter
        ports:
        - name: prom-node-exp
          containerPort: 9100
          hostPort: 9100
        livenessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 9100
            scheme: HTTP
        readinessProbe:
          failureThreshold: 3
          httpGet:
            path: /
            port: 9100
            scheme: HTTP
        resources:
          limits:
            cpu: 20m
            memory: 2Gi
          requests:
            cpu: 10m
            memory: 1Gi
      dnsPolicy: ClusterFirst
      hostNetwork: true
      hostPID: true
      
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/scrape: 'true'
    prometheus.io/app-metrics: 'true'
    prometheus.io/app-metrics-path: '/metrics'
  name: prometheus-node-exporter
  namespace: monitoring
  labels:
    k8s-app: prometheus-node-exporter
spec:
  ports:
    - name: prometheus-node-exporter
      port: 9100
      protocol: TCP
  selector:
    k8s-app: prometheus-node-exporter
  type: ClusterIP

配置

可以利用 Prometheus 的 static_configs 来拉取 node_exporter 的数据。

1
2
3

- job_name: 'node'
    static_configs:
    - targets: ['localhost:9100']

重启prometheus，然后在Prometheus页面中的Targets中就能看到新加入的node

常用查询语句

收集到 node_exporter 的数据后，我们可以使用 PromQL 进行一些业务查询和监控，下面是一些比较常见的查询

以下查询均以单个节点作为例子，如果大家想查看所有节点，将 instance="xxx" 去掉即可。

cpu使用率

1	100 - (avg by (instance) (irate(node_cpu{instance="172.16.8.153:9100", mode="idle"}[5m])) * 100)

CPU各个mode使用率

1	avg by (instance, mode) (irate(node_cpu{instance="172.16.8.153:9100"}[5m])) * 100

User：CPU一共花了多少比例的时间运行在用户态空间或者说是用户进程(running user space processes)。典型的用户态空间程序有：Shells、数据库、web服务器等
Nice：可理解为，用户空间进程的CPU的调度优先级，范围为[-20,19]
System：System的含义与User相似。System表示：CPU花了多少比例的时间在内核空间运行。分配内存、IO操作、创建子进程……都是内核操作。这也表明，当IO操作频繁时，System参数会很高
ioWait：在计算机中，读写磁盘的操作远比CPU运行的速度要慢，CPU负载处理数据，而数据一般在磁盘上需要读到内存中才能处理。当CPU发起读写操作后，需要等着磁盘驱动器将数据读入内存,从而导致CPU 在等待的这一段时间内无事可做。CPU处于这种等待状态的时间由Wait参数来衡量
Idle：CPU处于空闲状态时间比例。一般而言，idel + user + nice 约等于100%

机器平均负载

1
2
3

node_load1{instance="172.16.8.153:9100"}   // 1分钟负载
node_load5{instance="172.16.8.153:9100"}   // 5分钟负载
node_load15{instance="172.16.8.153:9100"}  // 15分钟负载

内存使用率

100-(node_memory_MemFree{instance="172.16.8.172:9100"}+node_memory_Cached{instance="172.16.8.172:9100"}+node_memory_Buffers{instance="172.16.8.172:9100"})/node_memory_MemTotal{instance="172.16.8.172:9100"} * 100

磁盘使用率

100 - node_filesystem_free{instance="172.16.8.153:9100",fstype!~"rootfs|selinuxfs|autofs|rpc_pipefs|tmpfs|udev|none|devpts|sysfs|debugfs|fuse.*"} / node_filesystem_size{instance="172.16.8.153:9100",fstype!~"rootfs|selinuxfs|autofs|rpc_pipefs|tmpfs|udev|none|devpts|sysfs|debugfs|fuse.*"} * 100

网卡出入包

// 入包量
sum by (instance) (rate(node_network_receive_bytes{instance="172.16.8.153:9100",device!="lo"}[5m]))

// 出包量
sum by (instance) (rate(node_network_transmit_bytes{instance="172.16.8.153:9100",device!="lo"}[5m]))

dashboard模板

可以在grafana官网中搜索对应模板

Mysql exporter

安装

使用helm安装部署

1	helm --fetch mysql_exporter

拉取下来之后修改values文件

授权

mysqld_exporter需要连接Mysql，首先为它创建用户并赋予所需要的权限：

1
2
3

CREATE USER 'exporter'@'%' IDENTIFIED BY '123456' WITH MAX_USER_CONNECTIONS 3;
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'%';
flush privileges;

prometheus配置

- job_name: 'mysql'
        static_configs:
          - targets:
            - localhost:9104

验证状态

1	curl localhost:9104/metrics

jmx_exporter

举例监控kafka

jmx_prometheus_javaagent 方式收集kafka指标

下载jmx_prometheus_javaagent和kafka.yml

1
2

wget https://raw.githubusercontent.com/prometheus/jmx_exporter/master/example_configs/kafka-0-8-2.yml
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.6/jmx_prometheus_javaagent-0.6.jar

编辑kafka的启动文件 kafka-server-start.sh

添加几行代码：

1 2	export JMX_PORT="9999" export KAFKA_OPTS="-javaagent:/path/jmx_prometheus_javaagent-0.6.jar=9991:/path/kafka-0-8-2.yml"

然后重启kafka。
访问 http://localhost:9991/metrics 可以看到各种指标了。

举例监控hadoop

下载jar包

1	wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.3.1/jmx_prometheus_javaagent-0.3.1.jar

创建配置文件

创建namenode.yaml(datanode.yaml)放在任意位置，内容为你想要的metrics

---
startDelaySeconds: 0
hostPort: master:1234 #master为本机IP（一般可设置为localhost）；1234为想设置的jmx端口（可设置为未被占用的端口）
#jmxUrl: service:jmx:rmi:///jndi/rmi://127.0.0.1:1234/jmxrmi
ssl: false
lowercaseOutputName: false
lowercaseOutputLabelNames: false

参数说明

Name	Description
startDelaySeconds	start delay before serving requests. Any requests within the delay period will result in an empty metrics set.
hostPort	The host and port to connect to via remote JMX. If neither this nor jmxUrl is specified, will talk to the local JVM.
username	The username to be used in remote JMX password authentication.
password	The password to be used in remote JMX password authentication.
jmxUrl	A full JMX URL to connect to. Should not be specified if hostPort is.
ssl	Whether JMX connection should be done over SSL. To configure certificates you have to set following system properties: `-Djavax.net.ssl.keyStore=/home/user/.keystore` `-Djavax.net.ssl.keyStorePassword=changeit` `-Djavax.net.ssl.trustStore=/home/user/.truststore` `-Djavax.net.ssl.trustStorePassword=changeit`
lowercaseOutputName	Lowercase the output metric name. Applies to default format and `name`. Defaults to false.
lowercaseOutputLabelNames	Lowercase the output metric label names. Applies to default format and `labels`. Defaults to false.
whitelistObjectNames	A list of ObjectNames to query. Defaults to all mBeans.
blacklistObjectNames	A list of ObjectNames to not query. Takes precedence over `whitelistObjectNames`. Defaults to none.
rules	A list of rules to apply in order, processing stops at the first matching rule. Attributes that aren’t matched aren’t collected. If not specified, defaults to collecting everything in the default format.
pattern	Regex pattern to match against each bean attribute. The pattern is not anchored. Capture groups can be used in other options. Defaults to matching everything.
attrNameSnakeCase	Converts the attribute name to snake case. This is seen in the names matched by the pattern and the default format. For example, anAttrName to an_attr_name. Defaults to false.
name	The metric name to set. Capture groups from the `pattern` can be used. If not specified, the default format will be used. If it evaluates to empty, processing of this attribute stops with no output.
value	Value for the metric. Static values and capture groups from the `pattern` can be used. If not specified the scraped mBean value will be used.
valueFactor	Optional number that `value` (or the scraped mBean value if `value` is not specified) is multiplied by, mainly used to convert mBean values from milliseconds to seconds.
labels	A map of label name to label value pairs. Capture groups from `pattern` can be used in each. `name` must be set to use this. Empty names and values are ignored. If not specified and the default format is not being used, no labels are set.
help	Help text for the metric. Capture groups from `pattern` can be used. `name` must be set to use this. Defaults to the mBean attribute decription and the full name of the attribute.
type	The type of the metric, can be `GAUGE`, `COUNTER` or `UNTYPED`. `name` must be set to use this. Defaults to `UNTYPED`.

修改$HADOOP_HOME/etc/hadoop/hadoop-env.sh

NameNode节点添加：

export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false   -Dcom.sun.management.jmxremote.port=1234 $HADOOP_NAMENODE_OPTS "

DataNode节点添加：

export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.local.only=false   -Dcom.sun.management.jmxremote.port=1235 $HADOOP_DATANODE_OPTS "

ps:

端口1234（1235）要与之前设置的jmx端口保持一致

修改$HADOOP_HOME/bin/hdfs

NameNode:

1	export HADOOP_NAMENODE_OPTS="$HADOOP_NAMENODE_OPTS -javaagent:/home/hadoop/jmx_prometheus_javaagent-0.3.1.jar=9200:/home/hadoop/namenode.yaml"

DataNode:

1	export HADOOP_DATANODE_OPTS="$HADOOP_DATANODE_OPTS -javaagent:/home/hadoop/jmx_prometheus_javaagent-0.3.1.jar=9300:/home/hadoop/datanode.yaml"

提示：9200（9300）为jmx_exporter提供metrics数据端口，后续Prometheus从此端口获取数据

访问http://master:9200/metrics就能获得需要的metrics数据:

# HELP jvm_buffer_pool_used_bytes Used bytes of a given JVM buffer pool.
# TYPE jvm_buffer_pool_used_bytes gauge
jvm_buffer_pool_used_bytes{pool="direct",} 1181032.0
jvm_buffer_pool_used_bytes{pool="mapped",} 0.0
# HELP jvm_buffer_pool_capacity_bytes Bytes capacity of a given JVM buffer pool.
# TYPE jvm_buffer_pool_capacity_bytes gauge
jvm_buffer_pool_capacity_bytes{pool="direct",} 1181032.0
jvm_buffer_pool_capacity_bytes{pool="mapped",} 0.0
# HELP jvm_buffer_pool_used_buffers Used buffers of a given JVM buffer pool.
...