Jonnyan的原创笔记
alpine
alpine里python安装mssql笔记
Alpine linux如何配置和管理自定义服务
windows
window server2012远程授权重置
window获取本机所有IP
window远程桌面RDP加速方案
远程监控 Win10 资源占用
windows 下 mysql 区分大小写敏感问题
window下navicat无限试用脚本
Linux
解决openvpn的CRL has expired笔记
centos7.x配置时间服务器(chrony)
centos7.x下安装wireguard
解决influxdb的log日志输出位置
保存 iptable 规则并开机自动加载 | SA-Logs
kafka笔记
kafka的server.properties 配置文件参数说明
CentOS 和 RedHat 下 8 个最常用的 YUM 库
外网IP查询网站
VirtualBox Ubuntu20/centos7 命令行如何扩容分区磁盘
如何备份sqlite数据库
yum 安装 redis5/mq/consul
centos7.x 安装 docker-ce
zabbix4.2 的 yum+mariadb 方式部署安装
如何在 Linux 中查找最大的 10 个文件
mongodb 备份与还原操作
Linux 高频工具快速教程
yum 安装 influxdb/telegraf
ubuntu 14.04/16.04/18.04 yum 安装 zabbix-agent 教程
逃不掉的 mysql 数据库安装方式大全 yum rpm 源码
VIM 配置入门
find 命令结合 cp bash mv 命令使用的 4 种方式
Tomcat nginx log 日志按天分割切割
linux 和 pycharm 下终端彩色打印输出
centos5/6/7 下 yum 安装 zabbix-agent(被控端)
shell 脚本头,#!/bin/sh 与 #!/bin/bash 的区别.
electerm/tabby在执行screen命令后不显示滚动条
python
python virtualenv笔记
python配置文件INI/TOML/YAML/ENV的区别
python限制函数的执行时间
python里and和or的理解
SQLite is not a toy database | Anton Zhiyanov
四行代码实现 Python 管道 - Aber's blog
systemd管理虚拟环境Django+uwsgi+nginx配置教程
Linux shell命令创建python django用户
nginx子路径下反代运行多个django
django web 应用 runserver 模式下 cpu 占用高解决办法
解决 pip 安装模块报错 Cannot fetch index base URL http://pypi.python.org/simple/
docker
仅在首次启动时在Docker容器中运行命令
Docker多平台架构镜像构建
解决cadvisor监控内存值与docker stats命令值不一致问题
docker 清理指定日期之前的镜像
docker 部署 graylog 使用教程
docker 一键搭建 zerotier-moon 节点
alpine的docker镜像安装mysql/mariadb/redis
dockerfile 多阶段构建参考
Warning: Stopping docker.service, but it can still be activated by: docker.socket
nginx
Nginx限制并发连接数与下载速度
nginx仅允许域名访问禁止IP访问
Nginx 强制跳转 Https
nginx强制跳转https无限301循环问题
万字总结,带你全面系统的认识 Nginx
linux 下编译安装 nginx 完整版
解决 nginx 同端口强制跳转 https 配置 ssl 证书问题
nginx 关闭日志功能 access_log 关闭
基于 nginx 的 token 认证
杂记
小米手机MIUI12安装Google服务
使用sphinx+markdown+readthedocs+github来编写文档
N1由armbian直刷openwrt
N1安装docker版本的openwrt做旁路由
NUC10 i3/i5/i7系列开启局域网wol唤醒
威联通qnap安装nginx
威联通qnap配置开机自启动项
telegram bot python使用示例教程
两款paste临时文本分享平台
docker部署微力同步(verysync)
Android和IOS自部署通知程序
苹果M1如何科学上网
M1 mac iterm2配置lrzsz命令
漫威轮播
网件XR500/R7800刷机
DIY 编译 openwrt 固件
苹果 mac 版微软官方远程连接工具下载 Microsoft Remote Desktop For Mac
wireguard 实现 peer 互联, NAT to NAT
学习本来的样子
解决 aws ec2 的 centos7 设置时区无效
redis 问题优化
N1 如何完美刷入 armbian 系统教程
v2rayN 的 pac 简单规则
博客园 markdown 使用折叠语法和颜色标签
十年感悟之 python 之路
在浏览器输入 URL 回车后发生了什么?
grafana 里 prometheus 查询语法
国内开源镜像站点汇总
解决阿里云部署 office web apps ApplicationFailedException 报错问题
解决 mac 休眠睡眠异常耗电方法
jira 集成 fisheye 代码深度查看工具安装绿色版
阿里云 ecs 开启 x11 图形化桌面
markdown 完整语法规范 3.0 + 编辑工具介绍
pycharm 重置设置,恢复默认设置
[已解决]window 下 Can't connect to MySQL server on'localhost' (10061) 与无法启动 MYSQL 服务”1067 进程意外终止”
解决 xshell6 评估过期, 需采购问题
[已解决]pycharm 报错: AttributeError: module 'pip' has no attribute 'main'
[已解决]windows 下 python3.x 与 python2.7 共存版本 pip 使用报错问题
云策文档think配置https教程
机器监控告警
zabbix
yum / 编译安装 Zabbix 5.0 LTS
zabbix 监控 AWS-SQS 队列
Zabbix-agent 端配置文件说明
Prometheus+grafana
prometheus+grafana安装和配置
node_exporter主机监控
cadvisor容器监控
redis_exporter监控
rabbitmq_exporter监控
consul_exporter监控
windows_exporter
Open-Falcon
falcon 数据丢失处理方法参考
日志监控告警
graylog
graylog 通过 python 实现钉钉 / 微信 / webhook 告警
loki+grafana
Loki简介
Loki安装
Loki查询语法
grafana面板pannel语法
内网穿透
frp
zerotier
zerotier充当网关实现内网互联,访问其它节点内网
一分钟自建zerotier-plant
nps
anylink
N2N
本文档发布于https://mrdoc.fun
-
+
首页
node_exporter主机监控
# 1.下载 node_exporter 访问官网地址下载<https://prometheus.io/download/#node_exporter> ```bash mkdir -p /opt/agent cd /opt/agent wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz ``` # 2.配置 systemd 管理 ## 2.1 创建启动用户和用户组 ```bash useradd -M -s /sbin/nologin prometheus ``` ## 2.2 创建node_exporter.service ```bash # vim /etc/systemd/system/node-exporter.service [Unit] Description=node-export service agent by jonnyan404 Requires=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Restart=on-failure ExecStart=/path/to/node_exporter --collector.tcpstat ExecReload=/bin/kill -HUP $MAINPID KillMode=process TimeoutStopSec=5 [Install] WantedBy=multi-user.target ``` ## 2.3 设置开机自启并启动 ```bash systemctl enable node-exporter.service systemctl start node-exporter.service ``` ## 2.4 查看日志 ```bash journalctl -u node-exporter.service ``` # 3. 配置自动发现的主机列表 基于 file_sd_configs 有 yaml和json两种格式,这里我们采用yaml - yaml格式 ```yaml # vim /opt/jonnyan404/prometheus/target/linux.yml 文件名字自己取 - targets: ['192.168.1.220:9100'] labels: app: 'app1' env: 'game1' region: 'us-west-2' - targets: ['192.168.1.221:9100'] labels: app: 'app2' env: 'game2' region: 'ap-southeast-1' ``` - json格式 ```json [ { "targets": [ "192.168.1.221:29090"], "labels": { "app": "app1", "env": "game1", "region": "us-west-2" } }, { "targets": [ "192.168.1.222:29090" ], "labels": { "app": "app2", "env": "game2", "region": "ap-southeast-1" } } ] ``` # 4. 配置告警规则 - vim /opt/jonnyan404/prometheus/rules/node-exporter-record.yml ```yaml groups: - name: node_exporter-record rules: - expr: up record: node_exporter:up labels: desc: "节点是否在线, 在线1,不在线0" unit: " " job: "aws_ec2" - expr: time() - node_boot_time_seconds{}* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:node_uptime labels: desc: "节点的运行时间" unit: "s" job: "aws_ec2" ############################################################################################## # cpu # - expr: (1 - avg by (environment,instance) (irate(node_cpu_seconds_total{job="aws_ec2",mode="idle"}[5m]))) * 100 * on(instance) group_left(nodename) (node_uname_info) record: node_exporter:cpu:total:percent labels: desc: "节点的cpu总消耗百分比" unit: "%" job: "aws_ec2" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="aws_ec2",mode="idle"}[5m]))) * 100 * on(instance) group_left(nodename) (node_uname_info) record: node_exporter:cpu:idle:percent labels: desc: "节点的cpu idle百分比" unit: "%" job: "aws_ec2" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="aws_ec2",mode="iowait"}[5m]))) * 100 * on(instance) group_left(nodename) (node_uname_info) record: node_exporter:cpu:iowait:percent labels: desc: "节点的cpu iowait百分比" unit: "%" job: "aws_ec2" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="aws_ec2",mode="system"}[5m]))) * 100 * on(instance) group_left(nodename) (node_uname_info) record: node_exporter:cpu:system:percent labels: desc: "节点的cpu system百分比" unit: "%" job: "aws_ec2" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="aws_ec2",mode="user"}[5m]))) * 100 * on(instance) group_left(nodename) (node_uname_info) record: node_exporter:cpu:user:percent labels: desc: "节点的cpu user百分比" unit: "%" job: "aws_ec2" - expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="aws_ec2",mode=~"softirq|nice|irq|steal"}[5m]))) * 100 * on(instance) group_left(nodename) (node_uname_info) record: node_exporter:cpu:other:percent labels: desc: "节点的cpu 其他的百分比" unit: "%" job: "aws_ec2" ############################################################################################## # memory # - expr: node_memory_MemTotal_bytes{job="aws_ec2"}* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:memory:total labels: desc: "节点的内存总量" unit: byte job: "aws_ec2" - expr: node_memory_MemFree_bytes{job="aws_ec2"}* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:memory:free labels: desc: "节点的剩余内存量" unit: byte job: "aws_ec2" - expr: node_memory_MemTotal_bytes{job="aws_ec2"} - node_memory_MemFree_bytes{job="aws_ec2"}* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:memory:used labels: desc: "节点的已使用内存量" unit: byte job: "aws_ec2" - expr: node_memory_MemTotal_bytes{job="aws_ec2"} - node_memory_MemAvailable_bytes{job="aws_ec2"}* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:memory:actualused labels: desc: "节点用户实际使用的内存量" unit: byte job: "aws_ec2" - expr: (1-(node_memory_MemAvailable_bytes{job="aws_ec2"} / (node_memory_MemTotal_bytes{job="aws_ec2"})))* 100* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:memory:used:percent labels: desc: "节点的内存使用百分比" unit: "%" job: "aws_ec2" - expr: ((node_memory_MemAvailable_bytes{job="aws_ec2"} / (node_memory_MemTotal_bytes{job="aws_ec2"})))* 100* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:memory:free:percent labels: desc: "节点的内存剩余百分比" unit: "%" job: "aws_ec2" ############################################################################################## # load # - expr: sum by (instance) (node_load1{job="aws_ec2"})* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:load:load1 labels: desc: "系统1分钟负载" unit: " " job: "aws_ec2" - expr: sum by (instance) (node_load5{job="aws_ec2"})* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:load:load5 labels: desc: "系统5分钟负载" unit: " " job: "aws_ec2" - expr: sum by (instance) (node_load15{job="aws_ec2"})* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:load:load15 labels: desc: "系统15分钟负载" unit: " " job: "aws_ec2" ############################################################################################## # disk # - expr: node_filesystem_size_bytes{job="aws_ec2" ,fstype=~"ext4|xfs"}* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:disk:usage:total labels: desc: "节点的磁盘总量" unit: byte job: "aws_ec2" - expr: node_filesystem_avail_bytes{job="aws_ec2",fstype=~"ext4|xfs"}* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:disk:usage:free labels: desc: "节点的磁盘剩余空间" unit: byte job: "aws_ec2" - expr: node_filesystem_size_bytes{job="aws_ec2",fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{job="aws_ec2",fstype=~"ext4|xfs"}* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:disk:usage:used labels: desc: "节点的磁盘使用的空间" unit: byte job: "aws_ec2" - expr: (1 - node_filesystem_avail_bytes{job="aws_ec2",fstype=~"ext4|xfs"} / node_filesystem_size_bytes{job="aws_ec2",fstype=~"ext4|xfs"}) * 100 * on(instance) group_left(nodename) (node_uname_info) record: node_exporter:disk:used:percent labels: desc: "节点的磁盘的使用百分比" unit: "%" job: "aws_ec2" - expr: irate(node_disk_reads_completed_total{job="aws_ec2"}[1m])* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:disk:read:count:rate labels: desc: "节点的磁盘读取速率" unit: "次/秒" job: "aws_ec2" - expr: irate(node_disk_writes_completed_total{job="aws_ec2"}[1m])* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:disk:write:count:rate labels: desc: "节点的磁盘写入速率" unit: "次/秒" job: "aws_ec2" - expr: (irate(node_disk_written_bytes_total{job="aws_ec2"}[1m]))/1024/1024* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:disk:read:mb:rate labels: desc: "节点的设备读取MB速率" unit: "MB/s" job: "aws_ec2" - expr: (irate(node_disk_read_bytes_total{job="aws_ec2"}[1m]))/1024/1024* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:disk:write:mb:rate labels: desc: "节点的设备写入MB速率" unit: "MB/s" job: "aws_ec2" ############################################################################################## # filesystem # - expr: (1 -node_filesystem_files_free{job="aws_ec2",fstype=~"ext4|xfs"} / node_filesystem_files{job="aws_ec2",fstype=~"ext4|xfs"}) * 100 * on(instance) group_left(nodename) (node_uname_info) record: node_exporter:filesystem:used:percent labels: desc: "节点的inode的剩余可用的百分比" unit: "%" job: "aws_ec2" ############################################################################################# # filefd # - expr: node_filefd_allocated{job="aws_ec2"}* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:filefd_allocated:count labels: desc: "节点的文件描述符打开个数" unit: "%" job: "aws_ec2" - expr: node_filefd_allocated{job="aws_ec2"}/node_filefd_maximum{job="aws_ec2"} * 100 * on(instance) group_left(nodename) (node_uname_info) record: node_exporter:filefd_allocated:percent labels: desc: "节点的文件描述符打开百分比" unit: "%" job: "aws_ec2" ############################################################################################# # network # - expr: avg by (environment,instance,device) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:network:netin:bit:rate labels: desc: "节点网卡eth0每秒接收的比特数" unit: "bit/s" job: "aws_ec2" - expr: avg by (environment,instance,device) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:network:netout:bit:rate labels: desc: "节点网卡eth0每秒发送的比特数" unit: "bit/s" job: "aws_ec2" - expr: avg by (environment,instance,device) (irate(node_network_receive_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:network:netin:packet:rate labels: desc: "节点网卡每秒接收的数据包个数" unit: "个/秒" job: "aws_ec2" - expr: avg by (environment,instance,device) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:network:netout:packet:rate labels: desc: "节点网卡发送的数据包个数" unit: "个/秒" job: "aws_ec2" - expr: avg by (environment,instance,device) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:network:netin:error:rate labels: desc: "节点设备驱动器检测到的接收错误包的数量" unit: "个/秒" job: "aws_ec2" - expr: avg by (environment,instance,device) (irate(node_network_transmit_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:network:netout:error:rate labels: desc: "节点设备驱动器检测到的发送错误包的数量" unit: "个/秒" job: "aws_ec2" - expr: node_tcp_connection_states{job="aws_ec2", state="established"}* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:network:tcp:established:count labels: desc: "节点当前established的个数" unit: "个" job: "aws_ec2" - expr: node_tcp_connection_states{job="aws_ec2", state="time_wait"}* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:network:tcp:timewait:count labels: desc: "节点timewait的连接数" unit: "个" job: "aws_ec2" - expr: sum by (environment,instance) (node_tcp_connection_states{job="aws_ec2"})* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:network:tcp:total:count labels: desc: "节点tcp连接总数" unit: "个" job: "aws_ec2" ############################################################################################# # process # - expr: node_processes_state{state="Z"}* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:process:zoom:total:count labels: desc: "节点当前状态为zoom的个数" unit: "个" job: "aws_ec2" ############################################################################################# # other # - expr: abs(node_timex_offset_seconds{job="aws_ec2"})* on(instance) group_left(nodename) (node_uname_info) record: node_exporter:time:offset labels: desc: "节点的时间偏差" unit: "s" job: "aws_ec2" ############################################################################################# # - expr: count by (instance) ( count by (instance,cpu) (node_cpu_seconds_total{ mode='system'}) ) * on(instance) group_left(nodename) (node_uname_info) record: node_exporter:cpu:count ``` - vim /opt/jonnyan404/prometheus/rules/node-exporter-alert.yml ```yaml # node-exporter-alert-rules.yml # 定义告警规则 # 通过前一个 rules 文件拿到定义的 record 别名来编写 expr 判断式 # 这里定义的告警规则,在触发的时候,都会传递到 alertmanager,最后从传递的信息中抽取所需数据发送给目标人。 groups: - name: node-alert rules: - alert: node-down expr: node_exporter:up == 0 for: 1m labels: severity: critical annotations: summary: "instance: {{ $labels.instance }} 宕机了" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: Prometheus无法连接Alertmanager expr: prometheus_notifications_alertmanagers_discovered < 1 for: 0m labels: severity: critical annotations: summary: Prometheus not connected to alertmanager description: "Prometheus cannot connect the alertmanager\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: Alertmanager发送通知失败 expr: rate(alertmanager_notifications_failed_total[1m]) > 0 for: 0m labels: severity: critical annotations: summary: Prometheus AlertManager notification failing description: "Alertmanager is failing sending notifications\n VALUE = {{ $value }}\n LABELS = {{ $labels }}" - alert: node-cpu-high expr: node_exporter:cpu:total:percent > 80 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} cpu 使用率高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-cpu-iowait-high expr: node_exporter:cpu:iowait:percent >= 12 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} cpu iowait 使用率高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-load-load1-high expr: (node_exporter:load:load1) > (node_exporter:cpu:count) * 1.2 for: 3m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} load1 使用率高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-memory-high expr: node_exporter:memory:used:percent > 85 for: 3m labels: severity: info annotations: summary: "内存使用率高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-disk-high expr: node_exporter:disk:used:percent > 80 for: 3m labels: severity: info annotations: summary: "{{ $labels.device }}:{{ $labels.mountpoint }} 使用率高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-disk-read:count-high expr: node_exporter:disk:read:count:rate > 3000 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} iops read 使用率高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-disk-write-count-high expr: node_exporter:disk:write:count:rate > 3000 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} iops write 使用率高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-disk-read-mb-high expr: node_exporter:disk:read:mb:rate > 60 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 读取字节数 高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-disk-write-mb-high expr: node_exporter:disk:write:mb:rate > 60 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 写入字节数 高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-filefd-allocated-percent-high expr: node_exporter:filefd_allocated:percent > 80 for: 10m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 打开文件描述符 高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-network-netin-error-rate-high expr: node_exporter:network:netin:error:rate > 4 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包进入的错误速率 高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-network-netin-packet-rate-high expr: node_exporter:network:netin:packet:rate > 35000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包进入速率 高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-network-netout-packet-rate-high expr: node_exporter:network:netout:packet:rate > 35000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 包流出速率 高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-network-tcp-total-count-high expr: node_exporter:network:tcp:total:count > 40000 for: 1m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} tcp连接数量 高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-process-zoom-total-count-high expr: node_exporter:process:zoom:total:count > 10 for: 10m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} 僵死进程数量 高于 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: node-time-offset-high expr: node_exporter:time:offset > 0.03 for: 2m labels: severity: info annotations: summary: "instance: {{ $labels.instance }} {{ $labels.desc }} {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " - alert: 磁盘剩余空间不足 expr: node_exporter:disk:used:percent > 80 for: 2m labels: severity: warn annotations: summary: "instance: {{ $labels.instance }} 磁盘使用率已超过 {{ $value }}{{ $labels.unit }}" grafana: "http://x.x.x.x:3000/d/9CWBz0bik/zhu-ji-jian-kong?orgId=1&var-node={{ $labels.instance }} " ``` # 5.重启prometheus,使规则生效 ``` docker restart prometheus ``` # 6.导入grafana模板 - ID: 8919
Jonny
2021年5月20日 10:55
767
转发文档
收藏文档
上一篇
下一篇
手机扫码
复制链接
手机扫一扫转发分享
复制链接
【腾讯云】爆款2核2G4M云服务器一年45元,企业首购最高获赠300元京东卡
【腾讯云】爆款2核2G4M云服务器一年45元,企业首购最高获赠300元京东卡
Markdown文件
Word文件
PDF文档
PDF文档(打印)
分享
链接
类型
密码
更新密码
有效期