mirror of
https://github.com/ccfos/nightingale.git
synced 2026-03-03 14:38:55 +00:00
Compare commits
1 Commits
ForceUseSe
...
fix_mute
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
de8f4dd93d |
10
Makefile
10
Makefile
@@ -2,15 +2,10 @@
|
||||
|
||||
NOW = $(shell date -u '+%Y%m%d%I%M%S')
|
||||
|
||||
RELEASE_VERSION = 5.9.6
|
||||
|
||||
APP = n9e
|
||||
SERVER_BIN = $(APP)
|
||||
ROOT:=$(shell pwd -P)
|
||||
GIT_COMMIT:=$(shell git --work-tree ${ROOT} rev-parse 'HEAD^{commit}')
|
||||
_GIT_VERSION:=$(shell git --work-tree ${ROOT} describe --tags --abbrev=14 "${GIT_COMMIT}^{commit}" 2>/dev/null)
|
||||
TAG=$(shell echo "${_GIT_VERSION}" | awk -F"-" '{print $$1}')
|
||||
RELEASE_VERSION:="$(TAG)-$(GIT_COMMIT)"
|
||||
|
||||
# RELEASE_ROOT = release
|
||||
# RELEASE_SERVER = release/${APP}
|
||||
# GIT_COUNT = $(shell git rev-list --all --count)
|
||||
@@ -22,9 +17,6 @@ all: build
|
||||
build:
|
||||
go build -ldflags "-w -s -X github.com/didi/nightingale/v5/src/pkg/version.VERSION=$(RELEASE_VERSION)" -o $(SERVER_BIN) ./src
|
||||
|
||||
build-linux:
|
||||
GOOS=linux GOARCH=amd64 go build -ldflags "-w -s -X github.com/didi/nightingale/v5/src/pkg/version.VERSION=$(RELEASE_VERSION)" -o $(SERVER_BIN) ./src
|
||||
|
||||
# start:
|
||||
# @go run -ldflags "-X main.VERSION=$(RELEASE_TAG)" ./cmd/${APP}/main.go web -c ./configs/config.toml -m ./configs/model.conf --menu ./configs/menu.yaml
|
||||
run_webapi:
|
||||
|
||||
132
README.md
132
README.md
@@ -1,122 +1,102 @@
|
||||
<p align="center">
|
||||
<a href="https://github.com/ccfos/nightingale">
|
||||
<img src="doc/img/ccf-n9e.png" alt="nightingale - cloud native monitoring" width="240" /></a>
|
||||
<p align="center">夜莺是一款开源的云原生监控系统,采用 all-in-one 的设计,提供企业级的功能特性,开箱即用的产品体验。推荐升级您的 Prometheus + AlertManager + Grafana 组合方案到夜莺</p>
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
<img alt="GitHub latest release" src="https://img.shields.io/github/v/release/ccfos/nightingale"/>
|
||||
<a href="https://n9e.github.io">
|
||||
<img alt="Docs" src="https://img.shields.io/badge/docs-get%20started-brightgreen"/></a>
|
||||
<a href="https://hub.docker.com/u/flashcatcloud">
|
||||
<img alt="Docker pulls" src="https://img.shields.io/docker/pulls/flashcatcloud/nightingale"/></a>
|
||||
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/ccfos/nightingale">
|
||||
<img alt="GitHub forks" src="https://img.shields.io/github/forks/ccfos/nightingale">
|
||||
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
|
||||
<img alt="GitHub contributors" src="https://img.shields.io/github/contributors-anon/ccfos/nightingale"/></a>
|
||||
<img alt="License" src="https://img.shields.io/badge/license-Apache--2.0-blue"/>
|
||||
</p>
|
||||
<img src="doc/img/ccf-n9e.png" width="240">
|
||||
|
||||
[English](./README_EN.md) | [中文](./README.md)
|
||||
|
||||
## Highlighted Features
|
||||
> 夜莺是一款开源的云原生监控系统,采用 All-In-One 的设计,提供企业级的功能特性,开箱即用的产品体验。推荐升级您的 Prometheus + AlertManager + Grafana 组合方案到夜莺。
|
||||
|
||||
- **开箱即用**
|
||||
- 支持 Docker、Helm Chart、云服务等多种部署方式,集数据采集、监控告警、可视化为一体,内置多种监控仪表盘、快捷视图、告警规则模板,导入即可快速使用,**大幅降低云原生监控系统的建设成本、学习成本、使用成本**;
|
||||
- **专业告警**
|
||||
- 可视化的告警配置和管理,支持丰富的告警规则,提供屏蔽规则、订阅规则的配置能力,支持告警多种送达渠道,支持告警自愈、告警事件管理等;
|
||||
- **云原生**
|
||||
- 以交钥匙的方式快速构建企业级的云原生监控体系,支持 [**Categraf**](https://github.com/flashcatcloud/categraf)、Telegraf、Grafana-agent 等多种采集器,支持 Prometheus、VictoriaMetrics、M3DB、ElasticSearch 等多种数据库,兼容支持导入 Grafana 仪表盘,**与云原生生态无缝集成**;
|
||||
- **高性能,高可用**
|
||||
- 得益于夜莺的多数据源管理引擎,和夜莺引擎侧优秀的架构设计,借助于高性能时序库,可以满足数亿时间线的采集、存储、告警分析场景,节省大量成本;
|
||||
- 夜莺监控组件均可水平扩展,无单点,已在上千家企业部署落地,经受了严苛的生产实践检验。众多互联网头部公司,夜莺集群机器达百台,处理数亿级时间线,重度使用夜莺监控;
|
||||
- **灵活扩展,中心化管理**
|
||||
- 夜莺监控,可部署在 1 核 1G 的云主机,可在上百台机器集群化部署,可运行在 K8s 中;也可将时序库、告警引擎等组件下沉到各机房、各 Region,兼顾边缘部署和中心化统一管理,**解决数据割裂,缺乏统一视图的难题**;
|
||||
- **开放社区**
|
||||
- 托管于[中国计算机学会开源发展委员会](https://www.ccf.org.cn/kyfzwyh/),有[**快猫星云**](https://flashcat.cloud)和众多公司的持续投入,和数千名社区用户的积极参与,以及夜莺监控项目清晰明确的定位,都保证了夜莺开源社区健康、长久的发展。活跃、专业的社区用户也在持续迭代和沉淀更多的最佳实践于产品中;
|
||||
**夜莺监控具有以下特点:**
|
||||
|
||||
> 如果您在使用 Prometheus 过程中,有以下的一个或者多个需求场景,推荐您无缝升级到夜莺:
|
||||
#### 1. 开箱即用
|
||||
支持 Docker、Helm Chart 等多种部署方式,内置多种监控大盘、快捷视图、告警规则模板,导入即可快速使用,活跃、专业的社区用户也在持续迭代和沉淀更多的最佳实践于产品中;
|
||||
|
||||
#### 2. 兼容并包
|
||||
支持 [Categraf](https://github.com/flashcatcloud/categraf)、Telegraf、Grafana-agent 等多种采集器,支持 Prometheus、VictoriaMetrics、M3DB 等各种时序数据库,支持对接 Grafana,与云原生生态无缝集成;
|
||||
|
||||
#### 3. 开放社区
|
||||
托管于[中国计算机学会开源发展委员会](https://www.ccf.org.cn/kyfzwyh/),有[快猫星云](https://flashcat.cloud)的持续投入,和数千名社区用户的积极参与,以及夜莺监控项目清晰明确的定位,都保证了夜莺开源社区健康、长久的发展;
|
||||
|
||||
#### 4. 高性能
|
||||
得益于夜莺的多数据源管理引擎,和夜莺引擎侧优秀的架构设计,借助于高性能时序库,可以满足数亿时间线的采集、存储、告警分析场景,节省大量成本;
|
||||
|
||||
#### 5. 高可用
|
||||
夜莺监控组件均可水平扩展,无单点,已在上千家企业部署落地,经受了严苛的生产实践检验。众多互联网头部公司,夜莺集群机器达百台,处理十亿级时间线,重度使用夜莺监控;
|
||||
|
||||
#### 6. 灵活扩展
|
||||
夜莺监控,可部署在1核1G的云主机,可在上百台机器部署集群,可运行在K8s中;也可将时序库、告警引擎等组件下沉到各机房、各region,兼顾边缘部署和中心化管理;
|
||||
|
||||
|
||||
#### 如果您在使用 Prometheus 过程中,有以下的一个或者多个需求场景,推荐您升级到夜莺:
|
||||
|
||||
- Prometheus、Alertmanager、Grafana 等多个系统较为割裂,缺乏统一视图,无法开箱即用;
|
||||
- 通过修改配置文件来管理 Prometheus、Alertmanager 的方式,学习曲线大,协同有难度;
|
||||
- 数据量过大而无法扩展您的 Prometheus 集群;
|
||||
- 生产环境运行多套 Prometheus 集群,面临管理和使用成本高的问题;
|
||||
|
||||
> 如果您在使用 Zabbix,有以下的场景,推荐您升级到夜莺:
|
||||
#### 如果您在使用 Zabbix,有以下的场景,推荐您升级到夜莺:
|
||||
|
||||
- 监控的数据量太大,希望有更好的扩展解决方案;
|
||||
- 学习曲线高,多人多团队模式下,希望有更好的协同使用效率;
|
||||
- 微服务和云原生架构下,监控数据的生命周期多变、监控数据维度基数高,Zabbix 数据模型不易适配;
|
||||
|
||||
> 如果您在使用 [Open-Falcon](https://github.com/open-falcon/falcon-plus),我们更推荐您升级到夜莺:
|
||||
#### 如果您在使用 [open-falcon](https://github.com/open-falcon/falcon-plus),我们更推荐您升级到夜莺:
|
||||
- 关于open-falcon和夜莺的详细介绍,请参考阅读[《云原生监控的十个特点和趋势》](https://mp.weixin.qq.com/s?__biz=MzkzNjI5OTM5Nw==&mid=2247483738&idx=1&sn=e8bdbb974a2cd003c1abcc2b5405dd18&chksm=c2a19fb0f5d616a63185cd79277a79a6b80118ef2185890d0683d2bb20451bd9303c78d083c5#rd)。
|
||||
|
||||
- 关于 Open-Falcon 和夜莺的详细介绍,请参考阅读:[云原生监控的十个特点和趋势](https://mp.weixin.qq.com/s?__biz=MzkzNjI5OTM5Nw==&mid=2247483738&idx=1&sn=e8bdbb974a2cd003c1abcc2b5405dd18&chksm=c2a19fb0f5d616a63185cd79277a79a6b80118ef2185890d0683d2bb20451bd9303c78d083c5#rd)。
|
||||
|
||||
> 我们推荐您使用 [Categraf](https://github.com/flashcatcloud/categraf) 作为首选的监控数据采集器:
|
||||
|
||||
- [Categraf](https://github.com/flashcatcloud/categraf) 是夜莺监控的默认采集器,采用开放插件机制和 all-in-one 的设计,同时支持 metric、log、trace、event 的采集。Categraf 不仅可以采集 CPU、内存、网络等系统层面的指标,也集成了众多开源组件的采集能力,支持K8s生态。Categraf 内置了对应的仪表盘和告警规则,开箱即用。
|
||||
#### 我们推荐您使用 [Categraf](https://github.com/flashcatcloud/categraf) 作为首选的监控数据采集器:
|
||||
- Categraf 是夜莺监控的默认采集器,采用开放插件机制和 all-in-one 的设计,同时支持 metric、log、trace、event 的采集。Categraf 不仅可以采集 CPU、内存、网络等系统层面的指标,也集成了众多开源组件的采集能力,支持K8s生态。Categraf 内置了对应的仪表盘和告警规则,开箱即用。
|
||||
|
||||
|
||||
## Getting Started
|
||||
## 资料链接
|
||||
|
||||
- [快速安装](https://mp.weixin.qq.com/s/iEC4pfL1TgjMDOWYh8H-FA)
|
||||
- [详细文档](https://n9e.github.io/)
|
||||
- [社区分享](https://n9e.github.io/docs/prologue/share/)
|
||||
|
||||
## Screenshots
|
||||
## 产品演示
|
||||
|
||||
<img src="doc/img/intro.gif" width="480">
|
||||
<img src="doc/img/intro.gif" width="680">
|
||||
|
||||
## 架构介绍
|
||||
|
||||
<img src="doc/img/arch-product.png" width="680">
|
||||
|
||||
Nightingale 可以接收各种采集器上报的监控数据(比如 [Categraf](https://github.com/flashcatcloud/categraf)、telegraf、grafana-agent、Prometheus),并写入多种流行的时序数据库中(可以支持Prometheus、M3DB、VictoriaMetrics、Thanos、TDEngine等),提供告警规则、屏蔽规则、订阅规则的配置能力,提供监控数据的查看能力,提供告警自愈机制(告警触发之后自动回调某个webhook地址或者执行某个脚本),提供历史告警事件的存储管理、分组查看的能力。
|
||||
|
||||
|
||||
## Architecture
|
||||
|
||||
<img src="doc/img/arch-product.png" width="480">
|
||||
|
||||
夜莺监控可以接收各种采集器上报的监控数据(比如 [Categraf](https://github.com/flashcatcloud/categraf)、telegraf、grafana-agent、Prometheus),并写入多种流行的时序数据库中(可以支持Prometheus、M3DB、VictoriaMetrics、Thanos、TDEngine等),提供告警规则、屏蔽规则、订阅规则的配置能力,提供监控数据的查看能力,提供告警自愈机制(告警触发之后自动回调某个webhook地址或者执行某个脚本),提供历史告警事件的存储管理、分组查看的能力。
|
||||
|
||||
<img src="doc/img/arch-system.png" width="480">
|
||||
<img src="doc/img/arch-system.png" width="680">
|
||||
|
||||
夜莺 v5 版本的设计非常简单,核心是 server 和 webapi 两个模块,webapi 无状态,放到中心端,承接前端请求,将用户配置写入数据库;server 是告警引擎和数据转发模块,一般随着时序库走,一个时序库就对应一套 server,每套 server 可以只用一个实例,也可以多个实例组成集群,server 可以接收 Categraf、Telegraf、Grafana-Agent、Datadog-Agent、Falcon-Plugins 上报的数据,写入后端时序库,周期性从数据库同步告警规则,然后查询时序库做告警判断。每套 server 依赖一个 redis。
|
||||
|
||||
|
||||
<img src="doc/img/install-vm.png" width="480">
|
||||
<img src="doc/img/install-vm.png" width="680">
|
||||
|
||||
如果单机版本的时序数据库(比如 Prometheus) 性能有瓶颈或容灾较差,我们推荐使用 [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics),VictoriaMetrics 架构较为简单,性能优异,易于部署和运维,架构图如上。VictoriaMetrics 更详尽的文档,还请参考其[官网](https://victoriametrics.com/)。
|
||||
如果单机版本的 Prometheus 性能不够或容灾较差,我们推荐使用 [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics),VictoriaMetrics 架构较为简单,性能优异,易于部署和运维,架构图如上。VictoriaMetrics 更详尽的文档,还请参考其[官网](https://victoriametrics.com/)。
|
||||
|
||||
|
||||
## Community
|
||||
## 如何参与
|
||||
|
||||
开源项目要更有生命力,离不开开放的治理架构和源源不断的开发者和用户共同参与,我们致力于建立开放、中立的开源治理架构,吸纳更多来自企业、高校等各方面对云原生监控感兴趣、有热情的开发者,一起打造有活力的夜莺开源社区。关于《夜莺开源项目和社区治理架构(草案)》,请查阅 [COMMUNITY GOVERNANCE](./doc/community-governance.md).
|
||||
开源项目要更有生命力,离不开开放的治理架构和源源不断的开发者和用户共同参与,我们致力于建立开放、中立的开源治理架构,吸纳更多来自企业、高校等各方面对云原生监控感兴趣、有热情的计算机专业人士,打造专业、有活力的开发者社区。关于《夜莺开源项目和社区治理架构(草案)》,请查阅 [doc/community-governance.md](./doc/community-governance.md).
|
||||
|
||||
**我们欢迎您以各种方式参与到夜莺开源项目和开源社区中来,工作包括不限于**:
|
||||
- 补充和完善文档 => [n9e.github.io](https://n9e.github.io/)
|
||||
- 分享您在使用夜莺监控过程中的最佳实践和经验心得 => [文章分享](https://n9e.github.io/docs/prologue/share/)
|
||||
- 提交产品建议 =》 [github issue](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.md)
|
||||
- 提交代码,让夜莺监控更快、更稳、更好用 => [github pull request](https://github.com/didi/nightingale/pulls)
|
||||
- 补充和完善文档 => [n9e.github.io](https://n9e.github.io/);
|
||||
- 分享您在使用夜莺监控过程中的最佳实践和经验心得 => [文章分享](https://n9e.github.io/docs/prologue/share/);
|
||||
- 提交产品建议 =》 [github issue](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.md);
|
||||
- 提交代码,让夜莺监控更快、更稳、更好用 => [github pull request](https://github.com/didi/nightingale/pulls);
|
||||
|
||||
|
||||
**尊重、认可和记录每一位贡献者的工作**是夜莺开源社区的第一指导原则,我们提倡**高效的提问**,这既是对开发者时间的尊重,也是对整个社区知识沉淀的贡献:
|
||||
- 提问之前请先查阅 [FAQ](https://www.gitlink.org.cn/ccfos/nightingale/wiki/faq)
|
||||
- 提问之前请先搜索 [github issue](https://github.com/ccfos/nightingale/issues)
|
||||
- 我们优先推荐通过提交 github issue 来提问,如果[有问题点击这里](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Fbug&template=bug_report.yml) | [有需求建议点击这里](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.md)
|
||||
- 最后,我们推荐你加入微信群,针对相关开放式问题,相互交流咨询 (请先加好友:[UlricGO](https://www.gitlink.org.cn/UlricQin/gist/tree/master/self.jpeg) 备注:夜莺加群+姓名+公司,交流群里会有开发者团队和专业、热心的群友回答问题)
|
||||
1. 提问之前请先查阅 [FAQ](https://www.gitlink.org.cn/ccfos/nightingale/wiki/faq) ;
|
||||
2. 提问之前请先搜索 [github issue](https://github.com/ccfos/nightingale/issues);
|
||||
3. 我们优先推荐通过提交 github issue 来提问,如果[有问题点击这里](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Fbug&template=bug_report.yml) | [有需求建议点击这里](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.md);
|
||||
4. 最后,我们推荐你加入微信群,针对相关开放式问题,相互交流咨询 (请先加好友:[UlricGO](https://www.gitlink.org.cn/UlricQin/gist/tree/master/self.jpeg) 备注:夜莺加群+姓名+公司,交流群里会有开发者团队和专业、热心的群友回答问题);
|
||||
|
||||
|
||||
## Who is using
|
||||
## 联系我们
|
||||
- 推荐您关注夜莺监控公众号,及时获取相关产品和社区动态
|
||||
|
||||
您可以通过在 **[Who is Using Nightingale](https://github.com/ccfos/nightingale/issues/897)** 登记您的使用情况,分享您的使用经验。
|
||||
<img src="doc/img/n9e-vx-new.png" width="180">
|
||||
|
||||
## Stargazers
|
||||
## Stargazers over time
|
||||
[](https://starchart.cc/ccfos/nightingale)
|
||||
|
||||
## Contributors
|
||||
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
|
||||
<img src="https://contrib.rocks/image?repo=ccfos/nightingale" />
|
||||
</a>
|
||||
|
||||
## License
|
||||
[Apache License V2.0](https://github.com/didi/nightingale/blob/main/LICENSE)
|
||||
|
||||
## Contact Us
|
||||
推荐您关注夜莺监控公众号,及时获取相关产品和社区动态:
|
||||
|
||||
<img src="doc/img/n9e-vx-new.png" width="120">
|
||||
- [Apache License V2.0](https://github.com/didi/nightingale/blob/main/LICENSE)
|
||||
@@ -1,5 +0,0 @@
|
||||
## Active Contributors
|
||||
|
||||
- [xiaoziv](https://github.com/xiaoziv)
|
||||
- [tanxiao1990](https://github.com/tanxiao1990)
|
||||
- [bbaobelief](https://github.com/bbaobelief)
|
||||
@@ -1,5 +0,0 @@
|
||||
## Committers
|
||||
|
||||
- [YeningQin](https://github.com/710leo)
|
||||
- [FeiKong](https://github.com/kongfei605)
|
||||
- [XiaqingDai](https://github.com/jsers)
|
||||
@@ -1,73 +1,52 @@
|
||||
# 夜莺开源项目和社区治理架构(草案)
|
||||
|
||||
## 社区架构
|
||||
#### 用户(User)
|
||||
|
||||
### 用户(User)
|
||||
>欢迎任何个人、公司以及组织,使用 Nightingale,并积极的反馈 bug、提交功能需求、以及相互帮助,我们推荐使用 github issue 来跟踪 bug 和管理需求。
|
||||
|
||||
> 欢迎任何个人、公司以及组织,使用夜莺监控,并积极的反馈 bug、提交功能需求、以及相互帮助,我们推荐使用 [github issue](https://github.com/ccfos/nightingale/issues) 来跟踪 bug 和管理需求。
|
||||
#### 贡献者(Contributer)
|
||||
|
||||
社区用户,可以通过在 **[Who is Using Nightingale](https://github.com/ccfos/nightingale/issues/897)** 登记您的使用情况,并分享您使用夜莺监控的经验,将会自动进入 **[END USERS](./end-users.md)** 列表,并获得社区的 **VIP Support**。
|
||||
>欢迎每一位用户,包括但不限于以下列方式参与到 Nightingale 开源项目并做出贡献:
|
||||
>1. 在 [github issue](https://github.com/ccfos/nightingale/issues) 中积极参与讨论;
|
||||
>2. 提交代码补丁;
|
||||
>3. 修订、补充和完善文档;
|
||||
>4. 提交建议 / 批评;
|
||||
|
||||
### 贡献者(Contributer)
|
||||
#### 提交者(Committer)
|
||||
|
||||
> 欢迎每一位用户,包括但不限于以下列方式参与到夜莺开源社区并做出贡献:
|
||||
>Committer 是指拥有 Nightingale 代码仓库写操作权限的贡献者,而且他们也签署了 Nightingale 项目贡献者许可协议(CLA),他们拥有 ccf.org.cn 为后缀的邮箱地址。原则上 Committer 能够自主决策某个代码补丁是否可以合入到 Nightingale 代码仓库,但是项目管委会拥有最终的决策权。
|
||||
|
||||
1. 在 [github issue](https://github.com/ccfos/nightingale/issues) 中积极参与讨论,参与社区活动;
|
||||
1. 提交代码补丁;
|
||||
1. 翻译、修订、补充和完善[文档](https://n9e.github.io);
|
||||
1. 分享夜莺监控的使用经验,积极布道;
|
||||
1. 提交建议 / 批评;
|
||||
#### 项目管委会成员(PMC Member)
|
||||
|
||||
年度累计向 [CCFOS/NIGHTINGALE](https://github.com/ccfos/nightingale) 提交 **5** 个PR(被合并),或者因为其他贡献被**项目管委会**一致认可,将会自动进入到 **[ACTIVE CONTRIBUTORS](./active-contributors.md)** 列表,并获得 **[CCF ODC](https://www.ccf.org.cn/kyfzwyh/)** 颁发的电子证书,享有夜莺开源社区一定的权益和福利。
|
||||
> 项目管委会成员,从贡献者或者 Committer 中选举产生,他们拥有 Nightingale 代码仓库的写操作权限,拥有 ccf.org.cn 为后缀的邮箱地址,拥有 Nightingale 社区相关事务的投票权、以及提名 Committer 候选人的权利。 项目管委会作为一个实体,为整个项目的发展全权负责。
|
||||
|
||||
#### 项目管委会主席(PMC Chair)
|
||||
|
||||
### 提交者(Committer)
|
||||
> 项目管委会主席采用任命制,由 [CCF ODC](https://www.ccf.org.cn/kyfzwyh/) 从项目管委会成员中任命产生。项目管委会作为一个统一的实体,来管理和领导 Nightingale 项目。管委会主席是 CCF ODC 和项目管委会之间的沟通桥梁,履行特定的项目管理职责。
|
||||
|
||||
> Committer 是指拥有 [CCFOS/NIGHTINGALE](https://github.com/ccfos/nightingale) 代码仓库写操作权限的贡献者,他们拥有 ccf.org.cn 为后缀的邮箱地址(待上线)。原则上 Committer 能够自主决策某个代码补丁是否可以合入到夜莺代码仓库,但是项目管委会拥有最终的决策权。
|
||||
|
||||
Committer 承担以下一个或多个职责:
|
||||
- 积极回应 Issues;
|
||||
- Review PRs;
|
||||
- 参加开发者例行会议,积极讨论项目规划和技术方案;
|
||||
- 代表夜莺开源社区出席相关技术会议并做演讲;
|
||||
|
||||
Committer 记录并公示于 **[COMMITTERS](./committers.md)** 列表,并获得 **[CCF ODC](https://www.ccf.org.cn/kyfzwyh/)** 颁发的电子证书,以及享有夜莺开源社区的各种权益和福利。
|
||||
|
||||
|
||||
### 项目管委会(PMC)
|
||||
|
||||
> 项目管委会作为一个实体,来管理和领导夜莺项目,为整个项目的发展全权负责。项目管委会相关内容记录并公示于文件[PMC](./pmc.md).
|
||||
|
||||
- 项目管委会成员(PMC Member),从 Contributor 或者 Committer 中选举产生,他们拥有 [CCFOS/NIGHTINGALE](https://github.com/ccfos/nightingale) 代码仓库的写操作权限,拥有 ccf.org.cn 为后缀的邮箱地址(待上线),拥有 Nightingale 社区相关事务的投票权、以及提名 Committer 候选人的权利。
|
||||
- 项目管委会主席(PMC Chair),由 **[CCF ODC](https://www.ccf.org.cn/kyfzwyh/)** 从项目管委会成员中任命产生。管委会主席是 CCF ODC 和项目管委会之间的沟通桥梁,履行特定的项目管理职责。
|
||||
|
||||
## 沟通机制(Communication)
|
||||
# 沟通机制(Communication)
|
||||
1. 我们推荐使用邮件列表来反馈建议(待发布);
|
||||
2. 我们推荐使用 [github issue](https://github.com/ccfos/nightingale/issues) 跟踪 bug 和管理需求;
|
||||
3. 我们推荐使用 [github milestone](https://github.com/ccfos/nightingale/milestones) 来管理项目进度和规划;
|
||||
4. 我们推荐使用腾讯会议来定期召开项目例会(会议 ID 待发布);
|
||||
4. 我们推荐使用腾讯会议来定期召开项目例会;
|
||||
|
||||
## 文档(Documentation)
|
||||
# 文档(Documentation)
|
||||
1. 我们推荐使用 [github pages](https://n9e.github.io) 来沉淀文档;
|
||||
2. 我们推荐使用 [gitlink wiki](https://www.gitlink.org.cn/ccfos/nightingale/wiki/faq) 来沉淀FAQ;
|
||||
|
||||
|
||||
## 运营机制(Operation)
|
||||
|
||||
# 运营机制(Operation)
|
||||
1. 我们定期组织用户、贡献者、项目管委会成员之间的沟通会议,讨论项目开发的目标、方案、进度,以及讨论相关需求的合理性、优先级等议题;
|
||||
2. 我们定期组织 meetup (线上&线下),创造良好的用户交流分享环境,并沉淀相关内容到文档站点;
|
||||
3. 我们定期组织夜莺开发者大会,分享 best user story、同步年度开发目标和计划、讨论新技术方向等;
|
||||
3. 我们定期组织 Nightingale 开发者大会,分享 best user story、同步年度开发目标和计划、讨论新技术方向等;
|
||||
|
||||
## 社区指导原则(Philosophy)
|
||||
|
||||
**尊重、认可和记录每一位贡献者的工作。**
|
||||
|
||||
## 关于提问的原则
|
||||
# 社区指导原则(Philosophy)
|
||||
- 尊重、认可和记录每一位贡献者的工作;
|
||||
|
||||
# 关于提问的原则
|
||||
按照**尊重、认可、记录每一位贡献者的工作**原则,我们提倡**高效的提问**,这既是对开发者时间的尊重,也是对整个社区的知识沉淀的贡献:
|
||||
|
||||
1. 提问之前请先查阅 [FAQ](https://www.gitlink.org.cn/ccfos/nightingale/wiki/faq) ;
|
||||
2. 提问之前请先搜索 [github issue](https://github.com/ccfos/nightingale/issues);
|
||||
3. 我们优先推荐通过提交 github issue 来提问,如果[有问题点击这里](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Fbug&template=bug_report.yml) | [有需求建议点击这里](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.md);
|
||||
|
||||
最后,我们推荐你加入微信群,针对相关开放式问题,相互交流咨询 (请先加好友:[UlricGO](https://www.gitlink.org.cn/UlricQin/gist/tree/master/self.jpeg) 备注:夜莺加群+姓名+公司,交流群里会有开发者团队和专业、热心的群友回答问题);
|
||||
4. 最后,我们推荐你加入微信群,针对相关开放式问题,相互交流咨询 (请先加好友:[UlricGO](https://www.gitlink.org.cn/UlricQin/gist/tree/master/self.jpeg) 备注:夜莺加群+姓名+公司,交流群里会有开发者团队和专业、热心的群友回答问题);
|
||||
@@ -1,5 +0,0 @@
|
||||
## Contributors
|
||||
|
||||
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
|
||||
<img src="https://contrib.rocks/image?repo=ccfos/nightingale" />
|
||||
</a>
|
||||
@@ -1,5 +0,0 @@
|
||||
## End Users
|
||||
|
||||
- [中移动](https://github.com/ccfos/nightingale/issues/897#issuecomment-1086573166)
|
||||
- [inke](https://github.com/ccfos/nightingale/issues/897#issuecomment-1099840636)
|
||||
- [方正证券](https://github.com/ccfos/nightingale/issues/897#issuecomment-1110492461)
|
||||
@@ -1,7 +0,0 @@
|
||||
## PMC Chair
|
||||
|
||||
- [laiwei](https://github.com/laiwei)
|
||||
|
||||
## PMC Member
|
||||
|
||||
- [UlricQin](https://github.com/UlricQin)
|
||||
@@ -1,234 +0,0 @@
|
||||
{
|
||||
"name": "夜莺大盘",
|
||||
"tags": "",
|
||||
"configs": {
|
||||
"var": [],
|
||||
"panels": [
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "rate(n9e_server_samples_received_total[1m])"
|
||||
}
|
||||
],
|
||||
"name": "每秒接收的数据点个数",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "none"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0.5,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 4,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0,
|
||||
"i": "53fcb9dc-23f9-41e0-bc5e-121eed14c3a4",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "53fcb9dc-23f9-41e0-bc5e-121eed14c3a4"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "rate(n9e_server_alerts_total[10m])"
|
||||
}
|
||||
],
|
||||
"name": "每秒产生的告警事件个数",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "none"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0.5,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 4,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0,
|
||||
"i": "47fc6252-9cc8-4b53-8e27-0c5c59a47269",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "f70dcb8b-b58b-4ef9-9e48-f230d9e17140"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "n9e_server_alert_queue_size"
|
||||
}
|
||||
],
|
||||
"name": "告警事件内存队列长度",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "none"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0.5,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 4,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 4,
|
||||
"i": "ad1af16c-de0c-45f4-8875-cea4e85d51d0",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "caf23e58-d907-42b0-9ed6-722c8c6f3c5f"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "n9e_server_http_request_duration_seconds_sum/n9e_server_http_request_duration_seconds_count"
|
||||
}
|
||||
],
|
||||
"name": "数据接收接口平均响应时间(单位:秒)",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0.5,
|
||||
"stack": "noraml"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 4,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 4,
|
||||
"i": "64c3abc2-404c-4462-a82f-c109a21dac91",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "6b8d2db1-efca-4b9e-b429-57a9d2272bc5"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "n9e_server_sample_queue_size"
|
||||
}
|
||||
],
|
||||
"name": "内存数据队列长度",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0.5,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 4,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 8,
|
||||
"i": "1c7da942-58c2-40dc-b42f-983e4a35b89b",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "bd41677d-40d3-482e-bb6e-fbd25df46d87"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "avg(n9e_server_forward_duration_seconds_sum/n9e_server_forward_duration_seconds_count)"
|
||||
}
|
||||
],
|
||||
"name": "数据发往TSDB平均耗时(单位:秒)",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {
|
||||
"decimals": 8
|
||||
},
|
||||
"thresholds": {}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0.5,
|
||||
"stack": "noraml"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 4,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 8,
|
||||
"i": "eed94a0b-954f-48ac-82e5-a2eada1c8a3d",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "c8642e72-f384-46a5-8410-1e6be2953c3c"
|
||||
}
|
||||
],
|
||||
"version": "2.0.0"
|
||||
}
|
||||
}
|
||||
@@ -1,5 +1,4 @@
|
||||
FROM python:2.7.8-slim
|
||||
#FROM python:2
|
||||
FROM python:2
|
||||
#FROM ubuntu:21.04
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
@@ -1,4 +1,4 @@
|
||||
FROM --platform=$BUILDPLATFORM python:2.7.8-slim
|
||||
FROM --platform=$BUILDPLATFORM python:2
|
||||
|
||||
|
||||
WORKDIR /app
|
||||
|
||||
@@ -43,9 +43,3 @@ basic_auth_pass = ""
|
||||
timeout = 5000
|
||||
dial_timeout = 2500
|
||||
max_idle_conns_per_host = 100
|
||||
|
||||
[http]
|
||||
enable = false
|
||||
address = ":9100"
|
||||
print_access = false
|
||||
run_mode = "release"
|
||||
@@ -6,7 +6,6 @@ networks:
|
||||
|
||||
services:
|
||||
mysql:
|
||||
platform: linux/x86_64
|
||||
image: "mysql:5.7"
|
||||
container_name: mysql
|
||||
hostname: mysql
|
||||
@@ -80,7 +79,7 @@ services:
|
||||
sh -c "/wait && /app/ibex server"
|
||||
|
||||
nwebapi:
|
||||
image: flashcatcloud/nightingale:latest
|
||||
image: ulric2019/nightingale:5.9.4
|
||||
container_name: nwebapi
|
||||
hostname: nwebapi
|
||||
restart: always
|
||||
@@ -108,7 +107,7 @@ services:
|
||||
sh -c "/wait && /app/n9e webapi"
|
||||
|
||||
nserver:
|
||||
image: flashcatcloud/nightingale:latest
|
||||
image: ulric2019/nightingale:5.9.4
|
||||
container_name: nserver
|
||||
hostname: nserver
|
||||
restart: always
|
||||
@@ -136,7 +135,7 @@ services:
|
||||
sh -c "/wait && /app/n9e server"
|
||||
|
||||
categraf:
|
||||
image: "flashcatcloud/categraf:latest"
|
||||
image: "flashcatcloud/categraf:v0.1.9"
|
||||
container_name: "categraf"
|
||||
hostname: "categraf01"
|
||||
restart: always
|
||||
@@ -150,7 +149,7 @@ services:
|
||||
- /:/hostfs
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
ports:
|
||||
- "9100:9100/tcp"
|
||||
- "8094:8094/tcp"
|
||||
networks:
|
||||
- nightingale
|
||||
depends_on:
|
||||
|
||||
@@ -52,7 +52,7 @@ insert into user_group_member(group_id, user_id) values(1, 1);
|
||||
CREATE TABLE `configs` (
|
||||
`id` bigint unsigned not null auto_increment,
|
||||
`ckey` varchar(191) not null,
|
||||
`cval` varchar(4096) not null default '',
|
||||
`cval` varchar(1024) not null default '',
|
||||
PRIMARY KEY (`id`),
|
||||
UNIQUE KEY (`ckey`)
|
||||
) ENGINE = InnoDB DEFAULT CHARSET = utf8mb4;
|
||||
@@ -226,7 +226,6 @@ CREATE TABLE `chart_share` (
|
||||
CREATE TABLE `alert_rule` (
|
||||
`id` bigint unsigned not null auto_increment,
|
||||
`group_id` bigint not null default 0 comment 'busi group id',
|
||||
`cate` varchar(128) not null,
|
||||
`cluster` varchar(128) not null,
|
||||
`name` varchar(255) not null,
|
||||
`note` varchar(1024) not null default '',
|
||||
@@ -265,7 +264,6 @@ CREATE TABLE `alert_mute` (
|
||||
`id` bigint unsigned not null auto_increment,
|
||||
`group_id` bigint not null default 0 comment 'busi group id',
|
||||
`prod` varchar(255) not null default '',
|
||||
`cate` varchar(128) not null,
|
||||
`cluster` varchar(128) not null,
|
||||
`tags` varchar(4096) not null default '' comment 'json,map,tagkey->regexp|value',
|
||||
`cause` varchar(255) not null default '',
|
||||
@@ -281,7 +279,6 @@ CREATE TABLE `alert_mute` (
|
||||
CREATE TABLE `alert_subscribe` (
|
||||
`id` bigint unsigned not null auto_increment,
|
||||
`group_id` bigint not null default 0 comment 'busi group id',
|
||||
`cate` varchar(128) not null,
|
||||
`cluster` varchar(128) not null,
|
||||
`rule_id` bigint not null default 0,
|
||||
`tags` varchar(4096) not null default '' comment 'json,map,tagkey->regexp|value',
|
||||
@@ -383,7 +380,6 @@ insert into alert_aggr_view(name, rule, cate) values('By RuleName', 'field:rule_
|
||||
|
||||
CREATE TABLE `alert_cur_event` (
|
||||
`id` bigint unsigned not null comment 'use alert_his_event.id',
|
||||
`cate` varchar(128) not null,
|
||||
`cluster` varchar(128) not null,
|
||||
`group_id` bigint unsigned not null comment 'busi group id of rule',
|
||||
`group_name` varchar(255) not null default '' comment 'busi group name',
|
||||
@@ -406,7 +402,6 @@ CREATE TABLE `alert_cur_event` (
|
||||
`notify_cur_number` int not null default 0 comment '',
|
||||
`target_ident` varchar(191) not null default '' comment 'target ident, also in tags',
|
||||
`target_note` varchar(191) not null default '' comment 'target note',
|
||||
`first_trigger_time` bigint,
|
||||
`trigger_time` bigint not null,
|
||||
`trigger_value` varchar(255) not null,
|
||||
`tags` varchar(1024) not null default '' comment 'merge data_tags rule_tags, split by ,,',
|
||||
@@ -420,7 +415,6 @@ CREATE TABLE `alert_cur_event` (
|
||||
CREATE TABLE `alert_his_event` (
|
||||
`id` bigint unsigned not null AUTO_INCREMENT,
|
||||
`is_recovered` tinyint(1) not null,
|
||||
`cate` varchar(128) not null,
|
||||
`cluster` varchar(128) not null,
|
||||
`group_id` bigint unsigned not null comment 'busi group id of rule',
|
||||
`group_name` varchar(255) not null default '' comment 'busi group name',
|
||||
@@ -442,7 +436,6 @@ CREATE TABLE `alert_his_event` (
|
||||
`notify_cur_number` int not null default 0 comment '',
|
||||
`target_ident` varchar(191) not null default '' comment 'target ident, also in tags',
|
||||
`target_note` varchar(191) not null default '' comment 'target note',
|
||||
`first_trigger_time` bigint,
|
||||
`trigger_time` bigint not null,
|
||||
`trigger_value` varchar(255) not null,
|
||||
`recover_time` bigint not null default 0,
|
||||
@@ -505,13 +498,3 @@ CREATE TABLE `task_record`
|
||||
KEY (`create_at`, `group_id`),
|
||||
KEY (`create_by`)
|
||||
) ENGINE = InnoDB DEFAULT CHARSET = utf8mb4;
|
||||
|
||||
CREATE TABLE `alerting_engines`
|
||||
(
|
||||
`id` int unsigned NOT NULL AUTO_INCREMENT,
|
||||
`instance` varchar(128) not null default '' comment 'instance identification, e.g. 10.9.0.9:9090',
|
||||
`cluster` varchar(128) not null default '' comment 'target reader cluster',
|
||||
`clock` bigint not null,
|
||||
PRIMARY KEY (`id`),
|
||||
UNIQUE KEY (`instance`)
|
||||
) ENGINE = InnoDB DEFAULT CHARSET = utf8mb4;
|
||||
|
||||
@@ -54,7 +54,7 @@ insert into user_group_member(group_id, user_id) values(1, 1);
|
||||
CREATE TABLE configs (
|
||||
id bigserial,
|
||||
ckey varchar(191) not null,
|
||||
cval varchar(4096) not null default ''
|
||||
cval varchar(1024) not null default ''
|
||||
) ;
|
||||
ALTER TABLE configs ADD CONSTRAINT configs_pk PRIMARY KEY (id);
|
||||
ALTER TABLE configs ADD CONSTRAINT configs_un UNIQUE (ckey);
|
||||
@@ -436,7 +436,6 @@ CREATE TABLE alert_cur_event (
|
||||
notify_cur_number int4 not null default 0,
|
||||
target_ident varchar(191) NOT NULL DEFAULT ''::character varying,
|
||||
target_note varchar(191) NOT NULL DEFAULT ''::character varying,
|
||||
first_trigger_time int8,
|
||||
trigger_time int8 NOT NULL,
|
||||
trigger_value varchar(255) NOT NULL,
|
||||
tags varchar(1024) NOT NULL DEFAULT ''::character varying,
|
||||
@@ -488,7 +487,6 @@ CREATE TABLE alert_his_event (
|
||||
notify_cur_number int4 not null default 0,
|
||||
target_ident varchar(191) NOT NULL DEFAULT ''::character varying,
|
||||
target_note varchar(191) NOT NULL DEFAULT ''::character varying,
|
||||
first_trigger_time int8,
|
||||
trigger_time int8 NOT NULL,
|
||||
trigger_value varchar(255) NOT NULL,
|
||||
recover_time int8 NOT NULL DEFAULT 0,
|
||||
@@ -580,15 +578,3 @@ CREATE INDEX task_record_create_by_idx ON task_record (create_by);
|
||||
|
||||
COMMENT ON COLUMN task_record.id IS 'ibex task id';
|
||||
COMMENT ON COLUMN task_record.group_id IS 'busi group id';
|
||||
|
||||
CREATE TABLE alerting_engines
|
||||
(
|
||||
id bigserial NOT NULL,
|
||||
instance varchar(128) not null default '',
|
||||
cluster varchar(128) not null default '',
|
||||
clock bigint not null
|
||||
) ;
|
||||
ALTER TABLE alerting_engines ADD CONSTRAINT alerting_engines_pk PRIMARY KEY (id);
|
||||
ALTER TABLE alerting_engines ADD CONSTRAINT alerting_engines_un UNIQUE (instance);
|
||||
COMMENT ON COLUMN alerting_engines.instance IS 'instance identification, e.g. 10.9.0.9:9090';
|
||||
COMMENT ON COLUMN alerting_engines.cluster IS 'target reader cluster';
|
||||
|
||||
@@ -174,10 +174,4 @@ Address = "http://ibex:10090"
|
||||
BasicAuthUser = "ibex"
|
||||
BasicAuthPass = "ibex"
|
||||
# unit: ms
|
||||
Timeout = 3000
|
||||
|
||||
[TargetMetrics]
|
||||
TargetUp = '''max(max_over_time(target_up{ident=~"(%s)"}[%dm])) by (ident)'''
|
||||
LoadPerCore = '''max(max_over_time(system_load_norm_1{ident=~"(%s)"}[%dm])) by (ident)'''
|
||||
MemUtil = '''100-max(max_over_time(mem_available_percent{ident=~"(%s)"}[%dm])) by (ident)'''
|
||||
DiskUtil = '''max(max_over_time(disk_used_percent{ident=~"(%s)", path="/"}[%dm])) by (ident)'''
|
||||
Timeout = 3000
|
||||
File diff suppressed because it is too large
Load Diff
496
etc/metrics.yaml
496
etc/metrics.yaml
@@ -1,383 +1,131 @@
|
||||
zh:
|
||||
cpu_usage_idle: CPU空闲率(单位:%)
|
||||
cpu_usage_active: CPU使用率(单位:%)
|
||||
cpu_usage_system: CPU内核态时间占比(单位:%)
|
||||
cpu_usage_user: CPU用户态时间占比(单位:%)
|
||||
cpu_usage_nice: 低优先级用户态CPU时间占比,也就是进程nice值被调整为1-19之间的CPU时间。这里注意,nice可取值范围是-20到19,数值越大,优先级反而越低(单位:%)
|
||||
cpu_usage_iowait: CPU等待I/O的时间占比(单位:%)
|
||||
cpu_usage_irq: CPU处理硬中断的时间占比(单位:%)
|
||||
cpu_usage_softirq: CPU处理软中断的时间占比(单位:%)
|
||||
cpu_usage_steal: 在虚拟机环境下有该指标,表示CPU被其他虚拟机争用的时间占比,超过20就表示争抢严重(单位:%)
|
||||
cpu_usage_guest: 通过虚拟化运行其他操作系统的时间,也就是运行虚拟机的CPU时间占比(单位:%)
|
||||
cpu_usage_guest_nice: 以低优先级运行虚拟机的时间占比(单位:%)
|
||||
cpu_usage_idle: CPU空闲率(单位:%)
|
||||
cpu_usage_active: CPU使用率(单位:%)
|
||||
cpu_usage_system: CPU内核态时间占比(单位:%)
|
||||
cpu_usage_user: CPU用户态时间占比(单位:%)
|
||||
cpu_usage_nice: 低优先级用户态CPU时间占比,也就是进程nice值被调整为1-19之间的CPU时间。这里注意,nice可取值范围是-20到19,数值越大,优先级反而越低(单位:%)
|
||||
cpu_usage_iowait: CPU等待I/O的时间占比(单位:%)
|
||||
cpu_usage_irq: CPU处理硬中断的时间占比(单位:%)
|
||||
cpu_usage_softirq: CPU处理软中断的时间占比(单位:%)
|
||||
cpu_usage_steal: 在虚拟机环境下有该指标,表示CPU被其他虚拟机争用的时间占比,超过20就表示争抢严重(单位:%)
|
||||
cpu_usage_guest: 通过虚拟化运行其他操作系统的时间,也就是运行虚拟机的CPU时间占比(单位:%)
|
||||
cpu_usage_guest_nice: 以低优先级运行虚拟机的时间占比(单位:%)
|
||||
|
||||
disk_free: 硬盘分区剩余量(单位:byte)
|
||||
disk_used: 硬盘分区使用量(单位:byte)
|
||||
disk_used_percent: 硬盘分区使用率(单位:%)
|
||||
disk_total: 硬盘分区总量(单位:byte)
|
||||
disk_inodes_free: 硬盘分区inode剩余量
|
||||
disk_inodes_used: 硬盘分区inode使用量
|
||||
disk_inodes_total: 硬盘分区inode总量
|
||||
disk_free: 硬盘分区剩余量(单位:byte)
|
||||
disk_used: 硬盘分区使用量(单位:byte)
|
||||
disk_used_percent: 硬盘分区使用率(单位:%)
|
||||
disk_total: 硬盘分区总量(单位:byte)
|
||||
disk_inodes_free: 硬盘分区inode剩余量
|
||||
disk_inodes_used: 硬盘分区inode使用量
|
||||
disk_inodes_total: 硬盘分区inode总量
|
||||
|
||||
diskio_io_time: 从设备视角来看I/O请求总时间,队列中有I/O请求就计数(单位:毫秒),counter类型,需要用函数求rate才有使用价值
|
||||
diskio_iops_in_progress: 已经分配给设备驱动且尚未完成的IO请求,不包含在队列中但尚未分配给设备驱动的IO请求,gauge类型
|
||||
diskio_merged_reads: 相邻读请求merge读的次数,counter类型
|
||||
diskio_merged_writes: 相邻写请求merge写的次数,counter类型
|
||||
diskio_read_bytes: 读取的byte数量,counter类型,需要用函数求rate才有使用价值
|
||||
diskio_read_time: 读请求总时间(单位:毫秒),counter类型,需要用函数求rate才有使用价值
|
||||
diskio_reads: 读请求次数,counter类型,需要用函数求rate才有使用价值
|
||||
diskio_weighted_io_time: 从I/O请求视角来看I/O等待总时间,如果同时有多个I/O请求,时间会叠加(单位:毫秒)
|
||||
diskio_write_bytes: 写入的byte数量,counter类型,需要用函数求rate才有使用价值
|
||||
diskio_write_time: 写请求总时间(单位:毫秒),counter类型,需要用函数求rate才有使用价值
|
||||
diskio_writes: 写请求次数,counter类型,需要用函数求rate才有使用价值
|
||||
diskio_io_time: 从设备视角来看I/O请求总时间,队列中有I/O请求就计数(单位:毫秒),counter类型,需要用函数求rate才有使用价值
|
||||
diskio_iops_in_progress: 已经分配给设备驱动且尚未完成的IO请求,不包含在队列中但尚未分配给设备驱动的IO请求,gauge类型
|
||||
diskio_merged_reads: 相邻读请求merge读的次数,counter类型
|
||||
diskio_merged_writes: 相邻写请求merge写的次数,counter类型
|
||||
diskio_read_bytes: 读取的byte数量,counter类型,需要用函数求rate才有使用价值
|
||||
diskio_read_time: 读请求总时间(单位:毫秒),counter类型,需要用函数求rate才有使用价值
|
||||
diskio_reads: 读请求次数,counter类型,需要用函数求rate才有使用价值
|
||||
diskio_weighted_io_time: 从I/O请求视角来看I/O等待总时间,如果同时有多个I/O请求,时间会叠加(单位:毫秒)
|
||||
diskio_write_bytes: 写入的byte数量,counter类型,需要用函数求rate才有使用价值
|
||||
diskio_write_time: 写请求总时间(单位:毫秒),counter类型,需要用函数求rate才有使用价值
|
||||
diskio_writes: 写请求次数,counter类型,需要用函数求rate才有使用价值
|
||||
|
||||
kernel_boot_time: 内核启动时间
|
||||
kernel_context_switches: 内核上下文切换次数
|
||||
kernel_entropy_avail: linux系统内部的熵池
|
||||
kernel_interrupts: 内核中断次数
|
||||
kernel_processes_forked: fork的进程数
|
||||
kernel_boot_time: 内核启动时间
|
||||
kernel_context_switches: 内核上下文切换次数
|
||||
kernel_entropy_avail: linux系统内部的熵池
|
||||
kernel_interrupts: 内核中断次数
|
||||
kernel_processes_forked: fork的进程数
|
||||
|
||||
mem_active: 活跃使用的内存总数(包括cache和buffer内存)
|
||||
mem_available: 应用程序可用内存数
|
||||
mem_available_percent: 内存剩余百分比(0~100)
|
||||
mem_buffered: 用来给文件做缓冲大小
|
||||
mem_cached: 被高速缓冲存储器(cache memory)用的内存的大小(等于 diskcache minus SwapCache )
|
||||
mem_commit_limit: 根据超额分配比率('vm.overcommit_ratio'),这是当前在系统上分配可用的内存总量,这个限制只是在模式2('vm.overcommit_memory')时启用
|
||||
mem_committed_as: 目前在系统上分配的内存量。是所有进程申请的内存的总和
|
||||
mem_dirty: 等待被写回到磁盘的内存大小
|
||||
mem_free: 空闲内存数
|
||||
mem_high_free: 未被使用的高位内存大小
|
||||
mem_high_total: 高位内存总大小(Highmem是指所有内存高于860MB的物理内存,Highmem区域供用户程序使用,或用于页面缓存。该区域不是直接映射到内核空间。内核必须使用不同的手法使用该段内存)
|
||||
mem_huge_page_size: 每个大页的大小
|
||||
mem_huge_pages_free: 池中尚未分配的 HugePages 数量
|
||||
mem_huge_pages_total: 预留HugePages的总个数
|
||||
mem_inactive: 空闲的内存数(包括free和avalible的内存)
|
||||
mem_low_free: 未被使用的低位大小
|
||||
mem_low_total: 低位内存总大小,低位可以达到高位内存一样的作用,而且它还能够被内核用来记录一些自己的数据结构
|
||||
mem_mapped: 设备和文件等映射的大小
|
||||
mem_page_tables: 管理内存分页页面的索引表的大小
|
||||
mem_shared: 多个进程共享的内存总额
|
||||
mem_slab: 内核数据结构缓存的大小,可以减少申请和释放内存带来的消耗
|
||||
mem_sreclaimable: 可收回Slab的大小
|
||||
mem_sunreclaim: 不可收回Slab的大小(SUnreclaim+SReclaimable=Slab)
|
||||
mem_swap_cached: 被高速缓冲存储器(cache memory)用的交换空间的大小,已经被交换出来的内存,但仍然被存放在swapfile中。用来在需要的时候很快的被替换而不需要再次打开I/O端口
|
||||
mem_swap_free: 未被使用交换空间的大小
|
||||
mem_swap_total: 交换空间的总大小
|
||||
mem_total: 内存总数
|
||||
mem_used: 已用内存数
|
||||
mem_used_percent: 已用内存数百分比(0~100)
|
||||
mem_vmalloc_chunk: 最大的连续未被使用的vmalloc区域
|
||||
mem_vmalloc_totalL: 可以vmalloc虚拟内存大小
|
||||
mem_vmalloc_used: vmalloc已使用的虚拟内存大小
|
||||
mem_write_back: 正在被写回到磁盘的内存大小
|
||||
mem_write_back_tmp: FUSE用于临时写回缓冲区的内存
|
||||
mem_active: 活跃使用的内存总数(包括cache和buffer内存)
|
||||
mem_available: 应用程序可用内存数
|
||||
mem_available_percent: 内存剩余百分比(0~100)
|
||||
mem_buffered: 用来给文件做缓冲大小
|
||||
mem_cached: 被高速缓冲存储器(cache memory)用的内存的大小(等于 diskcache minus SwapCache )
|
||||
mem_commit_limit: 根据超额分配比率('vm.overcommit_ratio'),这是当前在系统上分配可用的内存总量,这个限制只是在模式2('vm.overcommit_memory')时启用
|
||||
mem_committed_as: 目前在系统上分配的内存量。是所有进程申请的内存的总和
|
||||
mem_dirty: 等待被写回到磁盘的内存大小
|
||||
mem_free: 空闲内存数
|
||||
mem_high_free: 未被使用的高位内存大小
|
||||
mem_high_total: 高位内存总大小(Highmem是指所有内存高于860MB的物理内存,Highmem区域供用户程序使用,或用于页面缓存。该区域不是直接映射到内核空间。内核必须使用不同的手法使用该段内存)
|
||||
mem_huge_page_size: 每个大页的大小
|
||||
mem_huge_pages_free: 池中尚未分配的 HugePages 数量
|
||||
mem_huge_pages_total: 预留HugePages的总个数
|
||||
mem_inactive: 空闲的内存数(包括free和avalible的内存)
|
||||
mem_low_free: 未被使用的低位大小
|
||||
mem_low_total: 低位内存总大小,低位可以达到高位内存一样的作用,而且它还能够被内核用来记录一些自己的数据结构
|
||||
mem_mapped: 设备和文件等映射的大小
|
||||
mem_page_tables: 管理内存分页页面的索引表的大小
|
||||
mem_shared: 多个进程共享的内存总额
|
||||
mem_slab: 内核数据结构缓存的大小,可以减少申请和释放内存带来的消耗
|
||||
mem_sreclaimable: 可收回Slab的大小
|
||||
mem_sunreclaim: 不可收回Slab的大小(SUnreclaim+SReclaimable=Slab)
|
||||
mem_swap_cached: 被高速缓冲存储器(cache memory)用的交换空间的大小,已经被交换出来的内存,但仍然被存放在swapfile中。用来在需要的时候很快的被替换而不需要再次打开I/O端口
|
||||
mem_swap_free: 未被使用交换空间的大小
|
||||
mem_swap_total: 交换空间的总大小
|
||||
mem_total: 内存总数
|
||||
mem_used: 已用内存数
|
||||
mem_used_percent: 已用内存数百分比(0~100)
|
||||
mem_vmalloc_chunk: 最大的连续未被使用的vmalloc区域
|
||||
mem_vmalloc_totalL: 可以vmalloc虚拟内存大小
|
||||
mem_vmalloc_used: vmalloc已使用的虚拟内存大小
|
||||
mem_write_back: 正在被写回到磁盘的内存大小
|
||||
mem_write_back_tmp: FUSE用于临时写回缓冲区的内存
|
||||
|
||||
net_bytes_recv: 网卡收包总数(bytes)
|
||||
net_bytes_sent: 网卡发包总数(bytes)
|
||||
net_drop_in: 网卡收丢包数量
|
||||
net_drop_out: 网卡发丢包数量
|
||||
net_err_in: 网卡收包错误数量
|
||||
net_err_out: 网卡发包错误数量
|
||||
net_packets_recv: 网卡收包数量
|
||||
net_packets_sent: 网卡发包数量
|
||||
net_bytes_recv: 网卡收包总数(bytes)
|
||||
net_bytes_sent: 网卡发包总数(bytes)
|
||||
net_drop_in: 网卡收丢包数量
|
||||
net_drop_out: 网卡发丢包数量
|
||||
net_err_in: 网卡收包错误数量
|
||||
net_err_out: 网卡发包错误数量
|
||||
net_packets_recv: 网卡收包数量
|
||||
net_packets_sent: 网卡发包数量
|
||||
|
||||
netstat_tcp_established: ESTABLISHED状态的网络链接数
|
||||
netstat_tcp_fin_wait1: FIN_WAIT1状态的网络链接数
|
||||
netstat_tcp_fin_wait2: FIN_WAIT2状态的网络链接数
|
||||
netstat_tcp_last_ack: LAST_ACK状态的网络链接数
|
||||
netstat_tcp_listen: LISTEN状态的网络链接数
|
||||
netstat_tcp_syn_recv: SYN_RECV状态的网络链接数
|
||||
netstat_tcp_syn_sent: SYN_SENT状态的网络链接数
|
||||
netstat_tcp_time_wait: TIME_WAIT状态的网络链接数
|
||||
netstat_udp_socket: UDP状态的网络链接数
|
||||
netstat_tcp_established: ESTABLISHED状态的网络链接数
|
||||
netstat_tcp_fin_wait1: FIN_WAIT1状态的网络链接数
|
||||
netstat_tcp_fin_wait2: FIN_WAIT2状态的网络链接数
|
||||
netstat_tcp_last_ack: LAST_ACK状态的网络链接数
|
||||
netstat_tcp_listen: LISTEN状态的网络链接数
|
||||
netstat_tcp_syn_recv: SYN_RECV状态的网络链接数
|
||||
netstat_tcp_syn_sent: SYN_SENT状态的网络链接数
|
||||
netstat_tcp_time_wait: TIME_WAIT状态的网络链接数
|
||||
netstat_udp_socket: UDP状态的网络链接数
|
||||
|
||||
#[ping]
|
||||
ping_percent_packet_loss: ping数据包丢失百分比(%)
|
||||
ping_result_code: ping返回码('0','1')
|
||||
processes_blocked: 不可中断的睡眠状态下的进程数('U','D','L')
|
||||
processes_dead: 回收中的进程数('X')
|
||||
processes_idle: 挂起的空闲进程数('I')
|
||||
processes_paging: 分页进程数('P')
|
||||
processes_running: 运行中的进程数('R')
|
||||
processes_sleeping: 可中断进程数('S')
|
||||
processes_stopped: 暂停状态进程数('T')
|
||||
processes_total: 总进程数
|
||||
processes_total_threads: 总线程数
|
||||
processes_unknown: 未知状态进程数
|
||||
processes_zombies: 僵尸态进程数('Z')
|
||||
|
||||
processes_blocked: 不可中断的睡眠状态下的进程数('U','D','L')
|
||||
processes_dead: 回收中的进程数('X')
|
||||
processes_idle: 挂起的空闲进程数('I')
|
||||
processes_paging: 分页进程数('P')
|
||||
processes_running: 运行中的进程数('R')
|
||||
processes_sleeping: 可中断进程数('S')
|
||||
processes_stopped: 暂停状态进程数('T')
|
||||
processes_total: 总进程数
|
||||
processes_total_threads: 总线程数
|
||||
processes_unknown: 未知状态进程数
|
||||
processes_zombies: 僵尸态进程数('Z')
|
||||
swap_used_percent: Swap空间换出数据量
|
||||
|
||||
swap_used_percent: Swap空间换出数据量
|
||||
system_load1: 1分钟平均load值
|
||||
system_load5: 5分钟平均load值
|
||||
system_load15: 15分钟平均load值
|
||||
system_n_users: 用户数
|
||||
system_n_cpus: CPU核数
|
||||
system_uptime: 系统启动时间
|
||||
|
||||
system_load1: 1分钟平均load值
|
||||
system_load5: 5分钟平均load值
|
||||
system_load15: 15分钟平均load值
|
||||
system_n_users: 用户数
|
||||
system_n_cpus: CPU核数
|
||||
system_uptime: 系统启动时间
|
||||
nginx_accepts: 自nginx启动起,与客户端建立过得连接总数
|
||||
nginx_active: 当前nginx正在处理的活动连接数,等于Reading/Writing/Waiting总和
|
||||
nginx_handled: 自nginx启动起,处理过的客户端连接总数
|
||||
nginx_reading: 正在读取HTTP请求头部的连接总数
|
||||
nginx_requests: 自nginx启动起,处理过的客户端请求总数,由于存在HTTP Krrp-Alive请求,该值会大于handled值
|
||||
nginx_upstream_check_fall: upstream_check模块检测到后端失败的次数
|
||||
nginx_upstream_check_rise: upstream_check模块对后端的检测次数
|
||||
nginx_upstream_check_status_code: 后端upstream的状态,up为1,down为0
|
||||
nginx_waiting: 开启 keep-alive 的情况下,这个值等于 active – (reading+writing), 意思就是 Nginx 已经处理完正在等候下一次请求指令的驻留连接
|
||||
nginx_writing: 正在向客户端发送响应的连接总数
|
||||
|
||||
nginx_accepts: 自nginx启动起,与客户端建立过得连接总数
|
||||
nginx_active: 当前nginx正在处理的活动连接数,等于Reading/Writing/Waiting总和
|
||||
nginx_handled: 自nginx启动起,处理过的客户端连接总数
|
||||
nginx_reading: 正在读取HTTP请求头部的连接总数
|
||||
nginx_requests: 自nginx启动起,处理过的客户端请求总数,由于存在HTTP Krrp-Alive请求,该值会大于handled值
|
||||
nginx_upstream_check_fall: upstream_check模块检测到后端失败的次数
|
||||
nginx_upstream_check_rise: upstream_check模块对后端的检测次数
|
||||
nginx_upstream_check_status_code: 后端upstream的状态,up为1,down为0
|
||||
nginx_waiting: 开启 keep-alive 的情况下,这个值等于 active – (reading+writing), 意思就是 Nginx 已经处理完正在等候下一次请求指令的驻留连接
|
||||
nginx_writing: 正在向客户端发送响应的连接总数
|
||||
|
||||
http_response_content_length: HTTP消息实体的传输长度
|
||||
http_response_http_response_code: http响应状态码
|
||||
http_response_response_time: http响应用时
|
||||
http_response_result_code: url探测结果0为正常否则url无法访问
|
||||
|
||||
# [aws cloudwatch rds]
|
||||
cloudwatch_aws_rds_bin_log_disk_usage_average: rds 磁盘使用平均值
|
||||
cloudwatch_aws_rds_bin_log_disk_usage_maximum: rds 磁盘使用量最大值
|
||||
cloudwatch_aws_rds_bin_log_disk_usage_minimum: rds binlog 磁盘使用量最低
|
||||
cloudwatch_aws_rds_bin_log_disk_usage_sample_count: rds binlog 磁盘使用情况样本计数
|
||||
cloudwatch_aws_rds_bin_log_disk_usage_sum: rds binlog 磁盘使用总和
|
||||
cloudwatch_aws_rds_burst_balance_average: rds 突发余额平均值
|
||||
cloudwatch_aws_rds_burst_balance_maximum: rds 突发余额最大值
|
||||
cloudwatch_aws_rds_burst_balance_minimum: rds 突发余额最低
|
||||
cloudwatch_aws_rds_burst_balance_sample_count: rds 突发平衡样本计数
|
||||
cloudwatch_aws_rds_burst_balance_sum: rds 突发余额总和
|
||||
cloudwatch_aws_rds_cpu_utilization_average: rds cpu 利用率平均值
|
||||
cloudwatch_aws_rds_cpu_utilization_maximum: rds cpu 利用率最大值
|
||||
cloudwatch_aws_rds_cpu_utilization_minimum: rds cpu 利用率最低
|
||||
cloudwatch_aws_rds_cpu_utilization_sample_count: rds cpu 利用率样本计数
|
||||
cloudwatch_aws_rds_cpu_utilization_sum: rds cpu 利用率总和
|
||||
cloudwatch_aws_rds_database_connections_average: rds 数据库连接平均值
|
||||
cloudwatch_aws_rds_database_connections_maximum: rds 数据库连接数最大值
|
||||
cloudwatch_aws_rds_database_connections_minimum: rds 数据库连接最小
|
||||
cloudwatch_aws_rds_database_connections_sample_count: rds 数据库连接样本数
|
||||
cloudwatch_aws_rds_database_connections_sum: rds 数据库连接总和
|
||||
cloudwatch_aws_rds_db_load_average: rds db 平均负载
|
||||
cloudwatch_aws_rds_db_load_cpu_average: rds db 负载 cpu 平均值
|
||||
cloudwatch_aws_rds_db_load_cpu_maximum: rds db 负载 cpu 最大值
|
||||
cloudwatch_aws_rds_db_load_cpu_minimum: rds db 负载 cpu 最小值
|
||||
cloudwatch_aws_rds_db_load_cpu_sample_count: rds db 加载 CPU 样本数
|
||||
cloudwatch_aws_rds_db_load_cpu_sum: rds db 加载cpu总和
|
||||
cloudwatch_aws_rds_db_load_maximum: rds 数据库负载最大值
|
||||
cloudwatch_aws_rds_db_load_minimum: rds 数据库负载最小值
|
||||
cloudwatch_aws_rds_db_load_non_cpu_average: rds 加载非 CPU 平均值
|
||||
cloudwatch_aws_rds_db_load_non_cpu_maximum: rds 加载非 cpu 最大值
|
||||
cloudwatch_aws_rds_db_load_non_cpu_minimum: rds 加载非 cpu 最小值
|
||||
cloudwatch_aws_rds_db_load_non_cpu_sample_count: rds 加载非 cpu 样本计数
|
||||
cloudwatch_aws_rds_db_load_non_cpu_sum: rds 加载非cpu总和
|
||||
cloudwatch_aws_rds_db_load_sample_count: rds db 加载样本计数
|
||||
cloudwatch_aws_rds_db_load_sum: rds db 负载总和
|
||||
cloudwatch_aws_rds_disk_queue_depth_average: rds 磁盘队列深度平均值
|
||||
cloudwatch_aws_rds_disk_queue_depth_maximum: rds 磁盘队列深度最大值
|
||||
cloudwatch_aws_rds_disk_queue_depth_minimum: rds 磁盘队列深度最小值
|
||||
cloudwatch_aws_rds_disk_queue_depth_sample_count: rds 磁盘队列深度样本计数
|
||||
cloudwatch_aws_rds_disk_queue_depth_sum: rds 磁盘队列深度总和
|
||||
cloudwatch_aws_rds_ebs_byte_balance__average: rds ebs 字节余额平均值
|
||||
cloudwatch_aws_rds_ebs_byte_balance__maximum: rds ebs 字节余额最大值
|
||||
cloudwatch_aws_rds_ebs_byte_balance__minimum: rds ebs 字节余额最低
|
||||
cloudwatch_aws_rds_ebs_byte_balance__sample_count: rds ebs 字节余额样本数
|
||||
cloudwatch_aws_rds_ebs_byte_balance__sum: rds ebs 字节余额总和
|
||||
cloudwatch_aws_rds_ebsio_balance__average: rds ebsio 余额平均值
|
||||
cloudwatch_aws_rds_ebsio_balance__maximum: rds ebsio 余额最大值
|
||||
cloudwatch_aws_rds_ebsio_balance__minimum: rds ebsio 余额最低
|
||||
cloudwatch_aws_rds_ebsio_balance__sample_count: rds ebsio 平衡样本计数
|
||||
cloudwatch_aws_rds_ebsio_balance__sum: rds ebsio 余额总和
|
||||
cloudwatch_aws_rds_free_storage_space_average: rds 免费存储空间平均
|
||||
cloudwatch_aws_rds_free_storage_space_maximum: rds 最大可用存储空间
|
||||
cloudwatch_aws_rds_free_storage_space_minimum: rds 最低可用存储空间
|
||||
cloudwatch_aws_rds_free_storage_space_sample_count: rds 可用存储空间样本数
|
||||
cloudwatch_aws_rds_free_storage_space_sum: rds 免费存储空间总和
|
||||
cloudwatch_aws_rds_freeable_memory_average: rds 可用内存平均值
|
||||
cloudwatch_aws_rds_freeable_memory_maximum: rds 最大可用内存
|
||||
cloudwatch_aws_rds_freeable_memory_minimum: rds 最小可用内存
|
||||
cloudwatch_aws_rds_freeable_memory_sample_count: rds 可释放内存样本数
|
||||
cloudwatch_aws_rds_freeable_memory_sum: rds 可释放内存总和
|
||||
cloudwatch_aws_rds_lvm_read_iops_average: rds lvm 读取 iops 平均值
|
||||
cloudwatch_aws_rds_lvm_read_iops_maximum: rds lvm 读取 iops 最大值
|
||||
cloudwatch_aws_rds_lvm_read_iops_minimum: rds lvm 读取 iops 最低
|
||||
cloudwatch_aws_rds_lvm_read_iops_sample_count: rds lvm 读取 iops 样本计数
|
||||
cloudwatch_aws_rds_lvm_read_iops_sum: rds lvm 读取 iops 总和
|
||||
cloudwatch_aws_rds_lvm_write_iops_average: rds lvm 写入 iops 平均值
|
||||
cloudwatch_aws_rds_lvm_write_iops_maximum: rds lvm 写入 iops 最大值
|
||||
cloudwatch_aws_rds_lvm_write_iops_minimum: rds lvm 写入 iops 最低
|
||||
cloudwatch_aws_rds_lvm_write_iops_sample_count: rds lvm 写入 iops 样本计数
|
||||
cloudwatch_aws_rds_lvm_write_iops_sum: rds lvm 写入 iops 总和
|
||||
cloudwatch_aws_rds_network_receive_throughput_average: rds 网络接收吞吐量平均
|
||||
cloudwatch_aws_rds_network_receive_throughput_maximum: rds 网络接收吞吐量最大值
|
||||
cloudwatch_aws_rds_network_receive_throughput_minimum: rds 网络接收吞吐量最小值
|
||||
cloudwatch_aws_rds_network_receive_throughput_sample_count: rds 网络接收吞吐量样本计数
|
||||
cloudwatch_aws_rds_network_receive_throughput_sum: rds 网络接收吞吐量总和
|
||||
cloudwatch_aws_rds_network_transmit_throughput_average: rds 网络传输吞吐量平均值
|
||||
cloudwatch_aws_rds_network_transmit_throughput_maximum: rds 网络传输吞吐量最大
|
||||
cloudwatch_aws_rds_network_transmit_throughput_minimum: rds 网络传输吞吐量最小值
|
||||
cloudwatch_aws_rds_network_transmit_throughput_sample_count: rds 网络传输吞吐量样本计数
|
||||
cloudwatch_aws_rds_network_transmit_throughput_sum: rds 网络传输吞吐量总和
|
||||
cloudwatch_aws_rds_read_iops_average: rds 读取 iops 平均值
|
||||
cloudwatch_aws_rds_read_iops_maximum: rds 最大读取 iops
|
||||
cloudwatch_aws_rds_read_iops_minimum: rds 读取 iops 最低
|
||||
cloudwatch_aws_rds_read_iops_sample_count: rds 读取 iops 样本计数
|
||||
cloudwatch_aws_rds_read_iops_sum: rds 读取 iops 总和
|
||||
cloudwatch_aws_rds_read_latency_average: rds 读取延迟平均值
|
||||
cloudwatch_aws_rds_read_latency_maximum: rds 读取延迟最大值
|
||||
cloudwatch_aws_rds_read_latency_minimum: rds 最小读取延迟
|
||||
cloudwatch_aws_rds_read_latency_sample_count: rds 读取延迟样本计数
|
||||
cloudwatch_aws_rds_read_latency_sum: rds 读取延迟总和
|
||||
cloudwatch_aws_rds_read_throughput_average: rds 读取吞吐量平均值
|
||||
cloudwatch_aws_rds_read_throughput_maximum: rds 最大读取吞吐量
|
||||
cloudwatch_aws_rds_read_throughput_minimum: rds 最小读取吞吐量
|
||||
cloudwatch_aws_rds_read_throughput_sample_count: rds 读取吞吐量样本计数
|
||||
cloudwatch_aws_rds_read_throughput_sum: rds 读取吞吐量总和
|
||||
cloudwatch_aws_rds_swap_usage_average: rds 交换使用平均值
|
||||
cloudwatch_aws_rds_swap_usage_maximum: rds 交换使用最大值
|
||||
cloudwatch_aws_rds_swap_usage_minimum: rds 交换使用量最低
|
||||
cloudwatch_aws_rds_swap_usage_sample_count: rds 交换使用示例计数
|
||||
cloudwatch_aws_rds_swap_usage_sum: rds 交换使用总和
|
||||
cloudwatch_aws_rds_write_iops_average: rds 写入 iops 平均值
|
||||
cloudwatch_aws_rds_write_iops_maximum: rds 写入 iops 最大值
|
||||
cloudwatch_aws_rds_write_iops_minimum: rds 写入 iops 最低
|
||||
cloudwatch_aws_rds_write_iops_sample_count: rds 写入 iops 样本计数
|
||||
cloudwatch_aws_rds_write_iops_sum: rds 写入 iops 总和
|
||||
cloudwatch_aws_rds_write_latency_average: rds 写入延迟平均值
|
||||
cloudwatch_aws_rds_write_latency_maximum: rds 最大写入延迟
|
||||
cloudwatch_aws_rds_write_latency_minimum: rds 写入延迟最小值
|
||||
cloudwatch_aws_rds_write_latency_sample_count: rds 写入延迟样本计数
|
||||
cloudwatch_aws_rds_write_latency_sum: rds 写入延迟总和
|
||||
cloudwatch_aws_rds_write_throughput_average: rds 写入吞吐量平均值
|
||||
cloudwatch_aws_rds_write_throughput_maximum: rds 最大写入吞吐量
|
||||
cloudwatch_aws_rds_write_throughput_minimum: rds 写入吞吐量最小值
|
||||
cloudwatch_aws_rds_write_throughput_sample_count: rds 写入吞吐量样本计数
|
||||
cloudwatch_aws_rds_write_throughput_sum: rds 写入吞吐量总和
|
||||
|
||||
en:
|
||||
cpu_usage_idle: "CPU idle rate(unit:%)"
|
||||
cpu_usage_active: "CPU usage rate(unit:%)"
|
||||
cpu_usage_system: "CPU kernel state time proportion(unit:%)"
|
||||
cpu_usage_user: "CPU user attitude time proportion(unit:%)"
|
||||
cpu_usage_nice: "The proportion of low priority CPU time, that is, the process NICE value is adjusted to the CPU time between 1-19. Note here that the value range of NICE is -20 to 19, the larger the value, the lower the priority, the lower the priority(unit:%)"
|
||||
cpu_usage_iowait: "CPU waiting for I/O time proportion(unit:%)"
|
||||
cpu_usage_irq: "CPU processing hard interrupt time proportion(unit:%)"
|
||||
cpu_usage_softirq: "CPU processing soft interrupt time proportion(unit:%)"
|
||||
cpu_usage_steal: "In the virtual machine environment, there is this indicator, which means that the CPU is used by other virtual machines for the proportion of time.(unit:%)"
|
||||
cpu_usage_guest: "The time to run other operating systems by virtualization, that is, the proportion of CPU time running the virtual machine(unit:%)"
|
||||
cpu_usage_guest_nice: "The proportion of time to run the virtual machine at low priority(unit:%)"
|
||||
|
||||
disk_free: "The remaining amount of the hard disk partition (unit: byte)"
|
||||
disk_used: "Hard disk partitional use (unit: byte)"
|
||||
disk_used_percent: "Hard disk partitional use rate (unit:%)"
|
||||
disk_total: "Total amount of hard disk partition (unit: byte)"
|
||||
disk_inodes_free: "Hard disk partition INODE remaining amount"
|
||||
disk_inodes_used: "Hard disk partition INODE usage amount"
|
||||
disk_inodes_total: "The total amount of hard disk partition INODE"
|
||||
|
||||
diskio_io_time: "From the perspective of the device perspective, the total time of I/O request, the I/O request in the queue is count (unit: millisecond), the counter type, you need to use the function to find the value"
|
||||
diskio_iops_in_progress: "IO requests that have been assigned to device -driven and have not yet been completed, not included in the queue but not yet assigned to the device -driven IO request, Gauge type"
|
||||
diskio_merged_reads: "The number of times of adjacent reading request Merge, the counter type"
|
||||
diskio_merged_writes: "The number of times the request Merge writes, the counter type"
|
||||
diskio_read_bytes: "The number of byte reads, the counter type, you need to use the function to find the Rate to use the value"
|
||||
diskio_read_time: "The total time of reading request (unit: millisecond), the counter type, you need to use the function to find the Rate to have the value of use"
|
||||
diskio_reads: "Read the number of requests, the counter type, you need to use the function to find the Rate to use the value"
|
||||
diskio_weighted_io_time: "From the perspective of the I/O request perspective, I/O wait for the total time. If there are multiple I/O requests at the same time, the time will be superimposed (unit: millisecond)"
|
||||
diskio_write_bytes: "The number of bytes written, the counter type, you need to use the function to find the Rate to use the value"
|
||||
diskio_write_time: "The total time of the request (unit: millisecond), the counter type, you need to use the function to find the rate to have the value of use"
|
||||
diskio_writes: "Write the number of requests, the counter type, you need to use the function to find the rate to use value"
|
||||
|
||||
kernel_boot_time: "Kernel startup time"
|
||||
kernel_context_switches: "Number of kernel context switching times"
|
||||
kernel_entropy_avail: "Entropy pool inside the Linux system"
|
||||
kernel_interrupts: "Number of kernel interruption"
|
||||
kernel_processes_forked: "ForK's process number"
|
||||
|
||||
mem_active: "The total number of memory (including Cache and BUFFER memory)"
|
||||
mem_available: "Application can use memory numbers"
|
||||
mem_available_percent: "Memory remaining percentage (0 ~ 100)"
|
||||
mem_buffered: "Used to make buffer size for the file"
|
||||
mem_cached: "The size of the memory used by the cache memory (equal to diskcache minus Swap Cache )"
|
||||
mem_commit_limit: "According to the over allocation ratio ('vm.overCommit _ Ratio'), this is the current total memory that can be allocated on the system."
|
||||
mem_committed_as: "Currently allocated on the system. It is the sum of the memory of all process applications"
|
||||
mem_dirty: "Waiting to be written back to the memory size of the disk"
|
||||
mem_free: "Senior memory number"
|
||||
mem_high_free: "Unused high memory size"
|
||||
mem_high_total: "The total memory size of the high memory (Highmem refers to all the physical memory that is higher than 860 MB of memory, the HighMem area is used for user programs, or for page cache. This area is not directly mapped to the kernel space. The kernels must use different methods to use this section of memory. )"
|
||||
mem_huge_page_size: "The size of each big page"
|
||||
mem_huge_pages_free: "The number of Huge Pages in the pool that have not been allocated"
|
||||
mem_huge_pages_total: "Reserve the total number of Huge Pages"
|
||||
mem_inactive: "Free memory (including the memory of free and avalible)"
|
||||
mem_low_free: "Unused low size"
|
||||
mem_low_total: "The total size of the low memory memory can achieve the same role of high memory, and it can be used by the kernel to record some of its own data structure"
|
||||
mem_mapped: "The size of the mapping of equipment and files"
|
||||
mem_page_tables: "The size of the index table of the management of the memory paging page"
|
||||
mem_shared: "The total memory shared by multiple processes"
|
||||
mem_slab: "The size of the kernel data structure cache can reduce the consumption of application and release memory"
|
||||
mem_sreclaimable: "The size of the SLAB can be recovered"
|
||||
mem_sunreclaim: "The size of the SLAB cannot be recovered(SUnreclaim+SReclaimable=Slab)"
|
||||
mem_swap_cached: "The size of the swap space used by the cache memory (cache memory), the memory that has been swapped out, but is still stored in the swapfile. Used to be quickly replaced when needed without opening the I/O port again"
|
||||
mem_swap_free: "The size of the switching space is not used"
|
||||
mem_swap_total: "The total size of the exchange space"
|
||||
mem_total: "Total memory"
|
||||
mem_used: "Memory number"
|
||||
mem_used_percent: "The memory has been used by several percentage (0 ~ 100)"
|
||||
mem_vmalloc_chunk: "The largest continuous unused vmalloc area"
|
||||
mem_vmalloc_totalL: "You can vmalloc virtual memory size"
|
||||
mem_vmalloc_used: "Vmalloc's virtual memory size"
|
||||
mem_write_back: "The memory size of the disk is being written back to the disk"
|
||||
mem_write_back_tmp: "Fuse is used to temporarily write back the memory of the buffer area"
|
||||
|
||||
net_bytes_recv: "The total number of packaging of the network card (bytes)"
|
||||
net_bytes_sent: "Total number of network cards (bytes)"
|
||||
net_drop_in: "The number of packets for network cards"
|
||||
net_drop_out: "The number of packets issued by the network card"
|
||||
net_err_in: "The number of incorrect packets of the network card"
|
||||
net_err_out: "Number of incorrect number of network cards"
|
||||
net_packets_recv: "Net card collection quantity"
|
||||
net_packets_sent: "Number of network card issuance"
|
||||
|
||||
netstat_tcp_established: "ESTABLISHED status network link number"
|
||||
netstat_tcp_fin_wait1: "FIN _ WAIT1 status network link number"
|
||||
netstat_tcp_fin_wait2: "FIN _ WAIT2 status number of network links"
|
||||
netstat_tcp_last_ack: "LAST_ ACK status number of network links"
|
||||
netstat_tcp_listen: "Number of network links in Listen status"
|
||||
netstat_tcp_syn_recv: "SYN _ RECV status number of network links"
|
||||
netstat_tcp_syn_sent: "SYN _ SENT status number of network links"
|
||||
netstat_tcp_time_wait: "Time _ WAIT status network link number"
|
||||
netstat_udp_socket: "Number of network links in UDP status"
|
||||
|
||||
processes_blocked: "The number of processes in the unreprudible sleep state('U','D','L')"
|
||||
processes_dead: "Number of processes in recycling('X')"
|
||||
processes_idle: "Number of idle processes hanging('I')"
|
||||
processes_paging: "Number of paging processes('P')"
|
||||
processes_running: "Number of processes during operation('R')"
|
||||
processes_sleeping: "Can interrupt the number of processes('S')"
|
||||
processes_stopped: "Pushing status process number('T')"
|
||||
processes_total: "Total process number"
|
||||
processes_total_threads: "Number of threads"
|
||||
processes_unknown: "Unknown status process number"
|
||||
processes_zombies: "Number of zombies('Z')"
|
||||
|
||||
swap_used_percent: "SWAP space replace the data volume"
|
||||
|
||||
system_load1: "1 minute average load value"
|
||||
system_load5: "5 minutes average load value"
|
||||
system_load15: "15 minutes average load value"
|
||||
system_n_users: "User number"
|
||||
system_n_cpus: "CPU nuclear number"
|
||||
system_uptime: "System startup time"
|
||||
|
||||
nginx_accepts: "Since Nginx started, the total number of connections has been established with the client"
|
||||
nginx_active: "The current number of activity connections that Nginx is being processed is equal to Reading/Writing/Waiting"
|
||||
nginx_handled: "Starting from Nginx, the total number of client connections that have been processed"
|
||||
nginx_reading: "Reading the total number of connections on the http request header"
|
||||
nginx_requests: "Since nginx is started, the total number of client requests processed, due to the existence of HTTP Krrp - Alive requests, this value will be greater than the handled value"
|
||||
nginx_upstream_check_fall: "UPStream_CHECK module detects the number of back -end failures"
|
||||
nginx_upstream_check_rise: "UPSTREAM _ Check module to detect the number of back -end"
|
||||
nginx_upstream_check_status_code: "The state of the backstream is 1, and the down is 0"
|
||||
nginx_waiting: "When keep-alive is enabled, this value is equal to active – (reading+writing), which means that Nginx has processed the resident connection that is waiting for the next request command"
|
||||
nginx_writing: "The total number of connections to send a response to the client"
|
||||
|
||||
http_response_content_length: "HTTP message entity transmission length"
|
||||
http_response_http_response_code: "http response status code"
|
||||
http_response_response_time: "When http ring application"
|
||||
http_response_result_code: "URL detection result 0 is normal, otherwise the URL cannot be accessed"
|
||||
http_response_content_length: HTTP消息实体的传输长度
|
||||
http_response_http_response_code: http响应状态码
|
||||
http_response_response_time: http响应用时
|
||||
http_response_result_code: url探测结果0为正常否则url无法访问
|
||||
|
||||
# [mysqld_exporter]
|
||||
mysql_global_status_uptime: The number of seconds that the server has been up.(Gauge)
|
||||
@@ -489,7 +237,7 @@ redis_last_key_groups_scrape_duration_milliseconds: Duration of the last key gro
|
||||
redis_last_slow_execution_duration_seconds: The amount of time needed for last slow execution, in seconds.
|
||||
redis_latest_fork_seconds: The amount of time needed for last fork, in seconds.
|
||||
redis_lazyfree_pending_objects: The number of objects waiting to be freed (as a result of calling UNLINK, or FLUSHDB and FLUSHALL with the ASYNC option).
|
||||
redis_master_repl_offset: The server's current replication offset.
|
||||
redis_master_repl_offset: The server's current replication offset.
|
||||
redis_mem_clients_normal: Memory used by normal clients.(Gauge)
|
||||
redis_mem_clients_slaves: Memory used by replica clients - Starting Redis 7.0, replica buffers share memory with the replication backlog, so this field can show 0 when replicas don't trigger an increase of memory usage.
|
||||
redis_mem_fragmentation_bytes: Delta between used_memory_rss and used_memory. Note that when the total fragmentation bytes is low (few megabytes), a high ratio (e.g. 1.5 and above) is not an indication of an issue.
|
||||
@@ -622,6 +370,8 @@ node_load15: cpu load 15m
|
||||
|
||||
# MEM
|
||||
# 内核态
|
||||
# 用户追踪已从交换区获取但尚未修改的页面的内存
|
||||
node_memory_SwapCached_bytes: Memory that keeps track of pages that have been fetched from swap but not yet been modified
|
||||
# 内核用于缓存数据结构供自己使用的内存
|
||||
node_memory_Slab_bytes: Memory used by the kernel to cache data structures for its own use
|
||||
# slab中可回收的部分
|
||||
@@ -683,7 +433,7 @@ node_memory_SwapTotal_bytes: Memory information field SwapTotal_bytes
|
||||
node_memory_SwapFree_bytes: Memory information field SwapFree_bytes
|
||||
|
||||
# DISK
|
||||
node_filesystem_avail_bytes: Filesystem space available to non-root users in byte
|
||||
node_filesystem_files_free: Filesystem space available to non-root users in byte
|
||||
node_filesystem_free_bytes: Filesystem free space in bytes
|
||||
node_filesystem_size_bytes: Filesystem size in bytes
|
||||
node_filesystem_files_free: Filesystem total free file nodes
|
||||
@@ -729,7 +479,7 @@ kafka_consumer_lag_millis: Current approximation of consumer lag for a ConsumerG
|
||||
kafka_topic_partition_under_replicated_partition: 1 if Topic/Partition is under Replicated
|
||||
|
||||
# [zookeeper_exporter]
|
||||
zk_znode_count: The total count of znodes stored
|
||||
zk_znode_count: The total count of znodes stored
|
||||
zk_ephemerals_count: The number of Ephemerals nodes
|
||||
zk_watch_count: The number of watchers setup over Zookeeper nodes.
|
||||
zk_approximate_data_size: Size of data in bytes that a zookeeper server has in its data tree
|
||||
@@ -741,4 +491,4 @@ zk_open_file_descriptor_count: Number of file descriptors that a zookeeper serve
|
||||
zk_max_file_descriptor_count: Maximum number of file descriptors that a zookeeper server can open
|
||||
zk_avg_latency: Average time in milliseconds for requests to be processed
|
||||
zk_min_latency: Maximum time in milliseconds for a request to be processed
|
||||
zk_max_latency: Minimum time in milliseconds for a request to be processed
|
||||
zk_max_latency: Minimum time in milliseconds for a request to be processed
|
||||
@@ -7,6 +7,12 @@ import (
|
||||
"github.com/tidwall/gjson"
|
||||
)
|
||||
|
||||
// the caller can be called for alerting notify by complete this interface
|
||||
type inter interface {
|
||||
Descript() string
|
||||
Notify([]byte)
|
||||
}
|
||||
|
||||
// N9E complete
|
||||
type N9EPlugin struct {
|
||||
Name string
|
||||
@@ -31,16 +37,9 @@ func (n *N9EPlugin) Notify(bs []byte) {
|
||||
}
|
||||
}
|
||||
|
||||
func (n *N9EPlugin) NotifyMaintainer(bs []byte) {
|
||||
fmt.Println("do something... begin")
|
||||
result := string(bs)
|
||||
fmt.Println(result)
|
||||
fmt.Println("do something... end")
|
||||
}
|
||||
|
||||
// will be loaded for alertingCall , The first letter must be capitalized to be exported
|
||||
var N9eCaller = N9EPlugin{
|
||||
Name: "N9EPlugin",
|
||||
Description: "Notify by lib",
|
||||
Name: "n9e",
|
||||
Description: "演示告警通过动态链接库方式通知",
|
||||
BuildAt: time.Now().Local().Format("2006/01/02 15:04:05"),
|
||||
}
|
||||
|
||||
@@ -1,193 +0,0 @@
|
||||
import json
|
||||
import yaml
|
||||
|
||||
'''
|
||||
将promtheus/vmalert的rule转换为n9e中的rule
|
||||
支持k8s的rule configmap
|
||||
'''
|
||||
|
||||
rule_file = 'rules.yaml'
|
||||
|
||||
|
||||
def convert_interval(interval):
|
||||
if interval.endswith('s') or interval.endswith('S'):
|
||||
return int(interval[:-1])
|
||||
if interval.endswith('m') or interval.endswith('M'):
|
||||
return int(interval[:-1]) * 60
|
||||
if interval.endswith('h') or interval.endswith('H'):
|
||||
return int(interval[:-1]) * 60 * 60
|
||||
if interval.endswith('d') or interval.endswith('D'):
|
||||
return int(interval[:-1]) * 60 * 60 * 24
|
||||
return int(interval)
|
||||
|
||||
|
||||
def convert_alert(rule, interval):
|
||||
name = rule['alert']
|
||||
prom_ql = rule['expr']
|
||||
if 'for' in rule:
|
||||
prom_for_duration = convert_interval(rule['for'])
|
||||
else:
|
||||
prom_for_duration = 0
|
||||
|
||||
prom_eval_interval = convert_interval(interval)
|
||||
note = ''
|
||||
if 'annotations' in rule:
|
||||
for v in rule['annotations'].values():
|
||||
note = v
|
||||
break
|
||||
|
||||
append_tags = []
|
||||
severity = 2
|
||||
if 'labels' in rule:
|
||||
for k, v in rule['labels'].items():
|
||||
if k != 'severity':
|
||||
append_tags.append('{}={}'.format(k, v))
|
||||
continue
|
||||
if v == 'critical':
|
||||
severity = 1
|
||||
elif v == 'info':
|
||||
severity = 3
|
||||
# elif v == 'warning':
|
||||
# severity = 2
|
||||
|
||||
|
||||
n9e_alert_rule = {
|
||||
"name": name,
|
||||
"note": note,
|
||||
"severity": severity,
|
||||
"disabled": 0,
|
||||
"prom_for_duration": prom_for_duration,
|
||||
"prom_ql": prom_ql,
|
||||
"prom_eval_interval": prom_eval_interval,
|
||||
"enable_stime": "00:00",
|
||||
"enable_etime": "23:59",
|
||||
"enable_days_of_week": [
|
||||
"1",
|
||||
"2",
|
||||
"3",
|
||||
"4",
|
||||
"5",
|
||||
"6",
|
||||
"0"
|
||||
],
|
||||
"enable_in_bg": 0,
|
||||
"notify_recovered": 1,
|
||||
"notify_channels": [],
|
||||
"notify_repeat_step": 60,
|
||||
"recover_duration": 0,
|
||||
"callbacks": [],
|
||||
"runbook_url": "",
|
||||
"append_tags": append_tags
|
||||
}
|
||||
return n9e_alert_rule
|
||||
|
||||
|
||||
def convert_record(rule, interval):
|
||||
name = rule['record']
|
||||
prom_ql = rule['expr']
|
||||
prom_eval_interval = convert_interval(interval)
|
||||
note = ''
|
||||
append_tags = []
|
||||
if 'labels' in rule:
|
||||
for k, v in rule['labels'].items():
|
||||
append_tags.append('{}={}'.format(k, v))
|
||||
|
||||
n9e_record_rule = {
|
||||
"name": name,
|
||||
"note": note,
|
||||
"disabled": 0,
|
||||
"prom_ql": prom_ql,
|
||||
"prom_eval_interval": prom_eval_interval,
|
||||
"append_tags": append_tags
|
||||
}
|
||||
return n9e_record_rule
|
||||
|
||||
|
||||
'''
|
||||
example of rule group file
|
||||
---
|
||||
groups:
|
||||
- name: example
|
||||
rules:
|
||||
- alert: HighRequestLatency
|
||||
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
|
||||
for: 10m
|
||||
labels:
|
||||
severity: page
|
||||
annotations:
|
||||
summary: High request latency
|
||||
'''
|
||||
def deal_group(group):
|
||||
"""
|
||||
parse single prometheus/vmalert rule group
|
||||
"""
|
||||
alert_rules = []
|
||||
record_rules = []
|
||||
|
||||
for rule_segment in group['groups']:
|
||||
if 'interval' in rule_segment:
|
||||
interval = rule_segment['interval']
|
||||
else:
|
||||
interval = '15s'
|
||||
for rule in rule_segment['rules']:
|
||||
if 'alert' in rule:
|
||||
alert_rules.append(convert_alert(rule, interval))
|
||||
else:
|
||||
record_rules.append(convert_record(rule, interval))
|
||||
|
||||
return alert_rules, record_rules
|
||||
|
||||
|
||||
'''
|
||||
example of k8s rule configmap
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
name: rulefiles-0
|
||||
data:
|
||||
etcdrules.yaml: |
|
||||
groups:
|
||||
- name: etcd
|
||||
rules:
|
||||
- alert: etcdInsufficientMembers
|
||||
annotations:
|
||||
message: 'etcd cluster "{{ $labels.job }}": insufficient members ({{ $value}}).'
|
||||
expr: sum(up{job=~".*etcd.*"} == bool 1) by (job) < ((count(up{job=~".*etcd.*"})
|
||||
by (job) + 1) / 2)
|
||||
for: 3m
|
||||
labels:
|
||||
severity: critical
|
||||
'''
|
||||
def deal_configmap(rule_configmap):
|
||||
"""
|
||||
parse rule configmap from k8s
|
||||
"""
|
||||
all_record_rules = []
|
||||
all_alert_rules = []
|
||||
for _, rule_group_str in rule_configmap['data'].items():
|
||||
rule_group = yaml.load(rule_group_str, Loader=yaml.FullLoader)
|
||||
alert_rules, record_rules = deal_group(rule_group)
|
||||
all_alert_rules.extend(alert_rules)
|
||||
all_record_rules.extend(record_rules)
|
||||
|
||||
return all_alert_rules, all_record_rules
|
||||
|
||||
|
||||
def main():
|
||||
with open(rule_file, 'r') as f:
|
||||
rule_config = yaml.load(f, Loader=yaml.FullLoader)
|
||||
|
||||
# 如果文件是k8s中的configmap,使用下面的方法
|
||||
# alert_rules, record_rules = deal_configmap(rule_config)
|
||||
alert_rules, record_rules = deal_group(rule_config)
|
||||
|
||||
with open("alert-rules.json", 'w') as fw:
|
||||
json.dump(alert_rules, fw, indent=2, ensure_ascii=False)
|
||||
|
||||
with open("record-rules.json", 'w') as fw:
|
||||
json.dump(record_rules, fw, indent=2, ensure_ascii=False)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
@@ -13,9 +13,6 @@ EngineDelay = 120
|
||||
|
||||
DisableUsageReport = false
|
||||
|
||||
# config | database
|
||||
ReaderFrom = "config"
|
||||
|
||||
[Log]
|
||||
# log write dir
|
||||
Dir = "logs"
|
||||
@@ -158,8 +155,15 @@ BasicAuthUser = ""
|
||||
BasicAuthPass = ""
|
||||
# timeout settings, unit: ms
|
||||
Timeout = 30000
|
||||
DialTimeout = 3000
|
||||
MaxIdleConnsPerHost = 100
|
||||
DialTimeout = 10000
|
||||
TLSHandshakeTimeout = 30000
|
||||
ExpectContinueTimeout = 1000
|
||||
IdleConnTimeout = 90000
|
||||
# time duration, unit: ms
|
||||
KeepAlive = 30000
|
||||
MaxConnsPerHost = 0
|
||||
MaxIdleConns = 100
|
||||
MaxIdleConnsPerHost = 10
|
||||
|
||||
[WriterOpt]
|
||||
# queue channel count
|
||||
@@ -186,12 +190,6 @@ KeepAlive = 30000
|
||||
MaxConnsPerHost = 0
|
||||
MaxIdleConns = 100
|
||||
MaxIdleConnsPerHost = 100
|
||||
# [[Writers.WriteRelabels]]
|
||||
# Action = "replace"
|
||||
# SourceLabels = ["__address__"]
|
||||
# Regex = "([^:]+)(?::\\d+)?"
|
||||
# Replacement = "$1:80"
|
||||
# TargetLabel = "__address__"
|
||||
|
||||
# [[Writers]]
|
||||
# Url = "http://127.0.0.1:7201/api/v1/prom/remote/write"
|
||||
|
||||
@@ -1,26 +0,0 @@
|
||||
# 告警消息模版文件
|
||||
|
||||
模版中可以使用的变量参考`AlertCurEvent`对象
|
||||
模版语法如何使用可以参考[html/template](https://pkg.go.dev/html/template)
|
||||
|
||||
## 如何在告警模版中添加监控详情url
|
||||
|
||||
假设web的地址是http://127.0.0.1:18000/, 实际使用时用web地址替换该地址
|
||||
|
||||
在监控模版中添加以下行:
|
||||
|
||||
* dingtalk / wecom / feishu
|
||||
```markdown
|
||||
[监控详情](http://127.0.0.1:18000/metric/explorer?promql={{ .PromQl | escape }})
|
||||
```
|
||||
|
||||
* mailbody
|
||||
|
||||
```html
|
||||
<tr>
|
||||
<th>监控详情:</th>
|
||||
<td>
|
||||
<a href="http://127.0.0.1:18000/metric/explorer?promql={{ .PromQl | escape }}" target="_blank">点击查看</a>
|
||||
</td>
|
||||
</tr>
|
||||
```
|
||||
@@ -4,9 +4,6 @@ RunMode = "release"
|
||||
# # custom i18n dict config
|
||||
# I18N = "./etc/i18n.json"
|
||||
|
||||
# # custom i18n request header key
|
||||
# I18NHeaderKey = "X-Language"
|
||||
|
||||
# metrics descriptions
|
||||
MetricsYamlFile = "./etc/metrics.yaml"
|
||||
|
||||
@@ -201,10 +198,4 @@ Address = "http://127.0.0.1:10090"
|
||||
BasicAuthUser = "ibex"
|
||||
BasicAuthPass = "ibex"
|
||||
# unit: ms
|
||||
Timeout = 3000
|
||||
|
||||
[TargetMetrics]
|
||||
TargetUp = '''max(max_over_time(target_up{ident=~"(%s)"}[%dm])) by (ident)'''
|
||||
LoadPerCore = '''max(max_over_time(system_load_norm_1{ident=~"(%s)"}[%dm])) by (ident)'''
|
||||
MemUtil = '''100-max(max_over_time(mem_available_percent{ident=~"(%s)"}[%dm])) by (ident)'''
|
||||
DiskUtil = '''max(max_over_time(disk_used_percent{ident=~"(%s)", path="/"}[%dm])) by (ident)'''
|
||||
Timeout = 3000
|
||||
@@ -12,7 +12,6 @@ import (
|
||||
|
||||
type AlertCurEvent struct {
|
||||
Id int64 `json:"id" gorm:"primaryKey"`
|
||||
Cate string `json:"cate"`
|
||||
Cluster string `json:"cluster"`
|
||||
GroupId int64 `json:"group_id"` // busi group id
|
||||
GroupName string `json:"group_name"` // busi group name
|
||||
@@ -47,7 +46,6 @@ type AlertCurEvent struct {
|
||||
LastEvalTime int64 `json:"last_eval_time" gorm:"-"` // for notify.py 上次计算的时间
|
||||
LastSentTime int64 `json:"last_sent_time" gorm:"-"` // 上次发送时间
|
||||
NotifyCurNumber int `json:"notify_cur_number"` // notify: current number
|
||||
FirstTriggerTime int64 `json:"first_trigger_time"` // 连续告警的首次告警时间
|
||||
}
|
||||
|
||||
func (e *AlertCurEvent) TableName() string {
|
||||
@@ -156,7 +154,6 @@ func (e *AlertCurEvent) ToHis() *AlertHisEvent {
|
||||
|
||||
return &AlertHisEvent{
|
||||
IsRecovered: isRecovered,
|
||||
Cate: e.Cate,
|
||||
Cluster: e.Cluster,
|
||||
GroupId: e.GroupId,
|
||||
GroupName: e.GroupName,
|
||||
@@ -183,7 +180,6 @@ func (e *AlertCurEvent) ToHis() *AlertHisEvent {
|
||||
RecoverTime: recoverTime,
|
||||
LastEvalTime: e.LastEvalTime,
|
||||
NotifyCurNumber: e.NotifyCurNumber,
|
||||
FirstTriggerTime: e.FirstTriggerTime,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -251,7 +247,7 @@ func (e *AlertCurEvent) FillNotifyGroups(cache map[int64]*UserGroup) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func AlertCurEventTotal(prod string, bgid, stime, etime int64, severity int, clusters, cates []string, query string) (int64, error) {
|
||||
func AlertCurEventTotal(prod string, bgid, stime, etime int64, severity int, clusters []string, query string) (int64, error) {
|
||||
session := DB().Model(&AlertCurEvent{}).Where("trigger_time between ? and ? and rule_prod = ?", stime, etime, prod)
|
||||
|
||||
if bgid > 0 {
|
||||
@@ -266,10 +262,6 @@ func AlertCurEventTotal(prod string, bgid, stime, etime int64, severity int, clu
|
||||
session = session.Where("cluster in ?", clusters)
|
||||
}
|
||||
|
||||
if len(cates) > 0 {
|
||||
session = session.Where("cate in ?", cates)
|
||||
}
|
||||
|
||||
if query != "" {
|
||||
arr := strings.Fields(query)
|
||||
for i := 0; i < len(arr); i++ {
|
||||
@@ -281,7 +273,7 @@ func AlertCurEventTotal(prod string, bgid, stime, etime int64, severity int, clu
|
||||
return Count(session)
|
||||
}
|
||||
|
||||
func AlertCurEventGets(prod string, bgid, stime, etime int64, severity int, clusters, cates []string, query string, limit, offset int) ([]AlertCurEvent, error) {
|
||||
func AlertCurEventGets(prod string, bgid, stime, etime int64, severity int, clusters []string, query string, limit, offset int) ([]AlertCurEvent, error) {
|
||||
session := DB().Where("trigger_time between ? and ? and rule_prod = ?", stime, etime, prod)
|
||||
|
||||
if bgid > 0 {
|
||||
@@ -296,10 +288,6 @@ func AlertCurEventGets(prod string, bgid, stime, etime int64, severity int, clus
|
||||
session = session.Where("cluster in ?", clusters)
|
||||
}
|
||||
|
||||
if len(cates) > 0 {
|
||||
session = session.Where("cate in ?", cates)
|
||||
}
|
||||
|
||||
if query != "" {
|
||||
arr := strings.Fields(query)
|
||||
for i := 0; i < len(arr); i++ {
|
||||
|
||||
@@ -7,7 +7,6 @@ import (
|
||||
|
||||
type AlertHisEvent struct {
|
||||
Id int64 `json:"id" gorm:"primaryKey"`
|
||||
Cate string `json:"cate"`
|
||||
IsRecovered int `json:"is_recovered"`
|
||||
Cluster string `json:"cluster"`
|
||||
GroupId int64 `json:"group_id"`
|
||||
@@ -39,8 +38,7 @@ type AlertHisEvent struct {
|
||||
LastEvalTime int64 `json:"last_eval_time"`
|
||||
Tags string `json:"-"`
|
||||
TagsJSON []string `json:"tags" gorm:"-"`
|
||||
NotifyCurNumber int `json:"notify_cur_number"` // notify: current number
|
||||
FirstTriggerTime int64 `json:"first_trigger_time"` // 连续告警的首次告警时间
|
||||
NotifyCurNumber int `json:"notify_cur_number"` // notify: current number
|
||||
}
|
||||
|
||||
func (e *AlertHisEvent) TableName() string {
|
||||
@@ -92,7 +90,7 @@ func (e *AlertHisEvent) FillNotifyGroups(cache map[int64]*UserGroup) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func AlertHisEventTotal(prod string, bgid, stime, etime int64, severity int, recovered int, clusters, cates []string, query string) (int64, error) {
|
||||
func AlertHisEventTotal(prod string, bgid, stime, etime int64, severity int, recovered int, clusters []string, query string) (int64, error) {
|
||||
session := DB().Model(&AlertHisEvent{}).Where("last_eval_time between ? and ? and rule_prod = ?", stime, etime, prod)
|
||||
|
||||
if bgid > 0 {
|
||||
@@ -111,10 +109,6 @@ func AlertHisEventTotal(prod string, bgid, stime, etime int64, severity int, rec
|
||||
session = session.Where("cluster in ?", clusters)
|
||||
}
|
||||
|
||||
if len(cates) > 0 {
|
||||
session = session.Where("cate in ?", cates)
|
||||
}
|
||||
|
||||
if query != "" {
|
||||
arr := strings.Fields(query)
|
||||
for i := 0; i < len(arr); i++ {
|
||||
@@ -126,7 +120,7 @@ func AlertHisEventTotal(prod string, bgid, stime, etime int64, severity int, rec
|
||||
return Count(session)
|
||||
}
|
||||
|
||||
func AlertHisEventGets(prod string, bgid, stime, etime int64, severity int, recovered int, clusters, cates []string, query string, limit, offset int) ([]AlertHisEvent, error) {
|
||||
func AlertHisEventGets(prod string, bgid, stime, etime int64, severity int, recovered int, clusters []string, query string, limit, offset int) ([]AlertHisEvent, error) {
|
||||
session := DB().Where("last_eval_time between ? and ? and rule_prod = ?", stime, etime, prod)
|
||||
|
||||
if bgid > 0 {
|
||||
@@ -145,10 +139,6 @@ func AlertHisEventGets(prod string, bgid, stime, etime int64, severity int, reco
|
||||
session = session.Where("cluster in ?", clusters)
|
||||
}
|
||||
|
||||
if len(cates) > 0 {
|
||||
session = session.Where("cate in ?", cates)
|
||||
}
|
||||
|
||||
if query != "" {
|
||||
arr := strings.Fields(query)
|
||||
for i := 0; i < len(arr); i++ {
|
||||
|
||||
@@ -13,18 +13,17 @@ import (
|
||||
|
||||
type TagFilter struct {
|
||||
Key string `json:"key"` // tag key
|
||||
Func string `json:"func"` // `==` | `=~` | `in` | `!=` | `!~` | `not in`
|
||||
Func string `json:"func"` // == | =~ | in
|
||||
Value string `json:"value"` // tag value
|
||||
Regexp *regexp.Regexp // parse value to regexp if func = '=~' or '!~'
|
||||
Vset map[string]struct{} // parse value to regexp if func = 'in' or 'not in'
|
||||
Regexp *regexp.Regexp // parse value to regexp if func = '=~'
|
||||
Vset map[string]struct{} // parse value to regexp if func = 'in'
|
||||
}
|
||||
|
||||
type AlertMute struct {
|
||||
Id int64 `json:"id" gorm:"primaryKey"`
|
||||
GroupId int64 `json:"group_id"`
|
||||
Cate string `json:"cate"`
|
||||
Prod string `json:"prod"` // product empty means n9e
|
||||
Cluster string `json:"cluster"` // take effect by clusters, seperated by space
|
||||
Prod string `json:"prod"` // product empty means n9e
|
||||
Cluster string `json:"cluster"`
|
||||
Tags ormx.JSONArr `json:"tags"`
|
||||
Cause string `json:"cause"`
|
||||
Btime int64 `json:"btime"`
|
||||
@@ -45,7 +44,7 @@ func AlertMuteGets(prods []string, bgid int64, query string) (lst []AlertMute, e
|
||||
arr := strings.Fields(query)
|
||||
for i := 0; i < len(arr); i++ {
|
||||
qarg := "%" + arr[i] + "%"
|
||||
session = session.Where("cause like ?", qarg)
|
||||
session = session.Where("cause like ?", qarg, qarg)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -67,12 +66,8 @@ func (m *AlertMute) Verify() error {
|
||||
return errors.New("cluster invalid")
|
||||
}
|
||||
|
||||
if IsClusterAll(m.Cluster) {
|
||||
m.Cluster = ClusterAll
|
||||
}
|
||||
|
||||
if m.Etime <= m.Btime {
|
||||
return fmt.Errorf("oops... etime(%d) <= btime(%d)", m.Etime, m.Btime)
|
||||
return fmt.Errorf("Oops... etime(%d) <= btime(%d)", m.Etime, m.Btime)
|
||||
}
|
||||
|
||||
if err := m.Parse(); err != nil {
|
||||
@@ -128,7 +123,7 @@ func AlertMuteDel(ids []int64) error {
|
||||
func AlertMuteStatistics(cluster string) (*Statistics, error) {
|
||||
session := DB().Model(&AlertMute{}).Select("count(*) as total", "max(create_at) as last_updated")
|
||||
if cluster != "" {
|
||||
session = session.Where("(cluster like ? or cluster = ?)", "%"+cluster+"%", ClusterAll)
|
||||
session = session.Where("cluster = ?", cluster)
|
||||
}
|
||||
|
||||
var stats []*Statistics
|
||||
@@ -151,19 +146,10 @@ func AlertMuteGetsByCluster(cluster string) ([]*AlertMute, error) {
|
||||
// get my cluster's mutes
|
||||
session := DB().Model(&AlertMute{})
|
||||
if cluster != "" {
|
||||
session = session.Where("(cluster like ? or cluster = ?)", "%"+cluster+"%", ClusterAll)
|
||||
session = session.Where("cluster = ?", cluster)
|
||||
}
|
||||
|
||||
var lst []*AlertMute
|
||||
var mlst []*AlertMute
|
||||
err = session.Find(&lst).Error
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
for _, m := range lst {
|
||||
if MatchCluster(m.Cluster, cluster) {
|
||||
mlst = append(mlst, m)
|
||||
}
|
||||
}
|
||||
return mlst, err
|
||||
return lst, err
|
||||
}
|
||||
|
||||
@@ -16,8 +16,7 @@ import (
|
||||
type AlertRule struct {
|
||||
Id int64 `json:"id" gorm:"primaryKey"`
|
||||
GroupId int64 `json:"group_id"` // busi group id
|
||||
Cate string `json:"cate"` // alert rule cate (prometheus|elasticsearch)
|
||||
Cluster string `json:"cluster"` // take effect by clusters, seperated by space
|
||||
Cluster string `json:"cluster"` // take effect by cluster
|
||||
Name string `json:"name"` // rule name
|
||||
Note string `json:"note"` // will sent in notify
|
||||
Prod string `json:"prod"` // product empty means n9e
|
||||
@@ -68,10 +67,6 @@ func (ar *AlertRule) Verify() error {
|
||||
return errors.New("cluster is blank")
|
||||
}
|
||||
|
||||
if IsClusterAll(ar.Cluster) {
|
||||
ar.Cluster = ClusterAll
|
||||
}
|
||||
|
||||
if str.Dangerous(ar.Name) {
|
||||
return errors.New("Name has invalid characters")
|
||||
}
|
||||
@@ -129,7 +124,7 @@ func (ar *AlertRule) Add() error {
|
||||
return err
|
||||
}
|
||||
|
||||
exists, err := AlertRuleExists(0, ar.GroupId, ar.Cluster, ar.Name)
|
||||
exists, err := AlertRuleExists("group_id=? and cluster=? and name=?", ar.GroupId, ar.Cluster, ar.Name)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
@@ -147,7 +142,7 @@ func (ar *AlertRule) Add() error {
|
||||
|
||||
func (ar *AlertRule) Update(arf AlertRule) error {
|
||||
if ar.Name != arf.Name {
|
||||
exists, err := AlertRuleExists(ar.Id, ar.GroupId, ar.Cluster, arf.Name)
|
||||
exists, err := AlertRuleExists("group_id=? and cluster=? and name=? and id <> ?", ar.GroupId, ar.Cluster, arf.Name, ar.Id)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
@@ -267,25 +262,8 @@ func AlertRuleDels(ids []int64, bgid ...int64) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func AlertRuleExists(id, groupId int64, cluster, name string) (bool, error) {
|
||||
session := DB().Where("id <> ? and group_id = ? and name = ?", id, groupId, name)
|
||||
|
||||
var lst []AlertRule
|
||||
err := session.Find(&lst).Error
|
||||
if err != nil {
|
||||
return false, err
|
||||
}
|
||||
if len(lst) == 0 {
|
||||
return false, nil
|
||||
}
|
||||
|
||||
// match cluster
|
||||
for _, r := range lst {
|
||||
if MatchCluster(r.Cluster, cluster) {
|
||||
return true, nil
|
||||
}
|
||||
}
|
||||
return false, nil
|
||||
func AlertRuleExists(where string, args ...interface{}) (bool, error) {
|
||||
return Exists(DB().Model(&AlertRule{}).Where(where, args...))
|
||||
}
|
||||
|
||||
func AlertRuleGets(groupId int64) ([]AlertRule, error) {
|
||||
@@ -306,39 +284,22 @@ func AlertRuleGetsByCluster(cluster string) ([]*AlertRule, error) {
|
||||
session := DB().Where("disabled = ? and prod = ?", 0, "")
|
||||
|
||||
if cluster != "" {
|
||||
session = session.Where("(cluster like ? or cluster = ?)", "%"+cluster+"%", ClusterAll)
|
||||
session = session.Where("cluster = ?", cluster)
|
||||
}
|
||||
|
||||
var lst []*AlertRule
|
||||
err := session.Find(&lst).Error
|
||||
if err != nil {
|
||||
return lst, err
|
||||
}
|
||||
|
||||
if len(lst) == 0 {
|
||||
return lst, nil
|
||||
}
|
||||
|
||||
if cluster == "" {
|
||||
if err == nil {
|
||||
for i := 0; i < len(lst); i++ {
|
||||
lst[i].DB2FE()
|
||||
}
|
||||
return lst, nil
|
||||
}
|
||||
|
||||
lr := make([]*AlertRule, 0, len(lst))
|
||||
for _, r := range lst {
|
||||
if MatchCluster(r.Cluster, cluster) {
|
||||
r.DB2FE()
|
||||
lr = append(lr, r)
|
||||
}
|
||||
}
|
||||
|
||||
return lr, err
|
||||
return lst, err
|
||||
}
|
||||
|
||||
func AlertRulesGetsBy(prods []string, query, algorithm, cluster string, cates []string, disabled int) ([]*AlertRule, error) {
|
||||
session := DB().Where("prod in (?)", prods)
|
||||
func AlertRulesGetsBy(prods []string, query string) ([]*AlertRule, error) {
|
||||
session := DB().Where("disabled = ? and prod in (?)", 0, prods)
|
||||
|
||||
if query != "" {
|
||||
arr := strings.Fields(query)
|
||||
@@ -348,22 +309,6 @@ func AlertRulesGetsBy(prods []string, query, algorithm, cluster string, cates []
|
||||
}
|
||||
}
|
||||
|
||||
if algorithm != "" {
|
||||
session = session.Where("algorithm = ?", algorithm)
|
||||
}
|
||||
|
||||
if cluster != "" {
|
||||
session = session.Where("cluster like ?", "%"+cluster+"%")
|
||||
}
|
||||
|
||||
if len(cates) != 0 {
|
||||
session = session.Where("cate in (?)", cates)
|
||||
}
|
||||
|
||||
if disabled != -1 {
|
||||
session = session.Where("disabled = ?", disabled)
|
||||
}
|
||||
|
||||
var lst []*AlertRule
|
||||
err := session.Find(&lst).Error
|
||||
if err == nil {
|
||||
@@ -413,8 +358,7 @@ func AlertRuleStatistics(cluster string) (*Statistics, error) {
|
||||
session := DB().Model(&AlertRule{}).Select("count(*) as total", "max(update_at) as last_updated").Where("disabled = ? and prod = ?", 0, "")
|
||||
|
||||
if cluster != "" {
|
||||
// 简略的判断,当一个clustername是另一个clustername的substring的时候,会出现stats与预期不符,不影响使用
|
||||
session = session.Where("(cluster like ? or cluster = ?)", "%"+cluster+"%", ClusterAll)
|
||||
session = session.Where("cluster = ?", cluster)
|
||||
}
|
||||
|
||||
var stats []*Statistics
|
||||
|
||||
@@ -14,8 +14,7 @@ import (
|
||||
type AlertSubscribe struct {
|
||||
Id int64 `json:"id" gorm:"primaryKey"`
|
||||
GroupId int64 `json:"group_id"`
|
||||
Cate string `json:"cate"`
|
||||
Cluster string `json:"cluster"` // take effect by clusters, seperated by space
|
||||
Cluster string `json:"cluster"`
|
||||
RuleId int64 `json:"rule_id"`
|
||||
RuleName string `json:"rule_name" gorm:"-"` // for fe
|
||||
Tags ormx.JSONArr `json:"tags"`
|
||||
@@ -60,10 +59,6 @@ func (s *AlertSubscribe) Verify() error {
|
||||
return errors.New("cluster invalid")
|
||||
}
|
||||
|
||||
if IsClusterAll(s.Cluster) {
|
||||
s.Cluster = ClusterAll
|
||||
}
|
||||
|
||||
if err := s.Parse(); err != nil {
|
||||
return err
|
||||
}
|
||||
@@ -89,12 +84,12 @@ func (s *AlertSubscribe) Parse() error {
|
||||
}
|
||||
|
||||
for i := 0; i < len(s.ITags); i++ {
|
||||
if s.ITags[i].Func == "=~" || s.ITags[i].Func == "!~" {
|
||||
if s.ITags[i].Func == "=~" {
|
||||
s.ITags[i].Regexp, err = regexp.Compile(s.ITags[i].Value)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
} else if s.ITags[i].Func == "in" || s.ITags[i].Func == "not in" {
|
||||
} else if s.ITags[i].Func == "in" {
|
||||
arr := strings.Fields(s.ITags[i].Value)
|
||||
s.ITags[i].Vset = make(map[string]struct{})
|
||||
for j := 0; j < len(arr); j++ {
|
||||
@@ -207,7 +202,7 @@ func AlertSubscribeStatistics(cluster string) (*Statistics, error) {
|
||||
session := DB().Model(&AlertSubscribe{}).Select("count(*) as total", "max(update_at) as last_updated")
|
||||
|
||||
if cluster != "" {
|
||||
session = session.Where("(cluster like ? or cluster = ?)", "%"+cluster+"%", ClusterAll)
|
||||
session = session.Where("cluster = ?", cluster)
|
||||
}
|
||||
|
||||
var stats []*Statistics
|
||||
@@ -223,19 +218,10 @@ func AlertSubscribeGetsByCluster(cluster string) ([]*AlertSubscribe, error) {
|
||||
// get my cluster's subscribes
|
||||
session := DB().Model(&AlertSubscribe{})
|
||||
if cluster != "" {
|
||||
session = session.Where("(cluster like ? or cluster = ?)", "%"+cluster+"%", ClusterAll)
|
||||
session = session.Where("cluster = ?", cluster)
|
||||
}
|
||||
|
||||
var lst []*AlertSubscribe
|
||||
var slst []*AlertSubscribe
|
||||
err := session.Find(&lst).Error
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
for _, s := range lst {
|
||||
if MatchCluster(s.Cluster, cluster) {
|
||||
slst = append(slst, s)
|
||||
}
|
||||
}
|
||||
return slst, err
|
||||
return lst, err
|
||||
}
|
||||
|
||||
@@ -1,94 +0,0 @@
|
||||
package models
|
||||
|
||||
import "time"
|
||||
|
||||
type AlertingEngines struct {
|
||||
Id int64 `json:"id" gorm:"primaryKey"`
|
||||
Instance string `json:"instance"`
|
||||
Cluster string `json:"cluster"` // reader cluster
|
||||
Clock int64 `json:"clock"`
|
||||
}
|
||||
|
||||
func (e *AlertingEngines) TableName() string {
|
||||
return "alerting_engines"
|
||||
}
|
||||
|
||||
// UpdateCluster 页面上用户会给各个n9e-server分配要关联的目标集群是什么
|
||||
func (e *AlertingEngines) UpdateCluster(c string) error {
|
||||
e.Cluster = c
|
||||
return DB().Model(e).Select("cluster").Updates(e).Error
|
||||
}
|
||||
|
||||
// AlertingEngineGetCluster 根据实例名获取对应的集群名字
|
||||
func AlertingEngineGetCluster(instance string) (string, error) {
|
||||
var objs []AlertingEngines
|
||||
err := DB().Where("instance=?", instance).Find(&objs).Error
|
||||
if err != nil {
|
||||
return "", err
|
||||
}
|
||||
|
||||
if len(objs) == 0 {
|
||||
return "", nil
|
||||
}
|
||||
|
||||
return objs[0].Cluster, nil
|
||||
}
|
||||
|
||||
// AlertingEngineGets 拉取列表数据,用户要在页面上看到所有 n9e-server 实例列表,然后为其分配 cluster
|
||||
func AlertingEngineGets(where string, args ...interface{}) ([]*AlertingEngines, error) {
|
||||
var objs []*AlertingEngines
|
||||
var err error
|
||||
session := DB().Order("instance")
|
||||
if where == "" {
|
||||
err = session.Find(&objs).Error
|
||||
} else {
|
||||
err = session.Where(where, args...).Find(&objs).Error
|
||||
}
|
||||
return objs, err
|
||||
}
|
||||
|
||||
func AlertingEngineGet(where string, args ...interface{}) (*AlertingEngines, error) {
|
||||
lst, err := AlertingEngineGets(where, args...)
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
if len(lst) == 0 {
|
||||
return nil, nil
|
||||
}
|
||||
|
||||
return lst[0], nil
|
||||
}
|
||||
|
||||
func AlertingEngineGetsInstances(where string, args ...interface{}) ([]string, error) {
|
||||
var arr []string
|
||||
var err error
|
||||
session := DB().Model(new(AlertingEngines)).Order("instance")
|
||||
if where == "" {
|
||||
err = session.Pluck("instance", &arr).Error
|
||||
} else {
|
||||
err = session.Where(where, args...).Pluck("instance", &arr).Error
|
||||
}
|
||||
return arr, err
|
||||
}
|
||||
|
||||
func AlertingEngineHeartbeat(instance string) error {
|
||||
var total int64
|
||||
err := DB().Model(new(AlertingEngines)).Where("instance=?", instance).Count(&total).Error
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
if total == 0 {
|
||||
// insert
|
||||
err = DB().Create(&AlertingEngines{
|
||||
Instance: instance,
|
||||
Clock: time.Now().Unix(),
|
||||
}).Error
|
||||
} else {
|
||||
// update
|
||||
err = DB().Model(new(AlertingEngines)).Where("instance=?", instance).Update("clock", time.Now().Unix()).Error
|
||||
}
|
||||
|
||||
return err
|
||||
}
|
||||
@@ -71,20 +71,6 @@ func (b *Board) Del() error {
|
||||
})
|
||||
}
|
||||
|
||||
func BoardGetByID(id int64) (*Board, error) {
|
||||
var lst []*Board
|
||||
err := DB().Where("id = ?", id).Find(&lst).Error
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
if len(lst) == 0 {
|
||||
return nil, nil
|
||||
}
|
||||
|
||||
return lst[0], nil
|
||||
}
|
||||
|
||||
// BoardGet for detail page
|
||||
func BoardGet(where string, args ...interface{}) (*Board, error) {
|
||||
var lst []*Board
|
||||
|
||||
@@ -119,7 +119,7 @@ func (bg *BusiGroup) Del() error {
|
||||
return errors.New("Some targets still in the BusiGroup")
|
||||
}
|
||||
|
||||
has, err = Exists(DB().Model(&Board{}).Where("group_id=?", bg.Id))
|
||||
has, err = Exists(DB().Model(&Dashboard{}).Where("group_id=?", bg.Id))
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
@@ -1,8 +1,6 @@
|
||||
package models
|
||||
|
||||
import (
|
||||
"strings"
|
||||
|
||||
"github.com/toolkits/pkg/str"
|
||||
"gorm.io/gorm"
|
||||
|
||||
@@ -11,9 +9,6 @@ import (
|
||||
|
||||
const AdminRole = "Admin"
|
||||
|
||||
// if rule's cluster field contains `ClusterAll`, means it take effect in all clusters
|
||||
const ClusterAll = "$all"
|
||||
|
||||
func DB() *gorm.DB {
|
||||
return storage.DB
|
||||
}
|
||||
@@ -47,26 +42,3 @@ type Statistics struct {
|
||||
Total int64 `gorm:"total"`
|
||||
LastUpdated int64 `gorm:"last_updated"`
|
||||
}
|
||||
|
||||
func MatchCluster(ruleCluster, targetCluster string) bool {
|
||||
if targetCluster == ClusterAll {
|
||||
return true
|
||||
}
|
||||
clusters := strings.Fields(ruleCluster)
|
||||
for _, c := range clusters {
|
||||
if c == ClusterAll || c == targetCluster {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
func IsClusterAll(ruleCluster string) bool {
|
||||
clusters := strings.Fields(ruleCluster)
|
||||
for _, c := range clusters {
|
||||
if c == ClusterAll {
|
||||
return true
|
||||
}
|
||||
}
|
||||
return false
|
||||
}
|
||||
|
||||
@@ -13,7 +13,7 @@ import (
|
||||
type RecordingRule struct {
|
||||
Id int64 `json:"id" gorm:"primaryKey"`
|
||||
GroupId int64 `json:"group_id"` // busi group id
|
||||
Cluster string `json:"cluster"` // take effect by cluster, seperated by space
|
||||
Cluster string `json:"cluster"` // take effect by cluster
|
||||
Name string `json:"name"` // new metric name
|
||||
Note string `json:"note"` // note
|
||||
Disabled int `json:"disabled"` // 0: enabled, 1: disabled
|
||||
@@ -40,7 +40,6 @@ func (re *RecordingRule) DB2FE() {
|
||||
//re.ClusterJSON = strings.Fields(re.Cluster)
|
||||
re.AppendTagsJSON = strings.Fields(re.AppendTags)
|
||||
}
|
||||
|
||||
func (re *RecordingRule) Verify() error {
|
||||
if re.GroupId < 0 {
|
||||
return fmt.Errorf("GroupId(%d) invalid", re.GroupId)
|
||||
@@ -50,10 +49,6 @@ func (re *RecordingRule) Verify() error {
|
||||
return errors.New("cluster is blank")
|
||||
}
|
||||
|
||||
if IsClusterAll(re.Cluster) {
|
||||
re.Cluster = ClusterAll
|
||||
}
|
||||
|
||||
if !model.MetricNameRE.MatchString(re.Name) {
|
||||
return errors.New("Name has invalid chreacters")
|
||||
}
|
||||
@@ -83,15 +78,14 @@ func (re *RecordingRule) Add() error {
|
||||
return err
|
||||
}
|
||||
|
||||
// 由于实际场景中会出现name重复的recording rule,所以不需要检查重复
|
||||
//exists, err := RecordingRuleExists(0, re.GroupId, re.Cluster, re.Name)
|
||||
//if err != nil {
|
||||
// return err
|
||||
//}
|
||||
//
|
||||
//if exists {
|
||||
// return errors.New("RecordingRule already exists")
|
||||
//}
|
||||
exists, err := RecordingRuleExists("group_id=? and cluster=? and name=?", re.GroupId, re.Cluster, re.Name)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
if exists {
|
||||
return errors.New("RecordingRule already exists")
|
||||
}
|
||||
|
||||
now := time.Now().Unix()
|
||||
re.CreateAt = now
|
||||
@@ -101,16 +95,15 @@ func (re *RecordingRule) Add() error {
|
||||
}
|
||||
|
||||
func (re *RecordingRule) Update(ref RecordingRule) error {
|
||||
// 由于实际场景中会出现name重复的recording rule,所以不需要检查重复
|
||||
//if re.Name != ref.Name {
|
||||
// exists, err := RecordingRuleExists(re.Id, re.GroupId, re.Cluster, ref.Name)
|
||||
// if err != nil {
|
||||
// return err
|
||||
// }
|
||||
// if exists {
|
||||
// return errors.New("RecordingRule already exists")
|
||||
// }
|
||||
//}
|
||||
if re.Name != ref.Name {
|
||||
exists, err := RecordingRuleExists("group_id=? and cluster=? and name=? and id <> ?", re.GroupId, re.Cluster, ref.Name, re.Id)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
if exists {
|
||||
return errors.New("RecordingRule already exists")
|
||||
}
|
||||
}
|
||||
|
||||
ref.FE2DB()
|
||||
ref.Id = re.Id
|
||||
@@ -140,27 +133,9 @@ func RecordingRuleDels(ids []int64, groupId int64) error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func RecordingRuleExists(id, groupId int64, cluster, name string) (bool, error) {
|
||||
session := DB().Where("id <> ? and group_id = ? and name =? ", id, groupId, name)
|
||||
|
||||
var lst []RecordingRule
|
||||
err := session.Find(&lst).Error
|
||||
if err != nil {
|
||||
return false, err
|
||||
}
|
||||
if len(lst) == 0 {
|
||||
return false, nil
|
||||
}
|
||||
|
||||
// match cluster
|
||||
for _, r := range lst {
|
||||
if MatchCluster(r.Cluster, cluster) {
|
||||
return true, nil
|
||||
}
|
||||
}
|
||||
return false, nil
|
||||
func RecordingRuleExists(where string, regs ...interface{}) (bool, error) {
|
||||
return Exists(DB().Model(&RecordingRule{}).Where(where, regs...))
|
||||
}
|
||||
|
||||
func RecordingRuleGets(groupId int64) ([]RecordingRule, error) {
|
||||
session := DB().Where("group_id=?", groupId).Order("name")
|
||||
|
||||
@@ -196,45 +171,26 @@ func RecordingRuleGetById(id int64) (*RecordingRule, error) {
|
||||
}
|
||||
|
||||
func RecordingRuleGetsByCluster(cluster string) ([]*RecordingRule, error) {
|
||||
session := DB().Where("disabled = ?", 0)
|
||||
|
||||
session := DB()
|
||||
if cluster != "" {
|
||||
session = session.Where("(cluster like ? or cluster = ?)", "%"+cluster+"%", ClusterAll)
|
||||
session = session.Where("cluster = ?", cluster)
|
||||
}
|
||||
|
||||
var lst []*RecordingRule
|
||||
err := session.Find(&lst).Error
|
||||
if err != nil {
|
||||
return lst, err
|
||||
}
|
||||
|
||||
if len(lst) == 0 {
|
||||
return lst, nil
|
||||
}
|
||||
|
||||
if cluster == "" {
|
||||
if err == nil {
|
||||
for i := 0; i < len(lst); i++ {
|
||||
lst[i].DB2FE()
|
||||
}
|
||||
return lst, nil
|
||||
}
|
||||
|
||||
lr := make([]*RecordingRule, 0, len(lst))
|
||||
for _, r := range lst {
|
||||
if MatchCluster(r.Cluster, cluster) {
|
||||
r.DB2FE()
|
||||
lr = append(lr, r)
|
||||
}
|
||||
}
|
||||
|
||||
return lr, err
|
||||
return lst, err
|
||||
}
|
||||
|
||||
func RecordingRuleStatistics(cluster string) (*Statistics, error) {
|
||||
session := DB().Model(&RecordingRule{}).Select("count(*) as total", "max(update_at) as last_updated")
|
||||
if cluster != "" {
|
||||
// 简略的判断,当一个clustername是另一个clustername的substring的时候,会出现stats与预期不符,不影响使用
|
||||
session = session.Where("(cluster like ? or cluster = ?)", "%"+cluster+"%", ClusterAll)
|
||||
session = session.Where("cluster = ?", cluster)
|
||||
}
|
||||
|
||||
var stats []*Statistics
|
||||
|
||||
@@ -1,198 +0,0 @@
|
||||
package models
|
||||
|
||||
import (
|
||||
"crypto/md5"
|
||||
"fmt"
|
||||
"regexp"
|
||||
"sort"
|
||||
"strings"
|
||||
|
||||
"github.com/prometheus/common/model"
|
||||
"github.com/prometheus/prometheus/prompb"
|
||||
)
|
||||
|
||||
const (
|
||||
Replace Action = "replace"
|
||||
Keep Action = "keep"
|
||||
Drop Action = "drop"
|
||||
HashMod Action = "hashmod"
|
||||
LabelMap Action = "labelmap"
|
||||
LabelDrop Action = "labeldrop"
|
||||
LabelKeep Action = "labelkeep"
|
||||
Lowercase Action = "lowercase"
|
||||
Uppercase Action = "uppercase"
|
||||
)
|
||||
|
||||
type Action string
|
||||
|
||||
type Regexp struct {
|
||||
*regexp.Regexp
|
||||
}
|
||||
|
||||
type RelabelConfig struct {
|
||||
SourceLabels model.LabelNames
|
||||
Separator string
|
||||
Regex interface{}
|
||||
Modulus uint64
|
||||
TargetLabel string
|
||||
Replacement string
|
||||
Action Action
|
||||
}
|
||||
|
||||
func Process(labels []*prompb.Label, cfgs ...*RelabelConfig) []*prompb.Label {
|
||||
for _, cfg := range cfgs {
|
||||
labels = relabel(labels, cfg)
|
||||
if labels == nil {
|
||||
return nil
|
||||
}
|
||||
}
|
||||
return labels
|
||||
}
|
||||
|
||||
func getValue(ls []*prompb.Label, name model.LabelName) string {
|
||||
for _, l := range ls {
|
||||
if l.Name == string(name) {
|
||||
return l.Value
|
||||
}
|
||||
}
|
||||
return ""
|
||||
}
|
||||
|
||||
type LabelBuilder struct {
|
||||
LabelSet map[string]string
|
||||
}
|
||||
|
||||
func newBuilder(ls []*prompb.Label) *LabelBuilder {
|
||||
lset := make(map[string]string, len(ls))
|
||||
for _, l := range ls {
|
||||
lset[l.Name] = l.Value
|
||||
}
|
||||
return &LabelBuilder{LabelSet: lset}
|
||||
}
|
||||
|
||||
func (l *LabelBuilder) set(k, v string) *LabelBuilder {
|
||||
if v == "" {
|
||||
return l.del(k)
|
||||
}
|
||||
|
||||
l.LabelSet[k] = v
|
||||
return l
|
||||
}
|
||||
|
||||
func (l *LabelBuilder) del(ns ...string) *LabelBuilder {
|
||||
for _, n := range ns {
|
||||
delete(l.LabelSet, n)
|
||||
}
|
||||
return l
|
||||
}
|
||||
|
||||
func (l *LabelBuilder) labels() []*prompb.Label {
|
||||
ls := make([]*prompb.Label, 0, len(l.LabelSet))
|
||||
if len(l.LabelSet) == 0 {
|
||||
return ls
|
||||
}
|
||||
|
||||
for k, v := range l.LabelSet {
|
||||
ls = append(ls, &prompb.Label{
|
||||
Name: k,
|
||||
Value: v,
|
||||
})
|
||||
}
|
||||
|
||||
sort.Slice(ls, func(i, j int) bool {
|
||||
return ls[i].Name > ls[j].Name
|
||||
})
|
||||
return ls
|
||||
}
|
||||
|
||||
func relabel(lset []*prompb.Label, cfg *RelabelConfig) []*prompb.Label {
|
||||
values := make([]string, 0, len(cfg.SourceLabels))
|
||||
for _, ln := range cfg.SourceLabels {
|
||||
values = append(values, getValue(lset, ln))
|
||||
}
|
||||
|
||||
regx := cfg.Regex.(Regexp)
|
||||
|
||||
val := strings.Join(values, cfg.Separator)
|
||||
lb := newBuilder(lset)
|
||||
switch cfg.Action {
|
||||
case Drop:
|
||||
if regx.MatchString(val) {
|
||||
return nil
|
||||
}
|
||||
case Keep:
|
||||
if !regx.MatchString(val) {
|
||||
return nil
|
||||
}
|
||||
case Replace:
|
||||
indexes := regx.FindStringSubmatchIndex(val)
|
||||
if indexes == nil {
|
||||
break
|
||||
}
|
||||
target := model.LabelName(regx.ExpandString([]byte{}, cfg.TargetLabel, val, indexes))
|
||||
if !target.IsValid() {
|
||||
lb.del(cfg.TargetLabel)
|
||||
break
|
||||
}
|
||||
res := regx.ExpandString([]byte{}, cfg.Replacement, val, indexes)
|
||||
if len(res) == 0 {
|
||||
lb.del(cfg.TargetLabel)
|
||||
break
|
||||
}
|
||||
lb.set(string(target), string(res))
|
||||
case Lowercase:
|
||||
lb.set(cfg.TargetLabel, strings.ToLower(val))
|
||||
case Uppercase:
|
||||
lb.set(cfg.TargetLabel, strings.ToUpper(val))
|
||||
case HashMod:
|
||||
mod := sum64(md5.Sum([]byte(val))) % cfg.Modulus
|
||||
lb.set(cfg.TargetLabel, fmt.Sprintf("%d", mod))
|
||||
case LabelMap:
|
||||
for _, l := range lset {
|
||||
if regx.MatchString(l.Name) {
|
||||
res := regx.ReplaceAllString(l.Name, cfg.Replacement)
|
||||
lb.set(res, l.Value)
|
||||
}
|
||||
}
|
||||
case LabelDrop:
|
||||
for _, l := range lset {
|
||||
if regx.MatchString(l.Name) {
|
||||
lb.del(l.Name)
|
||||
}
|
||||
}
|
||||
case LabelKeep:
|
||||
for _, l := range lset {
|
||||
if !regx.MatchString(l.Name) {
|
||||
lb.del(l.Name)
|
||||
}
|
||||
}
|
||||
default:
|
||||
panic(fmt.Errorf("relabel: unknown relabel action type %q", cfg.Action))
|
||||
}
|
||||
|
||||
return lb.labels()
|
||||
}
|
||||
|
||||
func sum64(hash [md5.Size]byte) uint64 {
|
||||
var s uint64
|
||||
|
||||
for i, b := range hash {
|
||||
shift := uint64((md5.Size - i - 1) * 8)
|
||||
|
||||
s |= uint64(b) << shift
|
||||
}
|
||||
return s
|
||||
}
|
||||
|
||||
func NewRegexp(s string) (Regexp, error) {
|
||||
regex, err := regexp.Compile("^(?:" + s + ")$")
|
||||
return Regexp{Regexp: regex}, err
|
||||
}
|
||||
|
||||
func MustNewRegexp(s string) Regexp {
|
||||
re, err := NewRegexp(s)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
return re
|
||||
}
|
||||
@@ -20,11 +20,6 @@ type Target struct {
|
||||
TagsJSON []string `json:"tags" gorm:"-"`
|
||||
TagsMap map[string]string `json:"-" gorm:"-"` // internal use, append tags to series
|
||||
UpdateAt int64 `json:"update_at"`
|
||||
|
||||
TargetUp float64 `json:"target_up" gorm:"-"`
|
||||
LoadPerCore float64 `json:"load_per_core" gorm:"-"`
|
||||
MemUtil float64 `json:"mem_util" gorm:"-"`
|
||||
DiskUtil float64 `json:"disk_util" gorm:"-"`
|
||||
}
|
||||
|
||||
func (t *Target) TableName() string {
|
||||
@@ -116,10 +111,6 @@ func buildTargetWhere(bgid int64, clusters []string, query string) *gorm.DB {
|
||||
return session
|
||||
}
|
||||
|
||||
func TargetTotalCount() (int64, error) {
|
||||
return Count(DB().Model(new(Target)))
|
||||
}
|
||||
|
||||
func TargetTotal(bgid int64, clusters []string, query string) (int64, error) {
|
||||
return Count(buildTargetWhere(bgid, clusters, query))
|
||||
}
|
||||
|
||||
@@ -450,21 +450,6 @@ func (u *User) BusiGroups(limit int, query string, all ...bool) ([]BusiGroup, er
|
||||
var lst []BusiGroup
|
||||
if u.IsAdmin() || (len(all) > 0 && all[0]) {
|
||||
err := session.Where("name like ?", "%"+query+"%").Find(&lst).Error
|
||||
if err != nil {
|
||||
return lst, err
|
||||
}
|
||||
|
||||
if len(lst) == 0 && len(query) > 0 {
|
||||
// 隐藏功能,一般人不告诉,哈哈。query可能是给的ident,所以上面的sql没有查到,当做ident来查一下试试
|
||||
var t *Target
|
||||
t, err = TargetGet("ident=?", query)
|
||||
if err != nil {
|
||||
return lst, err
|
||||
}
|
||||
|
||||
err = DB().Order("name").Limit(limit).Where("id=?", t.GroupId).Find(&lst).Error
|
||||
}
|
||||
|
||||
return lst, err
|
||||
}
|
||||
|
||||
@@ -483,22 +468,6 @@ func (u *User) BusiGroups(limit int, query string, all ...bool) ([]BusiGroup, er
|
||||
}
|
||||
|
||||
err = session.Where("id in ?", busiGroupIds).Where("name like ?", "%"+query+"%").Find(&lst).Error
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
if len(lst) == 0 && len(query) > 0 {
|
||||
var t *Target
|
||||
t, err = TargetGet("ident=?", query)
|
||||
if err != nil {
|
||||
return lst, err
|
||||
}
|
||||
|
||||
if slice.ContainsInt64(busiGroupIds, t.GroupId) {
|
||||
err = DB().Order("name").Limit(limit).Where("id=?", t.GroupId).Find(&lst).Error
|
||||
}
|
||||
}
|
||||
|
||||
return lst, err
|
||||
}
|
||||
|
||||
|
||||
@@ -1,9 +0,0 @@
|
||||
package notifier
|
||||
|
||||
type Notifier interface {
|
||||
Descript() string
|
||||
Notify([]byte)
|
||||
NotifyMaintainer([]byte)
|
||||
}
|
||||
|
||||
var Instance Notifier
|
||||
@@ -6,11 +6,9 @@ import (
|
||||
"io/ioutil"
|
||||
"net/http"
|
||||
"time"
|
||||
|
||||
"github.com/toolkits/pkg/logger"
|
||||
)
|
||||
|
||||
func PostJSON(url string, timeout time.Duration, v interface{}, retries ...int) (response []byte, code int, err error) {
|
||||
func PostJSON(url string, timeout time.Duration, v interface{}) (response []byte, code int, err error) {
|
||||
var bs []byte
|
||||
|
||||
bs, err = json.Marshal(v)
|
||||
@@ -28,29 +26,7 @@ func PostJSON(url string, timeout time.Duration, v interface{}, retries ...int)
|
||||
req.Header.Set("Content-Type", "application/json")
|
||||
|
||||
var resp *http.Response
|
||||
|
||||
if len(retries) > 0 {
|
||||
for i := 0; i < retries[0]; i++ {
|
||||
resp, err = client.Do(req)
|
||||
if err == nil {
|
||||
break
|
||||
}
|
||||
|
||||
tryagain := ""
|
||||
if i+1 < retries[0] {
|
||||
tryagain = " try again"
|
||||
}
|
||||
|
||||
logger.Warningf("failed to curl %s error: %s"+tryagain, url, err)
|
||||
|
||||
if i+1 < retries[0] {
|
||||
time.Sleep(time.Millisecond * 200)
|
||||
}
|
||||
}
|
||||
} else {
|
||||
resp, err = client.Do(req)
|
||||
}
|
||||
|
||||
resp, err = client.Do(req)
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
|
||||
@@ -1,7 +0,0 @@
|
||||
package prom
|
||||
|
||||
type ClientOptions struct {
|
||||
BasicAuthUser string
|
||||
BasicAuthPass string
|
||||
Headers []string
|
||||
}
|
||||
@@ -12,26 +12,25 @@ import (
|
||||
|
||||
// ClientConfig represents the standard client TLS config.
|
||||
type ClientConfig struct {
|
||||
TLSCA string `toml:"tls_ca"`
|
||||
TLSCert string `toml:"tls_cert"`
|
||||
TLSKey string `toml:"tls_key"`
|
||||
TLSKeyPwd string `toml:"tls_key_pwd"`
|
||||
InsecureSkipVerify bool `toml:"insecure_skip_verify"`
|
||||
ServerName string `toml:"tls_server_name"`
|
||||
TLSMinVersion string `toml:"tls_min_version"`
|
||||
TLSMaxVersion string `toml:"tls_max_version"`
|
||||
TLSCA string
|
||||
TLSCert string
|
||||
TLSKey string
|
||||
TLSKeyPwd string
|
||||
InsecureSkipVerify bool
|
||||
ServerName string
|
||||
TLSMinVersion string
|
||||
}
|
||||
|
||||
// ServerConfig represents the standard server TLS config.
|
||||
type ServerConfig struct {
|
||||
TLSCert string `toml:"tls_cert"`
|
||||
TLSKey string `toml:"tls_key"`
|
||||
TLSKeyPwd string `toml:"tls_key_pwd"`
|
||||
TLSAllowedCACerts []string `toml:"tls_allowed_cacerts"`
|
||||
TLSCipherSuites []string `toml:"tls_cipher_suites"`
|
||||
TLSMinVersion string `toml:"tls_min_version"`
|
||||
TLSMaxVersion string `toml:"tls_max_version"`
|
||||
TLSAllowedDNSNames []string `toml:"tls_allowed_dns_names"`
|
||||
TLSCert string
|
||||
TLSKey string
|
||||
TLSKeyPwd string
|
||||
TLSAllowedCACerts []string
|
||||
TLSCipherSuites []string
|
||||
TLSMinVersion string
|
||||
TLSMaxVersion string
|
||||
TLSAllowedDNSNames []string
|
||||
}
|
||||
|
||||
// TLSConfig returns a tls.Config, may be nil without error if TLS is not
|
||||
@@ -71,16 +70,6 @@ func (c *ClientConfig) TLSConfig() (*tls.Config, error) {
|
||||
tlsConfig.MinVersion = tls.VersionTLS13
|
||||
}
|
||||
|
||||
if c.TLSMaxVersion == "1.0" {
|
||||
tlsConfig.MaxVersion = tls.VersionTLS10
|
||||
} else if c.TLSMaxVersion == "1.1" {
|
||||
tlsConfig.MaxVersion = tls.VersionTLS11
|
||||
} else if c.TLSMaxVersion == "1.2" {
|
||||
tlsConfig.MaxVersion = tls.VersionTLS12
|
||||
} else if c.TLSMaxVersion == "1.3" {
|
||||
tlsConfig.MaxVersion = tls.VersionTLS13
|
||||
}
|
||||
|
||||
return tlsConfig, nil
|
||||
}
|
||||
|
||||
|
||||
@@ -2,13 +2,11 @@ package tplx
|
||||
|
||||
import (
|
||||
"html/template"
|
||||
"net/url"
|
||||
"regexp"
|
||||
"strings"
|
||||
)
|
||||
|
||||
var TemplateFuncMap = template.FuncMap{
|
||||
"escape": url.PathEscape,
|
||||
"unescaped": Unescaped,
|
||||
"urlconvert": Urlconvert,
|
||||
"timeformat": Timeformat,
|
||||
|
||||
@@ -66,7 +66,7 @@ func SendDingtalk(message DingtalkMessage) {
|
||||
}
|
||||
}
|
||||
|
||||
res, code, err := poster.PostJSON(ur, time.Second*5, body, 3)
|
||||
res, code, err := poster.PostJSON(ur, time.Second*5, body)
|
||||
if err != nil {
|
||||
logger.Errorf("dingtalk_sender: result=fail url=%s code=%d error=%v response=%s", ur, code, err, string(res))
|
||||
} else {
|
||||
|
||||
@@ -42,7 +42,7 @@ func SendFeishu(message FeishuMessage) {
|
||||
},
|
||||
}
|
||||
|
||||
res, code, err := poster.PostJSON(url, time.Second*5, body, 3)
|
||||
res, code, err := poster.PostJSON(url, time.Second*5, body)
|
||||
if err != nil {
|
||||
logger.Errorf("feishu_sender: result=fail url=%s code=%d error=%v response=%s", url, code, err, string(res))
|
||||
} else {
|
||||
|
||||
@@ -31,7 +31,7 @@ func SendWecom(message WecomMessage) {
|
||||
},
|
||||
}
|
||||
|
||||
res, code, err := poster.PostJSON(url, time.Second*5, body, 3)
|
||||
res, code, err := poster.PostJSON(url, time.Second*5, body)
|
||||
if err != nil {
|
||||
logger.Errorf("wecom_sender: result=fail url=%s code=%d error=%v response=%s", url, code, err, string(res))
|
||||
} else {
|
||||
|
||||
@@ -2,11 +2,8 @@ package config
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"log"
|
||||
"net"
|
||||
"os"
|
||||
"plugin"
|
||||
"runtime"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
@@ -14,8 +11,6 @@ import (
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/koding/multiconfig"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/models"
|
||||
"github.com/didi/nightingale/v5/src/notifier"
|
||||
"github.com/didi/nightingale/v5/src/pkg/httpx"
|
||||
"github.com/didi/nightingale/v5/src/pkg/logx"
|
||||
"github.com/didi/nightingale/v5/src/pkg/ormx"
|
||||
@@ -70,10 +65,6 @@ func MustLoad(fpaths ...string) {
|
||||
C.EngineDelay = 120
|
||||
}
|
||||
|
||||
if C.ReaderFrom == "" {
|
||||
C.ReaderFrom = "config"
|
||||
}
|
||||
|
||||
if C.Heartbeat.IP == "" {
|
||||
// auto detect
|
||||
// C.Heartbeat.IP = fmt.Sprint(GetOutboundIP())
|
||||
@@ -85,11 +76,7 @@ func MustLoad(fpaths ...string) {
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
if strings.Contains(hostname, "localhost") {
|
||||
fmt.Println("Warning! hostname contains substring localhost, setting a more unique hostname is recommended")
|
||||
}
|
||||
|
||||
C.Heartbeat.IP = hostname
|
||||
C.Heartbeat.IP = hostname + "+" + fmt.Sprint(os.Getpid())
|
||||
|
||||
// if C.Heartbeat.IP == "" {
|
||||
// fmt.Println("heartbeat ip auto got is blank")
|
||||
@@ -98,6 +85,7 @@ func MustLoad(fpaths ...string) {
|
||||
}
|
||||
|
||||
C.Heartbeat.Endpoint = fmt.Sprintf("%s:%d", C.Heartbeat.IP, C.HTTP.Port)
|
||||
C.Alerting.RedisPub.ChannelKey = C.Alerting.RedisPub.ChannelPrefix + C.ClusterName
|
||||
|
||||
if C.Alerting.Webhook.Enable {
|
||||
if C.Alerting.Webhook.Timeout == "" {
|
||||
@@ -112,33 +100,6 @@ func MustLoad(fpaths ...string) {
|
||||
}
|
||||
}
|
||||
|
||||
if C.Alerting.CallPlugin.Enable {
|
||||
if runtime.GOOS == "windows" {
|
||||
fmt.Println("notify plugin on unsupported os:", runtime.GOOS)
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
p, err := plugin.Open(C.Alerting.CallPlugin.PluginPath)
|
||||
if err != nil {
|
||||
fmt.Println("failed to load plugin:", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
caller, err := p.Lookup(C.Alerting.CallPlugin.Caller)
|
||||
if err != nil {
|
||||
fmt.Println("failed to lookup plugin Caller:", err)
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
ins, ok := caller.(notifier.Notifier)
|
||||
if !ok {
|
||||
log.Println("notifier interface not implemented")
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
notifier.Instance = ins
|
||||
}
|
||||
|
||||
if C.WriterOpt.QueueMaxSize <= 0 {
|
||||
C.WriterOpt.QueueMaxSize = 100000
|
||||
}
|
||||
@@ -151,33 +112,6 @@ func MustLoad(fpaths ...string) {
|
||||
C.WriterOpt.QueueCount = 100
|
||||
}
|
||||
|
||||
for _, write := range C.Writers {
|
||||
for _, relabel := range write.WriteRelabels {
|
||||
regex, ok := relabel.Regex.(string)
|
||||
if !ok {
|
||||
log.Println("Regex field must be a string")
|
||||
os.Exit(1)
|
||||
}
|
||||
|
||||
if regex == "" {
|
||||
regex = "(.*)"
|
||||
}
|
||||
relabel.Regex = models.MustNewRegexp(regex)
|
||||
|
||||
if relabel.Separator == "" {
|
||||
relabel.Separator = ";"
|
||||
}
|
||||
|
||||
if relabel.Action == "" {
|
||||
relabel.Action = "replace"
|
||||
}
|
||||
|
||||
if relabel.Replacement == "" {
|
||||
relabel.Replacement = "$1"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fmt.Println("heartbeat.ip:", C.Heartbeat.IP)
|
||||
fmt.Printf("heartbeat.interval: %dms\n", C.Heartbeat.Interval)
|
||||
})
|
||||
@@ -187,10 +121,9 @@ type Config struct {
|
||||
RunMode string
|
||||
ClusterName string
|
||||
BusiGroupLabelKey string
|
||||
AnomalyDataApi []string
|
||||
EngineDelay int64
|
||||
DisableUsageReport bool
|
||||
ReaderFrom string
|
||||
ForceUseServerTS bool
|
||||
Log logx.Config
|
||||
HTTP httpx.Config
|
||||
BasicAuth gin.Accounts
|
||||
@@ -202,10 +135,29 @@ type Config struct {
|
||||
DB ormx.DBConfig
|
||||
WriterOpt WriterGlobalOpt
|
||||
Writers []WriterOptions
|
||||
Reader PromOption
|
||||
Reader ReaderOptions
|
||||
Ibex Ibex
|
||||
}
|
||||
|
||||
type ReaderOptions struct {
|
||||
Url string
|
||||
BasicAuthUser string
|
||||
BasicAuthPass string
|
||||
|
||||
Timeout int64
|
||||
DialTimeout int64
|
||||
TLSHandshakeTimeout int64
|
||||
ExpectContinueTimeout int64
|
||||
IdleConnTimeout int64
|
||||
KeepAlive int64
|
||||
|
||||
MaxConnsPerHost int
|
||||
MaxIdleConns int
|
||||
MaxIdleConnsPerHost int
|
||||
|
||||
Headers []string
|
||||
}
|
||||
|
||||
type WriterOptions struct {
|
||||
Url string
|
||||
BasicAuthUser string
|
||||
@@ -223,8 +175,6 @@ type WriterOptions struct {
|
||||
MaxIdleConnsPerHost int
|
||||
|
||||
Headers []string
|
||||
|
||||
WriteRelabels []*models.RelabelConfig
|
||||
}
|
||||
|
||||
type WriterGlobalOpt struct {
|
||||
@@ -304,7 +254,7 @@ func (c *Config) IsDebugMode() bool {
|
||||
|
||||
// Get preferred outbound ip of this machine
|
||||
func GetOutboundIP() net.IP {
|
||||
conn, err := net.Dial("udp", "223.5.5.5:80")
|
||||
conn, err := net.Dial("udp", "8.8.8.8:80")
|
||||
if err != nil {
|
||||
fmt.Println("auto get outbound ip fail:", err)
|
||||
os.Exit(1)
|
||||
|
||||
@@ -1,59 +0,0 @@
|
||||
package config
|
||||
|
||||
import (
|
||||
"sync"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/pkg/prom"
|
||||
)
|
||||
|
||||
type PromClient struct {
|
||||
prom.API
|
||||
ClusterName string
|
||||
sync.RWMutex
|
||||
}
|
||||
|
||||
var ReaderClient *PromClient = &PromClient{}
|
||||
|
||||
func (pc *PromClient) Set(clusterName string, c prom.API) {
|
||||
pc.Lock()
|
||||
defer pc.Unlock()
|
||||
pc.ClusterName = clusterName
|
||||
pc.API = c
|
||||
}
|
||||
|
||||
func (pc *PromClient) Get() (string, prom.API) {
|
||||
pc.RLock()
|
||||
defer pc.RUnlock()
|
||||
return pc.ClusterName, pc.API
|
||||
}
|
||||
|
||||
func (pc *PromClient) GetClusterName() string {
|
||||
pc.RLock()
|
||||
defer pc.RUnlock()
|
||||
return pc.ClusterName
|
||||
}
|
||||
|
||||
func (pc *PromClient) GetCli() prom.API {
|
||||
pc.RLock()
|
||||
defer pc.RUnlock()
|
||||
return pc.API
|
||||
}
|
||||
|
||||
func (pc *PromClient) IsNil() bool {
|
||||
if pc == nil {
|
||||
return true
|
||||
}
|
||||
|
||||
pc.RLock()
|
||||
defer pc.RUnlock()
|
||||
|
||||
return pc.API == nil
|
||||
}
|
||||
|
||||
func (pc *PromClient) Reset() {
|
||||
pc.Lock()
|
||||
defer pc.Unlock()
|
||||
|
||||
pc.ClusterName = ""
|
||||
pc.API = nil
|
||||
}
|
||||
@@ -1,81 +0,0 @@
|
||||
package config
|
||||
|
||||
import "sync"
|
||||
|
||||
type PromOption struct {
|
||||
Url string
|
||||
BasicAuthUser string
|
||||
BasicAuthPass string
|
||||
|
||||
Timeout int64
|
||||
DialTimeout int64
|
||||
|
||||
MaxIdleConnsPerHost int
|
||||
|
||||
Headers []string
|
||||
}
|
||||
|
||||
func (po *PromOption) Equal(target PromOption) bool {
|
||||
if po.Url != target.Url {
|
||||
return false
|
||||
}
|
||||
|
||||
if po.BasicAuthUser != target.BasicAuthUser {
|
||||
return false
|
||||
}
|
||||
|
||||
if po.BasicAuthPass != target.BasicAuthPass {
|
||||
return false
|
||||
}
|
||||
|
||||
if po.Timeout != target.Timeout {
|
||||
return false
|
||||
}
|
||||
|
||||
if po.DialTimeout != target.DialTimeout {
|
||||
return false
|
||||
}
|
||||
|
||||
if po.MaxIdleConnsPerHost != target.MaxIdleConnsPerHost {
|
||||
return false
|
||||
}
|
||||
|
||||
if len(po.Headers) != len(target.Headers) {
|
||||
return false
|
||||
}
|
||||
|
||||
for i := 0; i < len(po.Headers); i++ {
|
||||
if po.Headers[i] != target.Headers[i] {
|
||||
return false
|
||||
}
|
||||
}
|
||||
|
||||
return true
|
||||
}
|
||||
|
||||
type PromOptionsStruct struct {
|
||||
Data map[string]PromOption
|
||||
sync.RWMutex
|
||||
}
|
||||
|
||||
func (pos *PromOptionsStruct) Set(clusterName string, po PromOption) {
|
||||
pos.Lock()
|
||||
pos.Data[clusterName] = po
|
||||
pos.Unlock()
|
||||
}
|
||||
|
||||
func (pos *PromOptionsStruct) Sets(clusterName string, po PromOption) {
|
||||
pos.Lock()
|
||||
pos.Data = map[string]PromOption{clusterName: po}
|
||||
pos.Unlock()
|
||||
}
|
||||
|
||||
func (pos *PromOptionsStruct) Get(clusterName string) (PromOption, bool) {
|
||||
pos.RLock()
|
||||
defer pos.RUnlock()
|
||||
ret, has := pos.Data[clusterName]
|
||||
return ret, has
|
||||
}
|
||||
|
||||
// Data key is cluster name
|
||||
var PromOptions = &PromOptionsStruct{Data: make(map[string]PromOption)}
|
||||
@@ -1,131 +0,0 @@
|
||||
package config
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"net"
|
||||
"net/http"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/models"
|
||||
"github.com/didi/nightingale/v5/src/pkg/prom"
|
||||
"github.com/prometheus/client_golang/api"
|
||||
"github.com/toolkits/pkg/logger"
|
||||
)
|
||||
|
||||
func InitReader() error {
|
||||
rf := strings.ToLower(strings.TrimSpace(C.ReaderFrom))
|
||||
if rf == "" || rf == "config" {
|
||||
return setClientFromPromOption(C.ClusterName, C.Reader)
|
||||
}
|
||||
|
||||
if rf == "database" {
|
||||
return initFromDatabase()
|
||||
}
|
||||
|
||||
return fmt.Errorf("invalid configuration ReaderFrom: %s", rf)
|
||||
}
|
||||
|
||||
func initFromDatabase() error {
|
||||
go func() {
|
||||
for {
|
||||
loadFromDatabase()
|
||||
time.Sleep(time.Second)
|
||||
}
|
||||
}()
|
||||
return nil
|
||||
}
|
||||
|
||||
func loadFromDatabase() {
|
||||
cluster, err := models.AlertingEngineGetCluster(C.Heartbeat.Endpoint)
|
||||
if err != nil {
|
||||
logger.Errorf("failed to get current cluster, error: %v", err)
|
||||
return
|
||||
}
|
||||
|
||||
if cluster == "" {
|
||||
ReaderClient.Reset()
|
||||
logger.Warning("no datasource binded to me")
|
||||
return
|
||||
}
|
||||
|
||||
ckey := "prom." + cluster + ".option"
|
||||
cval, err := models.ConfigsGet(ckey)
|
||||
if err != nil {
|
||||
logger.Errorf("failed to get ckey: %s, error: %v", ckey, err)
|
||||
return
|
||||
}
|
||||
|
||||
if cval == "" {
|
||||
ReaderClient.Reset()
|
||||
return
|
||||
}
|
||||
|
||||
var po PromOption
|
||||
err = json.Unmarshal([]byte(cval), &po)
|
||||
if err != nil {
|
||||
logger.Errorf("failed to unmarshal PromOption: %s", err)
|
||||
return
|
||||
}
|
||||
|
||||
if ReaderClient.IsNil() {
|
||||
// first time
|
||||
if err = setClientFromPromOption(cluster, po); err != nil {
|
||||
logger.Errorf("failed to setClientFromPromOption: %v", err)
|
||||
return
|
||||
}
|
||||
|
||||
PromOptions.Sets(cluster, po)
|
||||
return
|
||||
}
|
||||
|
||||
localPo, has := PromOptions.Get(cluster)
|
||||
if !has || !localPo.Equal(po) {
|
||||
if err = setClientFromPromOption(cluster, po); err != nil {
|
||||
logger.Errorf("failed to setClientFromPromOption: %v", err)
|
||||
return
|
||||
}
|
||||
|
||||
PromOptions.Sets(cluster, po)
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
func newClientFromPromOption(po PromOption) (api.Client, error) {
|
||||
return api.NewClient(api.Config{
|
||||
Address: po.Url,
|
||||
RoundTripper: &http.Transport{
|
||||
// TLSClientConfig: tlsConfig,
|
||||
Proxy: http.ProxyFromEnvironment,
|
||||
DialContext: (&net.Dialer{
|
||||
Timeout: time.Duration(po.DialTimeout) * time.Millisecond,
|
||||
}).DialContext,
|
||||
ResponseHeaderTimeout: time.Duration(po.Timeout) * time.Millisecond,
|
||||
MaxIdleConnsPerHost: po.MaxIdleConnsPerHost,
|
||||
},
|
||||
})
|
||||
}
|
||||
|
||||
func setClientFromPromOption(clusterName string, po PromOption) error {
|
||||
if clusterName == "" {
|
||||
return fmt.Errorf("argument clusterName is blank")
|
||||
}
|
||||
|
||||
if po.Url == "" {
|
||||
return fmt.Errorf("prometheus url is blank")
|
||||
}
|
||||
|
||||
cli, err := newClientFromPromOption(po)
|
||||
if err != nil {
|
||||
return fmt.Errorf("failed to newClientFromPromOption: %v", err)
|
||||
}
|
||||
|
||||
ReaderClient.Set(clusterName, prom.NewAPI(cli, prom.ClientOptions{
|
||||
BasicAuthUser: po.BasicAuthUser,
|
||||
BasicAuthPass: po.BasicAuthPass,
|
||||
Headers: po.Headers,
|
||||
}))
|
||||
|
||||
return nil
|
||||
}
|
||||
@@ -32,7 +32,7 @@ func callback(event *models.AlertCurEvent) {
|
||||
url = "http://" + url
|
||||
}
|
||||
|
||||
resp, code, err := poster.PostJSON(url, 5*time.Second, event, 3)
|
||||
resp, code, err := poster.PostJSON(url, 5*time.Second, event)
|
||||
if err != nil {
|
||||
logger.Errorf("event_callback(rule_id=%d url=%s) fail, resp: %s, err: %v, code: %d", event.RuleId, url, string(resp), err, code)
|
||||
} else {
|
||||
|
||||
@@ -2,18 +2,15 @@ package engine
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"time"
|
||||
|
||||
"github.com/toolkits/pkg/logger"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/server/common/sender"
|
||||
"github.com/didi/nightingale/v5/src/server/config"
|
||||
promstat "github.com/didi/nightingale/v5/src/server/stat"
|
||||
)
|
||||
|
||||
func Start(ctx context.Context) error {
|
||||
err := reloadTpls()
|
||||
err := initTpls()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
@@ -28,35 +25,12 @@ func Start(ctx context.Context) error {
|
||||
|
||||
go sender.StartEmailSender()
|
||||
|
||||
go initReporter(func(em map[ErrorType]uint64) {
|
||||
if len(em) == 0 {
|
||||
return
|
||||
}
|
||||
title := fmt.Sprintf("server %s has some errors, please check server logs for detail", config.C.Heartbeat.IP)
|
||||
msg := ""
|
||||
for k, v := range em {
|
||||
msg += fmt.Sprintf("error: %s, count: %d\n", k, v)
|
||||
}
|
||||
notifyToMaintainer(title, msg)
|
||||
})
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func Reload() {
|
||||
err := reloadTpls()
|
||||
if err != nil {
|
||||
logger.Error("engine reload err:", err)
|
||||
}
|
||||
}
|
||||
|
||||
func reportQueueSize() {
|
||||
for {
|
||||
time.Sleep(time.Second)
|
||||
clusterName := config.ReaderClient.GetClusterName()
|
||||
if clusterName == "" {
|
||||
continue
|
||||
}
|
||||
promstat.GaugeAlertQueueSize.WithLabelValues(clusterName).Set(float64(EventQueue.Len()))
|
||||
promstat.GaugeAlertQueueSize.WithLabelValues(config.C.ClusterName).Set(float64(EventQueue.Len()))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -6,7 +6,7 @@ import (
|
||||
)
|
||||
|
||||
// 如果传入了clock这个可选参数,就表示使用这个clock表示的时间,否则就从event的字段中取TriggerTime
|
||||
func IsMuted(event *models.AlertCurEvent, clock ...int64) bool {
|
||||
func isMuted(event *models.AlertCurEvent, clock ...int64) bool {
|
||||
mutes, has := memsto.AlertMuteCache.Gets(event.GroupId)
|
||||
if !has || len(mutes) == 0 {
|
||||
return false
|
||||
|
||||
@@ -9,8 +9,9 @@ import (
|
||||
"net/http"
|
||||
"os/exec"
|
||||
"path"
|
||||
"plugin"
|
||||
"runtime"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/pkg/errors"
|
||||
@@ -21,7 +22,6 @@ import (
|
||||
"github.com/toolkits/pkg/slice"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/models"
|
||||
"github.com/didi/nightingale/v5/src/notifier"
|
||||
"github.com/didi/nightingale/v5/src/pkg/sys"
|
||||
"github.com/didi/nightingale/v5/src/pkg/tplx"
|
||||
"github.com/didi/nightingale/v5/src/server/common/sender"
|
||||
@@ -30,12 +30,9 @@ import (
|
||||
"github.com/didi/nightingale/v5/src/storage"
|
||||
)
|
||||
|
||||
var (
|
||||
tpls map[string]*template.Template
|
||||
rwLock sync.RWMutex
|
||||
)
|
||||
var tpls = make(map[string]*template.Template)
|
||||
|
||||
func reloadTpls() error {
|
||||
func initTpls() error {
|
||||
if config.C.Alerting.TemplatesDir == "" {
|
||||
config.C.Alerting.TemplatesDir = path.Join(runner.Cwd, "etc", "template")
|
||||
}
|
||||
@@ -60,7 +57,6 @@ func reloadTpls() error {
|
||||
return errors.New("no tpl files under " + config.C.Alerting.TemplatesDir)
|
||||
}
|
||||
|
||||
tmpTpls := make(map[string]*template.Template)
|
||||
for i := 0; i < len(tplFiles); i++ {
|
||||
tplpath := path.Join(config.C.Alerting.TemplatesDir, tplFiles[i])
|
||||
|
||||
@@ -69,12 +65,9 @@ func reloadTpls() error {
|
||||
return errors.WithMessage(err, "failed to parse tpl: "+tplpath)
|
||||
}
|
||||
|
||||
tmpTpls[tplFiles[i]] = tpl
|
||||
tpls[tplFiles[i]] = tpl
|
||||
}
|
||||
|
||||
rwLock.Lock()
|
||||
tpls = tmpTpls
|
||||
rwLock.Unlock()
|
||||
return nil
|
||||
}
|
||||
|
||||
@@ -86,9 +79,6 @@ type Notice struct {
|
||||
func genNotice(event *models.AlertCurEvent) Notice {
|
||||
// build notice body with templates
|
||||
ntpls := make(map[string]string)
|
||||
|
||||
rwLock.RLock()
|
||||
defer rwLock.RUnlock()
|
||||
for filename, tpl := range tpls {
|
||||
var body bytes.Buffer
|
||||
if err := tpl.Execute(&body, event); err != nil {
|
||||
@@ -101,19 +91,19 @@ func genNotice(event *models.AlertCurEvent) Notice {
|
||||
return Notice{Event: event, Tpls: ntpls}
|
||||
}
|
||||
|
||||
func alertingRedisPub(clusterName string, bs []byte) {
|
||||
channelKey := config.C.Alerting.RedisPub.ChannelPrefix + clusterName
|
||||
func alertingRedisPub(bs []byte) {
|
||||
// pub all alerts to redis
|
||||
if config.C.Alerting.RedisPub.Enable {
|
||||
err := storage.Redis.Publish(context.Background(), channelKey, bs).Err()
|
||||
err := storage.Redis.Publish(context.Background(), config.C.Alerting.RedisPub.ChannelKey, bs).Err()
|
||||
if err != nil {
|
||||
logger.Errorf("event_notify: redis publish %s err: %v", channelKey, err)
|
||||
logger.Errorf("event_notify: redis publish %s err: %v", config.C.Alerting.RedisPub.ChannelKey, err)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func handleNotice(notice Notice, bs []byte) {
|
||||
alertingCallScript(bs)
|
||||
|
||||
alertingCallPlugin(bs)
|
||||
|
||||
if len(config.C.Alerting.NotifyBuiltinChannels) == 0 {
|
||||
@@ -250,7 +240,7 @@ func notify(event *models.AlertCurEvent) {
|
||||
return
|
||||
}
|
||||
|
||||
alertingRedisPub(event.Cluster, stdinBytes)
|
||||
alertingRedisPub(stdinBytes)
|
||||
alertingWebhook(event)
|
||||
|
||||
handleNotice(notice, stdinBytes)
|
||||
@@ -408,6 +398,11 @@ func alertingCallScript(stdinBytes []byte) {
|
||||
logger.Infof("event_notify: exec %s output: %s", fpath, buf.String())
|
||||
}
|
||||
|
||||
type Notifier interface {
|
||||
Descript() string
|
||||
Notify([]byte)
|
||||
}
|
||||
|
||||
// call notify.so via golang plugin build
|
||||
// ig. etc/script/notify/notify.so
|
||||
func alertingCallPlugin(stdinBytes []byte) {
|
||||
@@ -415,8 +410,26 @@ func alertingCallPlugin(stdinBytes []byte) {
|
||||
return
|
||||
}
|
||||
|
||||
logger.Debugf("alertingCallPlugin begin")
|
||||
logger.Debugf("payload:", string(stdinBytes))
|
||||
notifier.Instance.Notify(stdinBytes)
|
||||
logger.Debugf("alertingCallPlugin done")
|
||||
if runtime.GOOS == "windows" {
|
||||
logger.Errorf("call notify plugin on unsupported os: %s", runtime.GOOS)
|
||||
return
|
||||
}
|
||||
|
||||
p, err := plugin.Open(config.C.Alerting.CallPlugin.PluginPath)
|
||||
if err != nil {
|
||||
logger.Errorf("failed to open notify plugin: %v", err)
|
||||
return
|
||||
}
|
||||
caller, err := p.Lookup(config.C.Alerting.CallPlugin.Caller)
|
||||
if err != nil {
|
||||
logger.Errorf("failed to load caller: %v", err)
|
||||
return
|
||||
}
|
||||
notifier, ok := caller.(Notifier)
|
||||
if !ok {
|
||||
logger.Errorf("notifier interface not implemented): %v", err)
|
||||
return
|
||||
}
|
||||
notifier.Notify(stdinBytes)
|
||||
logger.Debugf("alertingCallPlugin done. %s", notifier.Descript())
|
||||
}
|
||||
|
||||
@@ -1,11 +1,8 @@
|
||||
package engine
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"time"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/models"
|
||||
"github.com/didi/nightingale/v5/src/notifier"
|
||||
"github.com/didi/nightingale/v5/src/server/common/sender"
|
||||
"github.com/didi/nightingale/v5/src/server/config"
|
||||
"github.com/didi/nightingale/v5/src/server/memsto"
|
||||
@@ -13,59 +10,27 @@ import (
|
||||
"github.com/toolkits/pkg/logger"
|
||||
)
|
||||
|
||||
type MaintainMessage struct {
|
||||
Tos []*models.User `json:"tos"`
|
||||
Title string `json:"title"`
|
||||
Content string `json:"content"`
|
||||
}
|
||||
|
||||
// notify to maintainer to handle the error
|
||||
func notifyToMaintainer(title, msg string) {
|
||||
logger.Errorf("notifyToMaintainer, msg: %s", msg)
|
||||
func notifyToMaintainer(e error, title string) {
|
||||
|
||||
users := memsto.UserCache.GetMaintainerUsers()
|
||||
if len(users) == 0 {
|
||||
return
|
||||
}
|
||||
logger.Errorf("notifyToMaintainer,title:%s, error:%v", title, e)
|
||||
|
||||
triggerTime := time.Now().Format("2006/01/02 - 15:04:05")
|
||||
|
||||
notifyMaintainerWithPlugin(title, msg, triggerTime, users)
|
||||
notifyMaintainerWithBuiltin(title, msg, triggerTime, users)
|
||||
}
|
||||
|
||||
func notifyMaintainerWithPlugin(title, msg, triggerTime string, users []*models.User) {
|
||||
if !config.C.Alerting.CallPlugin.Enable {
|
||||
return
|
||||
}
|
||||
|
||||
stdinBytes, err := json.Marshal(MaintainMessage{
|
||||
Tos: users,
|
||||
Title: title,
|
||||
Content: "Title: " + title + "\nContent: " + msg + "\nTime: " + triggerTime,
|
||||
})
|
||||
|
||||
if err != nil {
|
||||
logger.Error("failed to marshal MaintainMessage:", err)
|
||||
return
|
||||
}
|
||||
|
||||
notifier.Instance.NotifyMaintainer(stdinBytes)
|
||||
logger.Debugf("notify maintainer with plugin done")
|
||||
}
|
||||
|
||||
func notifyMaintainerWithBuiltin(title, msg, triggerTime string, users []*models.User) {
|
||||
if len(config.C.Alerting.NotifyBuiltinChannels) == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
maintainerUsers := memsto.UserCache.GetMaintainerUsers()
|
||||
if len(maintainerUsers) == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
emailset := make(map[string]struct{})
|
||||
phoneset := make(map[string]struct{})
|
||||
wecomset := make(map[string]struct{})
|
||||
dingtalkset := make(map[string]struct{})
|
||||
feishuset := make(map[string]struct{})
|
||||
|
||||
for _, user := range users {
|
||||
for _, user := range maintainerUsers {
|
||||
if user.Email != "" {
|
||||
emailset[user.Email] = struct{}{}
|
||||
}
|
||||
@@ -97,6 +62,7 @@ func notifyMaintainerWithBuiltin(title, msg, triggerTime string, users []*models
|
||||
}
|
||||
|
||||
phones := StringSetKeys(phoneset)
|
||||
triggerTime := time.Now().Format("2006/01/02 - 15:04:05")
|
||||
|
||||
for _, ch := range config.C.Alerting.NotifyBuiltinChannels {
|
||||
switch ch {
|
||||
@@ -104,13 +70,13 @@ func notifyMaintainerWithBuiltin(title, msg, triggerTime string, users []*models
|
||||
if len(emailset) == 0 {
|
||||
continue
|
||||
}
|
||||
content := "Title: " + title + "\nContent: " + msg + "\nTime: " + triggerTime
|
||||
content := "【内部处理错误】当前标题: " + title + "\n【内部处理错误】当前异常: " + e.Error() + "\n【内部处理错误】发送时间: " + triggerTime
|
||||
sender.WriteEmail(title, content, StringSetKeys(emailset))
|
||||
case "dingtalk":
|
||||
if len(dingtalkset) == 0 {
|
||||
continue
|
||||
}
|
||||
content := "**Title: **" + title + "\n**Content: **" + msg + "\n**Time: **" + triggerTime
|
||||
content := "**【内部处理错误】当前标题: **" + title + "\n**【内部处理错误】当前异常: **" + e.Error() + "\n**【内部处理错误】发送时间: **" + triggerTime
|
||||
sender.SendDingtalk(sender.DingtalkMessage{
|
||||
Title: title,
|
||||
Text: content,
|
||||
@@ -121,7 +87,7 @@ func notifyMaintainerWithBuiltin(title, msg, triggerTime string, users []*models
|
||||
if len(wecomset) == 0 {
|
||||
continue
|
||||
}
|
||||
content := "**Title: **" + title + "\n**Content: **" + msg + "\n**Time: **" + triggerTime
|
||||
content := "**【内部处理错误】当前标题: **" + title + "\n**【内部处理错误】当前异常: **" + e.Error() + "\n**【内部处理错误】发送时间: **" + triggerTime
|
||||
sender.SendWecom(sender.WecomMessage{
|
||||
Text: content,
|
||||
Tokens: StringSetKeys(wecomset),
|
||||
@@ -131,7 +97,7 @@ func notifyMaintainerWithBuiltin(title, msg, triggerTime string, users []*models
|
||||
continue
|
||||
}
|
||||
|
||||
content := "Title: " + title + "\nContent: " + msg + "\nTime: " + triggerTime
|
||||
content := "【内部处理错误】当前标题: " + title + "\n【内部处理错误】当前异常: " + e.Error() + "\n【内部处理错误】发送时间: " + triggerTime
|
||||
sender.SendFeishu(sender.FeishuMessage{
|
||||
Text: content,
|
||||
AtMobiles: phones,
|
||||
|
||||
@@ -1,65 +0,0 @@
|
||||
package engine
|
||||
|
||||
import (
|
||||
"sync"
|
||||
"time"
|
||||
)
|
||||
|
||||
type ErrorType string
|
||||
|
||||
// register new error here
|
||||
const (
|
||||
QueryPrometheusError ErrorType = "QueryPrometheusError"
|
||||
RuntimeError ErrorType = "RuntimeError"
|
||||
)
|
||||
|
||||
type reporter struct {
|
||||
sync.Mutex
|
||||
em map[ErrorType]uint64
|
||||
cb func(em map[ErrorType]uint64)
|
||||
}
|
||||
|
||||
var rp reporter
|
||||
|
||||
func initReporter(cb func(em map[ErrorType]uint64)) {
|
||||
rp = reporter{cb: cb, em: make(map[ErrorType]uint64)}
|
||||
rp.Start()
|
||||
}
|
||||
|
||||
func Report(errorType ErrorType) {
|
||||
rp.report(errorType)
|
||||
}
|
||||
|
||||
func (r *reporter) reset() map[ErrorType]uint64 {
|
||||
r.Lock()
|
||||
defer r.Unlock()
|
||||
if len(r.em) == 0 {
|
||||
return nil
|
||||
}
|
||||
|
||||
oem := r.em
|
||||
r.em = make(map[ErrorType]uint64)
|
||||
return oem
|
||||
}
|
||||
|
||||
func (r *reporter) report(errorType ErrorType) {
|
||||
r.Lock()
|
||||
defer r.Unlock()
|
||||
if count, has := r.em[errorType]; has {
|
||||
r.em[errorType] = count + 1
|
||||
} else {
|
||||
r.em[errorType] = 1
|
||||
}
|
||||
}
|
||||
|
||||
func (r *reporter) Start() {
|
||||
for {
|
||||
select {
|
||||
case <-time.After(time.Minute):
|
||||
cur := r.reset()
|
||||
if cur != nil {
|
||||
r.cb(cur)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -3,23 +3,23 @@ package engine
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"log"
|
||||
"math/rand"
|
||||
"sort"
|
||||
"strings"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/server/writer"
|
||||
"github.com/prometheus/common/model"
|
||||
"github.com/toolkits/pkg/logger"
|
||||
"github.com/toolkits/pkg/net/httplib"
|
||||
"github.com/toolkits/pkg/str"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/models"
|
||||
"github.com/didi/nightingale/v5/src/pkg/prom"
|
||||
"github.com/didi/nightingale/v5/src/server/common/conv"
|
||||
"github.com/didi/nightingale/v5/src/server/config"
|
||||
"github.com/didi/nightingale/v5/src/server/memsto"
|
||||
"github.com/didi/nightingale/v5/src/server/naming"
|
||||
"github.com/didi/nightingale/v5/src/server/reader"
|
||||
promstat "github.com/didi/nightingale/v5/src/server/stat"
|
||||
)
|
||||
|
||||
@@ -59,88 +59,25 @@ func filterRules() {
|
||||
}
|
||||
|
||||
Workers.Build(mines)
|
||||
RuleEvalForExternal.Build()
|
||||
}
|
||||
|
||||
type RuleEval struct {
|
||||
rule *models.AlertRule
|
||||
fires *AlertCurEventMap
|
||||
pendings *AlertCurEventMap
|
||||
fires map[string]*models.AlertCurEvent
|
||||
pendings map[string]*models.AlertCurEvent
|
||||
quit chan struct{}
|
||||
}
|
||||
|
||||
type AlertCurEventMap struct {
|
||||
sync.RWMutex
|
||||
Data map[string]*models.AlertCurEvent
|
||||
}
|
||||
|
||||
func (a *AlertCurEventMap) SetAll(data map[string]*models.AlertCurEvent) {
|
||||
a.Lock()
|
||||
defer a.Unlock()
|
||||
a.Data = data
|
||||
}
|
||||
|
||||
func (a *AlertCurEventMap) Set(key string, value *models.AlertCurEvent) {
|
||||
a.Lock()
|
||||
defer a.Unlock()
|
||||
a.Data[key] = value
|
||||
}
|
||||
|
||||
func (a *AlertCurEventMap) Get(key string) (*models.AlertCurEvent, bool) {
|
||||
a.RLock()
|
||||
defer a.RUnlock()
|
||||
event, exists := a.Data[key]
|
||||
return event, exists
|
||||
}
|
||||
|
||||
func (a *AlertCurEventMap) UpdateLastEvalTime(key string, lastEvalTime int64) {
|
||||
a.Lock()
|
||||
defer a.Unlock()
|
||||
event, exists := a.Data[key]
|
||||
if !exists {
|
||||
return
|
||||
}
|
||||
event.LastEvalTime = lastEvalTime
|
||||
}
|
||||
|
||||
func (a *AlertCurEventMap) Delete(key string) {
|
||||
a.Lock()
|
||||
defer a.Unlock()
|
||||
delete(a.Data, key)
|
||||
}
|
||||
|
||||
func (a *AlertCurEventMap) Keys() []string {
|
||||
a.RLock()
|
||||
defer a.RUnlock()
|
||||
keys := make([]string, 0, len(a.Data))
|
||||
for k := range a.Data {
|
||||
keys = append(keys, k)
|
||||
}
|
||||
return keys
|
||||
}
|
||||
|
||||
func (a *AlertCurEventMap) GetAll() map[string]*models.AlertCurEvent {
|
||||
a.RLock()
|
||||
defer a.RUnlock()
|
||||
return a.Data
|
||||
}
|
||||
|
||||
func NewAlertCurEventMap() *AlertCurEventMap {
|
||||
return &AlertCurEventMap{
|
||||
Data: make(map[string]*models.AlertCurEvent),
|
||||
}
|
||||
}
|
||||
|
||||
func (r *RuleEval) Stop() {
|
||||
func (r RuleEval) Stop() {
|
||||
logger.Infof("rule_eval:%d stopping", r.RuleID())
|
||||
close(r.quit)
|
||||
}
|
||||
|
||||
func (r *RuleEval) RuleID() int64 {
|
||||
func (r RuleEval) RuleID() int64 {
|
||||
return r.rule.Id
|
||||
}
|
||||
|
||||
func (r *RuleEval) Start() {
|
||||
func (r RuleEval) Start() {
|
||||
logger.Infof("rule_eval:%d started", r.RuleID())
|
||||
for {
|
||||
select {
|
||||
@@ -149,7 +86,7 @@ func (r *RuleEval) Start() {
|
||||
return
|
||||
default:
|
||||
r.Work()
|
||||
logger.Debugf("rule executed, rule_eval:%d", r.RuleID())
|
||||
logger.Debugf("rule executed,rule_id=%d", r.RuleID())
|
||||
interval := r.rule.PromEvalInterval
|
||||
if interval <= 0 {
|
||||
interval = 10
|
||||
@@ -159,29 +96,27 @@ func (r *RuleEval) Start() {
|
||||
}
|
||||
}
|
||||
|
||||
func (r *RuleEval) Work() {
|
||||
type AnomalyPoint struct {
|
||||
Data model.Matrix `json:"data"`
|
||||
Err string `json:"error"`
|
||||
}
|
||||
|
||||
func (r RuleEval) Work() {
|
||||
promql := strings.TrimSpace(r.rule.PromQl)
|
||||
if promql == "" {
|
||||
logger.Errorf("rule_eval:%d promql is blank", r.RuleID())
|
||||
return
|
||||
}
|
||||
|
||||
if config.ReaderClient.IsNil() {
|
||||
logger.Error("reader client is nil")
|
||||
return
|
||||
}
|
||||
|
||||
clusterName, readerClient := config.ReaderClient.Get()
|
||||
|
||||
var value model.Value
|
||||
var err error
|
||||
if r.rule.Algorithm == "" && (r.rule.Cate == "" || r.rule.Cate == "prometheus") {
|
||||
var warnings prom.Warnings
|
||||
value, warnings, err = readerClient.Query(context.Background(), promql, time.Now())
|
||||
if r.rule.Algorithm == "" {
|
||||
var warnings reader.Warnings
|
||||
value, warnings, err = reader.Reader.Client.Query(context.Background(), promql, time.Now())
|
||||
if err != nil {
|
||||
logger.Errorf("rule_eval:%d promql:%s, error:%v", r.RuleID(), promql, err)
|
||||
//notifyToMaintainer(err, "failed to query prometheus")
|
||||
Report(QueryPrometheusError)
|
||||
// 告警查询prometheus逻辑出错,发告警信息给管理员
|
||||
notifyToMaintainer(err, "查询prometheus出错")
|
||||
return
|
||||
}
|
||||
|
||||
@@ -189,18 +124,34 @@ func (r *RuleEval) Work() {
|
||||
logger.Errorf("rule_eval:%d promql:%s, warnings:%v", r.RuleID(), promql, warnings)
|
||||
return
|
||||
}
|
||||
logger.Debugf("rule_eval:%d promql:%s, value:%v", r.RuleID(), promql, value)
|
||||
} else {
|
||||
var res AnomalyPoint
|
||||
count := len(config.C.AnomalyDataApi)
|
||||
for _, i := range rand.Perm(count) {
|
||||
url := fmt.Sprintf("%s?rid=%d", config.C.AnomalyDataApi[i], r.rule.Id)
|
||||
err = httplib.Get(url).SetTimeout(time.Duration(3000) * time.Millisecond).ToJSON(&res)
|
||||
if err != nil {
|
||||
logger.Errorf("curl %s fail: %v", url, err)
|
||||
continue
|
||||
}
|
||||
if res.Err != "" {
|
||||
logger.Errorf("curl %s fail: %s", url, res.Err)
|
||||
continue
|
||||
}
|
||||
value = res.Data
|
||||
logger.Debugf("curl %s get: %+v", url, res.Data)
|
||||
}
|
||||
}
|
||||
|
||||
r.Judge(clusterName, conv.ConvertVectors(value))
|
||||
r.judge(conv.ConvertVectors(value))
|
||||
}
|
||||
|
||||
type WorkersType struct {
|
||||
rules map[string]*RuleEval
|
||||
rules map[string]RuleEval
|
||||
recordRules map[string]RecordingRuleEval
|
||||
}
|
||||
|
||||
var Workers = &WorkersType{rules: make(map[string]*RuleEval), recordRules: make(map[string]RecordingRuleEval)}
|
||||
var Workers = &WorkersType{rules: make(map[string]RuleEval), recordRules: make(map[string]RecordingRuleEval)}
|
||||
|
||||
func (ws *WorkersType) Build(rids []int64) {
|
||||
rules := make(map[string]*models.AlertRule)
|
||||
@@ -238,6 +189,7 @@ func (ws *WorkersType) Build(rids []int64) {
|
||||
elst, err := models.AlertCurEventGetByRule(rules[hash].Id)
|
||||
if err != nil {
|
||||
logger.Errorf("worker_build: AlertCurEventGetByRule failed: %v", err)
|
||||
notifyToMaintainer(err, "AlertCurEventGetByRule Error,ruleID="+fmt.Sprint(rules[hash].Id))
|
||||
continue
|
||||
}
|
||||
|
||||
@@ -246,13 +198,12 @@ func (ws *WorkersType) Build(rids []int64) {
|
||||
elst[i].DB2Mem()
|
||||
firemap[elst[i].Hash] = elst[i]
|
||||
}
|
||||
fires := NewAlertCurEventMap()
|
||||
fires.SetAll(firemap)
|
||||
re := &RuleEval{
|
||||
|
||||
re := RuleEval{
|
||||
rule: rules[hash],
|
||||
quit: make(chan struct{}),
|
||||
fires: fires,
|
||||
pendings: NewAlertCurEventMap(),
|
||||
fires: firemap,
|
||||
pendings: make(map[string]*models.AlertCurEvent),
|
||||
}
|
||||
|
||||
go re.Start()
|
||||
@@ -307,31 +258,20 @@ func (ws *WorkersType) BuildRe(rids []int64) {
|
||||
}
|
||||
}
|
||||
|
||||
func (r *RuleEval) Judge(clusterName string, vectors []conv.Vector) {
|
||||
now := time.Now().Unix()
|
||||
|
||||
alertingKeys, ruleExists := r.MakeNewEvent("inner", now, clusterName, vectors)
|
||||
if !ruleExists {
|
||||
return
|
||||
}
|
||||
|
||||
// handle recovered events
|
||||
r.recoverRule(alertingKeys, now)
|
||||
}
|
||||
|
||||
func (r *RuleEval) MakeNewEvent(from string, now int64, clusterName string, vectors []conv.Vector) (map[string]struct{}, bool) {
|
||||
func (r RuleEval) judge(vectors []conv.Vector) {
|
||||
// 有可能rule的一些配置已经发生变化,比如告警接收人、callbacks等
|
||||
// 这些信息的修改是不会引起worker restart的,但是确实会影响告警处理逻辑
|
||||
// 所以,这里直接从memsto.AlertRuleCache中获取并覆盖
|
||||
curRule := memsto.AlertRuleCache.Get(r.rule.Id)
|
||||
if curRule == nil {
|
||||
return map[string]struct{}{}, false
|
||||
return
|
||||
}
|
||||
|
||||
r.rule = curRule
|
||||
|
||||
count := len(vectors)
|
||||
alertingKeys := make(map[string]struct{})
|
||||
now := time.Now().Unix()
|
||||
for i := 0; i < count; i++ {
|
||||
// compute hash
|
||||
hash := str.MD5(fmt.Sprintf("%d_%s", r.rule.Id, vectors[i].Key))
|
||||
@@ -339,7 +279,6 @@ func (r *RuleEval) MakeNewEvent(from string, now int64, clusterName string, vect
|
||||
|
||||
// rule disabled in this time span?
|
||||
if isNoneffective(vectors[i].Timestamp, r.rule) {
|
||||
logger.Debugf("event_disabled: rule_eval:%d rule:%v timestamp:%d", r.rule.Id, r.rule, vectors[i].Timestamp)
|
||||
continue
|
||||
}
|
||||
|
||||
@@ -368,7 +307,6 @@ func (r *RuleEval) MakeNewEvent(from string, now int64, clusterName string, vect
|
||||
// 对于包含ident的告警事件,check一下ident所属bg和rule所属bg是否相同
|
||||
// 如果告警规则选择了只在本BG生效,那其他BG的机器就不能因此规则产生告警
|
||||
if r.rule.EnableInBG == 1 && target.GroupId != r.rule.GroupId {
|
||||
logger.Debugf("event_enable_in_bg: rule_eval:%d", r.rule.Id)
|
||||
continue
|
||||
}
|
||||
}
|
||||
@@ -387,7 +325,7 @@ func (r *RuleEval) MakeNewEvent(from string, now int64, clusterName string, vect
|
||||
}
|
||||
|
||||
// isMuted only need TriggerTime RuleName and TagsMap
|
||||
if IsMuted(event) {
|
||||
if isMuted(event) {
|
||||
logger.Infof("event_muted: rule_id=%d %s", r.rule.Id, vectors[i].Key)
|
||||
continue
|
||||
}
|
||||
@@ -395,8 +333,7 @@ func (r *RuleEval) MakeNewEvent(from string, now int64, clusterName string, vect
|
||||
tagsArr := labelMapToArr(tagsMap)
|
||||
sort.Strings(tagsArr)
|
||||
|
||||
event.Cluster = clusterName
|
||||
event.Cate = r.rule.Cate
|
||||
event.Cluster = r.rule.Cluster
|
||||
event.Hash = hash
|
||||
event.RuleId = r.rule.Id
|
||||
event.RuleName = r.rule.Name
|
||||
@@ -422,15 +359,12 @@ func (r *RuleEval) MakeNewEvent(from string, now int64, clusterName string, vect
|
||||
event.Tags = strings.Join(tagsArr, ",,")
|
||||
event.IsRecovered = false
|
||||
event.LastEvalTime = now
|
||||
if from != "inner" {
|
||||
event.LastEvalTime = event.TriggerTime
|
||||
}
|
||||
|
||||
r.handleNewEvent(event)
|
||||
|
||||
}
|
||||
|
||||
return alertingKeys, true
|
||||
// handle recovered events
|
||||
r.recoverRule(alertingKeys, now)
|
||||
}
|
||||
|
||||
func readableValue(value float64) string {
|
||||
@@ -454,30 +388,26 @@ func labelMapToArr(m map[string]string) []string {
|
||||
return labelStrings
|
||||
}
|
||||
|
||||
func (r *RuleEval) handleNewEvent(event *models.AlertCurEvent) {
|
||||
func (r RuleEval) handleNewEvent(event *models.AlertCurEvent) {
|
||||
if event.PromForDuration == 0 {
|
||||
r.fireEvent(event)
|
||||
return
|
||||
}
|
||||
|
||||
var preTriggerTime int64
|
||||
preEvent, has := r.pendings.Get(event.Hash)
|
||||
_, has := r.pendings[event.Hash]
|
||||
if has {
|
||||
r.pendings.UpdateLastEvalTime(event.Hash, event.LastEvalTime)
|
||||
preTriggerTime = preEvent.TriggerTime
|
||||
r.pendings[event.Hash].LastEvalTime = event.LastEvalTime
|
||||
} else {
|
||||
r.pendings.Set(event.Hash, event)
|
||||
preTriggerTime = event.TriggerTime
|
||||
r.pendings[event.Hash] = event
|
||||
}
|
||||
|
||||
if event.LastEvalTime-preTriggerTime+int64(event.PromEvalInterval) >= int64(event.PromForDuration) {
|
||||
if r.pendings[event.Hash].LastEvalTime-r.pendings[event.Hash].TriggerTime+int64(event.PromEvalInterval) >= int64(event.PromForDuration) {
|
||||
r.fireEvent(event)
|
||||
}
|
||||
}
|
||||
|
||||
func (r *RuleEval) fireEvent(event *models.AlertCurEvent) {
|
||||
if fired, has := r.fires.Get(event.Hash); has {
|
||||
r.fires.UpdateLastEvalTime(event.Hash, event.LastEvalTime)
|
||||
func (r RuleEval) fireEvent(event *models.AlertCurEvent) {
|
||||
if fired, has := r.fires[event.Hash]; has {
|
||||
r.fires[event.Hash].LastEvalTime = event.LastEvalTime
|
||||
|
||||
if r.rule.NotifyRepeatStep == 0 {
|
||||
// 说明不想重复通知,那就直接返回了,nothing to do
|
||||
@@ -489,7 +419,6 @@ func (r *RuleEval) fireEvent(event *models.AlertCurEvent) {
|
||||
if r.rule.NotifyMaxNumber == 0 {
|
||||
// 最大可以发送次数如果是0,表示不想限制最大发送次数,一直发即可
|
||||
event.NotifyCurNumber = fired.NotifyCurNumber + 1
|
||||
event.FirstTriggerTime = fired.FirstTriggerTime
|
||||
r.pushEventToQueue(event)
|
||||
} else {
|
||||
// 有最大发送次数的限制,就要看已经发了几次了,是否达到了最大发送次数
|
||||
@@ -497,7 +426,6 @@ func (r *RuleEval) fireEvent(event *models.AlertCurEvent) {
|
||||
return
|
||||
} else {
|
||||
event.NotifyCurNumber = fired.NotifyCurNumber + 1
|
||||
event.FirstTriggerTime = fired.FirstTriggerTime
|
||||
r.pushEventToQueue(event)
|
||||
}
|
||||
}
|
||||
@@ -505,92 +433,70 @@ func (r *RuleEval) fireEvent(event *models.AlertCurEvent) {
|
||||
}
|
||||
} else {
|
||||
event.NotifyCurNumber = 1
|
||||
event.FirstTriggerTime = event.TriggerTime
|
||||
r.pushEventToQueue(event)
|
||||
}
|
||||
}
|
||||
|
||||
func (r *RuleEval) recoverRule(alertingKeys map[string]struct{}, now int64) {
|
||||
for _, hash := range r.pendings.Keys() {
|
||||
if _, has := alertingKeys[hash]; has {
|
||||
continue
|
||||
}
|
||||
r.pendings.Delete(hash)
|
||||
}
|
||||
|
||||
for hash, event := range r.fires.GetAll() {
|
||||
func (r RuleEval) recoverRule(alertingKeys map[string]struct{}, now int64) {
|
||||
for hash := range r.pendings {
|
||||
if _, has := alertingKeys[hash]; has {
|
||||
continue
|
||||
}
|
||||
|
||||
r.recoverEvent(hash, event, now)
|
||||
delete(r.pendings, hash)
|
||||
}
|
||||
|
||||
for hash, event := range r.fires {
|
||||
if _, has := alertingKeys[hash]; has {
|
||||
continue
|
||||
}
|
||||
|
||||
// 如果配置了留观时长,就不能立马恢复了
|
||||
if r.rule.RecoverDuration > 0 && now-event.LastEvalTime < r.rule.RecoverDuration {
|
||||
continue
|
||||
}
|
||||
|
||||
// 没查到触发阈值的vector,姑且就认为这个vector的值恢复了
|
||||
// 我确实无法分辨,是prom中有值但是未满足阈值所以没返回,还是prom中确实丢了一些点导致没有数据可以返回,尴尬
|
||||
delete(r.fires, hash)
|
||||
delete(r.pendings, hash)
|
||||
|
||||
event.IsRecovered = true
|
||||
event.LastEvalTime = now
|
||||
// 可能是因为调整了promql才恢复的,所以事件里边要体现最新的promql,否则用户会比较困惑
|
||||
// 当然,其实rule的各个字段都可能发生变化了,都更新一下吧
|
||||
event.RuleName = r.rule.Name
|
||||
event.RuleNote = r.rule.Note
|
||||
event.RuleProd = r.rule.Prod
|
||||
event.RuleAlgo = r.rule.Algorithm
|
||||
event.Severity = r.rule.Severity
|
||||
event.PromForDuration = r.rule.PromForDuration
|
||||
event.PromQl = r.rule.PromQl
|
||||
event.PromEvalInterval = r.rule.PromEvalInterval
|
||||
event.Callbacks = r.rule.Callbacks
|
||||
event.CallbacksJSON = r.rule.CallbacksJSON
|
||||
event.RunbookUrl = r.rule.RunbookUrl
|
||||
event.NotifyRecovered = r.rule.NotifyRecovered
|
||||
event.NotifyChannels = r.rule.NotifyChannels
|
||||
event.NotifyChannelsJSON = r.rule.NotifyChannelsJSON
|
||||
event.NotifyGroups = r.rule.NotifyGroups
|
||||
event.NotifyGroupsJSON = r.rule.NotifyGroupsJSON
|
||||
r.pushEventToQueue(event)
|
||||
}
|
||||
}
|
||||
|
||||
func (r *RuleEval) RecoverEvent(hash string, now int64, value float64) {
|
||||
curRule := memsto.AlertRuleCache.Get(r.rule.Id)
|
||||
if curRule == nil {
|
||||
return
|
||||
}
|
||||
r.rule = curRule
|
||||
|
||||
r.pendings.Delete(hash)
|
||||
event, has := r.fires.Get(hash)
|
||||
if !has {
|
||||
return
|
||||
}
|
||||
|
||||
event.TriggerValue = fmt.Sprintf("%.5f", value)
|
||||
r.recoverEvent(hash, event, now)
|
||||
}
|
||||
|
||||
func (r *RuleEval) recoverEvent(hash string, event *models.AlertCurEvent, now int64) {
|
||||
// 如果配置了留观时长,就不能立马恢复了
|
||||
if r.rule.RecoverDuration > 0 && now-event.LastEvalTime < r.rule.RecoverDuration {
|
||||
return
|
||||
}
|
||||
|
||||
// 没查到触发阈值的vector,姑且就认为这个vector的值恢复了
|
||||
// 我确实无法分辨,是prom中有值但是未满足阈值所以没返回,还是prom中确实丢了一些点导致没有数据可以返回,尴尬
|
||||
r.fires.Delete(hash)
|
||||
r.pendings.Delete(hash)
|
||||
|
||||
event.IsRecovered = true
|
||||
event.LastEvalTime = now
|
||||
// 可能是因为调整了promql才恢复的,所以事件里边要体现最新的promql,否则用户会比较困惑
|
||||
// 当然,其实rule的各个字段都可能发生变化了,都更新一下吧
|
||||
event.RuleName = r.rule.Name
|
||||
event.RuleNote = r.rule.Note
|
||||
event.RuleProd = r.rule.Prod
|
||||
event.RuleAlgo = r.rule.Algorithm
|
||||
event.Severity = r.rule.Severity
|
||||
event.PromForDuration = r.rule.PromForDuration
|
||||
event.PromQl = r.rule.PromQl
|
||||
event.PromEvalInterval = r.rule.PromEvalInterval
|
||||
event.Callbacks = r.rule.Callbacks
|
||||
event.CallbacksJSON = r.rule.CallbacksJSON
|
||||
event.RunbookUrl = r.rule.RunbookUrl
|
||||
event.NotifyRecovered = r.rule.NotifyRecovered
|
||||
event.NotifyChannels = r.rule.NotifyChannels
|
||||
event.NotifyChannelsJSON = r.rule.NotifyChannelsJSON
|
||||
event.NotifyGroups = r.rule.NotifyGroups
|
||||
event.NotifyGroupsJSON = r.rule.NotifyGroupsJSON
|
||||
r.pushEventToQueue(event)
|
||||
}
|
||||
|
||||
func (r *RuleEval) pushEventToQueue(event *models.AlertCurEvent) {
|
||||
func (r RuleEval) pushEventToQueue(event *models.AlertCurEvent) {
|
||||
if !event.IsRecovered {
|
||||
event.LastSentTime = event.LastEvalTime
|
||||
r.fires.Set(event.Hash, event)
|
||||
r.fires[event.Hash] = event
|
||||
}
|
||||
|
||||
promstat.CounterAlertsTotal.WithLabelValues(event.Cluster).Inc()
|
||||
promstat.CounterAlertsTotal.WithLabelValues(config.C.ClusterName).Inc()
|
||||
LogEvent(event, "push_queue")
|
||||
if !EventQueue.PushFront(event) {
|
||||
logger.Warningf("event_push_queue: queue is full")
|
||||
}
|
||||
}
|
||||
|
||||
func filterRecordingRules() {
|
||||
ids := memsto.RecordingRuleCache.GetRuleIds()
|
||||
|
||||
@@ -651,12 +557,7 @@ func (r RecordingRuleEval) Work() {
|
||||
return
|
||||
}
|
||||
|
||||
if config.ReaderClient.IsNil() {
|
||||
log.Println("reader client is nil")
|
||||
return
|
||||
}
|
||||
|
||||
value, warnings, err := config.ReaderClient.GetCli().Query(context.Background(), promql, time.Now())
|
||||
value, warnings, err := reader.Reader.Client.Query(context.Background(), promql, time.Now())
|
||||
if err != nil {
|
||||
logger.Errorf("recording_rule_eval:%d promql:%s, error:%v", r.RuleID(), promql, err)
|
||||
return
|
||||
@@ -673,82 +574,3 @@ func (r RecordingRuleEval) Work() {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
type RuleEvalForExternalType struct {
|
||||
sync.RWMutex
|
||||
rules map[int64]RuleEval
|
||||
}
|
||||
|
||||
var RuleEvalForExternal = RuleEvalForExternalType{rules: make(map[int64]RuleEval)}
|
||||
|
||||
func (re *RuleEvalForExternalType) Build() {
|
||||
rids := memsto.AlertRuleCache.GetRuleIds()
|
||||
rules := make(map[int64]*models.AlertRule)
|
||||
|
||||
for i := 0; i < len(rids); i++ {
|
||||
rule := memsto.AlertRuleCache.Get(rids[i])
|
||||
if rule == nil {
|
||||
continue
|
||||
}
|
||||
|
||||
re.Lock()
|
||||
rules[rule.Id] = rule
|
||||
re.Unlock()
|
||||
}
|
||||
|
||||
// stop old
|
||||
for rid := range re.rules {
|
||||
if _, has := rules[rid]; !has {
|
||||
re.Lock()
|
||||
delete(re.rules, rid)
|
||||
re.Unlock()
|
||||
}
|
||||
}
|
||||
|
||||
// start new
|
||||
re.Lock()
|
||||
defer re.Unlock()
|
||||
for rid := range rules {
|
||||
if _, has := re.rules[rid]; has {
|
||||
// already exists
|
||||
continue
|
||||
}
|
||||
|
||||
elst, err := models.AlertCurEventGetByRule(rules[rid].Id)
|
||||
if err != nil {
|
||||
logger.Errorf("worker_build: AlertCurEventGetByRule failed: %v", err)
|
||||
continue
|
||||
}
|
||||
|
||||
firemap := make(map[string]*models.AlertCurEvent)
|
||||
for i := 0; i < len(elst); i++ {
|
||||
elst[i].DB2Mem()
|
||||
firemap[elst[i].Hash] = elst[i]
|
||||
}
|
||||
fires := NewAlertCurEventMap()
|
||||
fires.SetAll(firemap)
|
||||
newRe := RuleEval{
|
||||
rule: rules[rid],
|
||||
quit: make(chan struct{}),
|
||||
fires: fires,
|
||||
pendings: NewAlertCurEventMap(),
|
||||
}
|
||||
|
||||
re.rules[rid] = newRe
|
||||
}
|
||||
}
|
||||
|
||||
func (re *RuleEvalForExternalType) Get(rid int64) (RuleEval, bool) {
|
||||
rule := memsto.AlertRuleCache.Get(rid)
|
||||
if rule == nil {
|
||||
return RuleEval{}, false
|
||||
}
|
||||
|
||||
re.RLock()
|
||||
defer re.RUnlock()
|
||||
if ret, has := re.rules[rid]; has {
|
||||
// already exists
|
||||
return ret, has
|
||||
}
|
||||
return RuleEval{}, false
|
||||
}
|
||||
|
||||
@@ -41,10 +41,6 @@ func toRedis() {
|
||||
return
|
||||
}
|
||||
|
||||
if config.ReaderClient.IsNil() {
|
||||
return
|
||||
}
|
||||
|
||||
now := time.Now().Unix()
|
||||
|
||||
// clean old idents
|
||||
@@ -53,7 +49,7 @@ func toRedis() {
|
||||
Idents.Remove(key)
|
||||
} else {
|
||||
// use now as timestamp to redis
|
||||
err := storage.Redis.HSet(context.Background(), redisKey(config.ReaderClient.GetClusterName()), key, now).Err()
|
||||
err := storage.Redis.HSet(context.Background(), redisKey(config.C.ClusterName), key, now).Err()
|
||||
if err != nil {
|
||||
logger.Errorf("redis hset idents failed: %v", err)
|
||||
}
|
||||
@@ -107,14 +103,8 @@ func pushMetrics() {
|
||||
return
|
||||
}
|
||||
|
||||
clusterName := config.ReaderClient.GetClusterName()
|
||||
if clusterName == "" {
|
||||
logger.Warning("cluster name is blank")
|
||||
return
|
||||
}
|
||||
|
||||
// get all the target heartbeat timestamp
|
||||
ret, err := storage.Redis.HGetAll(context.Background(), redisKey(clusterName)).Result()
|
||||
ret, err := storage.Redis.HGetAll(context.Background(), redisKey(config.C.ClusterName)).Result()
|
||||
if err != nil {
|
||||
logger.Errorf("handle_idents: redis hgetall fail: %v", err)
|
||||
return
|
||||
@@ -131,7 +121,7 @@ func pushMetrics() {
|
||||
}
|
||||
|
||||
if now-clock > dur {
|
||||
clearDeadIdent(context.Background(), clusterName, ident)
|
||||
clearDeadIdent(context.Background(), config.C.ClusterName, ident)
|
||||
} else {
|
||||
actives[ident] = struct{}{}
|
||||
}
|
||||
@@ -163,7 +153,7 @@ func pushMetrics() {
|
||||
if !has {
|
||||
// target not exists
|
||||
target = &models.Target{
|
||||
Cluster: clusterName,
|
||||
Cluster: config.C.ClusterName,
|
||||
Ident: active,
|
||||
Tags: "",
|
||||
TagsJSON: []string{},
|
||||
|
||||
@@ -27,15 +27,6 @@ var AlertMuteCache = AlertMuteCacheType{
|
||||
mutes: make(map[int64][]*models.AlertMute),
|
||||
}
|
||||
|
||||
func (amc *AlertMuteCacheType) Reset() {
|
||||
amc.Lock()
|
||||
defer amc.Unlock()
|
||||
|
||||
amc.statTotal = -1
|
||||
amc.statLastUpdated = -1
|
||||
amc.mutes = make(map[int64][]*models.AlertMute)
|
||||
}
|
||||
|
||||
func (amc *AlertMuteCacheType) StatChanged(total, lastUpdated int64) bool {
|
||||
if amc.statTotal == total && amc.statLastUpdated == lastUpdated {
|
||||
return false
|
||||
@@ -99,26 +90,19 @@ func loopSyncAlertMutes() {
|
||||
func syncAlertMutes() error {
|
||||
start := time.Now()
|
||||
|
||||
clusterName := config.ReaderClient.GetClusterName()
|
||||
if clusterName == "" {
|
||||
AlertMuteCache.Reset()
|
||||
logger.Warning("cluster name is blank")
|
||||
return nil
|
||||
}
|
||||
|
||||
stat, err := models.AlertMuteStatistics(clusterName)
|
||||
stat, err := models.AlertMuteStatistics(config.C.ClusterName)
|
||||
if err != nil {
|
||||
return errors.WithMessage(err, "failed to exec AlertMuteStatistics")
|
||||
}
|
||||
|
||||
if !AlertMuteCache.StatChanged(stat.Total, stat.LastUpdated) {
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_alert_mutes").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_alert_mutes").Set(0)
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_alert_mutes").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_alert_mutes").Set(0)
|
||||
logger.Debug("alert mutes not changed")
|
||||
return nil
|
||||
}
|
||||
|
||||
lst, err := models.AlertMuteGetsByCluster(clusterName)
|
||||
lst, err := models.AlertMuteGetsByCluster(config.C.ClusterName)
|
||||
if err != nil {
|
||||
return errors.WithMessage(err, "failed to exec AlertMuteGetsByCluster")
|
||||
}
|
||||
@@ -138,8 +122,8 @@ func syncAlertMutes() error {
|
||||
AlertMuteCache.Set(oks, stat.Total, stat.LastUpdated)
|
||||
|
||||
ms := time.Since(start).Milliseconds()
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_alert_mutes").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_alert_mutes").Set(float64(len(lst)))
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_alert_mutes").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_alert_mutes").Set(float64(len(lst)))
|
||||
logger.Infof("timer: sync mutes done, cost: %dms, number: %d", ms, len(lst))
|
||||
|
||||
return nil
|
||||
|
||||
@@ -27,15 +27,6 @@ var AlertRuleCache = AlertRuleCacheType{
|
||||
rules: make(map[int64]*models.AlertRule),
|
||||
}
|
||||
|
||||
func (arc *AlertRuleCacheType) Reset() {
|
||||
arc.Lock()
|
||||
defer arc.Unlock()
|
||||
|
||||
arc.statTotal = -1
|
||||
arc.statLastUpdated = -1
|
||||
arc.rules = make(map[int64]*models.AlertRule)
|
||||
}
|
||||
|
||||
func (arc *AlertRuleCacheType) StatChanged(total, lastUpdated int64) bool {
|
||||
if arc.statTotal == total && arc.statLastUpdated == lastUpdated {
|
||||
return false
|
||||
@@ -96,26 +87,19 @@ func loopSyncAlertRules() {
|
||||
func syncAlertRules() error {
|
||||
start := time.Now()
|
||||
|
||||
clusterName := config.ReaderClient.GetClusterName()
|
||||
if clusterName == "" {
|
||||
AlertRuleCache.Reset()
|
||||
logger.Warning("cluster name is blank")
|
||||
return nil
|
||||
}
|
||||
|
||||
stat, err := models.AlertRuleStatistics(clusterName)
|
||||
stat, err := models.AlertRuleStatistics(config.C.ClusterName)
|
||||
if err != nil {
|
||||
return errors.WithMessage(err, "failed to exec AlertRuleStatistics")
|
||||
}
|
||||
|
||||
if !AlertRuleCache.StatChanged(stat.Total, stat.LastUpdated) {
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_alert_rules").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_alert_rules").Set(0)
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_alert_rules").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_alert_rules").Set(0)
|
||||
logger.Debug("alert rules not changed")
|
||||
return nil
|
||||
}
|
||||
|
||||
lst, err := models.AlertRuleGetsByCluster(clusterName)
|
||||
lst, err := models.AlertRuleGetsByCluster(config.C.ClusterName)
|
||||
if err != nil {
|
||||
return errors.WithMessage(err, "failed to exec AlertRuleGetsByCluster")
|
||||
}
|
||||
@@ -128,8 +112,8 @@ func syncAlertRules() error {
|
||||
AlertRuleCache.Set(m, stat.Total, stat.LastUpdated)
|
||||
|
||||
ms := time.Since(start).Milliseconds()
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_alert_rules").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_alert_rules").Set(float64(len(m)))
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_alert_rules").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_alert_rules").Set(float64(len(m)))
|
||||
logger.Infof("timer: sync rules done, cost: %dms, number: %d", ms, len(m))
|
||||
|
||||
return nil
|
||||
|
||||
@@ -27,15 +27,6 @@ var AlertSubscribeCache = AlertSubscribeCacheType{
|
||||
subs: make(map[int64][]*models.AlertSubscribe),
|
||||
}
|
||||
|
||||
func (c *AlertSubscribeCacheType) Reset() {
|
||||
c.Lock()
|
||||
defer c.Unlock()
|
||||
|
||||
c.statTotal = -1
|
||||
c.statLastUpdated = -1
|
||||
c.subs = make(map[int64][]*models.AlertSubscribe)
|
||||
}
|
||||
|
||||
func (c *AlertSubscribeCacheType) StatChanged(total, lastUpdated int64) bool {
|
||||
if c.statTotal == total && c.statLastUpdated == lastUpdated {
|
||||
return false
|
||||
@@ -102,26 +93,19 @@ func loopSyncAlertSubscribes() {
|
||||
func syncAlertSubscribes() error {
|
||||
start := time.Now()
|
||||
|
||||
clusterName := config.ReaderClient.GetClusterName()
|
||||
if clusterName == "" {
|
||||
AlertSubscribeCache.Reset()
|
||||
logger.Warning("cluster name is blank")
|
||||
return nil
|
||||
}
|
||||
|
||||
stat, err := models.AlertSubscribeStatistics(clusterName)
|
||||
stat, err := models.AlertSubscribeStatistics(config.C.ClusterName)
|
||||
if err != nil {
|
||||
return errors.WithMessage(err, "failed to exec AlertSubscribeStatistics")
|
||||
}
|
||||
|
||||
if !AlertSubscribeCache.StatChanged(stat.Total, stat.LastUpdated) {
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_alert_subscribes").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_alert_subscribes").Set(0)
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_alert_subscribes").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_alert_subscribes").Set(0)
|
||||
logger.Debug("alert subscribes not changed")
|
||||
return nil
|
||||
}
|
||||
|
||||
lst, err := models.AlertSubscribeGetsByCluster(clusterName)
|
||||
lst, err := models.AlertSubscribeGetsByCluster(config.C.ClusterName)
|
||||
if err != nil {
|
||||
return errors.WithMessage(err, "failed to exec AlertSubscribeGetsByCluster")
|
||||
}
|
||||
@@ -141,8 +125,8 @@ func syncAlertSubscribes() error {
|
||||
AlertSubscribeCache.Set(subs, stat.Total, stat.LastUpdated)
|
||||
|
||||
ms := time.Since(start).Milliseconds()
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_alert_subscribes").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_alert_subscribes").Set(float64(len(lst)))
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_alert_subscribes").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_alert_subscribes").Set(float64(len(lst)))
|
||||
logger.Infof("timer: sync subscribes done, cost: %dms, number: %d", ms, len(lst))
|
||||
|
||||
return nil
|
||||
|
||||
@@ -79,14 +79,9 @@ func syncBusiGroups() error {
|
||||
return errors.WithMessage(err, "failed to exec BusiGroupStatistics")
|
||||
}
|
||||
|
||||
clusterName := config.ReaderClient.GetClusterName()
|
||||
|
||||
if !BusiGroupCache.StatChanged(stat.Total, stat.LastUpdated) {
|
||||
if clusterName != "" {
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_busi_groups").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_busi_groups").Set(0)
|
||||
}
|
||||
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_busi_groups").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_busi_groups").Set(0)
|
||||
logger.Debug("busi_group not changed")
|
||||
return nil
|
||||
}
|
||||
@@ -99,11 +94,8 @@ func syncBusiGroups() error {
|
||||
BusiGroupCache.Set(m, stat.Total, stat.LastUpdated)
|
||||
|
||||
ms := time.Since(start).Milliseconds()
|
||||
if clusterName != "" {
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_busi_groups").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_busi_groups").Set(float64(len(m)))
|
||||
}
|
||||
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_busi_groups").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_busi_groups").Set(float64(len(m)))
|
||||
logger.Infof("timer: sync busi groups done, cost: %dms, number: %d", ms, len(m))
|
||||
|
||||
return nil
|
||||
|
||||
@@ -26,15 +26,6 @@ var RecordingRuleCache = RecordingRuleCacheType{
|
||||
rules: make(map[int64]*models.RecordingRule),
|
||||
}
|
||||
|
||||
func (rrc *RecordingRuleCacheType) Reset() {
|
||||
rrc.Lock()
|
||||
defer rrc.Unlock()
|
||||
|
||||
rrc.statTotal = -1
|
||||
rrc.statLastUpdated = -1
|
||||
rrc.rules = make(map[int64]*models.RecordingRule)
|
||||
}
|
||||
|
||||
func (rrc *RecordingRuleCacheType) StatChanged(total, lastUpdated int64) bool {
|
||||
if rrc.statTotal == total && rrc.statLastUpdated == lastUpdated {
|
||||
return false
|
||||
@@ -95,26 +86,19 @@ func loopSyncRecordingRules() {
|
||||
func syncRecordingRules() error {
|
||||
start := time.Now()
|
||||
|
||||
clusterName := config.ReaderClient.GetClusterName()
|
||||
if clusterName == "" {
|
||||
RecordingRuleCache.Reset()
|
||||
logger.Warning("cluster name is blank")
|
||||
return nil
|
||||
}
|
||||
|
||||
stat, err := models.RecordingRuleStatistics(clusterName)
|
||||
stat, err := models.RecordingRuleStatistics(config.C.ClusterName)
|
||||
if err != nil {
|
||||
return errors.WithMessage(err, "failed to exec RecordingRuleStatistics")
|
||||
}
|
||||
|
||||
if !RecordingRuleCache.StatChanged(stat.Total, stat.LastUpdated) {
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_recording_rules").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_recording_rules").Set(0)
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_recording_rules").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_recording_rules").Set(0)
|
||||
logger.Debug("recoding rules not changed")
|
||||
return nil
|
||||
}
|
||||
|
||||
lst, err := models.RecordingRuleGetsByCluster(clusterName)
|
||||
lst, err := models.RecordingRuleGetsByCluster(config.C.ClusterName)
|
||||
if err != nil {
|
||||
return errors.WithMessage(err, "failed to exec RecordingRuleGetsByCluster")
|
||||
}
|
||||
@@ -127,8 +111,8 @@ func syncRecordingRules() error {
|
||||
RecordingRuleCache.Set(m, stat.Total, stat.LastUpdated)
|
||||
|
||||
ms := time.Since(start).Milliseconds()
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_recording_rules").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_recording_rules").Set(float64(len(m)))
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_recording_rules").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_recording_rules").Set(float64(len(m)))
|
||||
logger.Infof("timer: sync recording rules done, cost: %dms, number: %d", ms, len(m))
|
||||
|
||||
return nil
|
||||
|
||||
@@ -31,15 +31,6 @@ var TargetCache = TargetCacheType{
|
||||
targets: make(map[string]*models.Target),
|
||||
}
|
||||
|
||||
func (tc *TargetCacheType) Reset() {
|
||||
tc.Lock()
|
||||
defer tc.Unlock()
|
||||
|
||||
tc.statTotal = -1
|
||||
tc.statLastUpdated = -1
|
||||
tc.targets = make(map[string]*models.Target)
|
||||
}
|
||||
|
||||
func (tc *TargetCacheType) StatChanged(total, lastUpdated int64) bool {
|
||||
if tc.statTotal == total && tc.statLastUpdated == lastUpdated {
|
||||
return false
|
||||
@@ -103,26 +94,19 @@ func loopSyncTargets() {
|
||||
func syncTargets() error {
|
||||
start := time.Now()
|
||||
|
||||
clusterName := config.ReaderClient.GetClusterName()
|
||||
if clusterName == "" {
|
||||
TargetCache.Reset()
|
||||
logger.Warning("cluster name is blank")
|
||||
return nil
|
||||
}
|
||||
|
||||
stat, err := models.TargetStatistics(clusterName)
|
||||
stat, err := models.TargetStatistics(config.C.ClusterName)
|
||||
if err != nil {
|
||||
return errors.WithMessage(err, "failed to exec TargetStatistics")
|
||||
}
|
||||
|
||||
if !TargetCache.StatChanged(stat.Total, stat.LastUpdated) {
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_targets").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_targets").Set(0)
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_targets").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_targets").Set(0)
|
||||
logger.Debug("targets not changed")
|
||||
return nil
|
||||
}
|
||||
|
||||
lst, err := models.TargetGetsByCluster(clusterName)
|
||||
lst, err := models.TargetGetsByCluster(config.C.ClusterName)
|
||||
if err != nil {
|
||||
return errors.WithMessage(err, "failed to exec TargetGetsByCluster")
|
||||
}
|
||||
@@ -145,8 +129,8 @@ func syncTargets() error {
|
||||
TargetCache.Set(m, stat.Total, stat.LastUpdated)
|
||||
|
||||
ms := time.Since(start).Milliseconds()
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_targets").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_targets").Set(float64(len(lst)))
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_targets").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_targets").Set(float64(len(lst)))
|
||||
logger.Infof("timer: sync targets done, cost: %dms, number: %d", ms, len(lst))
|
||||
|
||||
return nil
|
||||
|
||||
@@ -124,14 +124,9 @@ func syncUsers() error {
|
||||
return errors.WithMessage(err, "failed to exec UserStatistics")
|
||||
}
|
||||
|
||||
clusterName := config.ReaderClient.GetClusterName()
|
||||
|
||||
if !UserCache.StatChanged(stat.Total, stat.LastUpdated) {
|
||||
if clusterName != "" {
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_users").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_users").Set(0)
|
||||
}
|
||||
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_users").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_users").Set(0)
|
||||
logger.Debug("users not changed")
|
||||
return nil
|
||||
}
|
||||
@@ -149,11 +144,8 @@ func syncUsers() error {
|
||||
UserCache.Set(m, stat.Total, stat.LastUpdated)
|
||||
|
||||
ms := time.Since(start).Milliseconds()
|
||||
if clusterName != "" {
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_users").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_users").Set(float64(len(m)))
|
||||
}
|
||||
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_users").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_users").Set(float64(len(m)))
|
||||
logger.Infof("timer: sync users done, cost: %dms, number: %d", ms, len(m))
|
||||
|
||||
return nil
|
||||
|
||||
@@ -106,14 +106,9 @@ func syncUserGroups() error {
|
||||
return errors.WithMessage(err, "failed to exec UserGroupStatistics")
|
||||
}
|
||||
|
||||
clusterName := config.ReaderClient.GetClusterName()
|
||||
|
||||
if !UserGroupCache.StatChanged(stat.Total, stat.LastUpdated) {
|
||||
if clusterName != "" {
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_user_groups").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_user_groups").Set(0)
|
||||
}
|
||||
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_user_groups").Set(0)
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_user_groups").Set(0)
|
||||
logger.Debug("user_group not changed")
|
||||
return nil
|
||||
}
|
||||
@@ -150,11 +145,8 @@ func syncUserGroups() error {
|
||||
UserGroupCache.Set(m, stat.Total, stat.LastUpdated)
|
||||
|
||||
ms := time.Since(start).Milliseconds()
|
||||
if clusterName != "" {
|
||||
promstat.GaugeCronDuration.WithLabelValues(clusterName, "sync_user_groups").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(clusterName, "sync_user_groups").Set(float64(len(m)))
|
||||
}
|
||||
|
||||
promstat.GaugeCronDuration.WithLabelValues(config.C.ClusterName, "sync_user_groups").Set(float64(ms))
|
||||
promstat.GaugeSyncNumber.WithLabelValues(config.C.ClusterName, "sync_user_groups").Set(float64(len(m)))
|
||||
logger.Infof("timer: sync user groups done, cost: %dms, number: %d", ms, len(m))
|
||||
|
||||
return nil
|
||||
|
||||
@@ -4,45 +4,57 @@ import (
|
||||
"context"
|
||||
"fmt"
|
||||
"sort"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/toolkits/pkg/logger"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/models"
|
||||
"github.com/didi/nightingale/v5/src/server/config"
|
||||
"github.com/didi/nightingale/v5/src/storage"
|
||||
)
|
||||
|
||||
// local servers
|
||||
var localss string
|
||||
|
||||
func Heartbeat(ctx context.Context) error {
|
||||
if err := heartbeat(); err != nil {
|
||||
if err := heartbeat(ctx); err != nil {
|
||||
fmt.Println("failed to heartbeat:", err)
|
||||
return err
|
||||
}
|
||||
|
||||
go loopHeartbeat()
|
||||
go loopHeartbeat(ctx)
|
||||
return nil
|
||||
}
|
||||
|
||||
func loopHeartbeat() {
|
||||
func loopHeartbeat(ctx context.Context) {
|
||||
interval := time.Duration(config.C.Heartbeat.Interval) * time.Millisecond
|
||||
for {
|
||||
time.Sleep(interval)
|
||||
if err := heartbeat(); err != nil {
|
||||
if err := heartbeat(ctx); err != nil {
|
||||
logger.Warning(err)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func heartbeat() error {
|
||||
err := models.AlertingEngineHeartbeat(config.C.Heartbeat.Endpoint)
|
||||
// hash struct:
|
||||
// /server/heartbeat/Default -> {
|
||||
// 10.2.3.4:19000 => $timestamp
|
||||
// 10.2.3.5:19000 => $timestamp
|
||||
// }
|
||||
func redisKey(cluster string) string {
|
||||
return fmt.Sprintf("/server/heartbeat/%s", cluster)
|
||||
}
|
||||
|
||||
func heartbeat(ctx context.Context) error {
|
||||
now := time.Now().Unix()
|
||||
key := redisKey(config.C.ClusterName)
|
||||
err := storage.Redis.HSet(ctx, key, config.C.Heartbeat.Endpoint, now).Err()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
servers, err := ActiveServers()
|
||||
servers, err := ActiveServers(ctx, config.C.ClusterName)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
@@ -57,12 +69,37 @@ func heartbeat() error {
|
||||
return nil
|
||||
}
|
||||
|
||||
func ActiveServers() ([]string, error) {
|
||||
cluster, err := models.AlertingEngineGetCluster(config.C.Heartbeat.Endpoint)
|
||||
func clearDeadServer(ctx context.Context, cluster, endpoint string) {
|
||||
key := redisKey(cluster)
|
||||
err := storage.Redis.HDel(ctx, key, endpoint).Err()
|
||||
if err != nil {
|
||||
logger.Warningf("failed to hdel %s %s, error: %v", key, endpoint, err)
|
||||
}
|
||||
}
|
||||
|
||||
func ActiveServers(ctx context.Context, cluster string) ([]string, error) {
|
||||
ret, err := storage.Redis.HGetAll(ctx, redisKey(cluster)).Result()
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
// 30秒内有心跳,就认为是活的
|
||||
return models.AlertingEngineGetsInstances("cluster = ? and clock > ?", cluster, time.Now().Unix()-30)
|
||||
now := time.Now().Unix()
|
||||
dur := int64(20)
|
||||
|
||||
actives := make([]string, 0, len(ret))
|
||||
for endpoint, clockstr := range ret {
|
||||
clock, err := strconv.ParseInt(clockstr, 10, 64)
|
||||
if err != nil {
|
||||
continue
|
||||
}
|
||||
|
||||
if now-clock > dur {
|
||||
clearDeadServer(ctx, cluster, endpoint)
|
||||
continue
|
||||
}
|
||||
|
||||
actives = append(actives, endpoint)
|
||||
}
|
||||
|
||||
return actives, nil
|
||||
}
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
package naming
|
||||
|
||||
import (
|
||||
"context"
|
||||
"sort"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/server/config"
|
||||
@@ -8,7 +9,7 @@ import (
|
||||
)
|
||||
|
||||
func IamLeader() (bool, error) {
|
||||
servers, err := ActiveServers()
|
||||
servers, err := ActiveServers(context.Background(), config.C.ClusterName)
|
||||
if err != nil {
|
||||
logger.Errorf("failed to get active servers: %v", err)
|
||||
return false, err
|
||||
|
||||
@@ -13,7 +13,7 @@
|
||||
|
||||
// Package v1 provides bindings to the Prometheus HTTP API v1:
|
||||
// http://prometheus.io/docs/querying/api/
|
||||
package prom
|
||||
package reader
|
||||
|
||||
import (
|
||||
"context"
|
||||
@@ -558,11 +558,10 @@ func (qr *queryResult) UnmarshalJSON(b []byte) error {
|
||||
// NewAPI returns a new API for the client.
|
||||
//
|
||||
// It is safe to use the returned API from multiple goroutines.
|
||||
func NewAPI(c api.Client, opt ClientOptions) API {
|
||||
func NewAPI(c api.Client) API {
|
||||
return &httpAPI{
|
||||
client: &apiClientImpl{
|
||||
client: c,
|
||||
opt: opt,
|
||||
},
|
||||
}
|
||||
}
|
||||
@@ -892,7 +891,6 @@ type apiClient interface {
|
||||
|
||||
type apiClientImpl struct {
|
||||
client api.Client
|
||||
opt ClientOptions
|
||||
}
|
||||
|
||||
type apiResponse struct {
|
||||
@@ -923,16 +921,16 @@ func (h *apiClientImpl) URL(ep string, args map[string]string) *url.URL {
|
||||
}
|
||||
|
||||
func (h *apiClientImpl) Do(ctx context.Context, req *http.Request) (*http.Response, []byte, Warnings, error) {
|
||||
if h.opt.BasicAuthUser != "" && h.opt.BasicAuthPass != "" {
|
||||
req.SetBasicAuth(h.opt.BasicAuthUser, h.opt.BasicAuthPass)
|
||||
if Reader.Opts.BasicAuthUser != "" && Reader.Opts.BasicAuthPass != "" {
|
||||
req.SetBasicAuth(Reader.Opts.BasicAuthUser, Reader.Opts.BasicAuthPass)
|
||||
}
|
||||
|
||||
headerCount := len(h.opt.Headers)
|
||||
headerCount := len(Reader.Opts.Headers)
|
||||
if headerCount > 0 && headerCount%2 == 0 {
|
||||
for i := 0; i < len(h.opt.Headers); i += 2 {
|
||||
req.Header.Add(h.opt.Headers[i], h.opt.Headers[i+1])
|
||||
if h.opt.Headers[i] == "Host" {
|
||||
req.Host = h.opt.Headers[i+1]
|
||||
for i := 0; i < len(Reader.Opts.Headers); i += 2 {
|
||||
req.Header.Add(Reader.Opts.Headers[i], Reader.Opts.Headers[i+1])
|
||||
if Reader.Opts.Headers[i] == "Host" {
|
||||
req.Host = Reader.Opts.Headers[i+1]
|
||||
}
|
||||
}
|
||||
}
|
||||
49
src/server/reader/reader.go
Normal file
49
src/server/reader/reader.go
Normal file
@@ -0,0 +1,49 @@
|
||||
package reader
|
||||
|
||||
import (
|
||||
"net"
|
||||
"net/http"
|
||||
"time"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/server/config"
|
||||
"github.com/prometheus/client_golang/api"
|
||||
)
|
||||
|
||||
type ReaderType struct {
|
||||
Opts config.ReaderOptions
|
||||
Client API
|
||||
}
|
||||
|
||||
var Reader ReaderType
|
||||
|
||||
func Init(opts config.ReaderOptions) error {
|
||||
cli, err := api.NewClient(api.Config{
|
||||
Address: opts.Url,
|
||||
RoundTripper: &http.Transport{
|
||||
// TLSClientConfig: tlsConfig,
|
||||
Proxy: http.ProxyFromEnvironment,
|
||||
DialContext: (&net.Dialer{
|
||||
Timeout: time.Duration(opts.DialTimeout) * time.Millisecond,
|
||||
KeepAlive: time.Duration(opts.KeepAlive) * time.Millisecond,
|
||||
}).DialContext,
|
||||
ResponseHeaderTimeout: time.Duration(opts.Timeout) * time.Millisecond,
|
||||
TLSHandshakeTimeout: time.Duration(opts.TLSHandshakeTimeout) * time.Millisecond,
|
||||
ExpectContinueTimeout: time.Duration(opts.ExpectContinueTimeout) * time.Millisecond,
|
||||
MaxConnsPerHost: opts.MaxConnsPerHost,
|
||||
MaxIdleConns: opts.MaxIdleConns,
|
||||
MaxIdleConnsPerHost: opts.MaxIdleConnsPerHost,
|
||||
IdleConnTimeout: time.Duration(opts.IdleConnTimeout) * time.Millisecond,
|
||||
},
|
||||
})
|
||||
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
Reader = ReaderType{
|
||||
Opts: opts,
|
||||
Client: NewAPI(cli),
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
@@ -18,7 +18,7 @@ import (
|
||||
promstat "github.com/didi/nightingale/v5/src/server/stat"
|
||||
)
|
||||
|
||||
func New(version string, reloadFunc func()) *gin.Engine {
|
||||
func New(version string) *gin.Engine {
|
||||
gin.SetMode(config.C.RunMode)
|
||||
|
||||
loggerMid := aop.Logger()
|
||||
@@ -37,12 +37,12 @@ func New(version string, reloadFunc func()) *gin.Engine {
|
||||
r.Use(loggerMid)
|
||||
}
|
||||
|
||||
configRoute(r, version, reloadFunc)
|
||||
configRoute(r, version)
|
||||
|
||||
return r
|
||||
}
|
||||
|
||||
func configRoute(r *gin.Engine, version string, reloadFunc func()) {
|
||||
func configRoute(r *gin.Engine, version string) {
|
||||
if config.C.HTTP.PProf {
|
||||
pprof.Register(r, "/api/debug/pprof")
|
||||
}
|
||||
@@ -63,13 +63,8 @@ func configRoute(r *gin.Engine, version string, reloadFunc func()) {
|
||||
c.String(200, version)
|
||||
})
|
||||
|
||||
r.POST("/-/reload", func(c *gin.Context) {
|
||||
reloadFunc()
|
||||
c.String(200, "reload success")
|
||||
})
|
||||
|
||||
r.GET("/servers/active", func(c *gin.Context) {
|
||||
lst, err := naming.ActiveServers()
|
||||
lst, err := naming.ActiveServers(c.Request.Context(), config.C.ClusterName)
|
||||
ginx.NewRender(c).Data(lst, err)
|
||||
})
|
||||
|
||||
@@ -103,8 +98,6 @@ func configRoute(r *gin.Engine, version string, reloadFunc func()) {
|
||||
|
||||
service := r.Group("/v1/n9e")
|
||||
service.POST("/event", pushEventToQueue)
|
||||
service.POST("/make-event", makeEvent)
|
||||
service.POST("/judge-event", judgeEvent)
|
||||
}
|
||||
|
||||
func stat() gin.HandlerFunc {
|
||||
|
||||
@@ -269,10 +269,7 @@ func datadogSeries(c *gin.Context) {
|
||||
}
|
||||
|
||||
if succ > 0 {
|
||||
cn := config.ReaderClient.GetClusterName()
|
||||
if cn != "" {
|
||||
promstat.CounterSampleTotal.WithLabelValues(cn, "datadog").Add(float64(succ))
|
||||
}
|
||||
promstat.CounterSampleTotal.WithLabelValues(config.C.ClusterName, "datadog").Add(float64(succ))
|
||||
idents.Idents.MSet(ids)
|
||||
}
|
||||
|
||||
|
||||
@@ -2,11 +2,8 @@ package router
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/models"
|
||||
"github.com/didi/nightingale/v5/src/server/common/conv"
|
||||
"github.com/didi/nightingale/v5/src/server/config"
|
||||
"github.com/didi/nightingale/v5/src/server/engine"
|
||||
promstat "github.com/didi/nightingale/v5/src/server/stat"
|
||||
@@ -14,61 +11,17 @@ import (
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/toolkits/pkg/ginx"
|
||||
"github.com/toolkits/pkg/logger"
|
||||
"github.com/toolkits/pkg/str"
|
||||
)
|
||||
|
||||
func pushEventToQueue(c *gin.Context) {
|
||||
var event *models.AlertCurEvent
|
||||
var event models.AlertCurEvent
|
||||
ginx.BindJSON(c, &event)
|
||||
if event.RuleId == 0 {
|
||||
ginx.Bomb(200, "event is illegal")
|
||||
}
|
||||
|
||||
event.TagsMap = make(map[string]string)
|
||||
for i := 0; i < len(event.TagsJSON); i++ {
|
||||
pair := strings.TrimSpace(event.TagsJSON[i])
|
||||
if pair == "" {
|
||||
continue
|
||||
}
|
||||
|
||||
arr := strings.Split(pair, "=")
|
||||
if len(arr) != 2 {
|
||||
continue
|
||||
}
|
||||
|
||||
event.TagsMap[arr[0]] = arr[1]
|
||||
}
|
||||
|
||||
// isMuted only need TriggerTime RuleName and TagsMap
|
||||
if engine.IsMuted(event) {
|
||||
logger.Infof("event_muted: rule_id=%d %s", event.RuleId, event.Hash)
|
||||
ginx.NewRender(c).Message(nil)
|
||||
return
|
||||
}
|
||||
|
||||
if err := event.ParseRuleNote(); err != nil {
|
||||
event.RuleNote = fmt.Sprintf("failed to parse rule note: %v", err)
|
||||
}
|
||||
|
||||
// 如果 rule_note 中有 ; 前缀,则使用 rule_note 替换 tags 中的内容
|
||||
if strings.HasPrefix(event.RuleNote, ";") {
|
||||
event.RuleNote = strings.TrimPrefix(event.RuleNote, ";")
|
||||
event.Tags = strings.ReplaceAll(event.RuleNote, " ", ",,")
|
||||
event.TagsJSON = strings.Split(event.Tags, ",,")
|
||||
} else {
|
||||
event.Tags = strings.Join(event.TagsJSON, ",,")
|
||||
}
|
||||
|
||||
event.Callbacks = strings.Join(event.CallbacksJSON, " ")
|
||||
event.NotifyChannels = strings.Join(event.NotifyChannelsJSON, " ")
|
||||
event.NotifyGroups = strings.Join(event.NotifyGroupsJSON, " ")
|
||||
|
||||
cn := config.ReaderClient.GetClusterName()
|
||||
if cn != "" {
|
||||
promstat.CounterAlertsTotal.WithLabelValues(cn).Inc()
|
||||
}
|
||||
|
||||
engine.LogEvent(event, "http_push_queue")
|
||||
promstat.CounterAlertsTotal.WithLabelValues(config.C.ClusterName).Inc()
|
||||
engine.LogEvent(&event, "http_push_queue")
|
||||
if !engine.EventQueue.PushFront(event) {
|
||||
msg := fmt.Sprintf("event:%+v push_queue err: queue is full", event)
|
||||
ginx.Bomb(200, msg)
|
||||
@@ -76,45 +29,3 @@ func pushEventToQueue(c *gin.Context) {
|
||||
}
|
||||
ginx.NewRender(c).Message(nil)
|
||||
}
|
||||
|
||||
type eventForm struct {
|
||||
Alert bool `json:"alert"`
|
||||
Vectors []conv.Vector `json:"vectors"`
|
||||
RuleId int64 `json:"rule_id"`
|
||||
Cluster string `json:"cluster"`
|
||||
}
|
||||
|
||||
func judgeEvent(c *gin.Context) {
|
||||
var form eventForm
|
||||
ginx.BindJSON(c, &form)
|
||||
re, exists := engine.RuleEvalForExternal.Get(form.RuleId)
|
||||
if !exists {
|
||||
ginx.Bomb(200, "rule not exists")
|
||||
}
|
||||
re.Judge(form.Cluster, form.Vectors)
|
||||
ginx.NewRender(c).Message(nil)
|
||||
}
|
||||
|
||||
func makeEvent(c *gin.Context) {
|
||||
var events []*eventForm
|
||||
ginx.BindJSON(c, &events)
|
||||
now := time.Now().Unix()
|
||||
for i := 0; i < len(events); i++ {
|
||||
re, exists := engine.RuleEvalForExternal.Get(events[i].RuleId)
|
||||
logger.Debugf("handle event:%+v exists:%v", events[i], exists)
|
||||
if !exists {
|
||||
ginx.Bomb(200, "rule not exists")
|
||||
}
|
||||
|
||||
if events[i].Alert {
|
||||
go re.MakeNewEvent("http", now, events[i].Cluster, events[i].Vectors)
|
||||
} else {
|
||||
for _, vector := range events[i].Vectors {
|
||||
hash := str.MD5(fmt.Sprintf("%d_%s", events[i].RuleId, vector.Key))
|
||||
now := vector.Timestamp
|
||||
go re.RecoverEvent(hash, now, vector.Value)
|
||||
}
|
||||
}
|
||||
}
|
||||
ginx.NewRender(c).Message(nil)
|
||||
}
|
||||
|
||||
@@ -214,11 +214,7 @@ func falconPush(c *gin.Context) {
|
||||
}
|
||||
|
||||
if succ > 0 {
|
||||
cn := config.ReaderClient.GetClusterName()
|
||||
if cn != "" {
|
||||
promstat.CounterSampleTotal.WithLabelValues(cn, "openfalcon").Add(float64(succ))
|
||||
}
|
||||
|
||||
promstat.CounterSampleTotal.WithLabelValues(config.C.ClusterName, "openfalcon").Add(float64(succ))
|
||||
idents.Idents.MSet(ids)
|
||||
}
|
||||
|
||||
|
||||
@@ -12,7 +12,6 @@ import (
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/prometheus/common/model"
|
||||
"github.com/prometheus/prometheus/prompb"
|
||||
"github.com/toolkits/pkg/logger"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/server/common"
|
||||
"github.com/didi/nightingale/v5/src/server/config"
|
||||
@@ -157,7 +156,6 @@ func handleOpenTSDB(c *gin.Context) {
|
||||
}
|
||||
|
||||
if err != nil {
|
||||
logger.Debugf("opentsdb msg format error: %s", err.Error())
|
||||
c.String(400, err.Error())
|
||||
return
|
||||
}
|
||||
@@ -172,20 +170,12 @@ func handleOpenTSDB(c *gin.Context) {
|
||||
|
||||
for i := 0; i < len(arr); i++ {
|
||||
if err := arr[i].Clean(ts); err != nil {
|
||||
logger.Debugf("opentsdb msg clean error: %s", err.Error())
|
||||
if fail == 0 {
|
||||
msg = fmt.Sprintf("%s , Error clean: %s", msg, err.Error())
|
||||
}
|
||||
fail++
|
||||
continue
|
||||
}
|
||||
|
||||
pt, err := arr[i].ToProm()
|
||||
if err != nil {
|
||||
logger.Debugf("opentsdb msg to tsdb error: %s", err.Error())
|
||||
if fail == 0 {
|
||||
msg = fmt.Sprintf("%s , Error toprom: %s", msg, err.Error())
|
||||
}
|
||||
fail++
|
||||
continue
|
||||
}
|
||||
@@ -208,17 +198,10 @@ func handleOpenTSDB(c *gin.Context) {
|
||||
}
|
||||
|
||||
if succ > 0 {
|
||||
cn := config.ReaderClient.GetClusterName()
|
||||
if cn != "" {
|
||||
promstat.CounterSampleTotal.WithLabelValues(cn, "opentsdb").Add(float64(succ))
|
||||
}
|
||||
promstat.CounterSampleTotal.WithLabelValues(config.C.ClusterName, "opentsdb").Add(float64(succ))
|
||||
idents.Idents.MSet(ids)
|
||||
}
|
||||
|
||||
if fail > 0 {
|
||||
logger.Debugf("opentsdb msg process error , msg is : %s", string(bs))
|
||||
}
|
||||
|
||||
c.JSON(200, gin.H{
|
||||
"succ": succ,
|
||||
"fail": fail,
|
||||
|
||||
@@ -17,6 +17,7 @@ import (
|
||||
"github.com/didi/nightingale/v5/src/server/config"
|
||||
"github.com/didi/nightingale/v5/src/server/idents"
|
||||
"github.com/didi/nightingale/v5/src/server/memsto"
|
||||
"github.com/didi/nightingale/v5/src/server/reader"
|
||||
promstat "github.com/didi/nightingale/v5/src/server/stat"
|
||||
"github.com/didi/nightingale/v5/src/server/writer"
|
||||
)
|
||||
@@ -37,12 +38,7 @@ func queryPromql(c *gin.Context) {
|
||||
var f promqlForm
|
||||
ginx.BindJSON(c, &f)
|
||||
|
||||
if config.ReaderClient.IsNil() {
|
||||
c.String(500, "reader client is nil")
|
||||
return
|
||||
}
|
||||
|
||||
value, warnings, err := config.ReaderClient.GetCli().Query(c.Request.Context(), f.PromQL, time.Now())
|
||||
value, warnings, err := reader.Reader.Client.Query(c.Request.Context(), f.PromQL, time.Now())
|
||||
if err != nil {
|
||||
c.String(500, "promql:%s error:%v", f.PromQL, err)
|
||||
return
|
||||
@@ -146,11 +142,7 @@ func remoteWrite(c *gin.Context) {
|
||||
writer.Writers.PushSample(metric, req.Timeseries[i])
|
||||
}
|
||||
|
||||
cn := config.ReaderClient.GetClusterName()
|
||||
if cn != "" {
|
||||
promstat.CounterSampleTotal.WithLabelValues(cn, "prometheus").Add(float64(count))
|
||||
}
|
||||
|
||||
promstat.CounterSampleTotal.WithLabelValues(config.C.ClusterName, "prometheus").Add(float64(count))
|
||||
idents.Idents.MSet(ids)
|
||||
}
|
||||
|
||||
|
||||
@@ -9,7 +9,6 @@ import (
|
||||
"syscall"
|
||||
|
||||
"github.com/toolkits/pkg/i18n"
|
||||
"github.com/toolkits/pkg/logger"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/pkg/httpx"
|
||||
"github.com/didi/nightingale/v5/src/pkg/logx"
|
||||
@@ -18,6 +17,7 @@ import (
|
||||
"github.com/didi/nightingale/v5/src/server/idents"
|
||||
"github.com/didi/nightingale/v5/src/server/memsto"
|
||||
"github.com/didi/nightingale/v5/src/server/naming"
|
||||
"github.com/didi/nightingale/v5/src/server/reader"
|
||||
"github.com/didi/nightingale/v5/src/server/router"
|
||||
"github.com/didi/nightingale/v5/src/server/stat"
|
||||
"github.com/didi/nightingale/v5/src/server/usage"
|
||||
@@ -75,7 +75,6 @@ EXIT:
|
||||
break EXIT
|
||||
case syscall.SIGHUP:
|
||||
// reload configuration?
|
||||
reload()
|
||||
default:
|
||||
break EXIT
|
||||
}
|
||||
@@ -124,7 +123,7 @@ func (s Server) initialize() (func(), error) {
|
||||
}
|
||||
|
||||
// init prometheus remote reader
|
||||
if err = config.InitReader(); err != nil {
|
||||
if err = reader.Init(config.C.Reader); err != nil {
|
||||
return fns.Ret(), err
|
||||
}
|
||||
|
||||
@@ -144,7 +143,7 @@ func (s Server) initialize() (func(), error) {
|
||||
stat.Init()
|
||||
|
||||
// init http server
|
||||
r := router.New(s.Version, reload)
|
||||
r := router.New(s.Version)
|
||||
httpClean := httpx.Init(config.C.HTTP, r)
|
||||
fns.Add(httpClean)
|
||||
|
||||
@@ -174,9 +173,3 @@ func (fs *Functions) Ret() func() {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func reload() {
|
||||
logger.Info("start reload configs")
|
||||
engine.Reload()
|
||||
logger.Info("reload configs finished")
|
||||
}
|
||||
|
||||
@@ -2,6 +2,7 @@ package usage
|
||||
|
||||
import (
|
||||
"bytes"
|
||||
"context"
|
||||
"encoding/json"
|
||||
"fmt"
|
||||
"io/ioutil"
|
||||
@@ -11,6 +12,8 @@ import (
|
||||
|
||||
"github.com/didi/nightingale/v5/src/models"
|
||||
"github.com/didi/nightingale/v5/src/pkg/version"
|
||||
"github.com/didi/nightingale/v5/src/server/common/conv"
|
||||
"github.com/didi/nightingale/v5/src/server/reader"
|
||||
)
|
||||
|
||||
const (
|
||||
@@ -26,6 +29,24 @@ type Usage struct {
|
||||
Version string `json:"version"`
|
||||
}
|
||||
|
||||
func getSamples() (float64, error) {
|
||||
value, warns, err := reader.Reader.Client.Query(context.Background(), request, time.Now())
|
||||
if err != nil {
|
||||
return 0, err
|
||||
}
|
||||
|
||||
if len(warns) > 0 {
|
||||
return 0, fmt.Errorf("occur some warnings: %v", warns)
|
||||
}
|
||||
|
||||
lst := conv.ConvertVectors(value)
|
||||
if len(lst) == 0 {
|
||||
return 0, fmt.Errorf("convert result is empty")
|
||||
}
|
||||
|
||||
return lst[0].Value, nil
|
||||
}
|
||||
|
||||
func Report() {
|
||||
for {
|
||||
time.Sleep(time.Minute * 10)
|
||||
@@ -34,7 +55,7 @@ func Report() {
|
||||
}
|
||||
|
||||
func report() {
|
||||
tnum, err := models.TargetTotalCount()
|
||||
sps, err := getSamples()
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
@@ -44,7 +65,7 @@ func report() {
|
||||
return
|
||||
}
|
||||
|
||||
unum, err := models.UserTotal("")
|
||||
num, err := models.UserTotal("")
|
||||
if err != nil {
|
||||
return
|
||||
}
|
||||
@@ -52,8 +73,8 @@ func report() {
|
||||
maintainer := "blank"
|
||||
|
||||
u := Usage{
|
||||
Samples: float64(tnum),
|
||||
Users: float64(unum),
|
||||
Samples: sps,
|
||||
Users: float64(num),
|
||||
Hostname: hostname,
|
||||
Maintainer: maintainer,
|
||||
Version: version.VERSION,
|
||||
|
||||
@@ -9,7 +9,6 @@ import (
|
||||
"net/http"
|
||||
"time"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/models"
|
||||
"github.com/didi/nightingale/v5/src/server/config"
|
||||
"github.com/golang/protobuf/proto"
|
||||
"github.com/golang/snappy"
|
||||
@@ -25,46 +24,16 @@ type WriterType struct {
|
||||
Client api.Client
|
||||
}
|
||||
|
||||
func (w WriterType) writeRelabel(items []*prompb.TimeSeries) []*prompb.TimeSeries {
|
||||
ritems := make([]*prompb.TimeSeries, 0, len(items))
|
||||
for _, item := range items {
|
||||
lbls := models.Process(item.Labels, w.Opts.WriteRelabels...)
|
||||
if len(lbls) == 0 {
|
||||
continue
|
||||
}
|
||||
ritems = append(ritems, item)
|
||||
}
|
||||
return ritems
|
||||
}
|
||||
|
||||
func (w WriterType) Write(index int, items []*prompb.TimeSeries, headers ...map[string]string) {
|
||||
if len(items) == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
items = w.writeRelabel(items)
|
||||
if len(items) == 0 {
|
||||
return
|
||||
}
|
||||
|
||||
start := time.Now()
|
||||
defer func() {
|
||||
cn := config.ReaderClient.GetClusterName()
|
||||
if cn != "" {
|
||||
promstat.ForwardDuration.WithLabelValues(cn, fmt.Sprint(index)).Observe(time.Since(start).Seconds())
|
||||
}
|
||||
promstat.ForwardDuration.WithLabelValues(config.C.ClusterName, fmt.Sprint(index)).Observe(time.Since(start).Seconds())
|
||||
}()
|
||||
|
||||
if config.C.ForceUseServerTS {
|
||||
ts := start.UnixMilli()
|
||||
for i := 0; i < len(items); i++ {
|
||||
if len(items[i].Samples) == 0 {
|
||||
continue
|
||||
}
|
||||
items[i].Samples[0].Timestamp = ts
|
||||
}
|
||||
}
|
||||
|
||||
req := &prompb.WriteRequest{
|
||||
Timeseries: items,
|
||||
}
|
||||
@@ -253,16 +222,11 @@ func Init(opts []config.WriterOptions, globalOpt config.WriterGlobalOpt) error {
|
||||
}
|
||||
|
||||
func reportChanSize() {
|
||||
clusterName := config.ReaderClient.GetClusterName()
|
||||
if clusterName == "" {
|
||||
return
|
||||
}
|
||||
|
||||
for {
|
||||
time.Sleep(time.Second * 3)
|
||||
for i, c := range Writers.chans {
|
||||
size := len(c)
|
||||
promstat.GaugeSampleQueueSize.WithLabelValues(clusterName, fmt.Sprint(i)).Set(float64(size))
|
||||
promstat.GaugeSampleQueueSize.WithLabelValues(config.C.ClusterName, fmt.Sprint(i)).Set(float64(size))
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -14,7 +14,6 @@ import (
|
||||
"github.com/didi/nightingale/v5/src/pkg/logx"
|
||||
"github.com/didi/nightingale/v5/src/pkg/oidcc"
|
||||
"github.com/didi/nightingale/v5/src/pkg/ormx"
|
||||
"github.com/didi/nightingale/v5/src/pkg/tls"
|
||||
"github.com/didi/nightingale/v5/src/storage"
|
||||
)
|
||||
|
||||
@@ -78,7 +77,6 @@ func MustLoad(fpaths ...string) {
|
||||
type Config struct {
|
||||
RunMode string
|
||||
I18N string
|
||||
I18NHeaderKey string
|
||||
AdminRole string
|
||||
MetricsYamlFile string
|
||||
BuiltinAlertsDir string
|
||||
@@ -99,7 +97,6 @@ type Config struct {
|
||||
Clusters []ClusterOptions
|
||||
Ibex Ibex
|
||||
OIDC oidcc.Config
|
||||
TargetMetrics map[string]string
|
||||
}
|
||||
|
||||
type ClusterOptions struct {
|
||||
@@ -113,9 +110,7 @@ type ClusterOptions struct {
|
||||
|
||||
Timeout int64
|
||||
DialTimeout int64
|
||||
|
||||
UseTLS bool
|
||||
tls.ClientConfig
|
||||
KeepAlive int64
|
||||
|
||||
MaxIdleConnsPerHost int
|
||||
}
|
||||
|
||||
@@ -3,43 +3,28 @@ package config
|
||||
import (
|
||||
"path"
|
||||
|
||||
cmap "github.com/orcaman/concurrent-map"
|
||||
"github.com/toolkits/pkg/file"
|
||||
"github.com/toolkits/pkg/runner"
|
||||
)
|
||||
|
||||
// metricDesc , As load map happens before read map, there is no necessary to use concurrent map for metric desc store
|
||||
type metricDesc struct {
|
||||
CommonDesc map[string]string `yaml:",inline" json:"common"`
|
||||
Zh map[string]string `yaml:"zh" json:"zh"`
|
||||
En map[string]string `yaml:"en" json:"en"`
|
||||
}
|
||||
|
||||
var MetricDesc metricDesc
|
||||
|
||||
// GetMetricDesc , if metric is not registered, empty string will be returned
|
||||
func GetMetricDesc(lang, metric string) string {
|
||||
var m map[string]string
|
||||
if lang == "zh" {
|
||||
m = MetricDesc.Zh
|
||||
} else {
|
||||
m = MetricDesc.En
|
||||
}
|
||||
if m != nil {
|
||||
if desc, has := m[metric]; has {
|
||||
return desc
|
||||
}
|
||||
}
|
||||
|
||||
return MetricDesc.CommonDesc[metric]
|
||||
}
|
||||
var Metrics = cmap.New()
|
||||
|
||||
func loadMetricsYaml() error {
|
||||
fp := C.MetricsYamlFile
|
||||
if fp == "" {
|
||||
fp = path.Join(runner.Cwd, "etc", "metrics.yaml")
|
||||
}
|
||||
fp := path.Join(runner.Cwd, "etc", "metrics.yaml")
|
||||
if !file.IsExist(fp) {
|
||||
return nil
|
||||
}
|
||||
return file.ReadYaml(fp, &MetricDesc)
|
||||
|
||||
nmap := make(map[string]string)
|
||||
err := file.ReadYaml(fp, &nmap)
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
for key, val := range nmap {
|
||||
Metrics.Set(key, val)
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
@@ -11,18 +11,14 @@ import (
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/models"
|
||||
"github.com/didi/nightingale/v5/src/pkg/prom"
|
||||
"github.com/didi/nightingale/v5/src/webapi/config"
|
||||
"github.com/prometheus/client_golang/api"
|
||||
"github.com/toolkits/pkg/logger"
|
||||
"github.com/toolkits/pkg/net/httplib"
|
||||
)
|
||||
|
||||
type ClusterType struct {
|
||||
Opts config.ClusterOptions
|
||||
Transport *http.Transport
|
||||
PromClient prom.API
|
||||
Opts config.ClusterOptions
|
||||
Transport *http.Transport
|
||||
}
|
||||
|
||||
type ClustersType struct {
|
||||
@@ -30,44 +26,10 @@ type ClustersType struct {
|
||||
mutex *sync.RWMutex
|
||||
}
|
||||
|
||||
type PromOption struct {
|
||||
Url string
|
||||
User string
|
||||
Pass string
|
||||
Headers []string
|
||||
Timeout int64
|
||||
DialTimeout int64
|
||||
MaxIdleConnsPerHost int
|
||||
}
|
||||
|
||||
func (cs *ClustersType) Put(name string, cluster *ClusterType) {
|
||||
cs.mutex.Lock()
|
||||
defer cs.mutex.Unlock()
|
||||
|
||||
cs.datas[name] = cluster
|
||||
|
||||
// 把配置信息写入DB一份,这样n9e-server就可以直接从DB读取了
|
||||
po := PromOption{
|
||||
Url: cluster.Opts.Prom,
|
||||
User: cluster.Opts.BasicAuthUser,
|
||||
Pass: cluster.Opts.BasicAuthPass,
|
||||
Headers: cluster.Opts.Headers,
|
||||
Timeout: cluster.Opts.Timeout,
|
||||
DialTimeout: cluster.Opts.DialTimeout,
|
||||
MaxIdleConnsPerHost: cluster.Opts.MaxIdleConnsPerHost,
|
||||
}
|
||||
|
||||
bs, err := json.Marshal(po)
|
||||
if err != nil {
|
||||
logger.Fatal("failed to marshal PromOption:", err)
|
||||
return
|
||||
}
|
||||
|
||||
key := "prom." + name + ".option"
|
||||
err = models.ConfigsSet(key, string(bs))
|
||||
if err != nil {
|
||||
logger.Fatal("failed to set PromOption ", key, " to database, error: ", err)
|
||||
}
|
||||
cs.mutex.Unlock()
|
||||
}
|
||||
|
||||
func (cs *ClustersType) Get(name string) (*ClusterType, bool) {
|
||||
@@ -99,9 +61,17 @@ func initClustersFromConfig() error {
|
||||
opts := config.C.Clusters
|
||||
|
||||
for i := 0; i < len(opts); i++ {
|
||||
cluster := newClusterByOption(opts[i])
|
||||
if cluster == nil {
|
||||
continue
|
||||
cluster := &ClusterType{
|
||||
Opts: opts[i],
|
||||
Transport: &http.Transport{
|
||||
// TLSClientConfig: tlsConfig,
|
||||
Proxy: http.ProxyFromEnvironment,
|
||||
DialContext: (&net.Dialer{
|
||||
Timeout: time.Duration(opts[i].DialTimeout) * time.Millisecond,
|
||||
}).DialContext,
|
||||
ResponseHeaderTimeout: time.Duration(opts[i].Timeout) * time.Millisecond,
|
||||
MaxIdleConnsPerHost: opts[i].MaxIdleConnsPerHost,
|
||||
},
|
||||
}
|
||||
Clusters.Put(opts[i].Name, cluster)
|
||||
}
|
||||
@@ -203,14 +173,17 @@ func loadClustersFromAPI() {
|
||||
MaxIdleConnsPerHost: 32,
|
||||
}
|
||||
|
||||
if strings.HasPrefix(opt.Prom, "https") {
|
||||
opt.UseTLS = true
|
||||
opt.InsecureSkipVerify = true
|
||||
}
|
||||
|
||||
cluster := newClusterByOption(opt)
|
||||
if cluster == nil {
|
||||
continue
|
||||
cluster := &ClusterType{
|
||||
Opts: opt,
|
||||
Transport: &http.Transport{
|
||||
// TLSClientConfig: tlsConfig,
|
||||
Proxy: http.ProxyFromEnvironment,
|
||||
DialContext: (&net.Dialer{
|
||||
Timeout: time.Duration(opt.DialTimeout) * time.Millisecond,
|
||||
}).DialContext,
|
||||
ResponseHeaderTimeout: time.Duration(opt.Timeout) * time.Millisecond,
|
||||
MaxIdleConnsPerHost: opt.MaxIdleConnsPerHost,
|
||||
},
|
||||
}
|
||||
|
||||
Clusters.Put(item.Name, cluster)
|
||||
@@ -218,45 +191,3 @@ func loadClustersFromAPI() {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
func newClusterByOption(opt config.ClusterOptions) *ClusterType {
|
||||
transport := &http.Transport{
|
||||
Proxy: http.ProxyFromEnvironment,
|
||||
DialContext: (&net.Dialer{
|
||||
Timeout: time.Duration(opt.DialTimeout) * time.Millisecond,
|
||||
}).DialContext,
|
||||
ResponseHeaderTimeout: time.Duration(opt.Timeout) * time.Millisecond,
|
||||
MaxIdleConnsPerHost: opt.MaxIdleConnsPerHost,
|
||||
}
|
||||
|
||||
if opt.UseTLS {
|
||||
tlsConfig, err := opt.TLSConfig()
|
||||
if err != nil {
|
||||
logger.Errorf("new cluster %s fail: %v", opt.Name, err)
|
||||
return nil
|
||||
}
|
||||
transport.TLSClientConfig = tlsConfig
|
||||
}
|
||||
|
||||
cli, err := api.NewClient(api.Config{
|
||||
Address: opt.Prom,
|
||||
RoundTripper: transport,
|
||||
})
|
||||
|
||||
if err != nil {
|
||||
logger.Errorf("new client fail: %v", err)
|
||||
return nil
|
||||
}
|
||||
|
||||
cluster := &ClusterType{
|
||||
Opts: opt,
|
||||
Transport: transport,
|
||||
PromClient: prom.NewAPI(cli, prom.ClientOptions{
|
||||
BasicAuthUser: opt.BasicAuthUser,
|
||||
BasicAuthPass: opt.BasicAuthPass,
|
||||
Headers: opt.Headers,
|
||||
}),
|
||||
}
|
||||
|
||||
return cluster
|
||||
}
|
||||
|
||||
@@ -31,25 +31,6 @@ func stat() gin.HandlerFunc {
|
||||
}
|
||||
}
|
||||
|
||||
func languageDetector() gin.HandlerFunc {
|
||||
headerKey := config.C.I18NHeaderKey
|
||||
return func(c *gin.Context) {
|
||||
if headerKey != "" {
|
||||
lang := c.GetHeader(headerKey)
|
||||
if lang != "" {
|
||||
if strings.HasPrefix(lang, "*") || strings.HasPrefix(lang, "zh") {
|
||||
c.Request.Header.Set("X-Language", "zh")
|
||||
} else if strings.HasPrefix(lang, "en") {
|
||||
c.Request.Header.Set("X-Language", "en")
|
||||
} else {
|
||||
c.Request.Header.Set("X-Language", lang)
|
||||
}
|
||||
}
|
||||
}
|
||||
c.Next()
|
||||
}
|
||||
}
|
||||
|
||||
func New(version string) *gin.Engine {
|
||||
gin.SetMode(config.C.RunMode)
|
||||
|
||||
@@ -60,7 +41,6 @@ func New(version string) *gin.Engine {
|
||||
r := gin.New()
|
||||
|
||||
r.Use(stat())
|
||||
r.Use(languageDetector())
|
||||
r.Use(aop.Recovery())
|
||||
|
||||
// whether print access log
|
||||
@@ -120,10 +100,8 @@ func configRoute(r *gin.Engine, version string) {
|
||||
{
|
||||
if config.C.AnonymousAccess.PromQuerier {
|
||||
pages.Any("/prometheus/*url", prometheusProxy)
|
||||
pages.POST("/query-range-batch", promBatchQueryRange)
|
||||
} else {
|
||||
pages.Any("/prometheus/*url", auth(), prometheusProxy)
|
||||
pages.POST("/query-range-batch", auth(), promBatchQueryRange)
|
||||
}
|
||||
|
||||
pages.GET("/version", func(c *gin.Context) {
|
||||
@@ -196,7 +174,6 @@ func configRoute(r *gin.Engine, version string) {
|
||||
pages.POST("/busi-group/:id/board/:bid/clone", auth(), user(), perm("/dashboards/add"), bgrw(), boardClone)
|
||||
|
||||
pages.GET("/board/:bid", auth(), user(), boardGet)
|
||||
pages.GET("/board/:bid/pure", boardPureGet)
|
||||
pages.PUT("/board/:bid", auth(), user(), perm("/dashboards/put"), boardPut)
|
||||
pages.PUT("/board/:bid/configs", auth(), user(), perm("/dashboards/put"), boardPutConfigs)
|
||||
pages.DELETE("/boards", auth(), user(), perm("/dashboards/del"), boardDel)
|
||||
@@ -290,9 +267,6 @@ func configRoute(r *gin.Engine, version string) {
|
||||
pages.POST("/busi-group/:id/tasks", auth(), user(), perm("/job-tasks/add"), bgrw(), taskAdd)
|
||||
pages.GET("/busi-group/:id/task/*url", auth(), user(), perm("/job-tasks"), taskProxy)
|
||||
pages.PUT("/busi-group/:id/task/*url", auth(), user(), perm("/job-tasks/put"), bgrw(), taskProxy)
|
||||
|
||||
pages.GET("/servers", auth(), admin(), serversGet)
|
||||
pages.PUT("/server/:id", auth(), admin(), serverBindCluster)
|
||||
}
|
||||
|
||||
service := r.Group("/v1/n9e")
|
||||
@@ -321,6 +295,5 @@ func configRoute(r *gin.Engine, version string) {
|
||||
|
||||
service.GET("/alert-cur-events", alertCurEventsList)
|
||||
service.GET("/alert-his-events", alertHisEventsList)
|
||||
service.GET("/alert-his-event/:eid", alertHisEventGet)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -46,14 +46,9 @@ func alertCurEventsCard(c *gin.Context) {
|
||||
clusters := queryClusters(c)
|
||||
rules := parseAggrRules(c)
|
||||
prod := ginx.QueryStr(c, "prod", "")
|
||||
cate := ginx.QueryStr(c, "cate", "$all")
|
||||
cates := []string{}
|
||||
if cate != "$all" {
|
||||
cates = strings.Split(cate, ",")
|
||||
}
|
||||
|
||||
// 最多获取50000个,获取太多也没啥意义
|
||||
list, err := models.AlertCurEventGets(prod, busiGroupId, stime, etime, severity, clusters, cates, query, 50000, 0)
|
||||
list, err := models.AlertCurEventGets(prod, busiGroupId, stime, etime, severity, clusters, query, 50000, 0)
|
||||
ginx.Dangerous(err)
|
||||
|
||||
cardmap := make(map[string]*AlertCard)
|
||||
@@ -128,16 +123,11 @@ func alertCurEventsList(c *gin.Context) {
|
||||
busiGroupId := ginx.QueryInt64(c, "bgid", 0)
|
||||
clusters := queryClusters(c)
|
||||
prod := ginx.QueryStr(c, "prod", "")
|
||||
cate := ginx.QueryStr(c, "cate", "$all")
|
||||
cates := []string{}
|
||||
if cate != "$all" {
|
||||
cates = strings.Split(cate, ",")
|
||||
}
|
||||
|
||||
total, err := models.AlertCurEventTotal(prod, busiGroupId, stime, etime, severity, clusters, cates, query)
|
||||
total, err := models.AlertCurEventTotal(prod, busiGroupId, stime, etime, severity, clusters, query)
|
||||
ginx.Dangerous(err)
|
||||
|
||||
list, err := models.AlertCurEventGets(prod, busiGroupId, stime, etime, severity, clusters, cates, query, limit, ginx.Offset(c, limit))
|
||||
list, err := models.AlertCurEventGets(prod, busiGroupId, stime, etime, severity, clusters, query, limit, ginx.Offset(c, limit))
|
||||
ginx.Dangerous(err)
|
||||
|
||||
cache := make(map[int64]*models.UserGroup)
|
||||
|
||||
@@ -1,7 +1,6 @@
|
||||
package router
|
||||
|
||||
import (
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/gin-gonic/gin"
|
||||
@@ -36,16 +35,11 @@ func alertHisEventsList(c *gin.Context) {
|
||||
busiGroupId := ginx.QueryInt64(c, "bgid", 0)
|
||||
clusters := queryClusters(c)
|
||||
prod := ginx.QueryStr(c, "prod", "")
|
||||
cate := ginx.QueryStr(c, "cate", "$all")
|
||||
cates := []string{}
|
||||
if cate != "$all" {
|
||||
cates = strings.Split(cate, ",")
|
||||
}
|
||||
|
||||
total, err := models.AlertHisEventTotal(prod, busiGroupId, stime, etime, severity, recovered, clusters, cates, query)
|
||||
total, err := models.AlertHisEventTotal(prod, busiGroupId, stime, etime, severity, recovered, clusters, query)
|
||||
ginx.Dangerous(err)
|
||||
|
||||
list, err := models.AlertHisEventGets(prod, busiGroupId, stime, etime, severity, recovered, clusters, cates, query, limit, ginx.Offset(c, limit))
|
||||
list, err := models.AlertHisEventGets(prod, busiGroupId, stime, etime, severity, recovered, clusters, query, limit, ginx.Offset(c, limit))
|
||||
ginx.Dangerous(err)
|
||||
|
||||
cache := make(map[int64]*models.UserGroup)
|
||||
|
||||
@@ -26,18 +26,10 @@ func alertRuleGets(c *gin.Context) {
|
||||
}
|
||||
|
||||
func alertRulesGetByService(c *gin.Context) {
|
||||
prods := strings.Split(ginx.QueryStr(c, "prods", ""), ",")
|
||||
prods := strings.Fields(ginx.QueryStr(c, "prods", ""))
|
||||
query := ginx.QueryStr(c, "query", "")
|
||||
algorithm := ginx.QueryStr(c, "algorithm", "")
|
||||
cluster := ginx.QueryStr(c, "cluster", "")
|
||||
cate := ginx.QueryStr(c, "cate", "$all")
|
||||
cates := []string{}
|
||||
if cate != "$all" {
|
||||
cates = strings.Split(cate, ",")
|
||||
}
|
||||
|
||||
disabled := ginx.QueryInt(c, "disabled", -1)
|
||||
ars, err := models.AlertRulesGetsBy(prods, query, algorithm, cluster, cates, disabled)
|
||||
ars, err := models.AlertRulesGetsBy(prods, query)
|
||||
if err == nil {
|
||||
cache := make(map[int64]*models.UserGroup)
|
||||
for i := 0; i < len(ars); i++ {
|
||||
|
||||
@@ -74,7 +74,6 @@ func alertSubscribePut(c *gin.Context) {
|
||||
fs[i].UpdateBy = username
|
||||
fs[i].UpdateAt = timestamp
|
||||
ginx.Dangerous(fs[i].Update(
|
||||
"cluster",
|
||||
"rule_id",
|
||||
"tags",
|
||||
"redefine_severity",
|
||||
|
||||
@@ -51,17 +51,6 @@ func boardGet(c *gin.Context) {
|
||||
ginx.NewRender(c).Data(board, nil)
|
||||
}
|
||||
|
||||
func boardPureGet(c *gin.Context) {
|
||||
board, err := models.BoardGetByID(ginx.UrlParamInt64(c, "bid"))
|
||||
ginx.Dangerous(err)
|
||||
|
||||
if board == nil {
|
||||
ginx.Bomb(http.StatusNotFound, "No such dashboard")
|
||||
}
|
||||
|
||||
ginx.NewRender(c).Data(board, nil)
|
||||
}
|
||||
|
||||
// bgrwCheck
|
||||
func boardDel(c *gin.Context) {
|
||||
var f idsForm
|
||||
|
||||
@@ -69,12 +69,6 @@ func busiGroupMemberAdd(c *gin.Context) {
|
||||
username := c.MustGet("username").(string)
|
||||
targetbg := c.MustGet("busi_group").(*models.BusiGroup)
|
||||
|
||||
for i := 0; i < len(members); i++ {
|
||||
if members[i].BusiGroupId != targetbg.Id {
|
||||
ginx.Bomb(http.StatusBadRequest, "business group id invalid")
|
||||
}
|
||||
}
|
||||
|
||||
ginx.NewRender(c).Message(targetbg.AddMembers(members, username))
|
||||
}
|
||||
|
||||
@@ -85,12 +79,6 @@ func busiGroupMemberDel(c *gin.Context) {
|
||||
username := c.MustGet("username").(string)
|
||||
targetbg := c.MustGet("busi_group").(*models.BusiGroup)
|
||||
|
||||
for i := 0; i < len(members); i++ {
|
||||
if members[i].BusiGroupId != targetbg.Id {
|
||||
ginx.Bomb(http.StatusBadRequest, "business group id invalid")
|
||||
}
|
||||
}
|
||||
|
||||
ginx.NewRender(c).Message(targetbg.DelMembers(members, username))
|
||||
}
|
||||
|
||||
|
||||
@@ -3,7 +3,6 @@ package router
|
||||
import (
|
||||
"fmt"
|
||||
"net/http"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
@@ -32,7 +31,6 @@ func loginPost(c *gin.Context) {
|
||||
if config.C.LDAP.Enable {
|
||||
user, err = models.LdapLogin(f.Username, f.Password)
|
||||
if err != nil {
|
||||
logger.Debugf("ldap login failed: %v username: %s", err, f.Username)
|
||||
ginx.NewRender(c).Message(err)
|
||||
return
|
||||
}
|
||||
@@ -117,24 +115,6 @@ func refreshPost(c *gin.Context) {
|
||||
return
|
||||
}
|
||||
|
||||
userid, err := strconv.ParseInt(strings.Split(userIdentity, "-")[0], 10, 64)
|
||||
if err != nil {
|
||||
ginx.NewRender(c, http.StatusUnauthorized).Message("failed to parse user_identity from jwt")
|
||||
return
|
||||
}
|
||||
|
||||
u, err := models.UserGetById(userid)
|
||||
if err != nil {
|
||||
ginx.NewRender(c, http.StatusInternalServerError).Message("failed to query user by id")
|
||||
return
|
||||
}
|
||||
|
||||
if u == nil {
|
||||
// user already deleted
|
||||
ginx.NewRender(c, http.StatusUnauthorized).Message("user already deleted")
|
||||
return
|
||||
}
|
||||
|
||||
// Delete the previous Refresh Token
|
||||
err = deleteAuth(c.Request.Context(), refreshUuid)
|
||||
if err != nil {
|
||||
|
||||
@@ -1,14 +1,35 @@
|
||||
package router
|
||||
|
||||
import (
|
||||
"path"
|
||||
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/toolkits/pkg/file"
|
||||
"github.com/toolkits/pkg/ginx"
|
||||
"github.com/toolkits/pkg/runner"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/webapi/config"
|
||||
)
|
||||
|
||||
func metricsDescGetFile(c *gin.Context) {
|
||||
c.JSON(200, config.MetricDesc)
|
||||
fp := config.C.MetricsYamlFile
|
||||
if fp == "" {
|
||||
fp = path.Join(runner.Cwd, "etc", "metrics.yaml")
|
||||
}
|
||||
|
||||
if !file.IsExist(fp) {
|
||||
c.String(404, "%s not found", fp)
|
||||
return
|
||||
}
|
||||
|
||||
ret := make(map[string]string)
|
||||
err := file.ReadYaml(fp, &ret)
|
||||
if err != nil {
|
||||
c.String(500, err.Error())
|
||||
return
|
||||
}
|
||||
|
||||
c.JSON(200, ret)
|
||||
}
|
||||
|
||||
// 前端传过来一个metric数组,后端去查询有没有对应的释义,返回map
|
||||
@@ -17,8 +38,13 @@ func metricsDescGetMap(c *gin.Context) {
|
||||
ginx.BindJSON(c, &arr)
|
||||
|
||||
ret := make(map[string]string)
|
||||
for _, key := range arr {
|
||||
ret[key] = config.GetMetricDesc(c.GetHeader("X-Language"), key)
|
||||
for i := 0; i < len(arr); i++ {
|
||||
desc, has := config.Metrics.Get(arr[i])
|
||||
if !has {
|
||||
ret[arr[i]] = ""
|
||||
} else {
|
||||
ret[arr[i]] = desc.(string)
|
||||
}
|
||||
}
|
||||
|
||||
ginx.NewRender(c).Data(ret, nil)
|
||||
|
||||
@@ -59,7 +59,7 @@ func proxyAuth() gin.HandlerFunc {
|
||||
return func(c *gin.Context) {
|
||||
user := handleProxyUser(c)
|
||||
c.Set("userid", user.Id)
|
||||
c.Set("username", user.Username)
|
||||
c.Set("username", user)
|
||||
c.Next()
|
||||
}
|
||||
}
|
||||
@@ -119,6 +119,7 @@ func jwtMock() gin.HandlerFunc {
|
||||
"refresh_token": "",
|
||||
}, nil)
|
||||
c.Abort()
|
||||
return
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -1,66 +1,18 @@
|
||||
package router
|
||||
|
||||
import (
|
||||
"context"
|
||||
|
||||
"net/http"
|
||||
"net/http/httputil"
|
||||
"net/url"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/toolkits/pkg/ginx"
|
||||
|
||||
pkgprom "github.com/didi/nightingale/v5/src/pkg/prom"
|
||||
"github.com/didi/nightingale/v5/src/webapi/config"
|
||||
"github.com/didi/nightingale/v5/src/webapi/prom"
|
||||
"github.com/prometheus/common/model"
|
||||
)
|
||||
|
||||
type queryFormItem struct {
|
||||
Start int64 `json:"start" binding:"required"`
|
||||
End int64 `json:"end" binding:"required"`
|
||||
Step int64 `json:"step" binding:"required"`
|
||||
Query string `json:"query" binding:"required"`
|
||||
}
|
||||
|
||||
type batchQueryForm struct {
|
||||
Queries []queryFormItem `json:"queries" binding:"required"`
|
||||
}
|
||||
|
||||
func promBatchQueryRange(c *gin.Context) {
|
||||
xcluster := c.GetHeader("X-Cluster")
|
||||
if xcluster == "" {
|
||||
ginx.Bomb(http.StatusBadRequest, "header(X-Cluster) is blank")
|
||||
}
|
||||
|
||||
var f batchQueryForm
|
||||
ginx.Dangerous(c.BindJSON(&f))
|
||||
|
||||
cluster, exist := prom.Clusters.Get(xcluster)
|
||||
if !exist {
|
||||
ginx.Bomb(http.StatusBadRequest, "cluster(%s) not found", xcluster)
|
||||
}
|
||||
|
||||
var lst []model.Value
|
||||
|
||||
for _, item := range f.Queries {
|
||||
r := pkgprom.Range{
|
||||
Start: time.Unix(item.Start, 0),
|
||||
End: time.Unix(item.End, 0),
|
||||
Step: time.Duration(item.Step) * time.Second,
|
||||
}
|
||||
|
||||
resp, _, err := cluster.PromClient.QueryRange(context.Background(), item.Query, r)
|
||||
ginx.Dangerous(err)
|
||||
|
||||
lst = append(lst, resp)
|
||||
}
|
||||
|
||||
ginx.NewRender(c).Data(lst, nil)
|
||||
}
|
||||
|
||||
func prometheusProxy(c *gin.Context) {
|
||||
xcluster := c.GetHeader("X-Cluster")
|
||||
if xcluster == "" {
|
||||
|
||||
@@ -1,35 +0,0 @@
|
||||
package router
|
||||
|
||||
import (
|
||||
"github.com/didi/nightingale/v5/src/models"
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/toolkits/pkg/ginx"
|
||||
)
|
||||
|
||||
// 页面上,拉取 server 列表
|
||||
func serversGet(c *gin.Context) {
|
||||
list, err := models.AlertingEngineGets("")
|
||||
ginx.NewRender(c).Data(list, err)
|
||||
}
|
||||
|
||||
type serverBindClusterForm struct {
|
||||
Cluster string `json:"cluster"`
|
||||
}
|
||||
|
||||
// 用户为某个 n9e-server 分配一个集群,也可以清空,设置cluster为空字符串即可
|
||||
// 清空就表示这个server没啥用了,可能是要下线掉,或者仅仅用作转发器
|
||||
func serverBindCluster(c *gin.Context) {
|
||||
id := ginx.UrlParamInt64(c, "id")
|
||||
|
||||
ae, err := models.AlertingEngineGet("id = ?", id)
|
||||
ginx.Dangerous(err)
|
||||
|
||||
if ae == nil {
|
||||
ginx.Dangerous("no such server")
|
||||
}
|
||||
|
||||
var f serverBindClusterForm
|
||||
ginx.BindJSON(c, &f)
|
||||
|
||||
ginx.NewRender(c).Message(ae.UpdateCluster(f.Cluster))
|
||||
}
|
||||
@@ -1,27 +1,21 @@
|
||||
package router
|
||||
|
||||
import (
|
||||
"context"
|
||||
"fmt"
|
||||
"net/http"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/prometheus/common/model"
|
||||
"github.com/toolkits/pkg/ginx"
|
||||
|
||||
"github.com/didi/nightingale/v5/src/models"
|
||||
"github.com/didi/nightingale/v5/src/server/common/conv"
|
||||
"github.com/didi/nightingale/v5/src/webapi/config"
|
||||
"github.com/didi/nightingale/v5/src/webapi/prom"
|
||||
)
|
||||
|
||||
func targetGets(c *gin.Context) {
|
||||
bgid := ginx.QueryInt64(c, "bgid", -1)
|
||||
query := ginx.QueryStr(c, "query", "")
|
||||
limit := ginx.QueryInt(c, "limit", 30)
|
||||
mins := ginx.QueryInt(c, "mins", 2)
|
||||
clusters := queryClusters(c)
|
||||
|
||||
total, err := models.TargetTotal(bgid, clusters, query)
|
||||
@@ -32,60 +26,8 @@ func targetGets(c *gin.Context) {
|
||||
|
||||
if err == nil {
|
||||
cache := make(map[int64]*models.BusiGroup)
|
||||
targetsMap := make(map[string]*models.Target)
|
||||
for i := 0; i < len(list); i++ {
|
||||
ginx.Dangerous(list[i].FillGroup(cache))
|
||||
targetsMap[list[i].Cluster+list[i].Ident] = list[i]
|
||||
}
|
||||
|
||||
now := time.Now()
|
||||
|
||||
// query LoadPerCore / MemUtil / TargetUp / DiskUsedPercent from prometheus
|
||||
// map key: cluster, map value: ident list
|
||||
targets := make(map[string][]string)
|
||||
for i := 0; i < len(list); i++ {
|
||||
targets[list[i].Cluster] = append(targets[list[i].Cluster], list[i].Ident)
|
||||
}
|
||||
|
||||
for cluster := range targets {
|
||||
cc, has := prom.Clusters.Get(cluster)
|
||||
if !has {
|
||||
continue
|
||||
}
|
||||
|
||||
targetArr := targets[cluster]
|
||||
if len(targetArr) == 0 {
|
||||
continue
|
||||
}
|
||||
|
||||
targetRe := strings.Join(targetArr, "|")
|
||||
valuesMap := make(map[string]map[string]float64)
|
||||
|
||||
for metric, ql := range config.C.TargetMetrics {
|
||||
promql := fmt.Sprintf(ql, targetRe, mins)
|
||||
values, err := instantQuery(context.Background(), cc, promql, now)
|
||||
ginx.Dangerous(err)
|
||||
valuesMap[metric] = values
|
||||
}
|
||||
|
||||
// handle values
|
||||
for metric, values := range valuesMap {
|
||||
for ident := range values {
|
||||
mapkey := cluster + ident
|
||||
if t, has := targetsMap[mapkey]; has {
|
||||
switch metric {
|
||||
case "LoadPerCore":
|
||||
t.LoadPerCore = values[ident]
|
||||
case "MemUtil":
|
||||
t.MemUtil = values[ident]
|
||||
case "TargetUp":
|
||||
t.TargetUp = values[ident]
|
||||
case "DiskUtil":
|
||||
t.DiskUtil = values[ident]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -95,29 +37,6 @@ func targetGets(c *gin.Context) {
|
||||
}, nil)
|
||||
}
|
||||
|
||||
func instantQuery(ctx context.Context, c *prom.ClusterType, promql string, ts time.Time) (map[string]float64, error) {
|
||||
ret := make(map[string]float64)
|
||||
|
||||
val, warnings, err := c.PromClient.Query(ctx, promql, ts)
|
||||
if err != nil {
|
||||
return ret, err
|
||||
}
|
||||
|
||||
if len(warnings) > 0 {
|
||||
return ret, fmt.Errorf("instant query occur warnings, promql: %s, warnings: %v", promql, warnings)
|
||||
}
|
||||
|
||||
vectors := conv.ConvertVectors(val)
|
||||
for i := range vectors {
|
||||
ident, has := vectors[i].Labels["ident"]
|
||||
if has {
|
||||
ret[string(ident)] = vectors[i].Value
|
||||
}
|
||||
}
|
||||
|
||||
return ret, nil
|
||||
}
|
||||
|
||||
func targetGetTags(c *gin.Context) {
|
||||
idents := ginx.QueryStr(c, "idents")
|
||||
idents = strings.ReplaceAll(idents, ",", " ")
|
||||
|
||||
Reference in New Issue
Block a user