Compare commits
122 Commits
docker_upd
...
v6.2.0
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
c5cd6c0337 | ||
|
|
fe1d566326 | ||
|
|
cedc918a09 | ||
|
|
1e6c0865dd | ||
|
|
7649986b55 | ||
|
|
86a82b409a | ||
|
|
f6ad9bdf82 | ||
|
|
a647526084 | ||
|
|
44ed90e181 | ||
|
|
3e7273701d | ||
|
|
d77ed30940 | ||
|
|
5ae80e67a3 | ||
|
|
184389be33 | ||
|
|
c1f022001f | ||
|
|
616d56d515 | ||
|
|
10a0b5099e | ||
|
|
0815605298 | ||
|
|
2df3216b32 | ||
|
|
74491c666d | ||
|
|
29a2eb6f2f | ||
|
|
baf56746ce | ||
|
|
5867c5af8f | ||
|
|
4a358f5cff | ||
|
|
13f2b008fd | ||
|
|
84400cd657 | ||
|
|
f2a3a6933e | ||
|
|
0a4d1cad4c | ||
|
|
08f472f9ee | ||
|
|
7f73945c8d | ||
|
|
56a7860b5a | ||
|
|
25dab86b8e | ||
|
|
35b90ca162 | ||
|
|
5babee6de3 | ||
|
|
7567d440a9 | ||
|
|
2ecd799dab | ||
|
|
5b3561f983 | ||
|
|
cce3711c02 | ||
|
|
9cdbda0828 | ||
|
|
9c4775fd38 | ||
|
|
212e0aa4c3 | ||
|
|
05300ec0e9 | ||
|
|
67fb49e54e | ||
|
|
7164b696b1 | ||
|
|
8728167733 | ||
|
|
6e80a63b68 | ||
|
|
9e43a22ec3 | ||
|
|
49d8ed4a6f | ||
|
|
c7b537e6c7 | ||
|
|
f1cdd2fa46 | ||
|
|
3d5ad02274 | ||
|
|
1cb9f4becf | ||
|
|
0d0dafbe49 | ||
|
|
048d1df2d1 | ||
|
|
4fb4154e30 | ||
|
|
0be69bbccd | ||
|
|
7015a40256 | ||
|
|
03cca642e9 | ||
|
|
579fd3780b | ||
|
|
a85d91c10e | ||
|
|
af31c496a1 | ||
|
|
f9efbaa954 | ||
|
|
d541ec7f20 | ||
|
|
1d847e2c6f | ||
|
|
2fedf4f075 | ||
|
|
e9a02c4c80 | ||
|
|
8beaccdded | ||
|
|
af6003da6d | ||
|
|
76ac2cd013 | ||
|
|
859876e3f8 | ||
|
|
7d49e7fb34 | ||
|
|
6c42ae9077 | ||
|
|
15dcc60407 | ||
|
|
5b811b7003 | ||
|
|
55d670fe3c | ||
|
|
ac3a5e52c7 | ||
|
|
2abe00e251 | ||
|
|
1bd3c29e39 | ||
|
|
1a8087bda7 | ||
|
|
72b4c2b1ec | ||
|
|
38e6820d7b | ||
|
|
765b3a57fe | ||
|
|
1c4a32f8fa | ||
|
|
3f258fcebf | ||
|
|
140f2cbfa8 | ||
|
|
6aacd77492 | ||
|
|
ef3f46f8b7 | ||
|
|
0cdd25d2cf | ||
|
|
5d02ce0636 | ||
|
|
0cd1228ba7 | ||
|
|
0595401d14 | ||
|
|
d724f8cc8e | ||
|
|
a3f5d458d7 | ||
|
|
76bfb130b0 | ||
|
|
184bb78e3b | ||
|
|
6a41af2cb2 | ||
|
|
faa149cc87 | ||
|
|
24592fe480 | ||
|
|
4be53082e0 | ||
|
|
ae8c9c668c | ||
|
|
b0c15af04f | ||
|
|
c05b710aff | ||
|
|
4299c48aef | ||
|
|
ae0523dec0 | ||
|
|
e18a6bda7b | ||
|
|
e64be95f1c | ||
|
|
a1aa0150f8 | ||
|
|
32f9cb5996 | ||
|
|
3b7e692b01 | ||
|
|
6491eba1da | ||
|
|
bb7ea7e809 | ||
|
|
169930e3b8 | ||
|
|
8e14047f36 | ||
|
|
fd29a96f7b | ||
|
|
820c12f230 | ||
|
|
ff3550e7b3 | ||
|
|
b65e43351d | ||
|
|
3fb74b632b | ||
|
|
253e54344d | ||
|
|
f1ee7d24a6 | ||
|
|
475673b3e7 | ||
|
|
dd49afef01 | ||
|
|
d0c842fe87 |
104
README.md
@@ -4,71 +4,101 @@
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
<a href="https://flashcat.cloud/docs/">
|
||||
<img alt="GitHub latest release" src="https://img.shields.io/github/v/release/ccfos/nightingale"/>
|
||||
<a href="https://n9e.github.io">
|
||||
<img alt="Docs" src="https://img.shields.io/badge/docs-get%20started-brightgreen"/></a>
|
||||
<a href="https://hub.docker.com/u/flashcatcloud">
|
||||
<img alt="Docker pulls" src="https://img.shields.io/docker/pulls/flashcatcloud/nightingale"/></a>
|
||||
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
|
||||
<img alt="GitHub contributors" src="https://img.shields.io/github/contributors-anon/ccfos/nightingale"/></a>
|
||||
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/ccfos/nightingale">
|
||||
<br/><img alt="GitHub Repo issues" src="https://img.shields.io/github/issues/ccfos/nightingale">
|
||||
<img alt="GitHub Repo issues" src="https://img.shields.io/github/issues/ccfos/nightingale">
|
||||
<img alt="GitHub Repo issues closed" src="https://img.shields.io/github/issues-closed/ccfos/nightingale">
|
||||
<img alt="GitHub forks" src="https://img.shields.io/github/forks/ccfos/nightingale">
|
||||
<img alt="GitHub latest release" src="https://img.shields.io/github/v/release/ccfos/nightingale"/>
|
||||
<img alt="License" src="https://img.shields.io/badge/license-Apache--2.0-blue"/>
|
||||
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
|
||||
<img alt="GitHub contributors" src="https://img.shields.io/github/contributors-anon/ccfos/nightingale"/></a>
|
||||
<a href="https://n9e-talk.slack.com/">
|
||||
<img alt="GitHub contributors" src="https://img.shields.io/badge/join%20slack-%23n9e-brightgreen.svg"/></a>
|
||||
<img alt="License" src="https://img.shields.io/badge/license-Apache--2.0-blue"/>
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
告警管理专家,一体化的开源可观测平台
|
||||
An open-source cloud-native monitoring system that is <b>all-in-one</b> <br/>
|
||||
<b>Out-of-the-box</b>, it integrates data collection, visualization, and monitoring alert <br/>
|
||||
We recommend upgrading your <b>Prometheus + AlertManager + Grafana</b> combination to Nightingale!
|
||||
</p>
|
||||
|
||||
[English](./README_en.md) | [中文](./README.md)
|
||||
|
||||
夜莺Nightingale是中国计算机学会托管的开源云原生可观测工具,最早由滴滴于 2020 年孵化并开源,并于 2022 年正式捐赠予中国计算机学会。夜莺采用 All-in-One 的设计理念,集数据采集、可视化、监控告警、数据分析于一体,与云原生生态紧密集成,融入了顶级互联网公司可观测性最佳实践,沉淀了众多社区专家经验,开箱即用。
|
||||
|
||||
## 资料
|
||||
|
||||
- 文档:[flashcat.cloud/docs](https://flashcat.cloud/docs/)
|
||||
- 提问:[answer.flashcat.cloud](https://answer.flashcat.cloud/)
|
||||
- 报Bug:[github.com/ccfos/nightingale/issues](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Fbug&projects=&template=bug_report.yml)
|
||||
[English](./README.md) | [中文](./README_zh.md)
|
||||
|
||||
|
||||
## 功能和特点
|
||||
## Highlighted Features
|
||||
|
||||
- 统一接入各种时序库:支持对接 Prometheus、VictoriaMetrics、Thanos、Mimir、M3DB 等多种时序库,实现统一告警管理
|
||||
- 专业告警能力:内置支持多种告警规则,可以扩展支持所有通知媒介,支持告警屏蔽、告警抑制、告警自愈、告警事件管理
|
||||
- 高性能可视化引擎:支持多种图表样式,内置众多Dashboard模版,也可导入Grafana模版,开箱即用,开源协议商业友好
|
||||
- 无缝搭配 [Flashduty](https://flashcat.cloud/product/flashcat-duty/):实现告警聚合收敛、认领、升级、排班、IM集成,确保告警处理不遗漏,减少打扰,更好协同
|
||||
- 支持所有常见采集器:支持 [Categraf](https://flashcat.cloud/product/categraf)、telegraf、grafana-agent、datadog-agent、各种 exporter 作为采集器,没有什么数据是不能监控的
|
||||
- 一体化观测平台:从 v6 版本开始,支持接入 ElasticSearch、Jaeger 数据源,实现日志、链路、指标多维度的统一可观测
|
||||
- **Out-of-the-box**
|
||||
- Supports multiple deployment methods such as **Docker, Helm Chart, and cloud services**, integrates data collection, monitoring, and alerting into one system, and comes with various monitoring dashboards, quick views, and alert rule templates. **It greatly reduces the construction cost, learning cost, and usage cost of cloud-native monitoring systems**.
|
||||
- **Professional Alerting**
|
||||
- Provides visual alert configuration and management, supports various alert rules, offers the ability to configure silence and subscription rules, supports multiple alert delivery channels, and has features such as alert self-healing and event management.
|
||||
- **Cloud-Native**
|
||||
- Quickly builds an enterprise-level cloud-native monitoring system through a turnkey approach, supports multiple collectors such as [Categraf](https://github.com/flashcatcloud/categraf), Telegraf, and Grafana-agent, supports multiple data sources such as Prometheus, VictoriaMetrics, M3DB, ElasticSearch, and Jaeger, and is compatible with importing Grafana dashboards. **It seamlessly integrates with the cloud-native ecosystem**.
|
||||
- **High Performance and High Availability**
|
||||
- Due to the multi-data-source management engine of Nightingale and its excellent architecture design, and utilizing a high-performance time-series database, it can handle data collection, storage, and alert analysis scenarios with billions of time-series data, saving a lot of costs.
|
||||
- Nightingale components can be horizontally scaled with no single point of failure. It has been deployed in thousands of enterprises and tested in harsh production practices. Many leading Internet companies have used Nightingale for cluster machines with hundreds of nodes, processing billions of time-series data.
|
||||
- **Flexible Extension and Centralized Management**
|
||||
- Nightingale can be deployed on a 1-core 1G cloud host, deployed in a cluster of hundreds of machines, or run in Kubernetes. Time-series databases, alert engines, and other components can also be decentralized to various data centers and regions, balancing edge deployment with centralized management. **It solves the problem of data fragmentation and lack of unified views**.
|
||||
|
||||
|
||||
## 产品演示
|
||||
#### If you are using Prometheus and have one or more of the following requirement scenarios, it is recommended that you upgrade to Nightingale:
|
||||
|
||||

|
||||
- Multiple systems such as Prometheus, Alertmanager, Grafana, etc. are fragmented and lack a unified view and cannot be used out of the box;
|
||||
- The way to manage Prometheus and Alertmanager by modifying configuration files has a big learning curve and is difficult to collaborate;
|
||||
- Too much data to scale-up your Prometheus cluster;
|
||||
- Multiple Prometheus clusters running in production environments, which faced high management and usage costs;
|
||||
|
||||
## 部署架构
|
||||
#### If you are using Zabbix and have the following scenarios, it is recommended that you upgrade to Nightingale:
|
||||
|
||||

|
||||
- Monitoring too much data and wanting a better scalable solution;
|
||||
- A high learning curve and a desire for better efficiency of collaborative use in a multi-person, multi-team model;
|
||||
- Microservice and cloud-native architectures with variable monitoring data lifecycles and high monitoring data dimension bases, which are not easily adaptable to the Zabbix data model;
|
||||
|
||||
## 加入交流群
|
||||
|
||||
欢迎加入 QQ 交流群,群号:479290895,QQ 群适合群友互助,夜莺研发人员通常不在群里。如果要报 bug 请到[这里](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Fbug&projects=&template=bug_report.yml),提问到[这里](https://answer.flashcat.cloud/)。
|
||||
#### If you are using [open-falcon](https://github.com/open-falcon/falcon-plus), we recommend you to upgrade to Nightingale:
|
||||
- For more information about open-falcon and Nightingale, please refer to read [Ten features and trends of cloud-native monitoring](https://mp.weixin.qq.com/s?__biz=MzkzNjI5OTM5Nw==&mid=2247483738&idx=1&sn=e8bdbb974a2cd003c1abcc2b5405dd18&chksm=c2a19fb0f5d616a63185cd79277a79a6b80118ef2185890d0683d2bb20451bd9303c78d083c5#rd)。
|
||||
|
||||
## Getting Started
|
||||
|
||||
[https://n9e.github.io/](https://n9e.github.io/)
|
||||
|
||||
## Screenshots
|
||||
|
||||
https://user-images.githubusercontent.com/792850/216888712-2565fcea-9df5-47bd-a49e-d60af9bd76e8.mp4
|
||||
|
||||
## Architecture
|
||||
|
||||
<img src="doc/img/arch-product.png" width="600">
|
||||
|
||||
Nightingale monitoring can receive monitoring data reported by various collectors (such as [Categraf](https://github.com/flashcatcloud/categraf) , telegraf, grafana-agent, Prometheus, etc.) and write them to various popular time-series databases (such as Prometheus, M3DB, VictoriaMetrics, Thanos, TDEngine, etc.). It provides configuration capabilities for alert rules, silence rules, and subscription rules, as well as the ability to view monitoring data. It also provides automatic alarm self-healing mechanisms (such as automatically calling back to a webhook address or executing a script after an alarm is triggered), and the ability to store and manage historical alarm events and view them in groups.
|
||||
|
||||
If the performance of a standalone time-series database (such as Prometheus) has bottlenecks or poor disaster recovery, we recommend using [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics). The VictoriaMetrics architecture is relatively simple, has excellent performance, and is easy to deploy and maintain. The architecture diagram is as shown above. For more detailed documentation on VictoriaMetrics, please refer to its [official website](https://victoriametrics.com/).
|
||||
|
||||
**We welcome you to participate in the Nightingale open-source project and community in various ways, including but not limited to**:
|
||||
- Adding and improving documentation => [n9e.github.io](https://n9e.github.io/)
|
||||
- Sharing your best practices and experience in using Nightingale monitoring => [Article sharing]((https://n9e.github.io/docs/prologue/share/))
|
||||
- Submitting product suggestions => [github issue](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.md)
|
||||
- Submitting code to make Nightingale monitoring faster, more stable, and easier to use => [github pull request](https://github.com/didi/nightingale/pulls)
|
||||
|
||||
|
||||
**Respecting, recognizing, and recording the work of every contributor** is the first guiding principle of the Nightingale open-source community. We advocate effective questioning, which not only respects the developer's time but also contributes to the accumulation of knowledge in the entire community
|
||||
- Before asking a question, please first refer to the [FAQ](https://www.gitlink.org.cn/ccfos/nightingale/wiki/faq)
|
||||
- We use [GitHub Discussions](https://github.com/ccfos/nightingale/discussions) as the communication forum. You can search and ask questions here.
|
||||
- We also recommend that you join ours [Slack channel](https://n9e-talk.slack.com/) to exchange experiences with other Nightingale users.
|
||||
|
||||
|
||||
## Who is using Nightingale
|
||||
You can register your usage and share your experience by posting on **[Who is Using Nightingale](https://github.com/ccfos/nightingale/issues/897)**.
|
||||
|
||||
## Stargazers over time
|
||||
|
||||
[](https://star-history.com/#ccfos/nightingale&Date)
|
||||
|
||||
[](https://starchart.cc/ccfos/nightingale)
|
||||
|
||||
## Contributors
|
||||
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
|
||||
<img src="https://contrib.rocks/image?repo=ccfos/nightingale" />
|
||||
</a>
|
||||
|
||||
## 社区治理
|
||||
[夜莺开源项目和社区治理架构(草案)](./doc/community-governance.md)
|
||||
|
||||
## License
|
||||
[Apache License V2.0](https://github.com/didi/nightingale/blob/main/LICENSE)
|
||||
[Apache License V2.0](https://github.com/didi/nightingale/blob/main/LICENSE)
|
||||
104
README_en.md
@@ -1,104 +0,0 @@
|
||||
<p align="center">
|
||||
<a href="https://github.com/ccfos/nightingale">
|
||||
<img src="doc/img/nightingale_logo_h.png" alt="nightingale - cloud native monitoring" width="240" /></a>
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
<img alt="GitHub latest release" src="https://img.shields.io/github/v/release/ccfos/nightingale"/>
|
||||
<a href="https://n9e.github.io">
|
||||
<img alt="Docs" src="https://img.shields.io/badge/docs-get%20started-brightgreen"/></a>
|
||||
<a href="https://hub.docker.com/u/flashcatcloud">
|
||||
<img alt="Docker pulls" src="https://img.shields.io/docker/pulls/flashcatcloud/nightingale"/></a>
|
||||
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/ccfos/nightingale">
|
||||
<img alt="GitHub Repo issues" src="https://img.shields.io/github/issues/ccfos/nightingale">
|
||||
<img alt="GitHub Repo issues closed" src="https://img.shields.io/github/issues-closed/ccfos/nightingale">
|
||||
<img alt="GitHub forks" src="https://img.shields.io/github/forks/ccfos/nightingale">
|
||||
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
|
||||
<img alt="GitHub contributors" src="https://img.shields.io/github/contributors-anon/ccfos/nightingale"/></a>
|
||||
<a href="https://n9e-talk.slack.com/">
|
||||
<img alt="GitHub contributors" src="https://img.shields.io/badge/join%20slack-%23n9e-brightgreen.svg"/></a>
|
||||
<img alt="License" src="https://img.shields.io/badge/license-Apache--2.0-blue"/>
|
||||
</p>
|
||||
<p align="center">
|
||||
An open-source cloud-native monitoring system that is <b>all-in-one</b> <br/>
|
||||
<b>Out-of-the-box</b>, it integrates data collection, visualization, and monitoring alert <br/>
|
||||
We recommend upgrading your <b>Prometheus + AlertManager + Grafana</b> combination to Nightingale!
|
||||
</p>
|
||||
|
||||
[English](./README.md) | [中文](./README_ZH.md)
|
||||
|
||||
|
||||
## Highlighted Features
|
||||
|
||||
- **Out-of-the-box**
|
||||
- Supports multiple deployment methods such as **Docker, Helm Chart, and cloud services**, integrates data collection, monitoring, and alerting into one system, and comes with various monitoring dashboards, quick views, and alert rule templates. **It greatly reduces the construction cost, learning cost, and usage cost of cloud-native monitoring systems**.
|
||||
- **Professional Alerting**
|
||||
- Provides visual alert configuration and management, supports various alert rules, offers the ability to configure silence and subscription rules, supports multiple alert delivery channels, and has features such as alert self-healing and event management.
|
||||
- **Cloud-Native**
|
||||
- Quickly builds an enterprise-level cloud-native monitoring system through a turnkey approach, supports multiple collectors such as [Categraf](https://github.com/flashcatcloud/categraf), Telegraf, and Grafana-agent, supports multiple data sources such as Prometheus, VictoriaMetrics, M3DB, ElasticSearch, and Jaeger, and is compatible with importing Grafana dashboards. **It seamlessly integrates with the cloud-native ecosystem**.
|
||||
- **High Performance and High Availability**
|
||||
- Due to the multi-data-source management engine of Nightingale and its excellent architecture design, and utilizing a high-performance time-series database, it can handle data collection, storage, and alert analysis scenarios with billions of time-series data, saving a lot of costs.
|
||||
- Nightingale components can be horizontally scaled with no single point of failure. It has been deployed in thousands of enterprises and tested in harsh production practices. Many leading Internet companies have used Nightingale for cluster machines with hundreds of nodes, processing billions of time-series data.
|
||||
- **Flexible Extension and Centralized Management**
|
||||
- Nightingale can be deployed on a 1-core 1G cloud host, deployed in a cluster of hundreds of machines, or run in Kubernetes. Time-series databases, alert engines, and other components can also be decentralized to various data centers and regions, balancing edge deployment with centralized management. **It solves the problem of data fragmentation and lack of unified views**.
|
||||
|
||||
|
||||
#### If you are using Prometheus and have one or more of the following requirement scenarios, it is recommended that you upgrade to Nightingale:
|
||||
|
||||
- Multiple systems such as Prometheus, Alertmanager, Grafana, etc. are fragmented and lack a unified view and cannot be used out of the box;
|
||||
- The way to manage Prometheus and Alertmanager by modifying configuration files has a big learning curve and is difficult to collaborate;
|
||||
- Too much data to scale-up your Prometheus cluster;
|
||||
- Multiple Prometheus clusters running in production environments, which faced high management and usage costs;
|
||||
|
||||
#### If you are using Zabbix and have the following scenarios, it is recommended that you upgrade to Nightingale:
|
||||
|
||||
- Monitoring too much data and wanting a better scalable solution;
|
||||
- A high learning curve and a desire for better efficiency of collaborative use in a multi-person, multi-team model;
|
||||
- Microservice and cloud-native architectures with variable monitoring data lifecycles and high monitoring data dimension bases, which are not easily adaptable to the Zabbix data model;
|
||||
|
||||
|
||||
#### If you are using [open-falcon](https://github.com/open-falcon/falcon-plus), we recommend you to upgrade to Nightingale:
|
||||
- For more information about open-falcon and Nightingale, please refer to read [Ten features and trends of cloud-native monitoring](https://mp.weixin.qq.com/s?__biz=MzkzNjI5OTM5Nw==&mid=2247483738&idx=1&sn=e8bdbb974a2cd003c1abcc2b5405dd18&chksm=c2a19fb0f5d616a63185cd79277a79a6b80118ef2185890d0683d2bb20451bd9303c78d083c5#rd)。
|
||||
|
||||
## Getting Started
|
||||
|
||||
[English Doc](https://n9e.github.io/) | [中文文档](http://n9e.flashcat.cloud/)
|
||||
|
||||
## Screenshots
|
||||
|
||||
https://user-images.githubusercontent.com/792850/216888712-2565fcea-9df5-47bd-a49e-d60af9bd76e8.mp4
|
||||
|
||||
## Architecture
|
||||
|
||||
<img src="doc/img/arch-product.png" width="600">
|
||||
|
||||
Nightingale monitoring can receive monitoring data reported by various collectors (such as [Categraf](https://github.com/flashcatcloud/categraf) , telegraf, grafana-agent, Prometheus, etc.) and write them to various popular time-series databases (such as Prometheus, M3DB, VictoriaMetrics, Thanos, TDEngine, etc.). It provides configuration capabilities for alert rules, silence rules, and subscription rules, as well as the ability to view monitoring data. It also provides automatic alarm self-healing mechanisms (such as automatically calling back to a webhook address or executing a script after an alarm is triggered), and the ability to store and manage historical alarm events and view them in groups.
|
||||
|
||||
If the performance of a standalone time-series database (such as Prometheus) has bottlenecks or poor disaster recovery, we recommend using [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics). The VictoriaMetrics architecture is relatively simple, has excellent performance, and is easy to deploy and maintain. The architecture diagram is as shown above. For more detailed documentation on VictoriaMetrics, please refer to its [official website](https://victoriametrics.com/).
|
||||
|
||||
**We welcome you to participate in the Nightingale open-source project and community in various ways, including but not limited to**:
|
||||
- Adding and improving documentation => [n9e.github.io](https://n9e.github.io/)
|
||||
- Sharing your best practices and experience in using Nightingale monitoring => [Article sharing]((https://n9e.github.io/docs/prologue/share/))
|
||||
- Submitting product suggestions => [github issue](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.md)
|
||||
- Submitting code to make Nightingale monitoring faster, more stable, and easier to use => [github pull request](https://github.com/didi/nightingale/pulls)
|
||||
|
||||
|
||||
**Respecting, recognizing, and recording the work of every contributor** is the first guiding principle of the Nightingale open-source community. We advocate effective questioning, which not only respects the developer's time but also contributes to the accumulation of knowledge in the entire community
|
||||
- Before asking a question, please first refer to the [FAQ](https://www.gitlink.org.cn/ccfos/nightingale/wiki/faq)
|
||||
- We use [GitHub Discussions](https://github.com/ccfos/nightingale/discussions) as the communication forum. You can search and ask questions here.
|
||||
- We also recommend that you join ours [Slack channel](https://n9e-talk.slack.com/) to exchange experiences with other Nightingale users.
|
||||
|
||||
|
||||
## Who is using Nightingale
|
||||
You can register your usage and share your experience by posting on **[Who is Using Nightingale](https://github.com/ccfos/nightingale/issues/897)**.
|
||||
|
||||
## Stargazers over time
|
||||
[](https://starchart.cc/ccfos/nightingale)
|
||||
|
||||
## Contributors
|
||||
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
|
||||
<img src="https://contrib.rocks/image?repo=ccfos/nightingale" />
|
||||
</a>
|
||||
|
||||
## License
|
||||
[Apache License V2.0](https://github.com/didi/nightingale/blob/main/LICENSE)
|
||||
74
README_zh.md
Normal file
@@ -0,0 +1,74 @@
|
||||
<p align="center">
|
||||
<a href="https://github.com/ccfos/nightingale">
|
||||
<img src="doc/img/nightingale_logo_h.png" alt="nightingale - cloud native monitoring" width="240" /></a>
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
<a href="https://flashcat.cloud/docs/">
|
||||
<img alt="Docs" src="https://img.shields.io/badge/docs-get%20started-brightgreen"/></a>
|
||||
<a href="https://hub.docker.com/u/flashcatcloud">
|
||||
<img alt="Docker pulls" src="https://img.shields.io/docker/pulls/flashcatcloud/nightingale"/></a>
|
||||
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
|
||||
<img alt="GitHub contributors" src="https://img.shields.io/github/contributors-anon/ccfos/nightingale"/></a>
|
||||
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/ccfos/nightingale">
|
||||
<br/><img alt="GitHub Repo issues" src="https://img.shields.io/github/issues/ccfos/nightingale">
|
||||
<img alt="GitHub Repo issues closed" src="https://img.shields.io/github/issues-closed/ccfos/nightingale">
|
||||
<img alt="GitHub forks" src="https://img.shields.io/github/forks/ccfos/nightingale">
|
||||
<img alt="GitHub latest release" src="https://img.shields.io/github/v/release/ccfos/nightingale"/>
|
||||
<img alt="License" src="https://img.shields.io/badge/license-Apache--2.0-blue"/>
|
||||
<a href="https://n9e-talk.slack.com/">
|
||||
<img alt="GitHub contributors" src="https://img.shields.io/badge/join%20slack-%23n9e-brightgreen.svg"/></a>
|
||||
</p>
|
||||
|
||||
<p align="center">
|
||||
告警管理专家,一体化的开源可观测平台
|
||||
</p>
|
||||
|
||||
[English](./README.md) | [中文](./README_zh.md)
|
||||
|
||||
夜莺Nightingale是中国计算机学会托管的开源云原生可观测工具,最早由滴滴于 2020 年孵化并开源,并于 2022 年正式捐赠予中国计算机学会。夜莺采用 All-in-One 的设计理念,集数据采集、可视化、监控告警、数据分析于一体,与云原生生态紧密集成,融入了顶级互联网公司可观测性最佳实践,沉淀了众多社区专家经验,开箱即用。
|
||||
|
||||
## 资料
|
||||
|
||||
- 文档:[flashcat.cloud/docs](https://flashcat.cloud/docs/)
|
||||
- 提问:[answer.flashcat.cloud](https://answer.flashcat.cloud/)
|
||||
- 报Bug:[github.com/ccfos/nightingale/issues](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Fbug&projects=&template=bug_report.yml)
|
||||
|
||||
|
||||
## 功能和特点
|
||||
|
||||
- 统一接入各种时序库:支持对接 Prometheus、VictoriaMetrics、Thanos、Mimir、M3DB 等多种时序库,实现统一告警管理
|
||||
- 专业告警能力:内置支持多种告警规则,可以扩展支持所有通知媒介,支持告警屏蔽、告警抑制、告警自愈、告警事件管理
|
||||
- 高性能可视化引擎:支持多种图表样式,内置众多Dashboard模版,也可导入Grafana模版,开箱即用,开源协议商业友好
|
||||
- 无缝搭配 [Flashduty](https://flashcat.cloud/product/flashcat-duty/):实现告警聚合收敛、认领、升级、排班、IM集成,确保告警处理不遗漏,减少打扰,更好协同
|
||||
- 支持所有常见采集器:支持 [Categraf](https://flashcat.cloud/product/categraf)、telegraf、grafana-agent、datadog-agent、各种 exporter 作为采集器,没有什么数据是不能监控的
|
||||
- 一体化观测平台:从 v6 版本开始,支持接入 ElasticSearch、Jaeger 数据源,实现日志、链路、指标多维度的统一可观测
|
||||
|
||||
|
||||
## 产品演示
|
||||
|
||||

|
||||
|
||||
## 部署架构
|
||||
|
||||

|
||||
|
||||
## 加入交流群
|
||||
|
||||
欢迎加入 QQ 交流群,群号:479290895,QQ 群适合群友互助,夜莺研发人员通常不在群里。如果要报 bug 请到[这里](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Fbug&projects=&template=bug_report.yml),提问到[这里](https://answer.flashcat.cloud/)。
|
||||
|
||||
## Stargazers over time
|
||||
|
||||
[](https://star-history.com/#ccfos/nightingale&Date)
|
||||
|
||||
|
||||
## Contributors
|
||||
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
|
||||
<img src="https://contrib.rocks/image?repo=ccfos/nightingale" />
|
||||
</a>
|
||||
|
||||
## 社区治理
|
||||
[夜莺开源项目和社区治理架构(草案)](./doc/community-governance.md)
|
||||
|
||||
## License
|
||||
[Apache License V2.0](https://github.com/didi/nightingale/blob/main/LICENSE)
|
||||
@@ -2,8 +2,6 @@ package aconf
|
||||
|
||||
import (
|
||||
"path"
|
||||
|
||||
"github.com/toolkits/pkg/runner"
|
||||
)
|
||||
|
||||
type Alert struct {
|
||||
@@ -55,9 +53,9 @@ type Ibex struct {
|
||||
Timeout int64
|
||||
}
|
||||
|
||||
func (a *Alert) PreCheck() {
|
||||
func (a *Alert) PreCheck(configDir string) {
|
||||
if a.Alerting.TemplatesDir == "" {
|
||||
a.Alerting.TemplatesDir = path.Join(runner.Cwd, "etc", "template")
|
||||
a.Alerting.TemplatesDir = path.Join(configDir, "template")
|
||||
}
|
||||
|
||||
if a.Alerting.NotifyConcurrency == 0 {
|
||||
|
||||
@@ -24,6 +24,7 @@ import (
|
||||
"github.com/ccfos/nightingale/v6/prom"
|
||||
"github.com/ccfos/nightingale/v6/pushgw/pconf"
|
||||
"github.com/ccfos/nightingale/v6/pushgw/writer"
|
||||
"github.com/ccfos/nightingale/v6/tdengine"
|
||||
)
|
||||
|
||||
func Initialize(configDir string, cryptoKey string) (func(), error) {
|
||||
@@ -52,10 +53,11 @@ func Initialize(configDir string, cryptoKey string) (func(), error) {
|
||||
userGroupCache := memsto.NewUserGroupCache(ctx, syncStats)
|
||||
|
||||
promClients := prom.NewPromClient(ctx, config.Alert.Heartbeat)
|
||||
tdengineClients := tdengine.NewTdengineClient(ctx, config.Alert.Heartbeat)
|
||||
|
||||
externalProcessors := process.NewExternalProcessors()
|
||||
|
||||
Start(config.Alert, config.Pushgw, syncStats, alertStats, externalProcessors, targetCache, busiGroupCache, alertMuteCache, alertRuleCache, notifyConfigCache, dsCache, ctx, promClients, userCache, userGroupCache)
|
||||
Start(config.Alert, config.Pushgw, syncStats, alertStats, externalProcessors, targetCache, busiGroupCache, alertMuteCache, alertRuleCache, notifyConfigCache, dsCache, ctx, promClients, tdengineClients, userCache, userGroupCache)
|
||||
|
||||
r := httpx.GinEngine(config.Global.RunMode, config.HTTP)
|
||||
rt := router.New(config.HTTP, config.Alert, alertMuteCache, targetCache, busiGroupCache, alertStats, ctx, externalProcessors)
|
||||
@@ -71,7 +73,8 @@ func Initialize(configDir string, cryptoKey string) (func(), error) {
|
||||
}
|
||||
|
||||
func Start(alertc aconf.Alert, pushgwc pconf.Pushgw, syncStats *memsto.Stats, alertStats *astats.Stats, externalProcessors *process.ExternalProcessorsType, targetCache *memsto.TargetCacheType, busiGroupCache *memsto.BusiGroupCacheType,
|
||||
alertMuteCache *memsto.AlertMuteCacheType, alertRuleCache *memsto.AlertRuleCacheType, notifyConfigCache *memsto.NotifyConfigCacheType, datasourceCache *memsto.DatasourceCacheType, ctx *ctx.Context, promClients *prom.PromClientMap, userCache *memsto.UserCacheType, userGroupCache *memsto.UserGroupCacheType) {
|
||||
alertMuteCache *memsto.AlertMuteCacheType, alertRuleCache *memsto.AlertRuleCacheType, notifyConfigCache *memsto.NotifyConfigCacheType, datasourceCache *memsto.DatasourceCacheType, ctx *ctx.Context,
|
||||
promClients *prom.PromClientMap, tdendgineClients *tdengine.TdengineClientMap, userCache *memsto.UserCacheType, userGroupCache *memsto.UserGroupCacheType) {
|
||||
alertSubscribeCache := memsto.NewAlertSubscribeCache(ctx, syncStats)
|
||||
recordingRuleCache := memsto.NewRecordingRuleCache(ctx, syncStats)
|
||||
|
||||
@@ -82,14 +85,14 @@ func Start(alertc aconf.Alert, pushgwc pconf.Pushgw, syncStats *memsto.Stats, al
|
||||
writers := writer.NewWriters(pushgwc)
|
||||
record.NewScheduler(alertc, recordingRuleCache, promClients, writers, alertStats)
|
||||
|
||||
eval.NewScheduler(alertc, externalProcessors, alertRuleCache, targetCache, busiGroupCache, alertMuteCache, datasourceCache, promClients, naming, ctx, alertStats)
|
||||
eval.NewScheduler(alertc, externalProcessors, alertRuleCache, targetCache, busiGroupCache, alertMuteCache, datasourceCache, promClients, tdendgineClients, naming, ctx, alertStats)
|
||||
|
||||
dp := dispatch.NewDispatch(alertRuleCache, userCache, userGroupCache, alertSubscribeCache, targetCache, notifyConfigCache, alertc.Alerting, ctx)
|
||||
dp := dispatch.NewDispatch(alertRuleCache, userCache, userGroupCache, alertSubscribeCache, targetCache, notifyConfigCache, alertc.Alerting, ctx, alertStats)
|
||||
consumer := dispatch.NewConsumer(alertc.Alerting, ctx, dp)
|
||||
|
||||
go dp.ReloadTpls()
|
||||
go consumer.LoopConsume()
|
||||
|
||||
go queue.ReportQueueSize(alertStats)
|
||||
go sender.StartEmailSender(notifyConfigCache.GetSMTP()) // todo
|
||||
go sender.InitEmailSender(notifyConfigCache.GetSMTP())
|
||||
}
|
||||
|
||||
@@ -10,22 +10,52 @@ const (
|
||||
)
|
||||
|
||||
type Stats struct {
|
||||
CounterSampleTotal *prometheus.CounterVec
|
||||
CounterAlertsTotal *prometheus.CounterVec
|
||||
GaugeAlertQueueSize prometheus.Gauge
|
||||
GaugeSampleQueueSize *prometheus.GaugeVec
|
||||
RequestDuration *prometheus.HistogramVec
|
||||
ForwardDuration *prometheus.HistogramVec
|
||||
AlertNotifyTotal *prometheus.CounterVec
|
||||
AlertNotifyErrorTotal *prometheus.CounterVec
|
||||
CounterAlertsTotal *prometheus.CounterVec
|
||||
GaugeAlertQueueSize prometheus.Gauge
|
||||
CounterRuleEval *prometheus.CounterVec
|
||||
CounterQueryDataErrorTotal *prometheus.CounterVec
|
||||
CounterRecordEval *prometheus.CounterVec
|
||||
CounterRecordEvalErrorTotal *prometheus.CounterVec
|
||||
CounterMuteTotal *prometheus.CounterVec
|
||||
}
|
||||
|
||||
func NewSyncStats() *Stats {
|
||||
// 从各个接收接口接收到的监控数据总量
|
||||
CounterSampleTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
|
||||
CounterRuleEval := prometheus.NewCounterVec(prometheus.CounterOpts{
|
||||
Namespace: namespace,
|
||||
Subsystem: subsystem,
|
||||
Name: "samples_received_total",
|
||||
Help: "Total number samples received.",
|
||||
}, []string{"cluster", "channel"})
|
||||
Name: "rule_eval_total",
|
||||
Help: "Number of rule eval.",
|
||||
}, []string{})
|
||||
|
||||
CounterRecordEval := prometheus.NewCounterVec(prometheus.CounterOpts{
|
||||
Namespace: namespace,
|
||||
Subsystem: subsystem,
|
||||
Name: "record_eval_total",
|
||||
Help: "Number of record eval.",
|
||||
}, []string{})
|
||||
|
||||
CounterRecordEvalErrorTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
|
||||
Namespace: namespace,
|
||||
Subsystem: subsystem,
|
||||
Name: "record_eval_error_total",
|
||||
Help: "Number of record eval error.",
|
||||
}, []string{})
|
||||
|
||||
AlertNotifyTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
|
||||
Namespace: namespace,
|
||||
Subsystem: subsystem,
|
||||
Name: "alert_notify_total",
|
||||
Help: "Number of send msg.",
|
||||
}, []string{"channel"})
|
||||
|
||||
AlertNotifyErrorTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
|
||||
Namespace: namespace,
|
||||
Subsystem: subsystem,
|
||||
Name: "alert_notify_error_total",
|
||||
Help: "Number of send msg.",
|
||||
}, []string{"channel"})
|
||||
|
||||
// 产生的告警总量
|
||||
CounterAlertsTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
|
||||
@@ -33,7 +63,7 @@ func NewSyncStats() *Stats {
|
||||
Subsystem: subsystem,
|
||||
Name: "alerts_total",
|
||||
Help: "Total number alert events.",
|
||||
}, []string{"cluster"})
|
||||
}, []string{"cluster", "type", "busi_group"})
|
||||
|
||||
// 内存中的告警事件队列的长度
|
||||
GaugeAlertQueueSize := prometheus.NewGauge(prometheus.GaugeOpts{
|
||||
@@ -43,51 +73,41 @@ func NewSyncStats() *Stats {
|
||||
Help: "The size of alert queue.",
|
||||
})
|
||||
|
||||
// 数据转发队列,各个队列的长度
|
||||
GaugeSampleQueueSize := prometheus.NewGaugeVec(prometheus.GaugeOpts{
|
||||
CounterQueryDataErrorTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
|
||||
Namespace: namespace,
|
||||
Subsystem: subsystem,
|
||||
Name: "sample_queue_size",
|
||||
Help: "The size of sample queue.",
|
||||
}, []string{"cluster", "channel_number"})
|
||||
Name: "query_data_error_total",
|
||||
Help: "Number of query data error.",
|
||||
}, []string{"datasource"})
|
||||
|
||||
// 一些重要的请求,比如接收数据的请求,应该统计一下延迟情况
|
||||
RequestDuration := prometheus.NewHistogramVec(
|
||||
prometheus.HistogramOpts{
|
||||
Namespace: namespace,
|
||||
Subsystem: subsystem,
|
||||
Buckets: []float64{.01, .1, 1},
|
||||
Name: "http_request_duration_seconds",
|
||||
Help: "HTTP request latencies in seconds.",
|
||||
}, []string{"code", "path", "method"},
|
||||
)
|
||||
|
||||
// 发往后端TSDB,延迟如何
|
||||
ForwardDuration := prometheus.NewHistogramVec(
|
||||
prometheus.HistogramOpts{
|
||||
Namespace: namespace,
|
||||
Subsystem: subsystem,
|
||||
Buckets: []float64{.1, 1, 10},
|
||||
Name: "forward_duration_seconds",
|
||||
Help: "Forward samples to TSDB. latencies in seconds.",
|
||||
}, []string{"cluster", "channel_number"},
|
||||
)
|
||||
CounterMuteTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
|
||||
Namespace: namespace,
|
||||
Subsystem: subsystem,
|
||||
Name: "mute_total",
|
||||
Help: "Number of mute.",
|
||||
}, []string{"group"})
|
||||
|
||||
prometheus.MustRegister(
|
||||
CounterSampleTotal,
|
||||
CounterAlertsTotal,
|
||||
GaugeAlertQueueSize,
|
||||
GaugeSampleQueueSize,
|
||||
RequestDuration,
|
||||
ForwardDuration,
|
||||
AlertNotifyTotal,
|
||||
AlertNotifyErrorTotal,
|
||||
CounterRuleEval,
|
||||
CounterQueryDataErrorTotal,
|
||||
CounterRecordEval,
|
||||
CounterRecordEvalErrorTotal,
|
||||
CounterMuteTotal,
|
||||
)
|
||||
|
||||
return &Stats{
|
||||
CounterSampleTotal: CounterSampleTotal,
|
||||
CounterAlertsTotal: CounterAlertsTotal,
|
||||
GaugeAlertQueueSize: GaugeAlertQueueSize,
|
||||
GaugeSampleQueueSize: GaugeSampleQueueSize,
|
||||
RequestDuration: RequestDuration,
|
||||
ForwardDuration: ForwardDuration,
|
||||
CounterAlertsTotal: CounterAlertsTotal,
|
||||
GaugeAlertQueueSize: GaugeAlertQueueSize,
|
||||
AlertNotifyTotal: AlertNotifyTotal,
|
||||
AlertNotifyErrorTotal: AlertNotifyErrorTotal,
|
||||
CounterRuleEval: CounterRuleEval,
|
||||
CounterQueryDataErrorTotal: CounterQueryDataErrorTotal,
|
||||
CounterRecordEval: CounterRecordEval,
|
||||
CounterRecordEvalErrorTotal: CounterRecordEvalErrorTotal,
|
||||
CounterMuteTotal: CounterMuteTotal,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -22,6 +22,14 @@ func MatchTags(eventTagsMap map[string]string, itags []models.TagFilter) bool {
|
||||
}
|
||||
return true
|
||||
}
|
||||
func MatchGroupsName(groupName string, groupFilter []models.TagFilter) bool {
|
||||
for _, filter := range groupFilter {
|
||||
if !matchTag(groupName, filter) {
|
||||
return false
|
||||
}
|
||||
}
|
||||
return true
|
||||
}
|
||||
|
||||
func matchTag(value string, filter models.TagFilter) bool {
|
||||
switch filter.Func {
|
||||
|
||||
@@ -61,6 +61,13 @@ func (e *Consumer) consume(events []interface{}, sema *semaphore.Semaphore) {
|
||||
func (e *Consumer) consumeOne(event *models.AlertCurEvent) {
|
||||
LogEvent(event, "consume")
|
||||
|
||||
eventType := "alert"
|
||||
if event.IsRecovered {
|
||||
eventType = "recovery"
|
||||
}
|
||||
|
||||
e.dispatch.astats.CounterAlertsTotal.WithLabelValues(event.Cluster, eventType, event.GroupName).Inc()
|
||||
|
||||
if err := event.ParseRule("rule_name"); err != nil {
|
||||
event.RuleName = fmt.Sprintf("failed to parse rule name: %v", err)
|
||||
}
|
||||
|
||||
@@ -9,6 +9,7 @@ import (
|
||||
"time"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/alert/aconf"
|
||||
"github.com/ccfos/nightingale/v6/alert/astats"
|
||||
"github.com/ccfos/nightingale/v6/alert/common"
|
||||
"github.com/ccfos/nightingale/v6/alert/sender"
|
||||
"github.com/ccfos/nightingale/v6/memsto"
|
||||
@@ -33,7 +34,8 @@ type Dispatch struct {
|
||||
ExtraSenders map[string]sender.Sender
|
||||
BeforeSenderHook func(*models.AlertCurEvent) bool
|
||||
|
||||
ctx *ctx.Context
|
||||
ctx *ctx.Context
|
||||
astats *astats.Stats
|
||||
|
||||
RwLock sync.RWMutex
|
||||
}
|
||||
@@ -41,7 +43,7 @@ type Dispatch struct {
|
||||
// 创建一个 Notify 实例
|
||||
func NewDispatch(alertRuleCache *memsto.AlertRuleCacheType, userCache *memsto.UserCacheType, userGroupCache *memsto.UserGroupCacheType,
|
||||
alertSubscribeCache *memsto.AlertSubscribeCacheType, targetCache *memsto.TargetCacheType, notifyConfigCache *memsto.NotifyConfigCacheType,
|
||||
alerting aconf.Alerting, ctx *ctx.Context) *Dispatch {
|
||||
alerting aconf.Alerting, ctx *ctx.Context, astats *astats.Stats) *Dispatch {
|
||||
notify := &Dispatch{
|
||||
alertRuleCache: alertRuleCache,
|
||||
userCache: userCache,
|
||||
@@ -57,7 +59,8 @@ func NewDispatch(alertRuleCache *memsto.AlertRuleCacheType, userCache *memsto.Us
|
||||
ExtraSenders: make(map[string]sender.Sender),
|
||||
BeforeSenderHook: func(*models.AlertCurEvent) bool { return true },
|
||||
|
||||
ctx: ctx,
|
||||
ctx: ctx,
|
||||
astats: astats,
|
||||
}
|
||||
return notify
|
||||
}
|
||||
@@ -86,17 +89,17 @@ func (e *Dispatch) relaodTpls() error {
|
||||
|
||||
senders := map[string]sender.Sender{
|
||||
models.Email: sender.NewSender(models.Email, tmpTpls, smtp),
|
||||
models.Dingtalk: sender.NewSender(models.Dingtalk, tmpTpls, smtp),
|
||||
models.Wecom: sender.NewSender(models.Wecom, tmpTpls, smtp),
|
||||
models.Feishu: sender.NewSender(models.Feishu, tmpTpls, smtp),
|
||||
models.Mm: sender.NewSender(models.Mm, tmpTpls, smtp),
|
||||
models.Telegram: sender.NewSender(models.Telegram, tmpTpls, smtp),
|
||||
models.FeishuCard: sender.NewSender(models.FeishuCard, tmpTpls, smtp),
|
||||
models.Dingtalk: sender.NewSender(models.Dingtalk, tmpTpls),
|
||||
models.Wecom: sender.NewSender(models.Wecom, tmpTpls),
|
||||
models.Feishu: sender.NewSender(models.Feishu, tmpTpls),
|
||||
models.Mm: sender.NewSender(models.Mm, tmpTpls),
|
||||
models.Telegram: sender.NewSender(models.Telegram, tmpTpls),
|
||||
models.FeishuCard: sender.NewSender(models.FeishuCard, tmpTpls),
|
||||
}
|
||||
|
||||
e.RwLock.RLock()
|
||||
for channel, sender := range e.ExtraSenders {
|
||||
senders[channel] = sender
|
||||
for channelName, extraSender := range e.ExtraSenders {
|
||||
senders[channelName] = extraSender
|
||||
}
|
||||
e.RwLock.RUnlock()
|
||||
|
||||
@@ -170,12 +173,25 @@ func (e *Dispatch) handleSubs(event *models.AlertCurEvent) {
|
||||
|
||||
// handleSub 处理订阅规则的event,注意这里event要使用值传递,因为后面会修改event的状态
|
||||
func (e *Dispatch) handleSub(sub *models.AlertSubscribe, event models.AlertCurEvent) {
|
||||
if sub.IsDisabled() || !sub.MatchCluster(event.DatasourceId) {
|
||||
if sub.IsDisabled() {
|
||||
return
|
||||
}
|
||||
|
||||
if !sub.MatchCluster(event.DatasourceId) {
|
||||
return
|
||||
}
|
||||
|
||||
if !sub.MatchProd(event.RuleProd) {
|
||||
return
|
||||
}
|
||||
|
||||
if !common.MatchTags(event.TagsMap, sub.ITags) {
|
||||
return
|
||||
}
|
||||
// event BusiGroups filter
|
||||
if !common.MatchGroupsName(event.GroupName, sub.IBusiGroups) {
|
||||
return
|
||||
}
|
||||
if sub.ForDuration > (event.TriggerTime - event.FirstTriggerTime) {
|
||||
return
|
||||
}
|
||||
@@ -204,7 +220,7 @@ func (e *Dispatch) Send(rule *models.AlertRule, event *models.AlertCurEvent, not
|
||||
needSend := e.BeforeSenderHook(event)
|
||||
if needSend {
|
||||
for channel, uids := range notifyTarget.ToChannelUserMap() {
|
||||
ctx := sender.BuildMessageContext(rule, []*models.AlertCurEvent{event}, uids, e.userCache)
|
||||
msgCtx := sender.BuildMessageContext(rule, []*models.AlertCurEvent{event}, uids, e.userCache, e.astats)
|
||||
e.RwLock.RLock()
|
||||
s := e.Senders[channel]
|
||||
e.RwLock.RUnlock()
|
||||
@@ -212,18 +228,18 @@ func (e *Dispatch) Send(rule *models.AlertRule, event *models.AlertCurEvent, not
|
||||
logger.Debugf("no sender for channel: %s", channel)
|
||||
continue
|
||||
}
|
||||
s.Send(ctx)
|
||||
s.Send(msgCtx)
|
||||
}
|
||||
}
|
||||
|
||||
// handle event callbacks
|
||||
sender.SendCallbacks(e.ctx, notifyTarget.ToCallbackList(), event, e.targetCache, e.userCache, e.notifyConfigCache.GetIbex())
|
||||
sender.SendCallbacks(e.ctx, notifyTarget.ToCallbackList(), event, e.targetCache, e.userCache, e.notifyConfigCache.GetIbex(), e.astats)
|
||||
|
||||
// handle global webhooks
|
||||
sender.SendWebhooks(notifyTarget.ToWebhookList(), event)
|
||||
sender.SendWebhooks(notifyTarget.ToWebhookList(), event, e.astats)
|
||||
|
||||
// handle plugin call
|
||||
go sender.MayPluginNotify(e.genNoticeBytes(event), e.notifyConfigCache.GetNotifyScript())
|
||||
go sender.MayPluginNotify(e.genNoticeBytes(event), e.notifyConfigCache.GetNotifyScript(), e.astats)
|
||||
}
|
||||
|
||||
type Notice struct {
|
||||
|
||||
@@ -12,6 +12,8 @@ import (
|
||||
"github.com/ccfos/nightingale/v6/memsto"
|
||||
"github.com/ccfos/nightingale/v6/pkg/ctx"
|
||||
"github.com/ccfos/nightingale/v6/prom"
|
||||
"github.com/ccfos/nightingale/v6/tdengine"
|
||||
|
||||
"github.com/toolkits/pkg/logger"
|
||||
)
|
||||
|
||||
@@ -29,7 +31,8 @@ type Scheduler struct {
|
||||
alertMuteCache *memsto.AlertMuteCacheType
|
||||
datasourceCache *memsto.DatasourceCacheType
|
||||
|
||||
promClients *prom.PromClientMap
|
||||
promClients *prom.PromClientMap
|
||||
tdengineClients *tdengine.TdengineClientMap
|
||||
|
||||
naming *naming.Naming
|
||||
|
||||
@@ -38,8 +41,8 @@ type Scheduler struct {
|
||||
}
|
||||
|
||||
func NewScheduler(aconf aconf.Alert, externalProcessors *process.ExternalProcessorsType, arc *memsto.AlertRuleCacheType, targetCache *memsto.TargetCacheType,
|
||||
busiGroupCache *memsto.BusiGroupCacheType, alertMuteCache *memsto.AlertMuteCacheType, datasourceCache *memsto.DatasourceCacheType, promClients *prom.PromClientMap, naming *naming.Naming,
|
||||
ctx *ctx.Context, stats *astats.Stats) *Scheduler {
|
||||
busiGroupCache *memsto.BusiGroupCacheType, alertMuteCache *memsto.AlertMuteCacheType, datasourceCache *memsto.DatasourceCacheType,
|
||||
promClients *prom.PromClientMap, tdengineClients *tdengine.TdengineClientMap, naming *naming.Naming, ctx *ctx.Context, stats *astats.Stats) *Scheduler {
|
||||
scheduler := &Scheduler{
|
||||
aconf: aconf,
|
||||
alertRules: make(map[string]*AlertRuleWorker),
|
||||
@@ -52,8 +55,9 @@ func NewScheduler(aconf aconf.Alert, externalProcessors *process.ExternalProcess
|
||||
alertMuteCache: alertMuteCache,
|
||||
datasourceCache: datasourceCache,
|
||||
|
||||
promClients: promClients,
|
||||
naming: naming,
|
||||
promClients: promClients,
|
||||
tdengineClients: tdengineClients,
|
||||
naming: naming,
|
||||
|
||||
ctx: ctx,
|
||||
stats: stats,
|
||||
@@ -85,8 +89,10 @@ func (s *Scheduler) syncAlertRules() {
|
||||
if rule == nil {
|
||||
continue
|
||||
}
|
||||
if rule.IsPrometheusRule() {
|
||||
|
||||
if rule.IsPrometheusRule() || rule.IsLokiRule() || rule.IsTdengineRule() {
|
||||
datasourceIds := s.promClients.Hit(rule.DatasourceIdsJson)
|
||||
datasourceIds = append(datasourceIds, s.tdengineClients.Hit(rule.DatasourceIdsJson)...)
|
||||
for _, dsId := range datasourceIds {
|
||||
if !naming.DatasourceHashRing.IsHit(dsId, fmt.Sprintf("%d", rule.Id), s.aconf.Heartbeat.Endpoint) {
|
||||
continue
|
||||
@@ -101,9 +107,9 @@ func (s *Scheduler) syncAlertRules() {
|
||||
logger.Debugf("datasource %d status is %s", dsId, ds.Status)
|
||||
continue
|
||||
}
|
||||
processor := process.NewProcessor(rule, dsId, s.alertRuleCache, s.targetCache, s.busiGroupCache, s.alertMuteCache, s.datasourceCache, s.promClients, s.ctx, s.stats)
|
||||
processor := process.NewProcessor(rule, dsId, s.alertRuleCache, s.targetCache, s.busiGroupCache, s.alertMuteCache, s.datasourceCache, s.ctx, s.stats)
|
||||
|
||||
alertRule := NewAlertRuleWorker(rule, dsId, processor, s.promClients, s.ctx)
|
||||
alertRule := NewAlertRuleWorker(rule, dsId, processor, s.promClients, s.tdengineClients, s.ctx)
|
||||
alertRuleWorkers[alertRule.Hash()] = alertRule
|
||||
}
|
||||
} else if rule.IsHostRule() && s.ctx.IsCenter {
|
||||
@@ -111,8 +117,8 @@ func (s *Scheduler) syncAlertRules() {
|
||||
if !naming.DatasourceHashRing.IsHit(naming.HostDatasource, fmt.Sprintf("%d", rule.Id), s.aconf.Heartbeat.Endpoint) {
|
||||
continue
|
||||
}
|
||||
processor := process.NewProcessor(rule, 0, s.alertRuleCache, s.targetCache, s.busiGroupCache, s.alertMuteCache, s.datasourceCache, s.promClients, s.ctx, s.stats)
|
||||
alertRule := NewAlertRuleWorker(rule, 0, processor, s.promClients, s.ctx)
|
||||
processor := process.NewProcessor(rule, 0, s.alertRuleCache, s.targetCache, s.busiGroupCache, s.alertMuteCache, s.datasourceCache, s.ctx, s.stats)
|
||||
alertRule := NewAlertRuleWorker(rule, 0, processor, s.promClients, s.tdengineClients, s.ctx)
|
||||
alertRuleWorkers[alertRule.Hash()] = alertRule
|
||||
} else {
|
||||
// 如果 rule 不是通过 prometheus engine 来告警的,则创建为 externalRule
|
||||
@@ -128,7 +134,7 @@ func (s *Scheduler) syncAlertRules() {
|
||||
logger.Debugf("datasource %d status is %s", dsId, ds.Status)
|
||||
continue
|
||||
}
|
||||
processor := process.NewProcessor(rule, dsId, s.alertRuleCache, s.targetCache, s.busiGroupCache, s.alertMuteCache, s.datasourceCache, s.promClients, s.ctx, s.stats)
|
||||
processor := process.NewProcessor(rule, dsId, s.alertRuleCache, s.targetCache, s.busiGroupCache, s.alertMuteCache, s.datasourceCache, s.ctx, s.stats)
|
||||
externalRuleWorkers[processor.Key()] = processor
|
||||
}
|
||||
}
|
||||
|
||||
@@ -11,8 +11,11 @@ import (
|
||||
"github.com/ccfos/nightingale/v6/alert/process"
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
"github.com/ccfos/nightingale/v6/pkg/ctx"
|
||||
"github.com/ccfos/nightingale/v6/pkg/hash"
|
||||
"github.com/ccfos/nightingale/v6/pkg/parser"
|
||||
promsdk "github.com/ccfos/nightingale/v6/pkg/prom"
|
||||
"github.com/ccfos/nightingale/v6/prom"
|
||||
"github.com/ccfos/nightingale/v6/tdengine"
|
||||
|
||||
"github.com/toolkits/pkg/logger"
|
||||
"github.com/toolkits/pkg/str"
|
||||
@@ -28,19 +31,21 @@ type AlertRuleWorker struct {
|
||||
|
||||
processor *process.Processor
|
||||
|
||||
promClients *prom.PromClientMap
|
||||
ctx *ctx.Context
|
||||
promClients *prom.PromClientMap
|
||||
tdengineClients *tdengine.TdengineClientMap
|
||||
ctx *ctx.Context
|
||||
}
|
||||
|
||||
func NewAlertRuleWorker(rule *models.AlertRule, datasourceId int64, processor *process.Processor, promClients *prom.PromClientMap, ctx *ctx.Context) *AlertRuleWorker {
|
||||
func NewAlertRuleWorker(rule *models.AlertRule, datasourceId int64, processor *process.Processor, promClients *prom.PromClientMap, tdengineClients *tdengine.TdengineClientMap, ctx *ctx.Context) *AlertRuleWorker {
|
||||
arw := &AlertRuleWorker{
|
||||
datasourceId: datasourceId,
|
||||
quit: make(chan struct{}),
|
||||
rule: rule,
|
||||
processor: processor,
|
||||
|
||||
promClients: promClients,
|
||||
ctx: ctx,
|
||||
promClients: promClients,
|
||||
tdengineClients: tdengineClients,
|
||||
ctx: ctx,
|
||||
}
|
||||
|
||||
return arw
|
||||
@@ -87,17 +92,23 @@ func (arw *AlertRuleWorker) Start() {
|
||||
func (arw *AlertRuleWorker) Eval() {
|
||||
cachedRule := arw.rule
|
||||
if cachedRule == nil {
|
||||
//logger.Errorf("rule_eval:%s rule not found", arw.Key())
|
||||
// logger.Errorf("rule_eval:%s rule not found", arw.Key())
|
||||
return
|
||||
}
|
||||
arw.processor.Stats.CounterRuleEval.WithLabelValues().Inc()
|
||||
|
||||
typ := cachedRule.GetRuleType()
|
||||
var lst []common.AnomalyPoint
|
||||
var anomalyPoints []common.AnomalyPoint
|
||||
var recoverPoints []common.AnomalyPoint
|
||||
switch typ {
|
||||
case models.PROMETHEUS:
|
||||
lst = arw.GetPromAnomalyPoint(cachedRule.RuleConfig)
|
||||
anomalyPoints = arw.GetPromAnomalyPoint(cachedRule.RuleConfig)
|
||||
case models.HOST:
|
||||
lst = arw.GetHostAnomalyPoint(cachedRule.RuleConfig)
|
||||
anomalyPoints = arw.GetHostAnomalyPoint(cachedRule.RuleConfig)
|
||||
case models.TDENGINE:
|
||||
anomalyPoints, recoverPoints = arw.GetTdengineAnomalyPoint(cachedRule, arw.processor.DatasourceId())
|
||||
case models.LOKI:
|
||||
anomalyPoints = arw.GetPromAnomalyPoint(cachedRule.RuleConfig)
|
||||
default:
|
||||
return
|
||||
}
|
||||
@@ -107,7 +118,11 @@ func (arw *AlertRuleWorker) Eval() {
|
||||
return
|
||||
}
|
||||
|
||||
arw.processor.Handle(lst, "inner", arw.inhibit)
|
||||
arw.processor.Handle(anomalyPoints, "inner", arw.inhibit)
|
||||
for _, point := range recoverPoints {
|
||||
str := fmt.Sprintf("%v", point.Value)
|
||||
arw.processor.RecoverSingle(process.Hash(cachedRule.Id, arw.processor.DatasourceId(), point), point.Timestamp, &str)
|
||||
}
|
||||
}
|
||||
|
||||
func (arw *AlertRuleWorker) Stop() {
|
||||
@@ -153,11 +168,13 @@ func (arw *AlertRuleWorker) GetPromAnomalyPoint(ruleConfig string) []common.Anom
|
||||
value, warnings, err := readerClient.Query(context.Background(), promql, time.Now())
|
||||
if err != nil {
|
||||
logger.Errorf("rule_eval:%s promql:%s, error:%v", arw.Key(), promql, err)
|
||||
arw.processor.Stats.CounterQueryDataErrorTotal.WithLabelValues(fmt.Sprintf("%d", arw.datasourceId)).Inc()
|
||||
continue
|
||||
}
|
||||
|
||||
if len(warnings) > 0 {
|
||||
logger.Errorf("rule_eval:%s promql:%s, warnings:%v", arw.Key(), promql, warnings)
|
||||
arw.processor.Stats.CounterQueryDataErrorTotal.WithLabelValues(fmt.Sprintf("%d", arw.datasourceId)).Inc()
|
||||
continue
|
||||
}
|
||||
|
||||
@@ -172,6 +189,110 @@ func (arw *AlertRuleWorker) GetPromAnomalyPoint(ruleConfig string) []common.Anom
|
||||
return lst
|
||||
}
|
||||
|
||||
func (arw *AlertRuleWorker) GetTdengineAnomalyPoint(rule *models.AlertRule, dsId int64) ([]common.AnomalyPoint, []common.AnomalyPoint) {
|
||||
// 获取查询和规则判断条件
|
||||
points := []common.AnomalyPoint{}
|
||||
recoverPoints := []common.AnomalyPoint{}
|
||||
ruleConfig := strings.TrimSpace(rule.RuleConfig)
|
||||
if ruleConfig == "" {
|
||||
logger.Warningf("rule_eval:%d promql is blank", rule.Id)
|
||||
return points, recoverPoints
|
||||
}
|
||||
|
||||
var ruleQuery models.RuleQuery
|
||||
err := json.Unmarshal([]byte(ruleConfig), &ruleQuery)
|
||||
if err != nil {
|
||||
logger.Warningf("rule_eval:%d promql parse error:%s", rule.Id, err.Error())
|
||||
return points, recoverPoints
|
||||
}
|
||||
|
||||
if len(ruleQuery.Queries) > 0 {
|
||||
seriesStore := make(map[uint64]*models.DataResp)
|
||||
seriesTagIndex := make(map[uint64][]uint64)
|
||||
|
||||
for _, query := range ruleQuery.Queries {
|
||||
cli := arw.tdengineClients.GetCli(dsId)
|
||||
if cli == nil {
|
||||
logger.Warningf("rule_eval:%d tdengine client is nil", rule.Id)
|
||||
continue
|
||||
}
|
||||
|
||||
series, err := cli.Query(query)
|
||||
if err != nil {
|
||||
logger.Warningf("rule_eval rid:%d query data error: %v", rule.Id, err)
|
||||
continue
|
||||
}
|
||||
|
||||
// 此条日志很重要,是告警判断的现场值
|
||||
logger.Debugf("rule_eval rid:%d req:%+v resp:%+v", rule.Id, query, series)
|
||||
for i := 0; i < len(series); i++ {
|
||||
serieHash := hash.GetHash(series[i].Metric, series[i].Ref)
|
||||
tagHash := hash.GetTagHash(series[i].Metric)
|
||||
seriesStore[serieHash] = series[i]
|
||||
|
||||
// 将曲线按照相同的 tag 分组
|
||||
if _, exists := seriesTagIndex[tagHash]; !exists {
|
||||
seriesTagIndex[tagHash] = make([]uint64, 0)
|
||||
}
|
||||
seriesTagIndex[tagHash] = append(seriesTagIndex[tagHash], serieHash)
|
||||
}
|
||||
}
|
||||
|
||||
// 判断
|
||||
for _, trigger := range ruleQuery.Triggers {
|
||||
for _, seriesHash := range seriesTagIndex {
|
||||
m := make(map[string]float64)
|
||||
var ts int64
|
||||
var sample *models.DataResp
|
||||
var value float64
|
||||
for _, serieHash := range seriesHash {
|
||||
series, exists := seriesStore[serieHash]
|
||||
if !exists {
|
||||
logger.Warningf("rule_eval rid:%d series:%+v not found", rule.Id, series)
|
||||
continue
|
||||
}
|
||||
t, v, exists := series.Last()
|
||||
if !exists {
|
||||
logger.Warningf("rule_eval rid:%d series:%+v value not found", rule.Id, series)
|
||||
continue
|
||||
}
|
||||
|
||||
if !strings.Contains(trigger.Exp, "$"+series.Ref) {
|
||||
// 表达式中不包含该变量
|
||||
continue
|
||||
}
|
||||
|
||||
m["$"+series.Ref] = v
|
||||
m["$"+series.Ref+"."+series.MetricName()] = v
|
||||
ts = int64(t)
|
||||
sample = series
|
||||
value = v
|
||||
}
|
||||
isTriggered := parser.Calc(trigger.Exp, m)
|
||||
// 此条日志很重要,是告警判断的现场值
|
||||
logger.Debugf("rule_eval rid:%d trigger:%+v exp:%s res:%v m:%v", rule.Id, trigger, trigger.Exp, isTriggered, m)
|
||||
|
||||
point := common.AnomalyPoint{
|
||||
Key: sample.MetricName(),
|
||||
Labels: sample.Metric,
|
||||
Timestamp: int64(ts),
|
||||
Value: value,
|
||||
Severity: trigger.Severity,
|
||||
Triggered: isTriggered,
|
||||
}
|
||||
|
||||
if isTriggered {
|
||||
points = append(points, point)
|
||||
} else {
|
||||
recoverPoints = append(recoverPoints, point)
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return points, recoverPoints
|
||||
}
|
||||
|
||||
func (arw *AlertRuleWorker) GetHostAnomalyPoint(ruleConfig string) []common.AnomalyPoint {
|
||||
var lst []common.AnomalyPoint
|
||||
var severity int
|
||||
@@ -201,6 +322,7 @@ func (arw *AlertRuleWorker) GetHostAnomalyPoint(ruleConfig string) []common.Anom
|
||||
targets, err := models.MissTargetGetsByFilter(arw.ctx, query, t)
|
||||
if err != nil {
|
||||
logger.Errorf("rule_eval:%s query:%v, error:%v", arw.Key(), query, err)
|
||||
arw.processor.Stats.CounterQueryDataErrorTotal.WithLabelValues(fmt.Sprintf("%d", arw.datasourceId)).Inc()
|
||||
continue
|
||||
}
|
||||
for _, target := range targets {
|
||||
@@ -222,6 +344,7 @@ func (arw *AlertRuleWorker) GetHostAnomalyPoint(ruleConfig string) []common.Anom
|
||||
targets, err := models.TargetGetsByFilter(arw.ctx, query, 0, 0)
|
||||
if err != nil {
|
||||
logger.Errorf("rule_eval:%s query:%v, error:%v", arw.Key(), query, err)
|
||||
arw.processor.Stats.CounterQueryDataErrorTotal.WithLabelValues(fmt.Sprintf("%d", arw.datasourceId)).Inc()
|
||||
continue
|
||||
}
|
||||
var targetMap = make(map[string]*models.Target)
|
||||
@@ -253,12 +376,14 @@ func (arw *AlertRuleWorker) GetHostAnomalyPoint(ruleConfig string) []common.Anom
|
||||
count, err := models.MissTargetCountByFilter(arw.ctx, query, t)
|
||||
if err != nil {
|
||||
logger.Errorf("rule_eval:%s query:%v, error:%v", arw.Key(), query, err)
|
||||
arw.processor.Stats.CounterQueryDataErrorTotal.WithLabelValues(fmt.Sprintf("%d", arw.datasourceId)).Inc()
|
||||
continue
|
||||
}
|
||||
|
||||
total, err := models.TargetCountByFilter(arw.ctx, query)
|
||||
if err != nil {
|
||||
logger.Errorf("rule_eval:%s query:%v, error:%v", arw.Key(), query, err)
|
||||
arw.processor.Stats.CounterQueryDataErrorTotal.WithLabelValues(fmt.Sprintf("%d", arw.datasourceId)).Inc()
|
||||
continue
|
||||
}
|
||||
pct := float64(count) / float64(total) * 100
|
||||
|
||||
@@ -69,7 +69,7 @@ type Processor struct {
|
||||
|
||||
promClients *prom.PromClientMap
|
||||
ctx *ctx.Context
|
||||
stats *astats.Stats
|
||||
Stats *astats.Stats
|
||||
|
||||
HandleFireEventHook HandleEventFunc
|
||||
HandleRecoverEventHook HandleEventFunc
|
||||
@@ -94,7 +94,7 @@ func (p *Processor) Hash() string {
|
||||
}
|
||||
|
||||
func NewProcessor(rule *models.AlertRule, datasourceId int64, atertRuleCache *memsto.AlertRuleCacheType, targetCache *memsto.TargetCacheType,
|
||||
busiGroupCache *memsto.BusiGroupCacheType, alertMuteCache *memsto.AlertMuteCacheType, datasourceCache *memsto.DatasourceCacheType, promClients *prom.PromClientMap, ctx *ctx.Context,
|
||||
busiGroupCache *memsto.BusiGroupCacheType, alertMuteCache *memsto.AlertMuteCacheType, datasourceCache *memsto.DatasourceCacheType, ctx *ctx.Context,
|
||||
stats *astats.Stats) *Processor {
|
||||
|
||||
p := &Processor{
|
||||
@@ -107,9 +107,8 @@ func NewProcessor(rule *models.AlertRule, datasourceId int64, atertRuleCache *me
|
||||
atertRuleCache: atertRuleCache,
|
||||
datasourceCache: datasourceCache,
|
||||
|
||||
promClients: promClients,
|
||||
ctx: ctx,
|
||||
stats: stats,
|
||||
ctx: ctx,
|
||||
Stats: stats,
|
||||
|
||||
HandleFireEventHook: func(event *models.AlertCurEvent) {},
|
||||
HandleRecoverEventHook: func(event *models.AlertCurEvent) {},
|
||||
@@ -142,6 +141,7 @@ func (p *Processor) Handle(anomalyPoints []common.AnomalyPoint, from string, inh
|
||||
hash := event.Hash
|
||||
alertingKeys[hash] = struct{}{}
|
||||
if mute.IsMuted(cachedRule, event, p.TargetCache, p.alertMuteCache) {
|
||||
p.Stats.CounterMuteTotal.WithLabelValues(event.GroupName).Inc()
|
||||
logger.Debugf("rule_eval:%s event:%v is muted", p.Key(), event)
|
||||
continue
|
||||
}
|
||||
@@ -350,7 +350,6 @@ func (p *Processor) pushEventToQueue(e *models.AlertCurEvent) {
|
||||
p.fires.Set(e.Hash, e)
|
||||
}
|
||||
|
||||
p.stats.CounterAlertsTotal.WithLabelValues(fmt.Sprintf("%d", e.DatasourceId)).Inc()
|
||||
dispatch.LogEvent(e, "push_queue")
|
||||
if !queue.EventQueue.PushFront(e) {
|
||||
logger.Warningf("event_push_queue: queue is full, event:%+v", e)
|
||||
@@ -428,7 +427,13 @@ func (p *Processor) mayHandleIdent() {
|
||||
if target, exists := p.TargetCache.Get(ident); exists {
|
||||
p.target = target.Ident
|
||||
p.targetNote = target.Note
|
||||
} else {
|
||||
p.target = ident
|
||||
p.targetNote = ""
|
||||
}
|
||||
} else {
|
||||
p.target = ""
|
||||
p.targetNote = ""
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -6,6 +6,7 @@ import (
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/alert/astats"
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
"github.com/ccfos/nightingale/v6/prom"
|
||||
"github.com/ccfos/nightingale/v6/pushgw/writer"
|
||||
@@ -18,18 +19,18 @@ type RecordRuleContext struct {
|
||||
datasourceId int64
|
||||
quit chan struct{}
|
||||
|
||||
rule *models.RecordingRule
|
||||
// writers *writer.WritersType
|
||||
rule *models.RecordingRule
|
||||
promClients *prom.PromClientMap
|
||||
stats *astats.Stats
|
||||
}
|
||||
|
||||
func NewRecordRuleContext(rule *models.RecordingRule, datasourceId int64, promClients *prom.PromClientMap, writers *writer.WritersType) *RecordRuleContext {
|
||||
func NewRecordRuleContext(rule *models.RecordingRule, datasourceId int64, promClients *prom.PromClientMap, writers *writer.WritersType, stats *astats.Stats) *RecordRuleContext {
|
||||
return &RecordRuleContext{
|
||||
datasourceId: datasourceId,
|
||||
quit: make(chan struct{}),
|
||||
rule: rule,
|
||||
promClients: promClients,
|
||||
//writers: writers,
|
||||
stats: stats,
|
||||
}
|
||||
}
|
||||
|
||||
@@ -70,6 +71,7 @@ func (rrc *RecordRuleContext) Start() {
|
||||
}
|
||||
|
||||
func (rrc *RecordRuleContext) Eval() {
|
||||
rrc.stats.CounterRecordEval.WithLabelValues().Inc()
|
||||
promql := strings.TrimSpace(rrc.rule.PromQl)
|
||||
if promql == "" {
|
||||
logger.Errorf("eval:%s promql is blank", rrc.Key())
|
||||
@@ -78,17 +80,20 @@ func (rrc *RecordRuleContext) Eval() {
|
||||
|
||||
if rrc.promClients.IsNil(rrc.datasourceId) {
|
||||
logger.Errorf("eval:%s reader client is nil", rrc.Key())
|
||||
rrc.stats.CounterRecordEvalErrorTotal.WithLabelValues().Inc()
|
||||
return
|
||||
}
|
||||
|
||||
value, warnings, err := rrc.promClients.GetCli(rrc.datasourceId).Query(context.Background(), promql, time.Now())
|
||||
if err != nil {
|
||||
logger.Errorf("eval:%s promql:%s, error:%v", rrc.Key(), promql, err)
|
||||
rrc.stats.CounterRecordEvalErrorTotal.WithLabelValues().Inc()
|
||||
return
|
||||
}
|
||||
|
||||
if len(warnings) > 0 {
|
||||
logger.Errorf("eval:%s promql:%s, warnings:%v", rrc.Key(), promql, warnings)
|
||||
rrc.stats.CounterRecordEvalErrorTotal.WithLabelValues().Inc()
|
||||
return
|
||||
}
|
||||
|
||||
|
||||
@@ -72,7 +72,7 @@ func (s *Scheduler) syncRecordRules() {
|
||||
continue
|
||||
}
|
||||
|
||||
recordRule := NewRecordRuleContext(rule, dsId, s.promClients, s.writers)
|
||||
recordRule := NewRecordRuleContext(rule, dsId, s.promClients, s.writers, s.stats)
|
||||
recordRules[recordRule.Hash()] = recordRule
|
||||
}
|
||||
}
|
||||
|
||||
@@ -72,8 +72,6 @@ func (rt *Router) pushEventToQueue(c *gin.Context) {
|
||||
event.NotifyChannels = strings.Join(event.NotifyChannelsJSON, " ")
|
||||
event.NotifyGroups = strings.Join(event.NotifyGroupsJSON, " ")
|
||||
|
||||
rt.AlertStats.CounterAlertsTotal.WithLabelValues(event.Cluster).Inc()
|
||||
|
||||
dispatch.LogEvent(event, "http_push_queue")
|
||||
if !queue.EventQueue.PushFront(event) {
|
||||
msg := fmt.Sprintf("event:%+v push_queue err: queue is full", event)
|
||||
|
||||
@@ -7,6 +7,7 @@ import (
|
||||
"time"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/alert/aconf"
|
||||
"github.com/ccfos/nightingale/v6/alert/astats"
|
||||
"github.com/ccfos/nightingale/v6/memsto"
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
"github.com/ccfos/nightingale/v6/pkg/ctx"
|
||||
@@ -16,7 +17,8 @@ import (
|
||||
"github.com/toolkits/pkg/logger"
|
||||
)
|
||||
|
||||
func SendCallbacks(ctx *ctx.Context, urls []string, event *models.AlertCurEvent, targetCache *memsto.TargetCacheType, userCache *memsto.UserCacheType, ibexConf aconf.Ibex) {
|
||||
func SendCallbacks(ctx *ctx.Context, urls []string, event *models.AlertCurEvent, targetCache *memsto.TargetCacheType, userCache *memsto.UserCacheType,
|
||||
ibexConf aconf.Ibex, stats *astats.Stats) {
|
||||
for _, url := range urls {
|
||||
if url == "" {
|
||||
continue
|
||||
@@ -33,9 +35,11 @@ func SendCallbacks(ctx *ctx.Context, urls []string, event *models.AlertCurEvent,
|
||||
url = "http://" + url
|
||||
}
|
||||
|
||||
stats.AlertNotifyTotal.WithLabelValues("rule_callback").Inc()
|
||||
resp, code, err := poster.PostJSON(url, 5*time.Second, event, 3)
|
||||
if err != nil {
|
||||
logger.Errorf("event_callback_fail(rule_id=%d url=%s), resp: %s, err: %v, code: %d", event.RuleId, url, string(resp), err, code)
|
||||
stats.AlertNotifyErrorTotal.WithLabelValues("rule_callback").Inc()
|
||||
} else {
|
||||
logger.Infof("event_callback_succ(rule_id=%d url=%s), resp: %s, code: %d", event.RuleId, url, string(resp), code)
|
||||
}
|
||||
@@ -92,7 +96,7 @@ func handleIbex(ctx *ctx.Context, url string, event *models.AlertCurEvent, targe
|
||||
return
|
||||
}
|
||||
|
||||
tpl, err := models.TaskTplGet(ctx, "id = ?", id)
|
||||
tpl, err := models.TaskTplGetById(ctx, id)
|
||||
if err != nil {
|
||||
logger.Errorf("event_callback_ibex: failed to get tpl: %v", err)
|
||||
return
|
||||
|
||||
@@ -5,6 +5,7 @@ import (
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/alert/astats"
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
"github.com/ccfos/nightingale/v6/pkg/poster"
|
||||
|
||||
@@ -66,7 +67,8 @@ func (ds *DingtalkSender) Send(ctx MessageContext) {
|
||||
},
|
||||
}
|
||||
}
|
||||
ds.doSend(url, body)
|
||||
|
||||
doSend(url, body, models.Dingtalk, ctx.Stats)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -81,7 +83,7 @@ func (ds *DingtalkSender) extract(users []*models.User) ([]string, []string) {
|
||||
}
|
||||
if token, has := user.ExtractToken(models.Dingtalk); has {
|
||||
url := token
|
||||
if !strings.HasPrefix(token, "https://") {
|
||||
if !strings.HasPrefix(token, "https://") && !strings.HasPrefix(token, "http://") {
|
||||
url = "https://oapi.dingtalk.com/robot/send?access_token=" + token
|
||||
}
|
||||
urls = append(urls, url)
|
||||
@@ -90,11 +92,14 @@ func (ds *DingtalkSender) extract(users []*models.User) ([]string, []string) {
|
||||
return urls, ats
|
||||
}
|
||||
|
||||
func (ds *DingtalkSender) doSend(url string, body dingtalk) {
|
||||
func doSend(url string, body interface{}, channel string, stats *astats.Stats) {
|
||||
stats.AlertNotifyTotal.WithLabelValues(channel).Inc()
|
||||
|
||||
res, code, err := poster.PostJSON(url, time.Second*5, body, 3)
|
||||
if err != nil {
|
||||
logger.Errorf("dingtalk_sender: result=fail url=%s code=%d error=%v response=%s", url, code, err, string(res))
|
||||
logger.Errorf("%s_sender: result=fail url=%s code=%d error=%v response=%s", channel, url, code, err, string(res))
|
||||
stats.AlertNotifyErrorTotal.WithLabelValues(channel).Inc()
|
||||
} else {
|
||||
logger.Infof("dingtalk_sender: result=succ url=%s code=%d response=%s", url, code, string(res))
|
||||
logger.Infof("%s_sender: result=succ url=%s code=%d response=%s", channel, url, code, string(res))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -2,6 +2,7 @@ package sender
|
||||
|
||||
import (
|
||||
"crypto/tls"
|
||||
"errors"
|
||||
"html/template"
|
||||
"time"
|
||||
|
||||
@@ -35,6 +36,8 @@ func (es *EmailSender) Send(ctx MessageContext) {
|
||||
}
|
||||
content := BuildTplMessage(es.contentTpl, ctx.Events)
|
||||
es.WriteEmail(subject, content, tos)
|
||||
|
||||
ctx.Stats.AlertNotifyTotal.WithLabelValues(models.Email).Add(float64(len(tos)))
|
||||
}
|
||||
|
||||
func extract(users []*models.User) []string {
|
||||
@@ -47,7 +50,7 @@ func extract(users []*models.User) []string {
|
||||
return tos
|
||||
}
|
||||
|
||||
func (es *EmailSender) SendEmail(subject, content string, tos []string, stmp aconf.SMTPConfig) {
|
||||
func SendEmail(subject, content string, tos []string, stmp aconf.SMTPConfig) error {
|
||||
conf := stmp
|
||||
|
||||
d := gomail.NewDialer(conf.Host, conf.Port, conf.User, conf.Pass)
|
||||
@@ -64,8 +67,9 @@ func (es *EmailSender) SendEmail(subject, content string, tos []string, stmp aco
|
||||
|
||||
err := d.DialAndSend(m)
|
||||
if err != nil {
|
||||
logger.Errorf("email_sender: failed to send: %v", err)
|
||||
return errors.New("email_sender: failed to send: " + err.Error())
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
func (es *EmailSender) WriteEmail(subject, content string, tos []string) {
|
||||
@@ -96,14 +100,16 @@ var mailQuit = make(chan struct{})
|
||||
func RestartEmailSender(smtp aconf.SMTPConfig) {
|
||||
close(mailQuit)
|
||||
mailQuit = make(chan struct{})
|
||||
StartEmailSender(smtp)
|
||||
startEmailSender(smtp)
|
||||
}
|
||||
|
||||
func StartEmailSender(smtp aconf.SMTPConfig) {
|
||||
func InitEmailSender(smtp aconf.SMTPConfig) {
|
||||
mailch = make(chan *gomail.Message, 100000)
|
||||
startEmailSender(smtp)
|
||||
}
|
||||
|
||||
func startEmailSender(smtp aconf.SMTPConfig) {
|
||||
conf := smtp
|
||||
|
||||
if conf.Host == "" || conf.Port == 0 {
|
||||
logger.Warning("SMTP configurations invalid")
|
||||
return
|
||||
|
||||
@@ -3,12 +3,8 @@ package sender
|
||||
import (
|
||||
"html/template"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
"github.com/ccfos/nightingale/v6/pkg/poster"
|
||||
|
||||
"github.com/toolkits/pkg/logger"
|
||||
)
|
||||
|
||||
type feishuContent struct {
|
||||
@@ -49,7 +45,7 @@ func (fs *FeishuSender) Send(ctx MessageContext) {
|
||||
IsAtAll: false,
|
||||
}
|
||||
}
|
||||
fs.doSend(url, body)
|
||||
doSend(url, body, models.Feishu, ctx.Stats)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -63,7 +59,7 @@ func (fs *FeishuSender) extract(users []*models.User) ([]string, []string) {
|
||||
}
|
||||
if token, has := user.ExtractToken(models.Feishu); has {
|
||||
url := token
|
||||
if !strings.HasPrefix(token, "https://") {
|
||||
if !strings.HasPrefix(token, "https://") && !strings.HasPrefix(token, "http://") {
|
||||
url = "https://open.feishu.cn/open-apis/bot/v2/hook/" + token
|
||||
}
|
||||
urls = append(urls, url)
|
||||
@@ -71,12 +67,3 @@ func (fs *FeishuSender) extract(users []*models.User) ([]string, []string) {
|
||||
}
|
||||
return urls, ats
|
||||
}
|
||||
|
||||
func (fs *FeishuSender) doSend(url string, body feishu) {
|
||||
res, code, err := poster.PostJSON(url, time.Second*5, body, 3)
|
||||
if err != nil {
|
||||
logger.Errorf("feishu_sender: result=fail url=%s code=%d error=%v response=%s", url, code, err, string(res))
|
||||
} else {
|
||||
logger.Infof("feishu_sender: result=succ url=%s code=%d response=%s", url, code, string(res))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -4,12 +4,8 @@ import (
|
||||
"fmt"
|
||||
"html/template"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
"github.com/ccfos/nightingale/v6/pkg/poster"
|
||||
|
||||
"github.com/toolkits/pkg/logger"
|
||||
)
|
||||
|
||||
type Conf struct {
|
||||
@@ -115,7 +111,7 @@ func (fs *FeishuCardSender) Send(ctx MessageContext) {
|
||||
body.Card.Elements[0].Text.Content = message
|
||||
body.Card.Elements[2].Elements[0].Content = SendTitle
|
||||
for _, url := range urls {
|
||||
fs.doSend(url, body)
|
||||
doSend(url, body, models.FeishuCard, ctx.Stats)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -125,7 +121,7 @@ func (fs *FeishuCardSender) extract(users []*models.User) ([]string, []string) {
|
||||
for i := range users {
|
||||
if token, has := users[i].ExtractToken(models.FeishuCard); has {
|
||||
url := token
|
||||
if !strings.HasPrefix(token, "https://") {
|
||||
if !strings.HasPrefix(token, "https://") && !strings.HasPrefix(token, "http://") {
|
||||
url = "https://open.feishu.cn/open-apis/bot/v2/hook/" + strings.TrimSpace(token)
|
||||
}
|
||||
urls = append(urls, url)
|
||||
@@ -133,12 +129,3 @@ func (fs *FeishuCardSender) extract(users []*models.User) ([]string, []string) {
|
||||
}
|
||||
return urls, ats
|
||||
}
|
||||
|
||||
func (fs *FeishuCardSender) doSend(url string, body feishuCard) {
|
||||
res, code, err := poster.PostJSON(url, time.Second*5, body, 3)
|
||||
if err != nil {
|
||||
logger.Errorf("feishucard_sender: result=fail url=%s code=%d error=%v response=%s", url, code, err, string(res))
|
||||
} else {
|
||||
logger.Debugf("feishucard_sender: result=succ url=%s code=%d response=%s", url, code, string(res))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -4,10 +4,9 @@ import (
|
||||
"html/template"
|
||||
"net/url"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/alert/astats"
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
"github.com/ccfos/nightingale/v6/pkg/poster"
|
||||
|
||||
"github.com/toolkits/pkg/logger"
|
||||
)
|
||||
@@ -15,6 +14,7 @@ import (
|
||||
type MatterMostMessage struct {
|
||||
Text string
|
||||
Tokens []string
|
||||
Stats *astats.Stats
|
||||
}
|
||||
|
||||
type mm struct {
|
||||
@@ -41,6 +41,7 @@ func (ms *MmSender) Send(ctx MessageContext) {
|
||||
SendMM(MatterMostMessage{
|
||||
Text: message,
|
||||
Tokens: urls,
|
||||
Stats: ctx.Stats,
|
||||
})
|
||||
}
|
||||
|
||||
@@ -87,13 +88,7 @@ func SendMM(message MatterMostMessage) {
|
||||
Username: username,
|
||||
Text: txt + message.Text,
|
||||
}
|
||||
|
||||
res, code, err := poster.PostJSON(ur, time.Second*5, body, 3)
|
||||
if err != nil {
|
||||
logger.Errorf("mm_sender: result=fail url=%s code=%d error=%v response=%s", ur, code, err, string(res))
|
||||
} else {
|
||||
logger.Infof("mm_sender: result=succ url=%s code=%d response=%s", ur, code, string(res))
|
||||
}
|
||||
doSend(ur, body, models.Mm, message.Stats)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
@@ -6,6 +6,7 @@ import (
|
||||
"os/exec"
|
||||
"time"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/alert/astats"
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
|
||||
"github.com/toolkits/pkg/file"
|
||||
@@ -13,20 +14,22 @@ import (
|
||||
"github.com/toolkits/pkg/sys"
|
||||
)
|
||||
|
||||
func MayPluginNotify(noticeBytes []byte, notifyScript models.NotifyScript) {
|
||||
func MayPluginNotify(noticeBytes []byte, notifyScript models.NotifyScript, stats *astats.Stats) {
|
||||
if len(noticeBytes) == 0 {
|
||||
return
|
||||
}
|
||||
alertingCallScript(noticeBytes, notifyScript)
|
||||
alertingCallScript(noticeBytes, notifyScript, stats)
|
||||
}
|
||||
|
||||
func alertingCallScript(stdinBytes []byte, notifyScript models.NotifyScript) {
|
||||
func alertingCallScript(stdinBytes []byte, notifyScript models.NotifyScript, stats *astats.Stats) {
|
||||
// not enable or no notify.py? do nothing
|
||||
config := notifyScript
|
||||
if !config.Enable || config.Content == "" {
|
||||
return
|
||||
}
|
||||
|
||||
channel := "script"
|
||||
stats.AlertNotifyTotal.WithLabelValues(channel).Inc()
|
||||
fpath := ".notify_scriptt"
|
||||
if config.Type == 1 {
|
||||
fpath = config.Content
|
||||
@@ -36,6 +39,7 @@ func alertingCallScript(stdinBytes []byte, notifyScript models.NotifyScript) {
|
||||
oldContent, err := file.ToString(fpath)
|
||||
if err != nil {
|
||||
logger.Errorf("event_script_notify_fail: read script file err: %v", err)
|
||||
stats.AlertNotifyErrorTotal.WithLabelValues(channel).Inc()
|
||||
return
|
||||
}
|
||||
|
||||
@@ -48,12 +52,14 @@ func alertingCallScript(stdinBytes []byte, notifyScript models.NotifyScript) {
|
||||
_, err := file.WriteString(fpath, config.Content)
|
||||
if err != nil {
|
||||
logger.Errorf("event_script_notify_fail: write script file err: %v", err)
|
||||
stats.AlertNotifyErrorTotal.WithLabelValues(channel).Inc()
|
||||
return
|
||||
}
|
||||
|
||||
err = os.Chmod(fpath, 0777)
|
||||
if err != nil {
|
||||
logger.Errorf("event_script_notify_fail: chmod script file err: %v", err)
|
||||
stats.AlertNotifyErrorTotal.WithLabelValues(channel).Inc()
|
||||
return
|
||||
}
|
||||
}
|
||||
@@ -83,13 +89,14 @@ func alertingCallScript(stdinBytes []byte, notifyScript models.NotifyScript) {
|
||||
|
||||
if err != nil {
|
||||
logger.Errorf("event_script_notify_fail: kill process %s occur error %v", fpath, err)
|
||||
stats.AlertNotifyErrorTotal.WithLabelValues(channel).Inc()
|
||||
}
|
||||
|
||||
return
|
||||
}
|
||||
|
||||
if err != nil {
|
||||
logger.Errorf("event_script_notify_fail: exec script %s occur error: %v, output: %s", fpath, err, buf.String())
|
||||
stats.AlertNotifyErrorTotal.WithLabelValues(channel).Inc()
|
||||
return
|
||||
}
|
||||
|
||||
|
||||
@@ -5,6 +5,7 @@ import (
|
||||
"html/template"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/alert/aconf"
|
||||
"github.com/ccfos/nightingale/v6/alert/astats"
|
||||
"github.com/ccfos/nightingale/v6/memsto"
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
)
|
||||
@@ -20,10 +21,11 @@ type (
|
||||
Users []*models.User
|
||||
Rule *models.AlertRule
|
||||
Events []*models.AlertCurEvent
|
||||
Stats *astats.Stats
|
||||
}
|
||||
)
|
||||
|
||||
func NewSender(key string, tpls map[string]*template.Template, smtp aconf.SMTPConfig) Sender {
|
||||
func NewSender(key string, tpls map[string]*template.Template, smtp ...aconf.SMTPConfig) Sender {
|
||||
switch key {
|
||||
case models.Dingtalk:
|
||||
return &DingtalkSender{tpl: tpls[models.Dingtalk]}
|
||||
@@ -34,7 +36,7 @@ func NewSender(key string, tpls map[string]*template.Template, smtp aconf.SMTPCo
|
||||
case models.FeishuCard:
|
||||
return &FeishuCardSender{tpl: tpls[models.FeishuCard]}
|
||||
case models.Email:
|
||||
return &EmailSender{subjectTpl: tpls["mailsubject"], contentTpl: tpls[models.Email], smtp: smtp}
|
||||
return &EmailSender{subjectTpl: tpls[models.EmailSubject], contentTpl: tpls[models.Email], smtp: smtp[0]}
|
||||
case models.Mm:
|
||||
return &MmSender{tpl: tpls[models.Mm]}
|
||||
case models.Telegram:
|
||||
@@ -43,12 +45,13 @@ func NewSender(key string, tpls map[string]*template.Template, smtp aconf.SMTPCo
|
||||
return nil
|
||||
}
|
||||
|
||||
func BuildMessageContext(rule *models.AlertRule, events []*models.AlertCurEvent, uids []int64, userCache *memsto.UserCacheType) MessageContext {
|
||||
func BuildMessageContext(rule *models.AlertRule, events []*models.AlertCurEvent, uids []int64, userCache *memsto.UserCacheType, stats *astats.Stats) MessageContext {
|
||||
users := userCache.GetByUserIds(uids)
|
||||
return MessageContext{
|
||||
Rule: rule,
|
||||
Events: events,
|
||||
Users: users,
|
||||
Stats: stats,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -3,10 +3,9 @@ package sender
|
||||
import (
|
||||
"html/template"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/alert/astats"
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
"github.com/ccfos/nightingale/v6/pkg/poster"
|
||||
|
||||
"github.com/toolkits/pkg/logger"
|
||||
)
|
||||
@@ -14,6 +13,7 @@ import (
|
||||
type TelegramMessage struct {
|
||||
Text string
|
||||
Tokens []string
|
||||
Stats *astats.Stats
|
||||
}
|
||||
|
||||
type telegram struct {
|
||||
@@ -35,6 +35,7 @@ func (ts *TelegramSender) Send(ctx MessageContext) {
|
||||
SendTelegram(TelegramMessage{
|
||||
Text: message,
|
||||
Tokens: tokens,
|
||||
Stats: ctx.Stats,
|
||||
})
|
||||
}
|
||||
|
||||
@@ -55,7 +56,7 @@ func SendTelegram(message TelegramMessage) {
|
||||
continue
|
||||
}
|
||||
var url string
|
||||
if strings.HasPrefix(message.Tokens[i], "https://") {
|
||||
if strings.HasPrefix(message.Tokens[i], "https://") || strings.HasPrefix(message.Tokens[i], "http://") {
|
||||
url = message.Tokens[i]
|
||||
} else {
|
||||
array := strings.Split(message.Tokens[i], "/")
|
||||
@@ -72,11 +73,6 @@ func SendTelegram(message TelegramMessage) {
|
||||
Text: message.Text,
|
||||
}
|
||||
|
||||
res, code, err := poster.PostJSON(url, time.Second*5, body, 3)
|
||||
if err != nil {
|
||||
logger.Errorf("telegram_sender: result=fail url=%s code=%d error=%v response=%s", url, code, err, string(res))
|
||||
} else {
|
||||
logger.Infof("telegram_sender: result=succ url=%s code=%d response=%s", url, code, string(res))
|
||||
}
|
||||
doSend(url, body, models.Telegram, message.Stats)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -3,16 +3,17 @@ package sender
|
||||
import (
|
||||
"bytes"
|
||||
"encoding/json"
|
||||
"io/ioutil"
|
||||
"io"
|
||||
"net/http"
|
||||
"time"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/alert/astats"
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
|
||||
"github.com/toolkits/pkg/logger"
|
||||
)
|
||||
|
||||
func SendWebhooks(webhooks []*models.Webhook, event *models.AlertCurEvent) {
|
||||
func SendWebhooks(webhooks []*models.Webhook, event *models.AlertCurEvent, stats *astats.Stats) {
|
||||
for _, conf := range webhooks {
|
||||
if conf.Url == "" || !conf.Enable {
|
||||
continue
|
||||
@@ -50,9 +51,11 @@ func SendWebhooks(webhooks []*models.Webhook, event *models.AlertCurEvent) {
|
||||
Timeout: time.Duration(conf.Timeout) * time.Second,
|
||||
}
|
||||
|
||||
stats.AlertNotifyTotal.WithLabelValues("webhook").Inc()
|
||||
var resp *http.Response
|
||||
resp, err = client.Do(req)
|
||||
if err != nil {
|
||||
stats.AlertNotifyErrorTotal.WithLabelValues("webhook").Inc()
|
||||
logger.Errorf("event_webhook_fail, ruleId: [%d], eventId: [%d], url: [%s], error: [%s]", event.RuleId, event.Id, conf.Url, err)
|
||||
continue
|
||||
}
|
||||
@@ -60,7 +63,7 @@ func SendWebhooks(webhooks []*models.Webhook, event *models.AlertCurEvent) {
|
||||
var body []byte
|
||||
if resp.Body != nil {
|
||||
defer resp.Body.Close()
|
||||
body, _ = ioutil.ReadAll(resp.Body)
|
||||
body, _ = io.ReadAll(resp.Body)
|
||||
}
|
||||
|
||||
logger.Debugf("event_webhook_succ, url: %s, response code: %d, body: %s", conf.Url, resp.StatusCode, string(body))
|
||||
|
||||
@@ -3,12 +3,8 @@ package sender
|
||||
import (
|
||||
"html/template"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
"github.com/ccfos/nightingale/v6/pkg/poster"
|
||||
|
||||
"github.com/toolkits/pkg/logger"
|
||||
)
|
||||
|
||||
type wecomMarkdown struct {
|
||||
@@ -37,7 +33,7 @@ func (ws *WecomSender) Send(ctx MessageContext) {
|
||||
Content: message,
|
||||
},
|
||||
}
|
||||
ws.doSend(url, body)
|
||||
doSend(url, body, models.Wecom, ctx.Stats)
|
||||
}
|
||||
}
|
||||
|
||||
@@ -46,7 +42,7 @@ func (ws *WecomSender) extract(users []*models.User) []string {
|
||||
for _, user := range users {
|
||||
if token, has := user.ExtractToken(models.Wecom); has {
|
||||
url := token
|
||||
if !strings.HasPrefix(token, "https://") {
|
||||
if !strings.HasPrefix(token, "https://") && !strings.HasPrefix(token, "http://") {
|
||||
url = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=" + token
|
||||
}
|
||||
urls = append(urls, url)
|
||||
@@ -54,12 +50,3 @@ func (ws *WecomSender) extract(users []*models.User) []string {
|
||||
}
|
||||
return urls
|
||||
}
|
||||
|
||||
func (ws *WecomSender) doSend(url string, body wecom) {
|
||||
res, code, err := poster.PostJSON(url, time.Second*5, body, 3)
|
||||
if err != nil {
|
||||
logger.Errorf("wecom_sender: result=fail url=%s code=%d error=%v response=%s", url, code, err, string(res))
|
||||
} else {
|
||||
logger.Infof("wecom_sender: result=succ url=%s code=%d response=%s", url, code, string(res))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -15,8 +15,14 @@ var Plugins = []Plugin{
|
||||
},
|
||||
{
|
||||
Id: 3,
|
||||
Category: "logging",
|
||||
Type: "jaeger",
|
||||
TypeName: "Jaeger",
|
||||
Category: "loki",
|
||||
Type: "loki",
|
||||
TypeName: "Loki",
|
||||
},
|
||||
{
|
||||
Id: 4,
|
||||
Category: "timeseries",
|
||||
Type: "tdengine",
|
||||
TypeName: "TDengine",
|
||||
},
|
||||
}
|
||||
|
||||
15
center/cconf/sql_tpl.go
Normal file
@@ -0,0 +1,15 @@
|
||||
package cconf
|
||||
|
||||
var TDengineSQLTpl = map[string]string{
|
||||
"load5": "SELECT _wstart as ts, last(load5) FROM $database.system WHERE host = '$server' and _ts >= $from and _ts <= $to interval($interval) fill(null)",
|
||||
"process_total": "SELECT _wstart as ts, last(total) FROM $database.processes WHERE host = '$server' and _ts >= $from and _ts <= $to interval($interval) fill(null)",
|
||||
"thread_total": "SELECT _wstart as ts, last(total) FROM $database.threads WHERE host = '$server' and _ts >= $from and _ts <= $to interval($interval) fill(null)",
|
||||
"cpu_idle": "SELECT _wstart as ts, last(usage_idle) * -1 + 100 FROM $database.cpu WHERE (host = '$server' and cpu = 'cpu-total') and _ts >= $from and _ts <= $to interval($interval) fill(null)",
|
||||
"mem_used_percent": "SELECT _wstart as ts, last(used_percent) FROM $database.mem WHERE (host = '$server') and _ts >= $from and _ts <= $to interval($interval) fill(null)",
|
||||
"disk_used_percent": "SELECT _wstart as ts, last(used_percent) FROM $database.disk WHERE (host = '$server' and path = '/') and _ts >= $from and _ts <= $to interval($interval) fill(null)",
|
||||
"cpu_context_switches": "select ts, derivative(context_switches, 1s, 0) as context FROM (SELECT _wstart as ts, avg(context_switches) as context_switches FROM $database.kernel WHERE host = '$server' and _ts >= $from and _ts <= $to interval($interval) )",
|
||||
"tcp": "SELECT _wstart as ts, avg(tcp_close) as CLOSED, avg(tcp_close_wait) as CLOSE_WAIT, avg(tcp_closing) as CLOSING, avg(tcp_established) as ESTABLISHED, avg(tcp_fin_wait1) as FIN_WAIT1, avg(tcp_fin_wait2) as FIN_WAIT2, avg(tcp_last_ack) as LAST_ACK, avg(tcp_syn_recv) as SYN_RECV, avg(tcp_syn_sent) as SYN_SENT, avg(tcp_time_wait) as TIME_WAIT FROM $database.netstat WHERE host = '$server' and _ts >= $from and _ts <= $to interval($interval)",
|
||||
"net_bytes_recv": "SELECT _wstart as ts, derivative(bytes_recv,1s, 1) as bytes_in FROM $database.net WHERE host = '$server' and interface = '$netif' and _ts >= $from and _ts <= $to group by tbname",
|
||||
"net_bytes_sent": "SELECT _wstart as ts, derivative(bytes_sent,1s, 1) as bytes_out FROM $database.net WHERE host = '$server' and interface = '$netif' and _ts >= $from and _ts <= $to group by tbname",
|
||||
"disk_total": "SELECT _wstart as ts, avg(total) AS total, avg(used) as used FROM $database.disk WHERE path = '$mountpoint' and _ts >= $from and _ts <= $to interval($interval) group by host",
|
||||
}
|
||||
@@ -8,6 +8,7 @@ import (
|
||||
"github.com/ccfos/nightingale/v6/alert/astats"
|
||||
"github.com/ccfos/nightingale/v6/alert/process"
|
||||
"github.com/ccfos/nightingale/v6/center/cconf"
|
||||
"github.com/ccfos/nightingale/v6/center/cstats"
|
||||
"github.com/ccfos/nightingale/v6/center/metas"
|
||||
"github.com/ccfos/nightingale/v6/center/sso"
|
||||
"github.com/ccfos/nightingale/v6/conf"
|
||||
@@ -19,10 +20,12 @@ import (
|
||||
"github.com/ccfos/nightingale/v6/pkg/httpx"
|
||||
"github.com/ccfos/nightingale/v6/pkg/i18nx"
|
||||
"github.com/ccfos/nightingale/v6/pkg/logx"
|
||||
"github.com/ccfos/nightingale/v6/pkg/version"
|
||||
"github.com/ccfos/nightingale/v6/prom"
|
||||
"github.com/ccfos/nightingale/v6/pushgw/idents"
|
||||
"github.com/ccfos/nightingale/v6/pushgw/writer"
|
||||
"github.com/ccfos/nightingale/v6/storage"
|
||||
"github.com/ccfos/nightingale/v6/tdengine"
|
||||
|
||||
alertrt "github.com/ccfos/nightingale/v6/alert/router"
|
||||
centerrt "github.com/ccfos/nightingale/v6/center/router"
|
||||
@@ -43,7 +46,8 @@ func Initialize(configDir string, cryptoKey string) (func(), error) {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
i18nx.Init()
|
||||
i18nx.Init(configDir)
|
||||
cstats.Init()
|
||||
|
||||
db, err := storage.New(config.DB)
|
||||
if err != nil {
|
||||
@@ -76,16 +80,19 @@ func Initialize(configDir string, cryptoKey string) (func(), error) {
|
||||
userGroupCache := memsto.NewUserGroupCache(ctx, syncStats)
|
||||
|
||||
promClients := prom.NewPromClient(ctx, config.Alert.Heartbeat)
|
||||
tdengineClients := tdengine.NewTdengineClient(ctx, config.Alert.Heartbeat)
|
||||
|
||||
externalProcessors := process.NewExternalProcessors()
|
||||
alert.Start(config.Alert, config.Pushgw, syncStats, alertStats, externalProcessors, targetCache, busiGroupCache, alertMuteCache, alertRuleCache, notifyConfigCache, dsCache, ctx, promClients, userCache, userGroupCache)
|
||||
alert.Start(config.Alert, config.Pushgw, syncStats, alertStats, externalProcessors, targetCache, busiGroupCache, alertMuteCache, alertRuleCache, notifyConfigCache, dsCache, ctx, promClients, tdengineClients, userCache, userGroupCache)
|
||||
|
||||
writers := writer.NewWriters(config.Pushgw)
|
||||
|
||||
httpx.InitRSAConfig(&config.HTTP.RSA)
|
||||
go version.GetGithubVersion()
|
||||
|
||||
alertrtRouter := alertrt.New(config.HTTP, config.Alert, alertMuteCache, targetCache, busiGroupCache, alertStats, ctx, externalProcessors)
|
||||
centerRouter := centerrt.New(config.HTTP, config.Center, cconf.Operations, dsCache, notifyConfigCache, promClients, redis, sso, ctx, metas, idents, targetCache, userCache, userGroupCache)
|
||||
centerRouter := centerrt.New(config.HTTP, config.Center, cconf.Operations, dsCache, notifyConfigCache, promClients, tdengineClients,
|
||||
redis, sso, ctx, metas, idents, targetCache, userCache, userGroupCache)
|
||||
pushgwRouter := pushgwrt.New(config.HTTP, config.Pushgw, targetCache, busiGroupCache, idents, writers, ctx)
|
||||
|
||||
r := httpx.GinEngine(config.Global.RunMode, config.HTTP)
|
||||
|
||||
@@ -17,12 +17,15 @@ import (
|
||||
"github.com/ccfos/nightingale/v6/pkg/aop"
|
||||
"github.com/ccfos/nightingale/v6/pkg/ctx"
|
||||
"github.com/ccfos/nightingale/v6/pkg/httpx"
|
||||
"github.com/ccfos/nightingale/v6/pkg/version"
|
||||
"github.com/ccfos/nightingale/v6/prom"
|
||||
"github.com/ccfos/nightingale/v6/pushgw/idents"
|
||||
"github.com/ccfos/nightingale/v6/storage"
|
||||
"github.com/ccfos/nightingale/v6/tdengine"
|
||||
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/rakyll/statik/fs"
|
||||
"github.com/toolkits/pkg/ginx"
|
||||
"github.com/toolkits/pkg/logger"
|
||||
"github.com/toolkits/pkg/runner"
|
||||
)
|
||||
@@ -34,6 +37,7 @@ type Router struct {
|
||||
DatasourceCache *memsto.DatasourceCacheType
|
||||
NotifyConfigCache *memsto.NotifyConfigCacheType
|
||||
PromClients *prom.PromClientMap
|
||||
TdendgineClients *tdengine.TdengineClientMap
|
||||
Redis storage.Redis
|
||||
MetaSet *metas.Set
|
||||
IdentSet *idents.Set
|
||||
@@ -42,10 +46,12 @@ type Router struct {
|
||||
UserCache *memsto.UserCacheType
|
||||
UserGroupCache *memsto.UserGroupCacheType
|
||||
Ctx *ctx.Context
|
||||
|
||||
DatasourceCheckHook func(*gin.Context) bool
|
||||
}
|
||||
|
||||
func New(httpConfig httpx.Config, center cconf.Center, operations cconf.Operation, ds *memsto.DatasourceCacheType, ncc *memsto.NotifyConfigCacheType,
|
||||
pc *prom.PromClientMap, redis storage.Redis, sso *sso.SsoClient, ctx *ctx.Context, metaSet *metas.Set, idents *idents.Set, tc *memsto.TargetCacheType,
|
||||
pc *prom.PromClientMap, tdendgineClients *tdengine.TdengineClientMap, redis storage.Redis, sso *sso.SsoClient, ctx *ctx.Context, metaSet *metas.Set, idents *idents.Set, tc *memsto.TargetCacheType,
|
||||
uc *memsto.UserCacheType, ugc *memsto.UserGroupCacheType) *Router {
|
||||
return &Router{
|
||||
HTTP: httpConfig,
|
||||
@@ -54,6 +60,7 @@ func New(httpConfig httpx.Config, center cconf.Center, operations cconf.Operatio
|
||||
DatasourceCache: ds,
|
||||
NotifyConfigCache: ncc,
|
||||
PromClients: pc,
|
||||
TdendgineClients: tdendgineClients,
|
||||
Redis: redis,
|
||||
MetaSet: metaSet,
|
||||
IdentSet: idents,
|
||||
@@ -62,6 +69,8 @@ func New(httpConfig httpx.Config, center cconf.Center, operations cconf.Operatio
|
||||
UserCache: uc,
|
||||
UserGroupCache: ugc,
|
||||
Ctx: ctx,
|
||||
|
||||
DatasourceCheckHook: func(ctx *gin.Context) bool { return false },
|
||||
}
|
||||
}
|
||||
|
||||
@@ -160,11 +169,27 @@ func (rt *Router) Config(r *gin.Engine) {
|
||||
pages.POST("/query-range-batch", rt.promBatchQueryRange)
|
||||
pages.POST("/query-instant-batch", rt.promBatchQueryInstant)
|
||||
pages.GET("/datasource/brief", rt.datasourceBriefs)
|
||||
|
||||
pages.POST("/ds-query", rt.QueryData)
|
||||
pages.POST("/logs-query", rt.QueryLog)
|
||||
|
||||
pages.POST("/tdengine-databases", rt.tdengineDatabases)
|
||||
pages.POST("/tdengine-tables", rt.tdengineTables)
|
||||
pages.POST("/tdengine-columns", rt.tdengineColumns)
|
||||
|
||||
pages.GET("/sql-template", rt.QuerySqlTemplate)
|
||||
} else {
|
||||
pages.Any("/proxy/:id/*url", rt.auth(), rt.dsProxy)
|
||||
pages.POST("/query-range-batch", rt.auth(), rt.promBatchQueryRange)
|
||||
pages.POST("/query-instant-batch", rt.auth(), rt.promBatchQueryInstant)
|
||||
pages.GET("/datasource/brief", rt.auth(), rt.datasourceBriefs)
|
||||
|
||||
pages.POST("/ds-query", rt.auth(), rt.QueryData)
|
||||
pages.POST("/logs-query", rt.auth(), rt.QueryLog)
|
||||
|
||||
pages.POST("/tdengine-databases", rt.auth(), rt.tdengineDatabases)
|
||||
pages.POST("/tdengine-tables", rt.auth(), rt.tdengineTables)
|
||||
pages.POST("/tdengine-columns", rt.auth(), rt.tdengineColumns)
|
||||
}
|
||||
|
||||
pages.POST("/auth/login", rt.jwtMock(), rt.loginPost)
|
||||
@@ -243,6 +268,7 @@ func (rt *Router) Config(r *gin.Engine) {
|
||||
pages.GET("/builtin-boards-cates", rt.auth(), rt.user(), rt.builtinBoardCateGets)
|
||||
pages.POST("/builtin-boards-detail", rt.auth(), rt.user(), rt.builtinBoardDetailGets)
|
||||
pages.GET("/integrations/icon/:cate/:name", rt.builtinIcon)
|
||||
pages.GET("/integrations/makedown/:cate", rt.builtinMarkdown)
|
||||
|
||||
pages.GET("/busi-group/:id/boards", rt.auth(), rt.user(), rt.perm("/dashboards"), rt.bgro(), rt.boardGets)
|
||||
pages.POST("/busi-group/:id/boards", rt.auth(), rt.user(), rt.perm("/dashboards/add"), rt.bgrw(), rt.boardAdd)
|
||||
@@ -268,7 +294,7 @@ func (rt *Router) Config(r *gin.Engine) {
|
||||
pages.PUT("/busi-group/:id/alert-rules/fields", rt.auth(), rt.user(), rt.perm("/alert-rules/put"), rt.bgrw(), rt.alertRulePutFields)
|
||||
pages.PUT("/busi-group/:id/alert-rule/:arid", rt.auth(), rt.user(), rt.perm("/alert-rules/put"), rt.alertRulePutByFE)
|
||||
pages.GET("/alert-rule/:arid", rt.auth(), rt.user(), rt.perm("/alert-rules"), rt.alertRuleGet)
|
||||
pages.PUT("/busi-group/:id/alert-rule/:arid/validate", rt.auth(), rt.user(), rt.perm("/alert-rules/put"), rt.alertRuleValidation)
|
||||
pages.PUT("/busi-group/alert-rule/validate", rt.auth(), rt.user(), rt.perm("/alert-rules/put"), rt.alertRuleValidation)
|
||||
|
||||
pages.GET("/busi-group/:id/recording-rules", rt.auth(), rt.user(), rt.perm("/recording-rules"), rt.recordingRuleGets)
|
||||
pages.POST("/busi-group/:id/recording-rules", rt.auth(), rt.user(), rt.perm("/recording-rules/add"), rt.bgrw(), rt.recordingRuleAddByFE)
|
||||
@@ -278,6 +304,7 @@ func (rt *Router) Config(r *gin.Engine) {
|
||||
pages.PUT("/busi-group/:id/recording-rules/fields", rt.auth(), rt.user(), rt.perm("/recording-rules/put"), rt.recordingRulePutFields)
|
||||
|
||||
pages.GET("/busi-group/:id/alert-mutes", rt.auth(), rt.user(), rt.perm("/alert-mutes"), rt.bgro(), rt.alertMuteGetsByBG)
|
||||
pages.POST("/busi-group/:id/alert-mutes/preview", rt.auth(), rt.user(), rt.perm("/alert-mutes/add"), rt.bgrw(), rt.alertMutePreview)
|
||||
pages.POST("/busi-group/:id/alert-mutes", rt.auth(), rt.user(), rt.perm("/alert-mutes/add"), rt.bgrw(), rt.alertMuteAdd)
|
||||
pages.DELETE("/busi-group/:id/alert-mutes", rt.auth(), rt.user(), rt.perm("/alert-mutes/del"), rt.bgrw(), rt.alertMuteDel)
|
||||
pages.PUT("/busi-group/:id/alert-mute/:amid", rt.auth(), rt.user(), rt.perm("/alert-mutes/put"), rt.alertMutePutByFE)
|
||||
@@ -303,6 +330,7 @@ func (rt *Router) Config(r *gin.Engine) {
|
||||
pages.POST("/alert-cur-events/card/details", rt.auth(), rt.alertCurEventsCardDetails)
|
||||
pages.GET("/alert-his-events/list", rt.auth(), rt.alertHisEventsList)
|
||||
pages.DELETE("/alert-cur-events", rt.auth(), rt.user(), rt.perm("/alert-cur-events/del"), rt.alertCurEventDel)
|
||||
pages.GET("/alert-cur-events/stats", rt.auth(), rt.alertCurEventsStatistics)
|
||||
|
||||
pages.GET("/alert-aggr-views", rt.auth(), rt.alertAggrViewGets)
|
||||
pages.DELETE("/alert-aggr-views", rt.auth(), rt.user(), rt.alertAggrViewDel)
|
||||
@@ -365,14 +393,28 @@ func (rt *Router) Config(r *gin.Engine) {
|
||||
|
||||
pages.GET("/notify-config", rt.auth(), rt.admin(), rt.notifyConfigGet)
|
||||
pages.PUT("/notify-config", rt.auth(), rt.admin(), rt.notifyConfigPut)
|
||||
pages.PUT("/smtp-config-test", rt.auth(), rt.admin(), rt.attemptSendEmail)
|
||||
|
||||
pages.GET("/es-index-pattern", rt.auth(), rt.esIndexPatternGet)
|
||||
pages.GET("/es-index-pattern-list", rt.auth(), rt.esIndexPatternGetList)
|
||||
pages.POST("/es-index-pattern", rt.auth(), rt.admin(), rt.esIndexPatternAdd)
|
||||
pages.PUT("/es-index-pattern", rt.auth(), rt.admin(), rt.esIndexPatternPut)
|
||||
pages.DELETE("/es-index-pattern", rt.auth(), rt.admin(), rt.esIndexPatternDel)
|
||||
|
||||
pages.GET("/config", rt.auth(), rt.admin(), rt.configGetByKey)
|
||||
pages.PUT("/config", rt.auth(), rt.admin(), rt.configPutByKey)
|
||||
}
|
||||
|
||||
r.GET("/api/n9e/versions", func(c *gin.Context) {
|
||||
v := version.Version
|
||||
lastIndex := strings.LastIndex(version.Version, "-")
|
||||
if lastIndex != -1 {
|
||||
v = version.Version[:lastIndex]
|
||||
}
|
||||
|
||||
ginx.NewRender(c).Data(gin.H{"version": v, "github_verison": version.GithubVersion.Load().(string)}, nil)
|
||||
})
|
||||
|
||||
if rt.HTTP.APIForService.Enable {
|
||||
service := r.Group("/v1/n9e")
|
||||
if len(rt.HTTP.APIForService.BasicAuth) > 0 {
|
||||
@@ -418,6 +460,8 @@ func (rt *Router) Config(r *gin.Engine) {
|
||||
service.GET("/alert-his-events", rt.alertHisEventsList)
|
||||
service.GET("/alert-his-event/:eid", rt.alertHisEventGet)
|
||||
|
||||
service.GET("/task-tpl/:tid", rt.taskTplGetByService)
|
||||
|
||||
service.GET("/config/:id", rt.configGet)
|
||||
service.GET("/configs", rt.configsGet)
|
||||
service.GET("/config", rt.configGetByKey)
|
||||
|
||||
@@ -4,6 +4,7 @@ import (
|
||||
"net/http"
|
||||
"sort"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
|
||||
@@ -182,10 +183,19 @@ func (rt *Router) alertCurEventDel(c *gin.Context) {
|
||||
ginx.BindJSON(c, &f)
|
||||
f.Verify()
|
||||
|
||||
rt.checkCurEventBusiGroupRWPermission(c, f.Ids)
|
||||
|
||||
ginx.NewRender(c).Message(models.AlertCurEventDel(rt.Ctx, f.Ids))
|
||||
}
|
||||
|
||||
func (rt *Router) checkCurEventBusiGroupRWPermission(c *gin.Context, ids []int64) {
|
||||
set := make(map[int64]struct{})
|
||||
|
||||
for i := 0; i < len(f.Ids); i++ {
|
||||
event, err := models.AlertCurEventGetById(rt.Ctx, f.Ids[i])
|
||||
// event group id is 0, ignore perm check
|
||||
set[0] = struct{}{}
|
||||
|
||||
for i := 0; i < len(ids); i++ {
|
||||
event, err := models.AlertCurEventGetById(rt.Ctx, ids[i])
|
||||
ginx.Dangerous(err)
|
||||
|
||||
if _, has := set[event.GroupId]; !has {
|
||||
@@ -193,8 +203,6 @@ func (rt *Router) alertCurEventDel(c *gin.Context) {
|
||||
set[event.GroupId] = struct{}{}
|
||||
}
|
||||
}
|
||||
|
||||
ginx.NewRender(c).Message(models.AlertCurEventDel(rt.Ctx, f.Ids))
|
||||
}
|
||||
|
||||
func (rt *Router) alertCurEventGet(c *gin.Context) {
|
||||
@@ -208,3 +216,8 @@ func (rt *Router) alertCurEventGet(c *gin.Context) {
|
||||
|
||||
ginx.NewRender(c).Data(event, nil)
|
||||
}
|
||||
|
||||
func (rt *Router) alertCurEventsStatistics(c *gin.Context) {
|
||||
|
||||
ginx.NewRender(c).Data(models.AlertCurEventStatistics(rt.Ctx, time.Now()), nil)
|
||||
}
|
||||
|
||||
@@ -273,22 +273,11 @@ func (rt *Router) alertRuleGet(c *gin.Context) {
|
||||
ginx.NewRender(c).Data(ar, err)
|
||||
}
|
||||
|
||||
//pre validation before save rule
|
||||
// pre validation before save rule
|
||||
func (rt *Router) alertRuleValidation(c *gin.Context) {
|
||||
var f models.AlertRule //new
|
||||
ginx.BindJSON(c, &f)
|
||||
|
||||
arid := ginx.UrlParamInt64(c, "arid")
|
||||
ar, err := models.AlertRuleGetById(rt.Ctx, arid)
|
||||
ginx.Dangerous(err)
|
||||
|
||||
if ar == nil {
|
||||
ginx.NewRender(c, http.StatusNotFound).Message("No such AlertRule")
|
||||
return
|
||||
}
|
||||
|
||||
rt.bgrwCheck(c, ar.GroupId)
|
||||
|
||||
if len(f.NotifyChannelsJSON) > 0 && len(f.NotifyGroupsJSON) > 0 { //Validation NotifyChannels
|
||||
ngids := make([]int64, 0, len(f.NotifyChannelsJSON))
|
||||
for i := range f.NotifyGroupsJSON {
|
||||
@@ -305,6 +294,15 @@ func (rt *Router) alertRuleValidation(c *gin.Context) {
|
||||
ancs := make([]string, 0, len(f.NotifyChannelsJSON)) //absent Notify Channels
|
||||
for i := range f.NotifyChannelsJSON {
|
||||
flag := true
|
||||
//ignore non-default channels
|
||||
switch f.NotifyChannelsJSON[i] {
|
||||
case models.Dingtalk, models.Wecom, models.Feishu, models.Mm,
|
||||
models.Telegram, models.Email, models.FeishuCard:
|
||||
// do nothing
|
||||
default:
|
||||
continue
|
||||
}
|
||||
//default channels
|
||||
for ui := range users {
|
||||
if _, b := users[ui].ExtractToken(f.NotifyChannelsJSON[i]); b {
|
||||
flag = false
|
||||
@@ -317,7 +315,7 @@ func (rt *Router) alertRuleValidation(c *gin.Context) {
|
||||
}
|
||||
|
||||
if len(ancs) > 0 {
|
||||
ginx.NewRender(c).Message(i18n.Sprintf(c.GetHeader("X-Language"), "All users are missing notify channel configurations. Please check for missing tokens (each channel should be configured with at least one user). %s", ancs))
|
||||
ginx.NewRender(c).Message("All users are missing notify channel configurations. Please check for missing tokens (each channel should be configured with at least one user). %s", ancs)
|
||||
return
|
||||
}
|
||||
|
||||
|
||||
@@ -14,21 +14,18 @@ import (
|
||||
func (rt *Router) alertSubscribeGets(c *gin.Context) {
|
||||
bgid := ginx.UrlParamInt64(c, "id")
|
||||
lst, err := models.AlertSubscribeGets(rt.Ctx, bgid)
|
||||
if err == nil {
|
||||
ugcache := make(map[int64]*models.UserGroup)
|
||||
for i := 0; i < len(lst); i++ {
|
||||
ginx.Dangerous(lst[i].FillUserGroups(rt.Ctx, ugcache))
|
||||
}
|
||||
ginx.Dangerous(err)
|
||||
|
||||
rulecache := make(map[int64]string)
|
||||
for i := 0; i < len(lst); i++ {
|
||||
ginx.Dangerous(lst[i].FillRuleName(rt.Ctx, rulecache))
|
||||
}
|
||||
ugcache := make(map[int64]*models.UserGroup)
|
||||
rulecache := make(map[int64]string)
|
||||
|
||||
for i := 0; i < len(lst); i++ {
|
||||
ginx.Dangerous(lst[i].FillDatasourceIds(rt.Ctx))
|
||||
}
|
||||
for i := 0; i < len(lst); i++ {
|
||||
ginx.Dangerous(lst[i].FillUserGroups(rt.Ctx, ugcache))
|
||||
ginx.Dangerous(lst[i].FillRuleName(rt.Ctx, rulecache))
|
||||
ginx.Dangerous(lst[i].FillDatasourceIds(rt.Ctx))
|
||||
ginx.Dangerous(lst[i].DB2FE())
|
||||
}
|
||||
|
||||
ginx.NewRender(c).Data(lst, err)
|
||||
}
|
||||
|
||||
@@ -101,6 +98,7 @@ func (rt *Router) alertSubscribePut(c *gin.Context) {
|
||||
"redefine_webhooks",
|
||||
"severities",
|
||||
"extra_config",
|
||||
"busi_groups",
|
||||
))
|
||||
}
|
||||
|
||||
|
||||
@@ -315,3 +315,26 @@ func (rt *Router) builtinIcon(c *gin.Context) {
|
||||
iconPath := fp + "/" + cate + "/icon/" + ginx.UrlParamStr(c, "name")
|
||||
c.File(path.Join(iconPath))
|
||||
}
|
||||
|
||||
func (rt *Router) builtinMarkdown(c *gin.Context) {
|
||||
fp := rt.Center.BuiltinIntegrationsDir
|
||||
if fp == "" {
|
||||
fp = path.Join(runner.Cwd, "integrations")
|
||||
}
|
||||
cate := ginx.UrlParamStr(c, "cate")
|
||||
|
||||
var markdown []byte
|
||||
markdownDir := fp + "/" + cate + "/markdown"
|
||||
markdownFiles, err := file.FilesUnder(markdownDir)
|
||||
if err != nil {
|
||||
logger.Warningf("get markdown fail: %v", err)
|
||||
} else if len(markdownFiles) > 0 {
|
||||
f := markdownFiles[0]
|
||||
fn := markdownDir + "/" + f
|
||||
markdown, err = file.ReadBytes(fn)
|
||||
if err != nil {
|
||||
logger.Warningf("get collect fail: %v", err)
|
||||
}
|
||||
}
|
||||
ginx.NewRender(c).Data(string(markdown), nil)
|
||||
}
|
||||
|
||||
@@ -25,6 +25,12 @@ func (rt *Router) configGetByKey(c *gin.Context) {
|
||||
ginx.NewRender(c).Data(config, err)
|
||||
}
|
||||
|
||||
func (rt *Router) configPutByKey(c *gin.Context) {
|
||||
var f models.Configs
|
||||
ginx.BindJSON(c, &f)
|
||||
ginx.NewRender(c).Message(models.ConfigsSet(rt.Ctx, f.Ckey, f.Cval))
|
||||
}
|
||||
|
||||
func (rt *Router) configsDel(c *gin.Context) {
|
||||
var f idsForm
|
||||
ginx.BindJSON(c, &f)
|
||||
|
||||
@@ -3,6 +3,7 @@ package router
|
||||
import (
|
||||
"crypto/tls"
|
||||
"fmt"
|
||||
"io"
|
||||
"net/http"
|
||||
"net/url"
|
||||
"strings"
|
||||
@@ -25,6 +26,11 @@ type listReq struct {
|
||||
}
|
||||
|
||||
func (rt *Router) datasourceList(c *gin.Context) {
|
||||
if rt.DatasourceCheckHook(c) {
|
||||
Render(c, []int{}, nil)
|
||||
return
|
||||
}
|
||||
|
||||
var req listReq
|
||||
ginx.BindJSON(c, &req)
|
||||
|
||||
@@ -65,6 +71,11 @@ func (rt *Router) datasourceBriefs(c *gin.Context) {
|
||||
}
|
||||
|
||||
func (rt *Router) datasourceUpsert(c *gin.Context) {
|
||||
if rt.DatasourceCheckHook(c) {
|
||||
Render(c, []int{}, nil)
|
||||
return
|
||||
}
|
||||
|
||||
var req models.Datasource
|
||||
ginx.BindJSON(c, &req)
|
||||
username := Username(c)
|
||||
@@ -127,14 +138,33 @@ func DatasourceCheck(ds models.Datasource) error {
|
||||
if ds.PluginType == models.PROMETHEUS {
|
||||
subPath := "/api/v1/query"
|
||||
query := url.Values{}
|
||||
if strings.Contains(fullURL, "loki") {
|
||||
if ds.HTTPJson.IsLoki() {
|
||||
subPath = "/api/v1/labels"
|
||||
} else {
|
||||
query.Add("query", "1+1")
|
||||
}
|
||||
fullURL = fmt.Sprintf("%s%s?%s", ds.HTTPJson.Url, subPath, query.Encode())
|
||||
|
||||
req, err = http.NewRequest("POST", fullURL, nil)
|
||||
req, err = http.NewRequest("GET", fullURL, nil)
|
||||
if err != nil {
|
||||
logger.Errorf("Error creating request: %v", err)
|
||||
return fmt.Errorf("request url:%s failed", fullURL)
|
||||
}
|
||||
} else if ds.PluginType == models.TDENGINE {
|
||||
fullURL = fmt.Sprintf("%s/rest/sql", ds.HTTPJson.Url)
|
||||
req, err = http.NewRequest("POST", fullURL, strings.NewReader("show databases"))
|
||||
if err != nil {
|
||||
logger.Errorf("Error creating request: %v", err)
|
||||
return fmt.Errorf("request url:%s failed", fullURL)
|
||||
}
|
||||
}
|
||||
|
||||
if ds.PluginType == models.LOKI {
|
||||
subPath := "/api/v1/labels"
|
||||
|
||||
fullURL = fmt.Sprintf("%s%s", ds.HTTPJson.Url, subPath)
|
||||
|
||||
req, err = http.NewRequest("GET", fullURL, nil)
|
||||
if err != nil {
|
||||
logger.Errorf("Error creating request: %v", err)
|
||||
return fmt.Errorf("request url:%s failed", fullURL)
|
||||
@@ -158,13 +188,19 @@ func DatasourceCheck(ds models.Datasource) error {
|
||||
|
||||
if resp.StatusCode != 200 {
|
||||
logger.Errorf("Error making request: %v\n", resp.StatusCode)
|
||||
return fmt.Errorf("request url:%s failed code:%d", fullURL, resp.StatusCode)
|
||||
body, _ := io.ReadAll(resp.Body)
|
||||
return fmt.Errorf("request url:%s failed code:%d body:%s", fullURL, resp.StatusCode, string(body))
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
|
||||
func (rt *Router) datasourceGet(c *gin.Context) {
|
||||
if rt.DatasourceCheckHook(c) {
|
||||
Render(c, []int{}, nil)
|
||||
return
|
||||
}
|
||||
|
||||
var req models.Datasource
|
||||
ginx.BindJSON(c, &req)
|
||||
err := req.Get(rt.Ctx)
|
||||
@@ -172,6 +208,11 @@ func (rt *Router) datasourceGet(c *gin.Context) {
|
||||
}
|
||||
|
||||
func (rt *Router) datasourceUpdataStatus(c *gin.Context) {
|
||||
if rt.DatasourceCheckHook(c) {
|
||||
Render(c, []int{}, nil)
|
||||
return
|
||||
}
|
||||
|
||||
var req models.Datasource
|
||||
ginx.BindJSON(c, &req)
|
||||
username := Username(c)
|
||||
@@ -181,6 +222,11 @@ func (rt *Router) datasourceUpdataStatus(c *gin.Context) {
|
||||
}
|
||||
|
||||
func (rt *Router) datasourceDel(c *gin.Context) {
|
||||
if rt.DatasourceCheckHook(c) {
|
||||
Render(c, []int{}, nil)
|
||||
return
|
||||
}
|
||||
|
||||
var ids []int64
|
||||
ginx.BindJSON(c, &ids)
|
||||
err := models.DatasourceDel(rt.Ctx, ids)
|
||||
|
||||
@@ -41,6 +41,7 @@ func (rt *Router) esIndexPatternPut(c *gin.Context) {
|
||||
}
|
||||
|
||||
f.UpdateBy = c.MustGet("username").(string)
|
||||
|
||||
ginx.NewRender(c).Message(esIndexPattern.Update(rt.Ctx, f))
|
||||
}
|
||||
|
||||
@@ -67,7 +68,7 @@ func (rt *Router) esIndexPatternGetList(c *gin.Context) {
|
||||
} else {
|
||||
lst, err = models.EsIndexPatternGets(rt.Ctx, "")
|
||||
}
|
||||
|
||||
|
||||
ginx.NewRender(c).Data(lst, err)
|
||||
}
|
||||
|
||||
|
||||
@@ -4,12 +4,14 @@ import (
|
||||
"compress/gzip"
|
||||
"encoding/json"
|
||||
"io/ioutil"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/toolkits/pkg/ginx"
|
||||
"github.com/toolkits/pkg/logger"
|
||||
)
|
||||
|
||||
func (rt *Router) heartbeat(c *gin.Context) {
|
||||
@@ -49,12 +51,16 @@ func (rt *Router) heartbeat(c *gin.Context) {
|
||||
items[req.Hostname] = struct{}{}
|
||||
rt.IdentSet.MSet(items)
|
||||
|
||||
gid := ginx.QueryInt64(c, "gid", 0)
|
||||
|
||||
if gid != 0 {
|
||||
target, has := rt.TargetCache.Get(req.Hostname)
|
||||
if has && target.GroupId != gid {
|
||||
err = models.TargetUpdateBgid(rt.Ctx, []string{req.Hostname}, gid, false)
|
||||
if target, has := rt.TargetCache.Get(req.Hostname); has && target != nil {
|
||||
var defGid int64 = -1
|
||||
gid := ginx.QueryInt64(c, "gid", defGid)
|
||||
hostIpStr := strings.TrimSpace(req.HostIp)
|
||||
if gid == defGid { //set gid value from cache
|
||||
gid = target.GroupId
|
||||
}
|
||||
logger.Debugf("heartbeat gid: %v, host_ip: '%v', target: %v", gid, hostIpStr, *target)
|
||||
if gid != target.GroupId || hostIpStr != target.HostIp { // if either gid or host_ip has a new value
|
||||
err = models.TargetUpdateHostIpAndBgid(rt.Ctx, req.Hostname, hostIpStr, gid)
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -230,7 +230,7 @@ func (rt *Router) loginCallback(c *gin.Context) {
|
||||
|
||||
ret, err := rt.Sso.OIDC.Callback(rt.Redis, c.Request.Context(), code, state)
|
||||
if err != nil {
|
||||
logger.Debugf("sso.callback() get ret %+v error %v", ret, err)
|
||||
logger.Errorf("sso_callback fail. code:%s, state:%s, get ret: %+v. error: %v", code, state, ret, err)
|
||||
ginx.NewRender(c).Data(CallbackOutput{}, err)
|
||||
return
|
||||
}
|
||||
@@ -515,10 +515,23 @@ type SsoConfigOutput struct {
|
||||
}
|
||||
|
||||
func (rt *Router) ssoConfigNameGet(c *gin.Context) {
|
||||
var oidcDisplayName, casDisplayName, oauthDisplayName string
|
||||
if rt.Sso.OIDC != nil {
|
||||
oidcDisplayName = rt.Sso.OIDC.GetDisplayName()
|
||||
}
|
||||
|
||||
if rt.Sso.CAS != nil {
|
||||
casDisplayName = rt.Sso.CAS.GetDisplayName()
|
||||
}
|
||||
|
||||
if rt.Sso.OAuth2 != nil {
|
||||
oauthDisplayName = rt.Sso.OAuth2.GetDisplayName()
|
||||
}
|
||||
|
||||
ginx.NewRender(c).Data(SsoConfigOutput{
|
||||
OidcDisplayName: rt.Sso.OIDC.GetDisplayName(),
|
||||
CasDisplayName: rt.Sso.CAS.GetDisplayName(),
|
||||
OauthDisplayName: rt.Sso.OAuth2.GetDisplayName(),
|
||||
OidcDisplayName: oidcDisplayName,
|
||||
CasDisplayName: casDisplayName,
|
||||
OauthDisplayName: oauthDisplayName,
|
||||
}, nil)
|
||||
}
|
||||
|
||||
@@ -543,8 +556,7 @@ func (rt *Router) ssoConfigUpdate(c *gin.Context) {
|
||||
var config oidcx.Config
|
||||
err := toml.Unmarshal([]byte(f.Content), &config)
|
||||
ginx.Dangerous(err)
|
||||
|
||||
err = rt.Sso.OIDC.Reload(config)
|
||||
rt.Sso.OIDC, err = oidcx.New(config)
|
||||
ginx.Dangerous(err)
|
||||
case "CAS":
|
||||
var config cas.Config
|
||||
|
||||
@@ -5,6 +5,7 @@ import (
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/alert/common"
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
|
||||
"github.com/gin-gonic/gin"
|
||||
@@ -29,16 +30,41 @@ func (rt *Router) alertMuteGets(c *gin.Context) {
|
||||
}
|
||||
|
||||
func (rt *Router) alertMuteAdd(c *gin.Context) {
|
||||
|
||||
var f models.AlertMute
|
||||
ginx.BindJSON(c, &f)
|
||||
|
||||
username := c.MustGet("username").(string)
|
||||
f.CreateBy = username
|
||||
f.GroupId = ginx.UrlParamInt64(c, "id")
|
||||
|
||||
ginx.NewRender(c).Message(f.Add(rt.Ctx))
|
||||
}
|
||||
|
||||
// Preview events (alert_cur_event) that match the mute strategy based on the following criteria:
|
||||
// business group ID (group_id, group_id), product (prod, rule_prod),
|
||||
// alert event severity (severities, severity), and event tags (tags, tags).
|
||||
// For products of type not 'host', also consider the category (cate, cate) and datasource ID (datasource_ids, datasource_id).
|
||||
func (rt *Router) alertMutePreview(c *gin.Context) {
|
||||
//Generally the match of events would be less.
|
||||
|
||||
var f models.AlertMute
|
||||
ginx.BindJSON(c, &f)
|
||||
f.GroupId = ginx.UrlParamInt64(c, "id")
|
||||
ginx.Dangerous(f.Verify()) //verify and parse tags json to ITags
|
||||
events, err := models.AlertCurEventGetsFromAlertMute(rt.Ctx, &f)
|
||||
ginx.Dangerous(err)
|
||||
|
||||
matchEvents := make([]*models.AlertCurEvent, 0, len(events))
|
||||
for i := 0; i < len(events); i++ {
|
||||
events[i].DB2Mem()
|
||||
if common.MatchTags(events[i].TagsMap, f.ITags) {
|
||||
matchEvents = append(matchEvents, events[i])
|
||||
}
|
||||
}
|
||||
ginx.NewRender(c).Data(matchEvents, err)
|
||||
|
||||
}
|
||||
|
||||
func (rt *Router) alertMuteAddByService(c *gin.Context) {
|
||||
var f models.AlertMute
|
||||
ginx.BindJSON(c, &f)
|
||||
|
||||
@@ -2,15 +2,17 @@ package router
|
||||
|
||||
import (
|
||||
"encoding/json"
|
||||
"strings"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/alert/aconf"
|
||||
"github.com/ccfos/nightingale/v6/alert/sender"
|
||||
"github.com/ccfos/nightingale/v6/memsto"
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
"github.com/pelletier/go-toml/v2"
|
||||
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/pelletier/go-toml/v2"
|
||||
"github.com/toolkits/pkg/ginx"
|
||||
"github.com/toolkits/pkg/str"
|
||||
)
|
||||
|
||||
func (rt *Router) webhookGets(c *gin.Context) {
|
||||
@@ -178,16 +180,42 @@ func (rt *Router) notifyConfigPut(c *gin.Context) {
|
||||
|
||||
if f.Ckey == models.SMTP {
|
||||
// 重置邮件发送器
|
||||
var smtp aconf.SMTPConfig
|
||||
err := toml.Unmarshal([]byte(f.Cval), &smtp)
|
||||
ginx.Dangerous(err)
|
||||
|
||||
if smtp.Host == "" || smtp.Port == 0 {
|
||||
ginx.Bomb(200, "smtp host or port can not be empty")
|
||||
}
|
||||
smtp := smtpValidate(f.Cval)
|
||||
|
||||
go sender.RestartEmailSender(smtp)
|
||||
}
|
||||
|
||||
ginx.NewRender(c).Message(nil)
|
||||
}
|
||||
|
||||
func smtpValidate(smtpStr string) aconf.SMTPConfig {
|
||||
var smtp aconf.SMTPConfig
|
||||
ginx.Dangerous(toml.Unmarshal([]byte(smtpStr), &smtp))
|
||||
|
||||
if smtp.Host == "" || smtp.Port == 0 {
|
||||
ginx.Bomb(200, "smtp host or port can not be empty")
|
||||
}
|
||||
return smtp
|
||||
}
|
||||
|
||||
type form struct {
|
||||
models.Configs
|
||||
Email string `json:"email"`
|
||||
}
|
||||
|
||||
// After configuring the aconf.SMTPConfig, users can choose to perform a test. In this test, the function attempts to send an email
|
||||
func (rt *Router) attemptSendEmail(c *gin.Context) {
|
||||
var f form
|
||||
ginx.BindJSON(c, &f)
|
||||
|
||||
if f.Email = strings.TrimSpace(f.Email); f.Email == "" || !str.IsMail(f.Email) {
|
||||
ginx.Bomb(200, "email(%s) invalid", f.Email)
|
||||
}
|
||||
|
||||
if f.Ckey != models.SMTP {
|
||||
ginx.Bomb(200, "config(%v) invalid", f)
|
||||
}
|
||||
smtp := smtpValidate(f.Cval)
|
||||
ginx.NewRender(c).Message(sender.SendEmail("Email test", "email content", []string{f.Email}, smtp))
|
||||
|
||||
}
|
||||
|
||||
@@ -21,7 +21,7 @@ func (rt *Router) notifyTplGets(c *gin.Context) {
|
||||
for _, channel := range models.DefaultChannels {
|
||||
m[channel] = struct{}{}
|
||||
}
|
||||
m["mailsubject"] = struct{}{}
|
||||
m[models.EmailSubject] = struct{}{}
|
||||
|
||||
lst, err := models.NotifyTplGets(rt.Ctx)
|
||||
for i := 0; i < len(lst); i++ {
|
||||
|
||||
@@ -3,6 +3,7 @@ package router
|
||||
import (
|
||||
"context"
|
||||
"crypto/tls"
|
||||
"fmt"
|
||||
"net"
|
||||
"net/http"
|
||||
"net/http/httputil"
|
||||
@@ -164,10 +165,18 @@ func (rt *Router) dsProxy(c *gin.Context) {
|
||||
transportPut(dsId, ds.UpdatedAt, transport)
|
||||
}
|
||||
|
||||
modifyResponse := func(r *http.Response) error {
|
||||
if r.StatusCode == http.StatusUnauthorized {
|
||||
return fmt.Errorf("unauthorized access")
|
||||
}
|
||||
return nil
|
||||
}
|
||||
|
||||
proxy := &httputil.ReverseProxy{
|
||||
Director: director,
|
||||
Transport: transport,
|
||||
ErrorHandler: errFunc,
|
||||
Director: director,
|
||||
Transport: transport,
|
||||
ErrorHandler: errFunc,
|
||||
ModifyResponse: modifyResponse,
|
||||
}
|
||||
|
||||
proxy.ServeHTTP(c.Writer, c.Request)
|
||||
|
||||
@@ -45,29 +45,32 @@ func (rt *Router) targetGets(c *gin.Context) {
|
||||
bgid := ginx.QueryInt64(c, "bgid", -1)
|
||||
query := ginx.QueryStr(c, "query", "")
|
||||
limit := ginx.QueryInt(c, "limit", 30)
|
||||
downtime := ginx.QueryInt64(c, "downtime", 0)
|
||||
dsIds := queryDatasourceIds(c)
|
||||
|
||||
var bgids []int64
|
||||
var err error
|
||||
if bgid == -1 {
|
||||
// 全部对象的情况,找到用户有权限的业务组
|
||||
user := c.MustGet("user").(*models.User)
|
||||
userGroupIds, err := models.MyGroupIds(rt.Ctx, user.Id)
|
||||
ginx.Dangerous(err)
|
||||
if !user.IsAdmin() {
|
||||
// 如果是非 admin 用户,全部对象的情况,找到用户有权限的业务组
|
||||
userGroupIds, err := models.MyGroupIds(rt.Ctx, user.Id)
|
||||
ginx.Dangerous(err)
|
||||
|
||||
bgids, err = models.BusiGroupIds(rt.Ctx, userGroupIds)
|
||||
ginx.Dangerous(err)
|
||||
bgids, err = models.BusiGroupIds(rt.Ctx, userGroupIds)
|
||||
ginx.Dangerous(err)
|
||||
|
||||
// 将未分配业务组的对象也加入到列表中
|
||||
bgids = append(bgids, 0)
|
||||
// 将未分配业务组的对象也加入到列表中
|
||||
bgids = append(bgids, 0)
|
||||
}
|
||||
} else {
|
||||
bgids = append(bgids, bgid)
|
||||
}
|
||||
|
||||
total, err := models.TargetTotal(rt.Ctx, bgids, dsIds, query)
|
||||
total, err := models.TargetTotal(rt.Ctx, bgids, dsIds, query, downtime)
|
||||
ginx.Dangerous(err)
|
||||
|
||||
list, err := models.TargetGets(rt.Ctx, bgids, dsIds, query, limit, ginx.Offset(c, limit))
|
||||
list, err := models.TargetGets(rt.Ctx, bgids, dsIds, query, downtime, limit, ginx.Offset(c, limit))
|
||||
ginx.Dangerous(err)
|
||||
|
||||
if err == nil {
|
||||
@@ -78,6 +81,12 @@ func (rt *Router) targetGets(c *gin.Context) {
|
||||
for i := 0; i < len(list); i++ {
|
||||
ginx.Dangerous(list[i].FillGroup(rt.Ctx, cache))
|
||||
keys = append(keys, models.WrapIdent(list[i].Ident))
|
||||
|
||||
if now.Unix()-list[i].UpdateAt < 60 {
|
||||
list[i].TargetUp = 2
|
||||
} else if now.Unix()-list[i].UpdateAt < 180 {
|
||||
list[i].TargetUp = 1
|
||||
}
|
||||
}
|
||||
|
||||
if len(keys) > 0 {
|
||||
@@ -103,12 +112,6 @@ func (rt *Router) targetGets(c *gin.Context) {
|
||||
// 未上报过元数据的主机,cpuNum默认为-1, 用于前端展示 unknown
|
||||
list[i].CpuNum = -1
|
||||
}
|
||||
|
||||
if now.Unix()-list[i].UnixTime/1000 < 60 {
|
||||
list[i].TargetUp = 2
|
||||
} else if now.Unix()-list[i].UnixTime/1000 < 180 {
|
||||
list[i].TargetUp = 1
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -48,6 +48,19 @@ func (rt *Router) taskTplGet(c *gin.Context) {
|
||||
}, err)
|
||||
}
|
||||
|
||||
func (rt *Router) taskTplGetByService(c *gin.Context) {
|
||||
tid := ginx.UrlParamInt64(c, "tid")
|
||||
|
||||
tpl, err := models.TaskTplGetById(rt.Ctx, tid)
|
||||
ginx.Dangerous(err)
|
||||
|
||||
if tpl == nil {
|
||||
ginx.Bomb(404, "no such task template")
|
||||
}
|
||||
|
||||
ginx.NewRender(c).Data(tpl, err)
|
||||
}
|
||||
|
||||
type taskTplForm struct {
|
||||
Title string `json:"title" binding:"required"`
|
||||
Batch int `json:"batch"`
|
||||
|
||||
117
center/router/router_tdengine.go
Normal file
@@ -0,0 +1,117 @@
|
||||
package router
|
||||
|
||||
import (
|
||||
"net/http"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/center/cconf"
|
||||
"github.com/ccfos/nightingale/v6/models"
|
||||
|
||||
"github.com/gin-gonic/gin"
|
||||
"github.com/toolkits/pkg/ginx"
|
||||
"github.com/toolkits/pkg/logger"
|
||||
)
|
||||
|
||||
type databasesQueryForm struct {
|
||||
Cate string `json:"cate" form:"cate"`
|
||||
DatasourceId int64 `json:"datasource_id" form:"datasource_id"`
|
||||
}
|
||||
|
||||
func (rt *Router) tdengineDatabases(c *gin.Context) {
|
||||
var f databasesQueryForm
|
||||
ginx.BindJSON(c, &f)
|
||||
|
||||
tdClient := rt.TdendgineClients.GetCli(f.DatasourceId)
|
||||
if tdClient == nil {
|
||||
ginx.NewRender(c, http.StatusNotFound).Message("No such datasource")
|
||||
return
|
||||
}
|
||||
|
||||
databases, err := tdClient.GetDatabases()
|
||||
ginx.NewRender(c).Data(databases, err)
|
||||
}
|
||||
|
||||
type tablesQueryForm struct {
|
||||
Cate string `json:"cate"`
|
||||
DatasourceId int64 `json:"datasource_id" `
|
||||
Database string `json:"db"`
|
||||
IsStable bool `json:"is_stable"`
|
||||
}
|
||||
|
||||
// get tdengine tables
|
||||
func (rt *Router) tdengineTables(c *gin.Context) {
|
||||
var f tablesQueryForm
|
||||
ginx.BindJSON(c, &f)
|
||||
|
||||
tdClient := rt.TdendgineClients.GetCli(f.DatasourceId)
|
||||
if tdClient == nil {
|
||||
ginx.NewRender(c, http.StatusNotFound).Message("No such datasource")
|
||||
return
|
||||
}
|
||||
|
||||
tables, err := tdClient.GetTables(f.Database, f.IsStable)
|
||||
ginx.NewRender(c).Data(tables, err)
|
||||
}
|
||||
|
||||
type columnsQueryForm struct {
|
||||
Cate string `json:"cate"`
|
||||
DatasourceId int64 `json:"datasource_id" `
|
||||
Database string `json:"db"`
|
||||
Table string `json:"table"`
|
||||
}
|
||||
|
||||
// get tdengine columns
|
||||
func (rt *Router) tdengineColumns(c *gin.Context) {
|
||||
var f columnsQueryForm
|
||||
ginx.BindJSON(c, &f)
|
||||
|
||||
tdClient := rt.TdendgineClients.GetCli(f.DatasourceId)
|
||||
if tdClient == nil {
|
||||
ginx.NewRender(c, http.StatusNotFound).Message("No such datasource")
|
||||
return
|
||||
}
|
||||
|
||||
columns, err := tdClient.GetColumns(f.Database, f.Table)
|
||||
ginx.NewRender(c).Data(columns, err)
|
||||
}
|
||||
|
||||
func (rt *Router) QueryData(c *gin.Context) {
|
||||
var f models.QueryParam
|
||||
ginx.BindJSON(c, &f)
|
||||
|
||||
var resp []*models.DataResp
|
||||
var err error
|
||||
tdClient := rt.TdendgineClients.GetCli(f.DatasourceId)
|
||||
for _, q := range f.Querys {
|
||||
datas, err := tdClient.Query(q)
|
||||
ginx.Dangerous(err)
|
||||
resp = append(resp, datas...)
|
||||
}
|
||||
|
||||
ginx.NewRender(c).Data(resp, err)
|
||||
}
|
||||
|
||||
func (rt *Router) QueryLog(c *gin.Context) {
|
||||
var f models.QueryParam
|
||||
ginx.BindJSON(c, &f)
|
||||
|
||||
tdClient := rt.TdendgineClients.GetCli(f.DatasourceId)
|
||||
if len(f.Querys) == 0 {
|
||||
ginx.Bomb(200, "querys is empty")
|
||||
return
|
||||
}
|
||||
|
||||
data, err := tdClient.QueryLog(f.Querys[0])
|
||||
logger.Debugf("tdengine query:%s result: %+v", f.Querys[0], data)
|
||||
ginx.NewRender(c).Data(data, err)
|
||||
}
|
||||
|
||||
// query sql template
|
||||
func (rt *Router) QuerySqlTemplate(c *gin.Context) {
|
||||
cate := ginx.QueryStr(c, "cate")
|
||||
m := make(map[string]string)
|
||||
switch cate {
|
||||
case models.TDENGINE:
|
||||
m = cconf.TDengineSQLTpl
|
||||
}
|
||||
ginx.NewRender(c).Data(m, nil)
|
||||
}
|
||||
@@ -29,6 +29,8 @@ Port = 389
|
||||
BaseDn = 'dc=example,dc=org'
|
||||
BindUser = 'cn=manager,dc=example,dc=org'
|
||||
BindPass = '*******'
|
||||
# openldap format e.g. (&(uid=%s))
|
||||
# AD format e.g. (&(sAMAccountName=%s))
|
||||
AuthFilter = '(&(uid=%s))'
|
||||
CoverAttributes = true
|
||||
TLS = false
|
||||
|
||||
@@ -2,6 +2,7 @@ package main
|
||||
|
||||
import (
|
||||
"context"
|
||||
"errors"
|
||||
"fmt"
|
||||
|
||||
"github.com/ccfos/nightingale/v6/alert"
|
||||
@@ -16,6 +17,7 @@ import (
|
||||
"github.com/ccfos/nightingale/v6/prom"
|
||||
"github.com/ccfos/nightingale/v6/pushgw/idents"
|
||||
"github.com/ccfos/nightingale/v6/pushgw/writer"
|
||||
"github.com/ccfos/nightingale/v6/tdengine"
|
||||
|
||||
alertrt "github.com/ccfos/nightingale/v6/alert/router"
|
||||
pushgwrt "github.com/ccfos/nightingale/v6/pushgw/router"
|
||||
@@ -31,7 +33,10 @@ func Initialize(configDir string, cryptoKey string) (func(), error) {
|
||||
if err != nil {
|
||||
return nil, err
|
||||
}
|
||||
|
||||
//check CenterApi is default value
|
||||
if len(config.CenterApi.Addrs) < 1 {
|
||||
return nil, errors.New("failed to init config: the CenterApi configuration is missing")
|
||||
}
|
||||
ctx := ctx.NewContext(context.Background(), nil, false, config.CenterApi)
|
||||
|
||||
syncStats := memsto.NewSyncStats()
|
||||
@@ -54,9 +59,11 @@ func Initialize(configDir string, cryptoKey string) (func(), error) {
|
||||
userGroupCache := memsto.NewUserGroupCache(ctx, syncStats)
|
||||
|
||||
promClients := prom.NewPromClient(ctx, config.Alert.Heartbeat)
|
||||
tdengineClients := tdengine.NewTdengineClient(ctx, config.Alert.Heartbeat)
|
||||
externalProcessors := process.NewExternalProcessors()
|
||||
|
||||
alert.Start(config.Alert, config.Pushgw, syncStats, alertStats, externalProcessors, targetCache, busiGroupCache, alertMuteCache, alertRuleCache, notifyConfigCache, dsCache, ctx, promClients, userCache, userGroupCache)
|
||||
alert.Start(config.Alert, config.Pushgw, syncStats, alertStats, externalProcessors, targetCache, busiGroupCache, alertMuteCache,
|
||||
alertRuleCache, notifyConfigCache, dsCache, ctx, promClients, tdengineClients, userCache, userGroupCache)
|
||||
|
||||
alertrtRouter := alertrt.New(config.HTTP, config.Alert, alertMuteCache, targetCache, busiGroupCache, alertStats, ctx, externalProcessors)
|
||||
|
||||
|
||||
@@ -48,7 +48,7 @@ func InitConfig(configDir, cryptoKey string) (*ConfigType, error) {
|
||||
}
|
||||
|
||||
config.Pushgw.PreCheck()
|
||||
config.Alert.PreCheck()
|
||||
config.Alert.PreCheck(configDir)
|
||||
config.Center.PreCheck()
|
||||
|
||||
err := decryptConfig(config, cryptoKey)
|
||||
|
||||
@@ -77,4 +77,3 @@ Committer 记录并公示于 **[COMMITTERS](https://github.com/ccfos/nightingale
|
||||
2. 提问之前请先搜索 [Github Issues](https://github.com/ccfos/nightingale/issues "Github Issue");
|
||||
3. 我们优先推荐通过提交 [Github Issue](https://github.com/ccfos/nightingale/issues "Github Issue") 来提问,如果[有问题点击这里](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Fbug&template=bug_report.yml "有问题点击这里") | [有需求建议点击这里](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.md "有需求建议点击这里");
|
||||
|
||||
最后,我们推荐你加入微信群,针对相关开放式问题,相互交流咨询 (请先加好友:[UlricGO](https://www.gitlink.org.cn/UlricQin/gist/tree/master/self.jpeg "UlricGO") 备注:夜莺加群+姓名+公司,交流群里会有开发者团队和专业、热心的群友回答问题)。
|
||||
|
||||
@@ -2,8 +2,7 @@ version: "3.7"
|
||||
|
||||
services:
|
||||
mysql:
|
||||
# platform: linux/x86_64
|
||||
image: "mysql:5.7"
|
||||
image: "mysql:8"
|
||||
container_name: mysql
|
||||
hostname: mysql
|
||||
restart: always
|
||||
@@ -58,7 +57,7 @@ services:
|
||||
depends_on:
|
||||
- mysql
|
||||
command: >
|
||||
sh -c "/wait && /app/ibex server"
|
||||
sh -c "/app/ibex server"
|
||||
|
||||
n9e:
|
||||
image: flashcatcloud/nightingale:latest
|
||||
@@ -76,7 +75,6 @@ services:
|
||||
- mysql
|
||||
- redis
|
||||
- prometheus
|
||||
- ibex
|
||||
command: >
|
||||
sh -c "/wait && /app/n9e"
|
||||
|
||||
@@ -96,5 +94,4 @@ services:
|
||||
- /var/run/docker.sock:/var/run/docker.sock
|
||||
network_mode: host
|
||||
depends_on:
|
||||
- n9e
|
||||
- ibex
|
||||
- n9e
|
||||
@@ -53,7 +53,7 @@ insert into user_group_member(group_id, user_id) values(1, 1);
|
||||
CREATE TABLE configs (
|
||||
id bigserial,
|
||||
ckey varchar(191) not null,
|
||||
cval varchar(4096) not null default '',
|
||||
cval text not null default '',
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE (ckey)
|
||||
) ;
|
||||
@@ -94,10 +94,18 @@ insert into role_operation(role_name, operation) values('Standard', '/log/explor
|
||||
insert into role_operation(role_name, operation) values('Standard', '/trace/explorer');
|
||||
insert into role_operation(role_name, operation) values('Standard', '/help/version');
|
||||
insert into role_operation(role_name, operation) values('Standard', '/help/contact');
|
||||
insert into role_operation(role_name, operation) values('Standard', '/help/servers');
|
||||
insert into role_operation(role_name, operation) values('Standard', '/help/migrate');
|
||||
|
||||
insert into role_operation(role_name, operation) values('Standard', '/alert-rules-built-in');
|
||||
insert into role_operation(role_name, operation) values('Standard', '/dashboards-built-in');
|
||||
insert into role_operation(role_name, operation) values('Standard', '/trace/dependencies');
|
||||
|
||||
insert into role_operation(role_name, operation) values('Admin', '/help/source');
|
||||
insert into role_operation(role_name, operation) values('Admin', '/help/sso');
|
||||
insert into role_operation(role_name, operation) values('Admin', '/help/notification-tpls');
|
||||
insert into role_operation(role_name, operation) values('Admin', '/help/notification-settings');
|
||||
|
||||
insert into role_operation(role_name, operation) values('Standard', '/users');
|
||||
insert into role_operation(role_name, operation) values('Standard', '/user-groups');
|
||||
insert into role_operation(role_name, operation) values('Standard', '/user-groups/add');
|
||||
@@ -292,6 +300,7 @@ CREATE TABLE alert_rule (
|
||||
runbook_url varchar(255),
|
||||
append_tags varchar(255) not null default '' ,
|
||||
annotations text not null ,
|
||||
extra_config text not null ,
|
||||
create_at bigint not null default 0,
|
||||
create_by varchar(64) not null default '',
|
||||
update_at bigint not null default 0,
|
||||
@@ -320,7 +329,7 @@ COMMENT ON COLUMN alert_rule.recover_duration IS 'unit: s';
|
||||
COMMENT ON COLUMN alert_rule.callbacks IS 'split by space: http://a.com/api/x http://a.com/api/y';
|
||||
COMMENT ON COLUMN alert_rule.append_tags IS 'split by space: service=n9e mod=api';
|
||||
COMMENT ON COLUMN alert_rule.annotations IS 'annotations';
|
||||
|
||||
COMMENT ON COLUMN alert_rule.extra_config IS 'extra_config';
|
||||
|
||||
CREATE TABLE alert_mute (
|
||||
id bigserial,
|
||||
@@ -337,6 +346,7 @@ CREATE TABLE alert_mute (
|
||||
disabled smallint not null default 0 ,
|
||||
mute_time_type smallint not null default 0,
|
||||
periodic_mutes varchar(4096) not null default '',
|
||||
severities varchar(32) not null default '',
|
||||
create_at bigint not null default 0,
|
||||
create_by varchar(64) not null default '',
|
||||
update_at bigint not null default 0,
|
||||
@@ -363,13 +373,15 @@ CREATE TABLE alert_subscribe (
|
||||
datasource_ids varchar(255) not null default '' ,
|
||||
cluster varchar(128) not null,
|
||||
rule_id bigint not null default 0,
|
||||
tags varchar(4096) not null default '' ,
|
||||
severities varchar(32) not null default '',
|
||||
tags varchar(4096) not null default '[]',
|
||||
redefine_severity smallint default 0 ,
|
||||
new_severity smallint not null ,
|
||||
redefine_channels smallint default 0 ,
|
||||
new_channels varchar(255) not null default '' ,
|
||||
user_group_ids varchar(250) not null ,
|
||||
webhooks text not null,
|
||||
extra_config text not null,
|
||||
redefine_webhooks smallint default 0,
|
||||
for_duration bigint not null default 0,
|
||||
create_at bigint not null default 0,
|
||||
@@ -389,8 +401,9 @@ COMMENT ON COLUMN alert_subscribe.new_severity IS '0:Emergency 1:Warning 2:Notic
|
||||
COMMENT ON COLUMN alert_subscribe.redefine_channels IS 'is redefine channels?';
|
||||
COMMENT ON COLUMN alert_subscribe.new_channels IS 'split by space: sms voice email dingtalk wecom';
|
||||
COMMENT ON COLUMN alert_subscribe.user_group_ids IS 'split by space 1 34 5, notify cc to user_group_ids';
|
||||
COMMENT ON COLUMN alert_subscribe.extra_config IS 'extra_config';
|
||||
|
||||
|
||||
|
||||
CREATE TABLE target (
|
||||
id bigserial,
|
||||
group_id bigint not null default 0 ,
|
||||
@@ -456,6 +469,7 @@ CREATE TABLE recording_rule (
|
||||
prom_ql varchar(8192) not null ,
|
||||
prom_eval_interval int not null ,
|
||||
append_tags varchar(255) default '' ,
|
||||
query_configs text not null ,
|
||||
create_at bigint default '0',
|
||||
create_by varchar(64) default '',
|
||||
update_at bigint default '0',
|
||||
@@ -472,6 +486,7 @@ COMMENT ON COLUMN recording_rule.disabled IS '0:enabled 1:disabled';
|
||||
COMMENT ON COLUMN recording_rule.prom_ql IS 'promql';
|
||||
COMMENT ON COLUMN recording_rule.prom_eval_interval IS 'evaluate interval';
|
||||
COMMENT ON COLUMN recording_rule.append_tags IS 'split by space: service=n9e mod=api';
|
||||
COMMENT ON COLUMN recording_rule.query_configs IS 'query configs';
|
||||
|
||||
|
||||
CREATE TABLE alert_aggr_view (
|
||||
@@ -732,4 +747,21 @@ CREATE TABLE sso_config (
|
||||
content text not null,
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE (name)
|
||||
) ;
|
||||
) ;
|
||||
|
||||
|
||||
CREATE TABLE es_index_pattern (
|
||||
id bigserial,
|
||||
datasource_id bigint not null default 0,
|
||||
name varchar(191) not null,
|
||||
time_field varchar(128) not null default '@timestamp',
|
||||
allow_hide_system_indices smallint not null default 0,
|
||||
fields_format varchar(4096) not null default '',
|
||||
create_at bigint default '0',
|
||||
create_by varchar(64) default '',
|
||||
update_at bigint default '0',
|
||||
update_by varchar(64) default '',
|
||||
PRIMARY KEY (id),
|
||||
UNIQUE (datasource_id, name)
|
||||
) ;
|
||||
COMMENT ON COLUMN es_index_pattern.datasource_id IS 'datasource id';
|
||||
|
||||
@@ -9,7 +9,7 @@ Level = "DEBUG"
|
||||
# stdout, stderr, file
|
||||
Output = "stdout"
|
||||
# # rotate by time
|
||||
# KeepHours: 4
|
||||
# KeepHours = 4
|
||||
# # rotate by size
|
||||
# RotateNum = 3
|
||||
# # unit: MB
|
||||
@@ -41,24 +41,17 @@ WriteTimeout = 40
|
||||
# http server idle timeout, unit: s
|
||||
IdleTimeout = 120
|
||||
|
||||
[HTTP.Pushgw]
|
||||
[HTTP.ShowCaptcha]
|
||||
Enable = false
|
||||
|
||||
[HTTP.APIForAgent]
|
||||
Enable = true
|
||||
# [HTTP.Pushgw.BasicAuth]
|
||||
# [HTTP.APIForAgent.BasicAuth]
|
||||
# user001 = "ccc26da7b9aba533cbb263a36c07dcc5"
|
||||
|
||||
[HTTP.Alert]
|
||||
[HTTP.APIForService]
|
||||
Enable = true
|
||||
[HTTP.Alert.BasicAuth]
|
||||
user001 = "ccc26da7b9aba533cbb263a36c07dcc5"
|
||||
|
||||
[HTTP.Heartbeat]
|
||||
Enable = true
|
||||
# [HTTP.Heartbeat.BasicAuth]
|
||||
# user001 = "ccc26da7b9aba533cbb263a36c07dcc5"
|
||||
|
||||
[HTTP.Service]
|
||||
Enable = true
|
||||
[HTTP.Service.BasicAuth]
|
||||
[HTTP.APIForService.BasicAuth]
|
||||
user001 = "ccc26da7b9aba533cbb263a36c07dcc5"
|
||||
|
||||
[HTTP.JWTAuth]
|
||||
@@ -77,6 +70,16 @@ Enable = false
|
||||
HeaderUserNameKey = "X-User-Name"
|
||||
DefaultRoles = ["Standard"]
|
||||
|
||||
[HTTP.RSA]
|
||||
# open RSA
|
||||
OpenRSA = false
|
||||
# RSA public key
|
||||
RSAPublicKeyPath = "/etc/n9e/public.pem"
|
||||
# RSA private key
|
||||
RSAPrivateKeyPath = "/etc/n9e/private.pem"
|
||||
# RSA private key password
|
||||
RSAPassWord = ""
|
||||
|
||||
[DB]
|
||||
DSN="host=postgres port=5432 user=root dbname=n9e_v6 password=1234 sslmode=disable"
|
||||
# enable debug mode or not
|
||||
@@ -115,7 +118,7 @@ RedisType = "standalone"
|
||||
IP = ""
|
||||
# unit ms
|
||||
Interval = 1000
|
||||
ClusterName = "default"
|
||||
EngineName = "default"
|
||||
|
||||
# [Alert.Alerting]
|
||||
# NotifyConcurrency = 10
|
||||
@@ -128,16 +131,49 @@ I18NHeaderKey = "X-Language"
|
||||
PromQuerier = true
|
||||
AlertDetail = true
|
||||
|
||||
[Center.Ibex]
|
||||
Address = "http://ibex:10090"
|
||||
# basic auth
|
||||
BasicAuthUser = "ibex"
|
||||
BasicAuthPass = "ibex"
|
||||
# unit: ms
|
||||
Timeout = 3000
|
||||
|
||||
[Pushgw]
|
||||
# use target labels in database instead of in series
|
||||
LabelRewrite = true
|
||||
# # default busigroup key name
|
||||
# BusiGroupLabelKey = "busigroup"
|
||||
# ForceUseServerTS = false
|
||||
|
||||
[[Pushgw.Writers]]
|
||||
Url = "http://victoriametrics:8428/api/v1/write"
|
||||
# [Pushgw.DebugSample]
|
||||
# ident = "xx"
|
||||
# __name__ = "xx"
|
||||
|
||||
# [Pushgw.WriterOpt]
|
||||
# QueueMaxSize = 1000000
|
||||
# QueuePopSize = 1000
|
||||
|
||||
[[Pushgw.Writers]]
|
||||
# Url = "http://127.0.0.1:8480/insert/0/prometheus/api/v1/write"
|
||||
Url = "http://victoriametrics:8428/api/v1/write"
|
||||
# Basic auth username
|
||||
BasicAuthUser = ""
|
||||
# Basic auth password
|
||||
BasicAuthPass = ""
|
||||
# timeout settings, unit: ms
|
||||
Headers = ["X-From", "n9e"]
|
||||
Timeout = 10000
|
||||
DialTimeout = 3000
|
||||
TLSHandshakeTimeout = 30000
|
||||
ExpectContinueTimeout = 1000
|
||||
IdleConnTimeout = 90000
|
||||
# time duration, unit: ms
|
||||
KeepAlive = 30000
|
||||
MaxConnsPerHost = 0
|
||||
MaxIdleConns = 100
|
||||
MaxIdleConnsPerHost = 100
|
||||
## Optional TLS Config
|
||||
# UseTLS = false
|
||||
# TLSCA = "/etc/n9e/ca.pem"
|
||||
# TLSCert = "/etc/n9e/cert.pem"
|
||||
# TLSKey = "/etc/n9e/key.pem"
|
||||
# InsecureSkipVerify = false
|
||||
# [[Writers.WriteRelabels]]
|
||||
# Action = "replace"
|
||||
# SourceLabels = ["__address__"]
|
||||
# Regex = "([^:]+)(?::\\d+)?"
|
||||
# Replacement = "$1:80"
|
||||
# TargetLabel = "__address__"
|
||||
|
||||
@@ -214,7 +214,7 @@
|
||||
|
||||
<footer>
|
||||
<div class="copyright" style="font-style: italic">
|
||||
我们希望与您一起,将监控这个事情,做到极致!
|
||||
报警太多?使用 <a href="https://flashcat.cloud/product/flashduty/" target="_blank">FlashDuty</a> 做告警聚合降噪、排班OnCall!
|
||||
</div>
|
||||
</footer>
|
||||
</div>
|
||||
|
||||
@@ -82,6 +82,7 @@ RSAPassWord = ""
|
||||
|
||||
[DB]
|
||||
# postgres: host=%s port=%s user=%s dbname=%s password=%s sslmode=%s
|
||||
# postgres: DSN="host=127.0.0.1 port=5432 user=root dbname=n9e_v6 password=1234 sslmode=disable"
|
||||
DSN="root:1234@tcp(127.0.0.1:3306)/n9e_v6?charset=utf8mb4&parseTime=True&loc=Local&allowNativePasswords=true"
|
||||
# enable debug mode or not
|
||||
Debug = false
|
||||
|
||||
742
etc/metrics.yaml
@@ -74,14 +74,16 @@ zh:
|
||||
mem_write_back: 正在被写回到磁盘的内存大小
|
||||
mem_write_back_tmp: FUSE用于临时写回缓冲区的内存
|
||||
|
||||
net_bytes_recv: 网卡收包总数(bytes)
|
||||
net_bytes_sent: 网卡发包总数(bytes)
|
||||
net_bytes_recv: 网卡收包总数(bytes),计算每秒速率时需要用到rate/irate函数
|
||||
net_bytes_sent: 网卡发包总数(bytes),计算每秒速率时需要用到rate/irate函数
|
||||
net_drop_in: 网卡收丢包数量
|
||||
net_drop_out: 网卡发丢包数量
|
||||
net_err_in: 网卡收包错误数量
|
||||
net_err_out: 网卡发包错误数量
|
||||
net_packets_recv: 网卡收包数量
|
||||
net_packets_sent: 网卡发包数量
|
||||
net_bits_recv: 网卡收包总数(bits),计算每秒速率时需要用到rate/irate函数
|
||||
net_bits_sent: 网卡发包总数(bits),计算每秒速率时需要用到rate/irate函数
|
||||
|
||||
netstat_tcp_established: ESTABLISHED状态的网络链接数
|
||||
netstat_tcp_fin_wait1: FIN_WAIT1状态的网络链接数
|
||||
@@ -93,10 +95,26 @@ zh:
|
||||
netstat_tcp_time_wait: TIME_WAIT状态的网络链接数
|
||||
netstat_udp_socket: UDP状态的网络链接数
|
||||
|
||||
netstat_sockets_used: 已使用的所有协议套接字总量
|
||||
netstat_tcp_inuse: 正在使用(正在侦听)的TCP套接字数量
|
||||
netstat_tcp_orphan: 无主(不属于任何进程)的TCP连接数(无用、待销毁的TCP socket数)
|
||||
netstat_tcp_tw: TIME_WAIT状态的TCP连接数
|
||||
netstat_tcp_alloc: 已分配(已建立、已申请到sk_buff)的TCP套接字数量
|
||||
netstat_tcp_mem: TCP套接字内存Page使用量
|
||||
netstat_udp_inuse: 在使用的UDP套接字数量
|
||||
netstat_udp_mem: UDP套接字内存Page使用量
|
||||
netstat_udplite_inuse: 正在使用的 udp lite 数量
|
||||
netstat_raw_inuse: 正在使用的 raw socket 数量
|
||||
netstat_frag_inuse: ip fragement 数量
|
||||
netstat_frag_memory: ip fragement 已经分配的内存(byte)
|
||||
|
||||
#[ping]
|
||||
ping_percent_packet_loss: ping数据包丢失百分比(%)
|
||||
ping_result_code: ping返回码('0','1')
|
||||
|
||||
net_response_result_code: 网络探测结果,0表示正常,非0表示异常
|
||||
net_response_response_time: 网络探测时延,单位:秒
|
||||
|
||||
processes_blocked: 不可中断的睡眠状态下的进程数('U','D','L')
|
||||
processes_dead: 回收中的进程数('X')
|
||||
processes_idle: 挂起的空闲进程数('I')
|
||||
@@ -114,6 +132,9 @@ zh:
|
||||
system_load1: 1分钟平均load值
|
||||
system_load5: 5分钟平均load值
|
||||
system_load15: 15分钟平均load值
|
||||
system_load_norm_1: 1分钟平均load值/逻辑CPU个数
|
||||
system_load_norm_5: 5分钟平均load值/逻辑CPU个数
|
||||
system_load_norm_15: 15分钟平均load值/逻辑CPU个数
|
||||
system_n_users: 用户数
|
||||
system_n_cpus: CPU核数
|
||||
system_uptime: 系统启动时间
|
||||
@@ -327,8 +348,10 @@ en:
|
||||
mem_write_back: "The memory size of the disk is being written back to the disk"
|
||||
mem_write_back_tmp: "Fuse is used to temporarily write back the memory of the buffer area"
|
||||
|
||||
net_bytes_recv: "The total number of packaging of the network card (bytes)"
|
||||
net_bytes_sent: "Total number of network cards (bytes)"
|
||||
net_bytes_recv: "Total inbound traffic(bytes) of network card"
|
||||
net_bytes_sent: "Total outbound traffic(bytes) of network card"
|
||||
net_bits_recv: "Total inbound traffic(bits) of network card"
|
||||
net_bits_sent: "Total outbound traffic(bits) of network card"
|
||||
net_drop_in: "The number of packets for network cards"
|
||||
net_drop_out: "The number of packets issued by the network card"
|
||||
net_err_in: "The number of incorrect packets of the network card"
|
||||
@@ -363,6 +386,9 @@ en:
|
||||
system_load1: "1 minute average load value"
|
||||
system_load5: "5 minutes average load value"
|
||||
system_load15: "15 minutes average load value"
|
||||
system_load_norm_1: "1 minute average load value/logical CPU number"
|
||||
system_load_norm_5: "5 minutes average load value/logical CPU number"
|
||||
system_load_norm_15: "15 minutes average load value/logical CPU number"
|
||||
system_n_users: "User number"
|
||||
system_n_cpus: "CPU nuclear number"
|
||||
system_uptime: "System startup time"
|
||||
@@ -383,366 +409,366 @@ en:
|
||||
http_response_response_time: "When http ring application"
|
||||
http_response_result_code: "URL detection result 0 is normal, otherwise the URL cannot be accessed"
|
||||
|
||||
# [mysqld_exporter]
|
||||
mysql_global_status_uptime: The number of seconds that the server has been up.(Gauge)
|
||||
mysql_global_status_uptime_since_flush_status: The number of seconds since the most recent FLUSH STATUS statement.(Gauge)
|
||||
mysql_global_status_queries: The number of statements executed by the server. This variable includes statements executed within stored programs, unlike the Questions variable. It does not count COM_PING or COM_STATISTICS commands.(Counter)
|
||||
mysql_global_status_threads_connected: The number of currently open connections.(Counter)
|
||||
mysql_global_status_connections: The number of connection attempts (successful or not) to the MySQL server.(Gauge)
|
||||
mysql_global_status_max_used_connections: The maximum number of connections that have been in use simultaneously since the server started.(Gauge)
|
||||
mysql_global_status_threads_running: The number of threads that are not sleeping.(Gauge)
|
||||
mysql_global_status_questions: The number of statements executed by the server. This includes only statements sent to the server by clients and not statements executed within stored programs, unlike the Queries variable. This variable does not count COM_PING, COM_STATISTICS, COM_STMT_PREPARE, COM_STMT_CLOSE, or COM_STMT_RESET commands.(Counter)
|
||||
mysql_global_status_threads_cached: The number of threads in the thread cache.(Counter)
|
||||
mysql_global_status_threads_created: The number of threads created to handle connections. If Threads_created is big, you may want to increase the thread_cache_size value. The cache miss rate can be calculated as Threads_created/Connections.(Counter)
|
||||
mysql_global_status_created_tmp_tables: The number of internal temporary tables created by the server while executing statements.(Counter)
|
||||
mysql_global_status_created_tmp_disk_tables: The number of internal on-disk temporary tables created by the server while executing statements. You can compare the number of internal on-disk temporary tables created to the total number of internal temporary tables created by comparing Created_tmp_disk_tables and Created_tmp_tables values.(Counter)
|
||||
mysql_global_status_created_tmp_files: How many temporary files mysqld has created.(Counter)
|
||||
mysql_global_status_select_full_join: The number of joins that perform table scans because they do not use indexes. If this value is not 0, you should carefully check the indexes of your tables.(Counter)
|
||||
mysql_global_status_select_full_range_join: The number of joins that used a range search on a reference table.(Counter)
|
||||
mysql_global_status_select_range: The number of joins that used ranges on the first table. This is normally not a critical issue even if the value is quite large.(Counter)
|
||||
mysql_global_status_select_range_check: The number of joins without keys that check for key usage after each row. If this is not 0, you should carefully check the indexes of your tables.(Counter)
|
||||
mysql_global_status_select_scan: The number of joins that did a full scan of the first table.(Counter)
|
||||
mysql_global_status_sort_rows: The number of sorted rows.(Counter)
|
||||
mysql_global_status_sort_range: The number of sorts that were done using ranges.(Counter)
|
||||
mysql_global_status_sort_merge_passes: The number of merge passes that the sort algorithm has had to do. If this value is large, you should consider increasing the value of the sort_buffer_size system variable.(Counter)
|
||||
mysql_global_status_sort_scan: The number of sorts that were done by scanning the table.(Counter)
|
||||
mysql_global_status_slow_queries: The number of queries that have taken more than long_query_time seconds. This counter increments regardless of whether the slow query log is enabled.(Counter)
|
||||
mysql_global_status_aborted_connects: The number of failed attempts to connect to the MySQL server.(Counter)
|
||||
mysql_global_status_aborted_clients: The number of connections that were aborted because the client died without closing the connection properly.(Counter)
|
||||
mysql_global_status_table_locks_immediate: The number of times that a request for a table lock could be granted immediately. Locks Immediate rising and falling is normal activity.(Counter)
|
||||
mysql_global_status_table_locks_waited: The number of times that a request for a table lock could not be granted immediately and a wait was needed. If this is high and you have performance problems, you should first optimize your queries, and then either split your table or tables or use replication.(Counter)
|
||||
mysql_global_status_bytes_received: The number of bytes received from all clients.(Counter)
|
||||
mysql_global_status_bytes_sent: The number of bytes sent to all clients.(Counter)
|
||||
mysql_global_status_innodb_page_size: InnoDB page size (default 16KB). Many values are counted in pages; the page size enables them to be easily converted to bytes.(Gauge)
|
||||
mysql_global_status_buffer_pool_pages: The number of pages in the InnoDB buffer pool.(Gauge)
|
||||
mysql_global_status_commands_total: The number of times each xxx statement has been executed.(Counter)
|
||||
mysql_global_status_handlers_total: Handler statistics are internal statistics on how MySQL is selecting, updating, inserting, and modifying rows, tables, and indexes. This is in fact the layer between the Storage Engine and MySQL.(Counter)
|
||||
mysql_global_status_opened_files: The number of files that have been opened with my_open() (a mysys library function). Parts of the server that open files without using this function do not increment the count.(Counter)
|
||||
mysql_global_status_open_tables: The number of tables that are open.(Gauge)
|
||||
mysql_global_status_opened_tables: The number of tables that have been opened. If Opened_tables is big, your table_open_cache value is probably too small.(Counter)
|
||||
mysql_global_status_table_open_cache_hits: The number of hits for open tables cache lookups.(Counter)
|
||||
mysql_global_status_table_open_cache_misses: The number of misses for open tables cache lookups.(Counter)
|
||||
mysql_global_status_table_open_cache_overflows: The number of overflows for the open tables cache.(Counter)
|
||||
mysql_global_status_innodb_num_open_files: The number of files InnoDB currently holds open.(Gauge)
|
||||
mysql_global_status_connection_errors_total: These variables provide information about errors that occur during the client connection process.(Counter)
|
||||
mysql_global_status_innodb_buffer_pool_read_requests: The number of logical read requests.(Counter)
|
||||
mysql_global_status_innodb_buffer_pool_reads: The number of logical reads that InnoDB could not satisfy from the buffer pool, and had to read directly from disk.(Counter)
|
||||
# [mysqld_exporter]
|
||||
mysql_global_status_uptime: The number of seconds that the server has been up.(Gauge)
|
||||
mysql_global_status_uptime_since_flush_status: The number of seconds since the most recent FLUSH STATUS statement.(Gauge)
|
||||
mysql_global_status_queries: The number of statements executed by the server. This variable includes statements executed within stored programs, unlike the Questions variable. It does not count COM_PING or COM_STATISTICS commands.(Counter)
|
||||
mysql_global_status_threads_connected: The number of currently open connections.(Counter)
|
||||
mysql_global_status_connections: The number of connection attempts (successful or not) to the MySQL server.(Gauge)
|
||||
mysql_global_status_max_used_connections: The maximum number of connections that have been in use simultaneously since the server started.(Gauge)
|
||||
mysql_global_status_threads_running: The number of threads that are not sleeping.(Gauge)
|
||||
mysql_global_status_questions: The number of statements executed by the server. This includes only statements sent to the server by clients and not statements executed within stored programs, unlike the Queries variable. This variable does not count COM_PING, COM_STATISTICS, COM_STMT_PREPARE, COM_STMT_CLOSE, or COM_STMT_RESET commands.(Counter)
|
||||
mysql_global_status_threads_cached: The number of threads in the thread cache.(Counter)
|
||||
mysql_global_status_threads_created: The number of threads created to handle connections. If Threads_created is big, you may want to increase the thread_cache_size value. The cache miss rate can be calculated as Threads_created/Connections.(Counter)
|
||||
mysql_global_status_created_tmp_tables: The number of internal temporary tables created by the server while executing statements.(Counter)
|
||||
mysql_global_status_created_tmp_disk_tables: The number of internal on-disk temporary tables created by the server while executing statements. You can compare the number of internal on-disk temporary tables created to the total number of internal temporary tables created by comparing Created_tmp_disk_tables and Created_tmp_tables values.(Counter)
|
||||
mysql_global_status_created_tmp_files: How many temporary files mysqld has created.(Counter)
|
||||
mysql_global_status_select_full_join: The number of joins that perform table scans because they do not use indexes. If this value is not 0, you should carefully check the indexes of your tables.(Counter)
|
||||
mysql_global_status_select_full_range_join: The number of joins that used a range search on a reference table.(Counter)
|
||||
mysql_global_status_select_range: The number of joins that used ranges on the first table. This is normally not a critical issue even if the value is quite large.(Counter)
|
||||
mysql_global_status_select_range_check: The number of joins without keys that check for key usage after each row. If this is not 0, you should carefully check the indexes of your tables.(Counter)
|
||||
mysql_global_status_select_scan: The number of joins that did a full scan of the first table.(Counter)
|
||||
mysql_global_status_sort_rows: The number of sorted rows.(Counter)
|
||||
mysql_global_status_sort_range: The number of sorts that were done using ranges.(Counter)
|
||||
mysql_global_status_sort_merge_passes: The number of merge passes that the sort algorithm has had to do. If this value is large, you should consider increasing the value of the sort_buffer_size system variable.(Counter)
|
||||
mysql_global_status_sort_scan: The number of sorts that were done by scanning the table.(Counter)
|
||||
mysql_global_status_slow_queries: The number of queries that have taken more than long_query_time seconds. This counter increments regardless of whether the slow query log is enabled.(Counter)
|
||||
mysql_global_status_aborted_connects: The number of failed attempts to connect to the MySQL server.(Counter)
|
||||
mysql_global_status_aborted_clients: The number of connections that were aborted because the client died without closing the connection properly.(Counter)
|
||||
mysql_global_status_table_locks_immediate: The number of times that a request for a table lock could be granted immediately. Locks Immediate rising and falling is normal activity.(Counter)
|
||||
mysql_global_status_table_locks_waited: The number of times that a request for a table lock could not be granted immediately and a wait was needed. If this is high and you have performance problems, you should first optimize your queries, and then either split your table or tables or use replication.(Counter)
|
||||
mysql_global_status_bytes_received: The number of bytes received from all clients.(Counter)
|
||||
mysql_global_status_bytes_sent: The number of bytes sent to all clients.(Counter)
|
||||
mysql_global_status_innodb_page_size: InnoDB page size (default 16KB). Many values are counted in pages; the page size enables them to be easily converted to bytes.(Gauge)
|
||||
mysql_global_status_buffer_pool_pages: The number of pages in the InnoDB buffer pool.(Gauge)
|
||||
mysql_global_status_commands_total: The number of times each xxx statement has been executed.(Counter)
|
||||
mysql_global_status_handlers_total: Handler statistics are internal statistics on how MySQL is selecting, updating, inserting, and modifying rows, tables, and indexes. This is in fact the layer between the Storage Engine and MySQL.(Counter)
|
||||
mysql_global_status_opened_files: The number of files that have been opened with my_open() (a mysys library function). Parts of the server that open files without using this function do not increment the count.(Counter)
|
||||
mysql_global_status_open_tables: The number of tables that are open.(Gauge)
|
||||
mysql_global_status_opened_tables: The number of tables that have been opened. If Opened_tables is big, your table_open_cache value is probably too small.(Counter)
|
||||
mysql_global_status_table_open_cache_hits: The number of hits for open tables cache lookups.(Counter)
|
||||
mysql_global_status_table_open_cache_misses: The number of misses for open tables cache lookups.(Counter)
|
||||
mysql_global_status_table_open_cache_overflows: The number of overflows for the open tables cache.(Counter)
|
||||
mysql_global_status_innodb_num_open_files: The number of files InnoDB currently holds open.(Gauge)
|
||||
mysql_global_status_connection_errors_total: These variables provide information about errors that occur during the client connection process.(Counter)
|
||||
mysql_global_status_innodb_buffer_pool_read_requests: The number of logical read requests.(Counter)
|
||||
mysql_global_status_innodb_buffer_pool_reads: The number of logical reads that InnoDB could not satisfy from the buffer pool, and had to read directly from disk.(Counter)
|
||||
|
||||
mysql_global_variables_thread_cache_size: How many threads the server should cache for reuse.(Gauge)
|
||||
mysql_global_variables_max_connections: The maximum permitted number of simultaneous client connections.(Gauge)
|
||||
mysql_global_variables_innodb_buffer_pool_size: The size in bytes of the buffer pool, the memory area where InnoDB caches table and index data. The default value is 134217728 bytes (128MB).(Gauge)
|
||||
mysql_global_variables_innodb_log_buffer_size: The size in bytes of the buffer that InnoDB uses to write to the log files on disk.(Gauge)
|
||||
mysql_global_variables_key_buffer_size: Index blocks for MyISAM tables are buffered and are shared by all threads.(Gauge)
|
||||
mysql_global_variables_query_cache_size: The amount of memory allocated for caching query results.(Gauge)
|
||||
mysql_global_variables_table_open_cache: The number of open tables for all threads.(Gauge)
|
||||
mysql_global_variables_open_files_limit: The number of file descriptors available to mysqld from the operating system.(Gauge)
|
||||
mysql_global_variables_thread_cache_size: How many threads the server should cache for reuse.(Gauge)
|
||||
mysql_global_variables_max_connections: The maximum permitted number of simultaneous client connections.(Gauge)
|
||||
mysql_global_variables_innodb_buffer_pool_size: The size in bytes of the buffer pool, the memory area where InnoDB caches table and index data. The default value is 134217728 bytes (128MB).(Gauge)
|
||||
mysql_global_variables_innodb_log_buffer_size: The size in bytes of the buffer that InnoDB uses to write to the log files on disk.(Gauge)
|
||||
mysql_global_variables_key_buffer_size: Index blocks for MyISAM tables are buffered and are shared by all threads.(Gauge)
|
||||
mysql_global_variables_query_cache_size: The amount of memory allocated for caching query results.(Gauge)
|
||||
mysql_global_variables_table_open_cache: The number of open tables for all threads.(Gauge)
|
||||
mysql_global_variables_open_files_limit: The number of file descriptors available to mysqld from the operating system.(Gauge)
|
||||
|
||||
# [redis_exporter]
|
||||
redis_active_defrag_running: When activedefrag is enabled, this indicates whether defragmentation is currently active, and the CPU percentage it intends to utilize.
|
||||
redis_allocator_active_bytes: Total bytes in the allocator active pages, this includes external-fragmentation.
|
||||
redis_allocator_allocated_bytes: Total bytes allocated form the allocator, including internal-fragmentation. Normally the same as used_memory.
|
||||
redis_allocator_frag_bytes: Delta between allocator_active and allocator_allocated. See note about mem_fragmentation_bytes.
|
||||
redis_allocator_frag_ratio: Ratio between allocator_active and allocator_allocated. This is the true (external) fragmentation metric (not mem_fragmentation_ratio).
|
||||
redis_allocator_resident_bytes: Total bytes resident (RSS) in the allocator, this includes pages that can be released to the OS (by MEMORY PURGE, or just waiting).
|
||||
redis_allocator_rss_bytes: Delta between allocator_resident and allocator_active.
|
||||
redis_allocator_rss_ratio: Ratio between allocator_resident and allocator_active. This usually indicates pages that the allocator can and probably will soon release back to the OS.
|
||||
redis_aof_current_rewrite_duration_sec: Duration of the on-going AOF rewrite operation if any.
|
||||
redis_aof_enabled: Flag indicating AOF logging is activated.
|
||||
redis_aof_last_bgrewrite_status: Status of the last AOF rewrite operation.
|
||||
redis_aof_last_cow_size_bytes: The size in bytes of copy-on-write memory during the last AOF rewrite operation.
|
||||
redis_aof_last_rewrite_duration_sec: Duration of the last AOF rewrite operation in seconds.
|
||||
redis_aof_last_write_status: Status of the last write operation to the AOF.
|
||||
redis_aof_rewrite_in_progress: Flag indicating a AOF rewrite operation is on-going.
|
||||
redis_aof_rewrite_scheduled: Flag indicating an AOF rewrite operation will be scheduled once the on-going RDB save is complete.
|
||||
redis_blocked_clients: Number of clients pending on a blocking call (BLPOP, BRPOP, BRPOPLPUSH, BLMOVE, BZPOPMIN, BZPOPMAX).
|
||||
redis_client_recent_max_input_buffer_bytes: Biggest input buffer among current client connections.
|
||||
redis_client_recent_max_output_buffer_bytes: Biggest output buffer among current client connections.
|
||||
redis_cluster_enabled: Indicate Redis cluster is enabled.
|
||||
redis_commands_duration_seconds_total: The total CPU time consumed by these commands.(Counter)
|
||||
redis_commands_processed_total: Total number of commands processed by the server.(Counter)
|
||||
redis_commands_total: The number of calls that reached command execution (not rejected).(Counter)
|
||||
redis_config_maxclients: The value of the maxclients configuration directive. This is the upper limit for the sum of connected_clients, connected_slaves and cluster_connections.
|
||||
redis_config_maxmemory: The value of the maxmemory configuration directive.
|
||||
redis_connected_clients: Number of client connections (excluding connections from replicas).
|
||||
redis_connected_slaves: Number of connected replicas.
|
||||
redis_connections_received_total: Total number of connections accepted by the server.(Counter)
|
||||
redis_cpu_sys_children_seconds_total: System CPU consumed by the background processes.(Counter)
|
||||
redis_cpu_sys_seconds_total: System CPU consumed by the Redis server, which is the sum of system CPU consumed by all threads of the server process (main thread and background threads).(Counter)
|
||||
redis_cpu_user_children_seconds_total: User CPU consumed by the background processes.(Counter)
|
||||
redis_cpu_user_seconds_total: User CPU consumed by the Redis server, which is the sum of user CPU consumed by all threads of the server process (main thread and background threads).(Counter)
|
||||
redis_db_keys: Total number of keys by DB.
|
||||
redis_db_keys_expiring: Total number of expiring keys by DB
|
||||
redis_defrag_hits: Number of value reallocations performed by active the defragmentation process.
|
||||
redis_defrag_misses: Number of aborted value reallocations started by the active defragmentation process.
|
||||
redis_defrag_key_hits: Number of keys that were actively defragmented.
|
||||
redis_defrag_key_misses: Number of keys that were skipped by the active defragmentation process.
|
||||
redis_evicted_keys_total: Number of evicted keys due to maxmemory limit.(Counter)
|
||||
redis_expired_keys_total: Total number of key expiration events.(Counter)
|
||||
redis_expired_stale_percentage: The percentage of keys probably expired.
|
||||
redis_expired_time_cap_reached_total: The count of times that active expiry cycles have stopped early.
|
||||
redis_exporter_last_scrape_connect_time_seconds: The duration(in seconds) to connect when scrape.
|
||||
redis_exporter_last_scrape_duration_seconds: The last scrape duration.
|
||||
redis_exporter_last_scrape_error: The last scrape error status.
|
||||
redis_exporter_scrape_duration_seconds_count: Durations of scrapes by the exporter
|
||||
redis_exporter_scrape_duration_seconds_sum: Durations of scrapes by the exporter
|
||||
redis_exporter_scrapes_total: Current total redis scrapes.(Counter)
|
||||
redis_instance_info: Information about the Redis instance.
|
||||
redis_keyspace_hits_total: Hits total.(Counter)
|
||||
redis_keyspace_misses_total: Misses total.(Counter)
|
||||
redis_last_key_groups_scrape_duration_milliseconds: Duration of the last key group metrics scrape in milliseconds.
|
||||
redis_last_slow_execution_duration_seconds: The amount of time needed for last slow execution, in seconds.
|
||||
redis_latest_fork_seconds: The amount of time needed for last fork, in seconds.
|
||||
redis_lazyfree_pending_objects: The number of objects waiting to be freed (as a result of calling UNLINK, or FLUSHDB and FLUSHALL with the ASYNC option).
|
||||
redis_master_repl_offset: The server's current replication offset.
|
||||
redis_mem_clients_normal: Memory used by normal clients.(Gauge)
|
||||
redis_mem_clients_slaves: Memory used by replica clients - Starting Redis 7.0, replica buffers share memory with the replication backlog, so this field can show 0 when replicas don't trigger an increase of memory usage.
|
||||
redis_mem_fragmentation_bytes: Delta between used_memory_rss and used_memory. Note that when the total fragmentation bytes is low (few megabytes), a high ratio (e.g. 1.5 and above) is not an indication of an issue.
|
||||
redis_mem_fragmentation_ratio: Ratio between used_memory_rss and used_memory. Note that this doesn't only includes fragmentation, but also other process overheads (see the allocator_* metrics), and also overheads like code, shared libraries, stack, etc.
|
||||
redis_mem_not_counted_for_eviction_bytes: (Gauge)
|
||||
redis_memory_max_bytes: Max memory limit in bytes.
|
||||
redis_memory_used_bytes: Total number of bytes allocated by Redis using its allocator (either standard libc, jemalloc, or an alternative allocator such as tcmalloc)
|
||||
redis_memory_used_dataset_bytes: The size in bytes of the dataset (used_memory_overhead subtracted from used_memory)
|
||||
redis_memory_used_lua_bytes: Number of bytes used by the Lua engine.
|
||||
redis_memory_used_overhead_bytes: The sum in bytes of all overheads that the server allocated for managing its internal data structures.
|
||||
redis_memory_used_peak_bytes: Peak memory consumed by Redis (in bytes)
|
||||
redis_memory_used_rss_bytes: Number of bytes that Redis allocated as seen by the operating system (a.k.a resident set size). This is the number reported by tools such as top(1) and ps(1)
|
||||
redis_memory_used_scripts_bytes: Number of bytes used by cached Lua scripts
|
||||
redis_memory_used_startup_bytes: Initial amount of memory consumed by Redis at startup in bytes
|
||||
redis_migrate_cached_sockets_total: The number of sockets open for MIGRATE purposes
|
||||
redis_net_input_bytes_total: Total input bytes(Counter)
|
||||
redis_net_output_bytes_total: Total output bytes(Counter)
|
||||
redis_process_id: Process ID
|
||||
redis_pubsub_channels: Global number of pub/sub channels with client subscriptions
|
||||
redis_pubsub_patterns: Global number of pub/sub pattern with client subscriptions
|
||||
redis_rdb_bgsave_in_progress: Flag indicating a RDB save is on-going
|
||||
redis_rdb_changes_since_last_save: Number of changes since the last dump
|
||||
redis_rdb_current_bgsave_duration_sec: Duration of the on-going RDB save operation if any
|
||||
redis_rdb_last_bgsave_duration_sec: Duration of the last RDB save operation in seconds
|
||||
redis_rdb_last_bgsave_status: Status of the last RDB save operation
|
||||
redis_rdb_last_cow_size_bytes: The size in bytes of copy-on-write memory during the last RDB save operation
|
||||
redis_rdb_last_save_timestamp_seconds: Epoch-based timestamp of last successful RDB save
|
||||
redis_rejected_connections_total: Number of connections rejected because of maxclients limit(Counter)
|
||||
redis_repl_backlog_first_byte_offset: The master offset of the replication backlog buffer
|
||||
redis_repl_backlog_history_bytes: Size in bytes of the data in the replication backlog buffer
|
||||
redis_repl_backlog_is_active: Flag indicating replication backlog is active
|
||||
redis_replica_partial_resync_accepted: The number of accepted partial resync requests(Gauge)
|
||||
redis_replica_partial_resync_denied: The number of denied partial resync requests(Gauge)
|
||||
redis_replica_resyncs_full: The number of full resyncs with replicas
|
||||
redis_replication_backlog_bytes: Memory used by replication backlog
|
||||
redis_second_repl_offset: The offset up to which replication IDs are accepted.
|
||||
redis_slave_expires_tracked_keys: The number of keys tracked for expiry purposes (applicable only to writable replicas)(Gauge)
|
||||
redis_slowlog_last_id: Last id of slowlog
|
||||
redis_slowlog_length: Total slowlog
|
||||
redis_start_time_seconds: Start time of the Redis instance since unix epoch in seconds.
|
||||
redis_target_scrape_request_errors_total: Errors in requests to the exporter
|
||||
redis_up: Flag indicating redis instance is up
|
||||
redis_uptime_in_seconds: Number of seconds since Redis server start
|
||||
# [redis_exporter]
|
||||
redis_active_defrag_running: When activedefrag is enabled, this indicates whether defragmentation is currently active, and the CPU percentage it intends to utilize.
|
||||
redis_allocator_active_bytes: Total bytes in the allocator active pages, this includes external-fragmentation.
|
||||
redis_allocator_allocated_bytes: Total bytes allocated form the allocator, including internal-fragmentation. Normally the same as used_memory.
|
||||
redis_allocator_frag_bytes: Delta between allocator_active and allocator_allocated. See note about mem_fragmentation_bytes.
|
||||
redis_allocator_frag_ratio: Ratio between allocator_active and allocator_allocated. This is the true (external) fragmentation metric (not mem_fragmentation_ratio).
|
||||
redis_allocator_resident_bytes: Total bytes resident (RSS) in the allocator, this includes pages that can be released to the OS (by MEMORY PURGE, or just waiting).
|
||||
redis_allocator_rss_bytes: Delta between allocator_resident and allocator_active.
|
||||
redis_allocator_rss_ratio: Ratio between allocator_resident and allocator_active. This usually indicates pages that the allocator can and probably will soon release back to the OS.
|
||||
redis_aof_current_rewrite_duration_sec: Duration of the on-going AOF rewrite operation if any.
|
||||
redis_aof_enabled: Flag indicating AOF logging is activated.
|
||||
redis_aof_last_bgrewrite_status: Status of the last AOF rewrite operation.
|
||||
redis_aof_last_cow_size_bytes: The size in bytes of copy-on-write memory during the last AOF rewrite operation.
|
||||
redis_aof_last_rewrite_duration_sec: Duration of the last AOF rewrite operation in seconds.
|
||||
redis_aof_last_write_status: Status of the last write operation to the AOF.
|
||||
redis_aof_rewrite_in_progress: Flag indicating a AOF rewrite operation is on-going.
|
||||
redis_aof_rewrite_scheduled: Flag indicating an AOF rewrite operation will be scheduled once the on-going RDB save is complete.
|
||||
redis_blocked_clients: Number of clients pending on a blocking call (BLPOP, BRPOP, BRPOPLPUSH, BLMOVE, BZPOPMIN, BZPOPMAX).
|
||||
redis_client_recent_max_input_buffer_bytes: Biggest input buffer among current client connections.
|
||||
redis_client_recent_max_output_buffer_bytes: Biggest output buffer among current client connections.
|
||||
redis_cluster_enabled: Indicate Redis cluster is enabled.
|
||||
redis_commands_duration_seconds_total: The total CPU time consumed by these commands.(Counter)
|
||||
redis_commands_processed_total: Total number of commands processed by the server.(Counter)
|
||||
redis_commands_total: The number of calls that reached command execution (not rejected).(Counter)
|
||||
redis_config_maxclients: The value of the maxclients configuration directive. This is the upper limit for the sum of connected_clients, connected_slaves and cluster_connections.
|
||||
redis_config_maxmemory: The value of the maxmemory configuration directive.
|
||||
redis_connected_clients: Number of client connections (excluding connections from replicas).
|
||||
redis_connected_slaves: Number of connected replicas.
|
||||
redis_connections_received_total: Total number of connections accepted by the server.(Counter)
|
||||
redis_cpu_sys_children_seconds_total: System CPU consumed by the background processes.(Counter)
|
||||
redis_cpu_sys_seconds_total: System CPU consumed by the Redis server, which is the sum of system CPU consumed by all threads of the server process (main thread and background threads).(Counter)
|
||||
redis_cpu_user_children_seconds_total: User CPU consumed by the background processes.(Counter)
|
||||
redis_cpu_user_seconds_total: User CPU consumed by the Redis server, which is the sum of user CPU consumed by all threads of the server process (main thread and background threads).(Counter)
|
||||
redis_db_keys: Total number of keys by DB.
|
||||
redis_db_keys_expiring: Total number of expiring keys by DB
|
||||
redis_defrag_hits: Number of value reallocations performed by active the defragmentation process.
|
||||
redis_defrag_misses: Number of aborted value reallocations started by the active defragmentation process.
|
||||
redis_defrag_key_hits: Number of keys that were actively defragmented.
|
||||
redis_defrag_key_misses: Number of keys that were skipped by the active defragmentation process.
|
||||
redis_evicted_keys_total: Number of evicted keys due to maxmemory limit.(Counter)
|
||||
redis_expired_keys_total: Total number of key expiration events.(Counter)
|
||||
redis_expired_stale_percentage: The percentage of keys probably expired.
|
||||
redis_expired_time_cap_reached_total: The count of times that active expiry cycles have stopped early.
|
||||
redis_exporter_last_scrape_connect_time_seconds: The duration(in seconds) to connect when scrape.
|
||||
redis_exporter_last_scrape_duration_seconds: The last scrape duration.
|
||||
redis_exporter_last_scrape_error: The last scrape error status.
|
||||
redis_exporter_scrape_duration_seconds_count: Durations of scrapes by the exporter
|
||||
redis_exporter_scrape_duration_seconds_sum: Durations of scrapes by the exporter
|
||||
redis_exporter_scrapes_total: Current total redis scrapes.(Counter)
|
||||
redis_instance_info: Information about the Redis instance.
|
||||
redis_keyspace_hits_total: Hits total.(Counter)
|
||||
redis_keyspace_misses_total: Misses total.(Counter)
|
||||
redis_last_key_groups_scrape_duration_milliseconds: Duration of the last key group metrics scrape in milliseconds.
|
||||
redis_last_slow_execution_duration_seconds: The amount of time needed for last slow execution, in seconds.
|
||||
redis_latest_fork_seconds: The amount of time needed for last fork, in seconds.
|
||||
redis_lazyfree_pending_objects: The number of objects waiting to be freed (as a result of calling UNLINK, or FLUSHDB and FLUSHALL with the ASYNC option).
|
||||
redis_master_repl_offset: The server's current replication offset.
|
||||
redis_mem_clients_normal: Memory used by normal clients.(Gauge)
|
||||
redis_mem_clients_slaves: Memory used by replica clients - Starting Redis 7.0, replica buffers share memory with the replication backlog, so this field can show 0 when replicas don't trigger an increase of memory usage.
|
||||
redis_mem_fragmentation_bytes: Delta between used_memory_rss and used_memory. Note that when the total fragmentation bytes is low (few megabytes), a high ratio (e.g. 1.5 and above) is not an indication of an issue.
|
||||
redis_mem_fragmentation_ratio: Ratio between used_memory_rss and used_memory. Note that this doesn't only includes fragmentation, but also other process overheads (see the allocator_* metrics), and also overheads like code, shared libraries, stack, etc.
|
||||
redis_mem_not_counted_for_eviction_bytes: (Gauge)
|
||||
redis_memory_max_bytes: Max memory limit in bytes.
|
||||
redis_memory_used_bytes: Total number of bytes allocated by Redis using its allocator (either standard libc, jemalloc, or an alternative allocator such as tcmalloc)
|
||||
redis_memory_used_dataset_bytes: The size in bytes of the dataset (used_memory_overhead subtracted from used_memory)
|
||||
redis_memory_used_lua_bytes: Number of bytes used by the Lua engine.
|
||||
redis_memory_used_overhead_bytes: The sum in bytes of all overheads that the server allocated for managing its internal data structures.
|
||||
redis_memory_used_peak_bytes: Peak memory consumed by Redis (in bytes)
|
||||
redis_memory_used_rss_bytes: Number of bytes that Redis allocated as seen by the operating system (a.k.a resident set size). This is the number reported by tools such as top(1) and ps(1)
|
||||
redis_memory_used_scripts_bytes: Number of bytes used by cached Lua scripts
|
||||
redis_memory_used_startup_bytes: Initial amount of memory consumed by Redis at startup in bytes
|
||||
redis_migrate_cached_sockets_total: The number of sockets open for MIGRATE purposes
|
||||
redis_net_input_bytes_total: Total input bytes(Counter)
|
||||
redis_net_output_bytes_total: Total output bytes(Counter)
|
||||
redis_process_id: Process ID
|
||||
redis_pubsub_channels: Global number of pub/sub channels with client subscriptions
|
||||
redis_pubsub_patterns: Global number of pub/sub pattern with client subscriptions
|
||||
redis_rdb_bgsave_in_progress: Flag indicating a RDB save is on-going
|
||||
redis_rdb_changes_since_last_save: Number of changes since the last dump
|
||||
redis_rdb_current_bgsave_duration_sec: Duration of the on-going RDB save operation if any
|
||||
redis_rdb_last_bgsave_duration_sec: Duration of the last RDB save operation in seconds
|
||||
redis_rdb_last_bgsave_status: Status of the last RDB save operation
|
||||
redis_rdb_last_cow_size_bytes: The size in bytes of copy-on-write memory during the last RDB save operation
|
||||
redis_rdb_last_save_timestamp_seconds: Epoch-based timestamp of last successful RDB save
|
||||
redis_rejected_connections_total: Number of connections rejected because of maxclients limit(Counter)
|
||||
redis_repl_backlog_first_byte_offset: The master offset of the replication backlog buffer
|
||||
redis_repl_backlog_history_bytes: Size in bytes of the data in the replication backlog buffer
|
||||
redis_repl_backlog_is_active: Flag indicating replication backlog is active
|
||||
redis_replica_partial_resync_accepted: The number of accepted partial resync requests(Gauge)
|
||||
redis_replica_partial_resync_denied: The number of denied partial resync requests(Gauge)
|
||||
redis_replica_resyncs_full: The number of full resyncs with replicas
|
||||
redis_replication_backlog_bytes: Memory used by replication backlog
|
||||
redis_second_repl_offset: The offset up to which replication IDs are accepted.
|
||||
redis_slave_expires_tracked_keys: The number of keys tracked for expiry purposes (applicable only to writable replicas)(Gauge)
|
||||
redis_slowlog_last_id: Last id of slowlog
|
||||
redis_slowlog_length: Total slowlog
|
||||
redis_start_time_seconds: Start time of the Redis instance since unix epoch in seconds.
|
||||
redis_target_scrape_request_errors_total: Errors in requests to the exporter
|
||||
redis_up: Flag indicating redis instance is up
|
||||
redis_uptime_in_seconds: Number of seconds since Redis server start
|
||||
|
||||
# [windows_exporter]
|
||||
windows_cpu_clock_interrupts_total: Total number of received and serviced clock tick interrupts(counter)
|
||||
windows_cpu_core_frequency_mhz: Core frequency in megahertz(gauge)
|
||||
windows_cpu_cstate_seconds_total: Time spent in low-power idle state(counter)
|
||||
windows_cpu_dpcs_total: Total number of received and serviced deferred procedure calls (DPCs)(counter)
|
||||
windows_cpu_idle_break_events_total: Total number of time processor was woken from idle(counter)
|
||||
windows_cpu_interrupts_total: Total number of received and serviced hardware interrupts(counter)
|
||||
windows_cpu_parking_status: Parking Status represents whether a processor is parked or not(gauge)
|
||||
windows_cpu_processor_performance: Processor Performance is the average performance of the processor while it is executing instructions, as a percentage of the nominal performance of the processor. On some processors, Processor Performance may exceed 100%(gauge)
|
||||
windows_cpu_time_total: Time that processor spent in different modes (idle, user, system, ...)(counter)
|
||||
windows_cs_hostname: Labeled system hostname information as provided by ComputerSystem.DNSHostName and ComputerSystem.Domain(gauge)
|
||||
windows_cs_logical_processors: ComputerSystem.NumberOfLogicalProcessors(gauge)
|
||||
windows_cs_physical_memory_bytes: ComputerSystem.TotalPhysicalMemory(gauge)
|
||||
windows_exporter_build_info: A metric with a constant '1' value labeled by version, revision, branch, and goversion from which windows_exporter was built.(gauge)
|
||||
windows_exporter_collector_duration_seconds: Duration of a collection.(gauge)
|
||||
windows_exporter_collector_success: Whether the collector was successful.(gauge)
|
||||
windows_exporter_collector_timeout: Whether the collector timed out.(gauge)
|
||||
windows_exporter_perflib_snapshot_duration_seconds: Duration of perflib snapshot capture(gauge)
|
||||
windows_logical_disk_free_bytes: Free space in bytes (LogicalDisk.PercentFreeSpace)(gauge)
|
||||
windows_logical_disk_idle_seconds_total: Seconds that the disk was idle (LogicalDisk.PercentIdleTime)(counter)
|
||||
windows_logical_disk_read_bytes_total: The number of bytes transferred from the disk during read operations (LogicalDisk.DiskReadBytesPerSec)(counter)
|
||||
windows_logical_disk_read_latency_seconds_total: Shows the average time, in seconds, of a read operation from the disk (LogicalDisk.AvgDiskSecPerRead)(counter)
|
||||
windows_logical_disk_read_seconds_total: Seconds that the disk was busy servicing read requests (LogicalDisk.PercentDiskReadTime)(counter)
|
||||
windows_logical_disk_read_write_latency_seconds_total: Shows the time, in seconds, of the average disk transfer (LogicalDisk.AvgDiskSecPerTransfer)(counter)
|
||||
windows_logical_disk_reads_total: The number of read operations on the disk (LogicalDisk.DiskReadsPerSec)(counter)
|
||||
windows_logical_disk_requests_queued: The number of requests queued to the disk (LogicalDisk.CurrentDiskQueueLength)(gauge)
|
||||
windows_logical_disk_size_bytes: Total space in bytes (LogicalDisk.PercentFreeSpace_Base)(gauge)
|
||||
windows_logical_disk_split_ios_total: The number of I/Os to the disk were split into multiple I/Os (LogicalDisk.SplitIOPerSec)(counter)
|
||||
windows_logical_disk_write_bytes_total: The number of bytes transferred to the disk during write operations (LogicalDisk.DiskWriteBytesPerSec)(counter)
|
||||
windows_logical_disk_write_latency_seconds_total: Shows the average time, in seconds, of a write operation to the disk (LogicalDisk.AvgDiskSecPerWrite)(counter)
|
||||
windows_logical_disk_write_seconds_total: Seconds that the disk was busy servicing write requests (LogicalDisk.PercentDiskWriteTime)(counter)
|
||||
windows_logical_disk_writes_total: The number of write operations on the disk (LogicalDisk.DiskWritesPerSec)(counter)
|
||||
windows_net_bytes_received_total: (Network.BytesReceivedPerSec)(counter)
|
||||
windows_net_bytes_sent_total: (Network.BytesSentPerSec)(counter)
|
||||
windows_net_bytes_total: (Network.BytesTotalPerSec)(counter)
|
||||
windows_net_current_bandwidth: (Network.CurrentBandwidth)(gauge)
|
||||
windows_net_packets_outbound_discarded_total: (Network.PacketsOutboundDiscarded)(counter)
|
||||
windows_net_packets_outbound_errors_total: (Network.PacketsOutboundErrors)(counter)
|
||||
windows_net_packets_received_discarded_total: (Network.PacketsReceivedDiscarded)(counter)
|
||||
windows_net_packets_received_errors_total: (Network.PacketsReceivedErrors)(counter)
|
||||
windows_net_packets_received_total: (Network.PacketsReceivedPerSec)(counter)
|
||||
windows_net_packets_received_unknown_total: (Network.PacketsReceivedUnknown)(counter)
|
||||
windows_net_packets_sent_total: (Network.PacketsSentPerSec)(counter)
|
||||
windows_net_packets_total: (Network.PacketsPerSec)(counter)
|
||||
windows_os_info: OperatingSystem.Caption, OperatingSystem.Version(gauge)
|
||||
windows_os_paging_free_bytes: OperatingSystem.FreeSpaceInPagingFiles(gauge)
|
||||
windows_os_paging_limit_bytes: OperatingSystem.SizeStoredInPagingFiles(gauge)
|
||||
windows_os_physical_memory_free_bytes: OperatingSystem.FreePhysicalMemory(gauge)
|
||||
windows_os_process_memory_limix_bytes: OperatingSystem.MaxProcessMemorySize(gauge)
|
||||
windows_os_processes: OperatingSystem.NumberOfProcesses(gauge)
|
||||
windows_os_processes_limit: OperatingSystem.MaxNumberOfProcesses(gauge)
|
||||
windows_os_time: OperatingSystem.LocalDateTime(gauge)
|
||||
windows_os_timezone: OperatingSystem.LocalDateTime(gauge)
|
||||
windows_os_users: OperatingSystem.NumberOfUsers(gauge)
|
||||
windows_os_virtual_memory_bytes: OperatingSystem.TotalVirtualMemorySize(gauge)
|
||||
windows_os_virtual_memory_free_bytes: OperatingSystem.FreeVirtualMemory(gauge)
|
||||
windows_os_visible_memory_bytes: OperatingSystem.TotalVisibleMemorySize(gauge)
|
||||
windows_service_info: A metric with a constant '1' value labeled with service information(gauge)
|
||||
windows_service_start_mode: The start mode of the service (StartMode)(gauge)
|
||||
windows_service_state: The state of the service (State)(gauge)
|
||||
windows_service_status: The status of the service (Status)(gauge)
|
||||
windows_system_context_switches_total: Total number of context switches (WMI source is PerfOS_System.ContextSwitchesPersec)(counter)
|
||||
windows_system_exception_dispatches_total: Total number of exceptions dispatched (WMI source is PerfOS_System.ExceptionDispatchesPersec)(counter)
|
||||
windows_system_processor_queue_length: Length of processor queue (WMI source is PerfOS_System.ProcessorQueueLength)(gauge)
|
||||
windows_system_system_calls_total: Total number of system calls (WMI source is PerfOS_System.SystemCallsPersec)(counter)
|
||||
windows_system_system_up_time: System boot time (WMI source is PerfOS_System.SystemUpTime)(gauge)
|
||||
windows_system_threads: Current number of threads (WMI source is PerfOS_System.Threads)(gauge)
|
||||
# [windows_exporter]
|
||||
windows_cpu_clock_interrupts_total: Total number of received and serviced clock tick interrupts(counter)
|
||||
windows_cpu_core_frequency_mhz: Core frequency in megahertz(gauge)
|
||||
windows_cpu_cstate_seconds_total: Time spent in low-power idle state(counter)
|
||||
windows_cpu_dpcs_total: Total number of received and serviced deferred procedure calls (DPCs)(counter)
|
||||
windows_cpu_idle_break_events_total: Total number of time processor was woken from idle(counter)
|
||||
windows_cpu_interrupts_total: Total number of received and serviced hardware interrupts(counter)
|
||||
windows_cpu_parking_status: Parking Status represents whether a processor is parked or not(gauge)
|
||||
windows_cpu_processor_performance: Processor Performance is the average performance of the processor while it is executing instructions, as a percentage of the nominal performance of the processor. On some processors, Processor Performance may exceed 100%(gauge)
|
||||
windows_cpu_time_total: Time that processor spent in different modes (idle, user, system, ...)(counter)
|
||||
windows_cs_hostname: Labeled system hostname information as provided by ComputerSystem.DNSHostName and ComputerSystem.Domain(gauge)
|
||||
windows_cs_logical_processors: ComputerSystem.NumberOfLogicalProcessors(gauge)
|
||||
windows_cs_physical_memory_bytes: ComputerSystem.TotalPhysicalMemory(gauge)
|
||||
windows_exporter_build_info: A metric with a constant '1' value labeled by version, revision, branch, and goversion from which windows_exporter was built.(gauge)
|
||||
windows_exporter_collector_duration_seconds: Duration of a collection.(gauge)
|
||||
windows_exporter_collector_success: Whether the collector was successful.(gauge)
|
||||
windows_exporter_collector_timeout: Whether the collector timed out.(gauge)
|
||||
windows_exporter_perflib_snapshot_duration_seconds: Duration of perflib snapshot capture(gauge)
|
||||
windows_logical_disk_free_bytes: Free space in bytes (LogicalDisk.PercentFreeSpace)(gauge)
|
||||
windows_logical_disk_idle_seconds_total: Seconds that the disk was idle (LogicalDisk.PercentIdleTime)(counter)
|
||||
windows_logical_disk_read_bytes_total: The number of bytes transferred from the disk during read operations (LogicalDisk.DiskReadBytesPerSec)(counter)
|
||||
windows_logical_disk_read_latency_seconds_total: Shows the average time, in seconds, of a read operation from the disk (LogicalDisk.AvgDiskSecPerRead)(counter)
|
||||
windows_logical_disk_read_seconds_total: Seconds that the disk was busy servicing read requests (LogicalDisk.PercentDiskReadTime)(counter)
|
||||
windows_logical_disk_read_write_latency_seconds_total: Shows the time, in seconds, of the average disk transfer (LogicalDisk.AvgDiskSecPerTransfer)(counter)
|
||||
windows_logical_disk_reads_total: The number of read operations on the disk (LogicalDisk.DiskReadsPerSec)(counter)
|
||||
windows_logical_disk_requests_queued: The number of requests queued to the disk (LogicalDisk.CurrentDiskQueueLength)(gauge)
|
||||
windows_logical_disk_size_bytes: Total space in bytes (LogicalDisk.PercentFreeSpace_Base)(gauge)
|
||||
windows_logical_disk_split_ios_total: The number of I/Os to the disk were split into multiple I/Os (LogicalDisk.SplitIOPerSec)(counter)
|
||||
windows_logical_disk_write_bytes_total: The number of bytes transferred to the disk during write operations (LogicalDisk.DiskWriteBytesPerSec)(counter)
|
||||
windows_logical_disk_write_latency_seconds_total: Shows the average time, in seconds, of a write operation to the disk (LogicalDisk.AvgDiskSecPerWrite)(counter)
|
||||
windows_logical_disk_write_seconds_total: Seconds that the disk was busy servicing write requests (LogicalDisk.PercentDiskWriteTime)(counter)
|
||||
windows_logical_disk_writes_total: The number of write operations on the disk (LogicalDisk.DiskWritesPerSec)(counter)
|
||||
windows_net_bytes_received_total: (Network.BytesReceivedPerSec)(counter)
|
||||
windows_net_bytes_sent_total: (Network.BytesSentPerSec)(counter)
|
||||
windows_net_bytes_total: (Network.BytesTotalPerSec)(counter)
|
||||
windows_net_current_bandwidth: (Network.CurrentBandwidth)(gauge)
|
||||
windows_net_packets_outbound_discarded_total: (Network.PacketsOutboundDiscarded)(counter)
|
||||
windows_net_packets_outbound_errors_total: (Network.PacketsOutboundErrors)(counter)
|
||||
windows_net_packets_received_discarded_total: (Network.PacketsReceivedDiscarded)(counter)
|
||||
windows_net_packets_received_errors_total: (Network.PacketsReceivedErrors)(counter)
|
||||
windows_net_packets_received_total: (Network.PacketsReceivedPerSec)(counter)
|
||||
windows_net_packets_received_unknown_total: (Network.PacketsReceivedUnknown)(counter)
|
||||
windows_net_packets_sent_total: (Network.PacketsSentPerSec)(counter)
|
||||
windows_net_packets_total: (Network.PacketsPerSec)(counter)
|
||||
windows_os_info: OperatingSystem.Caption, OperatingSystem.Version(gauge)
|
||||
windows_os_paging_free_bytes: OperatingSystem.FreeSpaceInPagingFiles(gauge)
|
||||
windows_os_paging_limit_bytes: OperatingSystem.SizeStoredInPagingFiles(gauge)
|
||||
windows_os_physical_memory_free_bytes: OperatingSystem.FreePhysicalMemory(gauge)
|
||||
windows_os_process_memory_limix_bytes: OperatingSystem.MaxProcessMemorySize(gauge)
|
||||
windows_os_processes: OperatingSystem.NumberOfProcesses(gauge)
|
||||
windows_os_processes_limit: OperatingSystem.MaxNumberOfProcesses(gauge)
|
||||
windows_os_time: OperatingSystem.LocalDateTime(gauge)
|
||||
windows_os_timezone: OperatingSystem.LocalDateTime(gauge)
|
||||
windows_os_users: OperatingSystem.NumberOfUsers(gauge)
|
||||
windows_os_virtual_memory_bytes: OperatingSystem.TotalVirtualMemorySize(gauge)
|
||||
windows_os_virtual_memory_free_bytes: OperatingSystem.FreeVirtualMemory(gauge)
|
||||
windows_os_visible_memory_bytes: OperatingSystem.TotalVisibleMemorySize(gauge)
|
||||
windows_service_info: A metric with a constant '1' value labeled with service information(gauge)
|
||||
windows_service_start_mode: The start mode of the service (StartMode)(gauge)
|
||||
windows_service_state: The state of the service (State)(gauge)
|
||||
windows_service_status: The status of the service (Status)(gauge)
|
||||
windows_system_context_switches_total: Total number of context switches (WMI source is PerfOS_System.ContextSwitchesPersec)(counter)
|
||||
windows_system_exception_dispatches_total: Total number of exceptions dispatched (WMI source is PerfOS_System.ExceptionDispatchesPersec)(counter)
|
||||
windows_system_processor_queue_length: Length of processor queue (WMI source is PerfOS_System.ProcessorQueueLength)(gauge)
|
||||
windows_system_system_calls_total: Total number of system calls (WMI source is PerfOS_System.SystemCallsPersec)(counter)
|
||||
windows_system_system_up_time: System boot time (WMI source is PerfOS_System.SystemUpTime)(gauge)
|
||||
windows_system_threads: Current number of threads (WMI source is PerfOS_System.Threads)(gauge)
|
||||
|
||||
# [node_exporter]
|
||||
# SYSTEM
|
||||
# CPU context switch 次数
|
||||
node_context_switches_total: context_switches
|
||||
# Interrupts 次数
|
||||
node_intr_total: Interrupts
|
||||
# 运行的进程数
|
||||
node_procs_running: Processes in runnable state
|
||||
# 熵池大小
|
||||
node_entropy_available_bits: Entropy available to random number generators
|
||||
node_time_seconds: System time in seconds since epoch (1970)
|
||||
node_boot_time_seconds: Node boot time, in unixtime
|
||||
# CPU
|
||||
node_cpu_seconds_total: Seconds the CPUs spent in each mode
|
||||
node_load1: cpu load 1m
|
||||
node_load5: cpu load 5m
|
||||
node_load15: cpu load 15m
|
||||
# [node_exporter]
|
||||
# SYSTEM
|
||||
# CPU context switch 次数
|
||||
node_context_switches_total: context_switches
|
||||
# Interrupts 次数
|
||||
node_intr_total: Interrupts
|
||||
# 运行的进程数
|
||||
node_procs_running: Processes in runnable state
|
||||
# 熵池大小
|
||||
node_entropy_available_bits: Entropy available to random number generators
|
||||
node_time_seconds: System time in seconds since epoch (1970)
|
||||
node_boot_time_seconds: Node boot time, in unixtime
|
||||
# CPU
|
||||
node_cpu_seconds_total: Seconds the CPUs spent in each mode
|
||||
node_load1: cpu load 1m
|
||||
node_load5: cpu load 5m
|
||||
node_load15: cpu load 15m
|
||||
|
||||
# MEM
|
||||
# 内核态
|
||||
# 内核用于缓存数据结构供自己使用的内存
|
||||
node_memory_Slab_bytes: Memory used by the kernel to cache data structures for its own use
|
||||
# slab中可回收的部分
|
||||
node_memory_SReclaimable_bytes: SReclaimable - Part of Slab, that might be reclaimed, such as caches
|
||||
# slab中不可回收的部分
|
||||
node_memory_SUnreclaim_bytes: Part of Slab, that cannot be reclaimed on memory pressure
|
||||
# Vmalloc内存区的大小
|
||||
node_memory_VmallocTotal_bytes: Total size of vmalloc memory area
|
||||
# vmalloc已分配的内存,虚拟地址空间上的连续的内存
|
||||
node_memory_VmallocUsed_bytes: Amount of vmalloc area which is used
|
||||
# vmalloc区可用的连续最大快的大小,通过此指标可以知道vmalloc可分配连续内存的最大值
|
||||
node_memory_VmallocChunk_bytes: Largest contigious block of vmalloc area which is free
|
||||
# 内存的硬件故障删除掉的内存页的总大小
|
||||
node_memory_HardwareCorrupted_bytes: Amount of RAM that the kernel identified as corrupted / not working
|
||||
# 用于在虚拟和物理内存地址之间映射的内存
|
||||
node_memory_PageTables_bytes: Memory used to map between virtual and physical memory addresses (gauge)
|
||||
# 内核栈内存,常驻内存,不可回收
|
||||
node_memory_KernelStack_bytes: Kernel memory stack. This is not reclaimable
|
||||
# 用来访问高端内存,复制高端内存的临时buffer,称为“bounce buffering”,会降低I/O 性能
|
||||
node_memory_Bounce_bytes: Memory used for block device bounce buffers
|
||||
#用户态
|
||||
# 单个巨页大小
|
||||
node_memory_Hugepagesize_bytes: Huge Page size
|
||||
# 系统分配的常驻巨页数
|
||||
node_memory_HugePages_Total: Total size of the pool of huge pages
|
||||
# 系统空闲的巨页数
|
||||
node_memory_HugePages_Free: Huge pages in the pool that are not yet allocated
|
||||
# 进程已申请但未使用的巨页数
|
||||
node_memory_HugePages_Rsvd: Huge pages for which a commitment to allocate from the pool has been made, but no allocation
|
||||
# 超过系统设定的常驻HugePages数量的个数
|
||||
node_memory_HugePages_Surp: Huge pages in the pool above the value in /proc/sys/vm/nr_hugepages
|
||||
# 透明巨页 Transparent HugePages (THP)
|
||||
node_memory_AnonHugePages_bytes: Memory in anonymous huge pages
|
||||
# inactivelist中的File-backed内存
|
||||
node_memory_Inactive_file_bytes: File-backed memory on inactive LRU list
|
||||
# inactivelist中的Anonymous内存
|
||||
node_memory_Inactive_anon_bytes: Anonymous and swap cache on inactive LRU list, including tmpfs (shmem)
|
||||
# activelist中的File-backed内存
|
||||
node_memory_Active_file_bytes: File-backed memory on active LRU list
|
||||
# activelist中的Anonymous内存
|
||||
node_memory_Active_anon_bytes: Anonymous and swap cache on active least-recently-used (LRU) list, including tmpfs
|
||||
# 禁止换出的页,对应 Unevictable 链表
|
||||
node_memory_Unevictable_bytes: Amount of unevictable memory that can't be swapped out for a variety of reasons
|
||||
# 共享内存
|
||||
node_memory_Shmem_bytes: Used shared memory (shared between several processes, thus including RAM disks)
|
||||
# 匿名页内存大小
|
||||
node_memory_AnonPages_bytes: Memory in user pages not backed by files
|
||||
# 被关联的内存页大小
|
||||
node_memory_Mapped_bytes: Used memory in mapped pages files which have been mmaped, such as libraries
|
||||
# file-backed内存页缓存大小
|
||||
node_memory_Cached_bytes: Parked file data (file content) cache
|
||||
# 系统中有多少匿名页曾经被swap-out、现在又被swap-in并且swap-in之后页面中的内容一直没发生变化
|
||||
node_memory_SwapCached_bytes: Memory that keeps track of pages that have been fetched from swap but not yet been modified
|
||||
# 被mlock()系统调用锁定的内存大小
|
||||
node_memory_Mlocked_bytes: Size of pages locked to memory using the mlock() system call
|
||||
# 块设备(block device)所占用的缓存页
|
||||
node_memory_Buffers_bytes: Block device (e.g. harddisk) cache
|
||||
node_memory_SwapTotal_bytes: Memory information field SwapTotal_bytes
|
||||
node_memory_SwapFree_bytes: Memory information field SwapFree_bytes
|
||||
# MEM
|
||||
# 内核态
|
||||
# 内核用于缓存数据结构供自己使用的内存
|
||||
node_memory_Slab_bytes: Memory used by the kernel to cache data structures for its own use
|
||||
# slab中可回收的部分
|
||||
node_memory_SReclaimable_bytes: SReclaimable - Part of Slab, that might be reclaimed, such as caches
|
||||
# slab中不可回收的部分
|
||||
node_memory_SUnreclaim_bytes: Part of Slab, that cannot be reclaimed on memory pressure
|
||||
# Vmalloc内存区的大小
|
||||
node_memory_VmallocTotal_bytes: Total size of vmalloc memory area
|
||||
# vmalloc已分配的内存,虚拟地址空间上的连续的内存
|
||||
node_memory_VmallocUsed_bytes: Amount of vmalloc area which is used
|
||||
# vmalloc区可用的连续最大快的大小,通过此指标可以知道vmalloc可分配连续内存的最大值
|
||||
node_memory_VmallocChunk_bytes: Largest contigious block of vmalloc area which is free
|
||||
# 内存的硬件故障删除掉的内存页的总大小
|
||||
node_memory_HardwareCorrupted_bytes: Amount of RAM that the kernel identified as corrupted / not working
|
||||
# 用于在虚拟和物理内存地址之间映射的内存
|
||||
node_memory_PageTables_bytes: Memory used to map between virtual and physical memory addresses (gauge)
|
||||
# 内核栈内存,常驻内存,不可回收
|
||||
node_memory_KernelStack_bytes: Kernel memory stack. This is not reclaimable
|
||||
# 用来访问高端内存,复制高端内存的临时buffer,称为“bounce buffering”,会降低I/O 性能
|
||||
node_memory_Bounce_bytes: Memory used for block device bounce buffers
|
||||
#用户态
|
||||
# 单个巨页大小
|
||||
node_memory_Hugepagesize_bytes: Huge Page size
|
||||
# 系统分配的常驻巨页数
|
||||
node_memory_HugePages_Total: Total size of the pool of huge pages
|
||||
# 系统空闲的巨页数
|
||||
node_memory_HugePages_Free: Huge pages in the pool that are not yet allocated
|
||||
# 进程已申请但未使用的巨页数
|
||||
node_memory_HugePages_Rsvd: Huge pages for which a commitment to allocate from the pool has been made, but no allocation
|
||||
# 超过系统设定的常驻HugePages数量的个数
|
||||
node_memory_HugePages_Surp: Huge pages in the pool above the value in /proc/sys/vm/nr_hugepages
|
||||
# 透明巨页 Transparent HugePages (THP)
|
||||
node_memory_AnonHugePages_bytes: Memory in anonymous huge pages
|
||||
# inactivelist中的File-backed内存
|
||||
node_memory_Inactive_file_bytes: File-backed memory on inactive LRU list
|
||||
# inactivelist中的Anonymous内存
|
||||
node_memory_Inactive_anon_bytes: Anonymous and swap cache on inactive LRU list, including tmpfs (shmem)
|
||||
# activelist中的File-backed内存
|
||||
node_memory_Active_file_bytes: File-backed memory on active LRU list
|
||||
# activelist中的Anonymous内存
|
||||
node_memory_Active_anon_bytes: Anonymous and swap cache on active least-recently-used (LRU) list, including tmpfs
|
||||
# 禁止换出的页,对应 Unevictable 链表
|
||||
node_memory_Unevictable_bytes: Amount of unevictable memory that can't be swapped out for a variety of reasons
|
||||
# 共享内存
|
||||
node_memory_Shmem_bytes: Used shared memory (shared between several processes, thus including RAM disks)
|
||||
# 匿名页内存大小
|
||||
node_memory_AnonPages_bytes: Memory in user pages not backed by files
|
||||
# 被关联的内存页大小
|
||||
node_memory_Mapped_bytes: Used memory in mapped pages files which have been mmaped, such as libraries
|
||||
# file-backed内存页缓存大小
|
||||
node_memory_Cached_bytes: Parked file data (file content) cache
|
||||
# 系统中有多少匿名页曾经被swap-out、现在又被swap-in并且swap-in之后页面中的内容一直没发生变化
|
||||
node_memory_SwapCached_bytes: Memory that keeps track of pages that have been fetched from swap but not yet been modified
|
||||
# 被mlock()系统调用锁定的内存大小
|
||||
node_memory_Mlocked_bytes: Size of pages locked to memory using the mlock() system call
|
||||
# 块设备(block device)所占用的缓存页
|
||||
node_memory_Buffers_bytes: Block device (e.g. harddisk) cache
|
||||
node_memory_SwapTotal_bytes: Memory information field SwapTotal_bytes
|
||||
node_memory_SwapFree_bytes: Memory information field SwapFree_bytes
|
||||
|
||||
# DISK
|
||||
node_filesystem_avail_bytes: Filesystem space available to non-root users in byte
|
||||
node_filesystem_free_bytes: Filesystem free space in bytes
|
||||
node_filesystem_size_bytes: Filesystem size in bytes
|
||||
node_filesystem_files_free: Filesystem total free file nodes
|
||||
node_filesystem_files: Filesystem total free file nodes
|
||||
node_filefd_maximum: Max open files
|
||||
node_filefd_allocated: Open files
|
||||
node_filesystem_readonly: Filesystem read-only status
|
||||
node_filesystem_device_error: Whether an error occurred while getting statistics for the given device
|
||||
node_disk_reads_completed_total: The total number of reads completed successfully
|
||||
node_disk_writes_completed_total: The total number of writes completed successfully
|
||||
node_disk_reads_merged_total: The number of reads merged
|
||||
node_disk_writes_merged_total: The number of writes merged
|
||||
node_disk_read_bytes_total: The total number of bytes read successfully
|
||||
node_disk_written_bytes_total: The total number of bytes written successfully
|
||||
node_disk_io_time_seconds_total: Total seconds spent doing I/Os
|
||||
node_disk_read_time_seconds_total: The total number of seconds spent by all reads
|
||||
node_disk_write_time_seconds_total: The total number of seconds spent by all writes
|
||||
node_disk_io_time_weighted_seconds_total: The weighted of seconds spent doing I/Os
|
||||
# DISK
|
||||
node_filesystem_avail_bytes: Filesystem space available to non-root users in byte
|
||||
node_filesystem_free_bytes: Filesystem free space in bytes
|
||||
node_filesystem_size_bytes: Filesystem size in bytes
|
||||
node_filesystem_files_free: Filesystem total free file nodes
|
||||
node_filesystem_files: Filesystem total free file nodes
|
||||
node_filefd_maximum: Max open files
|
||||
node_filefd_allocated: Open files
|
||||
node_filesystem_readonly: Filesystem read-only status
|
||||
node_filesystem_device_error: Whether an error occurred while getting statistics for the given device
|
||||
node_disk_reads_completed_total: The total number of reads completed successfully
|
||||
node_disk_writes_completed_total: The total number of writes completed successfully
|
||||
node_disk_reads_merged_total: The number of reads merged
|
||||
node_disk_writes_merged_total: The number of writes merged
|
||||
node_disk_read_bytes_total: The total number of bytes read successfully
|
||||
node_disk_written_bytes_total: The total number of bytes written successfully
|
||||
node_disk_io_time_seconds_total: Total seconds spent doing I/Os
|
||||
node_disk_read_time_seconds_total: The total number of seconds spent by all reads
|
||||
node_disk_write_time_seconds_total: The total number of seconds spent by all writes
|
||||
node_disk_io_time_weighted_seconds_total: The weighted of seconds spent doing I/Os
|
||||
|
||||
# NET
|
||||
node_network_receive_bytes_total: Network device statistic receive_bytes (counter)
|
||||
node_network_transmit_bytes_total: Network device statistic transmit_bytes (counter)
|
||||
node_network_receive_packets_total: Network device statistic receive_bytes
|
||||
node_network_transmit_packets_total: Network device statistic transmit_bytes
|
||||
node_network_receive_errs_total: Network device statistic receive_errs
|
||||
node_network_transmit_errs_total: Network device statistic transmit_errs
|
||||
node_network_receive_drop_total: Network device statistic receive_drop
|
||||
node_network_transmit_drop_total: Network device statistic transmit_drop
|
||||
node_nf_conntrack_entries: Number of currently allocated flow entries for connection tracking
|
||||
node_sockstat_TCP_alloc: Number of TCP sockets in state alloc
|
||||
node_sockstat_TCP_inuse: Number of TCP sockets in state inuse
|
||||
node_sockstat_TCP_orphan: Number of TCP sockets in state orphan
|
||||
node_sockstat_TCP_tw: Number of TCP sockets in state tw
|
||||
node_netstat_Tcp_CurrEstab: Statistic TcpCurrEstab
|
||||
node_sockstat_sockets_used: Number of IPv4 sockets in use
|
||||
# NET
|
||||
node_network_receive_bytes_total: Network device statistic receive_bytes (counter)
|
||||
node_network_transmit_bytes_total: Network device statistic transmit_bytes (counter)
|
||||
node_network_receive_packets_total: Network device statistic receive_bytes
|
||||
node_network_transmit_packets_total: Network device statistic transmit_bytes
|
||||
node_network_receive_errs_total: Network device statistic receive_errs
|
||||
node_network_transmit_errs_total: Network device statistic transmit_errs
|
||||
node_network_receive_drop_total: Network device statistic receive_drop
|
||||
node_network_transmit_drop_total: Network device statistic transmit_drop
|
||||
node_nf_conntrack_entries: Number of currently allocated flow entries for connection tracking
|
||||
node_sockstat_TCP_alloc: Number of TCP sockets in state alloc
|
||||
node_sockstat_TCP_inuse: Number of TCP sockets in state inuse
|
||||
node_sockstat_TCP_orphan: Number of TCP sockets in state orphan
|
||||
node_sockstat_TCP_tw: Number of TCP sockets in state tw
|
||||
node_netstat_Tcp_CurrEstab: Statistic TcpCurrEstab
|
||||
node_sockstat_sockets_used: Number of IPv4 sockets in use
|
||||
|
||||
# [kafka_exporter]
|
||||
kafka_brokers: count of kafka_brokers (gauge)
|
||||
kafka_topic_partitions: Number of partitions for this Topic (gauge)
|
||||
kafka_topic_partition_current_offset: Current Offset of a Broker at Topic/Partition (gauge)
|
||||
kafka_consumergroup_current_offset: Current Offset of a ConsumerGroup at Topic/Partition (gauge)
|
||||
kafka_consumer_lag_millis: Current approximation of consumer lag for a ConsumerGroup at Topic/Partition (gauge)
|
||||
kafka_topic_partition_under_replicated_partition: 1 if Topic/Partition is under Replicated
|
||||
# [kafka_exporter]
|
||||
kafka_brokers: count of kafka_brokers (gauge)
|
||||
kafka_topic_partitions: Number of partitions for this Topic (gauge)
|
||||
kafka_topic_partition_current_offset: Current Offset of a Broker at Topic/Partition (gauge)
|
||||
kafka_consumergroup_current_offset: Current Offset of a ConsumerGroup at Topic/Partition (gauge)
|
||||
kafka_consumer_lag_millis: Current approximation of consumer lag for a ConsumerGroup at Topic/Partition (gauge)
|
||||
kafka_topic_partition_under_replicated_partition: 1 if Topic/Partition is under Replicated
|
||||
|
||||
# [zookeeper_exporter]
|
||||
zk_znode_count: The total count of znodes stored
|
||||
zk_ephemerals_count: The number of Ephemerals nodes
|
||||
zk_watch_count: The number of watchers setup over Zookeeper nodes.
|
||||
zk_approximate_data_size: Size of data in bytes that a zookeeper server has in its data tree
|
||||
zk_outstanding_requests: Number of currently executing requests
|
||||
zk_packets_sent: Count of the number of zookeeper packets sent from a server
|
||||
zk_packets_received: Count of the number of zookeeper packets received by a server
|
||||
zk_num_alive_connections: Number of active clients connected to a zookeeper server
|
||||
zk_open_file_descriptor_count: Number of file descriptors that a zookeeper server has open
|
||||
zk_max_file_descriptor_count: Maximum number of file descriptors that a zookeeper server can open
|
||||
zk_avg_latency: Average time in milliseconds for requests to be processed
|
||||
zk_min_latency: Maximum time in milliseconds for a request to be processed
|
||||
zk_max_latency: Minimum time in milliseconds for a request to be processed
|
||||
# [zookeeper_exporter]
|
||||
zk_znode_count: The total count of znodes stored
|
||||
zk_ephemerals_count: The number of Ephemerals nodes
|
||||
zk_watch_count: The number of watchers setup over Zookeeper nodes.
|
||||
zk_approximate_data_size: Size of data in bytes that a zookeeper server has in its data tree
|
||||
zk_outstanding_requests: Number of currently executing requests
|
||||
zk_packets_sent: Count of the number of zookeeper packets sent from a server
|
||||
zk_packets_received: Count of the number of zookeeper packets received by a server
|
||||
zk_num_alive_connections: Number of active clients connected to a zookeeper server
|
||||
zk_open_file_descriptor_count: Number of file descriptors that a zookeeper server has open
|
||||
zk_max_file_descriptor_count: Maximum number of file descriptors that a zookeeper server can open
|
||||
zk_avg_latency: Average time in milliseconds for requests to be processed
|
||||
zk_min_latency: Maximum time in milliseconds for a request to be processed
|
||||
zk_max_latency: Minimum time in milliseconds for a request to be processed
|
||||
|
||||
@@ -35,6 +35,12 @@ def convert_alert(rule, interval):
|
||||
for v in rule['annotations'].values():
|
||||
note = v
|
||||
break
|
||||
|
||||
annotations = {}
|
||||
if 'annotations' in rule:
|
||||
for k, v in rule['annotations'].items():
|
||||
annotations[k] = v
|
||||
|
||||
|
||||
append_tags = []
|
||||
severity = 2
|
||||
@@ -50,7 +56,7 @@ def convert_alert(rule, interval):
|
||||
# elif v == 'warning':
|
||||
# severity = 2
|
||||
|
||||
|
||||
|
||||
n9e_alert_rule = {
|
||||
"name": name,
|
||||
"note": note,
|
||||
@@ -77,7 +83,8 @@ def convert_alert(rule, interval):
|
||||
"recover_duration": 0,
|
||||
"callbacks": [],
|
||||
"runbook_url": "",
|
||||
"append_tags": append_tags
|
||||
"append_tags": append_tags,
|
||||
"annotations":annotations
|
||||
}
|
||||
return n9e_alert_rule
|
||||
|
||||
|
||||
@@ -6,9 +6,7 @@ import (
|
||||
"github.com/rakyll/statik/fs"
|
||||
)
|
||||
|
||||
|
||||
func init() {
|
||||
data := "PK\x05\x06\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00"
|
||||
fs.Register(data)
|
||||
}
|
||||
|
||||
fs.Register(data)
|
||||
}
|
||||
|
||||
3
go.mod
@@ -5,6 +5,7 @@ go 1.18
|
||||
require (
|
||||
github.com/BurntSushi/toml v0.3.1
|
||||
github.com/coreos/go-oidc v2.2.1+incompatible
|
||||
github.com/davecgh/go-spew v1.1.1
|
||||
github.com/dgrijalva/jwt-go v3.2.0+incompatible
|
||||
github.com/gin-contrib/pprof v1.4.0
|
||||
github.com/gin-gonic/gin v1.9.1
|
||||
@@ -27,6 +28,7 @@ require (
|
||||
github.com/prometheus/prometheus v2.5.0+incompatible
|
||||
github.com/rakyll/statik v0.1.7
|
||||
github.com/redis/go-redis/v9 v9.0.2
|
||||
github.com/spaolacci/murmur3 v1.1.0
|
||||
github.com/tidwall/gjson v1.14.0
|
||||
github.com/toolkits/pkg v1.3.3
|
||||
golang.org/x/oauth2 v0.4.0
|
||||
@@ -74,7 +76,6 @@ require (
|
||||
github.com/pquerna/cachecontrol v0.1.0 // indirect
|
||||
github.com/prometheus/client_model v0.3.0 // indirect
|
||||
github.com/prometheus/procfs v0.8.0 // indirect
|
||||
github.com/spaolacci/murmur3 v1.1.0 // indirect
|
||||
github.com/tidwall/match v1.1.1 // indirect
|
||||
github.com/tidwall/pretty v1.2.0 // indirect
|
||||
github.com/twitchyliquid64/golang-asm v0.15.1 // indirect
|
||||
|
||||
21
integrations/AliYun/collect/cloud.toml
Normal file
@@ -0,0 +1,21 @@
|
||||
# # collect interval
|
||||
# interval = 60
|
||||
[[instances]]
|
||||
# # endpoint region 参考 https://help.aliyun.com/document_detail/28616.html#section-72p-xhs-6qt
|
||||
# region="cn-beijing"
|
||||
# endpoint="metrics.cn-hangzhou.aliyuncs.com"
|
||||
# access_key_id="your-access-key-id"
|
||||
# access_key_secret="your-access-key-secret"
|
||||
# interval_times=4
|
||||
# delay="10m"
|
||||
# period="60s"
|
||||
# # namespace 参考 https://help.aliyun.com/document_detail/163515.htm?spm=a2c4g.11186623.0.0.44d65c58mhgNw3
|
||||
# namespaces=["acs_ecs_dashboard"]
|
||||
# [[instances.metric_filters]]
|
||||
# # metric name 参考 https://help.aliyun.com/document_detail/163515.htm?spm=a2c4g.11186623.0.0.401d15c73Z0dZh
|
||||
# # 参考页面中的Metric Id 填入下面的metricName ,页面中包含中文的Metric Name对应接口中的Description
|
||||
# metric_names=["cpu_cores","vm.TcpCount"]
|
||||
# namespace=""
|
||||
# ratelimit=25
|
||||
# catch_ttl="1h"
|
||||
# timeout="5s"
|
||||
@@ -1,34 +1,43 @@
|
||||
## AliYun Dashboard & Configurable
|
||||
# aliyun plugin
|
||||
|
||||
使用[categraf](https://github.com/flashcatcloud/categraf)中[input.aliyun](https://github.com/flashcatcloud/categraf/blob/main/conf/input.aliyun/cloud.toml)插件采集Vmware指标数据:
|
||||
## 简介
|
||||
|
||||
1. 在阿里云控制创建AK/SK,在IAM中赋予阿里云监控的权限;
|
||||
2. 把创建好的AK/SK配置到Categraf的阿里云插件配置文件中。
|
||||
使用[categraf](https://github.com/flashcatcloud/categraf)中[aliyun](https://github.com/flashcatcloud/categraf/tree/main/inputs/aliyun)插件拉取阿里云云监控的数据(通过 OpenAPI)。
|
||||
|
||||
### Categraf中conf/input.aliyun/cloud.toml配置文件:
|
||||
## 授权
|
||||
|
||||
获取凭证 [https://usercenter.console.aliyun.com/#/manage/ak](https://usercenter.console.aliyun.com/#/manage/ak)
|
||||
RAM 用户授权。RAM 用户调用云监控 API 前,需要所属的阿里云账号将权限策略授予对应的 RAM 用户,参见 [RAM 用户权限](https://help.aliyun.com/document_detail/43170.html?spm=a2c4g.11186623.0.0.30c841feqsoAAn)。
|
||||
可以在 [授权页面](https://ram.console.aliyun.com/permissions) 新增授权,选择对应的用户,授予云监控只读权限 `AliyunCloudMonitorReadOnlyAccess`, 并为授予权限的用户创建accessKey 即可。
|
||||
|
||||
## Categraf中conf/input.aliyun/cloud.toml配置文件:
|
||||
|
||||
```toml
|
||||
# # categraf采集周期,阿里云指标的粒度一般是60秒,建议设置不要少于60秒
|
||||
interval = 60
|
||||
interval = 120
|
||||
[[instances]]
|
||||
## 阿里云资源所处的region
|
||||
## endpoint region 参考 https://help.aliyun.com/document_detail/28616.html#section-72p-xhs-6qt
|
||||
region="cn-beijing"
|
||||
#endpoint="metrics.cn-hangzhou.aliyuncs.com"
|
||||
endpoint="metrics.aliyuncs.com"
|
||||
## 填入你的acces_key_id
|
||||
access_key_id="admin"
|
||||
endpoint="metrics.cn-hangzhou.aliyuncs.com"
|
||||
## 填入你的access_key_id
|
||||
access_key_id=""
|
||||
## 填入你的access_key_secret
|
||||
access_key_secret="admin"
|
||||
access_key_secret=""
|
||||
|
||||
## 可能无法获取当前最新指标,这个指标是指监控指标的截止时间距离现在多久
|
||||
delay="2m"
|
||||
delay="50m"
|
||||
## 阿里云指标的最小粒度,60s 是推荐值,再小了部分指标不支持
|
||||
period="60s"
|
||||
## 指标所属的namespace ,为空,则表示所有空间指标都要采集
|
||||
## namespace 参考 https://help.aliyun.com/document_detail/163515.htm?spm=a2c4g.11186623.0.0.44d65c58mhgNw3
|
||||
#namespaces=["waf"]
|
||||
namespaces=["waf","acs_ecs_dashboard","acs_rds_dashboard","acs_slb_dashboard","acs_kvstore"]
|
||||
namespaces=["acs_ecs_dashboard"]
|
||||
## 过滤某个namespace下的一个或多个指标
|
||||
## metric name 参考 https://help.aliyun.com/document_detail/163515.htm?spm=a2c4g.11186623.0.0.401d15c73Z0dZh
|
||||
## 参考页面中的Metric Id 填入下面的metricName ,页面中包含中文的Metric Name对应接口中的Description
|
||||
[[instances.metric_filters]]
|
||||
namespace=""
|
||||
metric_names=["cpu_cores","vm.TcpCount", "cpu_idle"]
|
||||
|
||||
# 阿里云查询指标接口的QPS是50, 这里默认设置为一半
|
||||
ratelimit=25
|
||||
@@ -36,23 +45,26 @@ ratelimit=25
|
||||
catch_ttl="1h"
|
||||
# 每次请求阿里云endpoint的超时时间
|
||||
timeout="5s"
|
||||
|
||||
## 过滤某个namespace下的一个或多个指标
|
||||
## metric name 参考 https://help.aliyun.com/document_detail/163515.htm?spm=a2c4g.11186623.0.0.401d15c73Z0dZh
|
||||
## 参考页面中的Metric Id 填入下面的metricName ,页面中包含中文的Metric Name对应接口中的Description
|
||||
#[[instances.metric_filters]]
|
||||
#namespace=""
|
||||
#metric_names=["cpu_cores","vm.TcpCount", "cpu_idle"]
|
||||
```
|
||||
|
||||
### 效果图
|
||||
## 效果图
|
||||
|
||||

|
||||
### aliyun ecs
|
||||
|
||||

|
||||

|
||||
|
||||

|
||||
### aliyun rds
|
||||
|
||||

|
||||

|
||||
|
||||

|
||||
### aliyun redis
|
||||
|
||||

|
||||
|
||||
### aliyun slb
|
||||
|
||||

|
||||
|
||||
### aliyun waf
|
||||
|
||||

|
||||
|
||||
@@ -1,31 +1,34 @@
|
||||
### Ceph Dashboard & Alerts
|
||||
开启ceph 默认Prometheus支持
|
||||
# ceph plugin
|
||||
|
||||
开启 ceph prometheus 支持
|
||||
|
||||
```bash
|
||||
ceph mgr module enable prometheus
|
||||
```
|
||||
|
||||
### 采集配置
|
||||
在categraf中的prometheus插件中加入采集配置
|
||||
## 采集配置
|
||||
|
||||
既然 ceph 可以暴露 prometheus 协议的 metrics 数据,则直接使用 prometheus 插件抓取即可。
|
||||
|
||||
categraf 配置文件:`conf/input.prometheus/prometheus.toml`
|
||||
|
||||
```yaml
|
||||
cat /opt/categraf/conf/input.prometheus/prometheus.toml
|
||||
[[instances]]
|
||||
urls = [
|
||||
[[instances]]
|
||||
urls = [
|
||||
"http://192.168.11.181:9283/metrics"
|
||||
]
|
||||
url_label_key = "instance"
|
||||
url_label_value = "{{.Host}}"
|
||||
labels = {service="ceph",cluster="ceph"}
|
||||
labels = {service="ceph",cluster="ceph-cluster-001"}
|
||||
```
|
||||
|
||||
|
||||
Dashboard:
|
||||
## 仪表盘效果
|
||||
|
||||
[dashboard](../dashboards/ceph_by_categraf.json)
|
||||
夜莺内置仪表盘中已经内置了 ceph 的仪表盘,导入即可使用。
|
||||
|
||||

|
||||

|
||||
|
||||
Alerts:
|
||||
## 告警规则
|
||||
|
||||
[alerts](../alerts/ceph_by_categraf.json)
|
||||
夜莺内置告警规则中已经内置了 ceph 的告警规则,导入即可使用。
|
||||
|
||||

|
||||

|
||||
BIN
integrations/Ceph/markdown/ceph-alerts.png
Normal file
|
After Width: | Height: | Size: 194 KiB |
BIN
integrations/Ceph/markdown/ceph-dash.png
Normal file
|
After Width: | Height: | Size: 187 KiB |
63
integrations/ElasticSearch/collect/elasticsearch.toml
Normal file
@@ -0,0 +1,63 @@
|
||||
# # collect interval
|
||||
# interval = 15
|
||||
|
||||
############################################################################
|
||||
# !!! uncomment [[instances]] to enable this plugin
|
||||
[[instances]]
|
||||
# # interval = global.interval * interval_times
|
||||
# interval_times = 1
|
||||
|
||||
# append some labels to metrics
|
||||
# labels = { cluster="cloud-n9e-es" }
|
||||
|
||||
## specify a list of one or more Elasticsearch servers
|
||||
# servers = ["http://localhost:9200"]
|
||||
servers = []
|
||||
|
||||
## Timeout for HTTP requests to the elastic search server(s)
|
||||
http_timeout = "10s"
|
||||
|
||||
# either /_nodes/stats or /_nodes/_local/stats depending on this setting
|
||||
local = false
|
||||
|
||||
## Set cluster_health to true when you want to obtain cluster health stats
|
||||
cluster_health = true
|
||||
|
||||
## Adjust cluster_health_level when you want to obtain detailed health stats
|
||||
## The options are
|
||||
## - indices (default)
|
||||
## - cluster
|
||||
cluster_health_level = "cluster"
|
||||
|
||||
## Set cluster_stats to true when you want to obtain cluster stats.
|
||||
cluster_stats = true
|
||||
|
||||
## Indices to collect; can be one or more indices names or _all
|
||||
## Use of wildcards is allowed. Use a wildcard at the end to retrieve index names that end with a changing value, like a date.
|
||||
# indices_include = ["zipkin*"]
|
||||
|
||||
## use "shards" or blank string for indices level
|
||||
indices_level = ""
|
||||
|
||||
## node_stats is a list of sub-stats that you want to have gathered. Valid options
|
||||
## are "indices", "os", "process", "jvm", "thread_pool", "fs", "transport", "http",
|
||||
## "breaker". Per default, all stats are gathered.
|
||||
node_stats = ["jvm", "breaker", "process", "os", "fs", "indices", "thread_pool", "transport"]
|
||||
|
||||
## HTTP Basic Authentication username and password.
|
||||
username = "elastic"
|
||||
password = "password"
|
||||
|
||||
## Optional TLS Config
|
||||
# use_tls = false
|
||||
# tls_ca = "/etc/categraf/ca.pem"
|
||||
# tls_cert = "/etc/categraf/cert.pem"
|
||||
# tls_key = "/etc/categraf/key.pem"
|
||||
## Use TLS but skip chain & host verification
|
||||
# insecure_skip_verify = true
|
||||
|
||||
## Sets the number of most recent indices to return for indices that are configured with a date-stamped suffix.
|
||||
## Each 'indices_include' entry ending with a wildcard (*) or glob matching pattern will group together all indices that match it, and
|
||||
## sort them by the date or number after the wildcard. Metrics then are gathered for only the 'num_most_recent_indices' amount of most
|
||||
## recent indices.
|
||||
num_most_recent_indices = 1
|
||||
@@ -1,26 +1,33 @@
|
||||
### 采集方式
|
||||
# elasticsearch plugin
|
||||
|
||||
使用Categraf插件[elasticsearch](https://github.com/flashcatcloud/categraf/blob/main/conf/input.elasticsearch/elasticsearch.toml)采集ES指标;
|
||||
ElasticSearch 通过 HTTP JSON 的方式暴露了自身的监控指标,通过 categraf 的 [elasticsearch](https://github.com/flashcatcloud/categraf/tree/main/inputs/elasticsearch) 插件抓取。
|
||||
|
||||
### 配置示例
|
||||
如果是小规模集群,设置 `local=false`,从集群中某一个节点抓取数据,即可拿到整个集群所有节点的监控数据。如果是大规模集群,建议设置 `local=true`,在集群的每个节点上都部署抓取器,抓取本地 elasticsearch 进程的监控数据。
|
||||
|
||||
ElasticSearch 详细的监控讲解,请参考这篇 [文章](https://time.geekbang.org/column/article/628847)。
|
||||
|
||||
## 配置示例
|
||||
|
||||
categraf 配置文件:`conf/input.elasticsearch/elasticsearch.toml`
|
||||
|
||||
```yaml
|
||||
cat conf/input.elasticsearch/elasticsearch.toml | egrep -v "^#|^$"
|
||||
[[instances]]
|
||||
servers = ["http://192.168.11.177:9200"]
|
||||
http_timeout = "5s"
|
||||
local = true
|
||||
http_timeout = "10s"
|
||||
local = false
|
||||
cluster_health = true
|
||||
cluster_health_level = "cluster"
|
||||
cluster_stats = true
|
||||
indices_level = ""
|
||||
node_stats = ["jvm", "breaker", "process", "os", "fs", "indices"]
|
||||
node_stats = ["jvm", "breaker", "process", "os", "fs", "indices", "thread_pool", "transport"]
|
||||
username = "elastic"
|
||||
password = "xxxxxxxx"
|
||||
num_most_recent_indices = 1
|
||||
labels = { instance="default-es" , service="es" }
|
||||
labels = { service="es" }
|
||||
```
|
||||
|
||||
### 效果图:
|
||||
## 仪表盘效果
|
||||
|
||||

|
||||
夜莺内置仪表盘中已经内置了 Elasticsearch 的仪表盘,导入即可使用。
|
||||
|
||||

|
||||
|
Before Width: | Height: | Size: 263 KiB |
|
Before Width: | Height: | Size: 203 KiB |
|
Before Width: | Height: | Size: 141 KiB |
|
Before Width: | Height: | Size: 264 KiB |
@@ -1,127 +1,74 @@
|
||||
### Gitlab Dashboard & Alerts
|
||||
使用[categraf](https://github.com/flashcatcloud/categraf)中[inputs.prometheus](https://github.com/flashcatcloud/categraf/tree/main/inputs/prometheus)插件采集[Gitlab](https://docs.gitlab.com/)服务组件暴露的指标数据:
|
||||
# Gitlab
|
||||
|
||||
开启Gitlab默认Prometheus支持:
|
||||
Gitlab 默认提供 Prometheus 协议的监控数据,参考:[Monitoring GitLab with Prometheus](https://docs.gitlab.com/ee/administration/monitoring/prometheus/)。所以,使用 categraf 的 prometheus 插件即可采集。
|
||||
|
||||
[Monitoring GitLab with Prometheus](https://docs.gitlab.com/ee/administration/monitoring/prometheus/)
|
||||
## 采集配置
|
||||
|
||||
### 采集配置
|
||||
在categraf中的prometheus插件中加入采集配置
|
||||
```yaml
|
||||
cat /opt/categraf/conf/input.prometheus/prometheus.toml
|
||||
# # collect interval
|
||||
# interval = 15
|
||||
配置文件:categraf 的 `conf/input.prometheus/prometheus.toml`
|
||||
|
||||
```toml
|
||||
[[instances]]
|
||||
urls = [
|
||||
"http://192.168.11.77:9236/metrics"
|
||||
]
|
||||
labels = {service="gitlab", job="gitaly"}
|
||||
|
||||
[[instances]]
|
||||
urls = [
|
||||
"http://192.168.11.77:9236/metrics"
|
||||
"http://192.168.11.77:9168/sidekiq"
|
||||
]
|
||||
url_label_key = "instance"
|
||||
url_label_value = "{{.Host}}"
|
||||
labels = {service="gitlab",job="gitaly"}
|
||||
labels = {service="gitlab", job="gitlab-exporter-sidekiq"}
|
||||
|
||||
[[instances]]
|
||||
urls = [
|
||||
"http://192.168.11.77:9168/sidekiq"
|
||||
"http://192.168.11.77:9168/database"
|
||||
]
|
||||
url_label_key = "instance"
|
||||
url_label_value = "{{.Host}}"
|
||||
labels = {service="gitlab",job="gitlab_exporter_sidekiq"}
|
||||
|
||||
labels = {service="gitlab",job="gitlab-exporter-database"}
|
||||
|
||||
[[instances]]
|
||||
urls = [
|
||||
"http://192.168.11.77:9168/database"
|
||||
"http://192.168.11.77:8082/metrics"
|
||||
]
|
||||
url_label_key = "instance"
|
||||
url_label_value = "{{.Host}}"
|
||||
labels = {service="gitlab",job="gitlab_exporter_database"}
|
||||
labels = {service="gitlab", job="gitlab-sidekiq"}
|
||||
|
||||
[[instances]]
|
||||
urls = [
|
||||
"http://192.168.11.77:8082/metrics"
|
||||
"http://192.168.11.77:8082/metrics"
|
||||
]
|
||||
url_label_key = "instance"
|
||||
url_label_value = "{{.Host}}"
|
||||
labels = {service="gitlab",job="gitlab-sidekiq"}
|
||||
labels = {service="gitlab", job="gitlab-sidekiq"}
|
||||
|
||||
[[instances]]
|
||||
urls = [
|
||||
"http://192.168.11.77:8082/metrics"
|
||||
"http://192.168.11.77:9229/metrics"
|
||||
]
|
||||
url_label_key = "instance"
|
||||
url_label_value = "{{.Host}}"
|
||||
labels = {service="gitlab",job="gitlab-sidekiq"}
|
||||
|
||||
[[instances]]
|
||||
urls = [
|
||||
"http://192.168.11.77:9229/metrics"
|
||||
]
|
||||
url_label_key = "instance"
|
||||
url_label_value = "{{.Host}}"
|
||||
labels = {service="gitlab",job="gitlab-workhorse"}
|
||||
|
||||
[[instances]]
|
||||
urls = [
|
||||
"http://192.168.11.77:9100/metrics"
|
||||
"http://192.168.11.77:9100/metrics"
|
||||
]
|
||||
url_label_key = "instance"
|
||||
url_label_value = "{{.Host}}"
|
||||
labels = {service="gitlab",job="node"}
|
||||
|
||||
labels = {service="gitlab", job="node"}
|
||||
|
||||
[[instances]]
|
||||
urls = [
|
||||
"http://192.168.11.77:9187/metrics"
|
||||
"http://192.168.11.77:9187/metrics"
|
||||
]
|
||||
url_label_key = "instance"
|
||||
url_label_value = "{{.Host}}"
|
||||
labels = {service="gitlab",job="postgres"}
|
||||
|
||||
labels = {service="gitlab", job="postgres"}
|
||||
|
||||
[[instances]]
|
||||
urls = [
|
||||
"http://192.168.11.77:9121/metrics"
|
||||
"http://192.168.11.77:9121/metrics"
|
||||
]
|
||||
url_label_key = "instance"
|
||||
url_label_value = "{{.Host}}"
|
||||
labels = {service="gitlab",job="redis"}
|
||||
labels = {service="gitlab", job="redis"}
|
||||
|
||||
[[instances]]
|
||||
urls = [
|
||||
"http://192.168.11.77:9999/metrics"
|
||||
"http://192.168.11.77:9999/metrics"
|
||||
]
|
||||
url_label_key = "instance"
|
||||
url_label_value = "{{.Host}}"
|
||||
labels = {service="gitlab",job="nginx"}
|
||||
labels = {service="gitlab", job="nginx"}
|
||||
```
|
||||
|
||||
## 仪表盘和告警规则
|
||||
|
||||
Dashboards:
|
||||
夜莺内置提供了 gitlab 各个组件相关的仪表盘和告警规则,导入自己的业务组即可使用。
|
||||
|
||||
[MachinePerformance](../dashboards/MachinePerformance.json)
|
||||
|
||||

|
||||
|
||||
[NGINXVTS](../dashboards/NGINXVTS.json)
|
||||
|
||||

|
||||
|
||||
[Overview](../dashboards/Overview.json)
|
||||
|
||||

|
||||
|
||||
[PostgreSQL](../dashboards/PostgreSQL.json)
|
||||
|
||||

|
||||
|
||||
[Redis](../dashboards/Redis.json)
|
||||
|
||||

|
||||
|
||||
|
||||
Alerts:
|
||||
|
||||
[alerts](../alerts/gitlab_by_categraf.json)
|
||||
|
||||

|
||||
|
Before Width: | Height: | Size: 312 KiB |
|
Before Width: | Height: | Size: 144 KiB |
@@ -69,5 +69,76 @@
|
||||
"append_tags": [],
|
||||
"annotations": null,
|
||||
"extra_config": null
|
||||
},
|
||||
{
|
||||
"cate": "prometheus",
|
||||
"datasource_ids": [
|
||||
52
|
||||
],
|
||||
"name": "https certificate will expire within 7 days",
|
||||
"note": "",
|
||||
"prod": "metric",
|
||||
"algorithm": "",
|
||||
"algo_params": null,
|
||||
"delay": 0,
|
||||
"severity": 2,
|
||||
"severities": [
|
||||
2
|
||||
],
|
||||
"disabled": 1,
|
||||
"prom_for_duration": 60,
|
||||
"prom_ql": "",
|
||||
"rule_config": {
|
||||
"algo_params": null,
|
||||
"inhibit": false,
|
||||
"prom_ql": "",
|
||||
"queries": [
|
||||
{
|
||||
"prom_ql": "(http_response_cert_expire_timestamp - time())/86400 <= 7",
|
||||
"severity": 2
|
||||
}
|
||||
],
|
||||
"severity": 0
|
||||
},
|
||||
"prom_eval_interval": 15,
|
||||
"enable_stime": "00:00",
|
||||
"enable_stimes": [
|
||||
"00:00"
|
||||
],
|
||||
"enable_etime": "23:59",
|
||||
"enable_etimes": [
|
||||
"23:59"
|
||||
],
|
||||
"enable_days_of_week": [
|
||||
"1",
|
||||
"2",
|
||||
"3",
|
||||
"4",
|
||||
"5",
|
||||
"6",
|
||||
"0"
|
||||
],
|
||||
"enable_days_of_weeks": [
|
||||
[
|
||||
"1",
|
||||
"2",
|
||||
"3",
|
||||
"4",
|
||||
"5",
|
||||
"6",
|
||||
"0"
|
||||
]
|
||||
],
|
||||
"enable_in_bg": 0,
|
||||
"notify_recovered": 1,
|
||||
"notify_channels": [],
|
||||
"notify_repeat_step": 60,
|
||||
"notify_max_number": 0,
|
||||
"recover_duration": 0,
|
||||
"callbacks": [],
|
||||
"runbook_url": "",
|
||||
"append_tags": [],
|
||||
"annotations": null,
|
||||
"extra_config": null
|
||||
}
|
||||
]
|
||||
]
|
||||
|
||||
BIN
integrations/HTTP/icon/http.png
Normal file
|
After Width: | Height: | Size: 1.4 KiB |
105
integrations/HTTP/markdown/http.md
Normal file
@@ -0,0 +1,105 @@
|
||||
# http_response plugin
|
||||
|
||||
HTTP 探测插件,用于检测 HTTP 地址的连通性、延迟、HTTPS 证书过期时间。因为 Prometheus 生态的时序库只能存储 float64 类型的值,所以 HTTP 地址探测的结果也是 float64 类型的值,但是这个值的含义是不同的,具体含义如下:
|
||||
|
||||
```
|
||||
Success = 0
|
||||
ConnectionFailed = 1
|
||||
Timeout = 2
|
||||
DNSError = 3
|
||||
AddressError = 4
|
||||
BodyMismatch = 5
|
||||
CodeMismatch = 6
|
||||
```
|
||||
|
||||
如果一切正常,这个值是 0,如果有异常,这个值是 1-6 之间的值,具体含义如上。这个值对应的指标名字是 `http_response_result_code`。
|
||||
|
||||
## Configuration
|
||||
|
||||
categraf 的 `conf/input.http_response/http_response.toml`。最核心的配置就是 targets 配置,配置目标地址,比如想要监控两个地址:
|
||||
|
||||
```toml
|
||||
[[instances]]
|
||||
targets = [
|
||||
"http://localhost:8080",
|
||||
"https://www.baidu.com"
|
||||
]
|
||||
```
|
||||
|
||||
instances 下面的所有 targets 共享同一个 `[[instances]]` 下面的配置,比如超时时间,HTTP方法等,如果有些配置不同,可以拆成多个不同的 `[[instances]]`,比如:
|
||||
|
||||
```toml
|
||||
[[instances]]
|
||||
targets = [
|
||||
"http://localhost:8080",
|
||||
"https://www.baidu.com"
|
||||
]
|
||||
method = "GET"
|
||||
|
||||
[[instances]]
|
||||
targets = [
|
||||
"http://localhost:9090"
|
||||
]
|
||||
method = "POST"
|
||||
```
|
||||
|
||||
完整的带有注释的配置如下:
|
||||
|
||||
```toml
|
||||
[[instances]]
|
||||
targets = [
|
||||
# "http://localhost",
|
||||
# "https://www.baidu.com"
|
||||
]
|
||||
|
||||
# # append some labels for series
|
||||
# labels = { region="cloud", product="n9e" }
|
||||
|
||||
# # interval = global.interval * interval_times
|
||||
# interval_times = 1
|
||||
|
||||
## Set http_proxy (categraf uses the system wide proxy settings if it's is not set)
|
||||
# http_proxy = "http://localhost:8888"
|
||||
|
||||
## Interface to use when dialing an address
|
||||
# interface = "eth0"
|
||||
|
||||
## HTTP Request Method
|
||||
# method = "GET"
|
||||
|
||||
## Set response_timeout (default 5 seconds)
|
||||
# response_timeout = "5s"
|
||||
|
||||
## Whether to follow redirects from the server (defaults to false)
|
||||
# follow_redirects = false
|
||||
|
||||
## Optional HTTP Basic Auth Credentials
|
||||
# username = "username"
|
||||
# password = "pa$$word"
|
||||
|
||||
## Optional headers
|
||||
# headers = ["Header-Key-1", "Header-Value-1", "Header-Key-2", "Header-Value-2"]
|
||||
|
||||
## Optional HTTP Request Body
|
||||
# body = '''
|
||||
# {'fake':'data'}
|
||||
# '''
|
||||
|
||||
## Optional substring match in body of the response (case sensitive)
|
||||
# expect_response_substring = "ok"
|
||||
|
||||
## Optional expected response status code.
|
||||
# expect_response_status_code = 0
|
||||
|
||||
## Optional TLS Config
|
||||
# use_tls = false
|
||||
# tls_ca = "/etc/categraf/ca.pem"
|
||||
# tls_cert = "/etc/categraf/cert.pem"
|
||||
# tls_key = "/etc/categraf/key.pem"
|
||||
## Use TLS but skip chain & host verification
|
||||
# insecure_skip_verify = false
|
||||
```
|
||||
|
||||
## dashboard and monitors
|
||||
|
||||
夜莺提供了内置大盘和内置告警规则,克隆到自己的业务组下即可使用。
|
||||
43
integrations/IPMI/collect/conf.toml
Normal file
@@ -0,0 +1,43 @@
|
||||
# Read metrics from the bare metal servers via IPMI
|
||||
[[instances]]
|
||||
## optionally specify the path to the ipmitool executable
|
||||
# path = "/usr/bin/ipmitool"
|
||||
##
|
||||
## Setting 'use_sudo' to true will make use of sudo to run ipmitool.
|
||||
## Sudo must be configured to allow the telegraf user to run ipmitool
|
||||
## without a password.
|
||||
# use_sudo = false
|
||||
##
|
||||
## optionally force session privilege level. Can be CALLBACK, USER, OPERATOR, ADMINISTRATOR
|
||||
# privilege = "ADMINISTRATOR"
|
||||
##
|
||||
## optionally specify one or more servers via a url matching
|
||||
## [username[:password]@][protocol[(address)]]
|
||||
## e.g.
|
||||
## root:passwd@lan(127.0.0.1)
|
||||
##
|
||||
## if no servers are specified, local machine sensor stats will be queried
|
||||
##
|
||||
# servers = ["USERID:PASSW0RD@lan(192.168.1.1)"]
|
||||
|
||||
## Recommended: use metric 'interval' that is a multiple of 'timeout' to avoid
|
||||
## gaps or overlap in pulled data
|
||||
interval = "30s"
|
||||
|
||||
## Timeout for the ipmitool command to complete. Default is 20 seconds.
|
||||
timeout = "20s"
|
||||
|
||||
## Schema Version: (Optional, defaults to version 1)
|
||||
metric_version = 2
|
||||
|
||||
## Optionally provide the hex key for the IMPI connection.
|
||||
# hex_key = ""
|
||||
|
||||
## If ipmitool should use a cache
|
||||
## for me ipmitool runs about 2 to 10 times faster with cache enabled on HP G10 servers (when using ubuntu20.04)
|
||||
## the cache file may not work well for you if some sensors come up late
|
||||
# use_cache = false
|
||||
|
||||
## Path to the ipmitools cache file (defaults to OS temp dir)
|
||||
## The provided path must exist and must be writable
|
||||
# cache_path = ""
|
||||
@@ -1,10 +1,8 @@
|
||||
### 使用Categraf基于IPMI工具监控硬件温度、功率、电压
|
||||
# IPMI plugin
|
||||
|
||||
实现原理:
|
||||
利用 `ipmitool sdr` 命令, 采集硬件的温度、功率、电压等信息,并转化为指标。 依赖工具 `ipmitool`,所以需要安装`ipmitool`。
|
||||
|
||||
利用ipmitool sdr命令, 采集硬件的温度、功率、电压等信息,并转化为指标。 依赖工具ipmitool ,所以需要安装ipmitool。
|
||||
|
||||
IPMI配置:
|
||||
## IPMI 配置
|
||||
|
||||
```bash
|
||||
# 此处的主机必须支持ipmi bmc,不然openipmi启动不了
|
||||
@@ -100,53 +98,56 @@ MAC Address : xx:xx:52:xx:xx:81
|
||||
SNMP Community String : public
|
||||
```
|
||||
|
||||
### 采集配置
|
||||
使用[categraf](https://github.com/flashcatcloud/categraf)中[inputs.ipmi](https://github.com/flashcatcloud/categraf/tree/main/inputs/ipmi)插件采集服务器指标:
|
||||
```yaml
|
||||
cat /opt/categraf/conf/input.ipmi/conf.toml
|
||||
[[instances]]
|
||||
## optionally specify the path to the ipmitool executable
|
||||
# path = "/usr/bin/ipmitool"
|
||||
##
|
||||
## Setting 'use_sudo' to true will make use of sudo to run ipmitool.
|
||||
## Sudo must be configured to allow the telegraf user to run ipmitool
|
||||
## without a password.
|
||||
# use_sudo = false
|
||||
##
|
||||
## optionally force session privilege level. Can be CALLBACK, USER, OPERATOR, ADMINISTRATOR
|
||||
# privilege = "ADMINISTRATOR"
|
||||
##
|
||||
## optionally specify one or more servers via a url matching
|
||||
## [username[:password]@][protocol[(address)]]
|
||||
## e.g.
|
||||
## root:passwd@lan(127.0.0.1)
|
||||
##
|
||||
## if no servers are specified, local machine sensor stats will be queried
|
||||
##
|
||||
servers = ["ADMIN:1234567@lan(192.168.1.123)"]
|
||||
## 采集配置
|
||||
|
||||
## Recommended: use metric 'interval' that is a multiple of 'timeout' to avoid
|
||||
## gaps or overlap in pulled data
|
||||
interval = "30s"
|
||||
categraf 的 `conf/input.ipmi/conf.toml`
|
||||
|
||||
## Timeout for the ipmitool command to complete. Default is 20 seconds.
|
||||
timeout = "20s"
|
||||
```toml
|
||||
[[instances]]
|
||||
## optionally specify the path to the ipmitool executable
|
||||
# path = "/usr/bin/ipmitool"
|
||||
##
|
||||
## Setting 'use_sudo' to true will make use of sudo to run ipmitool.
|
||||
## Sudo must be configured to allow the telegraf user to run ipmitool
|
||||
## without a password.
|
||||
# use_sudo = false
|
||||
##
|
||||
## optionally force session privilege level. Can be CALLBACK, USER, OPERATOR, ADMINISTRATOR
|
||||
# privilege = "ADMINISTRATOR"
|
||||
##
|
||||
## optionally specify one or more servers via a url matching
|
||||
## [username[:password]@][protocol[(address)]]
|
||||
## e.g.
|
||||
## root:passwd@lan(127.0.0.1)
|
||||
##
|
||||
## if no servers are specified, local machine sensor stats will be queried
|
||||
##
|
||||
servers = ["ADMIN:1234567@lan(192.168.1.123)"]
|
||||
|
||||
## Schema Version: (Optional, defaults to version 1)
|
||||
metric_version = 2
|
||||
## Recommended: use metric 'interval' that is a multiple of 'timeout' to avoid
|
||||
## gaps or overlap in pulled data
|
||||
interval = "30s"
|
||||
|
||||
## Optionally provide the hex key for the IMPI connection.
|
||||
# hex_key = ""
|
||||
## Timeout for the ipmitool command to complete. Default is 20 seconds.
|
||||
timeout = "20s"
|
||||
|
||||
## If ipmitool should use a cache
|
||||
## for me ipmitool runs about 2 to 10 times faster with cache enabled on HP G10 servers (when using ubuntu20.04)
|
||||
## the cache file may not work well for you if some sensors come up late
|
||||
# use_cache = false
|
||||
## Schema Version: (Optional, defaults to version 1)
|
||||
metric_version = 2
|
||||
|
||||
## Path to the ipmitools cache file (defaults to OS temp dir)
|
||||
## The provided path must exist and must be writable
|
||||
## Optionally provide the hex key for the IMPI connection.
|
||||
# hex_key = ""
|
||||
|
||||
## If ipmitool should use a cache
|
||||
## for me ipmitool runs about 2 to 10 times faster with cache enabled on HP G10 servers (when using ubuntu20.04)
|
||||
## the cache file may not work well for you if some sensors come up late
|
||||
# use_cache = false
|
||||
|
||||
## Path to the ipmitools cache file (defaults to OS temp dir)
|
||||
## The provided path must exist and must be writable
|
||||
```
|
||||
|
||||
[告警规则](../alerts/alerts.json)
|
||||
## 仪表盘效果
|
||||
|
||||
效果图:
|
||||
夜莺内置了 IPMI 的仪表盘和告警规则,克隆到自己的业务组下即可使用。
|
||||
|
||||

|
||||
99
integrations/Kafka/collect/kafka.toml
Normal file
@@ -0,0 +1,99 @@
|
||||
# # collect interval
|
||||
# interval = 15
|
||||
|
||||
############################################################################
|
||||
# !!! uncomment [[instances]] to enable this plugin
|
||||
[[instances]]
|
||||
# # interval = global.interval * interval_times
|
||||
# interval_times = 1
|
||||
|
||||
# append some labels to metrics
|
||||
# cluster is a preferred tag with the cluster name. If none is provided, the first of kafka_uris will be used
|
||||
labels = { cluster="kafka-cluster-01" }
|
||||
|
||||
# log level only for kafka exporter
|
||||
log_level = "error"
|
||||
|
||||
# Address (host:port) of Kafka server.
|
||||
# kafka_uris = ["127.0.0.1:9092","127.0.0.1:9092","127.0.0.1:9092"]
|
||||
kafka_uris = []
|
||||
|
||||
# Connect using SASL/PLAIN
|
||||
# Default is false
|
||||
# use_sasl = false
|
||||
|
||||
# Only set this to false if using a non-Kafka SASL proxy
|
||||
# Default is true
|
||||
# use_sasl_handshake = false
|
||||
|
||||
# SASL user name
|
||||
# sasl_username = "username"
|
||||
|
||||
# SASL user password
|
||||
# sasl_password = "password"
|
||||
|
||||
# The SASL SCRAM SHA algorithm sha256 or sha512 as mechanism
|
||||
# sasl_mechanism = ""
|
||||
|
||||
# Connect using TLS
|
||||
# use_tls = false
|
||||
|
||||
# The optional certificate authority file for TLS client authentication
|
||||
# ca_file = ""
|
||||
|
||||
# The optional certificate file for TLS client authentication
|
||||
# cert_file = ""
|
||||
|
||||
# The optional key file for TLS client authentication
|
||||
# key_file = ""
|
||||
|
||||
# If true, the server's certificate will not be checked for validity. This will make your HTTPS connections insecure
|
||||
# insecure_skip_verify = true
|
||||
|
||||
# Kafka broker version
|
||||
# Default is 2.0.0
|
||||
# kafka_version = "2.0.0"
|
||||
|
||||
# if you need to use a group from zookeeper
|
||||
# Default is false
|
||||
# use_zookeeper_lag = false
|
||||
|
||||
# Address array (hosts) of zookeeper server.
|
||||
# zookeeper_uris = []
|
||||
|
||||
# Metadata refresh interval
|
||||
# Default is 1m
|
||||
# metadata_refresh_interval = "1m"
|
||||
|
||||
# Whether show the offset/lag for all consumer group, otherwise, only show connected consumer groups, default is true
|
||||
# Default is true
|
||||
# offset_show_all = true
|
||||
|
||||
# If true, all scrapes will trigger kafka operations otherwise, they will share results. WARN: This should be disabled on large clusters
|
||||
# Default is false
|
||||
# allow_concurrency = false
|
||||
|
||||
# Maximum number of offsets to store in the interpolation table for a partition
|
||||
# Default is 1000
|
||||
# max_offsets = 1000
|
||||
|
||||
# How frequently should the interpolation table be pruned, in seconds.
|
||||
# Default is 30
|
||||
# prune_interval_seconds = 30
|
||||
|
||||
# Regex filter for topics to be monitored
|
||||
# Default is ".*"
|
||||
# topics_filter_regex = ".*"
|
||||
|
||||
# Regex filter for consumer groups to be monitored
|
||||
# Default is ".*"
|
||||
# groups_filter_regex = ".*"
|
||||
|
||||
# if rename kafka_consumergroup_uncommitted_offsets to kafka_consumergroup_lag
|
||||
# Default is false
|
||||
# rename_uncommit_offset_to_lag = false
|
||||
|
||||
|
||||
# if disable calculating lag rate
|
||||
# Default is false
|
||||
# disable_calculate_lag_rate = false
|
||||
@@ -1,26 +1,155 @@
|
||||
## VictoriaMetrics Dashboard & Alerts
|
||||
# kafka plugin
|
||||
|
||||
使用[categraf](https://github.com/flashcatcloud/categraf)中[inputs.kafka](https://github.com/flashcatcloud/categraf/tree/main/inputs/kafka)插件采集[kafka](https://kafka.apache.org/)服务监控指标数据:
|
||||
Kafka 的核心指标,其实都是通过 JMX 的方式暴露的,可以参考这篇 [文章](https://time.geekbang.org/column/article/628498)。对于 JMX 暴露的指标,使用 jolokia 或者使用 jmx_exporter 那个 jar 包来采集即可,不需要本插件。
|
||||
|
||||
### 配置文件示例:
|
||||
本插件主要是采集的消费者延迟数据,这个数据无法通过 Kafka 服务端的 JMX 拿到。
|
||||
|
||||
下面为配置示例,如果是多个kafka就可以写多个[[instances]];
|
||||
本插件 fork 自 [https://github.com/davidmparrott/kafka_exporter](https://github.com/davidmparrott/kafka_exporter)(以下简称 davidmparrott 版本),davidmparrott 版本 fork 自 [https://github.com/danielqsj/kafka_exporter](https://github.com/danielqsj/kafka_exporter)(以下简称 danielqsj 版本)。
|
||||
|
||||
danielqsj 版本作为原始版本, github 版本也相对活跃, prometheus 生态使用较多。davidmparrott 版本与 danielqsj 版本相比, 有以下 metric 名字不同:
|
||||
|
||||
| davidmparrott 版本 | danielqsj 版本 |
|
||||
| ---- | ---- |
|
||||
| kafka_consumergroup_uncommit_offsets | kafka_consumergroup_lag |
|
||||
| kafka_consumergroup_uncommit_offsets_sum | kafka_consumergroup_lag_sum |
|
||||
| kafka_consumergroup_uncommitted_offsets_zookeeper | kafka_consumergroup_lag_zookeeper |
|
||||
|
||||
如果想使用 danielqsj 版本的 metric, 在 `[[instances]]` 中进行如下配置:
|
||||
|
||||
```toml
|
||||
rename_uncommit_offset_to_lag = true
|
||||
```
|
||||
|
||||
davidmparrott 版本比 danielqsj 版本多了以下 metric,这些指标是对延迟速率做了预估计算:
|
||||
|
||||
- kafka_consumer_lag_millis
|
||||
- kafka_consumer_lag_interpolation
|
||||
- kafka_consumer_lag_extrapolation
|
||||
|
||||
为什么要计算速率?因为 lag 很大,但是消费很快,是不会积压的,而 lag 很小,消费很慢,仍然会积压,所以,通过 lag 大小是没法判断积压风险的。通过计算历史消费速率,来判断积压风险会更为合理。要计算这个速率,需要占用较多内存,可以通过如下配置关闭这个计算逻辑:
|
||||
|
||||
```toml
|
||||
disable_calculate_lag_rate = true
|
||||
```
|
||||
|
||||
## 采集配置
|
||||
|
||||
categraf 配置文件:`conf/input.kafka/kafka.toml`。配置样例如下:
|
||||
|
||||
```toml
|
||||
[[instances]]
|
||||
log_level = "error"
|
||||
kafka_uris = ["192.168.0.250:9092"]
|
||||
labels = { cluster="kafka-cluster", service="kafka" }
|
||||
labels = { cluster="kafka-cluster-01", service="kafka" }
|
||||
```
|
||||
|
||||
### 告警规则
|
||||
完整的带有注释的配置如下:
|
||||
|
||||

|
||||
```toml
|
||||
[[instances]]
|
||||
# # interval = global.interval * interval_times
|
||||
# interval_times = 1
|
||||
|
||||
[alerts](../alerts/kafka_by_categraf.json)
|
||||
# append some labels to metrics
|
||||
# cluster is a preferred tag with the cluster name. If none is provided, the first of kafka_uris will be used
|
||||
labels = { cluster="kafka-cluster-01" }
|
||||
|
||||
### 仪表盘:
|
||||
# log level only for kafka exporter
|
||||
log_level = "error"
|
||||
|
||||

|
||||
# Address (host:port) of Kafka server.
|
||||
# kafka_uris = ["127.0.0.1:9092","127.0.0.1:9092","127.0.0.1:9092"]
|
||||
kafka_uris = []
|
||||
|
||||
[dashboard](../dashboards/kafka_by_categraf.json)
|
||||
# Connect using SASL/PLAIN
|
||||
# Default is false
|
||||
# use_sasl = false
|
||||
|
||||
# Only set this to false if using a non-Kafka SASL proxy
|
||||
# Default is true
|
||||
# use_sasl_handshake = false
|
||||
|
||||
# SASL user name
|
||||
# sasl_username = "username"
|
||||
|
||||
# SASL user password
|
||||
# sasl_password = "password"
|
||||
|
||||
# The SASL SCRAM SHA algorithm sha256 or sha512 as mechanism
|
||||
# sasl_mechanism = ""
|
||||
|
||||
# Connect using TLS
|
||||
# use_tls = false
|
||||
|
||||
# The optional certificate authority file for TLS client authentication
|
||||
# ca_file = ""
|
||||
|
||||
# The optional certificate file for TLS client authentication
|
||||
# cert_file = ""
|
||||
|
||||
# The optional key file for TLS client authentication
|
||||
# key_file = ""
|
||||
|
||||
# If true, the server's certificate will not be checked for validity. This will make your HTTPS connections insecure
|
||||
# insecure_skip_verify = true
|
||||
|
||||
# Kafka broker version
|
||||
# Default is 2.0.0
|
||||
# kafka_version = "2.0.0"
|
||||
|
||||
# if you need to use a group from zookeeper
|
||||
# Default is false
|
||||
# use_zookeeper_lag = false
|
||||
|
||||
# Address array (hosts) of zookeeper server.
|
||||
# zookeeper_uris = []
|
||||
|
||||
# Metadata refresh interval
|
||||
# Default is 1m
|
||||
# metadata_refresh_interval = "1m"
|
||||
|
||||
# Whether show the offset/lag for all consumer group, otherwise, only show connected consumer groups, default is true
|
||||
# Default is true
|
||||
# offset_show_all = true
|
||||
|
||||
# If true, all scrapes will trigger kafka operations otherwise, they will share results. WARN: This should be disabled on large clusters
|
||||
# Default is false
|
||||
# allow_concurrency = false
|
||||
|
||||
# Maximum number of offsets to store in the interpolation table for a partition
|
||||
# Default is 1000
|
||||
# max_offsets = 1000
|
||||
|
||||
# How frequently should the interpolation table be pruned, in seconds.
|
||||
# Default is 30
|
||||
# prune_interval_seconds = 30
|
||||
|
||||
# Regex filter for topics to be monitored
|
||||
# Default is ".*"
|
||||
# topics_filter_regex = ".*"
|
||||
|
||||
# Regex filter for consumer groups to be monitored
|
||||
# Default is ".*"
|
||||
# groups_filter_regex = ".*"
|
||||
|
||||
# if rename kafka_consumergroup_uncommitted_offsets to kafka_consumergroup_lag
|
||||
# Default is false
|
||||
# rename_uncommit_offset_to_lag = false
|
||||
|
||||
|
||||
# if disable calculating lag rate
|
||||
# Default is false
|
||||
# disable_calculate_lag_rate = false
|
||||
```
|
||||
|
||||
## 告警规则
|
||||
|
||||
夜莺提供了内置的 Kafka 告警规则,克隆到自己的业务组下即可使用。
|
||||
|
||||

|
||||
|
||||
## 仪表盘:
|
||||
|
||||
夜莺提供了内置的 Kafka 仪表盘,克隆到自己的业务组下即可使用。
|
||||
|
||||

|
||||
|
||||
BIN
integrations/Kafka/markdown/alerts-kafka.png
Normal file
|
After Width: | Height: | Size: 78 KiB |
BIN
integrations/Kafka/markdown/dash-kafka.png
Normal file
|
After Width: | Height: | Size: 145 KiB |
266
integrations/Kubernetes/alerts/apiserver.json
Normal file
@@ -0,0 +1,266 @@
|
||||
[
|
||||
{
|
||||
"name": "KubeClientCertificateExpiration-S2",
|
||||
"note": "A client certificate used to authenticate to the apiserver is expiring in less than 7.0 days.",
|
||||
"severity": 2,
|
||||
"disabled": 0,
|
||||
"prom_for_duration": 0,
|
||||
"prom_ql": "apiserver_client_certificate_expiration_seconds_count{job=\"apiserver\"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job=\"apiserver\"}[5m]))) < 604800\n",
|
||||
"prom_eval_interval": 15,
|
||||
"enable_stime": "00:00",
|
||||
"enable_etime": "23:59",
|
||||
"enable_days_of_week": [
|
||||
"1",
|
||||
"2",
|
||||
"3",
|
||||
"4",
|
||||
"5",
|
||||
"6",
|
||||
"0"
|
||||
],
|
||||
"enable_in_bg": 0,
|
||||
"notify_recovered": 1,
|
||||
"notify_channels": [],
|
||||
"notify_repeat_step": 60,
|
||||
"recover_duration": 0,
|
||||
"callbacks": [],
|
||||
"runbook_url": "",
|
||||
"append_tags": []
|
||||
},
|
||||
{
|
||||
"name": "KubeClientCertificateExpiration-S1",
|
||||
"note": "A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.",
|
||||
"severity": 1,
|
||||
"disabled": 0,
|
||||
"prom_for_duration": 0,
|
||||
"prom_ql": "apiserver_client_certificate_expiration_seconds_count{job=\"apiserver\"} > 0 and on(job) histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job=\"apiserver\"}[5m]))) < 86400\n",
|
||||
"prom_eval_interval": 15,
|
||||
"enable_stime": "00:00",
|
||||
"enable_etime": "23:59",
|
||||
"enable_days_of_week": [
|
||||
"1",
|
||||
"2",
|
||||
"3",
|
||||
"4",
|
||||
"5",
|
||||
"6",
|
||||
"0"
|
||||
],
|
||||
"enable_in_bg": 0,
|
||||
"notify_recovered": 1,
|
||||
"notify_channels": [],
|
||||
"notify_repeat_step": 60,
|
||||
"recover_duration": 0,
|
||||
"callbacks": [],
|
||||
"runbook_url": "",
|
||||
"append_tags": []
|
||||
},
|
||||
{
|
||||
"name": "AggregatedAPIErrors",
|
||||
"note": "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has reported errors. The number of errors have increased for it in the past five minutes. High values indicate that the availability of the service changes too often.",
|
||||
"severity": 2,
|
||||
"disabled": 0,
|
||||
"prom_for_duration": 0,
|
||||
"prom_ql": "sum by(name, namespace)(increase(aggregator_unavailable_apiservice_count[5m])) > 2\n",
|
||||
"prom_eval_interval": 15,
|
||||
"enable_stime": "00:00",
|
||||
"enable_etime": "23:59",
|
||||
"enable_days_of_week": [
|
||||
"1",
|
||||
"2",
|
||||
"3",
|
||||
"4",
|
||||
"5",
|
||||
"6",
|
||||
"0"
|
||||
],
|
||||
"enable_in_bg": 0,
|
||||
"notify_recovered": 1,
|
||||
"notify_channels": [],
|
||||
"notify_repeat_step": 60,
|
||||
"recover_duration": 0,
|
||||
"callbacks": [],
|
||||
"runbook_url": "",
|
||||
"append_tags": []
|
||||
},
|
||||
{
|
||||
"name": "AggregatedAPIDown",
|
||||
"note": "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has been only {{ $value | humanize }}% available over the last 10m.",
|
||||
"severity": 2,
|
||||
"disabled": 0,
|
||||
"prom_for_duration": 300,
|
||||
"prom_ql": "(1 - max by(name, namespace)(avg_over_time(aggregator_unavailable_apiservice[10m]))) * 100 < 85\n",
|
||||
"prom_eval_interval": 15,
|
||||
"enable_stime": "00:00",
|
||||
"enable_etime": "23:59",
|
||||
"enable_days_of_week": [
|
||||
"1",
|
||||
"2",
|
||||
"3",
|
||||
"4",
|
||||
"5",
|
||||
"6",
|
||||
"0"
|
||||
],
|
||||
"enable_in_bg": 0,
|
||||
"notify_recovered": 1,
|
||||
"notify_channels": [],
|
||||
"notify_repeat_step": 60,
|
||||
"recover_duration": 0,
|
||||
"callbacks": [],
|
||||
"runbook_url": "",
|
||||
"append_tags": []
|
||||
},
|
||||
{
|
||||
"name": "KubeAPIDown",
|
||||
"note": "KubeAPI has disappeared from Prometheus target discovery.",
|
||||
"severity": 1,
|
||||
"disabled": 0,
|
||||
"prom_for_duration": 900,
|
||||
"prom_ql": "absent(up{job=\"apiserver\"} == 1)\n",
|
||||
"prom_eval_interval": 15,
|
||||
"enable_stime": "00:00",
|
||||
"enable_etime": "23:59",
|
||||
"enable_days_of_week": [
|
||||
"1",
|
||||
"2",
|
||||
"3",
|
||||
"4",
|
||||
"5",
|
||||
"6",
|
||||
"0"
|
||||
],
|
||||
"enable_in_bg": 0,
|
||||
"notify_recovered": 1,
|
||||
"notify_channels": [],
|
||||
"notify_repeat_step": 60,
|
||||
"recover_duration": 0,
|
||||
"callbacks": [],
|
||||
"runbook_url": "",
|
||||
"append_tags": []
|
||||
},
|
||||
{
|
||||
"name": "KubeAPIErrorBudgetBurn-S1-120秒",
|
||||
"note": "The API server is burning too much error budget.",
|
||||
"severity": 1,
|
||||
"disabled": 0,
|
||||
"prom_for_duration": 120,
|
||||
"prom_ql": "sum(apiserver_request:burnrate1h) > (14.40 * 0.01000)\nand\nsum(apiserver_request:burnrate5m) > (14.40 * 0.01000)\n",
|
||||
"prom_eval_interval": 15,
|
||||
"enable_stime": "00:00",
|
||||
"enable_etime": "23:59",
|
||||
"enable_days_of_week": [
|
||||
"1",
|
||||
"2",
|
||||
"3",
|
||||
"4",
|
||||
"5",
|
||||
"6",
|
||||
"0"
|
||||
],
|
||||
"enable_in_bg": 0,
|
||||
"notify_recovered": 1,
|
||||
"notify_channels": [],
|
||||
"notify_repeat_step": 60,
|
||||
"recover_duration": 0,
|
||||
"callbacks": [],
|
||||
"runbook_url": "",
|
||||
"append_tags": [
|
||||
"long=1h",
|
||||
"short=5m"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "KubeAPIErrorBudgetBurn-S1-900秒",
|
||||
"note": "The API server is burning too much error budget.",
|
||||
"severity": 1,
|
||||
"disabled": 0,
|
||||
"prom_for_duration": 900,
|
||||
"prom_ql": "sum(apiserver_request:burnrate6h) > (6.00 * 0.01000)\nand\nsum(apiserver_request:burnrate30m) > (6.00 * 0.01000)\n",
|
||||
"prom_eval_interval": 15,
|
||||
"enable_stime": "00:00",
|
||||
"enable_etime": "23:59",
|
||||
"enable_days_of_week": [
|
||||
"1",
|
||||
"2",
|
||||
"3",
|
||||
"4",
|
||||
"5",
|
||||
"6",
|
||||
"0"
|
||||
],
|
||||
"enable_in_bg": 0,
|
||||
"notify_recovered": 1,
|
||||
"notify_channels": [],
|
||||
"notify_repeat_step": 60,
|
||||
"recover_duration": 0,
|
||||
"callbacks": [],
|
||||
"runbook_url": "",
|
||||
"append_tags": [
|
||||
"long=6h",
|
||||
"short=30m"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "KubeAPIErrorBudgetBurn-S2-3600秒",
|
||||
"note": "The API server is burning too much error budget.",
|
||||
"severity": 2,
|
||||
"disabled": 0,
|
||||
"prom_for_duration": 3600,
|
||||
"prom_ql": "sum(apiserver_request:burnrate1d) > (3.00 * 0.01000)\nand\nsum(apiserver_request:burnrate2h) > (3.00 * 0.01000)\n",
|
||||
"prom_eval_interval": 15,
|
||||
"enable_stime": "00:00",
|
||||
"enable_etime": "23:59",
|
||||
"enable_days_of_week": [
|
||||
"1",
|
||||
"2",
|
||||
"3",
|
||||
"4",
|
||||
"5",
|
||||
"6",
|
||||
"0"
|
||||
],
|
||||
"enable_in_bg": 0,
|
||||
"notify_recovered": 1,
|
||||
"notify_channels": [],
|
||||
"notify_repeat_step": 60,
|
||||
"recover_duration": 0,
|
||||
"callbacks": [],
|
||||
"runbook_url": "",
|
||||
"append_tags": [
|
||||
"long=1d",
|
||||
"short=2h"
|
||||
]
|
||||
},
|
||||
{
|
||||
"name": "KubeAPIErrorBudgetBurn-S2-10800秒",
|
||||
"note": "The API server is burning too much error budget.",
|
||||
"severity": 2,
|
||||
"disabled": 0,
|
||||
"prom_for_duration": 10800,
|
||||
"prom_ql": "sum(apiserver_request:burnrate3d) > (1.00 * 0.01000)\nand\nsum(apiserver_request:burnrate6h) > (1.00 * 0.01000)\n",
|
||||
"prom_eval_interval": 15,
|
||||
"enable_stime": "00:00",
|
||||
"enable_etime": "23:59",
|
||||
"enable_days_of_week": [
|
||||
"1",
|
||||
"2",
|
||||
"3",
|
||||
"4",
|
||||
"5",
|
||||
"6",
|
||||
"0"
|
||||
],
|
||||
"enable_in_bg": 0,
|
||||
"notify_recovered": 1,
|
||||
"notify_channels": [],
|
||||
"notify_repeat_step": 60,
|
||||
"recover_duration": 0,
|
||||
"callbacks": [],
|
||||
"runbook_url": "",
|
||||
"append_tags": [
|
||||
"long=3d",
|
||||
"short=6h"
|
||||
]
|
||||
}
|
||||
]
|
||||
42
integrations/Kubernetes/collect/kubernetes.toml
Normal file
@@ -0,0 +1,42 @@
|
||||
# # collect interval
|
||||
# interval = 15
|
||||
|
||||
[[instances]]
|
||||
# # append some labels for series
|
||||
# labels = { region="cloud", product="n9e" }
|
||||
|
||||
# # interval = global.interval * interval_times
|
||||
# interval_times = 1
|
||||
|
||||
# URL for the kubelet
|
||||
# url = "https://$HOSTIP:10250"
|
||||
url = ""
|
||||
|
||||
gather_system_container_metrics = true
|
||||
gather_node_metrics = true
|
||||
gather_pod_container_metrics = true
|
||||
gather_pod_volume_metrics = true
|
||||
gather_pod_network_metrics = true
|
||||
|
||||
## Use bearer token for authorization. ('bearer_token' takes priority)
|
||||
## If both of these are empty, we'll use the default serviceaccount:
|
||||
## at: /var/run/secrets/kubernetes.io/serviceaccount/token
|
||||
# bearer_token = "/path/to/bearer/token"
|
||||
## OR
|
||||
# bearer_token_string = "abc_123"
|
||||
|
||||
## Pod labels to be added as tags. An empty array for both include and
|
||||
## exclude will include all labels.
|
||||
# label_include = []
|
||||
# label_exclude = ["*"]
|
||||
|
||||
## Set response_timeout (default 5 seconds)
|
||||
# response_timeout = "5s"
|
||||
|
||||
## Optional TLS Config
|
||||
use_tls = true
|
||||
# tls_ca = "/etc/categraf/ca.pem"
|
||||
# tls_cert = "/etc/categraf/cert.pem"
|
||||
# tls_key = "/etc/categraf/key.pem"
|
||||
## Use TLS but skip chain & host verification
|
||||
insecure_skip_verify = true
|
||||
42
integrations/Kubernetes/markdown/README.md
Normal file
@@ -0,0 +1,42 @@
|
||||
# Kubernetes
|
||||
|
||||
这个插件已经废弃。Kubernetes 监控系列可以参考这个 [文章](https://flashcat.cloud/categories/kubernetes%E7%9B%91%E6%8E%A7%E4%B8%93%E6%A0%8F/)。或者参考 [专栏](https://time.geekbang.org/column/article/630306)。
|
||||
|
||||
不过 Kubernetes 这个类别下的内置告警规则和内置仪表盘都是可以使用的。
|
||||
|
||||
---
|
||||
|
||||
下面是老插件文档:
|
||||
|
||||
forked from telegraf/kubernetes. 这个插件的作用是通过kubelet提供的API获取监控数据,包括系统容器的监控数据、node的、pod数据卷的、pod网络的、pod容器的。
|
||||
|
||||
## Change
|
||||
|
||||
增加了一些控制开关:
|
||||
|
||||
`gather_system_container_metrics = true`
|
||||
|
||||
是否采集 system 容器(kubelet、runtime、misc、pods),比如 kubelet 一般就是静态容器,非业务容器
|
||||
|
||||
`gather_node_metrics = true`
|
||||
|
||||
是否采集 node 层面的指标,机器层面的指标其实 categraf 来采集了,这里理论上不需要再采集了,可以设置为 false,采集也没问题,也没多少数据
|
||||
|
||||
`gather_pod_container_metrics = true`
|
||||
|
||||
是否采集 Pod 中的容器的指标,这些 Pod 一般是业务容器
|
||||
|
||||
`gather_pod_volume_metrics = true`
|
||||
|
||||
是否采集 Pod 的数据卷的指标
|
||||
|
||||
`gather_pod_network_metrics = true`
|
||||
|
||||
是否采集 Pod 的网络监控数据
|
||||
|
||||
## 容器监控
|
||||
|
||||
通过这些开关可以看出,kubernetes 这个插件,采集的只是 pod、容器的监控指标,这些指标数据来自 kubelet 的 `/stats/summary` `/pods` 等接口。那么问题来了,容器监控到底是应该读取 `/metrics/cadvisor` 接口还是应该用这个 kubernetes 插件?有几个决策依据:
|
||||
|
||||
1. `/metrics/cadvisor` 采集的数据没有业务自定义标签,kubernetes 这个插件会自动带上业务自定义标签。但是业务标签可能比较混乱,建议每个公司制定规范,比如要求业务只能打 project、region、env、service、app、job 等标签,其他标签都过滤掉,通过 kubernetes 插件的 label_include label_exclude 配置,可以做标签过滤。
|
||||
2. kubernetes 这个插件采集的数据比 `/metrics/cadvisor` 吐出的指标要少,不过常见的 cpu、mem、net、volume 相关的也都有。
|
||||
@@ -474,7 +474,7 @@
|
||||
"datasource_ids": [
|
||||
0
|
||||
],
|
||||
"name": "大于200G的盘,空间不足了",
|
||||
"name": "A disk larger than 200G is running out of space",
|
||||
"note": "",
|
||||
"prod": "metric",
|
||||
"algorithm": "",
|
||||
@@ -543,7 +543,7 @@
|
||||
"datasource_ids": [
|
||||
0
|
||||
],
|
||||
"name": "小于200G的盘,空间不足了",
|
||||
"name": "A disk smaller than 200G is running out of space",
|
||||
"note": "",
|
||||
"prod": "metric",
|
||||
"algorithm": "",
|
||||
1882
integrations/Linux/alerts/linux_by_categraf_zh.json
Normal file
@@ -1,5 +1,5 @@
|
||||
{
|
||||
"name": "机器台账表格视图配置样例",
|
||||
"name": "机器台账表格视图",
|
||||
"tags": "",
|
||||
"ident": "",
|
||||
"configs": {
|
||||
@@ -17,14 +17,14 @@
|
||||
],
|
||||
"panels": [
|
||||
{
|
||||
"type": "table",
|
||||
"id": "77bf513a-8504-4d33-9efe-75aaf9abc9e4",
|
||||
"type": "hexbin",
|
||||
"id": "21b8b3ab-26aa-47cb-b814-f310f2d143aa",
|
||||
"layout": {
|
||||
"h": 13,
|
||||
"w": 24,
|
||||
"h": 5,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 0,
|
||||
"i": "77bf513a-8504-4d33-9efe-75aaf9abc9e4",
|
||||
"i": "21b8b3ab-26aa-47cb-b814-f310f2d143aa",
|
||||
"isResizable": true
|
||||
},
|
||||
"version": "3.0.0",
|
||||
@@ -33,23 +33,159 @@
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "cpu_usage_active{cpu=\"cpu-total\", ident=~\"$ident\"}",
|
||||
"legend": "{{ident}}",
|
||||
"instant": true
|
||||
}
|
||||
],
|
||||
"transformations": [
|
||||
{
|
||||
"id": "organize",
|
||||
"options": {}
|
||||
}
|
||||
],
|
||||
"name": "CPU利用率",
|
||||
"maxPerRow": 4,
|
||||
"custom": {
|
||||
"textMode": "valueAndName",
|
||||
"calc": "lastNotNull",
|
||||
"colorRange": [
|
||||
"thresholds"
|
||||
],
|
||||
"detailUrl": "/dashboards/linux-host-by-categraf?ident=${__field.labels.ident}"
|
||||
},
|
||||
"options": {
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"color": "#ef3c3c",
|
||||
"value": 95,
|
||||
"type": ""
|
||||
},
|
||||
{
|
||||
"color": "#ff656b",
|
||||
"value": 85,
|
||||
"type": ""
|
||||
},
|
||||
{
|
||||
"color": "#ffae39",
|
||||
"value": 75,
|
||||
"type": ""
|
||||
},
|
||||
{
|
||||
"color": "#2c9d3d",
|
||||
"value": null,
|
||||
"type": "base"
|
||||
}
|
||||
]
|
||||
},
|
||||
"standardOptions": {
|
||||
"util": "percent"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "hexbin",
|
||||
"id": "86d4a502-21f7-4981-9b38-ed8e696b6f49",
|
||||
"layout": {
|
||||
"h": 5,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 0,
|
||||
"i": "872b2040-c5b0-43fe-92c7-e37cb77edffc",
|
||||
"isResizable": true
|
||||
},
|
||||
"version": "3.0.0",
|
||||
"datasourceCate": "prometheus",
|
||||
"datasourceValue": "${prom}",
|
||||
"targets": [
|
||||
{
|
||||
"refId": "A",
|
||||
"expr": "mem_used_percent{ident=~\"$ident\"}",
|
||||
"legend": "{{ident}}",
|
||||
"instant": true
|
||||
}
|
||||
],
|
||||
"transformations": [
|
||||
{
|
||||
"id": "organize",
|
||||
"options": {}
|
||||
}
|
||||
],
|
||||
"name": "内存利用率",
|
||||
"maxPerRow": 4,
|
||||
"custom": {
|
||||
"textMode": "valueAndName",
|
||||
"calc": "lastNotNull",
|
||||
"colorRange": [
|
||||
"thresholds"
|
||||
],
|
||||
"detailUrl": "/dashboards/linux-host-by-categraf?ident=${__field.labels.ident}"
|
||||
},
|
||||
"options": {
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"color": "#ef3c3c",
|
||||
"value": 95,
|
||||
"type": ""
|
||||
},
|
||||
{
|
||||
"color": "#ff656b",
|
||||
"value": 85,
|
||||
"type": ""
|
||||
},
|
||||
{
|
||||
"color": "#ffae39",
|
||||
"value": 75,
|
||||
"type": ""
|
||||
},
|
||||
{
|
||||
"color": "#2c9d3d",
|
||||
"value": null,
|
||||
"type": "base"
|
||||
}
|
||||
]
|
||||
},
|
||||
"standardOptions": {
|
||||
"util": "percent"
|
||||
}
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "table",
|
||||
"id": "77bf513a-8504-4d33-9efe-75aaf9abc9e4",
|
||||
"layout": {
|
||||
"h": 11,
|
||||
"w": 24,
|
||||
"x": 0,
|
||||
"y": 5,
|
||||
"i": "77bf513a-8504-4d33-9efe-75aaf9abc9e4",
|
||||
"isResizable": true
|
||||
},
|
||||
"version": "3.0.0",
|
||||
"datasourceCate": "prometheus",
|
||||
"datasourceValue": "${prom}",
|
||||
"targets": [
|
||||
{
|
||||
"expr": "avg(cpu_usage_active{cpu=\"cpu-total\", ident=~\"$ident\"}) by (ident)",
|
||||
"legend": "CPU使用率"
|
||||
"legend": "CPU使用率",
|
||||
"refId": "A"
|
||||
},
|
||||
{
|
||||
"expr": "avg(mem_used_percent{ident=~\"$ident\"}) by (ident)",
|
||||
"refId": "B",
|
||||
"legend": "内存使用率"
|
||||
"legend": "内存使用率",
|
||||
"refId": "B"
|
||||
},
|
||||
{
|
||||
"expr": "avg(mem_total{ident=~\"$ident\"}) by (ident)",
|
||||
"refId": "C",
|
||||
"legend": "总内存"
|
||||
"legend": "总内存",
|
||||
"refId": "C"
|
||||
},
|
||||
{
|
||||
"expr": "avg(mem_free{ident=~\"$ident\"}) by (ident)",
|
||||
"refId": "D",
|
||||
"legend": "剩余内存"
|
||||
"expr": "avg(disk_used_percent{ident=~\"$ident\",path=\"/\"}) by (ident)",
|
||||
"legend": "根分区使用率",
|
||||
"refId": "D"
|
||||
}
|
||||
],
|
||||
"transformations": [
|
||||
@@ -62,7 +198,8 @@
|
||||
}
|
||||
}
|
||||
],
|
||||
"name": "表格配置样例",
|
||||
"name": "机器列表",
|
||||
"maxPerRow": 4,
|
||||
"custom": {
|
||||
"showHeader": true,
|
||||
"colorMode": "background",
|
||||
@@ -70,7 +207,13 @@
|
||||
"displayMode": "labelValuesToRows",
|
||||
"aggrDimension": "ident",
|
||||
"sortColumn": "ident",
|
||||
"sortOrder": "ascend"
|
||||
"sortOrder": "ascend",
|
||||
"links": [
|
||||
{
|
||||
"title": "详情",
|
||||
"url": "/dashboards/linux-host-by-categraf?ident=${__field.labels.ident}"
|
||||
}
|
||||
]
|
||||
},
|
||||
"options": {
|
||||
"standardOptions": {}
|
||||
@@ -81,104 +224,124 @@
|
||||
"value": "A"
|
||||
},
|
||||
"properties": {
|
||||
"standardOptions": {
|
||||
"util": "percent"
|
||||
},
|
||||
"valueMappings": [
|
||||
{
|
||||
"type": "range",
|
||||
"match": {
|
||||
"to": 65
|
||||
},
|
||||
"result": {
|
||||
"color": "#2c9d3d"
|
||||
},
|
||||
"match": {
|
||||
"to": 65
|
||||
}
|
||||
"type": "range"
|
||||
},
|
||||
{
|
||||
"type": "range",
|
||||
"match": {
|
||||
"to": 90
|
||||
},
|
||||
"result": {
|
||||
"color": "#ff656b"
|
||||
},
|
||||
"match": {
|
||||
"to": 90
|
||||
}
|
||||
"type": "range"
|
||||
},
|
||||
{
|
||||
"type": "range",
|
||||
"match": {
|
||||
"from": 90
|
||||
},
|
||||
"result": {
|
||||
"color": "#f50505"
|
||||
},
|
||||
"match": {
|
||||
"from": 90
|
||||
}
|
||||
"type": "range"
|
||||
}
|
||||
],
|
||||
"standardOptions": {
|
||||
"util": "percent"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "special",
|
||||
"matcher": {
|
||||
"value": "B"
|
||||
},
|
||||
"properties": {
|
||||
"standardOptions": {
|
||||
"util": "percent"
|
||||
},
|
||||
"valueMappings": [
|
||||
{
|
||||
"type": "range",
|
||||
"match": {
|
||||
"to": 65
|
||||
},
|
||||
"result": {
|
||||
"color": "#2c9d3d"
|
||||
},
|
||||
"match": {
|
||||
"to": 65
|
||||
}
|
||||
"type": "range"
|
||||
},
|
||||
{
|
||||
"type": "range",
|
||||
"match": {
|
||||
"to": 90
|
||||
},
|
||||
"result": {
|
||||
"color": "#ff656b"
|
||||
},
|
||||
"match": {
|
||||
"to": 90
|
||||
}
|
||||
"type": "range"
|
||||
},
|
||||
{
|
||||
"type": "range",
|
||||
"match": {
|
||||
"from": 90
|
||||
},
|
||||
"result": {
|
||||
"color": "#fa0a0a"
|
||||
},
|
||||
"match": {
|
||||
"from": 90
|
||||
}
|
||||
"type": "range"
|
||||
}
|
||||
],
|
||||
"standardOptions": {
|
||||
"util": "percent"
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
"type": "special"
|
||||
},
|
||||
{
|
||||
"type": "special",
|
||||
"matcher": {
|
||||
"value": "C"
|
||||
},
|
||||
"properties": {
|
||||
"valueMappings": [],
|
||||
"standardOptions": {
|
||||
"util": "bytesIEC",
|
||||
"decimals": 2
|
||||
}
|
||||
}
|
||||
"decimals": 2,
|
||||
"util": "bytesIEC"
|
||||
},
|
||||
"valueMappings": []
|
||||
},
|
||||
"type": "special"
|
||||
},
|
||||
{
|
||||
"type": "special",
|
||||
"matcher": {
|
||||
"value": "D"
|
||||
},
|
||||
"properties": {
|
||||
"standardOptions": {
|
||||
"util": "bytesIEC",
|
||||
"decimals": 2
|
||||
}
|
||||
}
|
||||
"decimals": 2,
|
||||
"util": "percent"
|
||||
},
|
||||
"valueMappings": [
|
||||
{
|
||||
"type": "range",
|
||||
"result": {
|
||||
"color": "#2c9d3d"
|
||||
},
|
||||
"match": {
|
||||
"to": 90
|
||||
}
|
||||
},
|
||||
{
|
||||
"type": "range",
|
||||
"result": {
|
||||
"color": "#ff656b"
|
||||
},
|
||||
"match": {
|
||||
"from": 90
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
"type": "special"
|
||||
}
|
||||
]
|
||||
}
|
||||
|
||||
11
integrations/Linux/markdown/README.md
Normal file
@@ -0,0 +1,11 @@
|
||||
# Linux
|
||||
|
||||
categraf 部署之后,就会自动采集 CPU、内存、磁盘、IO、网路相关的指标,无需额外配置。
|
||||
|
||||
## 内置仪表盘
|
||||
|
||||
夜莺内置了仪表盘,文件名是 `_categraf` 的表示是使用 categraf 作为采集器。文件名是 `_exporter` 的表示是使用 node-exporter 作为采集器。
|
||||
|
||||
## 内置告警规则
|
||||
|
||||
夜莺内置了告警规则,文件名是 `_categraf` 的表示是使用 categraf 作为采集器。文件名是 `_exporter` 的表示是使用 node-exporter 作为采集器。
|
||||
@@ -1,31 +1,34 @@
|
||||
### MinIO Dashboard & Alerts
|
||||
# MinIO
|
||||
|
||||
使用[categraf](https://github.com/flashcatcloud/categraf)中[inputs.prometheus](https://github.com/flashcatcloud/categraf/tree/main/inputs/prometheus)插件采集[MinIO](https://min.io/docs/minio/linux/operations/monitoring/collect-minio-metrics-using-prometheus.html?ref=docs-redirect#minio-metrics-collect-using-prometheus)服务指标数据:
|
||||
参考 [使用 Prometheus 采集 MinIO 指标](https://min.io/docs/minio/linux/operations/monitoring/collect-minio-metrics-using-prometheus.html?ref=docs-redirect#minio-metrics-collect-using-prometheus)
|
||||
|
||||
开启 MinIO Prometheus访问;
|
||||
开启 MinIO Prometheus 访问;
|
||||
|
||||
```bash
|
||||
# 启动MinIO服务的时候加入下面的变量:
|
||||
# 启动 MinIO 服务的时候加入下面的变量:
|
||||
MINIO_PROMETHEUS_AUTH_TYPE=public
|
||||
```
|
||||
|
||||
### 采集配置
|
||||
在categraf中的prometheus插件中加入采集配置
|
||||
```yaml
|
||||
cat /opt/categraf/conf/input.prometheus/prometheus.toml
|
||||
## 采集配置
|
||||
|
||||
categraf 的 `conf/input.prometheus/prometheus.toml`
|
||||
|
||||
```toml
|
||||
[[instances]]
|
||||
urls = [
|
||||
urls = [
|
||||
"http://192.168.1.188:9000/minio/v2/metrics/cluster"
|
||||
]
|
||||
url_label_key = "instance"
|
||||
url_label_value = "{{.Host}}"
|
||||
labels = {job="minio-cluster"}
|
||||
```
|
||||
|
||||
[Dashboard](../dashboards/minio_by_categraf.json)
|
||||
## Dashboard
|
||||
|
||||

|
||||
夜莺内置了 MinIO 的仪表盘,克隆到自己的业务组下即可使用。
|
||||
|
||||
[Alerts](../alerts/minio_by_categraf.json)
|
||||

|
||||
|
||||

|
||||
## Alerts
|
||||
|
||||
夜莺内置了 MinIO 的告警规则,克隆到自己的业务组下即可使用。
|
||||
|
||||

|
||||
BIN
integrations/MinIO/markdown/alerts-minio.png
Normal file
|
After Width: | Height: | Size: 49 KiB |
BIN
integrations/MinIO/markdown/dash-minio.png
Normal file
|
After Width: | Height: | Size: 158 KiB |
57
integrations/MongoDB/collect/mongodb.toml
Normal file
@@ -0,0 +1,57 @@
|
||||
[[instances]]
|
||||
# log level, enum: panic, fatal, error, warn, warning, info, debug, trace, defaults to info.
|
||||
log_level = "info"
|
||||
# append some const labels to metrics
|
||||
# NOTICE! the instance label is required for dashboards
|
||||
labels = { instance="mongo-cluster-01" }
|
||||
|
||||
# mongodb dsn, see https://www.mongodb.com/docs/manual/reference/connection-string/
|
||||
# mongodb_uri = "mongodb://127.0.0.1:27017"
|
||||
mongodb_uri = ""
|
||||
# if you don't specify the username or password in the mongodb_uri, you can set here.
|
||||
# This will overwrite the dsn, it would be helpful when special characters existing in the username or password and you don't want to encode them.
|
||||
# NOTICE! this user must be granted enough rights to query needed stats, see ../inputs/mongodb/README.md
|
||||
username = "username@Bj"
|
||||
password = "password@Bj"
|
||||
# if set to true, use the direct connection way
|
||||
# direct_connect = true
|
||||
|
||||
# collect all means you collect all the metrics, if set, all below enable_xxx flags in this section will be ignored
|
||||
collect_all = true
|
||||
# if set to true, collect databases metrics
|
||||
# enable_db_stats = true
|
||||
# if set to true, collect getDiagnosticData metrics
|
||||
# enable_diagnostic_data = true
|
||||
# if set to true, collect replSetGetStatus metrics
|
||||
# enable_replicaset_status = true
|
||||
# if set to true, collect top metrics by admin command
|
||||
# enable_top_metrics = true
|
||||
# if set to true, collect index metrics. You should specify one of the coll_stats_namespaces and the discovering_mode flags.
|
||||
# enable_index_stats = true
|
||||
# if set to true, collect collections metrics. You should specify one of the coll_stats_namespaces and the discovering_mode flags.
|
||||
# enable_coll_stats = true
|
||||
|
||||
# Only get stats for the collections matching this list of namespaces. if none set, discovering_mode will be enabled.
|
||||
# Example: db1.col1,db.col1
|
||||
# coll_stats_namespaces = []
|
||||
# Only get stats for index with the collections matching this list of namespaces.
|
||||
# Example: db1.col1,db.col1
|
||||
# index_stats_collections = []
|
||||
# if set to true, replace -1 to DESC for label key_name of the descending_index metrics
|
||||
# enable_override_descending_index = true
|
||||
|
||||
# which exposes metrics with 0.1x compatible metric names has been implemented which simplifies migration from the old version to the current version.
|
||||
# compatible_mode = true
|
||||
|
||||
|
||||
# [[instances]]
|
||||
# # interval = global.interval * interval_times
|
||||
# interval_times = 1
|
||||
|
||||
# log_level = "error"
|
||||
|
||||
# append some labels to metrics
|
||||
# labels = { instance="mongo-cluster-02" }
|
||||
# mongodb_uri = "mongodb://username:password@127.0.0.1:27017"
|
||||
# collect_all = true
|
||||
# compatible_mode = true
|
||||
92
integrations/MongoDB/markdown/README.md
Normal file
@@ -0,0 +1,92 @@
|
||||
# mongodb
|
||||
|
||||
mongodb 监控采集插件,由 [mongodb-exporter](https://github.com/percona/mongodb_exporter)封装而来。
|
||||
|
||||
## Configuration
|
||||
|
||||
配置文件示例:
|
||||
|
||||
```toml
|
||||
[[instances]]
|
||||
# log level, enum: panic, fatal, error, warn, warning, info, debug, trace, defaults to info.
|
||||
log_level = "info"
|
||||
# append some const labels to metrics
|
||||
# NOTICE! the instance label is required for dashboards
|
||||
labels = { instance="mongo-cluster-01" }
|
||||
|
||||
# mongodb dsn, see https://www.mongodb.com/docs/manual/reference/connection-string/
|
||||
# mongodb_uri = "mongodb://127.0.0.1:27017"
|
||||
mongodb_uri = ""
|
||||
# if you don't specify the username or password in the mongodb_uri, you can set here.
|
||||
# This will overwrite the dsn, it would be helpful when special characters existing in the username or password and you don't want to encode them.
|
||||
# NOTICE! this user must be granted enough rights to query needed stats, see ../inputs/mongodb/README.md
|
||||
username = "username@Bj"
|
||||
password = "password@Bj"
|
||||
# if set to true, use the direct connection way
|
||||
# direct_connect = true
|
||||
|
||||
# collect all means you collect all the metrics, if set, all below enable_xxx flags in this section will be ignored
|
||||
collect_all = true
|
||||
# if set to true, collect databases metrics
|
||||
# enable_db_stats = true
|
||||
# if set to true, collect getDiagnosticData metrics
|
||||
# enable_diagnostic_data = true
|
||||
# if set to true, collect replSetGetStatus metrics
|
||||
# enable_replicaset_status = true
|
||||
# if set to true, collect top metrics by admin command
|
||||
# enable_top_metrics = true
|
||||
# if set to true, collect index metrics. You should specify one of the coll_stats_namespaces and the discovering_mode flags.
|
||||
# enable_index_stats = true
|
||||
# if set to true, collect collections metrics. You should specify one of the coll_stats_namespaces and the discovering_mode flags.
|
||||
# enable_coll_stats = true
|
||||
|
||||
# Only get stats for the collections matching this list of namespaces. if none set, discovering_mode will be enabled.
|
||||
# Example: db1.col1,db.col1
|
||||
# coll_stats_namespaces = []
|
||||
# Only get stats for index with the collections matching this list of namespaces.
|
||||
# Example: db1.col1,db.col1
|
||||
# index_stats_collections = []
|
||||
# if set to true, replace -1 to DESC for label key_name of the descending_index metrics
|
||||
# enable_override_descending_index = true
|
||||
|
||||
# which exposes metrics with 0.1x compatible metric names has been implemented which simplifies migration from the old version to the current version.
|
||||
# compatible_mode = true
|
||||
|
||||
|
||||
# [[instances]]
|
||||
# # interval = global.interval * interval_times
|
||||
# interval_times = 1
|
||||
|
||||
# log_level = "error"
|
||||
|
||||
# append some labels to metrics
|
||||
# labels = { instance="mongo-cluster-02" }
|
||||
# mongodb_uri = "mongodb://username:password@127.0.0.1:27017"
|
||||
# collect_all = true
|
||||
# compatible_mode = true
|
||||
```
|
||||
|
||||
categraf 作为一个 client 连接 MongoDB,需要有足够的权限来收集指标,具体的权限配置请参考[官方文档](https://www.mongodb.com/docs/manual/reference/built-in-roles/#mongodb-authrole-clusterMonitor)。至少具有以下权限才可以:
|
||||
|
||||
```json
|
||||
{
|
||||
"role":"clusterMonitor",
|
||||
"db":"admin"
|
||||
},
|
||||
{
|
||||
"role":"read",
|
||||
"db":"local"
|
||||
}
|
||||
```
|
||||
|
||||
授权操作样例:
|
||||
|
||||
```shell
|
||||
mongo -h xxx -u xxx -p xxx --authenticationDatabase admin
|
||||
> use admin
|
||||
> db.createUser({user:"categraf",pwd:"categraf",roles: [{role:"read",db:"local"},{"role":"clusterMonitor","db":"admin"}]})
|
||||
```
|
||||
|
||||
## 监控大盘和告警规则
|
||||
|
||||
夜莺内置了 MongoDB 的告警规则和监控大盘,克隆到自己的业务组使用即可。虽然文件后缀是 `_exporter` 也可以使用,因为 categraf 这个插件是基于 mongodb-exporter 封装的。
|
||||
65
integrations/MySQL/collect/mysql.toml
Normal file
@@ -0,0 +1,65 @@
|
||||
# # collect interval
|
||||
# interval = 15
|
||||
|
||||
# [[queries]]
|
||||
# mesurement = "users"
|
||||
# metric_fields = [ "total" ]
|
||||
# label_fields = [ "service" ]
|
||||
# timeout = "3s"
|
||||
# request = '''
|
||||
# select 'n9e' as service, count(*) as total from n9e_v5.users
|
||||
# '''
|
||||
|
||||
|
||||
[[instances]]
|
||||
# address = "127.0.0.1:3306"
|
||||
# username = "root"
|
||||
# password = "1234"
|
||||
|
||||
# # set tls=custom to enable tls
|
||||
# parameters = "tls=false"
|
||||
|
||||
# extra_status_metrics = true
|
||||
# extra_innodb_metrics = false
|
||||
# gather_processlist_processes_by_state = false
|
||||
# gather_processlist_processes_by_user = false
|
||||
# gather_schema_size = true
|
||||
# gather_table_size = false
|
||||
# gather_system_table_size = false
|
||||
# gather_slave_status = true
|
||||
|
||||
# # timeout
|
||||
# timeout_seconds = 3
|
||||
|
||||
# # interval = global.interval * interval_times
|
||||
# interval_times = 1
|
||||
|
||||
# important! use global unique string to specify instance
|
||||
# labels = { instance="n9e-10.2.3.4:3306" }
|
||||
|
||||
## Optional TLS Config
|
||||
# use_tls = false
|
||||
# tls_min_version = "1.2"
|
||||
# tls_ca = "/etc/categraf/ca.pem"
|
||||
# tls_cert = "/etc/categraf/cert.pem"
|
||||
# tls_key = "/etc/categraf/key.pem"
|
||||
## Use TLS but skip chain & host verification
|
||||
# insecure_skip_verify = true
|
||||
|
||||
#[[instances.queries]]
|
||||
# mesurement = "lock_wait"
|
||||
# metric_fields = [ "total" ]
|
||||
# timeout = "3s"
|
||||
# request = '''
|
||||
#SELECT count(*) as total FROM information_schema.innodb_trx WHERE trx_state='LOCK WAIT'
|
||||
#'''
|
||||
|
||||
# [[instances.queries]]
|
||||
# mesurement = "users"
|
||||
# metric_fields = [ "total" ]
|
||||
# label_fields = [ "service" ]
|
||||
# # field_to_append = ""
|
||||
# timeout = "3s"
|
||||
# request = '''
|
||||
# select 'n9e' as service, count(*) as total from n9e_v5.users
|
||||
# '''
|
||||
745
integrations/MySQL/dashboards/dashboard-by-aws-rds.json
Normal file
@@ -0,0 +1,745 @@
|
||||
{
|
||||
"name": "AWS RDS Telegraf",
|
||||
"tags": "AWS Cloudwatch Telegraf",
|
||||
"configs": {
|
||||
"var": [
|
||||
{
|
||||
"name": "region",
|
||||
"definition": "label_values(cloudwatch_aws_rds_cpu_utilization_average, region)",
|
||||
"multi": false,
|
||||
"type": "query"
|
||||
},
|
||||
{
|
||||
"type": "query",
|
||||
"definition": "label_values(cloudwatch_aws_rds_cpu_utilization_average{region=\"$region\"}, db_instance_identifier)",
|
||||
"name": "instance"
|
||||
}
|
||||
],
|
||||
"panels": [
|
||||
{
|
||||
"type": "row",
|
||||
"id": "2ceac4da-53d8-432d-ad43-51a25cf63b21",
|
||||
"name": "Common metrics",
|
||||
"collapsed": true,
|
||||
"layout": {
|
||||
"h": 1,
|
||||
"w": 24,
|
||||
"x": 0,
|
||||
"y": 0,
|
||||
"i": "2ceac4da-53d8-432d-ad43-51a25cf63b21",
|
||||
"isResizable": false
|
||||
},
|
||||
"panels": []
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_cpu_utilization_average{region=\"$region\",db_instance_identifier=\"$instance\"}",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS CPU利用率(百分比)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds cpu 利用率平均值",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"value": 80,
|
||||
"color": "#d0021b"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 1,
|
||||
"i": "2002c9f5-6177-4239-a0c6-2981edacae5a",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "2002c9f5-6177-4239-a0c6-2981edacae5a"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_database_connections_average{region=\"$region\",db_instance_identifier=\"$instance\"}",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 数据库连接数",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds 数据库连接平均值",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"value": 100,
|
||||
"color": "#d0021b"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 1,
|
||||
"i": "05ddf798-e5f8-4b34-96f1-aaa2a45d1207",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "c54b9dca-88ce-425a-bf75-6d8b363f6ebb"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_free_storage_space_average{region=\"$region\",db_instance_identifier=\"$instance\"}",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 可用存储空间(MB/秒)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds 可用存储空间平均值",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"value": 10000000000,
|
||||
"color": "#d0021b"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 7,
|
||||
"i": "2d42ff70-a867-4f02-9980-5f20c017a21e",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "997a6214-2ac0-46c6-a0b9-046810b2b8cf"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_freeable_memory_average{region=\"$region\",db_instance_identifier=\"$instance\"}",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 可用内存(MB)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds 可用内存平均值",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": [
|
||||
{
|
||||
"value": 2000000000,
|
||||
"color": "#d0021b"
|
||||
}
|
||||
]
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 7,
|
||||
"i": "89bbb148-7fb3-4492-a5d6-abd0bb5df667",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "6c00311c-e931-487f-b088-3a3bfafc84ef"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_lvm_write_iops_average{region=\"$region\",db_instance_identifier=\"$instance\"}",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 写入IOPS(次数/秒)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds lvm 写入 iops 平均值",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": []
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 13,
|
||||
"i": "18640a88-13c0-4ce7-8456-60b20f8c7422",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "990ab5a1-4aa5-47c3-b7b7-a65f63459119"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_read_iops_average{region=\"$region\",db_instance_identifier=\"$instance\"}",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 读取IOPS(次数/秒)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds 读取 iops 平均值",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": []
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 13,
|
||||
"i": "010a63f8-2a08-4d56-9131-0f9e50a7e2f4",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "a61a80da-7d0a-45a5-a868-bd442b3aa4cf"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_write_throughput_average{region=\"$region\",db_instance_identifier=\"$instance\"}",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 写入吞吐量(MB/秒)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds 写入吞吐量平均值",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": []
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 19,
|
||||
"i": "58987f8f-09d3-445f-b22f-5f872f5b9dde",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "2e605342-3413-4004-9fcf-3dbbfa7e7be3"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_read_throughput_average{region=\"$region\",db_instance_identifier=\"$instance\"}",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 读取吞吐量(MB/秒)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds 读取吞吐量平均值",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": []
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 19,
|
||||
"i": "23e7b924-d638-4293-9840-78fb129d5410",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "1ef3f98d-1b54-408a-8cc2-4570c327d705"
|
||||
},
|
||||
{
|
||||
"type": "row",
|
||||
"id": "07e3cd80-1984-4ebe-a037-526e6a186ebb",
|
||||
"name": "NetWork metrics",
|
||||
"collapsed": true,
|
||||
"layout": {
|
||||
"h": 1,
|
||||
"w": 24,
|
||||
"x": 0,
|
||||
"y": 25,
|
||||
"i": "07e3cd80-1984-4ebe-a037-526e6a186ebb",
|
||||
"isResizable": false
|
||||
},
|
||||
"panels": []
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_network_receive_throughput_average{region=\"$region\",db_instance_identifier=\"$instance\"}",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 网络接收吞吐量(MB/秒)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds 网络接收吞吐量平均",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": []
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 26,
|
||||
"i": "e1573095-990a-468d-bf2f-7bbf5a6dcb42",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "4ba500c9-e87e-41e4-bbc1-82fec507da9d"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_network_transmit_throughput_average{region=\"$region\",db_instance_identifier=\"$instance\"}",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 网络传输吞吐量(MB/秒)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds 网络传输吞吐量平均值",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": []
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 26,
|
||||
"i": "0493a01d-d066-482a-b677-2d9ae1d9a30b",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "edee8285-1274-4ddc-b166-fb773c764c2b"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_write_latency_average{region=\"$region\",db_instance_identifier=\"$instance\"} * 1000",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 写入延迟(毫秒)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds 写入延迟平均值",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": []
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 32,
|
||||
"i": "fb7ee87d-7bec-4123-ab16-7ef2b6838d8c",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "ecb9b8a5-b168-4a65-b7f6-7912ab6c6b22"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_read_latency_average{region=\"$region\",db_instance_identifier=\"$instance\"} * 1000",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 读取延迟(毫秒)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds 读取延迟平均值",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": []
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 32,
|
||||
"i": "d652843b-4005-4448-8342-b3761f58677b",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "60d009fa-e547-45be-a862-9b156c15b675"
|
||||
},
|
||||
{
|
||||
"type": "row",
|
||||
"id": "3fafd89f-e6dc-4666-96b7-9f2dc216f496",
|
||||
"name": "Additional metrics",
|
||||
"collapsed": true,
|
||||
"layout": {
|
||||
"h": 1,
|
||||
"w": 24,
|
||||
"x": 0,
|
||||
"y": 38,
|
||||
"i": "3fafd89f-e6dc-4666-96b7-9f2dc216f496",
|
||||
"isResizable": false
|
||||
},
|
||||
"panels": []
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_disk_queue_depth_average{region=\"$region\",db_instance_identifier=\"$instance\"}",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 队列深度(数量)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds 磁盘队列深度平均值",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": []
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 39,
|
||||
"i": "b36508a8-057d-44fe-9899-74862407fd03",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "7edcf2a8-16f3-49ef-9026-e53dc5e72c69"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_bin_log_disk_usage_average{region=\"$region\",db_instance_identifier=\"$instance\"}",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 二进制日志磁盘使用情况 (MB)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds 二进制日志磁盘使用情况 (MB)",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": []
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 39,
|
||||
"i": "ca09fee2-6496-444a-937d-3fc2d7483630",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "42143731-22a9-45b4-bb1e-ddb8f2c11a70"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_swap_usage_average{region=\"$region\",db_instance_identifier=\"$instance\"}",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 交换分区使用情况(MB)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds 交换分区使用平均值",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {},
|
||||
"thresholds": {
|
||||
"steps": []
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 0,
|
||||
"y": 45,
|
||||
"i": "1252f5b7-278b-4cd9-9f36-8fb5ccf6ee51",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "51c6f9d9-30db-4514-a54d-712e1a570b23"
|
||||
},
|
||||
{
|
||||
"targets": [
|
||||
{
|
||||
"expr": "cloudwatch_aws_rds_burst_balance_average{region=\"$region\",db_instance_identifier=\"$instance\"}",
|
||||
"refId": "A",
|
||||
"legend": "{{db_instance_identifier}}"
|
||||
}
|
||||
],
|
||||
"name": "RDS 突发信用余额平均值(百分比)",
|
||||
"description": "* Telegraf Gather AWS Cloudwatch RDS\n* cloudwatch aws rds 突发余额平均值",
|
||||
"options": {
|
||||
"tooltip": {
|
||||
"mode": "all",
|
||||
"sort": "desc"
|
||||
},
|
||||
"legend": {
|
||||
"displayMode": "hidden"
|
||||
},
|
||||
"standardOptions": {
|
||||
"max": 110
|
||||
},
|
||||
"thresholds": {
|
||||
"steps": []
|
||||
}
|
||||
},
|
||||
"custom": {
|
||||
"drawStyle": "lines",
|
||||
"lineInterpolation": "smooth",
|
||||
"fillOpacity": 0,
|
||||
"stack": "off"
|
||||
},
|
||||
"version": "2.0.0",
|
||||
"type": "timeseries",
|
||||
"layout": {
|
||||
"h": 6,
|
||||
"w": 12,
|
||||
"x": 12,
|
||||
"y": 45,
|
||||
"i": "05473d8c-ea01-40c7-b4d4-47378a42aa3e",
|
||||
"isResizable": true
|
||||
},
|
||||
"id": "767bcc71-3f71-443a-9713-03f587ccc350"
|
||||
}
|
||||
],
|
||||
"version": "2.0.0"
|
||||
}
|
||||
}
|
||||