Compare commits

..

155 Commits

Author SHA1 Message Date
ning
17cbfb8453 change alert_rule extra_config type 2023-10-20 11:26:30 +08:00
ning
6b89b7b4a5 change configs cval type 2023-10-20 10:20:08 +08:00
shardingHe
4fe5828d8d feat: macro variable in Configs (#1725)
* configs form user crud

* configs form user crud

* code refactor

* code refactor

* configs for user variable update & add InitRSAConfig for alert.go

* configs for user variable update

* remove InitRSAConfig for alert.go

* add config of Center.Encryption

* migrate for config and set default value

* add annotation for Center.Encryption

* code refactor

* code refactor

* code refactor again

* code refactor

* remove userVariableCheck bool return

* remove userVariableCheck bool return

* optimize InitRSAConfig

* optimize InitRSAConfig

* optimize InitRSAConfig

* macro variable

* optimize config

* remove test function user-variable-ras

* code refactor config_cache

* code refactor again

* code refactor again

* ReplaceMacroVariables return string value & optimize code

* add user variable check on ckey, it is not recommended to use the period "."

* change configs ckey+external=uniqueness

* configs add add external value

* configs remove isInternal check(ckey is no longer unique field)

* update userVariableCheck

* migrate drop unique filed limit

* refactor rsa generate logic. read key pairs from config.toml(HTTP.RSA) if file is not exist generate rsa key paris and saving to database(configs).

* Improve the RSA key generation logic. In the first step, attempt to read key pairs from the database or the 'config.toml' file. If they don't exist, generate a new set of key pairs and store them in the database.

* optimize code

* ckey use C-style naming convention and optimize smtp validations

* change email logger. print host & port

* update init rsa config logically

* update migrate about DropUniqueFiled

* update migrate about DropUniqueFiled

* make migrating at the first step

* move config of rsa. generate in db.

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
Co-authored-by: Yening Qin <710leo@gmail.com>
2023-10-19 15:02:59 +08:00
Yening Qin
98422d696e refactor: heartbeat api (#1747) 2023-10-17 17:19:37 +08:00
shardingHe
e3103faeae refactor: Integrations sync from categraf (#1739)
* sync fist step

* integrations sync 14

* integrations sync 20

* integrations sync

* integrations sync 1 & add aliyun alerts

* update v2 dashboard to v3, and move dashboard to cloudwatch, cAdvisor, nginx_upstream_check. make sure all dashboards's name is unique.

* update readme

* delete nsq readme file

* rename switch_legacy,delete sockstat

* Rename the directory to match the lowercase plugin name

* add AMD_ROCm_SMI

* add ipvs

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
2023-10-17 11:05:52 +08:00
Ulric Qin
0b23ddffb2 ignore rsa 2023-10-16 16:23:50 +08:00
Ulric Qin
37fa12e214 remove rust wait tool 2023-10-16 13:29:14 +08:00
Ulric Qin
328f8ac125 Merge branch 'main' of github.com:ccfos/nightingale 2023-10-16 12:25:21 +08:00
Ulric Qin
744749d22b remove rust wait tool 2023-10-16 12:25:09 +08:00
Yening Qin
a56fd039b4 fix eval query err (#1743) 2023-10-16 12:18:27 +08:00
Ulric Qin
e16867b72a code refactor 2023-10-16 12:12:09 +08:00
Ulric Qin
ff6756447b refactor docker compose configurations 2023-10-15 17:05:35 +08:00
Ulric Qin
546980a906 add tcpx.WaitHosts 2023-10-15 14:26:54 +08:00
ning
f93e2ad4b6 fix: push sample 2023-10-13 11:54:31 +08:00
ning
68732d6b31 refactor: rsa generate file 2023-10-12 20:33:42 +08:00
shardingHe
c3b8146e7f fix: rsa decrypt & generate method (#1740)
* Maintain consistency between encryption and decryption methods

* Maintain consistency between encryption and decryption methods

* Maintain consistency between encryption and decryption methods

* optimize code

* optimize code

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
2023-10-12 19:34:55 +08:00
ning
271b7ca8a5 refactor get rule type 2023-10-12 19:33:05 +08:00
ning
2fa87cc428 fix: tdengine alert inhibit 2023-10-12 16:44:39 +08:00
ning
ffa7c4ee79 docs: delete ops.yaml 2023-10-12 15:50:33 +08:00
ning
86c4374238 refactor: unified heartbeat api 2023-10-12 15:27:02 +08:00
dependabot[bot]
4ee8f1b9ad build(deps): bump golang.org/x/net from 0.10.0 to 0.17.0 (#1738)
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.10.0 to 0.17.0.
- [Commits](https://github.com/golang/net/compare/v0.10.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-10-12 14:47:43 +08:00
ning
47dcd2b054 refactor: update go.mod 2023-10-12 14:44:45 +08:00
yuweizzz
d099e6b85c refactor: convert task script line endings (#1717) 2023-10-09 10:30:39 +08:00
shardingHe
320401e8f3 fix: sync event ID for edge model (#1710)
* sync event id for edge model

* update PostByUrlsWithResp

* update test file

* update test file

* update mock struct

* update mock struct

* update mock struct

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
2023-10-09 10:29:04 +08:00
Ulric Qin
6d75244f8f Merge branch 'main' of github.com:ccfos/nightingale 2023-10-09 07:01:27 +08:00
Ulric Qin
05ac5d51b5 forward metrics one by one 2023-10-09 07:01:13 +08:00
shardingHe
2b7c9a9673 feat: merge operation permission points from built-in (#1729)
* merge operation permission points from build-in

* optimizing spelling

* optimizing ops

* remove duplicate role operation

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
2023-10-07 11:49:52 +08:00
Yening Qin
f76bfdf6b3 Datasource add is_default column (#1728)
* ds add is_default

* code refactor
2023-09-28 15:05:14 +08:00
Lars Lehtonen
7ebb70b896 prom: fix dropped errors (#1727) 2023-09-27 20:27:18 +08:00
Ulric Qin
960ed6bf70 Merge branch 'main' of github.com:ccfos/nightingale 2023-09-27 10:00:30 +08:00
Ulric Qin
b4fd4f1087 code refactor 2023-09-27 09:59:49 +08:00
shardingHe
109d6db1fc fix: Optimize user config (#1724)
* optimize initRSAFile

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
2023-09-26 10:18:38 +08:00
shardingHe
07e2d9ed10 feat: configs from user crud (#1714)
* configs form user crud

* configs for user variable update & add InitRSAConfig for alert.go

* configs for user variable update

* remove InitRSAConfig for alert.go

* add config of Center.Encryption

* migrate for config and set default value

* add annotation for Center.Encryption

* remove userVariableCheck bool return

* optimize InitRSAConfig

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
Co-authored-by: Yening Qin <710leo@gmail.com>
2023-09-25 20:44:07 +08:00
ning
c5cd6c0337 docs: update integrations 2023-09-25 13:27:35 +08:00
Yening Qin
fe1d566326 feat: add tdengine datasource (#1722)
* refactor: add alert stats

* add tdengine

* add sql tpl api
2023-09-25 11:40:20 +08:00
Lars Lehtonen
cedc918a09 pkg/secu: fix dropped error (#1698) 2023-09-25 11:03:00 +08:00
shardingHe
1e6c0865dd feat: merge i18n configuration (#1701)
* merge expand config and build-in, prioritize the settings within the expand config options in case of conflicts

* code refactor

* code refactor

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
2023-09-22 17:49:22 +08:00
shardingHe
7649986b55 fix: oauth2 add debug log (#1706)
* oauth2 add CallbackOutput debug log

* code refactor

* code refactor

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
2023-09-22 17:48:14 +08:00
yanli
86a82b409a Update user_group.go (#1718)
Fix :修复同步用户组接口BUG
2023-09-21 17:15:11 +08:00
710leo
f6ad9bdf82 add oidc log 2023-09-19 17:27:50 +08:00
Yening Qin
a647526084 config api (#1716) 2023-09-19 16:29:12 +08:00
qifenggang
44ed90e181 set target_ident when cache is not sync (#1708) 2023-09-18 17:51:59 +08:00
shardingHe
3e7273701d feat: add host ip for target (#1711)
* add host ip for target

* update host ip from heartbeat

* update host ip from heartbeat

* batch update host ip from heartbeat

* batch update host ip from heartbeat

* refactor code

* update target when either gid or host_ip has a new value

* rollback

* add debug log

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
2023-09-14 14:45:16 +08:00
Ulric Qin
d77ed30940 add some sso logs 2023-09-11 18:30:30 +08:00
yuweizzz
5ae80e67a3 Update http_response_by_categraf.json (#1704) 2023-09-08 08:34:46 +08:00
idcdog
184389be33 1、增加victoriametrics单级版本视图(视图项定义参考官方grafana版本);2、针对集群版本视图调整datasource命名以便适配最新版本n9e (#1702) 2023-09-07 08:42:02 +08:00
ning
c1f022001f refactor: loki datasource 2023-09-01 23:14:08 +08:00
Tripitakav
616d56d515 feat: support loki datasources (#1700) 2023-09-01 23:07:27 +08:00
ning
10a0b5099e add debug log 2023-08-30 17:38:01 +08:00
chixianliangGithub
0815605298 Update reader.go (#1689)
remove  invalid code
2023-08-30 10:24:26 +08:00
ning
2df3216b32 refactor: event stats api 2023-08-29 15:33:24 +08:00
shardingHe
74491c666d Compute statistics for alert_cur_events (#1697)
* add statistics of alert_cur_events

* code refactor

* code refactor

* code refactor

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
2023-08-29 15:15:32 +08:00
Ulric Qin
29a2eb6f2f code refactor 2023-08-28 11:13:29 +08:00
Ulric Qin
baf56746ce Merge branch 'main' of github.com:ccfos/nightingale 2023-08-28 10:33:48 +08:00
Ulric Qin
5867c5af8f update dashboard table demo 2023-08-28 10:33:34 +08:00
shardingHe
4a358f5cff fix: Automatically migrate recent table structure changes. (#1696)
* auto migrate alerting_engines[engine_cluster],task_record[event_id],chart_share[datasource_id],recording_rule[datasource_ids]

* auto migrate code refactor

* auto migrate drop chart_share

* auto migrate drop dashboard_id

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
2023-08-25 18:46:22 +08:00
Ulric Qin
13f2b008fd code refactor 2023-08-25 17:52:09 +08:00
Ulric Qin
84400cd657 code refactor 2023-08-25 10:20:41 +08:00
Ulric Qin
f2a3a6933e Merge branch 'main' of github.com:ccfos/nightingale 2023-08-25 10:15:04 +08:00
Ulric Qin
0a4d1cad4c add linux_by_categraf_zh.json 2023-08-25 10:14:52 +08:00
李明
08f472f9ee fix: PostgreSQL bool int convert error (#1695)
* fix: postgres bool int

* fix: code optimize

* fix: updateby

* optimize

* Update es_index_pattern.go

---------

Co-authored-by: Yening Qin <710leo@gmail.com>
2023-08-24 20:53:14 +08:00
ning
7f73945c8d Merge branch 'main' of github.com:ccfos/nightingale 2023-08-24 16:09:43 +08:00
ning
56a7860b5a docs: update integrations 2023-08-24 16:07:55 +08:00
shardingHe
25dab86b8e feat: add mute preview and an option to delete the active alerts (#1692)
* add mute preview and an option to delete these active alerts(del_alert_cur:true)


---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
Co-authored-by: Yening Qin <710leo@gmail.com>
2023-08-24 15:56:42 +08:00
Ulric Qin
35b90ca162 upgrade mysql version in docker-compose 2023-08-23 16:43:09 +08:00
Ulric Qin
5babee6de3 Merge branch 'main' of github.com:ccfos/nightingale 2023-08-22 20:43:38 +08:00
Ulric Qin
7567d440a9 update metrics.yaml 2023-08-22 20:43:17 +08:00
shardingHe
2ecd799dab fix: attempt to send an email (#1691)
* After configuring the SMTP, attempt to send the email.

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
Co-authored-by: Yening Qin <710leo@gmail.com>
2023-08-22 17:43:12 +08:00
shardingHe
5b3561f983 fix: alert pre-save validation (#1690)
* add interface of validation rule
---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
Co-authored-by: Yening Qin <710leo@gmail.com>
2023-08-22 17:42:05 +08:00
shardingHe
cce3711c02 fix: Edge config check (#1686)
* check the configuration for the CenterApi in the edge model

* set default timeout 5000 ms

* Update edge.go

* Update post.go

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
Co-authored-by: Yening Qin <710leo@gmail.com>
2023-08-21 16:46:55 +08:00
ning
9cdbda0828 refactor: alert rule mute strategy 2023-08-19 20:17:24 +08:00
ning
9c4775fd38 refactor: alert subscribe gets api 2023-08-18 11:20:13 +08:00
ning
212e0aa4c3 refactor: alert subscribe verify 2023-08-18 10:20:33 +08:00
xtan
05300ec0e9 fix: compatible with pg (#1683)
* fix: fix pg migrate

* fix: compatible with pg

* fix: compatible with pg
2023-08-18 10:00:00 +08:00
shardingHe
67fb49e54e After configuring the SMTP, attempt to send the email. (#1684)
* After configuring the SMTP, attempt to send the email.

* refactor code

* refactor code

* Update router_notify_config.go

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
Co-authored-by: Yening Qin <710leo@gmail.com>
2023-08-17 20:30:53 +08:00
shardingHe
7164b696b1 Change validate rule ignore non-default channels (#1680)
* change validate rule ignore non-default channels

* refactor code

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
Co-authored-by: Yening Qin <710leo@gmail.com>
2023-08-17 18:18:20 +08:00
Yening Qin
8728167733 feat: sso support skip tls verify (#1685)
* refactor: sso support skip tls verify

* fix: update oidc

* fix: cas init enable
2023-08-17 18:03:26 +08:00
ning
6e80a63b68 refactor: cur event del 2023-08-17 14:20:03 +08:00
kongfei605
9e43a22ec3 use redis.Cmdable instead of Redis (#1681) 2023-08-16 18:37:57 +08:00
Ulric Qin
49d8ed4a6f Merge branch 'main' of github.com:ccfos/nightingale 2023-08-16 17:44:03 +08:00
Ulric Qin
c7b537e6c7 expose tags_map 2023-08-16 17:43:48 +08:00
shardingHe
f1cdd2fa46 refactor: modify the alert_subscribe to make the datasource optional (#1679)
* subscribe change 'pord','datasource_ids' to optional item

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
2023-08-16 14:23:16 +08:00
ning
3d5ad02274 feat: notification proxy supports http 2023-08-14 18:00:51 +08:00
Ulric Qin
1cb9f4becf code refactor 2023-08-14 15:01:36 +08:00
Ulric Qin
0d0dafbe49 code refactor 2023-08-14 15:00:05 +08:00
Ulric Qin
048d1df2d1 code refactor 2023-08-14 14:59:28 +08:00
ning
4fb4154e30 feat: add FormatDecimal 2023-08-10 23:15:09 +08:00
ning
0be69bbccd feat: add FormatDecimal 2023-08-10 22:58:36 +08:00
shardingHe
7015a40256 refactor: alert subscribe verify check (#1666)
* add BusiGroupFilter for alert_subscribe ,copy from TagFiler

* refactor BusiGroupFilter

* refactor BusiGroupFilter

* refactor BusiGroupFilter

* AlertSubscribe verify check

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
2023-08-09 13:26:00 +08:00
Ulric Qin
03cca642e9 modify email words 2023-08-08 16:30:51 +08:00
ulricqin
579fd3780b Update community-governance.md 2023-08-08 10:55:14 +08:00
Ulric Qin
a85d91c10e Merge branch 'main' of github.com:ccfos/nightingale 2023-08-08 07:55:40 +08:00
Ulric Qin
af31c496a1 datasource checker for loki 2023-08-08 07:55:27 +08:00
shardingHe
f9efbaa954 refactor: use config arguments (#1665)
Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
2023-08-07 19:03:15 +08:00
Ulric Qin
d541ec7f20 Merge branch 'main' of github.com:ccfos/nightingale 2023-08-07 09:16:05 +08:00
Ulric Qin
1d847e2c6f refactor datasource check of loki 2023-08-07 09:15:52 +08:00
xtan
2fedf4f075 docs: pg init sql (#1663) 2023-08-07 08:33:11 +08:00
Tripitakav
e9a02c4c80 refactor: sync rule to scheduler (#1657) 2023-08-07 08:29:50 +08:00
ning
8beaccdded refactor: GetTagFilters 2023-08-05 12:39:10 +08:00
shardingHe
af6003da6d feat: Add BusiGroupFilter for alert_subscribe (#1660)
* add BusiGroupFilter for alert_subscribe ,copy from TagFiler

* refactor BusiGroupFilter

* refactor BusiGroupFilter

---------

Co-authored-by: shardingHe <wangzihe@flashcat.cloud>
2023-08-04 18:30:00 +08:00
ning
76ac2cd013 refactor version api 2023-08-03 18:06:09 +08:00
ning
859876e3f8 change version api 2023-08-03 16:34:42 +08:00
ning
7d49e7fb34 feat: add github version api 2023-08-03 15:46:18 +08:00
ning
6c42ae9077 fix: query-range skip tls verify 2023-08-03 14:42:45 +08:00
Yening Qin
15dcc60407 refactor: proxy api (#1656)
* refactor: proxy api
2023-08-02 17:20:06 +08:00
Ulric Qin
5b811b7003 Merge branch 'main' of github.com:ccfos/nightingale 2023-08-02 16:22:09 +08:00
Ulric Qin
55d670fe3c code refactor 2023-08-02 16:21:57 +08:00
ning
ac3a5e52c7 docs: update ldap config 2023-08-02 14:16:20 +08:00
李明
2abe00e251 fix: post err process (#1653) 2023-08-02 13:33:44 +08:00
热心网友吴溢豪
1bd3c29e39 fix: open cstats init (#1654)
Co-authored-by: wuyh_1 <wuyh_1@chinatelecom.cn>
2023-08-02 13:28:36 +08:00
Ulric Qin
1a8087bda7 update zookeeper markdown 2023-08-02 09:36:24 +08:00
Ulric Qin
72b4c2b1ec update markdown of vmware 2023-08-02 09:20:09 +08:00
Ulric Qin
38e6820d7b update markdown of VictoriaMetrics 2023-08-02 09:10:14 +08:00
Ulric Qin
765b3a57fe update markdown of tomcat 2023-08-02 09:03:59 +08:00
Ulric Qin
1c4a32f8fa code refactor 2023-08-02 09:01:39 +08:00
Ulric Qin
3f258fcebf update markdown of springboot 2023-08-02 08:57:39 +08:00
Ulric Qin
140f2cbfa8 update markdown if snmp 2023-08-02 08:44:45 +08:00
Ulric Qin
6aacd77492 update markdown of redis 2023-08-02 08:34:51 +08:00
Ulric Qin
ef3f46f8b7 update markdown of integration RabbitMQ 2023-08-02 08:26:06 +08:00
Ulric Qin
0cdd25d2cf update markdown of integration Processes 2023-08-02 08:20:51 +08:00
Ulric Qin
5d02ce0636 update markdown of integration procstat 2023-08-02 08:17:41 +08:00
Ulric Qin
0cd1228ba7 update postgres markdown 2023-08-02 07:38:46 +08:00
Ulric Qin
0595401d14 update oracle markdown 2023-08-02 07:30:03 +08:00
Yening Qin
d724f8cc8e fix get tpl (#1655) 2023-08-02 00:48:02 +08:00
Ulric Qin
a3f5d458d7 add nginx markdown 2023-08-01 18:21:17 +08:00
Ulric Qin
76bfb130b0 code refactor 2023-08-01 18:05:50 +08:00
Ulric Qin
184bb78e3b add markdown of integration n9e 2023-08-01 17:49:49 +08:00
Ulric Qin
6a41af2cb2 update markdown of integration mysql 2023-08-01 17:42:30 +08:00
Ulric Qin
faa149cc87 code refactor 2023-08-01 17:26:10 +08:00
Ulric Qin
24592fe480 code refactor 2023-08-01 17:18:12 +08:00
Ulric Qin
4be53082e0 code refactor 2023-08-01 17:04:36 +08:00
Ulric Qin
ae8c9c668c code refactor 2023-08-01 16:52:21 +08:00
Ulric Qin
b0c15af04f code refactor 2023-08-01 16:39:46 +08:00
Ulric Qin
c05b710aff update markdown of kafka integration 2023-08-01 16:21:45 +08:00
Ulric Qin
4299c48aef update markdown of IPMI integration 2023-08-01 15:58:27 +08:00
Ulric Qin
ae0523dec0 code refactor 2023-08-01 15:50:26 +08:00
Ulric Qin
e18a6bda7b update markdown of integration http_response 2023-08-01 15:47:45 +08:00
Ulric Qin
e64be95f1c code refactor 2023-08-01 15:25:14 +08:00
Ulric Qin
a1aa0150f8 update markdown of integration elasticsearch 2023-08-01 15:14:27 +08:00
Ulric Qin
32f9cb5996 update markdown of ceph integration 2023-08-01 14:59:10 +08:00
Ulric Qin
3b7e692b01 update markdown of aliyun integration 2023-08-01 14:54:52 +08:00
Yening Qin
6491eba1da check datasource (#1651) 2023-07-31 15:32:03 +08:00
ning
bb7ea7e809 code refactor 2023-07-27 16:36:04 +08:00
ning
169930e3b8 docs: add markdown 2023-07-27 16:33:38 +08:00
ning
8e14047f36 docs: add markdown 2023-07-27 16:25:07 +08:00
ning
fd29a96f7b docs: remove jaeger 2023-07-27 15:24:58 +08:00
ning
820c12f230 docs: update markdown 2023-07-27 15:13:54 +08:00
ning
ff3550e7b3 docs: update markdown 2023-07-27 15:09:32 +08:00
ning
b65e43351d Merge branch 'main' of github.com:ccfos/nightingale 2023-07-27 15:02:58 +08:00
ning
3fb74b632b docs: update markdown 2023-07-27 15:02:35 +08:00
xtan
253e54344d docs: fix docker-compose for pg-vm (#1649) 2023-07-27 10:50:38 +08:00
ning
f1ee7d24a6 fix: sub rule filter 2023-07-26 18:14:45 +08:00
ning
475673b3e7 fix: admin role get targets 2023-07-26 16:49:38 +08:00
Yening Qin
dd49afef01 support markdown api and downtime select (#1645) 2023-07-25 17:06:25 +08:00
kongfei605
d0c842fe87 Merge pull request #1644 from ccfos/docker_update
install requests lib for python3
2023-07-25 11:19:14 +08:00
454 changed files with 71115 additions and 16422 deletions

7
.gitignore vendored
View File

@@ -30,6 +30,7 @@ _test
/dist
/etc/*.local.yml
/etc/*.local.conf
/etc/rsa/*
/etc/plugins/*.local.yml
/etc/script/rules.yaml
/etc/script/alert-rules.json
@@ -43,10 +44,12 @@ _test
/n9e
/docker/pub
/docker/n9e
/docker/mysqldata
/docker/experience_pg_vm/pgdata
/docker/compose-bridge/mysqldata
/docker/compose-host-network/mysqldata
/docker/compose-postgres/pgdata
/etc.local*
/front/statik/statik.go
/docker/compose-bridge/etc-nightingale/rsa/
.alerts
.idea

104
README.md
View File

@@ -4,71 +4,101 @@
</p>
<p align="center">
<a href="https://flashcat.cloud/docs/">
<img alt="GitHub latest release" src="https://img.shields.io/github/v/release/ccfos/nightingale"/>
<a href="https://n9e.github.io">
<img alt="Docs" src="https://img.shields.io/badge/docs-get%20started-brightgreen"/></a>
<a href="https://hub.docker.com/u/flashcatcloud">
<img alt="Docker pulls" src="https://img.shields.io/docker/pulls/flashcatcloud/nightingale"/></a>
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
<img alt="GitHub contributors" src="https://img.shields.io/github/contributors-anon/ccfos/nightingale"/></a>
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/ccfos/nightingale">
<br/><img alt="GitHub Repo issues" src="https://img.shields.io/github/issues/ccfos/nightingale">
<img alt="GitHub Repo issues" src="https://img.shields.io/github/issues/ccfos/nightingale">
<img alt="GitHub Repo issues closed" src="https://img.shields.io/github/issues-closed/ccfos/nightingale">
<img alt="GitHub forks" src="https://img.shields.io/github/forks/ccfos/nightingale">
<img alt="GitHub latest release" src="https://img.shields.io/github/v/release/ccfos/nightingale"/>
<img alt="License" src="https://img.shields.io/badge/license-Apache--2.0-blue"/>
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
<img alt="GitHub contributors" src="https://img.shields.io/github/contributors-anon/ccfos/nightingale"/></a>
<a href="https://n9e-talk.slack.com/">
<img alt="GitHub contributors" src="https://img.shields.io/badge/join%20slack-%23n9e-brightgreen.svg"/></a>
<img alt="License" src="https://img.shields.io/badge/license-Apache--2.0-blue"/>
</p>
<p align="center">
告警管理专家,一体化的开源可观测平台
An open-source cloud-native monitoring system that is <b>all-in-one</b> <br/>
<b>Out-of-the-box</b>, it integrates data collection, visualization, and monitoring alert <br/>
We recommend upgrading your <b>Prometheus + AlertManager + Grafana</b> combination to Nightingale!
</p>
[English](./README_en.md) | [中文](./README.md)
夜莺Nightingale是中国计算机学会托管的开源云原生可观测工具最早由滴滴于 2020 年孵化并开源,并于 2022 年正式捐赠予中国计算机学会。夜莺采用 All-in-One 的设计理念,集数据采集、可视化、监控告警、数据分析于一体,与云原生生态紧密集成,融入了顶级互联网公司可观测性最佳实践,沉淀了众多社区专家经验,开箱即用。
## 资料
- 文档:[flashcat.cloud/docs](https://flashcat.cloud/docs/)
- 提问:[answer.flashcat.cloud](https://answer.flashcat.cloud/)
- 报Bug[github.com/ccfos/nightingale/issues](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Fbug&projects=&template=bug_report.yml)
[English](./README.md) | [中文](./README_zh.md)
## 功能和特点
## Highlighted Features
- 统一接入各种时序库:支持对接 Prometheus、VictoriaMetrics、Thanos、Mimir、M3DB 等多种时序库,实现统一告警管理
- 专业告警能力:内置支持多种告警规则,可以扩展支持所有通知媒介,支持告警屏蔽、告警抑制、告警自愈、告警事件管理
- 高性能可视化引擎支持多种图表样式内置众多Dashboard模版也可导入Grafana模版开箱即用开源协议商业友好
- 无缝搭配 [Flashduty](https://flashcat.cloud/product/flashcat-duty/)实现告警聚合收敛、认领、升级、排班、IM集成确保告警处理不遗漏减少打扰更好协同
- 支持所有常见采集器:支持 [Categraf](https://flashcat.cloud/product/categraf)、telegraf、grafana-agent、datadog-agent、各种 exporter 作为采集器,没有什么数据是不能监控的
- 一体化观测平台:从 v6 版本开始,支持接入 ElasticSearch、Jaeger 数据源,实现日志、链路、指标多维度的统一可观测
- **Out-of-the-box**
- Supports multiple deployment methods such as **Docker, Helm Chart, and cloud services**, integrates data collection, monitoring, and alerting into one system, and comes with various monitoring dashboards, quick views, and alert rule templates. **It greatly reduces the construction cost, learning cost, and usage cost of cloud-native monitoring systems**.
- **Professional Alerting**
- Provides visual alert configuration and management, supports various alert rules, offers the ability to configure silence and subscription rules, supports multiple alert delivery channels, and has features such as alert self-healing and event management.
- **Cloud-Native**
- Quickly builds an enterprise-level cloud-native monitoring system through a turnkey approach, supports multiple collectors such as [Categraf](https://github.com/flashcatcloud/categraf), Telegraf, and Grafana-agent, supports multiple data sources such as Prometheus, VictoriaMetrics, M3DB, ElasticSearch, and Jaeger, and is compatible with importing Grafana dashboards. **It seamlessly integrates with the cloud-native ecosystem**.
- **High Performance and High Availability**
- Due to the multi-data-source management engine of Nightingale and its excellent architecture design, and utilizing a high-performance time-series database, it can handle data collection, storage, and alert analysis scenarios with billions of time-series data, saving a lot of costs.
- Nightingale components can be horizontally scaled with no single point of failure. It has been deployed in thousands of enterprises and tested in harsh production practices. Many leading Internet companies have used Nightingale for cluster machines with hundreds of nodes, processing billions of time-series data.
- **Flexible Extension and Centralized Management**
- Nightingale can be deployed on a 1-core 1G cloud host, deployed in a cluster of hundreds of machines, or run in Kubernetes. Time-series databases, alert engines, and other components can also be decentralized to various data centers and regions, balancing edge deployment with centralized management. **It solves the problem of data fragmentation and lack of unified views**.
## 产品演示
#### If you are using Prometheus and have one or more of the following requirement scenarios, it is recommended that you upgrade to Nightingale:
![演示](doc/img/n9e-screenshot-gif-v6.gif)
- Multiple systems such as Prometheus, Alertmanager, Grafana, etc. are fragmented and lack a unified view and cannot be used out of the box;
- The way to manage Prometheus and Alertmanager by modifying configuration files has a big learning curve and is difficult to collaborate;
- Too much data to scale-up your Prometheus cluster;
- Multiple Prometheus clusters running in production environments, which faced high management and usage costs;
## 部署架构
#### If you are using Zabbix and have the following scenarios, it is recommended that you upgrade to Nightingale:
![架构](doc/img/n9e-arch-latest.png)
- Monitoring too much data and wanting a better scalable solution;
- A high learning curve and a desire for better efficiency of collaborative use in a multi-person, multi-team model;
- Microservice and cloud-native architectures with variable monitoring data lifecycles and high monitoring data dimension bases, which are not easily adaptable to the Zabbix data model;
## 加入交流群
欢迎加入 QQ 交流群群号479290895QQ 群适合群友互助,夜莺研发人员通常不在群里。如果要报 bug 请到[这里](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Fbug&projects=&template=bug_report.yml),提问到[这里](https://answer.flashcat.cloud/)。
#### If you are using [open-falcon](https://github.com/open-falcon/falcon-plus), we recommend you to upgrade to Nightingale
- For more information about open-falcon and Nightingale, please refer to read [Ten features and trends of cloud-native monitoring](https://mp.weixin.qq.com/s?__biz=MzkzNjI5OTM5Nw==&mid=2247483738&idx=1&sn=e8bdbb974a2cd003c1abcc2b5405dd18&chksm=c2a19fb0f5d616a63185cd79277a79a6b80118ef2185890d0683d2bb20451bd9303c78d083c5#rd)。
## Getting Started
[https://n9e.github.io/](https://n9e.github.io/)
## Screenshots
https://user-images.githubusercontent.com/792850/216888712-2565fcea-9df5-47bd-a49e-d60af9bd76e8.mp4
## Architecture
<img src="doc/img/arch-product.png" width="600">
Nightingale monitoring can receive monitoring data reported by various collectors (such as [Categraf](https://github.com/flashcatcloud/categraf) , telegraf, grafana-agent, Prometheus, etc.) and write them to various popular time-series databases (such as Prometheus, M3DB, VictoriaMetrics, Thanos, TDEngine, etc.). It provides configuration capabilities for alert rules, silence rules, and subscription rules, as well as the ability to view monitoring data. It also provides automatic alarm self-healing mechanisms (such as automatically calling back to a webhook address or executing a script after an alarm is triggered), and the ability to store and manage historical alarm events and view them in groups.
If the performance of a standalone time-series database (such as Prometheus) has bottlenecks or poor disaster recovery, we recommend using [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics). The VictoriaMetrics architecture is relatively simple, has excellent performance, and is easy to deploy and maintain. The architecture diagram is as shown above. For more detailed documentation on VictoriaMetrics, please refer to its [official website](https://victoriametrics.com/).
**We welcome you to participate in the Nightingale open-source project and community in various ways, including but not limited to**
- Adding and improving documentation => [n9e.github.io](https://n9e.github.io/)
- Sharing your best practices and experience in using Nightingale monitoring => [Article sharing]((https://n9e.github.io/docs/prologue/share/))
- Submitting product suggestions => [github issue](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.md)
- Submitting code to make Nightingale monitoring faster, more stable, and easier to use => [github pull request](https://github.com/didi/nightingale/pulls)
**Respecting, recognizing, and recording the work of every contributor** is the first guiding principle of the Nightingale open-source community. We advocate effective questioning, which not only respects the developer's time but also contributes to the accumulation of knowledge in the entire community
- Before asking a question, please first refer to the [FAQ](https://www.gitlink.org.cn/ccfos/nightingale/wiki/faq)
- We use [GitHub Discussions](https://github.com/ccfos/nightingale/discussions) as the communication forum. You can search and ask questions here.
- We also recommend that you join ours [Slack channel](https://n9e-talk.slack.com/) to exchange experiences with other Nightingale users.
## Who is using Nightingale
You can register your usage and share your experience by posting on **[Who is Using Nightingale](https://github.com/ccfos/nightingale/issues/897)**.
## Stargazers over time
[![Stargazers over time](https://api.star-history.com/svg?repos=ccfos/nightingale&type=Date)](https://star-history.com/#ccfos/nightingale&Date)
[![Stargazers over time](https://starchart.cc/ccfos/nightingale.svg)](https://starchart.cc/ccfos/nightingale)
## Contributors
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
<img src="https://contrib.rocks/image?repo=ccfos/nightingale" />
</a>
## 社区治理
[夜莺开源项目和社区治理架构(草案)](./doc/community-governance.md)
## License
[Apache License V2.0](https://github.com/didi/nightingale/blob/main/LICENSE)
[Apache License V2.0](https://github.com/didi/nightingale/blob/main/LICENSE)

View File

@@ -1,104 +0,0 @@
<p align="center">
<a href="https://github.com/ccfos/nightingale">
<img src="doc/img/nightingale_logo_h.png" alt="nightingale - cloud native monitoring" width="240" /></a>
</p>
<p align="center">
<img alt="GitHub latest release" src="https://img.shields.io/github/v/release/ccfos/nightingale"/>
<a href="https://n9e.github.io">
<img alt="Docs" src="https://img.shields.io/badge/docs-get%20started-brightgreen"/></a>
<a href="https://hub.docker.com/u/flashcatcloud">
<img alt="Docker pulls" src="https://img.shields.io/docker/pulls/flashcatcloud/nightingale"/></a>
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/ccfos/nightingale">
<img alt="GitHub Repo issues" src="https://img.shields.io/github/issues/ccfos/nightingale">
<img alt="GitHub Repo issues closed" src="https://img.shields.io/github/issues-closed/ccfos/nightingale">
<img alt="GitHub forks" src="https://img.shields.io/github/forks/ccfos/nightingale">
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
<img alt="GitHub contributors" src="https://img.shields.io/github/contributors-anon/ccfos/nightingale"/></a>
<a href="https://n9e-talk.slack.com/">
<img alt="GitHub contributors" src="https://img.shields.io/badge/join%20slack-%23n9e-brightgreen.svg"/></a>
<img alt="License" src="https://img.shields.io/badge/license-Apache--2.0-blue"/>
</p>
<p align="center">
An open-source cloud-native monitoring system that is <b>all-in-one</b> <br/>
<b>Out-of-the-box</b>, it integrates data collection, visualization, and monitoring alert <br/>
We recommend upgrading your <b>Prometheus + AlertManager + Grafana</b> combination to Nightingale!
</p>
[English](./README.md) | [中文](./README_ZH.md)
## Highlighted Features
- **Out-of-the-box**
- Supports multiple deployment methods such as **Docker, Helm Chart, and cloud services**, integrates data collection, monitoring, and alerting into one system, and comes with various monitoring dashboards, quick views, and alert rule templates. **It greatly reduces the construction cost, learning cost, and usage cost of cloud-native monitoring systems**.
- **Professional Alerting**
- Provides visual alert configuration and management, supports various alert rules, offers the ability to configure silence and subscription rules, supports multiple alert delivery channels, and has features such as alert self-healing and event management.
- **Cloud-Native**
- Quickly builds an enterprise-level cloud-native monitoring system through a turnkey approach, supports multiple collectors such as [Categraf](https://github.com/flashcatcloud/categraf), Telegraf, and Grafana-agent, supports multiple data sources such as Prometheus, VictoriaMetrics, M3DB, ElasticSearch, and Jaeger, and is compatible with importing Grafana dashboards. **It seamlessly integrates with the cloud-native ecosystem**.
- **High Performance and High Availability**
- Due to the multi-data-source management engine of Nightingale and its excellent architecture design, and utilizing a high-performance time-series database, it can handle data collection, storage, and alert analysis scenarios with billions of time-series data, saving a lot of costs.
- Nightingale components can be horizontally scaled with no single point of failure. It has been deployed in thousands of enterprises and tested in harsh production practices. Many leading Internet companies have used Nightingale for cluster machines with hundreds of nodes, processing billions of time-series data.
- **Flexible Extension and Centralized Management**
- Nightingale can be deployed on a 1-core 1G cloud host, deployed in a cluster of hundreds of machines, or run in Kubernetes. Time-series databases, alert engines, and other components can also be decentralized to various data centers and regions, balancing edge deployment with centralized management. **It solves the problem of data fragmentation and lack of unified views**.
#### If you are using Prometheus and have one or more of the following requirement scenarios, it is recommended that you upgrade to Nightingale:
- Multiple systems such as Prometheus, Alertmanager, Grafana, etc. are fragmented and lack a unified view and cannot be used out of the box;
- The way to manage Prometheus and Alertmanager by modifying configuration files has a big learning curve and is difficult to collaborate;
- Too much data to scale-up your Prometheus cluster;
- Multiple Prometheus clusters running in production environments, which faced high management and usage costs;
#### If you are using Zabbix and have the following scenarios, it is recommended that you upgrade to Nightingale:
- Monitoring too much data and wanting a better scalable solution;
- A high learning curve and a desire for better efficiency of collaborative use in a multi-person, multi-team model;
- Microservice and cloud-native architectures with variable monitoring data lifecycles and high monitoring data dimension bases, which are not easily adaptable to the Zabbix data model;
#### If you are using [open-falcon](https://github.com/open-falcon/falcon-plus), we recommend you to upgrade to Nightingale
- For more information about open-falcon and Nightingale, please refer to read [Ten features and trends of cloud-native monitoring](https://mp.weixin.qq.com/s?__biz=MzkzNjI5OTM5Nw==&mid=2247483738&idx=1&sn=e8bdbb974a2cd003c1abcc2b5405dd18&chksm=c2a19fb0f5d616a63185cd79277a79a6b80118ef2185890d0683d2bb20451bd9303c78d083c5#rd)。
## Getting Started
[English Doc](https://n9e.github.io/) | [中文文档](http://n9e.flashcat.cloud/)
## Screenshots
https://user-images.githubusercontent.com/792850/216888712-2565fcea-9df5-47bd-a49e-d60af9bd76e8.mp4
## Architecture
<img src="doc/img/arch-product.png" width="600">
Nightingale monitoring can receive monitoring data reported by various collectors (such as [Categraf](https://github.com/flashcatcloud/categraf) , telegraf, grafana-agent, Prometheus, etc.) and write them to various popular time-series databases (such as Prometheus, M3DB, VictoriaMetrics, Thanos, TDEngine, etc.). It provides configuration capabilities for alert rules, silence rules, and subscription rules, as well as the ability to view monitoring data. It also provides automatic alarm self-healing mechanisms (such as automatically calling back to a webhook address or executing a script after an alarm is triggered), and the ability to store and manage historical alarm events and view them in groups.
If the performance of a standalone time-series database (such as Prometheus) has bottlenecks or poor disaster recovery, we recommend using [VictoriaMetrics](https://github.com/VictoriaMetrics/VictoriaMetrics). The VictoriaMetrics architecture is relatively simple, has excellent performance, and is easy to deploy and maintain. The architecture diagram is as shown above. For more detailed documentation on VictoriaMetrics, please refer to its [official website](https://victoriametrics.com/).
**We welcome you to participate in the Nightingale open-source project and community in various ways, including but not limited to**
- Adding and improving documentation => [n9e.github.io](https://n9e.github.io/)
- Sharing your best practices and experience in using Nightingale monitoring => [Article sharing]((https://n9e.github.io/docs/prologue/share/))
- Submitting product suggestions => [github issue](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.md)
- Submitting code to make Nightingale monitoring faster, more stable, and easier to use => [github pull request](https://github.com/didi/nightingale/pulls)
**Respecting, recognizing, and recording the work of every contributor** is the first guiding principle of the Nightingale open-source community. We advocate effective questioning, which not only respects the developer's time but also contributes to the accumulation of knowledge in the entire community
- Before asking a question, please first refer to the [FAQ](https://www.gitlink.org.cn/ccfos/nightingale/wiki/faq)
- We use [GitHub Discussions](https://github.com/ccfos/nightingale/discussions) as the communication forum. You can search and ask questions here.
- We also recommend that you join ours [Slack channel](https://n9e-talk.slack.com/) to exchange experiences with other Nightingale users.
## Who is using Nightingale
You can register your usage and share your experience by posting on **[Who is Using Nightingale](https://github.com/ccfos/nightingale/issues/897)**.
## Stargazers over time
[![Stargazers over time](https://starchart.cc/ccfos/nightingale.svg)](https://starchart.cc/ccfos/nightingale)
## Contributors
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
<img src="https://contrib.rocks/image?repo=ccfos/nightingale" />
</a>
## License
[Apache License V2.0](https://github.com/didi/nightingale/blob/main/LICENSE)

74
README_zh.md Normal file
View File

@@ -0,0 +1,74 @@
<p align="center">
<a href="https://github.com/ccfos/nightingale">
<img src="doc/img/nightingale_logo_h.png" alt="nightingale - cloud native monitoring" width="240" /></a>
</p>
<p align="center">
<a href="https://flashcat.cloud/docs/">
<img alt="Docs" src="https://img.shields.io/badge/docs-get%20started-brightgreen"/></a>
<a href="https://hub.docker.com/u/flashcatcloud">
<img alt="Docker pulls" src="https://img.shields.io/docker/pulls/flashcatcloud/nightingale"/></a>
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
<img alt="GitHub contributors" src="https://img.shields.io/github/contributors-anon/ccfos/nightingale"/></a>
<img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/ccfos/nightingale">
<br/><img alt="GitHub Repo issues" src="https://img.shields.io/github/issues/ccfos/nightingale">
<img alt="GitHub Repo issues closed" src="https://img.shields.io/github/issues-closed/ccfos/nightingale">
<img alt="GitHub forks" src="https://img.shields.io/github/forks/ccfos/nightingale">
<img alt="GitHub latest release" src="https://img.shields.io/github/v/release/ccfos/nightingale"/>
<img alt="License" src="https://img.shields.io/badge/license-Apache--2.0-blue"/>
<a href="https://n9e-talk.slack.com/">
<img alt="GitHub contributors" src="https://img.shields.io/badge/join%20slack-%23n9e-brightgreen.svg"/></a>
</p>
<p align="center">
告警管理专家,一体化的开源可观测平台
</p>
[English](./README.md) | [中文](./README_zh.md)
夜莺Nightingale是中国计算机学会托管的开源云原生可观测工具最早由滴滴于 2020 年孵化并开源,并于 2022 年正式捐赠予中国计算机学会。夜莺采用 All-in-One 的设计理念,集数据采集、可视化、监控告警、数据分析于一体,与云原生生态紧密集成,融入了顶级互联网公司可观测性最佳实践,沉淀了众多社区专家经验,开箱即用。
## 资料
- 文档:[flashcat.cloud/docs](https://flashcat.cloud/docs/)
- 提问:[answer.flashcat.cloud](https://answer.flashcat.cloud/)
- 报Bug[github.com/ccfos/nightingale/issues](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Fbug&projects=&template=bug_report.yml)
## 功能和特点
- 统一接入各种时序库:支持对接 Prometheus、VictoriaMetrics、Thanos、Mimir、M3DB 等多种时序库,实现统一告警管理
- 专业告警能力:内置支持多种告警规则,可以扩展支持所有通知媒介,支持告警屏蔽、告警抑制、告警自愈、告警事件管理
- 高性能可视化引擎支持多种图表样式内置众多Dashboard模版也可导入Grafana模版开箱即用开源协议商业友好
- 无缝搭配 [Flashduty](https://flashcat.cloud/product/flashcat-duty/)实现告警聚合收敛、认领、升级、排班、IM集成确保告警处理不遗漏减少打扰更好协同
- 支持所有常见采集器:支持 [Categraf](https://flashcat.cloud/product/categraf)、telegraf、grafana-agent、datadog-agent、各种 exporter 作为采集器,没有什么数据是不能监控的
- 一体化观测平台:从 v6 版本开始,支持接入 ElasticSearch、Jaeger 数据源,实现日志、链路、指标多维度的统一可观测
## 产品演示
![演示](doc/img/n9e-screenshot-gif-v6.gif)
## 部署架构
![架构](doc/img/n9e-arch-latest.png)
## 加入交流群
欢迎加入 QQ 交流群群号479290895QQ 群适合群友互助,夜莺研发人员通常不在群里。如果要报 bug 请到[这里](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Fbug&projects=&template=bug_report.yml),提问到[这里](https://answer.flashcat.cloud/)。
## Stargazers over time
[![Stargazers over time](https://api.star-history.com/svg?repos=ccfos/nightingale&type=Date)](https://star-history.com/#ccfos/nightingale&Date)
## Contributors
<a href="https://github.com/ccfos/nightingale/graphs/contributors">
<img src="https://contrib.rocks/image?repo=ccfos/nightingale" />
</a>
## 社区治理
[夜莺开源项目和社区治理架构(草案)](./doc/community-governance.md)
## License
[Apache License V2.0](https://github.com/didi/nightingale/blob/main/LICENSE)

View File

@@ -2,8 +2,6 @@ package aconf
import (
"path"
"github.com/toolkits/pkg/runner"
)
type Alert struct {
@@ -55,9 +53,9 @@ type Ibex struct {
Timeout int64
}
func (a *Alert) PreCheck() {
func (a *Alert) PreCheck(configDir string) {
if a.Alerting.TemplatesDir == "" {
a.Alerting.TemplatesDir = path.Join(runner.Cwd, "etc", "template")
a.Alerting.TemplatesDir = path.Join(configDir, "template")
}
if a.Alerting.NotifyConcurrency == 0 {

View File

@@ -24,6 +24,7 @@ import (
"github.com/ccfos/nightingale/v6/prom"
"github.com/ccfos/nightingale/v6/pushgw/pconf"
"github.com/ccfos/nightingale/v6/pushgw/writer"
"github.com/ccfos/nightingale/v6/tdengine"
)
func Initialize(configDir string, cryptoKey string) (func(), error) {
@@ -42,20 +43,22 @@ func Initialize(configDir string, cryptoKey string) (func(), error) {
syncStats := memsto.NewSyncStats()
alertStats := astats.NewSyncStats()
configCache := memsto.NewConfigCache(ctx, syncStats, nil, "")
targetCache := memsto.NewTargetCache(ctx, syncStats, nil)
busiGroupCache := memsto.NewBusiGroupCache(ctx, syncStats)
alertMuteCache := memsto.NewAlertMuteCache(ctx, syncStats)
alertRuleCache := memsto.NewAlertRuleCache(ctx, syncStats)
notifyConfigCache := memsto.NewNotifyConfigCache(ctx)
notifyConfigCache := memsto.NewNotifyConfigCache(ctx, configCache)
dsCache := memsto.NewDatasourceCache(ctx, syncStats)
userCache := memsto.NewUserCache(ctx, syncStats)
userGroupCache := memsto.NewUserGroupCache(ctx, syncStats)
promClients := prom.NewPromClient(ctx, config.Alert.Heartbeat)
tdengineClients := tdengine.NewTdengineClient(ctx, config.Alert.Heartbeat)
externalProcessors := process.NewExternalProcessors()
Start(config.Alert, config.Pushgw, syncStats, alertStats, externalProcessors, targetCache, busiGroupCache, alertMuteCache, alertRuleCache, notifyConfigCache, dsCache, ctx, promClients, userCache, userGroupCache)
Start(config.Alert, config.Pushgw, syncStats, alertStats, externalProcessors, targetCache, busiGroupCache, alertMuteCache, alertRuleCache, notifyConfigCache, dsCache, ctx, promClients, tdengineClients, userCache, userGroupCache)
r := httpx.GinEngine(config.Global.RunMode, config.HTTP)
rt := router.New(config.HTTP, config.Alert, alertMuteCache, targetCache, busiGroupCache, alertStats, ctx, externalProcessors)
@@ -71,7 +74,8 @@ func Initialize(configDir string, cryptoKey string) (func(), error) {
}
func Start(alertc aconf.Alert, pushgwc pconf.Pushgw, syncStats *memsto.Stats, alertStats *astats.Stats, externalProcessors *process.ExternalProcessorsType, targetCache *memsto.TargetCacheType, busiGroupCache *memsto.BusiGroupCacheType,
alertMuteCache *memsto.AlertMuteCacheType, alertRuleCache *memsto.AlertRuleCacheType, notifyConfigCache *memsto.NotifyConfigCacheType, datasourceCache *memsto.DatasourceCacheType, ctx *ctx.Context, promClients *prom.PromClientMap, userCache *memsto.UserCacheType, userGroupCache *memsto.UserGroupCacheType) {
alertMuteCache *memsto.AlertMuteCacheType, alertRuleCache *memsto.AlertRuleCacheType, notifyConfigCache *memsto.NotifyConfigCacheType, datasourceCache *memsto.DatasourceCacheType, ctx *ctx.Context,
promClients *prom.PromClientMap, tdendgineClients *tdengine.TdengineClientMap, userCache *memsto.UserCacheType, userGroupCache *memsto.UserGroupCacheType) {
alertSubscribeCache := memsto.NewAlertSubscribeCache(ctx, syncStats)
recordingRuleCache := memsto.NewRecordingRuleCache(ctx, syncStats)
@@ -82,14 +86,14 @@ func Start(alertc aconf.Alert, pushgwc pconf.Pushgw, syncStats *memsto.Stats, al
writers := writer.NewWriters(pushgwc)
record.NewScheduler(alertc, recordingRuleCache, promClients, writers, alertStats)
eval.NewScheduler(alertc, externalProcessors, alertRuleCache, targetCache, busiGroupCache, alertMuteCache, datasourceCache, promClients, naming, ctx, alertStats)
eval.NewScheduler(alertc, externalProcessors, alertRuleCache, targetCache, busiGroupCache, alertMuteCache, datasourceCache, promClients, tdendgineClients, naming, ctx, alertStats)
dp := dispatch.NewDispatch(alertRuleCache, userCache, userGroupCache, alertSubscribeCache, targetCache, notifyConfigCache, alertc.Alerting, ctx)
dp := dispatch.NewDispatch(alertRuleCache, userCache, userGroupCache, alertSubscribeCache, targetCache, notifyConfigCache, alertc.Alerting, ctx, alertStats)
consumer := dispatch.NewConsumer(alertc.Alerting, ctx, dp)
go dp.ReloadTpls()
go consumer.LoopConsume()
go queue.ReportQueueSize(alertStats)
go sender.StartEmailSender(notifyConfigCache.GetSMTP()) // todo
go sender.InitEmailSender(notifyConfigCache.GetSMTP())
}

View File

@@ -10,22 +10,52 @@ const (
)
type Stats struct {
CounterSampleTotal *prometheus.CounterVec
CounterAlertsTotal *prometheus.CounterVec
GaugeAlertQueueSize prometheus.Gauge
GaugeSampleQueueSize *prometheus.GaugeVec
RequestDuration *prometheus.HistogramVec
ForwardDuration *prometheus.HistogramVec
AlertNotifyTotal *prometheus.CounterVec
AlertNotifyErrorTotal *prometheus.CounterVec
CounterAlertsTotal *prometheus.CounterVec
GaugeAlertQueueSize prometheus.Gauge
CounterRuleEval *prometheus.CounterVec
CounterQueryDataErrorTotal *prometheus.CounterVec
CounterRecordEval *prometheus.CounterVec
CounterRecordEvalErrorTotal *prometheus.CounterVec
CounterMuteTotal *prometheus.CounterVec
}
func NewSyncStats() *Stats {
// 从各个接收接口接收到的监控数据总量
CounterSampleTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
CounterRuleEval := prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "samples_received_total",
Help: "Total number samples received.",
}, []string{"cluster", "channel"})
Name: "rule_eval_total",
Help: "Number of rule eval.",
}, []string{})
CounterRecordEval := prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "record_eval_total",
Help: "Number of record eval.",
}, []string{})
CounterRecordEvalErrorTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "record_eval_error_total",
Help: "Number of record eval error.",
}, []string{})
AlertNotifyTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "alert_notify_total",
Help: "Number of send msg.",
}, []string{"channel"})
AlertNotifyErrorTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "alert_notify_error_total",
Help: "Number of send msg.",
}, []string{"channel"})
// 产生的告警总量
CounterAlertsTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
@@ -33,7 +63,7 @@ func NewSyncStats() *Stats {
Subsystem: subsystem,
Name: "alerts_total",
Help: "Total number alert events.",
}, []string{"cluster"})
}, []string{"cluster", "type", "busi_group"})
// 内存中的告警事件队列的长度
GaugeAlertQueueSize := prometheus.NewGauge(prometheus.GaugeOpts{
@@ -43,51 +73,41 @@ func NewSyncStats() *Stats {
Help: "The size of alert queue.",
})
// 数据转发队列,各个队列的长度
GaugeSampleQueueSize := prometheus.NewGaugeVec(prometheus.GaugeOpts{
CounterQueryDataErrorTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "sample_queue_size",
Help: "The size of sample queue.",
}, []string{"cluster", "channel_number"})
Name: "query_data_error_total",
Help: "Number of query data error.",
}, []string{"datasource"})
// 一些重要的请求,比如接收数据的请求,应该统计一下延迟情况
RequestDuration := prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Namespace: namespace,
Subsystem: subsystem,
Buckets: []float64{.01, .1, 1},
Name: "http_request_duration_seconds",
Help: "HTTP request latencies in seconds.",
}, []string{"code", "path", "method"},
)
// 发往后端TSDB延迟如何
ForwardDuration := prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Namespace: namespace,
Subsystem: subsystem,
Buckets: []float64{.1, 1, 10},
Name: "forward_duration_seconds",
Help: "Forward samples to TSDB. latencies in seconds.",
}, []string{"cluster", "channel_number"},
)
CounterMuteTotal := prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "mute_total",
Help: "Number of mute.",
}, []string{"group"})
prometheus.MustRegister(
CounterSampleTotal,
CounterAlertsTotal,
GaugeAlertQueueSize,
GaugeSampleQueueSize,
RequestDuration,
ForwardDuration,
AlertNotifyTotal,
AlertNotifyErrorTotal,
CounterRuleEval,
CounterQueryDataErrorTotal,
CounterRecordEval,
CounterRecordEvalErrorTotal,
CounterMuteTotal,
)
return &Stats{
CounterSampleTotal: CounterSampleTotal,
CounterAlertsTotal: CounterAlertsTotal,
GaugeAlertQueueSize: GaugeAlertQueueSize,
GaugeSampleQueueSize: GaugeSampleQueueSize,
RequestDuration: RequestDuration,
ForwardDuration: ForwardDuration,
CounterAlertsTotal: CounterAlertsTotal,
GaugeAlertQueueSize: GaugeAlertQueueSize,
AlertNotifyTotal: AlertNotifyTotal,
AlertNotifyErrorTotal: AlertNotifyErrorTotal,
CounterRuleEval: CounterRuleEval,
CounterQueryDataErrorTotal: CounterQueryDataErrorTotal,
CounterRecordEval: CounterRecordEval,
CounterRecordEvalErrorTotal: CounterRecordEvalErrorTotal,
CounterMuteTotal: CounterMuteTotal,
}
}

View File

@@ -22,6 +22,14 @@ func MatchTags(eventTagsMap map[string]string, itags []models.TagFilter) bool {
}
return true
}
func MatchGroupsName(groupName string, groupFilter []models.TagFilter) bool {
for _, filter := range groupFilter {
if !matchTag(groupName, filter) {
return false
}
}
return true
}
func matchTag(value string, filter models.TagFilter) bool {
switch filter.Func {

View File

@@ -61,6 +61,13 @@ func (e *Consumer) consume(events []interface{}, sema *semaphore.Semaphore) {
func (e *Consumer) consumeOne(event *models.AlertCurEvent) {
LogEvent(event, "consume")
eventType := "alert"
if event.IsRecovered {
eventType = "recovery"
}
e.dispatch.astats.CounterAlertsTotal.WithLabelValues(event.Cluster, eventType, event.GroupName).Inc()
if err := event.ParseRule("rule_name"); err != nil {
event.RuleName = fmt.Sprintf("failed to parse rule name: %v", err)
}
@@ -89,9 +96,10 @@ func (e *Consumer) persist(event *models.AlertCurEvent) {
if !e.ctx.IsCenter {
event.DB2FE()
err := poster.PostByUrls(e.ctx, "/v1/n9e/event-persist", event)
var err error
event.Id, err = poster.PostByUrlsWithResp[int64](e.ctx, "/v1/n9e/event-persist", event)
if err != nil {
logger.Errorf("event%+v persist err:%v", event, err)
logger.Errorf("event:%+v persist err:%v", event, err)
}
return
}

View File

@@ -9,6 +9,7 @@ import (
"time"
"github.com/ccfos/nightingale/v6/alert/aconf"
"github.com/ccfos/nightingale/v6/alert/astats"
"github.com/ccfos/nightingale/v6/alert/common"
"github.com/ccfos/nightingale/v6/alert/sender"
"github.com/ccfos/nightingale/v6/memsto"
@@ -33,7 +34,8 @@ type Dispatch struct {
ExtraSenders map[string]sender.Sender
BeforeSenderHook func(*models.AlertCurEvent) bool
ctx *ctx.Context
ctx *ctx.Context
astats *astats.Stats
RwLock sync.RWMutex
}
@@ -41,7 +43,7 @@ type Dispatch struct {
// 创建一个 Notify 实例
func NewDispatch(alertRuleCache *memsto.AlertRuleCacheType, userCache *memsto.UserCacheType, userGroupCache *memsto.UserGroupCacheType,
alertSubscribeCache *memsto.AlertSubscribeCacheType, targetCache *memsto.TargetCacheType, notifyConfigCache *memsto.NotifyConfigCacheType,
alerting aconf.Alerting, ctx *ctx.Context) *Dispatch {
alerting aconf.Alerting, ctx *ctx.Context, astats *astats.Stats) *Dispatch {
notify := &Dispatch{
alertRuleCache: alertRuleCache,
userCache: userCache,
@@ -57,7 +59,8 @@ func NewDispatch(alertRuleCache *memsto.AlertRuleCacheType, userCache *memsto.Us
ExtraSenders: make(map[string]sender.Sender),
BeforeSenderHook: func(*models.AlertCurEvent) bool { return true },
ctx: ctx,
ctx: ctx,
astats: astats,
}
return notify
}
@@ -86,17 +89,17 @@ func (e *Dispatch) relaodTpls() error {
senders := map[string]sender.Sender{
models.Email: sender.NewSender(models.Email, tmpTpls, smtp),
models.Dingtalk: sender.NewSender(models.Dingtalk, tmpTpls, smtp),
models.Wecom: sender.NewSender(models.Wecom, tmpTpls, smtp),
models.Feishu: sender.NewSender(models.Feishu, tmpTpls, smtp),
models.Mm: sender.NewSender(models.Mm, tmpTpls, smtp),
models.Telegram: sender.NewSender(models.Telegram, tmpTpls, smtp),
models.FeishuCard: sender.NewSender(models.FeishuCard, tmpTpls, smtp),
models.Dingtalk: sender.NewSender(models.Dingtalk, tmpTpls),
models.Wecom: sender.NewSender(models.Wecom, tmpTpls),
models.Feishu: sender.NewSender(models.Feishu, tmpTpls),
models.Mm: sender.NewSender(models.Mm, tmpTpls),
models.Telegram: sender.NewSender(models.Telegram, tmpTpls),
models.FeishuCard: sender.NewSender(models.FeishuCard, tmpTpls),
}
e.RwLock.RLock()
for channel, sender := range e.ExtraSenders {
senders[channel] = sender
for channelName, extraSender := range e.ExtraSenders {
senders[channelName] = extraSender
}
e.RwLock.RUnlock()
@@ -170,12 +173,25 @@ func (e *Dispatch) handleSubs(event *models.AlertCurEvent) {
// handleSub 处理订阅规则的event,注意这里event要使用值传递,因为后面会修改event的状态
func (e *Dispatch) handleSub(sub *models.AlertSubscribe, event models.AlertCurEvent) {
if sub.IsDisabled() || !sub.MatchCluster(event.DatasourceId) {
if sub.IsDisabled() {
return
}
if !sub.MatchCluster(event.DatasourceId) {
return
}
if !sub.MatchProd(event.RuleProd) {
return
}
if !common.MatchTags(event.TagsMap, sub.ITags) {
return
}
// event BusiGroups filter
if !common.MatchGroupsName(event.GroupName, sub.IBusiGroups) {
return
}
if sub.ForDuration > (event.TriggerTime - event.FirstTriggerTime) {
return
}
@@ -204,7 +220,7 @@ func (e *Dispatch) Send(rule *models.AlertRule, event *models.AlertCurEvent, not
needSend := e.BeforeSenderHook(event)
if needSend {
for channel, uids := range notifyTarget.ToChannelUserMap() {
ctx := sender.BuildMessageContext(rule, []*models.AlertCurEvent{event}, uids, e.userCache)
msgCtx := sender.BuildMessageContext(rule, []*models.AlertCurEvent{event}, uids, e.userCache, e.astats)
e.RwLock.RLock()
s := e.Senders[channel]
e.RwLock.RUnlock()
@@ -212,18 +228,18 @@ func (e *Dispatch) Send(rule *models.AlertRule, event *models.AlertCurEvent, not
logger.Debugf("no sender for channel: %s", channel)
continue
}
s.Send(ctx)
s.Send(msgCtx)
}
}
// handle event callbacks
sender.SendCallbacks(e.ctx, notifyTarget.ToCallbackList(), event, e.targetCache, e.userCache, e.notifyConfigCache.GetIbex())
sender.SendCallbacks(e.ctx, notifyTarget.ToCallbackList(), event, e.targetCache, e.userCache, e.notifyConfigCache.GetIbex(), e.astats)
// handle global webhooks
sender.SendWebhooks(notifyTarget.ToWebhookList(), event)
sender.SendWebhooks(notifyTarget.ToWebhookList(), event, e.astats)
// handle plugin call
go sender.MayPluginNotify(e.genNoticeBytes(event), e.notifyConfigCache.GetNotifyScript())
go sender.MayPluginNotify(e.genNoticeBytes(event), e.notifyConfigCache.GetNotifyScript(), e.astats)
}
type Notice struct {

View File

@@ -12,6 +12,8 @@ import (
"github.com/ccfos/nightingale/v6/memsto"
"github.com/ccfos/nightingale/v6/pkg/ctx"
"github.com/ccfos/nightingale/v6/prom"
"github.com/ccfos/nightingale/v6/tdengine"
"github.com/toolkits/pkg/logger"
)
@@ -29,7 +31,8 @@ type Scheduler struct {
alertMuteCache *memsto.AlertMuteCacheType
datasourceCache *memsto.DatasourceCacheType
promClients *prom.PromClientMap
promClients *prom.PromClientMap
tdengineClients *tdengine.TdengineClientMap
naming *naming.Naming
@@ -38,8 +41,8 @@ type Scheduler struct {
}
func NewScheduler(aconf aconf.Alert, externalProcessors *process.ExternalProcessorsType, arc *memsto.AlertRuleCacheType, targetCache *memsto.TargetCacheType,
busiGroupCache *memsto.BusiGroupCacheType, alertMuteCache *memsto.AlertMuteCacheType, datasourceCache *memsto.DatasourceCacheType, promClients *prom.PromClientMap, naming *naming.Naming,
ctx *ctx.Context, stats *astats.Stats) *Scheduler {
busiGroupCache *memsto.BusiGroupCacheType, alertMuteCache *memsto.AlertMuteCacheType, datasourceCache *memsto.DatasourceCacheType,
promClients *prom.PromClientMap, tdengineClients *tdengine.TdengineClientMap, naming *naming.Naming, ctx *ctx.Context, stats *astats.Stats) *Scheduler {
scheduler := &Scheduler{
aconf: aconf,
alertRules: make(map[string]*AlertRuleWorker),
@@ -52,8 +55,9 @@ func NewScheduler(aconf aconf.Alert, externalProcessors *process.ExternalProcess
alertMuteCache: alertMuteCache,
datasourceCache: datasourceCache,
promClients: promClients,
naming: naming,
promClients: promClients,
tdengineClients: tdengineClients,
naming: naming,
ctx: ctx,
stats: stats,
@@ -85,8 +89,11 @@ func (s *Scheduler) syncAlertRules() {
if rule == nil {
continue
}
if rule.IsPrometheusRule() {
ruleType := rule.GetRuleType()
if rule.IsPrometheusRule() || rule.IsLokiRule() || rule.IsTdengineRule() {
datasourceIds := s.promClients.Hit(rule.DatasourceIdsJson)
datasourceIds = append(datasourceIds, s.tdengineClients.Hit(rule.DatasourceIdsJson)...)
for _, dsId := range datasourceIds {
if !naming.DatasourceHashRing.IsHit(dsId, fmt.Sprintf("%d", rule.Id), s.aconf.Heartbeat.Endpoint) {
continue
@@ -97,13 +104,18 @@ func (s *Scheduler) syncAlertRules() {
continue
}
if ds.PluginType != ruleType {
logger.Debugf("datasource %d category is %s not %s", dsId, ds.PluginType, ruleType)
continue
}
if ds.Status != "enabled" {
logger.Debugf("datasource %d status is %s", dsId, ds.Status)
continue
}
processor := process.NewProcessor(rule, dsId, s.alertRuleCache, s.targetCache, s.busiGroupCache, s.alertMuteCache, s.datasourceCache, s.promClients, s.ctx, s.stats)
processor := process.NewProcessor(rule, dsId, s.alertRuleCache, s.targetCache, s.busiGroupCache, s.alertMuteCache, s.datasourceCache, s.ctx, s.stats)
alertRule := NewAlertRuleWorker(rule, dsId, processor, s.promClients, s.ctx)
alertRule := NewAlertRuleWorker(rule, dsId, processor, s.promClients, s.tdengineClients, s.ctx)
alertRuleWorkers[alertRule.Hash()] = alertRule
}
} else if rule.IsHostRule() && s.ctx.IsCenter {
@@ -111,8 +123,8 @@ func (s *Scheduler) syncAlertRules() {
if !naming.DatasourceHashRing.IsHit(naming.HostDatasource, fmt.Sprintf("%d", rule.Id), s.aconf.Heartbeat.Endpoint) {
continue
}
processor := process.NewProcessor(rule, 0, s.alertRuleCache, s.targetCache, s.busiGroupCache, s.alertMuteCache, s.datasourceCache, s.promClients, s.ctx, s.stats)
alertRule := NewAlertRuleWorker(rule, 0, processor, s.promClients, s.ctx)
processor := process.NewProcessor(rule, 0, s.alertRuleCache, s.targetCache, s.busiGroupCache, s.alertMuteCache, s.datasourceCache, s.ctx, s.stats)
alertRule := NewAlertRuleWorker(rule, 0, processor, s.promClients, s.tdengineClients, s.ctx)
alertRuleWorkers[alertRule.Hash()] = alertRule
} else {
// 如果 rule 不是通过 prometheus engine 来告警的,则创建为 externalRule
@@ -128,7 +140,7 @@ func (s *Scheduler) syncAlertRules() {
logger.Debugf("datasource %d status is %s", dsId, ds.Status)
continue
}
processor := process.NewProcessor(rule, dsId, s.alertRuleCache, s.targetCache, s.busiGroupCache, s.alertMuteCache, s.datasourceCache, s.promClients, s.ctx, s.stats)
processor := process.NewProcessor(rule, dsId, s.alertRuleCache, s.targetCache, s.busiGroupCache, s.alertMuteCache, s.datasourceCache, s.ctx, s.stats)
externalRuleWorkers[processor.Key()] = processor
}
}

View File

@@ -11,8 +11,11 @@ import (
"github.com/ccfos/nightingale/v6/alert/process"
"github.com/ccfos/nightingale/v6/models"
"github.com/ccfos/nightingale/v6/pkg/ctx"
"github.com/ccfos/nightingale/v6/pkg/hash"
"github.com/ccfos/nightingale/v6/pkg/parser"
promsdk "github.com/ccfos/nightingale/v6/pkg/prom"
"github.com/ccfos/nightingale/v6/prom"
"github.com/ccfos/nightingale/v6/tdengine"
"github.com/toolkits/pkg/logger"
"github.com/toolkits/pkg/str"
@@ -28,19 +31,21 @@ type AlertRuleWorker struct {
processor *process.Processor
promClients *prom.PromClientMap
ctx *ctx.Context
promClients *prom.PromClientMap
tdengineClients *tdengine.TdengineClientMap
ctx *ctx.Context
}
func NewAlertRuleWorker(rule *models.AlertRule, datasourceId int64, processor *process.Processor, promClients *prom.PromClientMap, ctx *ctx.Context) *AlertRuleWorker {
func NewAlertRuleWorker(rule *models.AlertRule, datasourceId int64, processor *process.Processor, promClients *prom.PromClientMap, tdengineClients *tdengine.TdengineClientMap, ctx *ctx.Context) *AlertRuleWorker {
arw := &AlertRuleWorker{
datasourceId: datasourceId,
quit: make(chan struct{}),
rule: rule,
processor: processor,
promClients: promClients,
ctx: ctx,
promClients: promClients,
tdengineClients: tdengineClients,
ctx: ctx,
}
return arw
@@ -87,17 +92,23 @@ func (arw *AlertRuleWorker) Start() {
func (arw *AlertRuleWorker) Eval() {
cachedRule := arw.rule
if cachedRule == nil {
//logger.Errorf("rule_eval:%s rule not found", arw.Key())
// logger.Errorf("rule_eval:%s rule not found", arw.Key())
return
}
arw.processor.Stats.CounterRuleEval.WithLabelValues().Inc()
typ := cachedRule.GetRuleType()
var lst []common.AnomalyPoint
var anomalyPoints []common.AnomalyPoint
var recoverPoints []common.AnomalyPoint
switch typ {
case models.PROMETHEUS:
lst = arw.GetPromAnomalyPoint(cachedRule.RuleConfig)
anomalyPoints = arw.GetPromAnomalyPoint(cachedRule.RuleConfig)
case models.HOST:
lst = arw.GetHostAnomalyPoint(cachedRule.RuleConfig)
anomalyPoints = arw.GetHostAnomalyPoint(cachedRule.RuleConfig)
case models.TDENGINE:
anomalyPoints, recoverPoints = arw.GetTdengineAnomalyPoint(cachedRule, arw.processor.DatasourceId())
case models.LOKI:
anomalyPoints = arw.GetPromAnomalyPoint(cachedRule.RuleConfig)
default:
return
}
@@ -107,7 +118,11 @@ func (arw *AlertRuleWorker) Eval() {
return
}
arw.processor.Handle(lst, "inner", arw.inhibit)
arw.processor.Handle(anomalyPoints, "inner", arw.inhibit)
for _, point := range recoverPoints {
str := fmt.Sprintf("%v", point.Value)
arw.processor.RecoverSingle(process.Hash(cachedRule.Id, arw.processor.DatasourceId(), point), point.Timestamp, &str)
}
}
func (arw *AlertRuleWorker) Stop() {
@@ -153,11 +168,13 @@ func (arw *AlertRuleWorker) GetPromAnomalyPoint(ruleConfig string) []common.Anom
value, warnings, err := readerClient.Query(context.Background(), promql, time.Now())
if err != nil {
logger.Errorf("rule_eval:%s promql:%s, error:%v", arw.Key(), promql, err)
arw.processor.Stats.CounterQueryDataErrorTotal.WithLabelValues(fmt.Sprintf("%d", arw.datasourceId)).Inc()
continue
}
if len(warnings) > 0 {
logger.Errorf("rule_eval:%s promql:%s, warnings:%v", arw.Key(), promql, warnings)
arw.processor.Stats.CounterQueryDataErrorTotal.WithLabelValues(fmt.Sprintf("%d", arw.datasourceId)).Inc()
continue
}
@@ -172,6 +189,111 @@ func (arw *AlertRuleWorker) GetPromAnomalyPoint(ruleConfig string) []common.Anom
return lst
}
func (arw *AlertRuleWorker) GetTdengineAnomalyPoint(rule *models.AlertRule, dsId int64) ([]common.AnomalyPoint, []common.AnomalyPoint) {
// 获取查询和规则判断条件
points := []common.AnomalyPoint{}
recoverPoints := []common.AnomalyPoint{}
ruleConfig := strings.TrimSpace(rule.RuleConfig)
if ruleConfig == "" {
logger.Warningf("rule_eval:%d promql is blank", rule.Id)
return points, recoverPoints
}
var ruleQuery models.RuleQuery
err := json.Unmarshal([]byte(ruleConfig), &ruleQuery)
if err != nil {
logger.Warningf("rule_eval:%d promql parse error:%s", rule.Id, err.Error())
return points, recoverPoints
}
arw.inhibit = ruleQuery.Inhibit
if len(ruleQuery.Queries) > 0 {
seriesStore := make(map[uint64]*models.DataResp)
seriesTagIndex := make(map[uint64][]uint64)
for _, query := range ruleQuery.Queries {
cli := arw.tdengineClients.GetCli(dsId)
if cli == nil {
logger.Warningf("rule_eval:%d tdengine client is nil", rule.Id)
continue
}
series, err := cli.Query(query)
if err != nil {
logger.Warningf("rule_eval rid:%d query data error: %v", rule.Id, err)
continue
}
// 此条日志很重要,是告警判断的现场值
logger.Debugf("rule_eval rid:%d req:%+v resp:%+v", rule.Id, query, series)
for i := 0; i < len(series); i++ {
serieHash := hash.GetHash(series[i].Metric, series[i].Ref)
tagHash := hash.GetTagHash(series[i].Metric)
seriesStore[serieHash] = series[i]
// 将曲线按照相同的 tag 分组
if _, exists := seriesTagIndex[tagHash]; !exists {
seriesTagIndex[tagHash] = make([]uint64, 0)
}
seriesTagIndex[tagHash] = append(seriesTagIndex[tagHash], serieHash)
}
}
// 判断
for _, trigger := range ruleQuery.Triggers {
for _, seriesHash := range seriesTagIndex {
m := make(map[string]float64)
var ts int64
var sample *models.DataResp
var value float64
for _, serieHash := range seriesHash {
series, exists := seriesStore[serieHash]
if !exists {
logger.Warningf("rule_eval rid:%d series:%+v not found", rule.Id, series)
continue
}
t, v, exists := series.Last()
if !exists {
logger.Warningf("rule_eval rid:%d series:%+v value not found", rule.Id, series)
continue
}
if !strings.Contains(trigger.Exp, "$"+series.Ref) {
// 表达式中不包含该变量
continue
}
m["$"+series.Ref] = v
m["$"+series.Ref+"."+series.MetricName()] = v
ts = int64(t)
sample = series
value = v
}
isTriggered := parser.Calc(trigger.Exp, m)
// 此条日志很重要,是告警判断的现场值
logger.Debugf("rule_eval rid:%d trigger:%+v exp:%s res:%v m:%v", rule.Id, trigger, trigger.Exp, isTriggered, m)
point := common.AnomalyPoint{
Key: sample.MetricName(),
Labels: sample.Metric,
Timestamp: int64(ts),
Value: value,
Severity: trigger.Severity,
Triggered: isTriggered,
}
if isTriggered {
points = append(points, point)
} else {
recoverPoints = append(recoverPoints, point)
}
}
}
}
return points, recoverPoints
}
func (arw *AlertRuleWorker) GetHostAnomalyPoint(ruleConfig string) []common.AnomalyPoint {
var lst []common.AnomalyPoint
var severity int
@@ -201,6 +323,7 @@ func (arw *AlertRuleWorker) GetHostAnomalyPoint(ruleConfig string) []common.Anom
targets, err := models.MissTargetGetsByFilter(arw.ctx, query, t)
if err != nil {
logger.Errorf("rule_eval:%s query:%v, error:%v", arw.Key(), query, err)
arw.processor.Stats.CounterQueryDataErrorTotal.WithLabelValues(fmt.Sprintf("%d", arw.datasourceId)).Inc()
continue
}
for _, target := range targets {
@@ -222,6 +345,7 @@ func (arw *AlertRuleWorker) GetHostAnomalyPoint(ruleConfig string) []common.Anom
targets, err := models.TargetGetsByFilter(arw.ctx, query, 0, 0)
if err != nil {
logger.Errorf("rule_eval:%s query:%v, error:%v", arw.Key(), query, err)
arw.processor.Stats.CounterQueryDataErrorTotal.WithLabelValues(fmt.Sprintf("%d", arw.datasourceId)).Inc()
continue
}
var targetMap = make(map[string]*models.Target)
@@ -253,12 +377,14 @@ func (arw *AlertRuleWorker) GetHostAnomalyPoint(ruleConfig string) []common.Anom
count, err := models.MissTargetCountByFilter(arw.ctx, query, t)
if err != nil {
logger.Errorf("rule_eval:%s query:%v, error:%v", arw.Key(), query, err)
arw.processor.Stats.CounterQueryDataErrorTotal.WithLabelValues(fmt.Sprintf("%d", arw.datasourceId)).Inc()
continue
}
total, err := models.TargetCountByFilter(arw.ctx, query)
if err != nil {
logger.Errorf("rule_eval:%s query:%v, error:%v", arw.Key(), query, err)
arw.processor.Stats.CounterQueryDataErrorTotal.WithLabelValues(fmt.Sprintf("%d", arw.datasourceId)).Inc()
continue
}
pct := float64(count) / float64(total) * 100

View File

@@ -54,7 +54,7 @@ func (chr *DatasourceHashRingType) IsHit(datasourceId int64, pk string, currentN
node, err := chr.GetNode(datasourceId, pk)
if err != nil {
if !errors.Is(err, consistent.ErrEmptyCircle) {
logger.Debugf("rule id:%s is not work, datasource id:%d failed to get node from hashring:%v", pk, datasourceId, err)
logger.Errorf("rule id:%s is not work, datasource id:%d failed to get node from hashring:%v", pk, datasourceId, err)
}
return false
}

View File

@@ -69,7 +69,7 @@ type Processor struct {
promClients *prom.PromClientMap
ctx *ctx.Context
stats *astats.Stats
Stats *astats.Stats
HandleFireEventHook HandleEventFunc
HandleRecoverEventHook HandleEventFunc
@@ -94,7 +94,7 @@ func (p *Processor) Hash() string {
}
func NewProcessor(rule *models.AlertRule, datasourceId int64, atertRuleCache *memsto.AlertRuleCacheType, targetCache *memsto.TargetCacheType,
busiGroupCache *memsto.BusiGroupCacheType, alertMuteCache *memsto.AlertMuteCacheType, datasourceCache *memsto.DatasourceCacheType, promClients *prom.PromClientMap, ctx *ctx.Context,
busiGroupCache *memsto.BusiGroupCacheType, alertMuteCache *memsto.AlertMuteCacheType, datasourceCache *memsto.DatasourceCacheType, ctx *ctx.Context,
stats *astats.Stats) *Processor {
p := &Processor{
@@ -107,9 +107,8 @@ func NewProcessor(rule *models.AlertRule, datasourceId int64, atertRuleCache *me
atertRuleCache: atertRuleCache,
datasourceCache: datasourceCache,
promClients: promClients,
ctx: ctx,
stats: stats,
ctx: ctx,
Stats: stats,
HandleFireEventHook: func(event *models.AlertCurEvent) {},
HandleRecoverEventHook: func(event *models.AlertCurEvent) {},
@@ -142,6 +141,7 @@ func (p *Processor) Handle(anomalyPoints []common.AnomalyPoint, from string, inh
hash := event.Hash
alertingKeys[hash] = struct{}{}
if mute.IsMuted(cachedRule, event, p.TargetCache, p.alertMuteCache) {
p.Stats.CounterMuteTotal.WithLabelValues(event.GroupName).Inc()
logger.Debugf("rule_eval:%s event:%v is muted", p.Key(), event)
continue
}
@@ -350,7 +350,6 @@ func (p *Processor) pushEventToQueue(e *models.AlertCurEvent) {
p.fires.Set(e.Hash, e)
}
p.stats.CounterAlertsTotal.WithLabelValues(fmt.Sprintf("%d", e.DatasourceId)).Inc()
dispatch.LogEvent(e, "push_queue")
if !queue.EventQueue.PushFront(e) {
logger.Warningf("event_push_queue: queue is full, event:%+v", e)
@@ -428,7 +427,13 @@ func (p *Processor) mayHandleIdent() {
if target, exists := p.TargetCache.Get(ident); exists {
p.target = target.Ident
p.targetNote = target.Note
} else {
p.target = ident
p.targetNote = ""
}
} else {
p.target = ""
p.targetNote = ""
}
}

View File

@@ -6,6 +6,7 @@ import (
"strings"
"time"
"github.com/ccfos/nightingale/v6/alert/astats"
"github.com/ccfos/nightingale/v6/models"
"github.com/ccfos/nightingale/v6/prom"
"github.com/ccfos/nightingale/v6/pushgw/writer"
@@ -18,18 +19,18 @@ type RecordRuleContext struct {
datasourceId int64
quit chan struct{}
rule *models.RecordingRule
// writers *writer.WritersType
rule *models.RecordingRule
promClients *prom.PromClientMap
stats *astats.Stats
}
func NewRecordRuleContext(rule *models.RecordingRule, datasourceId int64, promClients *prom.PromClientMap, writers *writer.WritersType) *RecordRuleContext {
func NewRecordRuleContext(rule *models.RecordingRule, datasourceId int64, promClients *prom.PromClientMap, writers *writer.WritersType, stats *astats.Stats) *RecordRuleContext {
return &RecordRuleContext{
datasourceId: datasourceId,
quit: make(chan struct{}),
rule: rule,
promClients: promClients,
//writers: writers,
stats: stats,
}
}
@@ -70,6 +71,7 @@ func (rrc *RecordRuleContext) Start() {
}
func (rrc *RecordRuleContext) Eval() {
rrc.stats.CounterRecordEval.WithLabelValues().Inc()
promql := strings.TrimSpace(rrc.rule.PromQl)
if promql == "" {
logger.Errorf("eval:%s promql is blank", rrc.Key())
@@ -78,17 +80,20 @@ func (rrc *RecordRuleContext) Eval() {
if rrc.promClients.IsNil(rrc.datasourceId) {
logger.Errorf("eval:%s reader client is nil", rrc.Key())
rrc.stats.CounterRecordEvalErrorTotal.WithLabelValues().Inc()
return
}
value, warnings, err := rrc.promClients.GetCli(rrc.datasourceId).Query(context.Background(), promql, time.Now())
if err != nil {
logger.Errorf("eval:%s promql:%s, error:%v", rrc.Key(), promql, err)
rrc.stats.CounterRecordEvalErrorTotal.WithLabelValues().Inc()
return
}
if len(warnings) > 0 {
logger.Errorf("eval:%s promql:%s, warnings:%v", rrc.Key(), promql, warnings)
rrc.stats.CounterRecordEvalErrorTotal.WithLabelValues().Inc()
return
}

View File

@@ -15,7 +15,7 @@ const (
LabelName = "__name__"
)
func ConvertToTimeSeries(value model.Value, rule *models.RecordingRule) (lst []*prompb.TimeSeries) {
func ConvertToTimeSeries(value model.Value, rule *models.RecordingRule) (lst []prompb.TimeSeries) {
switch value.Type() {
case model.ValVector:
items, ok := value.(model.Vector)
@@ -31,7 +31,7 @@ func ConvertToTimeSeries(value model.Value, rule *models.RecordingRule) (lst []*
s.Timestamp = time.Unix(item.Timestamp.Unix(), 0).UnixNano() / 1e6
s.Value = float64(item.Value)
l := labelsToLabelsProto(item.Metric, rule)
lst = append(lst, &prompb.TimeSeries{
lst = append(lst, prompb.TimeSeries{
Labels: l,
Samples: []prompb.Sample{s},
})
@@ -63,7 +63,7 @@ func ConvertToTimeSeries(value model.Value, rule *models.RecordingRule) (lst []*
Value: float64(v.Value),
})
}
lst = append(lst, &prompb.TimeSeries{
lst = append(lst, prompb.TimeSeries{
Labels: l,
Samples: slst,
})
@@ -78,7 +78,7 @@ func ConvertToTimeSeries(value model.Value, rule *models.RecordingRule) (lst []*
return
}
lst = append(lst, &prompb.TimeSeries{
lst = append(lst, prompb.TimeSeries{
Labels: nil,
Samples: []prompb.Sample{{Value: float64(item.Value), Timestamp: time.Unix(item.Timestamp.Unix(), 0).UnixNano() / 1e6}},
})
@@ -89,9 +89,9 @@ func ConvertToTimeSeries(value model.Value, rule *models.RecordingRule) (lst []*
return
}
func labelsToLabelsProto(labels model.Metric, rule *models.RecordingRule) (result []*prompb.Label) {
func labelsToLabelsProto(labels model.Metric, rule *models.RecordingRule) (result []prompb.Label) {
//name
nameLs := &prompb.Label{
nameLs := prompb.Label{
Name: LabelName,
Value: rule.Name,
}
@@ -101,7 +101,7 @@ func labelsToLabelsProto(labels model.Metric, rule *models.RecordingRule) (resul
continue
}
if model.LabelNameRE.MatchString(string(k)) {
result = append(result, &prompb.Label{
result = append(result, prompb.Label{
Name: string(k),
Value: string(v),
})
@@ -111,7 +111,7 @@ func labelsToLabelsProto(labels model.Metric, rule *models.RecordingRule) (resul
for _, v := range rule.AppendTagsJSON {
index := strings.Index(v, "=")
if model.LabelNameRE.MatchString(v[:index]) {
result = append(result, &prompb.Label{
result = append(result, prompb.Label{
Name: v[:index],
Value: v[index+1:],
})

View File

@@ -72,7 +72,7 @@ func (s *Scheduler) syncRecordRules() {
continue
}
recordRule := NewRecordRuleContext(rule, dsId, s.promClients, s.writers)
recordRule := NewRecordRuleContext(rule, dsId, s.promClients, s.writers, s.stats)
recordRules[recordRule.Hash()] = recordRule
}
}

View File

@@ -72,8 +72,6 @@ func (rt *Router) pushEventToQueue(c *gin.Context) {
event.NotifyChannels = strings.Join(event.NotifyChannelsJSON, " ")
event.NotifyGroups = strings.Join(event.NotifyGroupsJSON, " ")
rt.AlertStats.CounterAlertsTotal.WithLabelValues(event.Cluster).Inc()
dispatch.LogEvent(event, "http_push_queue")
if !queue.EventQueue.PushFront(event) {
msg := fmt.Sprintf("event:%+v push_queue err: queue is full", event)
@@ -87,7 +85,8 @@ func (rt *Router) eventPersist(c *gin.Context) {
var event *models.AlertCurEvent
ginx.BindJSON(c, &event)
event.FE2DB()
ginx.NewRender(c).Message(models.EventPersist(rt.Ctx, event))
err := models.EventPersist(rt.Ctx, event)
ginx.NewRender(c).Data(event.Id, err)
}
type eventForm struct {

View File

@@ -7,6 +7,7 @@ import (
"time"
"github.com/ccfos/nightingale/v6/alert/aconf"
"github.com/ccfos/nightingale/v6/alert/astats"
"github.com/ccfos/nightingale/v6/memsto"
"github.com/ccfos/nightingale/v6/models"
"github.com/ccfos/nightingale/v6/pkg/ctx"
@@ -16,7 +17,8 @@ import (
"github.com/toolkits/pkg/logger"
)
func SendCallbacks(ctx *ctx.Context, urls []string, event *models.AlertCurEvent, targetCache *memsto.TargetCacheType, userCache *memsto.UserCacheType, ibexConf aconf.Ibex) {
func SendCallbacks(ctx *ctx.Context, urls []string, event *models.AlertCurEvent, targetCache *memsto.TargetCacheType, userCache *memsto.UserCacheType,
ibexConf aconf.Ibex, stats *astats.Stats) {
for _, url := range urls {
if url == "" {
continue
@@ -33,9 +35,11 @@ func SendCallbacks(ctx *ctx.Context, urls []string, event *models.AlertCurEvent,
url = "http://" + url
}
stats.AlertNotifyTotal.WithLabelValues("rule_callback").Inc()
resp, code, err := poster.PostJSON(url, 5*time.Second, event, 3)
if err != nil {
logger.Errorf("event_callback_fail(rule_id=%d url=%s), resp: %s, err: %v, code: %d", event.RuleId, url, string(resp), err, code)
stats.AlertNotifyErrorTotal.WithLabelValues("rule_callback").Inc()
} else {
logger.Infof("event_callback_succ(rule_id=%d url=%s), resp: %s, code: %d", event.RuleId, url, string(resp), code)
}
@@ -92,7 +96,7 @@ func handleIbex(ctx *ctx.Context, url string, event *models.AlertCurEvent, targe
return
}
tpl, err := models.TaskTplGet(ctx, "id = ?", id)
tpl, err := models.TaskTplGetById(ctx, id)
if err != nil {
logger.Errorf("event_callback_ibex: failed to get tpl: %v", err)
return

View File

@@ -5,6 +5,7 @@ import (
"strings"
"time"
"github.com/ccfos/nightingale/v6/alert/astats"
"github.com/ccfos/nightingale/v6/models"
"github.com/ccfos/nightingale/v6/pkg/poster"
@@ -66,7 +67,8 @@ func (ds *DingtalkSender) Send(ctx MessageContext) {
},
}
}
ds.doSend(url, body)
doSend(url, body, models.Dingtalk, ctx.Stats)
}
}
@@ -81,7 +83,7 @@ func (ds *DingtalkSender) extract(users []*models.User) ([]string, []string) {
}
if token, has := user.ExtractToken(models.Dingtalk); has {
url := token
if !strings.HasPrefix(token, "https://") {
if !strings.HasPrefix(token, "https://") && !strings.HasPrefix(token, "http://") {
url = "https://oapi.dingtalk.com/robot/send?access_token=" + token
}
urls = append(urls, url)
@@ -90,11 +92,14 @@ func (ds *DingtalkSender) extract(users []*models.User) ([]string, []string) {
return urls, ats
}
func (ds *DingtalkSender) doSend(url string, body dingtalk) {
func doSend(url string, body interface{}, channel string, stats *astats.Stats) {
stats.AlertNotifyTotal.WithLabelValues(channel).Inc()
res, code, err := poster.PostJSON(url, time.Second*5, body, 3)
if err != nil {
logger.Errorf("dingtalk_sender: result=fail url=%s code=%d error=%v response=%s", url, code, err, string(res))
logger.Errorf("%s_sender: result=fail url=%s code=%d error=%v response=%s", channel, url, code, err, string(res))
stats.AlertNotifyErrorTotal.WithLabelValues(channel).Inc()
} else {
logger.Infof("dingtalk_sender: result=succ url=%s code=%d response=%s", url, code, string(res))
logger.Infof("%s_sender: result=succ url=%s code=%d response=%s", channel, url, code, string(res))
}
}

View File

@@ -2,6 +2,7 @@ package sender
import (
"crypto/tls"
"errors"
"html/template"
"time"
@@ -35,6 +36,8 @@ func (es *EmailSender) Send(ctx MessageContext) {
}
content := BuildTplMessage(es.contentTpl, ctx.Events)
es.WriteEmail(subject, content, tos)
ctx.Stats.AlertNotifyTotal.WithLabelValues(models.Email).Add(float64(len(tos)))
}
func extract(users []*models.User) []string {
@@ -47,7 +50,7 @@ func extract(users []*models.User) []string {
return tos
}
func (es *EmailSender) SendEmail(subject, content string, tos []string, stmp aconf.SMTPConfig) {
func SendEmail(subject, content string, tos []string, stmp aconf.SMTPConfig) error {
conf := stmp
d := gomail.NewDialer(conf.Host, conf.Port, conf.User, conf.Pass)
@@ -64,8 +67,9 @@ func (es *EmailSender) SendEmail(subject, content string, tos []string, stmp aco
err := d.DialAndSend(m)
if err != nil {
logger.Errorf("email_sender: failed to send: %v", err)
return errors.New("email_sender: failed to send: " + err.Error())
}
return nil
}
func (es *EmailSender) WriteEmail(subject, content string, tos []string) {
@@ -96,19 +100,21 @@ var mailQuit = make(chan struct{})
func RestartEmailSender(smtp aconf.SMTPConfig) {
close(mailQuit)
mailQuit = make(chan struct{})
StartEmailSender(smtp)
startEmailSender(smtp)
}
func StartEmailSender(smtp aconf.SMTPConfig) {
func InitEmailSender(smtp aconf.SMTPConfig) {
mailch = make(chan *gomail.Message, 100000)
startEmailSender(smtp)
}
func startEmailSender(smtp aconf.SMTPConfig) {
conf := smtp
if conf.Host == "" || conf.Port == 0 {
logger.Warning("SMTP configurations invalid")
return
}
logger.Infof("start email sender... %+v", conf)
logger.Infof("start email sender... conf.Host:%+v,conf.Port:%+v", conf.Host, conf.Port)
d := gomail.NewDialer(conf.Host, conf.Port, conf.User, conf.Pass)
if conf.InsecureSkipVerify {

View File

@@ -3,12 +3,8 @@ package sender
import (
"html/template"
"strings"
"time"
"github.com/ccfos/nightingale/v6/models"
"github.com/ccfos/nightingale/v6/pkg/poster"
"github.com/toolkits/pkg/logger"
)
type feishuContent struct {
@@ -49,7 +45,7 @@ func (fs *FeishuSender) Send(ctx MessageContext) {
IsAtAll: false,
}
}
fs.doSend(url, body)
doSend(url, body, models.Feishu, ctx.Stats)
}
}
@@ -63,7 +59,7 @@ func (fs *FeishuSender) extract(users []*models.User) ([]string, []string) {
}
if token, has := user.ExtractToken(models.Feishu); has {
url := token
if !strings.HasPrefix(token, "https://") {
if !strings.HasPrefix(token, "https://") && !strings.HasPrefix(token, "http://") {
url = "https://open.feishu.cn/open-apis/bot/v2/hook/" + token
}
urls = append(urls, url)
@@ -71,12 +67,3 @@ func (fs *FeishuSender) extract(users []*models.User) ([]string, []string) {
}
return urls, ats
}
func (fs *FeishuSender) doSend(url string, body feishu) {
res, code, err := poster.PostJSON(url, time.Second*5, body, 3)
if err != nil {
logger.Errorf("feishu_sender: result=fail url=%s code=%d error=%v response=%s", url, code, err, string(res))
} else {
logger.Infof("feishu_sender: result=succ url=%s code=%d response=%s", url, code, string(res))
}
}

View File

@@ -4,12 +4,8 @@ import (
"fmt"
"html/template"
"strings"
"time"
"github.com/ccfos/nightingale/v6/models"
"github.com/ccfos/nightingale/v6/pkg/poster"
"github.com/toolkits/pkg/logger"
)
type Conf struct {
@@ -115,7 +111,7 @@ func (fs *FeishuCardSender) Send(ctx MessageContext) {
body.Card.Elements[0].Text.Content = message
body.Card.Elements[2].Elements[0].Content = SendTitle
for _, url := range urls {
fs.doSend(url, body)
doSend(url, body, models.FeishuCard, ctx.Stats)
}
}
@@ -125,7 +121,7 @@ func (fs *FeishuCardSender) extract(users []*models.User) ([]string, []string) {
for i := range users {
if token, has := users[i].ExtractToken(models.FeishuCard); has {
url := token
if !strings.HasPrefix(token, "https://") {
if !strings.HasPrefix(token, "https://") && !strings.HasPrefix(token, "http://") {
url = "https://open.feishu.cn/open-apis/bot/v2/hook/" + strings.TrimSpace(token)
}
urls = append(urls, url)
@@ -133,12 +129,3 @@ func (fs *FeishuCardSender) extract(users []*models.User) ([]string, []string) {
}
return urls, ats
}
func (fs *FeishuCardSender) doSend(url string, body feishuCard) {
res, code, err := poster.PostJSON(url, time.Second*5, body, 3)
if err != nil {
logger.Errorf("feishucard_sender: result=fail url=%s code=%d error=%v response=%s", url, code, err, string(res))
} else {
logger.Debugf("feishucard_sender: result=succ url=%s code=%d response=%s", url, code, string(res))
}
}

View File

@@ -4,10 +4,9 @@ import (
"html/template"
"net/url"
"strings"
"time"
"github.com/ccfos/nightingale/v6/alert/astats"
"github.com/ccfos/nightingale/v6/models"
"github.com/ccfos/nightingale/v6/pkg/poster"
"github.com/toolkits/pkg/logger"
)
@@ -15,6 +14,7 @@ import (
type MatterMostMessage struct {
Text string
Tokens []string
Stats *astats.Stats
}
type mm struct {
@@ -41,6 +41,7 @@ func (ms *MmSender) Send(ctx MessageContext) {
SendMM(MatterMostMessage{
Text: message,
Tokens: urls,
Stats: ctx.Stats,
})
}
@@ -87,13 +88,7 @@ func SendMM(message MatterMostMessage) {
Username: username,
Text: txt + message.Text,
}
res, code, err := poster.PostJSON(ur, time.Second*5, body, 3)
if err != nil {
logger.Errorf("mm_sender: result=fail url=%s code=%d error=%v response=%s", ur, code, err, string(res))
} else {
logger.Infof("mm_sender: result=succ url=%s code=%d response=%s", ur, code, string(res))
}
doSend(ur, body, models.Mm, message.Stats)
}
}
}

View File

@@ -6,6 +6,7 @@ import (
"os/exec"
"time"
"github.com/ccfos/nightingale/v6/alert/astats"
"github.com/ccfos/nightingale/v6/models"
"github.com/toolkits/pkg/file"
@@ -13,20 +14,22 @@ import (
"github.com/toolkits/pkg/sys"
)
func MayPluginNotify(noticeBytes []byte, notifyScript models.NotifyScript) {
func MayPluginNotify(noticeBytes []byte, notifyScript models.NotifyScript, stats *astats.Stats) {
if len(noticeBytes) == 0 {
return
}
alertingCallScript(noticeBytes, notifyScript)
alertingCallScript(noticeBytes, notifyScript, stats)
}
func alertingCallScript(stdinBytes []byte, notifyScript models.NotifyScript) {
func alertingCallScript(stdinBytes []byte, notifyScript models.NotifyScript, stats *astats.Stats) {
// not enable or no notify.py? do nothing
config := notifyScript
if !config.Enable || config.Content == "" {
return
}
channel := "script"
stats.AlertNotifyTotal.WithLabelValues(channel).Inc()
fpath := ".notify_scriptt"
if config.Type == 1 {
fpath = config.Content
@@ -36,6 +39,7 @@ func alertingCallScript(stdinBytes []byte, notifyScript models.NotifyScript) {
oldContent, err := file.ToString(fpath)
if err != nil {
logger.Errorf("event_script_notify_fail: read script file err: %v", err)
stats.AlertNotifyErrorTotal.WithLabelValues(channel).Inc()
return
}
@@ -48,12 +52,14 @@ func alertingCallScript(stdinBytes []byte, notifyScript models.NotifyScript) {
_, err := file.WriteString(fpath, config.Content)
if err != nil {
logger.Errorf("event_script_notify_fail: write script file err: %v", err)
stats.AlertNotifyErrorTotal.WithLabelValues(channel).Inc()
return
}
err = os.Chmod(fpath, 0777)
if err != nil {
logger.Errorf("event_script_notify_fail: chmod script file err: %v", err)
stats.AlertNotifyErrorTotal.WithLabelValues(channel).Inc()
return
}
}
@@ -83,13 +89,14 @@ func alertingCallScript(stdinBytes []byte, notifyScript models.NotifyScript) {
if err != nil {
logger.Errorf("event_script_notify_fail: kill process %s occur error %v", fpath, err)
stats.AlertNotifyErrorTotal.WithLabelValues(channel).Inc()
}
return
}
if err != nil {
logger.Errorf("event_script_notify_fail: exec script %s occur error: %v, output: %s", fpath, err, buf.String())
stats.AlertNotifyErrorTotal.WithLabelValues(channel).Inc()
return
}

View File

@@ -5,6 +5,7 @@ import (
"html/template"
"github.com/ccfos/nightingale/v6/alert/aconf"
"github.com/ccfos/nightingale/v6/alert/astats"
"github.com/ccfos/nightingale/v6/memsto"
"github.com/ccfos/nightingale/v6/models"
)
@@ -20,10 +21,11 @@ type (
Users []*models.User
Rule *models.AlertRule
Events []*models.AlertCurEvent
Stats *astats.Stats
}
)
func NewSender(key string, tpls map[string]*template.Template, smtp aconf.SMTPConfig) Sender {
func NewSender(key string, tpls map[string]*template.Template, smtp ...aconf.SMTPConfig) Sender {
switch key {
case models.Dingtalk:
return &DingtalkSender{tpl: tpls[models.Dingtalk]}
@@ -34,7 +36,7 @@ func NewSender(key string, tpls map[string]*template.Template, smtp aconf.SMTPCo
case models.FeishuCard:
return &FeishuCardSender{tpl: tpls[models.FeishuCard]}
case models.Email:
return &EmailSender{subjectTpl: tpls["mailsubject"], contentTpl: tpls[models.Email], smtp: smtp}
return &EmailSender{subjectTpl: tpls[models.EmailSubject], contentTpl: tpls[models.Email], smtp: smtp[0]}
case models.Mm:
return &MmSender{tpl: tpls[models.Mm]}
case models.Telegram:
@@ -43,12 +45,13 @@ func NewSender(key string, tpls map[string]*template.Template, smtp aconf.SMTPCo
return nil
}
func BuildMessageContext(rule *models.AlertRule, events []*models.AlertCurEvent, uids []int64, userCache *memsto.UserCacheType) MessageContext {
func BuildMessageContext(rule *models.AlertRule, events []*models.AlertCurEvent, uids []int64, userCache *memsto.UserCacheType, stats *astats.Stats) MessageContext {
users := userCache.GetByUserIds(uids)
return MessageContext{
Rule: rule,
Events: events,
Users: users,
Stats: stats,
}
}

View File

@@ -3,10 +3,9 @@ package sender
import (
"html/template"
"strings"
"time"
"github.com/ccfos/nightingale/v6/alert/astats"
"github.com/ccfos/nightingale/v6/models"
"github.com/ccfos/nightingale/v6/pkg/poster"
"github.com/toolkits/pkg/logger"
)
@@ -14,6 +13,7 @@ import (
type TelegramMessage struct {
Text string
Tokens []string
Stats *astats.Stats
}
type telegram struct {
@@ -35,6 +35,7 @@ func (ts *TelegramSender) Send(ctx MessageContext) {
SendTelegram(TelegramMessage{
Text: message,
Tokens: tokens,
Stats: ctx.Stats,
})
}
@@ -55,7 +56,7 @@ func SendTelegram(message TelegramMessage) {
continue
}
var url string
if strings.HasPrefix(message.Tokens[i], "https://") {
if strings.HasPrefix(message.Tokens[i], "https://") || strings.HasPrefix(message.Tokens[i], "http://") {
url = message.Tokens[i]
} else {
array := strings.Split(message.Tokens[i], "/")
@@ -72,11 +73,6 @@ func SendTelegram(message TelegramMessage) {
Text: message.Text,
}
res, code, err := poster.PostJSON(url, time.Second*5, body, 3)
if err != nil {
logger.Errorf("telegram_sender: result=fail url=%s code=%d error=%v response=%s", url, code, err, string(res))
} else {
logger.Infof("telegram_sender: result=succ url=%s code=%d response=%s", url, code, string(res))
}
doSend(url, body, models.Telegram, message.Stats)
}
}

View File

@@ -3,16 +3,17 @@ package sender
import (
"bytes"
"encoding/json"
"io/ioutil"
"io"
"net/http"
"time"
"github.com/ccfos/nightingale/v6/alert/astats"
"github.com/ccfos/nightingale/v6/models"
"github.com/toolkits/pkg/logger"
)
func SendWebhooks(webhooks []*models.Webhook, event *models.AlertCurEvent) {
func SendWebhooks(webhooks []*models.Webhook, event *models.AlertCurEvent, stats *astats.Stats) {
for _, conf := range webhooks {
if conf.Url == "" || !conf.Enable {
continue
@@ -50,9 +51,11 @@ func SendWebhooks(webhooks []*models.Webhook, event *models.AlertCurEvent) {
Timeout: time.Duration(conf.Timeout) * time.Second,
}
stats.AlertNotifyTotal.WithLabelValues("webhook").Inc()
var resp *http.Response
resp, err = client.Do(req)
if err != nil {
stats.AlertNotifyErrorTotal.WithLabelValues("webhook").Inc()
logger.Errorf("event_webhook_fail, ruleId: [%d], eventId: [%d], url: [%s], error: [%s]", event.RuleId, event.Id, conf.Url, err)
continue
}
@@ -60,7 +63,7 @@ func SendWebhooks(webhooks []*models.Webhook, event *models.AlertCurEvent) {
var body []byte
if resp.Body != nil {
defer resp.Body.Close()
body, _ = ioutil.ReadAll(resp.Body)
body, _ = io.ReadAll(resp.Body)
}
logger.Debugf("event_webhook_succ, url: %s, response code: %d, body: %s", conf.Url, resp.StatusCode, string(body))

View File

@@ -3,12 +3,8 @@ package sender
import (
"html/template"
"strings"
"time"
"github.com/ccfos/nightingale/v6/models"
"github.com/ccfos/nightingale/v6/pkg/poster"
"github.com/toolkits/pkg/logger"
)
type wecomMarkdown struct {
@@ -37,7 +33,7 @@ func (ws *WecomSender) Send(ctx MessageContext) {
Content: message,
},
}
ws.doSend(url, body)
doSend(url, body, models.Wecom, ctx.Stats)
}
}
@@ -46,7 +42,7 @@ func (ws *WecomSender) extract(users []*models.User) []string {
for _, user := range users {
if token, has := user.ExtractToken(models.Wecom); has {
url := token
if !strings.HasPrefix(token, "https://") {
if !strings.HasPrefix(token, "https://") && !strings.HasPrefix(token, "http://") {
url = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=" + token
}
urls = append(urls, url)
@@ -54,12 +50,3 @@ func (ws *WecomSender) extract(users []*models.User) []string {
}
return urls
}
func (ws *WecomSender) doSend(url string, body wecom) {
res, code, err := poster.PostJSON(url, time.Second*5, body, 3)
if err != nil {
logger.Errorf("wecom_sender: result=fail url=%s code=%d error=%v response=%s", url, code, err, string(res))
} else {
logger.Infof("wecom_sender: result=succ url=%s code=%d response=%s", url, code, string(res))
}
}

View File

@@ -1,9 +1,11 @@
package cconf
import (
"fmt"
"path"
"github.com/toolkits/pkg/file"
"gopkg.in/yaml.v2"
)
var Operations = Operation{}
@@ -36,3 +38,134 @@ func GetAllOps(ops []Ops) []string {
}
return ret
}
func MergeOperationConf() error {
opsBuiltIn := Operation{}
err := yaml.Unmarshal([]byte(builtInOps), &opsBuiltIn)
if err != nil {
return fmt.Errorf("cannot parse builtInOps: %s", err.Error())
}
configOpsMap := make(map[string]struct{})
for _, op := range Operations.Ops {
configOpsMap[op.Name] = struct{}{}
}
//If the opBu.Name is not a constant in the target (Operations.Ops), add Ops from the built-in options
for _, opBu := range opsBuiltIn.Ops {
if _, has := configOpsMap[opBu.Name]; !has {
Operations.Ops = append(Operations.Ops, opBu)
}
}
return nil
}
const (
builtInOps = `
ops:
- name: dashboards
cname: 仪表盘
ops:
- "/dashboards"
- "/dashboards/add"
- "/dashboards/put"
- "/dashboards/del"
- "/dashboards-built-in"
- name: alert
cname: 告警规则
ops:
- "/alert-rules"
- "/alert-rules/add"
- "/alert-rules/put"
- "/alert-rules/del"
- "/alert-rules-built-in"
- name: alert-mutes
cname: 告警静默管理
ops:
- "/alert-mutes"
- "/alert-mutes/add"
- "/alert-mutes/put"
- "/alert-mutes/del"
- name: alert-subscribes
cname: 告警订阅管理
ops:
- "/alert-subscribes"
- "/alert-subscribes/add"
- "/alert-subscribes/put"
- "/alert-subscribes/del"
- name: alert-events
cname: 告警事件管理
ops:
- "/alert-cur-events"
- "/alert-cur-events/del"
- "/alert-his-events"
- name: recording-rules
cname: 记录规则管理
ops:
- "/recording-rules"
- "/recording-rules/add"
- "/recording-rules/put"
- "/recording-rules/del"
- name: metric
cname: 时序指标
ops:
- "/metric/explorer"
- "/object/explorer"
- name: log
cname: 日志分析
ops:
- "/log/explorer"
- "/log/index-patterns"
- name: targets
cname: 基础设施
ops:
- "/targets"
- "/targets/add"
- "/targets/put"
- "/targets/del"
- name: job
cname: 任务管理
ops:
- "/job-tpls"
- "/job-tpls/add"
- "/job-tpls/put"
- "/job-tpls/del"
- "/job-tasks"
- "/job-tasks/add"
- "/job-tasks/put"
- name: user
cname: 用户管理
ops:
- "/users"
- "/user-groups"
- "/user-groups/add"
- "/user-groups/put"
- "/user-groups/del"
- name: busi-groups
cname: 业务分组管理
ops:
- "/busi-groups"
- "/busi-groups/add"
- "/busi-groups/put"
- "/busi-groups/del"
- name: system
cname: 系统信息
ops:
- "/help/version"
- "/help/servers"
- "/help/source"
- "/help/sso"
- "/help/notification-tpls"
- "/help/notification-settings"
- "/help/migrate"
`
)

View File

@@ -15,8 +15,14 @@ var Plugins = []Plugin{
},
{
Id: 3,
Category: "logging",
Type: "jaeger",
TypeName: "Jaeger",
Category: "loki",
Type: "loki",
TypeName: "Loki",
},
{
Id: 4,
Category: "timeseries",
Type: "tdengine",
TypeName: "TDengine",
},
}

View File

@@ -0,0 +1,105 @@
package rsa
import (
"os"
"github.com/ccfos/nightingale/v6/models"
"github.com/ccfos/nightingale/v6/pkg/ctx"
"github.com/ccfos/nightingale/v6/pkg/httpx"
"github.com/ccfos/nightingale/v6/pkg/secu"
"github.com/pkg/errors"
"github.com/toolkits/pkg/file"
"github.com/toolkits/pkg/logger"
)
func InitRSAConfig(ctx *ctx.Context, rsaConfig *httpx.RSAConfig) error {
// 1.Load RSA keys from Database
rsaPassWord, err := models.ConfigsGet(ctx, models.RSA_PASSWORD)
if err != nil {
return errors.WithMessagef(err, "cannot query config(%s)", models.RSA_PASSWORD)
}
privateKeyVal, err := models.ConfigsGet(ctx, models.RSA_PRIVATE_KEY)
if err != nil {
return errors.WithMessagef(err, "cannot query config(%s)", models.RSA_PRIVATE_KEY)
}
publicKeyVal, err := models.ConfigsGet(ctx, models.RSA_PUBLIC_KEY)
if err != nil {
return errors.WithMessagef(err, "cannot query config(%s)", models.RSA_PUBLIC_KEY)
}
if rsaPassWord != "" && privateKeyVal != "" && publicKeyVal != "" {
rsaConfig.RSAPassWord = rsaPassWord
rsaConfig.RSAPrivateKey = []byte(privateKeyVal)
rsaConfig.RSAPublicKey = []byte(publicKeyVal)
return nil
}
// 2.Read RSA configuration from file if exists
if file.IsExist(rsaConfig.RSAPrivateKeyPath) && file.IsExist(rsaConfig.RSAPublicKeyPath) {
//password already read from config
rsaConfig.RSAPrivateKey, rsaConfig.RSAPublicKey, err = readConfigFile(rsaConfig)
if err != nil {
return errors.WithMessage(err, "failed to read rsa config from file")
}
return nil
}
// 3.Generate RSA keys if not exist
rsaConfig.RSAPassWord, rsaConfig.RSAPrivateKey, rsaConfig.RSAPublicKey, err = initRSAKeyPairs(ctx, rsaConfig.RSAPassWord)
if err != nil {
return errors.WithMessage(err, "failed to generate rsa key pair")
}
return nil
}
func initRSAKeyPairs(ctx *ctx.Context, rsaPassWord string) (password string, privateByte, publicByte []byte, err error) {
// Generate RSA keys
// Generate RSA password
if rsaPassWord != "" {
logger.Debug("Using existing RSA password")
password = rsaPassWord
err = models.ConfigsSet(ctx, models.RSA_PASSWORD, password)
if err != nil {
err = errors.WithMessagef(err, "failed to set config(%s)", models.RSA_PASSWORD)
return
}
} else {
password, err = models.InitRSAPassWord(ctx)
if err != nil {
err = errors.WithMessage(err, "failed to generate rsa password")
return
}
}
privateByte, publicByte, err = secu.GenerateRsaKeyPair(password)
if err != nil {
err = errors.WithMessage(err, "failed to generate rsa key pair")
return
}
// Save generated RSA keys
err = models.ConfigsSet(ctx, models.RSA_PRIVATE_KEY, string(privateByte))
if err != nil {
err = errors.WithMessagef(err, "failed to set config(%s)", models.RSA_PRIVATE_KEY)
return
}
err = models.ConfigsSet(ctx, models.RSA_PUBLIC_KEY, string(publicByte))
if err != nil {
err = errors.WithMessagef(err, "failed to set config(%s)", models.RSA_PUBLIC_KEY)
return
}
return
}
func readConfigFile(rsaConfig *httpx.RSAConfig) (privateBuf, publicBuf []byte, err error) {
publicBuf, err = os.ReadFile(rsaConfig.RSAPublicKeyPath)
if err != nil {
err = errors.WithMessagef(err, "could not read RSAPublicKeyPath %q", rsaConfig.RSAPublicKeyPath)
return
}
privateBuf, err = os.ReadFile(rsaConfig.RSAPrivateKeyPath)
if err != nil {
err = errors.WithMessagef(err, "could not read RSAPrivateKeyPath %q", rsaConfig.RSAPrivateKeyPath)
}
return
}

15
center/cconf/sql_tpl.go Normal file
View File

@@ -0,0 +1,15 @@
package cconf
var TDengineSQLTpl = map[string]string{
"load5": "SELECT _wstart as ts, last(load5) FROM $database.system WHERE host = '$server' and _ts >= $from and _ts <= $to interval($interval) fill(null)",
"process_total": "SELECT _wstart as ts, last(total) FROM $database.processes WHERE host = '$server' and _ts >= $from and _ts <= $to interval($interval) fill(null)",
"thread_total": "SELECT _wstart as ts, last(total) FROM $database.threads WHERE host = '$server' and _ts >= $from and _ts <= $to interval($interval) fill(null)",
"cpu_idle": "SELECT _wstart as ts, last(usage_idle) * -1 + 100 FROM $database.cpu WHERE (host = '$server' and cpu = 'cpu-total') and _ts >= $from and _ts <= $to interval($interval) fill(null)",
"mem_used_percent": "SELECT _wstart as ts, last(used_percent) FROM $database.mem WHERE (host = '$server') and _ts >= $from and _ts <= $to interval($interval) fill(null)",
"disk_used_percent": "SELECT _wstart as ts, last(used_percent) FROM $database.disk WHERE (host = '$server' and path = '/') and _ts >= $from and _ts <= $to interval($interval) fill(null)",
"cpu_context_switches": "select ts, derivative(context_switches, 1s, 0) as context FROM (SELECT _wstart as ts, avg(context_switches) as context_switches FROM $database.kernel WHERE host = '$server' and _ts >= $from and _ts <= $to interval($interval) )",
"tcp": "SELECT _wstart as ts, avg(tcp_close) as CLOSED, avg(tcp_close_wait) as CLOSE_WAIT, avg(tcp_closing) as CLOSING, avg(tcp_established) as ESTABLISHED, avg(tcp_fin_wait1) as FIN_WAIT1, avg(tcp_fin_wait2) as FIN_WAIT2, avg(tcp_last_ack) as LAST_ACK, avg(tcp_syn_recv) as SYN_RECV, avg(tcp_syn_sent) as SYN_SENT, avg(tcp_time_wait) as TIME_WAIT FROM $database.netstat WHERE host = '$server' and _ts >= $from and _ts <= $to interval($interval)",
"net_bytes_recv": "SELECT _wstart as ts, derivative(bytes_recv,1s, 1) as bytes_in FROM $database.net WHERE host = '$server' and interface = '$netif' and _ts >= $from and _ts <= $to group by tbname",
"net_bytes_sent": "SELECT _wstart as ts, derivative(bytes_sent,1s, 1) as bytes_out FROM $database.net WHERE host = '$server' and interface = '$netif' and _ts >= $from and _ts <= $to group by tbname",
"disk_total": "SELECT _wstart as ts, avg(total) AS total, avg(used) as used FROM $database.disk WHERE path = '$mountpoint' and _ts >= $from and _ts <= $to interval($interval) group by host",
}

View File

@@ -8,6 +8,8 @@ import (
"github.com/ccfos/nightingale/v6/alert/astats"
"github.com/ccfos/nightingale/v6/alert/process"
"github.com/ccfos/nightingale/v6/center/cconf"
"github.com/ccfos/nightingale/v6/center/cconf/rsa"
"github.com/ccfos/nightingale/v6/center/cstats"
"github.com/ccfos/nightingale/v6/center/metas"
"github.com/ccfos/nightingale/v6/center/sso"
"github.com/ccfos/nightingale/v6/conf"
@@ -19,10 +21,12 @@ import (
"github.com/ccfos/nightingale/v6/pkg/httpx"
"github.com/ccfos/nightingale/v6/pkg/i18nx"
"github.com/ccfos/nightingale/v6/pkg/logx"
"github.com/ccfos/nightingale/v6/pkg/version"
"github.com/ccfos/nightingale/v6/prom"
"github.com/ccfos/nightingale/v6/pushgw/idents"
"github.com/ccfos/nightingale/v6/pushgw/writer"
"github.com/ccfos/nightingale/v6/storage"
"github.com/ccfos/nightingale/v6/tdengine"
alertrt "github.com/ccfos/nightingale/v6/alert/router"
centerrt "github.com/ccfos/nightingale/v6/center/router"
@@ -38,22 +42,31 @@ func Initialize(configDir string, cryptoKey string) (func(), error) {
cconf.LoadMetricsYaml(configDir, config.Center.MetricsYamlFile)
cconf.LoadOpsYaml(configDir, config.Center.OpsYamlFile)
cconf.MergeOperationConf()
logxClean, err := logx.Init(config.Log)
if err != nil {
return nil, err
}
i18nx.Init()
i18nx.Init(configDir)
cstats.Init()
db, err := storage.New(config.DB)
if err != nil {
return nil, err
}
ctx := ctx.NewContext(context.Background(), db, true)
models.InitRoot(ctx)
migrate.Migrate(db)
models.InitRoot(ctx)
redis, err := storage.NewRedis(config.Redis)
err = rsa.InitRSAConfig(ctx, &config.HTTP.RSA)
if err != nil {
return nil, err
}
var redis storage.Redis
redis, err = storage.NewRedis(config.Redis)
if err != nil {
return nil, err
}
@@ -66,26 +79,29 @@ func Initialize(configDir string, cryptoKey string) (func(), error) {
sso := sso.Init(config.Center, ctx)
configCache := memsto.NewConfigCache(ctx, syncStats, config.HTTP.RSA.RSAPrivateKey, config.HTTP.RSA.RSAPassWord)
busiGroupCache := memsto.NewBusiGroupCache(ctx, syncStats)
targetCache := memsto.NewTargetCache(ctx, syncStats, redis)
dsCache := memsto.NewDatasourceCache(ctx, syncStats)
alertMuteCache := memsto.NewAlertMuteCache(ctx, syncStats)
alertRuleCache := memsto.NewAlertRuleCache(ctx, syncStats)
notifyConfigCache := memsto.NewNotifyConfigCache(ctx)
notifyConfigCache := memsto.NewNotifyConfigCache(ctx, configCache)
userCache := memsto.NewUserCache(ctx, syncStats)
userGroupCache := memsto.NewUserGroupCache(ctx, syncStats)
promClients := prom.NewPromClient(ctx, config.Alert.Heartbeat)
tdengineClients := tdengine.NewTdengineClient(ctx, config.Alert.Heartbeat)
externalProcessors := process.NewExternalProcessors()
alert.Start(config.Alert, config.Pushgw, syncStats, alertStats, externalProcessors, targetCache, busiGroupCache, alertMuteCache, alertRuleCache, notifyConfigCache, dsCache, ctx, promClients, userCache, userGroupCache)
alert.Start(config.Alert, config.Pushgw, syncStats, alertStats, externalProcessors, targetCache, busiGroupCache, alertMuteCache, alertRuleCache, notifyConfigCache, dsCache, ctx, promClients, tdengineClients, userCache, userGroupCache)
writers := writer.NewWriters(config.Pushgw)
httpx.InitRSAConfig(&config.HTTP.RSA)
go version.GetGithubVersion()
alertrtRouter := alertrt.New(config.HTTP, config.Alert, alertMuteCache, targetCache, busiGroupCache, alertStats, ctx, externalProcessors)
centerRouter := centerrt.New(config.HTTP, config.Center, cconf.Operations, dsCache, notifyConfigCache, promClients, redis, sso, ctx, metas, idents, targetCache, userCache, userGroupCache)
centerRouter := centerrt.New(config.HTTP, config.Center, cconf.Operations, dsCache, notifyConfigCache, promClients, tdengineClients,
redis, sso, ctx, metas, idents, targetCache, userCache, userGroupCache)
pushgwRouter := pushgwrt.New(config.HTTP, config.Pushgw, targetCache, busiGroupCache, idents, writers, ctx)
r := httpx.GinEngine(config.Global.RunMode, config.HTTP)

View File

@@ -17,12 +17,15 @@ import (
"github.com/ccfos/nightingale/v6/pkg/aop"
"github.com/ccfos/nightingale/v6/pkg/ctx"
"github.com/ccfos/nightingale/v6/pkg/httpx"
"github.com/ccfos/nightingale/v6/pkg/version"
"github.com/ccfos/nightingale/v6/prom"
"github.com/ccfos/nightingale/v6/pushgw/idents"
"github.com/ccfos/nightingale/v6/storage"
"github.com/ccfos/nightingale/v6/tdengine"
"github.com/gin-gonic/gin"
"github.com/rakyll/statik/fs"
"github.com/toolkits/pkg/ginx"
"github.com/toolkits/pkg/logger"
"github.com/toolkits/pkg/runner"
)
@@ -34,6 +37,7 @@ type Router struct {
DatasourceCache *memsto.DatasourceCacheType
NotifyConfigCache *memsto.NotifyConfigCacheType
PromClients *prom.PromClientMap
TdendgineClients *tdengine.TdengineClientMap
Redis storage.Redis
MetaSet *metas.Set
IdentSet *idents.Set
@@ -42,11 +46,11 @@ type Router struct {
UserCache *memsto.UserCacheType
UserGroupCache *memsto.UserGroupCacheType
Ctx *ctx.Context
DatasourceCheckHook func(*gin.Context) bool
}
func New(httpConfig httpx.Config, center cconf.Center, operations cconf.Operation, ds *memsto.DatasourceCacheType, ncc *memsto.NotifyConfigCacheType,
pc *prom.PromClientMap, redis storage.Redis, sso *sso.SsoClient, ctx *ctx.Context, metaSet *metas.Set, idents *idents.Set, tc *memsto.TargetCacheType,
uc *memsto.UserCacheType, ugc *memsto.UserGroupCacheType) *Router {
func New(httpConfig httpx.Config, center cconf.Center, operations cconf.Operation, ds *memsto.DatasourceCacheType, ncc *memsto.NotifyConfigCacheType, pc *prom.PromClientMap, tdendgineClients *tdengine.TdengineClientMap, redis storage.Redis, sso *sso.SsoClient, ctx *ctx.Context, metaSet *metas.Set, idents *idents.Set, tc *memsto.TargetCacheType, uc *memsto.UserCacheType, ugc *memsto.UserGroupCacheType) *Router {
return &Router{
HTTP: httpConfig,
Center: center,
@@ -54,6 +58,7 @@ func New(httpConfig httpx.Config, center cconf.Center, operations cconf.Operatio
DatasourceCache: ds,
NotifyConfigCache: ncc,
PromClients: pc,
TdendgineClients: tdendgineClients,
Redis: redis,
MetaSet: metaSet,
IdentSet: idents,
@@ -62,6 +67,8 @@ func New(httpConfig httpx.Config, center cconf.Center, operations cconf.Operatio
UserCache: uc,
UserGroupCache: ugc,
Ctx: ctx,
DatasourceCheckHook: func(ctx *gin.Context) bool { return false },
}
}
@@ -160,11 +167,27 @@ func (rt *Router) Config(r *gin.Engine) {
pages.POST("/query-range-batch", rt.promBatchQueryRange)
pages.POST("/query-instant-batch", rt.promBatchQueryInstant)
pages.GET("/datasource/brief", rt.datasourceBriefs)
pages.POST("/ds-query", rt.QueryData)
pages.POST("/logs-query", rt.QueryLog)
pages.POST("/tdengine-databases", rt.tdengineDatabases)
pages.POST("/tdengine-tables", rt.tdengineTables)
pages.POST("/tdengine-columns", rt.tdengineColumns)
pages.GET("/sql-template", rt.QuerySqlTemplate)
} else {
pages.Any("/proxy/:id/*url", rt.auth(), rt.dsProxy)
pages.POST("/query-range-batch", rt.auth(), rt.promBatchQueryRange)
pages.POST("/query-instant-batch", rt.auth(), rt.promBatchQueryInstant)
pages.GET("/datasource/brief", rt.auth(), rt.datasourceBriefs)
pages.POST("/ds-query", rt.auth(), rt.QueryData)
pages.POST("/logs-query", rt.auth(), rt.QueryLog)
pages.POST("/tdengine-databases", rt.auth(), rt.tdengineDatabases)
pages.POST("/tdengine-tables", rt.auth(), rt.tdengineTables)
pages.POST("/tdengine-columns", rt.auth(), rt.tdengineColumns)
}
pages.POST("/auth/login", rt.jwtMock(), rt.loginPost)
@@ -243,6 +266,7 @@ func (rt *Router) Config(r *gin.Engine) {
pages.GET("/builtin-boards-cates", rt.auth(), rt.user(), rt.builtinBoardCateGets)
pages.POST("/builtin-boards-detail", rt.auth(), rt.user(), rt.builtinBoardDetailGets)
pages.GET("/integrations/icon/:cate/:name", rt.builtinIcon)
pages.GET("/integrations/makedown/:cate", rt.builtinMarkdown)
pages.GET("/busi-group/:id/boards", rt.auth(), rt.user(), rt.perm("/dashboards"), rt.bgro(), rt.boardGets)
pages.POST("/busi-group/:id/boards", rt.auth(), rt.user(), rt.perm("/dashboards/add"), rt.bgrw(), rt.boardAdd)
@@ -268,7 +292,7 @@ func (rt *Router) Config(r *gin.Engine) {
pages.PUT("/busi-group/:id/alert-rules/fields", rt.auth(), rt.user(), rt.perm("/alert-rules/put"), rt.bgrw(), rt.alertRulePutFields)
pages.PUT("/busi-group/:id/alert-rule/:arid", rt.auth(), rt.user(), rt.perm("/alert-rules/put"), rt.alertRulePutByFE)
pages.GET("/alert-rule/:arid", rt.auth(), rt.user(), rt.perm("/alert-rules"), rt.alertRuleGet)
pages.PUT("/busi-group/:id/alert-rule/:arid/validate", rt.auth(), rt.user(), rt.perm("/alert-rules/put"), rt.alertRuleValidation)
pages.PUT("/busi-group/alert-rule/validate", rt.auth(), rt.user(), rt.perm("/alert-rules/put"), rt.alertRuleValidation)
pages.GET("/busi-group/:id/recording-rules", rt.auth(), rt.user(), rt.perm("/recording-rules"), rt.recordingRuleGets)
pages.POST("/busi-group/:id/recording-rules", rt.auth(), rt.user(), rt.perm("/recording-rules/add"), rt.bgrw(), rt.recordingRuleAddByFE)
@@ -278,6 +302,7 @@ func (rt *Router) Config(r *gin.Engine) {
pages.PUT("/busi-group/:id/recording-rules/fields", rt.auth(), rt.user(), rt.perm("/recording-rules/put"), rt.recordingRulePutFields)
pages.GET("/busi-group/:id/alert-mutes", rt.auth(), rt.user(), rt.perm("/alert-mutes"), rt.bgro(), rt.alertMuteGetsByBG)
pages.POST("/busi-group/:id/alert-mutes/preview", rt.auth(), rt.user(), rt.perm("/alert-mutes/add"), rt.bgrw(), rt.alertMutePreview)
pages.POST("/busi-group/:id/alert-mutes", rt.auth(), rt.user(), rt.perm("/alert-mutes/add"), rt.bgrw(), rt.alertMuteAdd)
pages.DELETE("/busi-group/:id/alert-mutes", rt.auth(), rt.user(), rt.perm("/alert-mutes/del"), rt.bgrw(), rt.alertMuteDel)
pages.PUT("/busi-group/:id/alert-mute/:amid", rt.auth(), rt.user(), rt.perm("/alert-mutes/put"), rt.alertMutePutByFE)
@@ -303,6 +328,7 @@ func (rt *Router) Config(r *gin.Engine) {
pages.POST("/alert-cur-events/card/details", rt.auth(), rt.alertCurEventsCardDetails)
pages.GET("/alert-his-events/list", rt.auth(), rt.alertHisEventsList)
pages.DELETE("/alert-cur-events", rt.auth(), rt.user(), rt.perm("/alert-cur-events/del"), rt.alertCurEventDel)
pages.GET("/alert-cur-events/stats", rt.auth(), rt.alertCurEventsStatistics)
pages.GET("/alert-aggr-views", rt.auth(), rt.alertAggrViewGets)
pages.DELETE("/alert-aggr-views", rt.auth(), rt.user(), rt.alertAggrViewDel)
@@ -365,14 +391,39 @@ func (rt *Router) Config(r *gin.Engine) {
pages.GET("/notify-config", rt.auth(), rt.admin(), rt.notifyConfigGet)
pages.PUT("/notify-config", rt.auth(), rt.admin(), rt.notifyConfigPut)
pages.PUT("/smtp-config-test", rt.auth(), rt.admin(), rt.attemptSendEmail)
pages.GET("/es-index-pattern", rt.auth(), rt.esIndexPatternGet)
pages.GET("/es-index-pattern-list", rt.auth(), rt.esIndexPatternGetList)
pages.POST("/es-index-pattern", rt.auth(), rt.admin(), rt.esIndexPatternAdd)
pages.PUT("/es-index-pattern", rt.auth(), rt.admin(), rt.esIndexPatternPut)
pages.DELETE("/es-index-pattern", rt.auth(), rt.admin(), rt.esIndexPatternDel)
pages.GET("/user-variable-configs", rt.auth(), rt.admin(), rt.userVariableConfigGets)
pages.POST("/user-variable-config", rt.auth(), rt.admin(), rt.userVariableConfigAdd)
pages.PUT("/user-variable-config/:id", rt.auth(), rt.admin(), rt.userVariableConfigPut)
pages.DELETE("/user-variable-config/:id", rt.auth(), rt.admin(), rt.userVariableConfigDel)
pages.GET("/config", rt.auth(), rt.admin(), rt.configGetByKey)
pages.PUT("/config", rt.auth(), rt.admin(), rt.configPutByKey)
}
r.GET("/api/n9e/versions", func(c *gin.Context) {
v := version.Version
lastIndex := strings.LastIndex(version.Version, "-")
if lastIndex != -1 {
v = version.Version[:lastIndex]
}
gv := version.GithubVersion.Load()
if gv != nil {
ginx.NewRender(c).Data(gin.H{"version": v, "github_verison": gv.(string)}, nil)
} else {
ginx.NewRender(c).Data(gin.H{"version": v, "github_verison": ""}, nil)
}
})
if rt.HTTP.APIForService.Enable {
service := r.Group("/v1/n9e")
if len(rt.HTTP.APIForService.BasicAuth) > 0 {
@@ -418,6 +469,8 @@ func (rt *Router) Config(r *gin.Engine) {
service.GET("/alert-his-events", rt.alertHisEventsList)
service.GET("/alert-his-event/:eid", rt.alertHisEventGet)
service.GET("/task-tpl/:tid", rt.taskTplGetByService)
service.GET("/config/:id", rt.configGet)
service.GET("/configs", rt.configsGet)
service.GET("/config", rt.configGetByKey)
@@ -433,6 +486,8 @@ func (rt *Router) Config(r *gin.Engine) {
service.GET("/notify-tpls", rt.notifyTplGets)
service.POST("/task-record-add", rt.taskRecordAdd)
service.GET("/user-variable/decrypt", rt.userVariableGetDecryptByService)
}
}

View File

@@ -4,6 +4,7 @@ import (
"net/http"
"sort"
"strings"
"time"
"github.com/ccfos/nightingale/v6/models"
@@ -182,10 +183,19 @@ func (rt *Router) alertCurEventDel(c *gin.Context) {
ginx.BindJSON(c, &f)
f.Verify()
rt.checkCurEventBusiGroupRWPermission(c, f.Ids)
ginx.NewRender(c).Message(models.AlertCurEventDel(rt.Ctx, f.Ids))
}
func (rt *Router) checkCurEventBusiGroupRWPermission(c *gin.Context, ids []int64) {
set := make(map[int64]struct{})
for i := 0; i < len(f.Ids); i++ {
event, err := models.AlertCurEventGetById(rt.Ctx, f.Ids[i])
// event group id is 0, ignore perm check
set[0] = struct{}{}
for i := 0; i < len(ids); i++ {
event, err := models.AlertCurEventGetById(rt.Ctx, ids[i])
ginx.Dangerous(err)
if _, has := set[event.GroupId]; !has {
@@ -193,8 +203,6 @@ func (rt *Router) alertCurEventDel(c *gin.Context) {
set[event.GroupId] = struct{}{}
}
}
ginx.NewRender(c).Message(models.AlertCurEventDel(rt.Ctx, f.Ids))
}
func (rt *Router) alertCurEventGet(c *gin.Context) {
@@ -208,3 +216,8 @@ func (rt *Router) alertCurEventGet(c *gin.Context) {
ginx.NewRender(c).Data(event, nil)
}
func (rt *Router) alertCurEventsStatistics(c *gin.Context) {
ginx.NewRender(c).Data(models.AlertCurEventStatistics(rt.Ctx, time.Now()), nil)
}

View File

@@ -273,22 +273,11 @@ func (rt *Router) alertRuleGet(c *gin.Context) {
ginx.NewRender(c).Data(ar, err)
}
//pre validation before save rule
// pre validation before save rule
func (rt *Router) alertRuleValidation(c *gin.Context) {
var f models.AlertRule //new
ginx.BindJSON(c, &f)
arid := ginx.UrlParamInt64(c, "arid")
ar, err := models.AlertRuleGetById(rt.Ctx, arid)
ginx.Dangerous(err)
if ar == nil {
ginx.NewRender(c, http.StatusNotFound).Message("No such AlertRule")
return
}
rt.bgrwCheck(c, ar.GroupId)
if len(f.NotifyChannelsJSON) > 0 && len(f.NotifyGroupsJSON) > 0 { //Validation NotifyChannels
ngids := make([]int64, 0, len(f.NotifyChannelsJSON))
for i := range f.NotifyGroupsJSON {
@@ -305,6 +294,15 @@ func (rt *Router) alertRuleValidation(c *gin.Context) {
ancs := make([]string, 0, len(f.NotifyChannelsJSON)) //absent Notify Channels
for i := range f.NotifyChannelsJSON {
flag := true
//ignore non-default channels
switch f.NotifyChannelsJSON[i] {
case models.Dingtalk, models.Wecom, models.Feishu, models.Mm,
models.Telegram, models.Email, models.FeishuCard:
// do nothing
default:
continue
}
//default channels
for ui := range users {
if _, b := users[ui].ExtractToken(f.NotifyChannelsJSON[i]); b {
flag = false
@@ -317,7 +315,7 @@ func (rt *Router) alertRuleValidation(c *gin.Context) {
}
if len(ancs) > 0 {
ginx.NewRender(c).Message(i18n.Sprintf(c.GetHeader("X-Language"), "All users are missing notify channel configurations. Please check for missing tokens (each channel should be configured with at least one user). %s", ancs))
ginx.NewRender(c).Message("All users are missing notify channel configurations. Please check for missing tokens (each channel should be configured with at least one user). %s", ancs)
return
}

View File

@@ -14,21 +14,18 @@ import (
func (rt *Router) alertSubscribeGets(c *gin.Context) {
bgid := ginx.UrlParamInt64(c, "id")
lst, err := models.AlertSubscribeGets(rt.Ctx, bgid)
if err == nil {
ugcache := make(map[int64]*models.UserGroup)
for i := 0; i < len(lst); i++ {
ginx.Dangerous(lst[i].FillUserGroups(rt.Ctx, ugcache))
}
ginx.Dangerous(err)
rulecache := make(map[int64]string)
for i := 0; i < len(lst); i++ {
ginx.Dangerous(lst[i].FillRuleName(rt.Ctx, rulecache))
}
ugcache := make(map[int64]*models.UserGroup)
rulecache := make(map[int64]string)
for i := 0; i < len(lst); i++ {
ginx.Dangerous(lst[i].FillDatasourceIds(rt.Ctx))
}
for i := 0; i < len(lst); i++ {
ginx.Dangerous(lst[i].FillUserGroups(rt.Ctx, ugcache))
ginx.Dangerous(lst[i].FillRuleName(rt.Ctx, rulecache))
ginx.Dangerous(lst[i].FillDatasourceIds(rt.Ctx))
ginx.Dangerous(lst[i].DB2FE())
}
ginx.NewRender(c).Data(lst, err)
}
@@ -101,6 +98,7 @@ func (rt *Router) alertSubscribePut(c *gin.Context) {
"redefine_webhooks",
"severities",
"extra_config",
"busi_groups",
))
}

View File

@@ -78,7 +78,7 @@ func (rt *Router) builtinBoardCateGets(c *gin.Context) {
}
me := c.MustGet("user").(*models.User)
buildinFavoritesMap, err := models.BuiltinCateGetByUserId(rt.Ctx, me.Id)
builtinFavoritesMap, err := models.BuiltinCateGetByUserId(rt.Ctx, me.Id)
if err != nil {
logger.Warningf("get builtin favorites fail: %v", err)
}
@@ -117,7 +117,7 @@ func (rt *Router) builtinBoardCateGets(c *gin.Context) {
}
boardCate.Boards = boards
if _, ok := buildinFavoritesMap[dir]; ok {
if _, ok := builtinFavoritesMap[dir]; ok {
boardCate.Favorite = true
}
@@ -173,7 +173,7 @@ func (rt *Router) builtinAlertCateGets(c *gin.Context) {
}
me := c.MustGet("user").(*models.User)
buildinFavoritesMap, err := models.BuiltinCateGetByUserId(rt.Ctx, me.Id)
builtinFavoritesMap, err := models.BuiltinCateGetByUserId(rt.Ctx, me.Id)
if err != nil {
logger.Warningf("get builtin favorites fail: %v", err)
}
@@ -210,7 +210,7 @@ func (rt *Router) builtinAlertCateGets(c *gin.Context) {
alertCate.IconUrl = fmt.Sprintf("/api/n9e/integrations/icon/%s/%s", dir, iconFiles[0])
}
if _, ok := buildinFavoritesMap[dir]; ok {
if _, ok := builtinFavoritesMap[dir]; ok {
alertCate.Favorite = true
}
@@ -233,7 +233,7 @@ func (rt *Router) builtinAlertRules(c *gin.Context) {
}
me := c.MustGet("user").(*models.User)
buildinFavoritesMap, err := models.BuiltinCateGetByUserId(rt.Ctx, me.Id)
builtinFavoritesMap, err := models.BuiltinCateGetByUserId(rt.Ctx, me.Id)
if err != nil {
logger.Warningf("get builtin favorites fail: %v", err)
}
@@ -274,7 +274,7 @@ func (rt *Router) builtinAlertRules(c *gin.Context) {
alertCate.IconUrl = fmt.Sprintf("/api/n9e/integrations/icon/%s/%s", dir, iconFiles[0])
}
if _, ok := buildinFavoritesMap[dir]; ok {
if _, ok := builtinFavoritesMap[dir]; ok {
alertCate.Favorite = true
}
@@ -315,3 +315,26 @@ func (rt *Router) builtinIcon(c *gin.Context) {
iconPath := fp + "/" + cate + "/icon/" + ginx.UrlParamStr(c, "name")
c.File(path.Join(iconPath))
}
func (rt *Router) builtinMarkdown(c *gin.Context) {
fp := rt.Center.BuiltinIntegrationsDir
if fp == "" {
fp = path.Join(runner.Cwd, "integrations")
}
cate := ginx.UrlParamStr(c, "cate")
var markdown []byte
markdownDir := fp + "/" + cate + "/markdown"
markdownFiles, err := file.FilesUnder(markdownDir)
if err != nil {
logger.Warningf("get markdown fail: %v", err)
} else if len(markdownFiles) > 0 {
f := markdownFiles[0]
fn := markdownDir + "/" + f
markdown, err = file.ReadBytes(fn)
if err != nil {
logger.Warningf("get collect fail: %v", err)
}
}
ginx.NewRender(c).Data(string(markdown), nil)
}

View File

@@ -2,6 +2,7 @@ package router
import (
"github.com/ccfos/nightingale/v6/models"
"time"
"github.com/gin-gonic/gin"
"github.com/toolkits/pkg/ginx"
@@ -25,28 +26,49 @@ func (rt *Router) configGetByKey(c *gin.Context) {
ginx.NewRender(c).Data(config, err)
}
func (rt *Router) configPutByKey(c *gin.Context) {
var f models.Configs
ginx.BindJSON(c, &f)
username := c.MustGet("username").(string)
ginx.NewRender(c).Message(models.ConfigsSetWithUname(rt.Ctx, f.Ckey, f.Cval, username))
}
func (rt *Router) configsDel(c *gin.Context) {
var f idsForm
ginx.BindJSON(c, &f)
ginx.NewRender(c).Message(models.ConfigsDel(rt.Ctx, f.Ids))
}
func (rt *Router) configsPut(c *gin.Context) {
func (rt *Router) configsPut(c *gin.Context) { //for APIForService
var arr []models.Configs
ginx.BindJSON(c, &arr)
username := c.GetString("user")
if username == "" {
username = "default"
}
now := time.Now().Unix()
for i := 0; i < len(arr); i++ {
arr[i].UpdateBy = username
arr[i].UpdateAt = now
ginx.Dangerous(arr[i].Update(rt.Ctx))
}
ginx.NewRender(c).Message(nil)
}
func (rt *Router) configsPost(c *gin.Context) {
func (rt *Router) configsPost(c *gin.Context) { //for APIForService
var arr []models.Configs
ginx.BindJSON(c, &arr)
username := c.GetString("user")
if username == "" {
username = "default"
}
now := time.Now().Unix()
for i := 0; i < len(arr); i++ {
arr[i].CreateBy = username
arr[i].UpdateBy = username
arr[i].CreateAt = now
arr[i].UpdateAt = now
ginx.Dangerous(arr[i].Add(rt.Ctx))
}

View File

@@ -3,6 +3,7 @@ package router
import (
"crypto/tls"
"fmt"
"io"
"net/http"
"net/url"
"strings"
@@ -25,6 +26,11 @@ type listReq struct {
}
func (rt *Router) datasourceList(c *gin.Context) {
if rt.DatasourceCheckHook(c) {
Render(c, []int{}, nil)
return
}
var req listReq
ginx.BindJSON(c, &req)
@@ -65,6 +71,11 @@ func (rt *Router) datasourceBriefs(c *gin.Context) {
}
func (rt *Router) datasourceUpsert(c *gin.Context) {
if rt.DatasourceCheckHook(c) {
Render(c, []int{}, nil)
return
}
var req models.Datasource
ginx.BindJSON(c, &req)
username := Username(c)
@@ -127,14 +138,33 @@ func DatasourceCheck(ds models.Datasource) error {
if ds.PluginType == models.PROMETHEUS {
subPath := "/api/v1/query"
query := url.Values{}
if strings.Contains(fullURL, "loki") {
if ds.HTTPJson.IsLoki() {
subPath = "/api/v1/labels"
} else {
query.Add("query", "1+1")
}
fullURL = fmt.Sprintf("%s%s?%s", ds.HTTPJson.Url, subPath, query.Encode())
req, err = http.NewRequest("POST", fullURL, nil)
req, err = http.NewRequest("GET", fullURL, nil)
if err != nil {
logger.Errorf("Error creating request: %v", err)
return fmt.Errorf("request url:%s failed", fullURL)
}
} else if ds.PluginType == models.TDENGINE {
fullURL = fmt.Sprintf("%s/rest/sql", ds.HTTPJson.Url)
req, err = http.NewRequest("POST", fullURL, strings.NewReader("show databases"))
if err != nil {
logger.Errorf("Error creating request: %v", err)
return fmt.Errorf("request url:%s failed", fullURL)
}
}
if ds.PluginType == models.LOKI {
subPath := "/api/v1/labels"
fullURL = fmt.Sprintf("%s%s", ds.HTTPJson.Url, subPath)
req, err = http.NewRequest("GET", fullURL, nil)
if err != nil {
logger.Errorf("Error creating request: %v", err)
return fmt.Errorf("request url:%s failed", fullURL)
@@ -158,13 +188,19 @@ func DatasourceCheck(ds models.Datasource) error {
if resp.StatusCode != 200 {
logger.Errorf("Error making request: %v\n", resp.StatusCode)
return fmt.Errorf("request url:%s failed code:%d", fullURL, resp.StatusCode)
body, _ := io.ReadAll(resp.Body)
return fmt.Errorf("request url:%s failed code:%d body:%s", fullURL, resp.StatusCode, string(body))
}
return nil
}
func (rt *Router) datasourceGet(c *gin.Context) {
if rt.DatasourceCheckHook(c) {
Render(c, []int{}, nil)
return
}
var req models.Datasource
ginx.BindJSON(c, &req)
err := req.Get(rt.Ctx)
@@ -172,6 +208,11 @@ func (rt *Router) datasourceGet(c *gin.Context) {
}
func (rt *Router) datasourceUpdataStatus(c *gin.Context) {
if rt.DatasourceCheckHook(c) {
Render(c, []int{}, nil)
return
}
var req models.Datasource
ginx.BindJSON(c, &req)
username := Username(c)
@@ -181,6 +222,11 @@ func (rt *Router) datasourceUpdataStatus(c *gin.Context) {
}
func (rt *Router) datasourceDel(c *gin.Context) {
if rt.DatasourceCheckHook(c) {
Render(c, []int{}, nil)
return
}
var ids []int64
ginx.BindJSON(c, &ids)
err := models.DatasourceDel(rt.Ctx, ids)

View File

@@ -41,6 +41,7 @@ func (rt *Router) esIndexPatternPut(c *gin.Context) {
}
f.UpdateBy = c.MustGet("username").(string)
ginx.NewRender(c).Message(esIndexPattern.Update(rt.Ctx, f))
}
@@ -67,7 +68,7 @@ func (rt *Router) esIndexPatternGetList(c *gin.Context) {
} else {
lst, err = models.EsIndexPatternGets(rt.Ctx, "")
}
ginx.NewRender(c).Data(lst, err)
}

View File

@@ -44,6 +44,10 @@ func (rt *Router) statistic(c *gin.Context) {
statistics, err = models.DatasourceStatistics(rt.Ctx)
ginx.NewRender(c).Data(statistics, err)
return
case "user_variable":
statistics, err = models.ConfigsUserVariableStatistics(rt.Ctx)
ginx.NewRender(c).Data(statistics, err)
return
default:
ginx.Bomb(http.StatusBadRequest, "invalid name")
}

View File

@@ -4,12 +4,15 @@ import (
"compress/gzip"
"encoding/json"
"io/ioutil"
"sort"
"strings"
"time"
"github.com/ccfos/nightingale/v6/models"
"github.com/gin-gonic/gin"
"github.com/toolkits/pkg/ginx"
"github.com/toolkits/pkg/logger"
)
func (rt *Router) heartbeat(c *gin.Context) {
@@ -49,13 +52,38 @@ func (rt *Router) heartbeat(c *gin.Context) {
items[req.Hostname] = struct{}{}
rt.IdentSet.MSet(items)
gid := ginx.QueryInt64(c, "gid", 0)
if target, has := rt.TargetCache.Get(req.Hostname); has && target != nil {
gid := ginx.QueryInt64(c, "gid", 0)
hostIp := strings.TrimSpace(req.HostIp)
if gid != 0 {
target, has := rt.TargetCache.Get(req.Hostname)
if has && target.GroupId != gid {
err = models.TargetUpdateBgid(rt.Ctx, []string{req.Hostname}, gid, false)
filed := make(map[string]interface{})
if gid != 0 && gid != target.GroupId {
filed["group_id"] = gid
}
if hostIp != "" && hostIp != target.HostIp {
filed["host_ip"] = hostIp
}
if len(req.GlobalLabels) > 0 {
lst := []string{}
for k, v := range req.GlobalLabels {
lst = append(lst, k+"="+v)
}
sort.Strings(lst)
labels := strings.Join(lst, " ")
if target.Tags != labels {
filed["tags"] = labels
}
}
if len(filed) > 0 {
err := target.UpdateFieldsMap(rt.Ctx, filed)
if err != nil {
logger.Errorf("update target fields failed, err: %v", err)
}
}
logger.Debugf("heartbeat field:%+v target: %v", filed, *target)
}
ginx.NewRender(c).Message(err)

View File

@@ -230,7 +230,7 @@ func (rt *Router) loginCallback(c *gin.Context) {
ret, err := rt.Sso.OIDC.Callback(rt.Redis, c.Request.Context(), code, state)
if err != nil {
logger.Debugf("sso.callback() get ret %+v error %v", ret, err)
logger.Errorf("sso_callback fail. code:%s, state:%s, get ret: %+v. error: %v", code, state, ret, err)
ginx.NewRender(c).Data(CallbackOutput{}, err)
return
}
@@ -515,10 +515,23 @@ type SsoConfigOutput struct {
}
func (rt *Router) ssoConfigNameGet(c *gin.Context) {
var oidcDisplayName, casDisplayName, oauthDisplayName string
if rt.Sso.OIDC != nil {
oidcDisplayName = rt.Sso.OIDC.GetDisplayName()
}
if rt.Sso.CAS != nil {
casDisplayName = rt.Sso.CAS.GetDisplayName()
}
if rt.Sso.OAuth2 != nil {
oauthDisplayName = rt.Sso.OAuth2.GetDisplayName()
}
ginx.NewRender(c).Data(SsoConfigOutput{
OidcDisplayName: rt.Sso.OIDC.GetDisplayName(),
CasDisplayName: rt.Sso.CAS.GetDisplayName(),
OauthDisplayName: rt.Sso.OAuth2.GetDisplayName(),
OidcDisplayName: oidcDisplayName,
CasDisplayName: casDisplayName,
OauthDisplayName: oauthDisplayName,
}, nil)
}
@@ -543,8 +556,7 @@ func (rt *Router) ssoConfigUpdate(c *gin.Context) {
var config oidcx.Config
err := toml.Unmarshal([]byte(f.Content), &config)
ginx.Dangerous(err)
err = rt.Sso.OIDC.Reload(config)
rt.Sso.OIDC, err = oidcx.New(config)
ginx.Dangerous(err)
case "CAS":
var config cas.Config
@@ -568,7 +580,7 @@ type RSAConfigOutput struct {
func (rt *Router) rsaConfigGet(c *gin.Context) {
publicKey := ""
if rt.HTTP.RSA.OpenRSA {
if len(rt.HTTP.RSA.RSAPublicKey) > 0 {
publicKey = base64.StdEncoding.EncodeToString(rt.HTTP.RSA.RSAPublicKey)
}
ginx.NewRender(c).Data(RSAConfigOutput{

View File

@@ -5,6 +5,7 @@ import (
"strings"
"time"
"github.com/ccfos/nightingale/v6/alert/common"
"github.com/ccfos/nightingale/v6/models"
"github.com/gin-gonic/gin"
@@ -29,16 +30,41 @@ func (rt *Router) alertMuteGets(c *gin.Context) {
}
func (rt *Router) alertMuteAdd(c *gin.Context) {
var f models.AlertMute
ginx.BindJSON(c, &f)
username := c.MustGet("username").(string)
f.CreateBy = username
f.GroupId = ginx.UrlParamInt64(c, "id")
ginx.NewRender(c).Message(f.Add(rt.Ctx))
}
// Preview events (alert_cur_event) that match the mute strategy based on the following criteria:
// business group ID (group_id, group_id), product (prod, rule_prod),
// alert event severity (severities, severity), and event tags (tags, tags).
// For products of type not 'host', also consider the category (cate, cate) and datasource ID (datasource_ids, datasource_id).
func (rt *Router) alertMutePreview(c *gin.Context) {
//Generally the match of events would be less.
var f models.AlertMute
ginx.BindJSON(c, &f)
f.GroupId = ginx.UrlParamInt64(c, "id")
ginx.Dangerous(f.Verify()) //verify and parse tags json to ITags
events, err := models.AlertCurEventGetsFromAlertMute(rt.Ctx, &f)
ginx.Dangerous(err)
matchEvents := make([]*models.AlertCurEvent, 0, len(events))
for i := 0; i < len(events); i++ {
events[i].DB2Mem()
if common.MatchTags(events[i].TagsMap, f.ITags) {
matchEvents = append(matchEvents, events[i])
}
}
ginx.NewRender(c).Data(matchEvents, err)
}
func (rt *Router) alertMuteAddByService(c *gin.Context) {
var f models.AlertMute
ginx.BindJSON(c, &f)

View File

@@ -2,15 +2,19 @@ package router
import (
"encoding/json"
"fmt"
"strings"
"github.com/ccfos/nightingale/v6/alert/aconf"
"github.com/ccfos/nightingale/v6/alert/sender"
"github.com/ccfos/nightingale/v6/memsto"
"github.com/ccfos/nightingale/v6/models"
"github.com/pelletier/go-toml/v2"
"github.com/ccfos/nightingale/v6/pkg/tplx"
"github.com/gin-gonic/gin"
"github.com/pelletier/go-toml/v2"
"github.com/toolkits/pkg/ginx"
"github.com/toolkits/pkg/str"
)
func (rt *Router) webhookGets(c *gin.Context) {
@@ -41,8 +45,8 @@ func (rt *Router) webhookPuts(c *gin.Context) {
data, err := json.Marshal(webhooks)
ginx.Dangerous(err)
ginx.NewRender(c).Message(models.ConfigsSet(rt.Ctx, models.WEBHOOKKEY, string(data)))
username := c.MustGet("username").(string)
ginx.NewRender(c).Message(models.ConfigsSetWithUname(rt.Ctx, models.WEBHOOKKEY, string(data), username))
}
func (rt *Router) notifyScriptGet(c *gin.Context) {
@@ -65,8 +69,8 @@ func (rt *Router) notifyScriptPut(c *gin.Context) {
data, err := json.Marshal(notifyScript)
ginx.Dangerous(err)
ginx.NewRender(c).Message(models.ConfigsSet(rt.Ctx, models.NOTIFYSCRIPT, string(data)))
username := c.MustGet("username").(string)
ginx.NewRender(c).Message(models.ConfigsSetWithUname(rt.Ctx, models.NOTIFYSCRIPT, string(data), username))
}
func (rt *Router) notifyChannelGets(c *gin.Context) {
@@ -101,8 +105,8 @@ func (rt *Router) notifyChannelPuts(c *gin.Context) {
data, err := json.Marshal(notifyChannels)
ginx.Dangerous(err)
ginx.NewRender(c).Message(models.ConfigsSet(rt.Ctx, models.NOTIFYCHANNEL, string(data)))
username := c.MustGet("username").(string)
ginx.NewRender(c).Message(models.ConfigsSetWithUname(rt.Ctx, models.NOTIFYCHANNEL, string(data), username))
}
func (rt *Router) notifyContactGets(c *gin.Context) {
@@ -137,8 +141,8 @@ func (rt *Router) notifyContactPuts(c *gin.Context) {
data, err := json.Marshal(notifyContacts)
ginx.Dangerous(err)
ginx.NewRender(c).Message(models.ConfigsSet(rt.Ctx, models.NOTIFYCONTACT, string(data)))
username := c.MustGet("username").(string)
ginx.NewRender(c).Message(models.ConfigsSetWithUname(rt.Ctx, models.NOTIFYCONTACT, string(data), username))
}
func (rt *Router) notifyConfigGet(c *gin.Context) {
@@ -158,10 +162,12 @@ func (rt *Router) notifyConfigGet(c *gin.Context) {
func (rt *Router) notifyConfigPut(c *gin.Context) {
var f models.Configs
ginx.BindJSON(c, &f)
userVariableMap := rt.NotifyConfigCache.ConfigCache.Get()
text := tplx.ReplaceMacroVariables(f.Ckey, f.Cval, userVariableMap)
switch f.Ckey {
case models.SMTP:
var smtp aconf.SMTPConfig
err := toml.Unmarshal([]byte(f.Cval), &smtp)
err := toml.Unmarshal([]byte(text), &smtp)
ginx.Dangerous(err)
case models.IBEX:
var ibex aconf.Ibex
@@ -170,24 +176,55 @@ func (rt *Router) notifyConfigPut(c *gin.Context) {
default:
ginx.Bomb(200, "key %s can not modify", f.Ckey)
}
err := models.ConfigsSet(rt.Ctx, f.Ckey, f.Cval)
if err != nil {
ginx.Bomb(200, err.Error())
}
username := c.MustGet("username").(string)
//insert or update build-in config
ginx.Dangerous(models.ConfigsSetWithUname(rt.Ctx, f.Ckey, f.Cval, username))
if f.Ckey == models.SMTP {
// 重置邮件发送器
var smtp aconf.SMTPConfig
err := toml.Unmarshal([]byte(f.Cval), &smtp)
ginx.Dangerous(err)
if smtp.Host == "" || smtp.Port == 0 {
ginx.Bomb(200, "smtp host or port can not be empty")
}
smtp, errSmtp := SmtpValidate(text)
ginx.Dangerous(errSmtp)
go sender.RestartEmailSender(smtp)
}
ginx.NewRender(c).Message(nil)
}
func SmtpValidate(text string) (aconf.SMTPConfig, error) {
var smtp aconf.SMTPConfig
var err error
err = toml.Unmarshal([]byte(text), &smtp)
if err != nil {
return smtp, err
}
if smtp.Host == "" || smtp.Port == 0 {
return smtp, fmt.Errorf("smtp host or port can not be empty")
}
return smtp, err
}
type form struct {
models.Configs
Email string `json:"email"`
}
// After configuring the aconf.SMTPConfig, users can choose to perform a test. In this test, the function attempts to send an email
func (rt *Router) attemptSendEmail(c *gin.Context) {
var f form
ginx.BindJSON(c, &f)
if f.Email = strings.TrimSpace(f.Email); f.Email == "" || !str.IsMail(f.Email) {
ginx.Bomb(200, "email(%s) invalid", f.Email)
}
if f.Ckey != models.SMTP {
ginx.Bomb(200, "config(%v) invalid", f)
}
userVariableMap := rt.NotifyConfigCache.ConfigCache.Get()
text := tplx.ReplaceMacroVariables(f.Ckey, f.Cval, userVariableMap)
smtp, err := SmtpValidate(text)
ginx.Dangerous(err)
ginx.NewRender(c).Message(sender.SendEmail("Email test", "email content", []string{f.Email}, smtp))
}

View File

@@ -21,7 +21,7 @@ func (rt *Router) notifyTplGets(c *gin.Context) {
for _, channel := range models.DefaultChannels {
m[channel] = struct{}{}
}
m["mailsubject"] = struct{}{}
m[models.EmailSubject] = struct{}{}
lst, err := models.NotifyTplGets(rt.Ctx)
for i := 0; i < len(lst); i++ {

View File

@@ -3,6 +3,7 @@ package router
import (
"context"
"crypto/tls"
"fmt"
"net"
"net/http"
"net/http/httputil"
@@ -164,10 +165,18 @@ func (rt *Router) dsProxy(c *gin.Context) {
transportPut(dsId, ds.UpdatedAt, transport)
}
modifyResponse := func(r *http.Response) error {
if r.StatusCode == http.StatusUnauthorized {
return fmt.Errorf("unauthorized access")
}
return nil
}
proxy := &httputil.ReverseProxy{
Director: director,
Transport: transport,
ErrorHandler: errFunc,
Director: director,
Transport: transport,
ErrorHandler: errFunc,
ModifyResponse: modifyResponse,
}
proxy.ServeHTTP(c.Writer, c.Request)

View File

@@ -45,29 +45,32 @@ func (rt *Router) targetGets(c *gin.Context) {
bgid := ginx.QueryInt64(c, "bgid", -1)
query := ginx.QueryStr(c, "query", "")
limit := ginx.QueryInt(c, "limit", 30)
downtime := ginx.QueryInt64(c, "downtime", 0)
dsIds := queryDatasourceIds(c)
var bgids []int64
var err error
if bgid == -1 {
// 全部对象的情况,找到用户有权限的业务组
user := c.MustGet("user").(*models.User)
userGroupIds, err := models.MyGroupIds(rt.Ctx, user.Id)
ginx.Dangerous(err)
if !user.IsAdmin() {
// 如果是非 admin 用户,全部对象的情况,找到用户有权限的业务组
userGroupIds, err := models.MyGroupIds(rt.Ctx, user.Id)
ginx.Dangerous(err)
bgids, err = models.BusiGroupIds(rt.Ctx, userGroupIds)
ginx.Dangerous(err)
bgids, err = models.BusiGroupIds(rt.Ctx, userGroupIds)
ginx.Dangerous(err)
// 将未分配业务组的对象也加入到列表中
bgids = append(bgids, 0)
// 将未分配业务组的对象也加入到列表中
bgids = append(bgids, 0)
}
} else {
bgids = append(bgids, bgid)
}
total, err := models.TargetTotal(rt.Ctx, bgids, dsIds, query)
total, err := models.TargetTotal(rt.Ctx, bgids, dsIds, query, downtime)
ginx.Dangerous(err)
list, err := models.TargetGets(rt.Ctx, bgids, dsIds, query, limit, ginx.Offset(c, limit))
list, err := models.TargetGets(rt.Ctx, bgids, dsIds, query, downtime, limit, ginx.Offset(c, limit))
ginx.Dangerous(err)
if err == nil {
@@ -78,6 +81,12 @@ func (rt *Router) targetGets(c *gin.Context) {
for i := 0; i < len(list); i++ {
ginx.Dangerous(list[i].FillGroup(rt.Ctx, cache))
keys = append(keys, models.WrapIdent(list[i].Ident))
if now.Unix()-list[i].UpdateAt < 60 {
list[i].TargetUp = 2
} else if now.Unix()-list[i].UpdateAt < 180 {
list[i].TargetUp = 1
}
}
if len(keys) > 0 {
@@ -103,12 +112,6 @@ func (rt *Router) targetGets(c *gin.Context) {
// 未上报过元数据的主机cpuNum默认为-1, 用于前端展示 unknown
list[i].CpuNum = -1
}
if now.Unix()-list[i].UnixTime/1000 < 60 {
list[i].TargetUp = 2
} else if now.Unix()-list[i].UnixTime/1000 < 180 {
list[i].TargetUp = 1
}
}
}

View File

@@ -92,6 +92,7 @@ func (f *taskForm) Verify() error {
if f.Script == "" {
return fmt.Errorf("arg(script) is required")
}
f.Script = strings.Replace(f.Script, "\r\n", "\n", -1)
if str.Dangerous(f.Args) {
return fmt.Errorf("arg(args) is dangerous")

View File

@@ -48,6 +48,19 @@ func (rt *Router) taskTplGet(c *gin.Context) {
}, err)
}
func (rt *Router) taskTplGetByService(c *gin.Context) {
tid := ginx.UrlParamInt64(c, "tid")
tpl, err := models.TaskTplGetById(rt.Ctx, tid)
ginx.Dangerous(err)
if tpl == nil {
ginx.Bomb(404, "no such task template")
}
ginx.NewRender(c).Data(tpl, err)
}
type taskTplForm struct {
Title string `json:"title" binding:"required"`
Batch int `json:"batch"`

View File

@@ -0,0 +1,117 @@
package router
import (
"net/http"
"github.com/ccfos/nightingale/v6/center/cconf"
"github.com/ccfos/nightingale/v6/models"
"github.com/gin-gonic/gin"
"github.com/toolkits/pkg/ginx"
"github.com/toolkits/pkg/logger"
)
type databasesQueryForm struct {
Cate string `json:"cate" form:"cate"`
DatasourceId int64 `json:"datasource_id" form:"datasource_id"`
}
func (rt *Router) tdengineDatabases(c *gin.Context) {
var f databasesQueryForm
ginx.BindJSON(c, &f)
tdClient := rt.TdendgineClients.GetCli(f.DatasourceId)
if tdClient == nil {
ginx.NewRender(c, http.StatusNotFound).Message("No such datasource")
return
}
databases, err := tdClient.GetDatabases()
ginx.NewRender(c).Data(databases, err)
}
type tablesQueryForm struct {
Cate string `json:"cate"`
DatasourceId int64 `json:"datasource_id" `
Database string `json:"db"`
IsStable bool `json:"is_stable"`
}
// get tdengine tables
func (rt *Router) tdengineTables(c *gin.Context) {
var f tablesQueryForm
ginx.BindJSON(c, &f)
tdClient := rt.TdendgineClients.GetCli(f.DatasourceId)
if tdClient == nil {
ginx.NewRender(c, http.StatusNotFound).Message("No such datasource")
return
}
tables, err := tdClient.GetTables(f.Database, f.IsStable)
ginx.NewRender(c).Data(tables, err)
}
type columnsQueryForm struct {
Cate string `json:"cate"`
DatasourceId int64 `json:"datasource_id" `
Database string `json:"db"`
Table string `json:"table"`
}
// get tdengine columns
func (rt *Router) tdengineColumns(c *gin.Context) {
var f columnsQueryForm
ginx.BindJSON(c, &f)
tdClient := rt.TdendgineClients.GetCli(f.DatasourceId)
if tdClient == nil {
ginx.NewRender(c, http.StatusNotFound).Message("No such datasource")
return
}
columns, err := tdClient.GetColumns(f.Database, f.Table)
ginx.NewRender(c).Data(columns, err)
}
func (rt *Router) QueryData(c *gin.Context) {
var f models.QueryParam
ginx.BindJSON(c, &f)
var resp []*models.DataResp
var err error
tdClient := rt.TdendgineClients.GetCli(f.DatasourceId)
for _, q := range f.Querys {
datas, err := tdClient.Query(q)
ginx.Dangerous(err)
resp = append(resp, datas...)
}
ginx.NewRender(c).Data(resp, err)
}
func (rt *Router) QueryLog(c *gin.Context) {
var f models.QueryParam
ginx.BindJSON(c, &f)
tdClient := rt.TdendgineClients.GetCli(f.DatasourceId)
if len(f.Querys) == 0 {
ginx.Bomb(200, "querys is empty")
return
}
data, err := tdClient.QueryLog(f.Querys[0])
logger.Debugf("tdengine query:%s result: %+v", f.Querys[0], data)
ginx.NewRender(c).Data(data, err)
}
// query sql template
func (rt *Router) QuerySqlTemplate(c *gin.Context) {
cate := ginx.QueryStr(c, "cate")
m := make(map[string]string)
switch cate {
case models.TDENGINE:
m = cconf.TDengineSQLTpl
}
ginx.NewRender(c).Data(m, nil)
}

View File

@@ -0,0 +1,59 @@
package router
import (
"strings"
"time"
"github.com/ccfos/nightingale/v6/models"
"github.com/gin-gonic/gin"
"github.com/toolkits/pkg/ginx"
)
func (rt *Router) userVariableConfigGets(context *gin.Context) {
userVariables, err := models.ConfigsGetUserVariable(rt.Ctx)
ginx.NewRender(context).Data(userVariables, err)
}
func (rt *Router) userVariableConfigAdd(context *gin.Context) {
var f models.Configs
ginx.BindJSON(context, &f)
f.Ckey = strings.TrimSpace(f.Ckey)
//insert external config. needs to make sure not plaintext for an encrypted type config
username := context.MustGet("username").(string)
now := time.Now().Unix()
f.CreateBy = username
f.UpdateBy = username
f.CreateAt = now
f.UpdateAt = now
ginx.NewRender(context).Message(models.ConfigsUserVariableInsert(rt.Ctx, f))
}
func (rt *Router) userVariableConfigPut(context *gin.Context) {
var f models.Configs
ginx.BindJSON(context, &f)
f.Id = ginx.UrlParamInt64(context, "id")
f.Ckey = strings.TrimSpace(f.Ckey)
//update external config. needs to make sure not plaintext for an encrypted type config
//updating with struct it will update all fields ("ckey", "cval", "note", "encrypted", "update_by", "update_at"), not non-zero fields.
f.UpdateBy = context.MustGet("username").(string)
f.UpdateAt = time.Now().Unix()
ginx.NewRender(context).Message(models.ConfigsUserVariableUpdate(rt.Ctx, f))
}
func (rt *Router) userVariableConfigDel(context *gin.Context) {
id := ginx.UrlParamInt64(context, "id")
configs, err := models.ConfigGet(rt.Ctx, id)
ginx.Dangerous(err)
if configs != nil && configs.External == models.ConfigExternal {
ginx.NewRender(context).Message(models.ConfigsDel(rt.Ctx, []int64{id}))
} else {
ginx.NewRender(context).Message(nil)
}
}
func (rt *Router) userVariableGetDecryptByService(context *gin.Context) {
decryptMap, decryptErr := models.ConfigUserVariableGetDecryptMap(rt.Ctx, rt.HTTP.RSA.RSAPrivateKey, rt.HTTP.RSA.RSAPassWord)
ginx.NewRender(context).Data(decryptMap, decryptErr)
}

View File

@@ -29,6 +29,8 @@ Port = 389
BaseDn = 'dc=example,dc=org'
BindUser = 'cn=manager,dc=example,dc=org'
BindPass = '*******'
# openldap format e.g. (&(uid=%s))
# AD format e.g. (&(sAMAccountName=%s))
AuthFilter = '(&(uid=%s))'
CoverAttributes = true
TLS = false

View File

@@ -12,6 +12,7 @@ import (
"github.com/ccfos/nightingale/v6/pkg/osx"
"github.com/ccfos/nightingale/v6/pkg/version"
"github.com/toolkits/pkg/net/tcpx"
"github.com/toolkits/pkg/runner"
)
@@ -31,6 +32,8 @@ func main() {
printEnv()
tcpx.WaitHosts()
cleanFunc, err := center.Initialize(*configDir, *cryptoKey)
if err != nil {
log.Fatalln("failed to initialize:", err)

View File

@@ -2,6 +2,7 @@ package main
import (
"context"
"errors"
"fmt"
"github.com/ccfos/nightingale/v6/alert"
@@ -16,6 +17,7 @@ import (
"github.com/ccfos/nightingale/v6/prom"
"github.com/ccfos/nightingale/v6/pushgw/idents"
"github.com/ccfos/nightingale/v6/pushgw/writer"
"github.com/ccfos/nightingale/v6/tdengine"
alertrt "github.com/ccfos/nightingale/v6/alert/router"
pushgwrt "github.com/ccfos/nightingale/v6/pushgw/router"
@@ -31,7 +33,10 @@ func Initialize(configDir string, cryptoKey string) (func(), error) {
if err != nil {
return nil, err
}
//check CenterApi is default value
if len(config.CenterApi.Addrs) < 1 {
return nil, errors.New("failed to init config: the CenterApi configuration is missing")
}
ctx := ctx.NewContext(context.Background(), nil, false, config.CenterApi)
syncStats := memsto.NewSyncStats()
@@ -45,18 +50,21 @@ func Initialize(configDir string, cryptoKey string) (func(), error) {
pushgwRouter.Config(r)
if !config.Alert.Disable {
configCache := memsto.NewConfigCache(ctx, syncStats, nil, "")
alertStats := astats.NewSyncStats()
dsCache := memsto.NewDatasourceCache(ctx, syncStats)
alertMuteCache := memsto.NewAlertMuteCache(ctx, syncStats)
alertRuleCache := memsto.NewAlertRuleCache(ctx, syncStats)
notifyConfigCache := memsto.NewNotifyConfigCache(ctx)
notifyConfigCache := memsto.NewNotifyConfigCache(ctx, configCache)
userCache := memsto.NewUserCache(ctx, syncStats)
userGroupCache := memsto.NewUserGroupCache(ctx, syncStats)
promClients := prom.NewPromClient(ctx, config.Alert.Heartbeat)
tdengineClients := tdengine.NewTdengineClient(ctx, config.Alert.Heartbeat)
externalProcessors := process.NewExternalProcessors()
alert.Start(config.Alert, config.Pushgw, syncStats, alertStats, externalProcessors, targetCache, busiGroupCache, alertMuteCache, alertRuleCache, notifyConfigCache, dsCache, ctx, promClients, userCache, userGroupCache)
alert.Start(config.Alert, config.Pushgw, syncStats, alertStats, externalProcessors, targetCache, busiGroupCache, alertMuteCache,
alertRuleCache, notifyConfigCache, dsCache, ctx, promClients, tdengineClients, userCache, userGroupCache)
alertrtRouter := alertrt.New(config.HTTP, config.Alert, alertMuteCache, targetCache, busiGroupCache, alertStats, ctx, externalProcessors)

View File

@@ -48,7 +48,7 @@ func InitConfig(configDir, cryptoKey string) (*ConfigType, error) {
}
config.Pushgw.PreCheck()
config.Alert.PreCheck()
config.Alert.PreCheck(configDir)
config.Center.PreCheck()
err := decryptConfig(config, cryptoKey)

View File

@@ -77,4 +77,3 @@ Committer 记录并公示于 **[COMMITTERS](https://github.com/ccfos/nightingale
2. 提问之前请先搜索 [Github Issues](https://github.com/ccfos/nightingale/issues "Github Issue")
3. 我们优先推荐通过提交 [Github Issue](https://github.com/ccfos/nightingale/issues "Github Issue") 来提问,如果[有问题点击这里](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Fbug&template=bug_report.yml "有问题点击这里") | [有需求建议点击这里](https://github.com/ccfos/nightingale/issues/new?assignees=&labels=kind%2Ffeature&template=enhancement.md "有需求建议点击这里")
最后,我们推荐你加入微信群,针对相关开放式问题,相互交流咨询 (请先加好友:[UlricGO](https://www.gitlink.org.cn/UlricQin/gist/tree/master/self.jpeg "UlricGO") 备注:夜莺加群+姓名+公司,交流群里会有开发者团队和专业、热心的群友回答问题)。

View File

@@ -1,7 +1,5 @@
ibexetc
compose-host-network
compose-postgres
compose-bridge
initsql
mysqletc
n9eetc
prometc
build.sh
docker-compose.yaml

View File

@@ -5,8 +5,7 @@ WORKDIR /app
ADD n9e /app/
ADD etc /app/
ADD integrations /app/integrations/
ADD --chmod=755 https://github.com/ufoscout/docker-compose-wait/releases/download/2.11.0/wait_x86_64 /wait
RUN chmod +x /wait && pip install requests
RUN pip install requests
EXPOSE 17000

View File

@@ -1,4 +1,3 @@
FROM flashcatcloud/toolbox:v0.0.1 as toolbox
FROM --platform=$TARGETPLATFORM python:3-slim
@@ -6,7 +5,6 @@ WORKDIR /app
ADD n9e /app/
ADD etc /app/
ADD integrations /app/integrations/
COPY --chmod=755 --from=toolbox /toolbox/wait_aarch64 /wait
EXPOSE 17000

View File

@@ -0,0 +1,132 @@
version: "3.7"
networks:
nightingale:
driver: bridge
services:
mysql:
image: "mysql:8"
container_name: mysql
hostname: mysql
restart: always
environment:
TZ: Asia/Shanghai
MYSQL_ROOT_PASSWORD: 1234
volumes:
- ./mysqldata:/var/lib/mysql/
- ../initsql:/docker-entrypoint-initdb.d/
- ./etc-mysql/my.cnf:/etc/my.cnf
networks:
- nightingale
ports:
- "3306:3306"
redis:
image: "redis:6.2"
container_name: redis
hostname: redis
restart: always
environment:
TZ: Asia/Shanghai
networks:
- nightingale
ports:
- "6379:6379"
# prometheus:
# image: prom/prometheus
# container_name: prometheus
# hostname: prometheus
# restart: always
# environment:
# TZ: Asia/Shanghai
# volumes:
# - ./etc-prometheus:/etc/prometheus
# command:
# - "--config.file=/etc/prometheus/prometheus.yml"
# - "--storage.tsdb.path=/prometheus"
# - "--web.console.libraries=/usr/share/prometheus/console_libraries"
# - "--web.console.templates=/usr/share/prometheus/consoles"
# - "--enable-feature=remote-write-receiver"
# - "--query.lookback-delta=2m"
# networks:
# - nightingale
# ports:
# - "9090:9090"
victoriametrics:
image: victoriametrics/victoria-metrics:v1.79.12
container_name: victoriametrics
hostname: victoriametrics
restart: always
environment:
TZ: Asia/Shanghai
ports:
- "8428:8428"
networks:
- nightingale
command:
- "--loggerTimezone=Asia/Shanghai"
ibex:
image: flashcatcloud/ibex:v1.2.0
container_name: ibex
hostname: ibex
restart: always
environment:
GIN_MODE: release
TZ: Asia/Shanghai
WAIT_HOSTS: mysql:3306
volumes:
- ./etc-ibex:/app/etc
networks:
- nightingale
ports:
- "10090:10090"
- "20090:20090"
depends_on:
- mysql
command: >
sh -c "/app/ibex server"
nightingale:
image: flashcatcloud/nightingale:latest
container_name: nightingale
hostname: nightingale
restart: always
environment:
GIN_MODE: release
TZ: Asia/Shanghai
WAIT_HOSTS: mysql:3306, redis:6379
volumes:
- ./etc-nightingale:/app/etc
networks:
- nightingale
ports:
- "17000:17000"
depends_on:
- mysql
- redis
- victoriametrics
command: >
sh -c "/app/n9e"
categraf:
image: "flashcatcloud/categraf:latest"
container_name: "categraf"
hostname: "categraf01"
restart: always
environment:
TZ: Asia/Shanghai
HOST_PROC: /hostfs/proc
HOST_SYS: /hostfs/sys
HOST_MOUNT_PREFIX: /hostfs
WAIT_HOSTS: nightingale:17000, ibex:20090
volumes:
- ./etc-categraf:/etc/categraf/conf
- /:/hostfs
networks:
- nightingale
depends_on:
- nightingale

View File

@@ -0,0 +1,83 @@
[global]
# whether print configs
print_configs = false
# add label(agent_hostname) to series
# "" -> auto detect hostname
# "xx" -> use specified string xx
# "$hostname" -> auto detect hostname
# "$ip" -> auto detect ip
# "$hostname-$ip" -> auto detect hostname and ip to replace the vars
hostname = "$HOSTNAME"
# will not add label(agent_hostname) if true
omit_hostname = false
# s | ms
precision = "ms"
# global collect interval
interval = 15
[global.labels]
source="categraf"
# region = "shanghai"
# env = "localhost"
[writer_opt]
# default: 2000
batch = 2000
# channel(as queue) size
chan_size = 10000
[[writers]]
url = "http://nightingale:17000/prometheus/v1/write"
# Basic auth username
basic_auth_user = ""
# Basic auth password
basic_auth_pass = ""
# timeout settings, unit: ms
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100
[http]
enable = false
address = ":9100"
print_access = false
run_mode = "release"
[heartbeat]
enable = true
# report os version cpu.util mem.util metadata
url = "http://nightingale:17000/v1/n9e/heartbeat"
# interval, unit: s
interval = 10
# Basic auth username
basic_auth_user = ""
# Basic auth password
basic_auth_pass = ""
## Optional headers
# headers = ["X-From", "categraf", "X-Xyz", "abc"]
# timeout settings, unit: ms
timeout = 5000
dial_timeout = 2500
max_idle_conns_per_host = 100
[ibex]
enable = true
## ibex flush interval
interval = "1000ms"
## n9e ibex server rpc address
servers = ["ibex:20090"]
## temp script dir
meta_dir = "./meta"

View File

@@ -0,0 +1,10 @@
# # collect interval
# interval = 15
# # By default stats will be gathered for all mount points.
# # Set mount_points will restrict the stats to only the specified mount points.
mount_points = ["/"]
# Ignore mount points by filesystem type.
ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs", "nsfs"]

View File

@@ -0,0 +1,4 @@
[[instances]]
urls = [
"http://nightingale:17000/metrics"
]

View File

@@ -0,0 +1,97 @@
# debug, release
RunMode = "release"
[Log]
# log write dir
Dir = "logs-server"
# log level: DEBUG INFO WARNING ERROR
Level = "DEBUG"
# stdout, stderr, file
Output = "stdout"
# # rotate by time
# KeepHours: 4
# # rotate by size
# RotateNum = 3
# # unit: MB
# RotateSize = 256
[HTTP]
Enable = true
# http listening address
Host = "0.0.0.0"
# http listening port
Port = 10090
# https cert file path
CertFile = ""
# https key file path
KeyFile = ""
# whether print access log
PrintAccessLog = true
# whether enable pprof
PProf = false
# http graceful shutdown timeout, unit: s
ShutdownTimeout = 30
# max content length: 64M
MaxContentLength = 67108864
# http server read timeout, unit: s
ReadTimeout = 20
# http server write timeout, unit: s
WriteTimeout = 40
# http server idle timeout, unit: s
IdleTimeout = 120
[BasicAuth]
# using when call apis
ibex = "ibex"
[RPC]
Listen = "0.0.0.0:20090"
[Heartbeat]
# auto detect if blank
IP = ""
# unit: ms
Interval = 1000
[Output]
# database | remote
ComeFrom = "database"
AgtdPort = 2090
[Gorm]
# enable debug mode or not
Debug = false
# mysql postgres
DBType = "mysql"
# unit: s
MaxLifetime = 7200
# max open connections
MaxOpenConns = 150
# max idle connections
MaxIdleConns = 50
# table prefix
TablePrefix = ""
[MySQL]
# mysql address host:port
Address = "mysql:3306"
# mysql username
User = "root"
# mysql password
Password = "1234"
# database name
DBName = "ibex"
# connection params
Parameters = "charset=utf8mb4&parseTime=True&loc=Local&allowNativePasswords=true"
[Postgres]
# pg address host:port
Address = "postgres:5432"
# pg user
User = "root"
# pg password
Password = "1234"
# database name
DBName = "ibex"
# ssl mode
SSLMode = "disable"

View File

@@ -0,0 +1,183 @@
[Global]
RunMode = "release"
[Log]
# log write dir
Dir = "logs"
# log level: DEBUG INFO WARNING ERROR
Level = "DEBUG"
# stdout, stderr, file
Output = "stdout"
# # rotate by time
# KeepHours = 4
# # rotate by size
# RotateNum = 3
# # unit: MB
# RotateSize = 256
[HTTP]
# http listening address
Host = "0.0.0.0"
# http listening port
Port = 17000
# https cert file path
CertFile = ""
# https key file path
KeyFile = ""
# whether print access log
PrintAccessLog = false
# whether enable pprof
PProf = false
# expose prometheus /metrics?
ExposeMetrics = true
# http graceful shutdown timeout, unit: s
ShutdownTimeout = 30
# max content length: 64M
MaxContentLength = 67108864
# http server read timeout, unit: s
ReadTimeout = 20
# http server write timeout, unit: s
WriteTimeout = 40
# http server idle timeout, unit: s
IdleTimeout = 120
[HTTP.ShowCaptcha]
Enable = false
[HTTP.APIForAgent]
Enable = true
# [HTTP.APIForAgent.BasicAuth]
# user001 = "ccc26da7b9aba533cbb263a36c07dcc5"
[HTTP.APIForService]
Enable = true
[HTTP.APIForService.BasicAuth]
user001 = "ccc26da7b9aba533cbb263a36c07dcc5"
[HTTP.JWTAuth]
# signing key
SigningKey = "5b94a0fd640fe2765af826acfe42d151"
# unit: min
AccessExpired = 1500
# unit: min
RefreshExpired = 10080
RedisKeyPrefix = "/jwt/"
[HTTP.ProxyAuth]
# if proxy auth enabled, jwt auth is disabled
Enable = false
# username key in http proxy header
HeaderUserNameKey = "X-User-Name"
DefaultRoles = ["Standard"]
[HTTP.RSA]
# open RSA
OpenRSA = false
# Before replacing the key file, make sure that there are no encrypted variables in the database "configs".
# It is recommended to decrypt and remove all encrypted values from the database before replacing the key file.
# This will prevent any potential issues with accessing or decrypting the variables using the new key file.
# RSA public key (auto carete)
RSAPublicKeyPath = "etc/rsa/public.pem"
# RSA private key (auto carete)
RSAPrivateKeyPath = "etc/rsa/private.pem"
# RSA private key password
RSAPassWord = "n9e@n9e!"
[DB]
# postgres: DSN="host=127.0.0.1 port=5432 user=root dbname=n9e_v6 password=1234 sslmode=disable"
DSN="root:1234@tcp(mysql:3306)/n9e_v6?charset=utf8mb4&parseTime=True&loc=Local&allowNativePasswords=true"
# enable debug mode or not
Debug = false
# mysql postgres
DBType = "mysql"
# unit: s
MaxLifetime = 7200
# max open connections
MaxOpenConns = 150
# max idle connections
MaxIdleConns = 50
# table prefix
TablePrefix = ""
# enable auto migrate or not
# EnableAutoMigrate = false
[Redis]
# address, ip:port or ip1:port,ip2:port for cluster and sentinel(SentinelAddrs)
Address = "redis:6379"
# Username = ""
# Password = ""
# DB = 0
# UseTLS = false
# TLSMinVersion = "1.2"
# standalone cluster sentinel
RedisType = "standalone"
# Mastername for sentinel type
# MasterName = "mymaster"
# SentinelUsername = ""
# SentinelPassword = ""
[Alert]
[Alert.Heartbeat]
# auto detect if blank
IP = ""
# unit ms
Interval = 1000
EngineName = "default"
# [Alert.Alerting]
# NotifyConcurrency = 10
[Center]
MetricsYamlFile = "./etc/metrics.yaml"
I18NHeaderKey = "X-Language"
[Center.AnonymousAccess]
PromQuerier = false
AlertDetail = false
[Pushgw]
# use target labels in database instead of in series
LabelRewrite = true
# # default busigroup key name
# BusiGroupLabelKey = "busigroup"
# ForceUseServerTS = false
# [Pushgw.DebugSample]
# ident = "xx"
# __name__ = "xx"
# [Pushgw.WriterOpt]
# QueueMaxSize = 1000000
# QueuePopSize = 1000
[[Pushgw.Writers]]
# Url = "http://127.0.0.1:8480/insert/0/prometheus/api/v1/write"
Url = "http://victoriametrics:8428/api/v1/write"
# Basic auth username
BasicAuthUser = ""
# Basic auth password
BasicAuthPass = ""
# timeout settings, unit: ms
Headers = ["X-From", "n9e"]
Timeout = 10000
DialTimeout = 3000
TLSHandshakeTimeout = 30000
ExpectContinueTimeout = 1000
IdleConnTimeout = 90000
# time duration, unit: ms
KeepAlive = 30000
MaxConnsPerHost = 0
MaxIdleConns = 100
MaxIdleConnsPerHost = 100
## Optional TLS Config
# UseTLS = false
# TLSCA = "/etc/n9e/ca.pem"
# TLSCert = "/etc/n9e/cert.pem"
# TLSKey = "/etc/n9e/key.pem"
# InsecureSkipVerify = false
# [[Writers.WriteRelabels]]
# Action = "replace"
# SourceLabels = ["__address__"]
# Regex = "([^:]+)(?::\\d+)?"
# Replacement = "$1:80"
# TargetLabel = "__address__"

View File

@@ -0,0 +1,774 @@
zh:
ip_conntrack_count: 连接跟踪表条目总数单位int, count
ip_conntrack_max: 连接跟踪表最大容量单位int, size
cpu_usage_idle: CPU空闲率单位%
cpu_usage_active: CPU使用率单位%
cpu_usage_system: CPU内核态时间占比单位%
cpu_usage_user: CPU用户态时间占比单位%
cpu_usage_nice: 低优先级用户态CPU时间占比也就是进程nice值被调整为1-19之间的CPU时间。这里注意nice可取值范围是-20到19数值越大优先级反而越低单位%
cpu_usage_iowait: CPU等待I/O的时间占比单位%
cpu_usage_irq: CPU处理硬中断的时间占比单位%
cpu_usage_softirq: CPU处理软中断的时间占比单位%
cpu_usage_steal: 在虚拟机环境下有该指标表示CPU被其他虚拟机争用的时间占比超过20就表示争抢严重单位%
cpu_usage_guest: 通过虚拟化运行其他操作系统的时间也就是运行虚拟机的CPU时间占比单位%
cpu_usage_guest_nice: 以低优先级运行虚拟机的时间占比(单位:%
disk_free: 硬盘分区剩余量单位byte
disk_used: 硬盘分区使用量单位byte
disk_used_percent: 硬盘分区使用率(单位:%
disk_total: 硬盘分区总量单位byte
disk_inodes_free: 硬盘分区inode剩余量
disk_inodes_used: 硬盘分区inode使用量
disk_inodes_total: 硬盘分区inode总量
diskio_io_time: 从设备视角来看I/O请求总时间队列中有I/O请求就计数单位毫秒counter类型需要用函数求rate才有使用价值
diskio_iops_in_progress: 已经分配给设备驱动且尚未完成的IO请求不包含在队列中但尚未分配给设备驱动的IO请求gauge类型
diskio_merged_reads: 相邻读请求merge读的次数counter类型
diskio_merged_writes: 相邻写请求merge写的次数counter类型
diskio_read_bytes: 读取的byte数量counter类型需要用函数求rate才有使用价值
diskio_read_time: 读请求总时间单位毫秒counter类型需要用函数求rate才有使用价值
diskio_reads: 读请求次数counter类型需要用函数求rate才有使用价值
diskio_weighted_io_time: 从I/O请求视角来看I/O等待总时间如果同时有多个I/O请求时间会叠加单位毫秒
diskio_write_bytes: 写入的byte数量counter类型需要用函数求rate才有使用价值
diskio_write_time: 写请求总时间单位毫秒counter类型需要用函数求rate才有使用价值
diskio_writes: 写请求次数counter类型需要用函数求rate才有使用价值
kernel_boot_time: 内核启动时间
kernel_context_switches: 内核上下文切换次数
kernel_entropy_avail: linux系统内部的熵池
kernel_interrupts: 内核中断次数
kernel_processes_forked: fork的进程数
mem_active: 活跃使用的内存总数(包括cache和buffer内存)
mem_available: 可用内存大小(bytes)
mem_available_percent: 内存剩余百分比(0~100)
mem_buffered: 用来给文件做缓冲大小
mem_cached: 被高速缓冲存储器cache memory用的内存的大小等于 diskcache minus SwapCache
mem_commit_limit: 根据超额分配比率('vm.overcommit_ratio'这是当前在系统上分配可用的内存总量这个限制只是在模式2('vm.overcommit_memory')时启用
mem_committed_as: 目前在系统上分配的内存量。是所有进程申请的内存的总和
mem_dirty: 等待被写回到磁盘的内存大小
mem_free: 空闲内存大小(bytes)
mem_high_free: 未被使用的高位内存大小
mem_high_total: 高位内存总大小Highmem是指所有内存高于860MB的物理内存,Highmem区域供用户程序使用或用于页面缓存。该区域不是直接映射到内核空间。内核必须使用不同的手法使用该段内存
mem_huge_page_size: 每个大页的大小
mem_huge_pages_free: 池中尚未分配的 HugePages 数量
mem_huge_pages_total: 预留HugePages的总个数
mem_inactive: 空闲的内存数(包括free和avalible的内存)
mem_low_free: 未被使用的低位大小
mem_low_total: 低位内存总大小,低位可以达到高位内存一样的作用,而且它还能够被内核用来记录一些自己的数据结构
mem_mapped: 设备和文件等映射的大小
mem_page_tables: 管理内存分页页面的索引表的大小
mem_shared: 多个进程共享的内存总额
mem_slab: 内核数据结构缓存的大小,可以减少申请和释放内存带来的消耗
mem_sreclaimable: 可收回Slab的大小
mem_sunreclaim: 不可收回Slab的大小SUnreclaim+SReclaimableSlab
mem_swap_cached: 被高速缓冲存储器cache memory用的交换空间的大小已经被交换出来的内存但仍然被存放在swapfile中。用来在需要的时候很快的被替换而不需要再次打开I/O端口
mem_swap_free: 未被使用交换空间的大小
mem_swap_total: 交换空间的总大小
mem_total: 内存总数
mem_used: 已用内存数
mem_used_percent: 已用内存数百分比(0~100)
mem_vmalloc_chunk: 最大的连续未被使用的vmalloc区域
mem_vmalloc_totalL: 可以vmalloc虚拟内存大小
mem_vmalloc_used: vmalloc已使用的虚拟内存大小
mem_write_back: 正在被写回到磁盘的内存大小
mem_write_back_tmp: FUSE用于临时写回缓冲区的内存
net_bytes_recv: 网卡收包总数(bytes)计算每秒速率时需要用到rate/irate函数
net_bytes_sent: 网卡发包总数(bytes)计算每秒速率时需要用到rate/irate函数
net_drop_in: 网卡收丢包数量
net_drop_out: 网卡发丢包数量
net_err_in: 网卡收包错误数量
net_err_out: 网卡发包错误数量
net_packets_recv: 网卡收包数量
net_packets_sent: 网卡发包数量
net_bits_recv: 网卡收包总数(bits)计算每秒速率时需要用到rate/irate函数
net_bits_sent: 网卡发包总数(bits)计算每秒速率时需要用到rate/irate函数
netstat_tcp_established: ESTABLISHED状态的网络链接数
netstat_tcp_fin_wait1: FIN_WAIT1状态的网络链接数
netstat_tcp_fin_wait2: FIN_WAIT2状态的网络链接数
netstat_tcp_last_ack: LAST_ACK状态的网络链接数
netstat_tcp_listen: LISTEN状态的网络链接数
netstat_tcp_syn_recv: SYN_RECV状态的网络链接数
netstat_tcp_syn_sent: SYN_SENT状态的网络链接数
netstat_tcp_time_wait: TIME_WAIT状态的网络链接数
netstat_udp_socket: UDP状态的网络链接数
netstat_sockets_used: 已使用的所有协议套接字总量
netstat_tcp_inuse: 正在使用正在侦听的TCP套接字数量
netstat_tcp_orphan: 无主不属于任何进程的TCP连接数无用、待销毁的TCP socket数
netstat_tcp_tw: TIME_WAIT状态的TCP连接数
netstat_tcp_alloc: 已分配已建立、已申请到sk_buff的TCP套接字数量
netstat_tcp_mem: TCP套接字内存Page使用量
netstat_udp_inuse: 在使用的UDP套接字数量
netstat_udp_mem: UDP套接字内存Page使用量
netstat_udplite_inuse: 正在使用的 udp lite 数量
netstat_raw_inuse: 正在使用的 raw socket 数量
netstat_frag_inuse: ip fragement 数量
netstat_frag_memory: ip fragement 已经分配的内存(byte
#[ping]
ping_percent_packet_loss: ping数据包丢失百分比(%)
ping_result_code: ping返回码('0','1')
net_response_result_code: 网络探测结果0表示正常非0表示异常
net_response_response_time: 网络探测时延,单位:秒
processes_blocked: 不可中断的睡眠状态下的进程数('U','D','L')
processes_dead: 回收中的进程数('X')
processes_idle: 挂起的空闲进程数('I')
processes_paging: 分页进程数('P')
processes_running: 运行中的进程数('R')
processes_sleeping: 可中断进程数('S')
processes_stopped: 暂停状态进程数('T')
processes_total: 总进程数
processes_total_threads: 总线程数
processes_unknown: 未知状态进程数
processes_zombies: 僵尸态进程数('Z')
swap_used_percent: Swap空间换出数据量
system_load1: 1分钟平均load值
system_load5: 5分钟平均load值
system_load15: 15分钟平均load值
system_load_norm_1: 1分钟平均load值/逻辑CPU个数
system_load_norm_5: 5分钟平均load值/逻辑CPU个数
system_load_norm_15: 15分钟平均load值/逻辑CPU个数
system_n_users: 用户数
system_n_cpus: CPU核数
system_uptime: 系统启动时间
nginx_accepts: 自nginx启动起,与客户端建立过得连接总数
nginx_active: 当前nginx正在处理的活动连接数,等于Reading/Writing/Waiting总和
nginx_handled: 自nginx启动起,处理过的客户端连接总数
nginx_reading: 正在读取HTTP请求头部的连接总数
nginx_requests: 自nginx启动起,处理过的客户端请求总数,由于存在HTTP Krrp-Alive请求,该值会大于handled值
nginx_upstream_check_fall: upstream_check模块检测到后端失败的次数
nginx_upstream_check_rise: upstream_check模块对后端的检测次数
nginx_upstream_check_status_code: 后端upstream的状态,up为1,down为0
nginx_waiting: 开启 keep-alive 的情况下,这个值等于 active (reading+writing), 意思就是 Nginx 已经处理完正在等候下一次请求指令的驻留连接
nginx_writing: 正在向客户端发送响应的连接总数
http_response_content_length: HTTP消息实体的传输长度
http_response_http_response_code: http响应状态码
http_response_response_time: http响应用时
http_response_result_code: url探测结果0为正常否则url无法访问
# [aws cloudwatch rds]
cloudwatch_aws_rds_bin_log_disk_usage_average: rds 磁盘使用平均值
cloudwatch_aws_rds_bin_log_disk_usage_maximum: rds 磁盘使用量最大值
cloudwatch_aws_rds_bin_log_disk_usage_minimum: rds binlog 磁盘使用量最低
cloudwatch_aws_rds_bin_log_disk_usage_sample_count: rds binlog 磁盘使用情况样本计数
cloudwatch_aws_rds_bin_log_disk_usage_sum: rds binlog 磁盘使用总和
cloudwatch_aws_rds_burst_balance_average: rds 突发余额平均值
cloudwatch_aws_rds_burst_balance_maximum: rds 突发余额最大值
cloudwatch_aws_rds_burst_balance_minimum: rds 突发余额最低
cloudwatch_aws_rds_burst_balance_sample_count: rds 突发平衡样本计数
cloudwatch_aws_rds_burst_balance_sum: rds 突发余额总和
cloudwatch_aws_rds_cpu_utilization_average: rds cpu 利用率平均值
cloudwatch_aws_rds_cpu_utilization_maximum: rds cpu 利用率最大值
cloudwatch_aws_rds_cpu_utilization_minimum: rds cpu 利用率最低
cloudwatch_aws_rds_cpu_utilization_sample_count: rds cpu 利用率样本计数
cloudwatch_aws_rds_cpu_utilization_sum: rds cpu 利用率总和
cloudwatch_aws_rds_database_connections_average: rds 数据库连接平均值
cloudwatch_aws_rds_database_connections_maximum: rds 数据库连接数最大值
cloudwatch_aws_rds_database_connections_minimum: rds 数据库连接最小
cloudwatch_aws_rds_database_connections_sample_count: rds 数据库连接样本数
cloudwatch_aws_rds_database_connections_sum: rds 数据库连接总和
cloudwatch_aws_rds_db_load_average: rds db 平均负载
cloudwatch_aws_rds_db_load_cpu_average: rds db 负载 cpu 平均值
cloudwatch_aws_rds_db_load_cpu_maximum: rds db 负载 cpu 最大值
cloudwatch_aws_rds_db_load_cpu_minimum: rds db 负载 cpu 最小值
cloudwatch_aws_rds_db_load_cpu_sample_count: rds db 加载 CPU 样本数
cloudwatch_aws_rds_db_load_cpu_sum: rds db 加载cpu总和
cloudwatch_aws_rds_db_load_maximum: rds 数据库负载最大值
cloudwatch_aws_rds_db_load_minimum: rds 数据库负载最小值
cloudwatch_aws_rds_db_load_non_cpu_average: rds 加载非 CPU 平均值
cloudwatch_aws_rds_db_load_non_cpu_maximum: rds 加载非 cpu 最大值
cloudwatch_aws_rds_db_load_non_cpu_minimum: rds 加载非 cpu 最小值
cloudwatch_aws_rds_db_load_non_cpu_sample_count: rds 加载非 cpu 样本计数
cloudwatch_aws_rds_db_load_non_cpu_sum: rds 加载非cpu总和
cloudwatch_aws_rds_db_load_sample_count: rds db 加载样本计数
cloudwatch_aws_rds_db_load_sum: rds db 负载总和
cloudwatch_aws_rds_disk_queue_depth_average: rds 磁盘队列深度平均值
cloudwatch_aws_rds_disk_queue_depth_maximum: rds 磁盘队列深度最大值
cloudwatch_aws_rds_disk_queue_depth_minimum: rds 磁盘队列深度最小值
cloudwatch_aws_rds_disk_queue_depth_sample_count: rds 磁盘队列深度样本计数
cloudwatch_aws_rds_disk_queue_depth_sum: rds 磁盘队列深度总和
cloudwatch_aws_rds_ebs_byte_balance__average: rds ebs 字节余额平均值
cloudwatch_aws_rds_ebs_byte_balance__maximum: rds ebs 字节余额最大值
cloudwatch_aws_rds_ebs_byte_balance__minimum: rds ebs 字节余额最低
cloudwatch_aws_rds_ebs_byte_balance__sample_count: rds ebs 字节余额样本数
cloudwatch_aws_rds_ebs_byte_balance__sum: rds ebs 字节余额总和
cloudwatch_aws_rds_ebsio_balance__average: rds ebsio 余额平均值
cloudwatch_aws_rds_ebsio_balance__maximum: rds ebsio 余额最大值
cloudwatch_aws_rds_ebsio_balance__minimum: rds ebsio 余额最低
cloudwatch_aws_rds_ebsio_balance__sample_count: rds ebsio 平衡样本计数
cloudwatch_aws_rds_ebsio_balance__sum: rds ebsio 余额总和
cloudwatch_aws_rds_free_storage_space_average: rds 免费存储空间平均
cloudwatch_aws_rds_free_storage_space_maximum: rds 最大可用存储空间
cloudwatch_aws_rds_free_storage_space_minimum: rds 最低可用存储空间
cloudwatch_aws_rds_free_storage_space_sample_count: rds 可用存储空间样本数
cloudwatch_aws_rds_free_storage_space_sum: rds 免费存储空间总和
cloudwatch_aws_rds_freeable_memory_average: rds 可用内存平均值
cloudwatch_aws_rds_freeable_memory_maximum: rds 最大可用内存
cloudwatch_aws_rds_freeable_memory_minimum: rds 最小可用内存
cloudwatch_aws_rds_freeable_memory_sample_count: rds 可释放内存样本数
cloudwatch_aws_rds_freeable_memory_sum: rds 可释放内存总和
cloudwatch_aws_rds_lvm_read_iops_average: rds lvm 读取 iops 平均值
cloudwatch_aws_rds_lvm_read_iops_maximum: rds lvm 读取 iops 最大值
cloudwatch_aws_rds_lvm_read_iops_minimum: rds lvm 读取 iops 最低
cloudwatch_aws_rds_lvm_read_iops_sample_count: rds lvm 读取 iops 样本计数
cloudwatch_aws_rds_lvm_read_iops_sum: rds lvm 读取 iops 总和
cloudwatch_aws_rds_lvm_write_iops_average: rds lvm 写入 iops 平均值
cloudwatch_aws_rds_lvm_write_iops_maximum: rds lvm 写入 iops 最大值
cloudwatch_aws_rds_lvm_write_iops_minimum: rds lvm 写入 iops 最低
cloudwatch_aws_rds_lvm_write_iops_sample_count: rds lvm 写入 iops 样本计数
cloudwatch_aws_rds_lvm_write_iops_sum: rds lvm 写入 iops 总和
cloudwatch_aws_rds_network_receive_throughput_average: rds 网络接收吞吐量平均
cloudwatch_aws_rds_network_receive_throughput_maximum: rds 网络接收吞吐量最大值
cloudwatch_aws_rds_network_receive_throughput_minimum: rds 网络接收吞吐量最小值
cloudwatch_aws_rds_network_receive_throughput_sample_count: rds 网络接收吞吐量样本计数
cloudwatch_aws_rds_network_receive_throughput_sum: rds 网络接收吞吐量总和
cloudwatch_aws_rds_network_transmit_throughput_average: rds 网络传输吞吐量平均值
cloudwatch_aws_rds_network_transmit_throughput_maximum: rds 网络传输吞吐量最大
cloudwatch_aws_rds_network_transmit_throughput_minimum: rds 网络传输吞吐量最小值
cloudwatch_aws_rds_network_transmit_throughput_sample_count: rds 网络传输吞吐量样本计数
cloudwatch_aws_rds_network_transmit_throughput_sum: rds 网络传输吞吐量总和
cloudwatch_aws_rds_read_iops_average: rds 读取 iops 平均值
cloudwatch_aws_rds_read_iops_maximum: rds 最大读取 iops
cloudwatch_aws_rds_read_iops_minimum: rds 读取 iops 最低
cloudwatch_aws_rds_read_iops_sample_count: rds 读取 iops 样本计数
cloudwatch_aws_rds_read_iops_sum: rds 读取 iops 总和
cloudwatch_aws_rds_read_latency_average: rds 读取延迟平均值
cloudwatch_aws_rds_read_latency_maximum: rds 读取延迟最大值
cloudwatch_aws_rds_read_latency_minimum: rds 最小读取延迟
cloudwatch_aws_rds_read_latency_sample_count: rds 读取延迟样本计数
cloudwatch_aws_rds_read_latency_sum: rds 读取延迟总和
cloudwatch_aws_rds_read_throughput_average: rds 读取吞吐量平均值
cloudwatch_aws_rds_read_throughput_maximum: rds 最大读取吞吐量
cloudwatch_aws_rds_read_throughput_minimum: rds 最小读取吞吐量
cloudwatch_aws_rds_read_throughput_sample_count: rds 读取吞吐量样本计数
cloudwatch_aws_rds_read_throughput_sum: rds 读取吞吐量总和
cloudwatch_aws_rds_swap_usage_average: rds 交换使用平均值
cloudwatch_aws_rds_swap_usage_maximum: rds 交换使用最大值
cloudwatch_aws_rds_swap_usage_minimum: rds 交换使用量最低
cloudwatch_aws_rds_swap_usage_sample_count: rds 交换使用示例计数
cloudwatch_aws_rds_swap_usage_sum: rds 交换使用总和
cloudwatch_aws_rds_write_iops_average: rds 写入 iops 平均值
cloudwatch_aws_rds_write_iops_maximum: rds 写入 iops 最大值
cloudwatch_aws_rds_write_iops_minimum: rds 写入 iops 最低
cloudwatch_aws_rds_write_iops_sample_count: rds 写入 iops 样本计数
cloudwatch_aws_rds_write_iops_sum: rds 写入 iops 总和
cloudwatch_aws_rds_write_latency_average: rds 写入延迟平均值
cloudwatch_aws_rds_write_latency_maximum: rds 最大写入延迟
cloudwatch_aws_rds_write_latency_minimum: rds 写入延迟最小值
cloudwatch_aws_rds_write_latency_sample_count: rds 写入延迟样本计数
cloudwatch_aws_rds_write_latency_sum: rds 写入延迟总和
cloudwatch_aws_rds_write_throughput_average: rds 写入吞吐量平均值
cloudwatch_aws_rds_write_throughput_maximum: rds 最大写入吞吐量
cloudwatch_aws_rds_write_throughput_minimum: rds 写入吞吐量最小值
cloudwatch_aws_rds_write_throughput_sample_count: rds 写入吞吐量样本计数
cloudwatch_aws_rds_write_throughput_sum: rds 写入吞吐量总和
en:
ip_conntrack_count: the number of entries in the conntrack tableunitint, count
ip_conntrack_max: the max capacity of the conntrack tableunitint, size
cpu_usage_idle: "CPU idle rate(unit%)"
cpu_usage_active: "CPU usage rate(unit%)"
cpu_usage_system: "CPU kernel state time proportion(unit%)"
cpu_usage_user: "CPU user attitude time proportion(unit%)"
cpu_usage_nice: "The proportion of low priority CPU time, that is, the process NICE value is adjusted to the CPU time between 1-19. Note here that the value range of NICE is -20 to 19, the larger the value, the lower the priority, the lower the priority(unit%)"
cpu_usage_iowait: "CPU waiting for I/O time proportion(unit%)"
cpu_usage_irq: "CPU processing hard interrupt time proportion(unit%)"
cpu_usage_softirq: "CPU processing soft interrupt time proportion(unit%)"
cpu_usage_steal: "In the virtual machine environment, there is this indicator, which means that the CPU is used by other virtual machines for the proportion of time.(unit%)"
cpu_usage_guest: "The time to run other operating systems by virtualization, that is, the proportion of CPU time running the virtual machine(unit%)"
cpu_usage_guest_nice: "The proportion of time to run the virtual machine at low priority(unit%)"
disk_free: "The remaining amount of the hard disk partition (unit: byte)"
disk_used: "Hard disk partitional use (unit: byte)"
disk_used_percent: "Hard disk partitional use rate (unit:%)"
disk_total: "Total amount of hard disk partition (unit: byte)"
disk_inodes_free: "Hard disk partition INODE remaining amount"
disk_inodes_used: "Hard disk partition INODE usage amount"
disk_inodes_total: "The total amount of hard disk partition INODE"
diskio_io_time: "From the perspective of the device perspective, the total time of I/O request, the I/O request in the queue is count (unit: millisecond), the counter type, you need to use the function to find the value"
diskio_iops_in_progress: "IO requests that have been assigned to device -driven and have not yet been completed, not included in the queue but not yet assigned to the device -driven IO request, Gauge type"
diskio_merged_reads: "The number of times of adjacent reading request Merge, the counter type"
diskio_merged_writes: "The number of times the request Merge writes, the counter type"
diskio_read_bytes: "The number of byte reads, the counter type, you need to use the function to find the Rate to use the value"
diskio_read_time: "The total time of reading request (unit: millisecond), the counter type, you need to use the function to find the Rate to have the value of use"
diskio_reads: "Read the number of requests, the counter type, you need to use the function to find the Rate to use the value"
diskio_weighted_io_time: "From the perspective of the I/O request perspective, I/O wait for the total time. If there are multiple I/O requests at the same time, the time will be superimposed (unit: millisecond)"
diskio_write_bytes: "The number of bytes written, the counter type, you need to use the function to find the Rate to use the value"
diskio_write_time: "The total time of the request (unit: millisecond), the counter type, you need to use the function to find the rate to have the value of use"
diskio_writes: "Write the number of requests, the counter type, you need to use the function to find the rate to use value"
kernel_boot_time: "Kernel startup time"
kernel_context_switches: "Number of kernel context switching times"
kernel_entropy_avail: "Entropy pool inside the Linux system"
kernel_interrupts: "Number of kernel interruption"
kernel_processes_forked: "ForK's process number"
mem_active: "The total number of memory (including Cache and BUFFER memory)"
mem_available: "Application can use memory numbers"
mem_available_percent: "Memory remaining percentage (0 ~ 100)"
mem_buffered: "Used to make buffer size for the file"
mem_cached: "The size of the memory used by the cache memory (equal to diskcache minus Swap Cache )"
mem_commit_limit: "According to the over allocation ratio ('vm.overCommit _ Ratio'), this is the current total memory that can be allocated on the system."
mem_committed_as: "Currently allocated on the system. It is the sum of the memory of all process applications"
mem_dirty: "Waiting to be written back to the memory size of the disk"
mem_free: "Senior memory number"
mem_high_free: "Unused high memory size"
mem_high_total: "The total memory size of the high memory (Highmem refers to all the physical memory that is higher than 860 MB of memory, the HighMem area is used for user programs, or for page cache. This area is not directly mapped to the kernel space. The kernels must use different methods to use this section of memory. )"
mem_huge_page_size: "The size of each big page"
mem_huge_pages_free: "The number of Huge Pages in the pool that have not been allocated"
mem_huge_pages_total: "Reserve the total number of Huge Pages"
mem_inactive: "Free memory (including the memory of free and avalible)"
mem_low_free: "Unused low size"
mem_low_total: "The total size of the low memory memory can achieve the same role of high memory, and it can be used by the kernel to record some of its own data structure"
mem_mapped: "The size of the mapping of equipment and files"
mem_page_tables: "The size of the index table of the management of the memory paging page"
mem_shared: "The total memory shared by multiple processes"
mem_slab: "The size of the kernel data structure cache can reduce the consumption of application and release memory"
mem_sreclaimable: "The size of the SLAB can be recovered"
mem_sunreclaim: "The size of the SLAB cannot be recovered(SUnreclaim+SReclaimableSlab)"
mem_swap_cached: "The size of the swap space used by the cache memory (cache memory), the memory that has been swapped out, but is still stored in the swapfile. Used to be quickly replaced when needed without opening the I/O port again"
mem_swap_free: "The size of the switching space is not used"
mem_swap_total: "The total size of the exchange space"
mem_total: "Total memory"
mem_used: "Memory number"
mem_used_percent: "The memory has been used by several percentage (0 ~ 100)"
mem_vmalloc_chunk: "The largest continuous unused vmalloc area"
mem_vmalloc_totalL: "You can vmalloc virtual memory size"
mem_vmalloc_used: "Vmalloc's virtual memory size"
mem_write_back: "The memory size of the disk is being written back to the disk"
mem_write_back_tmp: "Fuse is used to temporarily write back the memory of the buffer area"
net_bytes_recv: "Total inbound traffic(bytes) of network card"
net_bytes_sent: "Total outbound traffic(bytes) of network card"
net_bits_recv: "Total inbound traffic(bits) of network card"
net_bits_sent: "Total outbound traffic(bits) of network card"
net_drop_in: "The number of packets for network cards"
net_drop_out: "The number of packets issued by the network card"
net_err_in: "The number of incorrect packets of the network card"
net_err_out: "Number of incorrect number of network cards"
net_packets_recv: "Net card collection quantity"
net_packets_sent: "Number of network card issuance"
netstat_tcp_established: "ESTABLISHED status network link number"
netstat_tcp_fin_wait1: "FIN _ WAIT1 status network link number"
netstat_tcp_fin_wait2: "FIN _ WAIT2 status number of network links"
netstat_tcp_last_ack: "LAST_ ACK status number of network links"
netstat_tcp_listen: "Number of network links in Listen status"
netstat_tcp_syn_recv: "SYN _ RECV status number of network links"
netstat_tcp_syn_sent: "SYN _ SENT status number of network links"
netstat_tcp_time_wait: "Time _ WAIT status network link number"
netstat_udp_socket: "Number of network links in UDP status"
processes_blocked: "The number of processes in the unreprudible sleep state('U','D','L')"
processes_dead: "Number of processes in recycling('X')"
processes_idle: "Number of idle processes hanging('I')"
processes_paging: "Number of paging processes('P')"
processes_running: "Number of processes during operation('R')"
processes_sleeping: "Can interrupt the number of processes('S')"
processes_stopped: "Pushing status process number('T')"
processes_total: "Total process number"
processes_total_threads: "Number of threads"
processes_unknown: "Unknown status process number"
processes_zombies: "Number of zombies('Z')"
swap_used_percent: "SWAP space replace the data volume"
system_load1: "1 minute average load value"
system_load5: "5 minutes average load value"
system_load15: "15 minutes average load value"
system_load_norm_1: "1 minute average load value/logical CPU number"
system_load_norm_5: "5 minutes average load value/logical CPU number"
system_load_norm_15: "15 minutes average load value/logical CPU number"
system_n_users: "User number"
system_n_cpus: "CPU nuclear number"
system_uptime: "System startup time"
nginx_accepts: "Since Nginx started, the total number of connections has been established with the client"
nginx_active: "The current number of activity connections that Nginx is being processed is equal to Reading/Writing/Waiting"
nginx_handled: "Starting from Nginx, the total number of client connections that have been processed"
nginx_reading: "Reading the total number of connections on the http request header"
nginx_requests: "Since nginx is started, the total number of client requests processed, due to the existence of HTTP Krrp - Alive requests, this value will be greater than the handled value"
nginx_upstream_check_fall: "UPStream_CHECK module detects the number of back -end failures"
nginx_upstream_check_rise: "UPSTREAM _ Check module to detect the number of back -end"
nginx_upstream_check_status_code: "The state of the backstream is 1, and the down is 0"
nginx_waiting: "When keep-alive is enabled, this value is equal to active (reading+writing), which means that Nginx has processed the resident connection that is waiting for the next request command"
nginx_writing: "The total number of connections to send a response to the client"
http_response_content_length: "HTTP message entity transmission length"
http_response_http_response_code: "http response status code"
http_response_response_time: "When http ring application"
http_response_result_code: "URL detection result 0 is normal, otherwise the URL cannot be accessed"
# [mysqld_exporter]
mysql_global_status_uptime: The number of seconds that the server has been up.(Gauge)
mysql_global_status_uptime_since_flush_status: The number of seconds since the most recent FLUSH STATUS statement.(Gauge)
mysql_global_status_queries: The number of statements executed by the server. This variable includes statements executed within stored programs, unlike the Questions variable. It does not count COM_PING or COM_STATISTICS commands.(Counter)
mysql_global_status_threads_connected: The number of currently open connections.(Counter)
mysql_global_status_connections: The number of connection attempts (successful or not) to the MySQL server.(Gauge)
mysql_global_status_max_used_connections: The maximum number of connections that have been in use simultaneously since the server started.(Gauge)
mysql_global_status_threads_running: The number of threads that are not sleeping.(Gauge)
mysql_global_status_questions: The number of statements executed by the server. This includes only statements sent to the server by clients and not statements executed within stored programs, unlike the Queries variable. This variable does not count COM_PING, COM_STATISTICS, COM_STMT_PREPARE, COM_STMT_CLOSE, or COM_STMT_RESET commands.(Counter)
mysql_global_status_threads_cached: The number of threads in the thread cache.(Counter)
mysql_global_status_threads_created: The number of threads created to handle connections. If Threads_created is big, you may want to increase the thread_cache_size value. The cache miss rate can be calculated as Threads_created/Connections.(Counter)
mysql_global_status_created_tmp_tables: The number of internal temporary tables created by the server while executing statements.(Counter)
mysql_global_status_created_tmp_disk_tables: The number of internal on-disk temporary tables created by the server while executing statements. You can compare the number of internal on-disk temporary tables created to the total number of internal temporary tables created by comparing Created_tmp_disk_tables and Created_tmp_tables values.(Counter)
mysql_global_status_created_tmp_files: How many temporary files mysqld has created.(Counter)
mysql_global_status_select_full_join: The number of joins that perform table scans because they do not use indexes. If this value is not 0, you should carefully check the indexes of your tables.(Counter)
mysql_global_status_select_full_range_join: The number of joins that used a range search on a reference table.(Counter)
mysql_global_status_select_range: The number of joins that used ranges on the first table. This is normally not a critical issue even if the value is quite large.(Counter)
mysql_global_status_select_range_check: The number of joins without keys that check for key usage after each row. If this is not 0, you should carefully check the indexes of your tables.(Counter)
mysql_global_status_select_scan: The number of joins that did a full scan of the first table.(Counter)
mysql_global_status_sort_rows: The number of sorted rows.(Counter)
mysql_global_status_sort_range: The number of sorts that were done using ranges.(Counter)
mysql_global_status_sort_merge_passes: The number of merge passes that the sort algorithm has had to do. If this value is large, you should consider increasing the value of the sort_buffer_size system variable.(Counter)
mysql_global_status_sort_scan: The number of sorts that were done by scanning the table.(Counter)
mysql_global_status_slow_queries: The number of queries that have taken more than long_query_time seconds. This counter increments regardless of whether the slow query log is enabled.(Counter)
mysql_global_status_aborted_connects: The number of failed attempts to connect to the MySQL server.(Counter)
mysql_global_status_aborted_clients: The number of connections that were aborted because the client died without closing the connection properly.(Counter)
mysql_global_status_table_locks_immediate: The number of times that a request for a table lock could be granted immediately. Locks Immediate rising and falling is normal activity.(Counter)
mysql_global_status_table_locks_waited: The number of times that a request for a table lock could not be granted immediately and a wait was needed. If this is high and you have performance problems, you should first optimize your queries, and then either split your table or tables or use replication.(Counter)
mysql_global_status_bytes_received: The number of bytes received from all clients.(Counter)
mysql_global_status_bytes_sent: The number of bytes sent to all clients.(Counter)
mysql_global_status_innodb_page_size: InnoDB page size (default 16KB). Many values are counted in pages; the page size enables them to be easily converted to bytes.(Gauge)
mysql_global_status_buffer_pool_pages: The number of pages in the InnoDB buffer pool.(Gauge)
mysql_global_status_commands_total: The number of times each xxx statement has been executed.(Counter)
mysql_global_status_handlers_total: Handler statistics are internal statistics on how MySQL is selecting, updating, inserting, and modifying rows, tables, and indexes. This is in fact the layer between the Storage Engine and MySQL.(Counter)
mysql_global_status_opened_files: The number of files that have been opened with my_open() (a mysys library function). Parts of the server that open files without using this function do not increment the count.(Counter)
mysql_global_status_open_tables: The number of tables that are open.(Gauge)
mysql_global_status_opened_tables: The number of tables that have been opened. If Opened_tables is big, your table_open_cache value is probably too small.(Counter)
mysql_global_status_table_open_cache_hits: The number of hits for open tables cache lookups.(Counter)
mysql_global_status_table_open_cache_misses: The number of misses for open tables cache lookups.(Counter)
mysql_global_status_table_open_cache_overflows: The number of overflows for the open tables cache.(Counter)
mysql_global_status_innodb_num_open_files: The number of files InnoDB currently holds open.(Gauge)
mysql_global_status_connection_errors_total: These variables provide information about errors that occur during the client connection process.(Counter)
mysql_global_status_innodb_buffer_pool_read_requests: The number of logical read requests.(Counter)
mysql_global_status_innodb_buffer_pool_reads: The number of logical reads that InnoDB could not satisfy from the buffer pool, and had to read directly from disk.(Counter)
mysql_global_variables_thread_cache_size: How many threads the server should cache for reuse.(Gauge)
mysql_global_variables_max_connections: The maximum permitted number of simultaneous client connections.(Gauge)
mysql_global_variables_innodb_buffer_pool_size: The size in bytes of the buffer pool, the memory area where InnoDB caches table and index data. The default value is 134217728 bytes (128MB).(Gauge)
mysql_global_variables_innodb_log_buffer_size: The size in bytes of the buffer that InnoDB uses to write to the log files on disk.(Gauge)
mysql_global_variables_key_buffer_size: Index blocks for MyISAM tables are buffered and are shared by all threads.(Gauge)
mysql_global_variables_query_cache_size: The amount of memory allocated for caching query results.(Gauge)
mysql_global_variables_table_open_cache: The number of open tables for all threads.(Gauge)
mysql_global_variables_open_files_limit: The number of file descriptors available to mysqld from the operating system.(Gauge)
# [redis_exporter]
redis_active_defrag_running: When activedefrag is enabled, this indicates whether defragmentation is currently active, and the CPU percentage it intends to utilize.
redis_allocator_active_bytes: Total bytes in the allocator active pages, this includes external-fragmentation.
redis_allocator_allocated_bytes: Total bytes allocated form the allocator, including internal-fragmentation. Normally the same as used_memory.
redis_allocator_frag_bytes: Delta between allocator_active and allocator_allocated. See note about mem_fragmentation_bytes.
redis_allocator_frag_ratio: Ratio between allocator_active and allocator_allocated. This is the true (external) fragmentation metric (not mem_fragmentation_ratio).
redis_allocator_resident_bytes: Total bytes resident (RSS) in the allocator, this includes pages that can be released to the OS (by MEMORY PURGE, or just waiting).
redis_allocator_rss_bytes: Delta between allocator_resident and allocator_active.
redis_allocator_rss_ratio: Ratio between allocator_resident and allocator_active. This usually indicates pages that the allocator can and probably will soon release back to the OS.
redis_aof_current_rewrite_duration_sec: Duration of the on-going AOF rewrite operation if any.
redis_aof_enabled: Flag indicating AOF logging is activated.
redis_aof_last_bgrewrite_status: Status of the last AOF rewrite operation.
redis_aof_last_cow_size_bytes: The size in bytes of copy-on-write memory during the last AOF rewrite operation.
redis_aof_last_rewrite_duration_sec: Duration of the last AOF rewrite operation in seconds.
redis_aof_last_write_status: Status of the last write operation to the AOF.
redis_aof_rewrite_in_progress: Flag indicating a AOF rewrite operation is on-going.
redis_aof_rewrite_scheduled: Flag indicating an AOF rewrite operation will be scheduled once the on-going RDB save is complete.
redis_blocked_clients: Number of clients pending on a blocking call (BLPOP, BRPOP, BRPOPLPUSH, BLMOVE, BZPOPMIN, BZPOPMAX).
redis_client_recent_max_input_buffer_bytes: Biggest input buffer among current client connections.
redis_client_recent_max_output_buffer_bytes: Biggest output buffer among current client connections.
redis_cluster_enabled: Indicate Redis cluster is enabled.
redis_commands_duration_seconds_total: The total CPU time consumed by these commands.(Counter)
redis_commands_processed_total: Total number of commands processed by the server.(Counter)
redis_commands_total: The number of calls that reached command execution (not rejected).(Counter)
redis_config_maxclients: The value of the maxclients configuration directive. This is the upper limit for the sum of connected_clients, connected_slaves and cluster_connections.
redis_config_maxmemory: The value of the maxmemory configuration directive.
redis_connected_clients: Number of client connections (excluding connections from replicas).
redis_connected_slaves: Number of connected replicas.
redis_connections_received_total: Total number of connections accepted by the server.(Counter)
redis_cpu_sys_children_seconds_total: System CPU consumed by the background processes.(Counter)
redis_cpu_sys_seconds_total: System CPU consumed by the Redis server, which is the sum of system CPU consumed by all threads of the server process (main thread and background threads).(Counter)
redis_cpu_user_children_seconds_total: User CPU consumed by the background processes.(Counter)
redis_cpu_user_seconds_total: User CPU consumed by the Redis server, which is the sum of user CPU consumed by all threads of the server process (main thread and background threads).(Counter)
redis_db_keys: Total number of keys by DB.
redis_db_keys_expiring: Total number of expiring keys by DB
redis_defrag_hits: Number of value reallocations performed by active the defragmentation process.
redis_defrag_misses: Number of aborted value reallocations started by the active defragmentation process.
redis_defrag_key_hits: Number of keys that were actively defragmented.
redis_defrag_key_misses: Number of keys that were skipped by the active defragmentation process.
redis_evicted_keys_total: Number of evicted keys due to maxmemory limit.(Counter)
redis_expired_keys_total: Total number of key expiration events.(Counter)
redis_expired_stale_percentage: The percentage of keys probably expired.
redis_expired_time_cap_reached_total: The count of times that active expiry cycles have stopped early.
redis_exporter_last_scrape_connect_time_seconds: The duration(in seconds) to connect when scrape.
redis_exporter_last_scrape_duration_seconds: The last scrape duration.
redis_exporter_last_scrape_error: The last scrape error status.
redis_exporter_scrape_duration_seconds_count: Durations of scrapes by the exporter
redis_exporter_scrape_duration_seconds_sum: Durations of scrapes by the exporter
redis_exporter_scrapes_total: Current total redis scrapes.(Counter)
redis_instance_info: Information about the Redis instance.
redis_keyspace_hits_total: Hits total.(Counter)
redis_keyspace_misses_total: Misses total.(Counter)
redis_last_key_groups_scrape_duration_milliseconds: Duration of the last key group metrics scrape in milliseconds.
redis_last_slow_execution_duration_seconds: The amount of time needed for last slow execution, in seconds.
redis_latest_fork_seconds: The amount of time needed for last fork, in seconds.
redis_lazyfree_pending_objects: The number of objects waiting to be freed (as a result of calling UNLINK, or FLUSHDB and FLUSHALL with the ASYNC option).
redis_master_repl_offset: The server's current replication offset.
redis_mem_clients_normal: Memory used by normal clients.(Gauge)
redis_mem_clients_slaves: Memory used by replica clients - Starting Redis 7.0, replica buffers share memory with the replication backlog, so this field can show 0 when replicas don't trigger an increase of memory usage.
redis_mem_fragmentation_bytes: Delta between used_memory_rss and used_memory. Note that when the total fragmentation bytes is low (few megabytes), a high ratio (e.g. 1.5 and above) is not an indication of an issue.
redis_mem_fragmentation_ratio: Ratio between used_memory_rss and used_memory. Note that this doesn't only includes fragmentation, but also other process overheads (see the allocator_* metrics), and also overheads like code, shared libraries, stack, etc.
redis_mem_not_counted_for_eviction_bytes: (Gauge)
redis_memory_max_bytes: Max memory limit in bytes.
redis_memory_used_bytes: Total number of bytes allocated by Redis using its allocator (either standard libc, jemalloc, or an alternative allocator such as tcmalloc)
redis_memory_used_dataset_bytes: The size in bytes of the dataset (used_memory_overhead subtracted from used_memory)
redis_memory_used_lua_bytes: Number of bytes used by the Lua engine.
redis_memory_used_overhead_bytes: The sum in bytes of all overheads that the server allocated for managing its internal data structures.
redis_memory_used_peak_bytes: Peak memory consumed by Redis (in bytes)
redis_memory_used_rss_bytes: Number of bytes that Redis allocated as seen by the operating system (a.k.a resident set size). This is the number reported by tools such as top(1) and ps(1)
redis_memory_used_scripts_bytes: Number of bytes used by cached Lua scripts
redis_memory_used_startup_bytes: Initial amount of memory consumed by Redis at startup in bytes
redis_migrate_cached_sockets_total: The number of sockets open for MIGRATE purposes
redis_net_input_bytes_total: Total input bytes(Counter)
redis_net_output_bytes_total: Total output bytes(Counter)
redis_process_id: Process ID
redis_pubsub_channels: Global number of pub/sub channels with client subscriptions
redis_pubsub_patterns: Global number of pub/sub pattern with client subscriptions
redis_rdb_bgsave_in_progress: Flag indicating a RDB save is on-going
redis_rdb_changes_since_last_save: Number of changes since the last dump
redis_rdb_current_bgsave_duration_sec: Duration of the on-going RDB save operation if any
redis_rdb_last_bgsave_duration_sec: Duration of the last RDB save operation in seconds
redis_rdb_last_bgsave_status: Status of the last RDB save operation
redis_rdb_last_cow_size_bytes: The size in bytes of copy-on-write memory during the last RDB save operation
redis_rdb_last_save_timestamp_seconds: Epoch-based timestamp of last successful RDB save
redis_rejected_connections_total: Number of connections rejected because of maxclients limit(Counter)
redis_repl_backlog_first_byte_offset: The master offset of the replication backlog buffer
redis_repl_backlog_history_bytes: Size in bytes of the data in the replication backlog buffer
redis_repl_backlog_is_active: Flag indicating replication backlog is active
redis_replica_partial_resync_accepted: The number of accepted partial resync requests(Gauge)
redis_replica_partial_resync_denied: The number of denied partial resync requests(Gauge)
redis_replica_resyncs_full: The number of full resyncs with replicas
redis_replication_backlog_bytes: Memory used by replication backlog
redis_second_repl_offset: The offset up to which replication IDs are accepted.
redis_slave_expires_tracked_keys: The number of keys tracked for expiry purposes (applicable only to writable replicas)(Gauge)
redis_slowlog_last_id: Last id of slowlog
redis_slowlog_length: Total slowlog
redis_start_time_seconds: Start time of the Redis instance since unix epoch in seconds.
redis_target_scrape_request_errors_total: Errors in requests to the exporter
redis_up: Flag indicating redis instance is up
redis_uptime_in_seconds: Number of seconds since Redis server start
# [windows_exporter]
windows_cpu_clock_interrupts_total: Total number of received and serviced clock tick interrupts(counter)
windows_cpu_core_frequency_mhz: Core frequency in megahertz(gauge)
windows_cpu_cstate_seconds_total: Time spent in low-power idle state(counter)
windows_cpu_dpcs_total: Total number of received and serviced deferred procedure calls (DPCs)(counter)
windows_cpu_idle_break_events_total: Total number of time processor was woken from idle(counter)
windows_cpu_interrupts_total: Total number of received and serviced hardware interrupts(counter)
windows_cpu_parking_status: Parking Status represents whether a processor is parked or not(gauge)
windows_cpu_processor_performance: Processor Performance is the average performance of the processor while it is executing instructions, as a percentage of the nominal performance of the processor. On some processors, Processor Performance may exceed 100%(gauge)
windows_cpu_time_total: Time that processor spent in different modes (idle, user, system, ...)(counter)
windows_cs_hostname: Labeled system hostname information as provided by ComputerSystem.DNSHostName and ComputerSystem.Domain(gauge)
windows_cs_logical_processors: ComputerSystem.NumberOfLogicalProcessors(gauge)
windows_cs_physical_memory_bytes: ComputerSystem.TotalPhysicalMemory(gauge)
windows_exporter_build_info: A metric with a constant '1' value labeled by version, revision, branch, and goversion from which windows_exporter was built.(gauge)
windows_exporter_collector_duration_seconds: Duration of a collection.(gauge)
windows_exporter_collector_success: Whether the collector was successful.(gauge)
windows_exporter_collector_timeout: Whether the collector timed out.(gauge)
windows_exporter_perflib_snapshot_duration_seconds: Duration of perflib snapshot capture(gauge)
windows_logical_disk_free_bytes: Free space in bytes (LogicalDisk.PercentFreeSpace)(gauge)
windows_logical_disk_idle_seconds_total: Seconds that the disk was idle (LogicalDisk.PercentIdleTime)(counter)
windows_logical_disk_read_bytes_total: The number of bytes transferred from the disk during read operations (LogicalDisk.DiskReadBytesPerSec)(counter)
windows_logical_disk_read_latency_seconds_total: Shows the average time, in seconds, of a read operation from the disk (LogicalDisk.AvgDiskSecPerRead)(counter)
windows_logical_disk_read_seconds_total: Seconds that the disk was busy servicing read requests (LogicalDisk.PercentDiskReadTime)(counter)
windows_logical_disk_read_write_latency_seconds_total: Shows the time, in seconds, of the average disk transfer (LogicalDisk.AvgDiskSecPerTransfer)(counter)
windows_logical_disk_reads_total: The number of read operations on the disk (LogicalDisk.DiskReadsPerSec)(counter)
windows_logical_disk_requests_queued: The number of requests queued to the disk (LogicalDisk.CurrentDiskQueueLength)(gauge)
windows_logical_disk_size_bytes: Total space in bytes (LogicalDisk.PercentFreeSpace_Base)(gauge)
windows_logical_disk_split_ios_total: The number of I/Os to the disk were split into multiple I/Os (LogicalDisk.SplitIOPerSec)(counter)
windows_logical_disk_write_bytes_total: The number of bytes transferred to the disk during write operations (LogicalDisk.DiskWriteBytesPerSec)(counter)
windows_logical_disk_write_latency_seconds_total: Shows the average time, in seconds, of a write operation to the disk (LogicalDisk.AvgDiskSecPerWrite)(counter)
windows_logical_disk_write_seconds_total: Seconds that the disk was busy servicing write requests (LogicalDisk.PercentDiskWriteTime)(counter)
windows_logical_disk_writes_total: The number of write operations on the disk (LogicalDisk.DiskWritesPerSec)(counter)
windows_net_bytes_received_total: (Network.BytesReceivedPerSec)(counter)
windows_net_bytes_sent_total: (Network.BytesSentPerSec)(counter)
windows_net_bytes_total: (Network.BytesTotalPerSec)(counter)
windows_net_current_bandwidth: (Network.CurrentBandwidth)(gauge)
windows_net_packets_outbound_discarded_total: (Network.PacketsOutboundDiscarded)(counter)
windows_net_packets_outbound_errors_total: (Network.PacketsOutboundErrors)(counter)
windows_net_packets_received_discarded_total: (Network.PacketsReceivedDiscarded)(counter)
windows_net_packets_received_errors_total: (Network.PacketsReceivedErrors)(counter)
windows_net_packets_received_total: (Network.PacketsReceivedPerSec)(counter)
windows_net_packets_received_unknown_total: (Network.PacketsReceivedUnknown)(counter)
windows_net_packets_sent_total: (Network.PacketsSentPerSec)(counter)
windows_net_packets_total: (Network.PacketsPerSec)(counter)
windows_os_info: OperatingSystem.Caption, OperatingSystem.Version(gauge)
windows_os_paging_free_bytes: OperatingSystem.FreeSpaceInPagingFiles(gauge)
windows_os_paging_limit_bytes: OperatingSystem.SizeStoredInPagingFiles(gauge)
windows_os_physical_memory_free_bytes: OperatingSystem.FreePhysicalMemory(gauge)
windows_os_process_memory_limix_bytes: OperatingSystem.MaxProcessMemorySize(gauge)
windows_os_processes: OperatingSystem.NumberOfProcesses(gauge)
windows_os_processes_limit: OperatingSystem.MaxNumberOfProcesses(gauge)
windows_os_time: OperatingSystem.LocalDateTime(gauge)
windows_os_timezone: OperatingSystem.LocalDateTime(gauge)
windows_os_users: OperatingSystem.NumberOfUsers(gauge)
windows_os_virtual_memory_bytes: OperatingSystem.TotalVirtualMemorySize(gauge)
windows_os_virtual_memory_free_bytes: OperatingSystem.FreeVirtualMemory(gauge)
windows_os_visible_memory_bytes: OperatingSystem.TotalVisibleMemorySize(gauge)
windows_service_info: A metric with a constant '1' value labeled with service information(gauge)
windows_service_start_mode: The start mode of the service (StartMode)(gauge)
windows_service_state: The state of the service (State)(gauge)
windows_service_status: The status of the service (Status)(gauge)
windows_system_context_switches_total: Total number of context switches (WMI source is PerfOS_System.ContextSwitchesPersec)(counter)
windows_system_exception_dispatches_total: Total number of exceptions dispatched (WMI source is PerfOS_System.ExceptionDispatchesPersec)(counter)
windows_system_processor_queue_length: Length of processor queue (WMI source is PerfOS_System.ProcessorQueueLength)(gauge)
windows_system_system_calls_total: Total number of system calls (WMI source is PerfOS_System.SystemCallsPersec)(counter)
windows_system_system_up_time: System boot time (WMI source is PerfOS_System.SystemUpTime)(gauge)
windows_system_threads: Current number of threads (WMI source is PerfOS_System.Threads)(gauge)
# [node_exporter]
# SYSTEM
# CPU context switch 次数
node_context_switches_total: context_switches
# Interrupts 次数
node_intr_total: Interrupts
# 运行的进程数
node_procs_running: Processes in runnable state
# 熵池大小
node_entropy_available_bits: Entropy available to random number generators
node_time_seconds: System time in seconds since epoch (1970)
node_boot_time_seconds: Node boot time, in unixtime
# CPU
node_cpu_seconds_total: Seconds the CPUs spent in each mode
node_load1: cpu load 1m
node_load5: cpu load 5m
node_load15: cpu load 15m
# MEM
# 内核态
# 内核用于缓存数据结构供自己使用的内存
node_memory_Slab_bytes: Memory used by the kernel to cache data structures for its own use
# slab中可回收的部分
node_memory_SReclaimable_bytes: SReclaimable - Part of Slab, that might be reclaimed, such as caches
# slab中不可回收的部分
node_memory_SUnreclaim_bytes: Part of Slab, that cannot be reclaimed on memory pressure
# Vmalloc内存区的大小
node_memory_VmallocTotal_bytes: Total size of vmalloc memory area
# vmalloc已分配的内存虚拟地址空间上的连续的内存
node_memory_VmallocUsed_bytes: Amount of vmalloc area which is used
# vmalloc区可用的连续最大快的大小通过此指标可以知道vmalloc可分配连续内存的最大值
node_memory_VmallocChunk_bytes: Largest contigious block of vmalloc area which is free
# 内存的硬件故障删除掉的内存页的总大小
node_memory_HardwareCorrupted_bytes: Amount of RAM that the kernel identified as corrupted / not working
# 用于在虚拟和物理内存地址之间映射的内存
node_memory_PageTables_bytes: Memory used to map between virtual and physical memory addresses (gauge)
# 内核栈内存,常驻内存,不可回收
node_memory_KernelStack_bytes: Kernel memory stack. This is not reclaimable
# 用来访问高端内存复制高端内存的临时buffer称为“bounce buffering”会降低I/O 性能
node_memory_Bounce_bytes: Memory used for block device bounce buffers
#用户态
# 单个巨页大小
node_memory_Hugepagesize_bytes: Huge Page size
# 系统分配的常驻巨页数
node_memory_HugePages_Total: Total size of the pool of huge pages
# 系统空闲的巨页数
node_memory_HugePages_Free: Huge pages in the pool that are not yet allocated
# 进程已申请但未使用的巨页数
node_memory_HugePages_Rsvd: Huge pages for which a commitment to allocate from the pool has been made, but no allocation
# 超过系统设定的常驻HugePages数量的个数
node_memory_HugePages_Surp: Huge pages in the pool above the value in /proc/sys/vm/nr_hugepages
# 透明巨页 Transparent HugePages (THP)
node_memory_AnonHugePages_bytes: Memory in anonymous huge pages
# inactivelist中的File-backed内存
node_memory_Inactive_file_bytes: File-backed memory on inactive LRU list
# inactivelist中的Anonymous内存
node_memory_Inactive_anon_bytes: Anonymous and swap cache on inactive LRU list, including tmpfs (shmem)
# activelist中的File-backed内存
node_memory_Active_file_bytes: File-backed memory on active LRU list
# activelist中的Anonymous内存
node_memory_Active_anon_bytes: Anonymous and swap cache on active least-recently-used (LRU) list, including tmpfs
# 禁止换出的页,对应 Unevictable 链表
node_memory_Unevictable_bytes: Amount of unevictable memory that can't be swapped out for a variety of reasons
# 共享内存
node_memory_Shmem_bytes: Used shared memory (shared between several processes, thus including RAM disks)
# 匿名页内存大小
node_memory_AnonPages_bytes: Memory in user pages not backed by files
# 被关联的内存页大小
node_memory_Mapped_bytes: Used memory in mapped pages files which have been mmaped, such as libraries
# file-backed内存页缓存大小
node_memory_Cached_bytes: Parked file data (file content) cache
# 系统中有多少匿名页曾经被swap-out、现在又被swap-in并且swap-in之后页面中的内容一直没发生变化
node_memory_SwapCached_bytes: Memory that keeps track of pages that have been fetched from swap but not yet been modified
# 被mlock()系统调用锁定的内存大小
node_memory_Mlocked_bytes: Size of pages locked to memory using the mlock() system call
# 块设备(block device)所占用的缓存页
node_memory_Buffers_bytes: Block device (e.g. harddisk) cache
node_memory_SwapTotal_bytes: Memory information field SwapTotal_bytes
node_memory_SwapFree_bytes: Memory information field SwapFree_bytes
# DISK
node_filesystem_avail_bytes: Filesystem space available to non-root users in byte
node_filesystem_free_bytes: Filesystem free space in bytes
node_filesystem_size_bytes: Filesystem size in bytes
node_filesystem_files_free: Filesystem total free file nodes
node_filesystem_files: Filesystem total free file nodes
node_filefd_maximum: Max open files
node_filefd_allocated: Open files
node_filesystem_readonly: Filesystem read-only status
node_filesystem_device_error: Whether an error occurred while getting statistics for the given device
node_disk_reads_completed_total: The total number of reads completed successfully
node_disk_writes_completed_total: The total number of writes completed successfully
node_disk_reads_merged_total: The number of reads merged
node_disk_writes_merged_total: The number of writes merged
node_disk_read_bytes_total: The total number of bytes read successfully
node_disk_written_bytes_total: The total number of bytes written successfully
node_disk_io_time_seconds_total: Total seconds spent doing I/Os
node_disk_read_time_seconds_total: The total number of seconds spent by all reads
node_disk_write_time_seconds_total: The total number of seconds spent by all writes
node_disk_io_time_weighted_seconds_total: The weighted of seconds spent doing I/Os
# NET
node_network_receive_bytes_total: Network device statistic receive_bytes (counter)
node_network_transmit_bytes_total: Network device statistic transmit_bytes (counter)
node_network_receive_packets_total: Network device statistic receive_bytes
node_network_transmit_packets_total: Network device statistic transmit_bytes
node_network_receive_errs_total: Network device statistic receive_errs
node_network_transmit_errs_total: Network device statistic transmit_errs
node_network_receive_drop_total: Network device statistic receive_drop
node_network_transmit_drop_total: Network device statistic transmit_drop
node_nf_conntrack_entries: Number of currently allocated flow entries for connection tracking
node_sockstat_TCP_alloc: Number of TCP sockets in state alloc
node_sockstat_TCP_inuse: Number of TCP sockets in state inuse
node_sockstat_TCP_orphan: Number of TCP sockets in state orphan
node_sockstat_TCP_tw: Number of TCP sockets in state tw
node_netstat_Tcp_CurrEstab: Statistic TcpCurrEstab
node_sockstat_sockets_used: Number of IPv4 sockets in use
# [kafka_exporter]
kafka_brokers: count of kafka_brokers (gauge)
kafka_topic_partitions: Number of partitions for this Topic (gauge)
kafka_topic_partition_current_offset: Current Offset of a Broker at Topic/Partition (gauge)
kafka_consumergroup_current_offset: Current Offset of a ConsumerGroup at Topic/Partition (gauge)
kafka_consumer_lag_millis: Current approximation of consumer lag for a ConsumerGroup at Topic/Partition (gauge)
kafka_topic_partition_under_replicated_partition: 1 if Topic/Partition is under Replicated
# [zookeeper_exporter]
zk_znode_count: The total count of znodes stored
zk_ephemerals_count: The number of Ephemerals nodes
zk_watch_count: The number of watchers setup over Zookeeper nodes.
zk_approximate_data_size: Size of data in bytes that a zookeeper server has in its data tree
zk_outstanding_requests: Number of currently executing requests
zk_packets_sent: Count of the number of zookeeper packets sent from a server
zk_packets_received: Count of the number of zookeeper packets received by a server
zk_num_alive_connections: Number of active clients connected to a zookeeper server
zk_open_file_descriptor_count: Number of file descriptors that a zookeeper server has open
zk_max_file_descriptor_count: Maximum number of file descriptors that a zookeeper server can open
zk_avg_latency: Average time in milliseconds for requests to be processed
zk_min_latency: Maximum time in milliseconds for a request to be processed
zk_max_latency: Minimum time in milliseconds for a request to be processed

View File

@@ -0,0 +1,216 @@
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import sys
import json
import urllib2
import smtplib
from email.mime.text import MIMEText
reload(sys)
sys.setdefaultencoding('utf8')
notify_channel_funcs = {
"email":"email",
"sms":"sms",
"voice":"voice",
"dingtalk":"dingtalk",
"wecom":"wecom",
"feishu":"feishu"
}
mail_host = "smtp.163.com"
mail_port = 994
mail_user = "ulricqin"
mail_pass = "password"
mail_from = "ulricqin@163.com"
class Sender(object):
@classmethod
def send_email(cls, payload):
if mail_user == "ulricqin" and mail_pass == "password":
print("invalid smtp configuration")
return
users = payload.get('event').get("notify_users_obj")
emails = {}
for u in users:
if u.get("email"):
emails[u.get("email")] = 1
if not emails:
return
recipients = emails.keys()
mail_body = payload.get('tpls').get("email.tpl", "email.tpl not found")
message = MIMEText(mail_body, 'html', 'utf-8')
message['From'] = mail_from
message['To'] = ", ".join(recipients)
message["Subject"] = payload.get('tpls').get("subject.tpl", "subject.tpl not found")
try:
smtp = smtplib.SMTP_SSL(mail_host, mail_port)
smtp.login(mail_user, mail_pass)
smtp.sendmail(mail_from, recipients, message.as_string())
smtp.close()
except smtplib.SMTPException, error:
print(error)
@classmethod
def send_wecom(cls, payload):
users = payload.get('event').get("notify_users_obj")
tokens = {}
for u in users:
contacts = u.get("contacts")
if contacts.get("wecom_robot_token", ""):
tokens[contacts.get("wecom_robot_token", "")] = 1
opener = urllib2.build_opener(urllib2.HTTPHandler())
method = "POST"
for t in tokens:
url = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key={}".format(t)
body = {
"msgtype": "markdown",
"markdown": {
"content": payload.get('tpls').get("wecom.tpl", "wecom.tpl not found")
}
}
request = urllib2.Request(url, data=json.dumps(body))
request.add_header("Content-Type",'application/json;charset=utf-8')
request.get_method = lambda: method
try:
connection = opener.open(request)
print(connection.read())
except urllib2.HTTPError, error:
print(error)
@classmethod
def send_dingtalk(cls, payload):
event = payload.get('event')
users = event.get("notify_users_obj")
rule_name = event.get("rule_name")
event_state = "Triggered"
if event.get("is_recovered"):
event_state = "Recovered"
tokens = {}
phones = {}
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
contacts = u.get("contacts")
if contacts.get("dingtalk_robot_token", ""):
tokens[contacts.get("dingtalk_robot_token", "")] = 1
opener = urllib2.build_opener(urllib2.HTTPHandler())
method = "POST"
for t in tokens:
url = "https://oapi.dingtalk.com/robot/send?access_token={}".format(t)
body = {
"msgtype": "markdown",
"markdown": {
"title": "{} - {}".format(event_state, rule_name),
"text": payload.get('tpls').get("dingtalk.tpl", "dingtalk.tpl not found") + ' '.join(["@"+i for i in phones.keys()])
},
"at": {
"atMobiles": phones.keys(),
"isAtAll": False
}
}
request = urllib2.Request(url, data=json.dumps(body))
request.add_header("Content-Type",'application/json;charset=utf-8')
request.get_method = lambda: method
try:
connection = opener.open(request)
print(connection.read())
except urllib2.HTTPError, error:
print(error)
@classmethod
def send_feishu(cls, payload):
users = payload.get('event').get("notify_users_obj")
tokens = {}
phones = {}
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
contacts = u.get("contacts")
if contacts.get("feishu_robot_token", ""):
tokens[contacts.get("feishu_robot_token", "")] = 1
opener = urllib2.build_opener(urllib2.HTTPHandler())
method = "POST"
for t in tokens:
url = "https://open.feishu.cn/open-apis/bot/v2/hook/{}".format(t)
body = {
"msg_type": "text",
"content": {
"text": payload.get('tpls').get("feishu.tpl", "feishu.tpl not found")
},
"at": {
"atMobiles": phones.keys(),
"isAtAll": False
}
}
request = urllib2.Request(url, data=json.dumps(body))
request.add_header("Content-Type",'application/json;charset=utf-8')
request.get_method = lambda: method
try:
connection = opener.open(request)
print(connection.read())
except urllib2.HTTPError, error:
print(error)
@classmethod
def send_sms(cls, payload):
users = payload.get('event').get("notify_users_obj")
phones = {}
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
if phones:
print("send_sms not implemented, phones: {}".format(phones.keys()))
@classmethod
def send_voice(cls, payload):
users = payload.get('event').get("notify_users_obj")
phones = {}
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
if phones:
print("send_voice not implemented, phones: {}".format(phones.keys()))
def main():
payload = json.load(sys.stdin)
with open(".payload", 'w') as f:
f.write(json.dumps(payload, indent=4))
for ch in payload.get('event').get('notify_channels'):
send_func_name = "send_{}".format(notify_channel_funcs.get(ch.strip()))
if not hasattr(Sender, send_func_name):
print("function: {} not found", send_func_name)
continue
send_func = getattr(Sender, send_func_name)
send_func(payload)
def hello():
print("hello nightingale")
if __name__ == "__main__":
if len(sys.argv) == 1:
main()
elif sys.argv[1] == "hello":
hello()
else:
print("I am confused")

View File

@@ -0,0 +1,73 @@
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
import sys
import json
class Sender(object):
@classmethod
def send_email(cls, payload):
# already done in go code
pass
@classmethod
def send_wecom(cls, payload):
# already done in go code
pass
@classmethod
def send_dingtalk(cls, payload):
# already done in go code
pass
@classmethod
def send_feishu(cls, payload):
# already done in go code
pass
@classmethod
def send_mm(cls, payload):
# already done in go code
pass
@classmethod
def send_sms(cls, payload):
users = payload.get('event').get("notify_users_obj")
phones = {}
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
if phones:
print("send_sms not implemented, phones: {}".format(phones.keys()))
@classmethod
def send_voice(cls, payload):
users = payload.get('event').get("notify_users_obj")
phones = {}
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
if phones:
print("send_voice not implemented, phones: {}".format(phones.keys()))
def main():
payload = json.load(sys.stdin)
with open(".payload", 'w') as f:
f.write(json.dumps(payload, indent=4))
for ch in payload.get('event').get('notify_channels'):
send_func_name = "send_{}".format(ch.strip())
if not hasattr(Sender, send_func_name):
print("function: {} not found", send_func_name)
continue
send_func = getattr(Sender, send_func_name)
send_func(payload)
def hello():
print("hello nightingale")
if __name__ == "__main__":
if len(sys.argv) == 1:
main()
elif sys.argv[1] == "hello":
hello()
else:
print("I am confused")

View File

@@ -0,0 +1,92 @@
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import sys
import json
import requests
class Sender(object):
@classmethod
def send_email(cls, payload):
# already done in go code
pass
@classmethod
def send_wecom(cls, payload):
# already done in go code
pass
@classmethod
def send_dingtalk(cls, payload):
# already done in go code
pass
@classmethod
def send_ifeishu(cls, payload):
users = payload.get('event').get("notify_users_obj")
tokens = {}
phones = {}
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
contacts = u.get("contacts")
if contacts.get("feishu_robot_token", ""):
tokens[contacts.get("feishu_robot_token", "")] = 1
headers = {
"Content-Type": "application/json;charset=utf-8",
"Host": "open.feishu.cn"
}
for t in tokens:
url = "https://open.feishu.cn/open-apis/bot/v2/hook/{}".format(t)
body = {
"msg_type": "text",
"content": {
"text": payload.get('tpls').get("feishu", "feishu not found")
},
"at": {
"atMobiles": list(phones.keys()),
"isAtAll": False
}
}
response = requests.post(url, headers=headers, data=json.dumps(body))
print(f"notify_ifeishu: token={t} status_code={response.status_code} response_text={response.text}")
@classmethod
def send_mm(cls, payload):
# already done in go code
pass
@classmethod
def send_sms(cls, payload):
pass
@classmethod
def send_voice(cls, payload):
pass
def main():
payload = json.load(sys.stdin)
with open(".payload", 'w') as f:
f.write(json.dumps(payload, indent=4))
for ch in payload.get('event').get('notify_channels'):
send_func_name = "send_{}".format(ch.strip())
if not hasattr(Sender, send_func_name):
print("function: {} not found", send_func_name)
continue
send_func = getattr(Sender, send_func_name)
send_func(payload)
def hello():
print("hello nightingale")
if __name__ == "__main__":
if len(sys.argv) == 1:
main()
elif sys.argv[1] == "hello":
hello()
else:
print("I am confused")

View File

@@ -0,0 +1,200 @@
import json
import yaml
'''
将promtheus/vmalert的rule转换为n9e中的rule
支持k8s的rule configmap
'''
rule_file = 'rules.yaml'
def convert_interval(interval):
if interval.endswith('s') or interval.endswith('S'):
return int(interval[:-1])
if interval.endswith('m') or interval.endswith('M'):
return int(interval[:-1]) * 60
if interval.endswith('h') or interval.endswith('H'):
return int(interval[:-1]) * 60 * 60
if interval.endswith('d') or interval.endswith('D'):
return int(interval[:-1]) * 60 * 60 * 24
return int(interval)
def convert_alert(rule, interval):
name = rule['alert']
prom_ql = rule['expr']
if 'for' in rule:
prom_for_duration = convert_interval(rule['for'])
else:
prom_for_duration = 0
prom_eval_interval = convert_interval(interval)
note = ''
if 'annotations' in rule:
for v in rule['annotations'].values():
note = v
break
annotations = {}
if 'annotations' in rule:
for k, v in rule['annotations'].items():
annotations[k] = v
append_tags = []
severity = 2
if 'labels' in rule:
for k, v in rule['labels'].items():
if k != 'severity':
append_tags.append('{}={}'.format(k, v))
continue
if v == 'critical':
severity = 1
elif v == 'info':
severity = 3
# elif v == 'warning':
# severity = 2
n9e_alert_rule = {
"name": name,
"note": note,
"severity": severity,
"disabled": 0,
"prom_for_duration": prom_for_duration,
"prom_ql": prom_ql,
"prom_eval_interval": prom_eval_interval,
"enable_stime": "00:00",
"enable_etime": "23:59",
"enable_days_of_week": [
"1",
"2",
"3",
"4",
"5",
"6",
"0"
],
"enable_in_bg": 0,
"notify_recovered": 1,
"notify_channels": [],
"notify_repeat_step": 60,
"recover_duration": 0,
"callbacks": [],
"runbook_url": "",
"append_tags": append_tags,
"annotations":annotations
}
return n9e_alert_rule
def convert_record(rule, interval):
name = rule['record']
prom_ql = rule['expr']
prom_eval_interval = convert_interval(interval)
note = ''
append_tags = []
if 'labels' in rule:
for k, v in rule['labels'].items():
append_tags.append('{}={}'.format(k, v))
n9e_record_rule = {
"name": name,
"note": note,
"disabled": 0,
"prom_ql": prom_ql,
"prom_eval_interval": prom_eval_interval,
"append_tags": append_tags
}
return n9e_record_rule
'''
example of rule group file
---
groups:
- name: example
rules:
- alert: HighRequestLatency
expr: job:request_latency_seconds:mean5m{job="myjob"} > 0.5
for: 10m
labels:
severity: page
annotations:
summary: High request latency
'''
def deal_group(group):
"""
parse single prometheus/vmalert rule group
"""
alert_rules = []
record_rules = []
for rule_segment in group['groups']:
if 'interval' in rule_segment:
interval = rule_segment['interval']
else:
interval = '15s'
for rule in rule_segment['rules']:
if 'alert' in rule:
alert_rules.append(convert_alert(rule, interval))
else:
record_rules.append(convert_record(rule, interval))
return alert_rules, record_rules
'''
example of k8s rule configmap
---
apiVersion: v1
kind: ConfigMap
metadata:
name: rulefiles-0
data:
etcdrules.yaml: |
groups:
- name: etcd
rules:
- alert: etcdInsufficientMembers
annotations:
message: 'etcd cluster "{{ $labels.job }}": insufficient members ({{ $value}}).'
expr: sum(up{job=~".*etcd.*"} == bool 1) by (job) < ((count(up{job=~".*etcd.*"})
by (job) + 1) / 2)
for: 3m
labels:
severity: critical
'''
def deal_configmap(rule_configmap):
"""
parse rule configmap from k8s
"""
all_record_rules = []
all_alert_rules = []
for _, rule_group_str in rule_configmap['data'].items():
rule_group = yaml.load(rule_group_str, Loader=yaml.FullLoader)
alert_rules, record_rules = deal_group(rule_group)
all_alert_rules.extend(alert_rules)
all_record_rules.extend(record_rules)
return all_alert_rules, all_record_rules
def main():
with open(rule_file, 'r') as f:
rule_config = yaml.load(f, Loader=yaml.FullLoader)
# 如果文件是k8s中的configmap,使用下面的方法
# alert_rules, record_rules = deal_configmap(rule_config)
alert_rules, record_rules = deal_group(rule_config)
with open("alert-rules.json", 'w') as fw:
json.dump(alert_rules, fw, indent=2, ensure_ascii=False)
with open("record-rules.json", 'w') as fw:
json.dump(record_rules, fw, indent=2, ensure_ascii=False)
if __name__ == '__main__':
main()

View File

@@ -2,8 +2,7 @@ version: "3.7"
services:
mysql:
# platform: linux/x86_64
image: "mysql:5.7"
image: "mysql:8"
container_name: mysql
hostname: mysql
restart: always
@@ -12,8 +11,8 @@ services:
MYSQL_ROOT_PASSWORD: 1234
volumes:
- ./mysqldata:/var/lib/mysql/
- ./initsql:/docker-entrypoint-initdb.d/
- ./mysqletc/my.cnf:/etc/my.cnf
- ../initsql:/docker-entrypoint-initdb.d/
- ./etc-mysql/my.cnf:/etc/my.cnf
network_mode: host
redis:
@@ -33,7 +32,7 @@ services:
environment:
TZ: Asia/Shanghai
volumes:
- ./prometc:/etc/prometheus
- ./etc-prometheus:/etc/prometheus
network_mode: host
command:
- "--config.file=/etc/prometheus/prometheus.yml"
@@ -44,7 +43,7 @@ services:
- "--query.lookback-delta=2m"
ibex:
image: ulric2019/ibex:0.3
image: flashcatcloud/ibex:v0.5.0
container_name: ibex
hostname: ibex
restart: always
@@ -53,12 +52,12 @@ services:
TZ: Asia/Shanghai
WAIT_HOSTS: 127.0.0.1:3306
volumes:
- ./ibexetc:/app/etc
- ./etc-ibex:/app/etc
network_mode: host
depends_on:
- mysql
command: >
sh -c "/wait && /app/ibex server"
sh -c "/app/ibex server"
n9e:
image: flashcatcloud/nightingale:latest
@@ -70,15 +69,14 @@ services:
TZ: Asia/Shanghai
WAIT_HOSTS: 127.0.0.1:3306, 127.0.0.1:6379
volumes:
- ../etc:/app/etc
- ./etc-nightingale:/app/etc
network_mode: host
depends_on:
- mysql
- redis
- prometheus
- ibex
command: >
sh -c "/wait && /app/n9e"
sh -c "/app/n9e"
categraf:
image: "flashcatcloud/categraf:latest"
@@ -90,11 +88,10 @@ services:
HOST_PROC: /hostfs/proc
HOST_SYS: /hostfs/sys
HOST_MOUNT_PREFIX: /hostfs
WAIT_HOSTS: 127.0.0.1:17000, 127.0.0.1:20090
volumes:
- ./categraf/conf:/etc/categraf/conf
- ./etc-categraf:/etc/categraf/conf
- /:/hostfs
- /var/run/docker.sock:/var/run/docker.sock
network_mode: host
depends_on:
- n9e
- ibex
- n9e

View File

@@ -0,0 +1,5 @@
[mysqld]
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
datadir = /var/lib/mysql
bind-address = 127.0.0.1

View File

@@ -0,0 +1,184 @@
[Global]
RunMode = "release"
[Log]
# log write dir
Dir = "logs"
# log level: DEBUG INFO WARNING ERROR
Level = "DEBUG"
# stdout, stderr, file
Output = "stdout"
# # rotate by time
# KeepHours = 4
# # rotate by size
# RotateNum = 3
# # unit: MB
# RotateSize = 256
[HTTP]
# http listening address
Host = "0.0.0.0"
# http listening port
Port = 17000
# https cert file path
CertFile = ""
# https key file path
KeyFile = ""
# whether print access log
PrintAccessLog = false
# whether enable pprof
PProf = false
# expose prometheus /metrics?
ExposeMetrics = true
# http graceful shutdown timeout, unit: s
ShutdownTimeout = 30
# max content length: 64M
MaxContentLength = 67108864
# http server read timeout, unit: s
ReadTimeout = 20
# http server write timeout, unit: s
WriteTimeout = 40
# http server idle timeout, unit: s
IdleTimeout = 120
[HTTP.ShowCaptcha]
Enable = false
[HTTP.APIForAgent]
Enable = true
# [HTTP.APIForAgent.BasicAuth]
# user001 = "ccc26da7b9aba533cbb263a36c07dcc5"
[HTTP.APIForService]
Enable = true
[HTTP.APIForService.BasicAuth]
user001 = "ccc26da7b9aba533cbb263a36c07dcc5"
[HTTP.JWTAuth]
# signing key
SigningKey = "5b94a0fd640fe2765af826acfe42d151"
# unit: min
AccessExpired = 1500
# unit: min
RefreshExpired = 10080
RedisKeyPrefix = "/jwt/"
[HTTP.ProxyAuth]
# if proxy auth enabled, jwt auth is disabled
Enable = false
# username key in http proxy header
HeaderUserNameKey = "X-User-Name"
DefaultRoles = ["Standard"]
[HTTP.RSA]
# open RSA
OpenRSA = false
# Before replacing the key file, make sure that there are no encrypted variables in the database "configs".
# It is recommended to decrypt and remove all encrypted values from the database before replacing the key file.
# This will prevent any potential issues with accessing or decrypting the variables using the new key file.
# RSA public key (auto carete)
RSAPublicKeyPath = "etc/rsa/public.pem"
# RSA private key (auto carete)
RSAPrivateKeyPath = "etc/rsa/private.pem"
# RSA private key password
RSAPassWord = "n9e@n9e!"
[DB]
# postgres: host=%s port=%s user=%s dbname=%s password=%s sslmode=%s
# postgres: DSN="host=127.0.0.1 port=5432 user=root dbname=n9e_v6 password=1234 sslmode=disable"
DSN="root:1234@tcp(127.0.0.1:3306)/n9e_v6?charset=utf8mb4&parseTime=True&loc=Local&allowNativePasswords=true"
# enable debug mode or not
Debug = false
# mysql postgres
DBType = "mysql"
# unit: s
MaxLifetime = 7200
# max open connections
MaxOpenConns = 150
# max idle connections
MaxIdleConns = 50
# table prefix
TablePrefix = ""
# enable auto migrate or not
# EnableAutoMigrate = false
[Redis]
# address, ip:port or ip1:port,ip2:port for cluster and sentinel(SentinelAddrs)
Address = "127.0.0.1:6379"
# Username = ""
# Password = ""
# DB = 0
# UseTLS = false
# TLSMinVersion = "1.2"
# standalone cluster sentinel
RedisType = "standalone"
# Mastername for sentinel type
# MasterName = "mymaster"
# SentinelUsername = ""
# SentinelPassword = ""
[Alert]
[Alert.Heartbeat]
# auto detect if blank
IP = ""
# unit ms
Interval = 1000
EngineName = "default"
# [Alert.Alerting]
# NotifyConcurrency = 10
[Center]
MetricsYamlFile = "./etc/metrics.yaml"
I18NHeaderKey = "X-Language"
[Center.AnonymousAccess]
PromQuerier = true
AlertDetail = true
[Pushgw]
# use target labels in database instead of in series
LabelRewrite = true
# # default busigroup key name
# BusiGroupLabelKey = "busigroup"
# ForceUseServerTS = false
# [Pushgw.DebugSample]
# ident = "xx"
# __name__ = "xx"
# [Pushgw.WriterOpt]
# QueueMaxSize = 1000000
# QueuePopSize = 1000
[[Pushgw.Writers]]
# Url = "http://127.0.0.1:8480/insert/0/prometheus/api/v1/write"
Url = "http://127.0.0.1:9090/api/v1/write"
# Basic auth username
BasicAuthUser = ""
# Basic auth password
BasicAuthPass = ""
# timeout settings, unit: ms
Headers = ["X-From", "n9e"]
Timeout = 10000
DialTimeout = 3000
TLSHandshakeTimeout = 30000
ExpectContinueTimeout = 1000
IdleConnTimeout = 90000
# time duration, unit: ms
KeepAlive = 30000
MaxConnsPerHost = 0
MaxIdleConns = 100
MaxIdleConnsPerHost = 100
## Optional TLS Config
# UseTLS = false
# TLSCA = "/etc/n9e/ca.pem"
# TLSCert = "/etc/n9e/cert.pem"
# TLSKey = "/etc/n9e/key.pem"
# InsecureSkipVerify = false
# [[Writers.WriteRelabels]]
# Action = "replace"
# SourceLabels = ["__address__"]
# Regex = "([^:]+)(?::\\d+)?"
# Replacement = "$1:80"
# TargetLabel = "__address__"

View File

@@ -0,0 +1,774 @@
zh:
ip_conntrack_count: 连接跟踪表条目总数单位int, count
ip_conntrack_max: 连接跟踪表最大容量单位int, size
cpu_usage_idle: CPU空闲率单位%
cpu_usage_active: CPU使用率单位%
cpu_usage_system: CPU内核态时间占比单位%
cpu_usage_user: CPU用户态时间占比单位%
cpu_usage_nice: 低优先级用户态CPU时间占比也就是进程nice值被调整为1-19之间的CPU时间。这里注意nice可取值范围是-20到19数值越大优先级反而越低单位%
cpu_usage_iowait: CPU等待I/O的时间占比单位%
cpu_usage_irq: CPU处理硬中断的时间占比单位%
cpu_usage_softirq: CPU处理软中断的时间占比单位%
cpu_usage_steal: 在虚拟机环境下有该指标表示CPU被其他虚拟机争用的时间占比超过20就表示争抢严重单位%
cpu_usage_guest: 通过虚拟化运行其他操作系统的时间也就是运行虚拟机的CPU时间占比单位%
cpu_usage_guest_nice: 以低优先级运行虚拟机的时间占比(单位:%
disk_free: 硬盘分区剩余量单位byte
disk_used: 硬盘分区使用量单位byte
disk_used_percent: 硬盘分区使用率(单位:%
disk_total: 硬盘分区总量单位byte
disk_inodes_free: 硬盘分区inode剩余量
disk_inodes_used: 硬盘分区inode使用量
disk_inodes_total: 硬盘分区inode总量
diskio_io_time: 从设备视角来看I/O请求总时间队列中有I/O请求就计数单位毫秒counter类型需要用函数求rate才有使用价值
diskio_iops_in_progress: 已经分配给设备驱动且尚未完成的IO请求不包含在队列中但尚未分配给设备驱动的IO请求gauge类型
diskio_merged_reads: 相邻读请求merge读的次数counter类型
diskio_merged_writes: 相邻写请求merge写的次数counter类型
diskio_read_bytes: 读取的byte数量counter类型需要用函数求rate才有使用价值
diskio_read_time: 读请求总时间单位毫秒counter类型需要用函数求rate才有使用价值
diskio_reads: 读请求次数counter类型需要用函数求rate才有使用价值
diskio_weighted_io_time: 从I/O请求视角来看I/O等待总时间如果同时有多个I/O请求时间会叠加单位毫秒
diskio_write_bytes: 写入的byte数量counter类型需要用函数求rate才有使用价值
diskio_write_time: 写请求总时间单位毫秒counter类型需要用函数求rate才有使用价值
diskio_writes: 写请求次数counter类型需要用函数求rate才有使用价值
kernel_boot_time: 内核启动时间
kernel_context_switches: 内核上下文切换次数
kernel_entropy_avail: linux系统内部的熵池
kernel_interrupts: 内核中断次数
kernel_processes_forked: fork的进程数
mem_active: 活跃使用的内存总数(包括cache和buffer内存)
mem_available: 可用内存大小(bytes)
mem_available_percent: 内存剩余百分比(0~100)
mem_buffered: 用来给文件做缓冲大小
mem_cached: 被高速缓冲存储器cache memory用的内存的大小等于 diskcache minus SwapCache
mem_commit_limit: 根据超额分配比率('vm.overcommit_ratio'这是当前在系统上分配可用的内存总量这个限制只是在模式2('vm.overcommit_memory')时启用
mem_committed_as: 目前在系统上分配的内存量。是所有进程申请的内存的总和
mem_dirty: 等待被写回到磁盘的内存大小
mem_free: 空闲内存大小(bytes)
mem_high_free: 未被使用的高位内存大小
mem_high_total: 高位内存总大小Highmem是指所有内存高于860MB的物理内存,Highmem区域供用户程序使用或用于页面缓存。该区域不是直接映射到内核空间。内核必须使用不同的手法使用该段内存
mem_huge_page_size: 每个大页的大小
mem_huge_pages_free: 池中尚未分配的 HugePages 数量
mem_huge_pages_total: 预留HugePages的总个数
mem_inactive: 空闲的内存数(包括free和avalible的内存)
mem_low_free: 未被使用的低位大小
mem_low_total: 低位内存总大小,低位可以达到高位内存一样的作用,而且它还能够被内核用来记录一些自己的数据结构
mem_mapped: 设备和文件等映射的大小
mem_page_tables: 管理内存分页页面的索引表的大小
mem_shared: 多个进程共享的内存总额
mem_slab: 内核数据结构缓存的大小,可以减少申请和释放内存带来的消耗
mem_sreclaimable: 可收回Slab的大小
mem_sunreclaim: 不可收回Slab的大小SUnreclaim+SReclaimableSlab
mem_swap_cached: 被高速缓冲存储器cache memory用的交换空间的大小已经被交换出来的内存但仍然被存放在swapfile中。用来在需要的时候很快的被替换而不需要再次打开I/O端口
mem_swap_free: 未被使用交换空间的大小
mem_swap_total: 交换空间的总大小
mem_total: 内存总数
mem_used: 已用内存数
mem_used_percent: 已用内存数百分比(0~100)
mem_vmalloc_chunk: 最大的连续未被使用的vmalloc区域
mem_vmalloc_totalL: 可以vmalloc虚拟内存大小
mem_vmalloc_used: vmalloc已使用的虚拟内存大小
mem_write_back: 正在被写回到磁盘的内存大小
mem_write_back_tmp: FUSE用于临时写回缓冲区的内存
net_bytes_recv: 网卡收包总数(bytes)计算每秒速率时需要用到rate/irate函数
net_bytes_sent: 网卡发包总数(bytes)计算每秒速率时需要用到rate/irate函数
net_drop_in: 网卡收丢包数量
net_drop_out: 网卡发丢包数量
net_err_in: 网卡收包错误数量
net_err_out: 网卡发包错误数量
net_packets_recv: 网卡收包数量
net_packets_sent: 网卡发包数量
net_bits_recv: 网卡收包总数(bits)计算每秒速率时需要用到rate/irate函数
net_bits_sent: 网卡发包总数(bits)计算每秒速率时需要用到rate/irate函数
netstat_tcp_established: ESTABLISHED状态的网络链接数
netstat_tcp_fin_wait1: FIN_WAIT1状态的网络链接数
netstat_tcp_fin_wait2: FIN_WAIT2状态的网络链接数
netstat_tcp_last_ack: LAST_ACK状态的网络链接数
netstat_tcp_listen: LISTEN状态的网络链接数
netstat_tcp_syn_recv: SYN_RECV状态的网络链接数
netstat_tcp_syn_sent: SYN_SENT状态的网络链接数
netstat_tcp_time_wait: TIME_WAIT状态的网络链接数
netstat_udp_socket: UDP状态的网络链接数
netstat_sockets_used: 已使用的所有协议套接字总量
netstat_tcp_inuse: 正在使用正在侦听的TCP套接字数量
netstat_tcp_orphan: 无主不属于任何进程的TCP连接数无用、待销毁的TCP socket数
netstat_tcp_tw: TIME_WAIT状态的TCP连接数
netstat_tcp_alloc: 已分配已建立、已申请到sk_buff的TCP套接字数量
netstat_tcp_mem: TCP套接字内存Page使用量
netstat_udp_inuse: 在使用的UDP套接字数量
netstat_udp_mem: UDP套接字内存Page使用量
netstat_udplite_inuse: 正在使用的 udp lite 数量
netstat_raw_inuse: 正在使用的 raw socket 数量
netstat_frag_inuse: ip fragement 数量
netstat_frag_memory: ip fragement 已经分配的内存(byte
#[ping]
ping_percent_packet_loss: ping数据包丢失百分比(%)
ping_result_code: ping返回码('0','1')
net_response_result_code: 网络探测结果0表示正常非0表示异常
net_response_response_time: 网络探测时延,单位:秒
processes_blocked: 不可中断的睡眠状态下的进程数('U','D','L')
processes_dead: 回收中的进程数('X')
processes_idle: 挂起的空闲进程数('I')
processes_paging: 分页进程数('P')
processes_running: 运行中的进程数('R')
processes_sleeping: 可中断进程数('S')
processes_stopped: 暂停状态进程数('T')
processes_total: 总进程数
processes_total_threads: 总线程数
processes_unknown: 未知状态进程数
processes_zombies: 僵尸态进程数('Z')
swap_used_percent: Swap空间换出数据量
system_load1: 1分钟平均load值
system_load5: 5分钟平均load值
system_load15: 15分钟平均load值
system_load_norm_1: 1分钟平均load值/逻辑CPU个数
system_load_norm_5: 5分钟平均load值/逻辑CPU个数
system_load_norm_15: 15分钟平均load值/逻辑CPU个数
system_n_users: 用户数
system_n_cpus: CPU核数
system_uptime: 系统启动时间
nginx_accepts: 自nginx启动起,与客户端建立过得连接总数
nginx_active: 当前nginx正在处理的活动连接数,等于Reading/Writing/Waiting总和
nginx_handled: 自nginx启动起,处理过的客户端连接总数
nginx_reading: 正在读取HTTP请求头部的连接总数
nginx_requests: 自nginx启动起,处理过的客户端请求总数,由于存在HTTP Krrp-Alive请求,该值会大于handled值
nginx_upstream_check_fall: upstream_check模块检测到后端失败的次数
nginx_upstream_check_rise: upstream_check模块对后端的检测次数
nginx_upstream_check_status_code: 后端upstream的状态,up为1,down为0
nginx_waiting: 开启 keep-alive 的情况下,这个值等于 active (reading+writing), 意思就是 Nginx 已经处理完正在等候下一次请求指令的驻留连接
nginx_writing: 正在向客户端发送响应的连接总数
http_response_content_length: HTTP消息实体的传输长度
http_response_http_response_code: http响应状态码
http_response_response_time: http响应用时
http_response_result_code: url探测结果0为正常否则url无法访问
# [aws cloudwatch rds]
cloudwatch_aws_rds_bin_log_disk_usage_average: rds 磁盘使用平均值
cloudwatch_aws_rds_bin_log_disk_usage_maximum: rds 磁盘使用量最大值
cloudwatch_aws_rds_bin_log_disk_usage_minimum: rds binlog 磁盘使用量最低
cloudwatch_aws_rds_bin_log_disk_usage_sample_count: rds binlog 磁盘使用情况样本计数
cloudwatch_aws_rds_bin_log_disk_usage_sum: rds binlog 磁盘使用总和
cloudwatch_aws_rds_burst_balance_average: rds 突发余额平均值
cloudwatch_aws_rds_burst_balance_maximum: rds 突发余额最大值
cloudwatch_aws_rds_burst_balance_minimum: rds 突发余额最低
cloudwatch_aws_rds_burst_balance_sample_count: rds 突发平衡样本计数
cloudwatch_aws_rds_burst_balance_sum: rds 突发余额总和
cloudwatch_aws_rds_cpu_utilization_average: rds cpu 利用率平均值
cloudwatch_aws_rds_cpu_utilization_maximum: rds cpu 利用率最大值
cloudwatch_aws_rds_cpu_utilization_minimum: rds cpu 利用率最低
cloudwatch_aws_rds_cpu_utilization_sample_count: rds cpu 利用率样本计数
cloudwatch_aws_rds_cpu_utilization_sum: rds cpu 利用率总和
cloudwatch_aws_rds_database_connections_average: rds 数据库连接平均值
cloudwatch_aws_rds_database_connections_maximum: rds 数据库连接数最大值
cloudwatch_aws_rds_database_connections_minimum: rds 数据库连接最小
cloudwatch_aws_rds_database_connections_sample_count: rds 数据库连接样本数
cloudwatch_aws_rds_database_connections_sum: rds 数据库连接总和
cloudwatch_aws_rds_db_load_average: rds db 平均负载
cloudwatch_aws_rds_db_load_cpu_average: rds db 负载 cpu 平均值
cloudwatch_aws_rds_db_load_cpu_maximum: rds db 负载 cpu 最大值
cloudwatch_aws_rds_db_load_cpu_minimum: rds db 负载 cpu 最小值
cloudwatch_aws_rds_db_load_cpu_sample_count: rds db 加载 CPU 样本数
cloudwatch_aws_rds_db_load_cpu_sum: rds db 加载cpu总和
cloudwatch_aws_rds_db_load_maximum: rds 数据库负载最大值
cloudwatch_aws_rds_db_load_minimum: rds 数据库负载最小值
cloudwatch_aws_rds_db_load_non_cpu_average: rds 加载非 CPU 平均值
cloudwatch_aws_rds_db_load_non_cpu_maximum: rds 加载非 cpu 最大值
cloudwatch_aws_rds_db_load_non_cpu_minimum: rds 加载非 cpu 最小值
cloudwatch_aws_rds_db_load_non_cpu_sample_count: rds 加载非 cpu 样本计数
cloudwatch_aws_rds_db_load_non_cpu_sum: rds 加载非cpu总和
cloudwatch_aws_rds_db_load_sample_count: rds db 加载样本计数
cloudwatch_aws_rds_db_load_sum: rds db 负载总和
cloudwatch_aws_rds_disk_queue_depth_average: rds 磁盘队列深度平均值
cloudwatch_aws_rds_disk_queue_depth_maximum: rds 磁盘队列深度最大值
cloudwatch_aws_rds_disk_queue_depth_minimum: rds 磁盘队列深度最小值
cloudwatch_aws_rds_disk_queue_depth_sample_count: rds 磁盘队列深度样本计数
cloudwatch_aws_rds_disk_queue_depth_sum: rds 磁盘队列深度总和
cloudwatch_aws_rds_ebs_byte_balance__average: rds ebs 字节余额平均值
cloudwatch_aws_rds_ebs_byte_balance__maximum: rds ebs 字节余额最大值
cloudwatch_aws_rds_ebs_byte_balance__minimum: rds ebs 字节余额最低
cloudwatch_aws_rds_ebs_byte_balance__sample_count: rds ebs 字节余额样本数
cloudwatch_aws_rds_ebs_byte_balance__sum: rds ebs 字节余额总和
cloudwatch_aws_rds_ebsio_balance__average: rds ebsio 余额平均值
cloudwatch_aws_rds_ebsio_balance__maximum: rds ebsio 余额最大值
cloudwatch_aws_rds_ebsio_balance__minimum: rds ebsio 余额最低
cloudwatch_aws_rds_ebsio_balance__sample_count: rds ebsio 平衡样本计数
cloudwatch_aws_rds_ebsio_balance__sum: rds ebsio 余额总和
cloudwatch_aws_rds_free_storage_space_average: rds 免费存储空间平均
cloudwatch_aws_rds_free_storage_space_maximum: rds 最大可用存储空间
cloudwatch_aws_rds_free_storage_space_minimum: rds 最低可用存储空间
cloudwatch_aws_rds_free_storage_space_sample_count: rds 可用存储空间样本数
cloudwatch_aws_rds_free_storage_space_sum: rds 免费存储空间总和
cloudwatch_aws_rds_freeable_memory_average: rds 可用内存平均值
cloudwatch_aws_rds_freeable_memory_maximum: rds 最大可用内存
cloudwatch_aws_rds_freeable_memory_minimum: rds 最小可用内存
cloudwatch_aws_rds_freeable_memory_sample_count: rds 可释放内存样本数
cloudwatch_aws_rds_freeable_memory_sum: rds 可释放内存总和
cloudwatch_aws_rds_lvm_read_iops_average: rds lvm 读取 iops 平均值
cloudwatch_aws_rds_lvm_read_iops_maximum: rds lvm 读取 iops 最大值
cloudwatch_aws_rds_lvm_read_iops_minimum: rds lvm 读取 iops 最低
cloudwatch_aws_rds_lvm_read_iops_sample_count: rds lvm 读取 iops 样本计数
cloudwatch_aws_rds_lvm_read_iops_sum: rds lvm 读取 iops 总和
cloudwatch_aws_rds_lvm_write_iops_average: rds lvm 写入 iops 平均值
cloudwatch_aws_rds_lvm_write_iops_maximum: rds lvm 写入 iops 最大值
cloudwatch_aws_rds_lvm_write_iops_minimum: rds lvm 写入 iops 最低
cloudwatch_aws_rds_lvm_write_iops_sample_count: rds lvm 写入 iops 样本计数
cloudwatch_aws_rds_lvm_write_iops_sum: rds lvm 写入 iops 总和
cloudwatch_aws_rds_network_receive_throughput_average: rds 网络接收吞吐量平均
cloudwatch_aws_rds_network_receive_throughput_maximum: rds 网络接收吞吐量最大值
cloudwatch_aws_rds_network_receive_throughput_minimum: rds 网络接收吞吐量最小值
cloudwatch_aws_rds_network_receive_throughput_sample_count: rds 网络接收吞吐量样本计数
cloudwatch_aws_rds_network_receive_throughput_sum: rds 网络接收吞吐量总和
cloudwatch_aws_rds_network_transmit_throughput_average: rds 网络传输吞吐量平均值
cloudwatch_aws_rds_network_transmit_throughput_maximum: rds 网络传输吞吐量最大
cloudwatch_aws_rds_network_transmit_throughput_minimum: rds 网络传输吞吐量最小值
cloudwatch_aws_rds_network_transmit_throughput_sample_count: rds 网络传输吞吐量样本计数
cloudwatch_aws_rds_network_transmit_throughput_sum: rds 网络传输吞吐量总和
cloudwatch_aws_rds_read_iops_average: rds 读取 iops 平均值
cloudwatch_aws_rds_read_iops_maximum: rds 最大读取 iops
cloudwatch_aws_rds_read_iops_minimum: rds 读取 iops 最低
cloudwatch_aws_rds_read_iops_sample_count: rds 读取 iops 样本计数
cloudwatch_aws_rds_read_iops_sum: rds 读取 iops 总和
cloudwatch_aws_rds_read_latency_average: rds 读取延迟平均值
cloudwatch_aws_rds_read_latency_maximum: rds 读取延迟最大值
cloudwatch_aws_rds_read_latency_minimum: rds 最小读取延迟
cloudwatch_aws_rds_read_latency_sample_count: rds 读取延迟样本计数
cloudwatch_aws_rds_read_latency_sum: rds 读取延迟总和
cloudwatch_aws_rds_read_throughput_average: rds 读取吞吐量平均值
cloudwatch_aws_rds_read_throughput_maximum: rds 最大读取吞吐量
cloudwatch_aws_rds_read_throughput_minimum: rds 最小读取吞吐量
cloudwatch_aws_rds_read_throughput_sample_count: rds 读取吞吐量样本计数
cloudwatch_aws_rds_read_throughput_sum: rds 读取吞吐量总和
cloudwatch_aws_rds_swap_usage_average: rds 交换使用平均值
cloudwatch_aws_rds_swap_usage_maximum: rds 交换使用最大值
cloudwatch_aws_rds_swap_usage_minimum: rds 交换使用量最低
cloudwatch_aws_rds_swap_usage_sample_count: rds 交换使用示例计数
cloudwatch_aws_rds_swap_usage_sum: rds 交换使用总和
cloudwatch_aws_rds_write_iops_average: rds 写入 iops 平均值
cloudwatch_aws_rds_write_iops_maximum: rds 写入 iops 最大值
cloudwatch_aws_rds_write_iops_minimum: rds 写入 iops 最低
cloudwatch_aws_rds_write_iops_sample_count: rds 写入 iops 样本计数
cloudwatch_aws_rds_write_iops_sum: rds 写入 iops 总和
cloudwatch_aws_rds_write_latency_average: rds 写入延迟平均值
cloudwatch_aws_rds_write_latency_maximum: rds 最大写入延迟
cloudwatch_aws_rds_write_latency_minimum: rds 写入延迟最小值
cloudwatch_aws_rds_write_latency_sample_count: rds 写入延迟样本计数
cloudwatch_aws_rds_write_latency_sum: rds 写入延迟总和
cloudwatch_aws_rds_write_throughput_average: rds 写入吞吐量平均值
cloudwatch_aws_rds_write_throughput_maximum: rds 最大写入吞吐量
cloudwatch_aws_rds_write_throughput_minimum: rds 写入吞吐量最小值
cloudwatch_aws_rds_write_throughput_sample_count: rds 写入吞吐量样本计数
cloudwatch_aws_rds_write_throughput_sum: rds 写入吞吐量总和
en:
ip_conntrack_count: the number of entries in the conntrack tableunitint, count
ip_conntrack_max: the max capacity of the conntrack tableunitint, size
cpu_usage_idle: "CPU idle rate(unit%)"
cpu_usage_active: "CPU usage rate(unit%)"
cpu_usage_system: "CPU kernel state time proportion(unit%)"
cpu_usage_user: "CPU user attitude time proportion(unit%)"
cpu_usage_nice: "The proportion of low priority CPU time, that is, the process NICE value is adjusted to the CPU time between 1-19. Note here that the value range of NICE is -20 to 19, the larger the value, the lower the priority, the lower the priority(unit%)"
cpu_usage_iowait: "CPU waiting for I/O time proportion(unit%)"
cpu_usage_irq: "CPU processing hard interrupt time proportion(unit%)"
cpu_usage_softirq: "CPU processing soft interrupt time proportion(unit%)"
cpu_usage_steal: "In the virtual machine environment, there is this indicator, which means that the CPU is used by other virtual machines for the proportion of time.(unit%)"
cpu_usage_guest: "The time to run other operating systems by virtualization, that is, the proportion of CPU time running the virtual machine(unit%)"
cpu_usage_guest_nice: "The proportion of time to run the virtual machine at low priority(unit%)"
disk_free: "The remaining amount of the hard disk partition (unit: byte)"
disk_used: "Hard disk partitional use (unit: byte)"
disk_used_percent: "Hard disk partitional use rate (unit:%)"
disk_total: "Total amount of hard disk partition (unit: byte)"
disk_inodes_free: "Hard disk partition INODE remaining amount"
disk_inodes_used: "Hard disk partition INODE usage amount"
disk_inodes_total: "The total amount of hard disk partition INODE"
diskio_io_time: "From the perspective of the device perspective, the total time of I/O request, the I/O request in the queue is count (unit: millisecond), the counter type, you need to use the function to find the value"
diskio_iops_in_progress: "IO requests that have been assigned to device -driven and have not yet been completed, not included in the queue but not yet assigned to the device -driven IO request, Gauge type"
diskio_merged_reads: "The number of times of adjacent reading request Merge, the counter type"
diskio_merged_writes: "The number of times the request Merge writes, the counter type"
diskio_read_bytes: "The number of byte reads, the counter type, you need to use the function to find the Rate to use the value"
diskio_read_time: "The total time of reading request (unit: millisecond), the counter type, you need to use the function to find the Rate to have the value of use"
diskio_reads: "Read the number of requests, the counter type, you need to use the function to find the Rate to use the value"
diskio_weighted_io_time: "From the perspective of the I/O request perspective, I/O wait for the total time. If there are multiple I/O requests at the same time, the time will be superimposed (unit: millisecond)"
diskio_write_bytes: "The number of bytes written, the counter type, you need to use the function to find the Rate to use the value"
diskio_write_time: "The total time of the request (unit: millisecond), the counter type, you need to use the function to find the rate to have the value of use"
diskio_writes: "Write the number of requests, the counter type, you need to use the function to find the rate to use value"
kernel_boot_time: "Kernel startup time"
kernel_context_switches: "Number of kernel context switching times"
kernel_entropy_avail: "Entropy pool inside the Linux system"
kernel_interrupts: "Number of kernel interruption"
kernel_processes_forked: "ForK's process number"
mem_active: "The total number of memory (including Cache and BUFFER memory)"
mem_available: "Application can use memory numbers"
mem_available_percent: "Memory remaining percentage (0 ~ 100)"
mem_buffered: "Used to make buffer size for the file"
mem_cached: "The size of the memory used by the cache memory (equal to diskcache minus Swap Cache )"
mem_commit_limit: "According to the over allocation ratio ('vm.overCommit _ Ratio'), this is the current total memory that can be allocated on the system."
mem_committed_as: "Currently allocated on the system. It is the sum of the memory of all process applications"
mem_dirty: "Waiting to be written back to the memory size of the disk"
mem_free: "Senior memory number"
mem_high_free: "Unused high memory size"
mem_high_total: "The total memory size of the high memory (Highmem refers to all the physical memory that is higher than 860 MB of memory, the HighMem area is used for user programs, or for page cache. This area is not directly mapped to the kernel space. The kernels must use different methods to use this section of memory. )"
mem_huge_page_size: "The size of each big page"
mem_huge_pages_free: "The number of Huge Pages in the pool that have not been allocated"
mem_huge_pages_total: "Reserve the total number of Huge Pages"
mem_inactive: "Free memory (including the memory of free and avalible)"
mem_low_free: "Unused low size"
mem_low_total: "The total size of the low memory memory can achieve the same role of high memory, and it can be used by the kernel to record some of its own data structure"
mem_mapped: "The size of the mapping of equipment and files"
mem_page_tables: "The size of the index table of the management of the memory paging page"
mem_shared: "The total memory shared by multiple processes"
mem_slab: "The size of the kernel data structure cache can reduce the consumption of application and release memory"
mem_sreclaimable: "The size of the SLAB can be recovered"
mem_sunreclaim: "The size of the SLAB cannot be recovered(SUnreclaim+SReclaimableSlab)"
mem_swap_cached: "The size of the swap space used by the cache memory (cache memory), the memory that has been swapped out, but is still stored in the swapfile. Used to be quickly replaced when needed without opening the I/O port again"
mem_swap_free: "The size of the switching space is not used"
mem_swap_total: "The total size of the exchange space"
mem_total: "Total memory"
mem_used: "Memory number"
mem_used_percent: "The memory has been used by several percentage (0 ~ 100)"
mem_vmalloc_chunk: "The largest continuous unused vmalloc area"
mem_vmalloc_totalL: "You can vmalloc virtual memory size"
mem_vmalloc_used: "Vmalloc's virtual memory size"
mem_write_back: "The memory size of the disk is being written back to the disk"
mem_write_back_tmp: "Fuse is used to temporarily write back the memory of the buffer area"
net_bytes_recv: "Total inbound traffic(bytes) of network card"
net_bytes_sent: "Total outbound traffic(bytes) of network card"
net_bits_recv: "Total inbound traffic(bits) of network card"
net_bits_sent: "Total outbound traffic(bits) of network card"
net_drop_in: "The number of packets for network cards"
net_drop_out: "The number of packets issued by the network card"
net_err_in: "The number of incorrect packets of the network card"
net_err_out: "Number of incorrect number of network cards"
net_packets_recv: "Net card collection quantity"
net_packets_sent: "Number of network card issuance"
netstat_tcp_established: "ESTABLISHED status network link number"
netstat_tcp_fin_wait1: "FIN _ WAIT1 status network link number"
netstat_tcp_fin_wait2: "FIN _ WAIT2 status number of network links"
netstat_tcp_last_ack: "LAST_ ACK status number of network links"
netstat_tcp_listen: "Number of network links in Listen status"
netstat_tcp_syn_recv: "SYN _ RECV status number of network links"
netstat_tcp_syn_sent: "SYN _ SENT status number of network links"
netstat_tcp_time_wait: "Time _ WAIT status network link number"
netstat_udp_socket: "Number of network links in UDP status"
processes_blocked: "The number of processes in the unreprudible sleep state('U','D','L')"
processes_dead: "Number of processes in recycling('X')"
processes_idle: "Number of idle processes hanging('I')"
processes_paging: "Number of paging processes('P')"
processes_running: "Number of processes during operation('R')"
processes_sleeping: "Can interrupt the number of processes('S')"
processes_stopped: "Pushing status process number('T')"
processes_total: "Total process number"
processes_total_threads: "Number of threads"
processes_unknown: "Unknown status process number"
processes_zombies: "Number of zombies('Z')"
swap_used_percent: "SWAP space replace the data volume"
system_load1: "1 minute average load value"
system_load5: "5 minutes average load value"
system_load15: "15 minutes average load value"
system_load_norm_1: "1 minute average load value/logical CPU number"
system_load_norm_5: "5 minutes average load value/logical CPU number"
system_load_norm_15: "15 minutes average load value/logical CPU number"
system_n_users: "User number"
system_n_cpus: "CPU nuclear number"
system_uptime: "System startup time"
nginx_accepts: "Since Nginx started, the total number of connections has been established with the client"
nginx_active: "The current number of activity connections that Nginx is being processed is equal to Reading/Writing/Waiting"
nginx_handled: "Starting from Nginx, the total number of client connections that have been processed"
nginx_reading: "Reading the total number of connections on the http request header"
nginx_requests: "Since nginx is started, the total number of client requests processed, due to the existence of HTTP Krrp - Alive requests, this value will be greater than the handled value"
nginx_upstream_check_fall: "UPStream_CHECK module detects the number of back -end failures"
nginx_upstream_check_rise: "UPSTREAM _ Check module to detect the number of back -end"
nginx_upstream_check_status_code: "The state of the backstream is 1, and the down is 0"
nginx_waiting: "When keep-alive is enabled, this value is equal to active (reading+writing), which means that Nginx has processed the resident connection that is waiting for the next request command"
nginx_writing: "The total number of connections to send a response to the client"
http_response_content_length: "HTTP message entity transmission length"
http_response_http_response_code: "http response status code"
http_response_response_time: "When http ring application"
http_response_result_code: "URL detection result 0 is normal, otherwise the URL cannot be accessed"
# [mysqld_exporter]
mysql_global_status_uptime: The number of seconds that the server has been up.(Gauge)
mysql_global_status_uptime_since_flush_status: The number of seconds since the most recent FLUSH STATUS statement.(Gauge)
mysql_global_status_queries: The number of statements executed by the server. This variable includes statements executed within stored programs, unlike the Questions variable. It does not count COM_PING or COM_STATISTICS commands.(Counter)
mysql_global_status_threads_connected: The number of currently open connections.(Counter)
mysql_global_status_connections: The number of connection attempts (successful or not) to the MySQL server.(Gauge)
mysql_global_status_max_used_connections: The maximum number of connections that have been in use simultaneously since the server started.(Gauge)
mysql_global_status_threads_running: The number of threads that are not sleeping.(Gauge)
mysql_global_status_questions: The number of statements executed by the server. This includes only statements sent to the server by clients and not statements executed within stored programs, unlike the Queries variable. This variable does not count COM_PING, COM_STATISTICS, COM_STMT_PREPARE, COM_STMT_CLOSE, or COM_STMT_RESET commands.(Counter)
mysql_global_status_threads_cached: The number of threads in the thread cache.(Counter)
mysql_global_status_threads_created: The number of threads created to handle connections. If Threads_created is big, you may want to increase the thread_cache_size value. The cache miss rate can be calculated as Threads_created/Connections.(Counter)
mysql_global_status_created_tmp_tables: The number of internal temporary tables created by the server while executing statements.(Counter)
mysql_global_status_created_tmp_disk_tables: The number of internal on-disk temporary tables created by the server while executing statements. You can compare the number of internal on-disk temporary tables created to the total number of internal temporary tables created by comparing Created_tmp_disk_tables and Created_tmp_tables values.(Counter)
mysql_global_status_created_tmp_files: How many temporary files mysqld has created.(Counter)
mysql_global_status_select_full_join: The number of joins that perform table scans because they do not use indexes. If this value is not 0, you should carefully check the indexes of your tables.(Counter)
mysql_global_status_select_full_range_join: The number of joins that used a range search on a reference table.(Counter)
mysql_global_status_select_range: The number of joins that used ranges on the first table. This is normally not a critical issue even if the value is quite large.(Counter)
mysql_global_status_select_range_check: The number of joins without keys that check for key usage after each row. If this is not 0, you should carefully check the indexes of your tables.(Counter)
mysql_global_status_select_scan: The number of joins that did a full scan of the first table.(Counter)
mysql_global_status_sort_rows: The number of sorted rows.(Counter)
mysql_global_status_sort_range: The number of sorts that were done using ranges.(Counter)
mysql_global_status_sort_merge_passes: The number of merge passes that the sort algorithm has had to do. If this value is large, you should consider increasing the value of the sort_buffer_size system variable.(Counter)
mysql_global_status_sort_scan: The number of sorts that were done by scanning the table.(Counter)
mysql_global_status_slow_queries: The number of queries that have taken more than long_query_time seconds. This counter increments regardless of whether the slow query log is enabled.(Counter)
mysql_global_status_aborted_connects: The number of failed attempts to connect to the MySQL server.(Counter)
mysql_global_status_aborted_clients: The number of connections that were aborted because the client died without closing the connection properly.(Counter)
mysql_global_status_table_locks_immediate: The number of times that a request for a table lock could be granted immediately. Locks Immediate rising and falling is normal activity.(Counter)
mysql_global_status_table_locks_waited: The number of times that a request for a table lock could not be granted immediately and a wait was needed. If this is high and you have performance problems, you should first optimize your queries, and then either split your table or tables or use replication.(Counter)
mysql_global_status_bytes_received: The number of bytes received from all clients.(Counter)
mysql_global_status_bytes_sent: The number of bytes sent to all clients.(Counter)
mysql_global_status_innodb_page_size: InnoDB page size (default 16KB). Many values are counted in pages; the page size enables them to be easily converted to bytes.(Gauge)
mysql_global_status_buffer_pool_pages: The number of pages in the InnoDB buffer pool.(Gauge)
mysql_global_status_commands_total: The number of times each xxx statement has been executed.(Counter)
mysql_global_status_handlers_total: Handler statistics are internal statistics on how MySQL is selecting, updating, inserting, and modifying rows, tables, and indexes. This is in fact the layer between the Storage Engine and MySQL.(Counter)
mysql_global_status_opened_files: The number of files that have been opened with my_open() (a mysys library function). Parts of the server that open files without using this function do not increment the count.(Counter)
mysql_global_status_open_tables: The number of tables that are open.(Gauge)
mysql_global_status_opened_tables: The number of tables that have been opened. If Opened_tables is big, your table_open_cache value is probably too small.(Counter)
mysql_global_status_table_open_cache_hits: The number of hits for open tables cache lookups.(Counter)
mysql_global_status_table_open_cache_misses: The number of misses for open tables cache lookups.(Counter)
mysql_global_status_table_open_cache_overflows: The number of overflows for the open tables cache.(Counter)
mysql_global_status_innodb_num_open_files: The number of files InnoDB currently holds open.(Gauge)
mysql_global_status_connection_errors_total: These variables provide information about errors that occur during the client connection process.(Counter)
mysql_global_status_innodb_buffer_pool_read_requests: The number of logical read requests.(Counter)
mysql_global_status_innodb_buffer_pool_reads: The number of logical reads that InnoDB could not satisfy from the buffer pool, and had to read directly from disk.(Counter)
mysql_global_variables_thread_cache_size: How many threads the server should cache for reuse.(Gauge)
mysql_global_variables_max_connections: The maximum permitted number of simultaneous client connections.(Gauge)
mysql_global_variables_innodb_buffer_pool_size: The size in bytes of the buffer pool, the memory area where InnoDB caches table and index data. The default value is 134217728 bytes (128MB).(Gauge)
mysql_global_variables_innodb_log_buffer_size: The size in bytes of the buffer that InnoDB uses to write to the log files on disk.(Gauge)
mysql_global_variables_key_buffer_size: Index blocks for MyISAM tables are buffered and are shared by all threads.(Gauge)
mysql_global_variables_query_cache_size: The amount of memory allocated for caching query results.(Gauge)
mysql_global_variables_table_open_cache: The number of open tables for all threads.(Gauge)
mysql_global_variables_open_files_limit: The number of file descriptors available to mysqld from the operating system.(Gauge)
# [redis_exporter]
redis_active_defrag_running: When activedefrag is enabled, this indicates whether defragmentation is currently active, and the CPU percentage it intends to utilize.
redis_allocator_active_bytes: Total bytes in the allocator active pages, this includes external-fragmentation.
redis_allocator_allocated_bytes: Total bytes allocated form the allocator, including internal-fragmentation. Normally the same as used_memory.
redis_allocator_frag_bytes: Delta between allocator_active and allocator_allocated. See note about mem_fragmentation_bytes.
redis_allocator_frag_ratio: Ratio between allocator_active and allocator_allocated. This is the true (external) fragmentation metric (not mem_fragmentation_ratio).
redis_allocator_resident_bytes: Total bytes resident (RSS) in the allocator, this includes pages that can be released to the OS (by MEMORY PURGE, or just waiting).
redis_allocator_rss_bytes: Delta between allocator_resident and allocator_active.
redis_allocator_rss_ratio: Ratio between allocator_resident and allocator_active. This usually indicates pages that the allocator can and probably will soon release back to the OS.
redis_aof_current_rewrite_duration_sec: Duration of the on-going AOF rewrite operation if any.
redis_aof_enabled: Flag indicating AOF logging is activated.
redis_aof_last_bgrewrite_status: Status of the last AOF rewrite operation.
redis_aof_last_cow_size_bytes: The size in bytes of copy-on-write memory during the last AOF rewrite operation.
redis_aof_last_rewrite_duration_sec: Duration of the last AOF rewrite operation in seconds.
redis_aof_last_write_status: Status of the last write operation to the AOF.
redis_aof_rewrite_in_progress: Flag indicating a AOF rewrite operation is on-going.
redis_aof_rewrite_scheduled: Flag indicating an AOF rewrite operation will be scheduled once the on-going RDB save is complete.
redis_blocked_clients: Number of clients pending on a blocking call (BLPOP, BRPOP, BRPOPLPUSH, BLMOVE, BZPOPMIN, BZPOPMAX).
redis_client_recent_max_input_buffer_bytes: Biggest input buffer among current client connections.
redis_client_recent_max_output_buffer_bytes: Biggest output buffer among current client connections.
redis_cluster_enabled: Indicate Redis cluster is enabled.
redis_commands_duration_seconds_total: The total CPU time consumed by these commands.(Counter)
redis_commands_processed_total: Total number of commands processed by the server.(Counter)
redis_commands_total: The number of calls that reached command execution (not rejected).(Counter)
redis_config_maxclients: The value of the maxclients configuration directive. This is the upper limit for the sum of connected_clients, connected_slaves and cluster_connections.
redis_config_maxmemory: The value of the maxmemory configuration directive.
redis_connected_clients: Number of client connections (excluding connections from replicas).
redis_connected_slaves: Number of connected replicas.
redis_connections_received_total: Total number of connections accepted by the server.(Counter)
redis_cpu_sys_children_seconds_total: System CPU consumed by the background processes.(Counter)
redis_cpu_sys_seconds_total: System CPU consumed by the Redis server, which is the sum of system CPU consumed by all threads of the server process (main thread and background threads).(Counter)
redis_cpu_user_children_seconds_total: User CPU consumed by the background processes.(Counter)
redis_cpu_user_seconds_total: User CPU consumed by the Redis server, which is the sum of user CPU consumed by all threads of the server process (main thread and background threads).(Counter)
redis_db_keys: Total number of keys by DB.
redis_db_keys_expiring: Total number of expiring keys by DB
redis_defrag_hits: Number of value reallocations performed by active the defragmentation process.
redis_defrag_misses: Number of aborted value reallocations started by the active defragmentation process.
redis_defrag_key_hits: Number of keys that were actively defragmented.
redis_defrag_key_misses: Number of keys that were skipped by the active defragmentation process.
redis_evicted_keys_total: Number of evicted keys due to maxmemory limit.(Counter)
redis_expired_keys_total: Total number of key expiration events.(Counter)
redis_expired_stale_percentage: The percentage of keys probably expired.
redis_expired_time_cap_reached_total: The count of times that active expiry cycles have stopped early.
redis_exporter_last_scrape_connect_time_seconds: The duration(in seconds) to connect when scrape.
redis_exporter_last_scrape_duration_seconds: The last scrape duration.
redis_exporter_last_scrape_error: The last scrape error status.
redis_exporter_scrape_duration_seconds_count: Durations of scrapes by the exporter
redis_exporter_scrape_duration_seconds_sum: Durations of scrapes by the exporter
redis_exporter_scrapes_total: Current total redis scrapes.(Counter)
redis_instance_info: Information about the Redis instance.
redis_keyspace_hits_total: Hits total.(Counter)
redis_keyspace_misses_total: Misses total.(Counter)
redis_last_key_groups_scrape_duration_milliseconds: Duration of the last key group metrics scrape in milliseconds.
redis_last_slow_execution_duration_seconds: The amount of time needed for last slow execution, in seconds.
redis_latest_fork_seconds: The amount of time needed for last fork, in seconds.
redis_lazyfree_pending_objects: The number of objects waiting to be freed (as a result of calling UNLINK, or FLUSHDB and FLUSHALL with the ASYNC option).
redis_master_repl_offset: The server's current replication offset.
redis_mem_clients_normal: Memory used by normal clients.(Gauge)
redis_mem_clients_slaves: Memory used by replica clients - Starting Redis 7.0, replica buffers share memory with the replication backlog, so this field can show 0 when replicas don't trigger an increase of memory usage.
redis_mem_fragmentation_bytes: Delta between used_memory_rss and used_memory. Note that when the total fragmentation bytes is low (few megabytes), a high ratio (e.g. 1.5 and above) is not an indication of an issue.
redis_mem_fragmentation_ratio: Ratio between used_memory_rss and used_memory. Note that this doesn't only includes fragmentation, but also other process overheads (see the allocator_* metrics), and also overheads like code, shared libraries, stack, etc.
redis_mem_not_counted_for_eviction_bytes: (Gauge)
redis_memory_max_bytes: Max memory limit in bytes.
redis_memory_used_bytes: Total number of bytes allocated by Redis using its allocator (either standard libc, jemalloc, or an alternative allocator such as tcmalloc)
redis_memory_used_dataset_bytes: The size in bytes of the dataset (used_memory_overhead subtracted from used_memory)
redis_memory_used_lua_bytes: Number of bytes used by the Lua engine.
redis_memory_used_overhead_bytes: The sum in bytes of all overheads that the server allocated for managing its internal data structures.
redis_memory_used_peak_bytes: Peak memory consumed by Redis (in bytes)
redis_memory_used_rss_bytes: Number of bytes that Redis allocated as seen by the operating system (a.k.a resident set size). This is the number reported by tools such as top(1) and ps(1)
redis_memory_used_scripts_bytes: Number of bytes used by cached Lua scripts
redis_memory_used_startup_bytes: Initial amount of memory consumed by Redis at startup in bytes
redis_migrate_cached_sockets_total: The number of sockets open for MIGRATE purposes
redis_net_input_bytes_total: Total input bytes(Counter)
redis_net_output_bytes_total: Total output bytes(Counter)
redis_process_id: Process ID
redis_pubsub_channels: Global number of pub/sub channels with client subscriptions
redis_pubsub_patterns: Global number of pub/sub pattern with client subscriptions
redis_rdb_bgsave_in_progress: Flag indicating a RDB save is on-going
redis_rdb_changes_since_last_save: Number of changes since the last dump
redis_rdb_current_bgsave_duration_sec: Duration of the on-going RDB save operation if any
redis_rdb_last_bgsave_duration_sec: Duration of the last RDB save operation in seconds
redis_rdb_last_bgsave_status: Status of the last RDB save operation
redis_rdb_last_cow_size_bytes: The size in bytes of copy-on-write memory during the last RDB save operation
redis_rdb_last_save_timestamp_seconds: Epoch-based timestamp of last successful RDB save
redis_rejected_connections_total: Number of connections rejected because of maxclients limit(Counter)
redis_repl_backlog_first_byte_offset: The master offset of the replication backlog buffer
redis_repl_backlog_history_bytes: Size in bytes of the data in the replication backlog buffer
redis_repl_backlog_is_active: Flag indicating replication backlog is active
redis_replica_partial_resync_accepted: The number of accepted partial resync requests(Gauge)
redis_replica_partial_resync_denied: The number of denied partial resync requests(Gauge)
redis_replica_resyncs_full: The number of full resyncs with replicas
redis_replication_backlog_bytes: Memory used by replication backlog
redis_second_repl_offset: The offset up to which replication IDs are accepted.
redis_slave_expires_tracked_keys: The number of keys tracked for expiry purposes (applicable only to writable replicas)(Gauge)
redis_slowlog_last_id: Last id of slowlog
redis_slowlog_length: Total slowlog
redis_start_time_seconds: Start time of the Redis instance since unix epoch in seconds.
redis_target_scrape_request_errors_total: Errors in requests to the exporter
redis_up: Flag indicating redis instance is up
redis_uptime_in_seconds: Number of seconds since Redis server start
# [windows_exporter]
windows_cpu_clock_interrupts_total: Total number of received and serviced clock tick interrupts(counter)
windows_cpu_core_frequency_mhz: Core frequency in megahertz(gauge)
windows_cpu_cstate_seconds_total: Time spent in low-power idle state(counter)
windows_cpu_dpcs_total: Total number of received and serviced deferred procedure calls (DPCs)(counter)
windows_cpu_idle_break_events_total: Total number of time processor was woken from idle(counter)
windows_cpu_interrupts_total: Total number of received and serviced hardware interrupts(counter)
windows_cpu_parking_status: Parking Status represents whether a processor is parked or not(gauge)
windows_cpu_processor_performance: Processor Performance is the average performance of the processor while it is executing instructions, as a percentage of the nominal performance of the processor. On some processors, Processor Performance may exceed 100%(gauge)
windows_cpu_time_total: Time that processor spent in different modes (idle, user, system, ...)(counter)
windows_cs_hostname: Labeled system hostname information as provided by ComputerSystem.DNSHostName and ComputerSystem.Domain(gauge)
windows_cs_logical_processors: ComputerSystem.NumberOfLogicalProcessors(gauge)
windows_cs_physical_memory_bytes: ComputerSystem.TotalPhysicalMemory(gauge)
windows_exporter_build_info: A metric with a constant '1' value labeled by version, revision, branch, and goversion from which windows_exporter was built.(gauge)
windows_exporter_collector_duration_seconds: Duration of a collection.(gauge)
windows_exporter_collector_success: Whether the collector was successful.(gauge)
windows_exporter_collector_timeout: Whether the collector timed out.(gauge)
windows_exporter_perflib_snapshot_duration_seconds: Duration of perflib snapshot capture(gauge)
windows_logical_disk_free_bytes: Free space in bytes (LogicalDisk.PercentFreeSpace)(gauge)
windows_logical_disk_idle_seconds_total: Seconds that the disk was idle (LogicalDisk.PercentIdleTime)(counter)
windows_logical_disk_read_bytes_total: The number of bytes transferred from the disk during read operations (LogicalDisk.DiskReadBytesPerSec)(counter)
windows_logical_disk_read_latency_seconds_total: Shows the average time, in seconds, of a read operation from the disk (LogicalDisk.AvgDiskSecPerRead)(counter)
windows_logical_disk_read_seconds_total: Seconds that the disk was busy servicing read requests (LogicalDisk.PercentDiskReadTime)(counter)
windows_logical_disk_read_write_latency_seconds_total: Shows the time, in seconds, of the average disk transfer (LogicalDisk.AvgDiskSecPerTransfer)(counter)
windows_logical_disk_reads_total: The number of read operations on the disk (LogicalDisk.DiskReadsPerSec)(counter)
windows_logical_disk_requests_queued: The number of requests queued to the disk (LogicalDisk.CurrentDiskQueueLength)(gauge)
windows_logical_disk_size_bytes: Total space in bytes (LogicalDisk.PercentFreeSpace_Base)(gauge)
windows_logical_disk_split_ios_total: The number of I/Os to the disk were split into multiple I/Os (LogicalDisk.SplitIOPerSec)(counter)
windows_logical_disk_write_bytes_total: The number of bytes transferred to the disk during write operations (LogicalDisk.DiskWriteBytesPerSec)(counter)
windows_logical_disk_write_latency_seconds_total: Shows the average time, in seconds, of a write operation to the disk (LogicalDisk.AvgDiskSecPerWrite)(counter)
windows_logical_disk_write_seconds_total: Seconds that the disk was busy servicing write requests (LogicalDisk.PercentDiskWriteTime)(counter)
windows_logical_disk_writes_total: The number of write operations on the disk (LogicalDisk.DiskWritesPerSec)(counter)
windows_net_bytes_received_total: (Network.BytesReceivedPerSec)(counter)
windows_net_bytes_sent_total: (Network.BytesSentPerSec)(counter)
windows_net_bytes_total: (Network.BytesTotalPerSec)(counter)
windows_net_current_bandwidth: (Network.CurrentBandwidth)(gauge)
windows_net_packets_outbound_discarded_total: (Network.PacketsOutboundDiscarded)(counter)
windows_net_packets_outbound_errors_total: (Network.PacketsOutboundErrors)(counter)
windows_net_packets_received_discarded_total: (Network.PacketsReceivedDiscarded)(counter)
windows_net_packets_received_errors_total: (Network.PacketsReceivedErrors)(counter)
windows_net_packets_received_total: (Network.PacketsReceivedPerSec)(counter)
windows_net_packets_received_unknown_total: (Network.PacketsReceivedUnknown)(counter)
windows_net_packets_sent_total: (Network.PacketsSentPerSec)(counter)
windows_net_packets_total: (Network.PacketsPerSec)(counter)
windows_os_info: OperatingSystem.Caption, OperatingSystem.Version(gauge)
windows_os_paging_free_bytes: OperatingSystem.FreeSpaceInPagingFiles(gauge)
windows_os_paging_limit_bytes: OperatingSystem.SizeStoredInPagingFiles(gauge)
windows_os_physical_memory_free_bytes: OperatingSystem.FreePhysicalMemory(gauge)
windows_os_process_memory_limix_bytes: OperatingSystem.MaxProcessMemorySize(gauge)
windows_os_processes: OperatingSystem.NumberOfProcesses(gauge)
windows_os_processes_limit: OperatingSystem.MaxNumberOfProcesses(gauge)
windows_os_time: OperatingSystem.LocalDateTime(gauge)
windows_os_timezone: OperatingSystem.LocalDateTime(gauge)
windows_os_users: OperatingSystem.NumberOfUsers(gauge)
windows_os_virtual_memory_bytes: OperatingSystem.TotalVirtualMemorySize(gauge)
windows_os_virtual_memory_free_bytes: OperatingSystem.FreeVirtualMemory(gauge)
windows_os_visible_memory_bytes: OperatingSystem.TotalVisibleMemorySize(gauge)
windows_service_info: A metric with a constant '1' value labeled with service information(gauge)
windows_service_start_mode: The start mode of the service (StartMode)(gauge)
windows_service_state: The state of the service (State)(gauge)
windows_service_status: The status of the service (Status)(gauge)
windows_system_context_switches_total: Total number of context switches (WMI source is PerfOS_System.ContextSwitchesPersec)(counter)
windows_system_exception_dispatches_total: Total number of exceptions dispatched (WMI source is PerfOS_System.ExceptionDispatchesPersec)(counter)
windows_system_processor_queue_length: Length of processor queue (WMI source is PerfOS_System.ProcessorQueueLength)(gauge)
windows_system_system_calls_total: Total number of system calls (WMI source is PerfOS_System.SystemCallsPersec)(counter)
windows_system_system_up_time: System boot time (WMI source is PerfOS_System.SystemUpTime)(gauge)
windows_system_threads: Current number of threads (WMI source is PerfOS_System.Threads)(gauge)
# [node_exporter]
# SYSTEM
# CPU context switch 次数
node_context_switches_total: context_switches
# Interrupts 次数
node_intr_total: Interrupts
# 运行的进程数
node_procs_running: Processes in runnable state
# 熵池大小
node_entropy_available_bits: Entropy available to random number generators
node_time_seconds: System time in seconds since epoch (1970)
node_boot_time_seconds: Node boot time, in unixtime
# CPU
node_cpu_seconds_total: Seconds the CPUs spent in each mode
node_load1: cpu load 1m
node_load5: cpu load 5m
node_load15: cpu load 15m
# MEM
# 内核态
# 内核用于缓存数据结构供自己使用的内存
node_memory_Slab_bytes: Memory used by the kernel to cache data structures for its own use
# slab中可回收的部分
node_memory_SReclaimable_bytes: SReclaimable - Part of Slab, that might be reclaimed, such as caches
# slab中不可回收的部分
node_memory_SUnreclaim_bytes: Part of Slab, that cannot be reclaimed on memory pressure
# Vmalloc内存区的大小
node_memory_VmallocTotal_bytes: Total size of vmalloc memory area
# vmalloc已分配的内存虚拟地址空间上的连续的内存
node_memory_VmallocUsed_bytes: Amount of vmalloc area which is used
# vmalloc区可用的连续最大快的大小通过此指标可以知道vmalloc可分配连续内存的最大值
node_memory_VmallocChunk_bytes: Largest contigious block of vmalloc area which is free
# 内存的硬件故障删除掉的内存页的总大小
node_memory_HardwareCorrupted_bytes: Amount of RAM that the kernel identified as corrupted / not working
# 用于在虚拟和物理内存地址之间映射的内存
node_memory_PageTables_bytes: Memory used to map between virtual and physical memory addresses (gauge)
# 内核栈内存,常驻内存,不可回收
node_memory_KernelStack_bytes: Kernel memory stack. This is not reclaimable
# 用来访问高端内存复制高端内存的临时buffer称为“bounce buffering”会降低I/O 性能
node_memory_Bounce_bytes: Memory used for block device bounce buffers
#用户态
# 单个巨页大小
node_memory_Hugepagesize_bytes: Huge Page size
# 系统分配的常驻巨页数
node_memory_HugePages_Total: Total size of the pool of huge pages
# 系统空闲的巨页数
node_memory_HugePages_Free: Huge pages in the pool that are not yet allocated
# 进程已申请但未使用的巨页数
node_memory_HugePages_Rsvd: Huge pages for which a commitment to allocate from the pool has been made, but no allocation
# 超过系统设定的常驻HugePages数量的个数
node_memory_HugePages_Surp: Huge pages in the pool above the value in /proc/sys/vm/nr_hugepages
# 透明巨页 Transparent HugePages (THP)
node_memory_AnonHugePages_bytes: Memory in anonymous huge pages
# inactivelist中的File-backed内存
node_memory_Inactive_file_bytes: File-backed memory on inactive LRU list
# inactivelist中的Anonymous内存
node_memory_Inactive_anon_bytes: Anonymous and swap cache on inactive LRU list, including tmpfs (shmem)
# activelist中的File-backed内存
node_memory_Active_file_bytes: File-backed memory on active LRU list
# activelist中的Anonymous内存
node_memory_Active_anon_bytes: Anonymous and swap cache on active least-recently-used (LRU) list, including tmpfs
# 禁止换出的页,对应 Unevictable 链表
node_memory_Unevictable_bytes: Amount of unevictable memory that can't be swapped out for a variety of reasons
# 共享内存
node_memory_Shmem_bytes: Used shared memory (shared between several processes, thus including RAM disks)
# 匿名页内存大小
node_memory_AnonPages_bytes: Memory in user pages not backed by files
# 被关联的内存页大小
node_memory_Mapped_bytes: Used memory in mapped pages files which have been mmaped, such as libraries
# file-backed内存页缓存大小
node_memory_Cached_bytes: Parked file data (file content) cache
# 系统中有多少匿名页曾经被swap-out、现在又被swap-in并且swap-in之后页面中的内容一直没发生变化
node_memory_SwapCached_bytes: Memory that keeps track of pages that have been fetched from swap but not yet been modified
# 被mlock()系统调用锁定的内存大小
node_memory_Mlocked_bytes: Size of pages locked to memory using the mlock() system call
# 块设备(block device)所占用的缓存页
node_memory_Buffers_bytes: Block device (e.g. harddisk) cache
node_memory_SwapTotal_bytes: Memory information field SwapTotal_bytes
node_memory_SwapFree_bytes: Memory information field SwapFree_bytes
# DISK
node_filesystem_avail_bytes: Filesystem space available to non-root users in byte
node_filesystem_free_bytes: Filesystem free space in bytes
node_filesystem_size_bytes: Filesystem size in bytes
node_filesystem_files_free: Filesystem total free file nodes
node_filesystem_files: Filesystem total free file nodes
node_filefd_maximum: Max open files
node_filefd_allocated: Open files
node_filesystem_readonly: Filesystem read-only status
node_filesystem_device_error: Whether an error occurred while getting statistics for the given device
node_disk_reads_completed_total: The total number of reads completed successfully
node_disk_writes_completed_total: The total number of writes completed successfully
node_disk_reads_merged_total: The number of reads merged
node_disk_writes_merged_total: The number of writes merged
node_disk_read_bytes_total: The total number of bytes read successfully
node_disk_written_bytes_total: The total number of bytes written successfully
node_disk_io_time_seconds_total: Total seconds spent doing I/Os
node_disk_read_time_seconds_total: The total number of seconds spent by all reads
node_disk_write_time_seconds_total: The total number of seconds spent by all writes
node_disk_io_time_weighted_seconds_total: The weighted of seconds spent doing I/Os
# NET
node_network_receive_bytes_total: Network device statistic receive_bytes (counter)
node_network_transmit_bytes_total: Network device statistic transmit_bytes (counter)
node_network_receive_packets_total: Network device statistic receive_bytes
node_network_transmit_packets_total: Network device statistic transmit_bytes
node_network_receive_errs_total: Network device statistic receive_errs
node_network_transmit_errs_total: Network device statistic transmit_errs
node_network_receive_drop_total: Network device statistic receive_drop
node_network_transmit_drop_total: Network device statistic transmit_drop
node_nf_conntrack_entries: Number of currently allocated flow entries for connection tracking
node_sockstat_TCP_alloc: Number of TCP sockets in state alloc
node_sockstat_TCP_inuse: Number of TCP sockets in state inuse
node_sockstat_TCP_orphan: Number of TCP sockets in state orphan
node_sockstat_TCP_tw: Number of TCP sockets in state tw
node_netstat_Tcp_CurrEstab: Statistic TcpCurrEstab
node_sockstat_sockets_used: Number of IPv4 sockets in use
# [kafka_exporter]
kafka_brokers: count of kafka_brokers (gauge)
kafka_topic_partitions: Number of partitions for this Topic (gauge)
kafka_topic_partition_current_offset: Current Offset of a Broker at Topic/Partition (gauge)
kafka_consumergroup_current_offset: Current Offset of a ConsumerGroup at Topic/Partition (gauge)
kafka_consumer_lag_millis: Current approximation of consumer lag for a ConsumerGroup at Topic/Partition (gauge)
kafka_topic_partition_under_replicated_partition: 1 if Topic/Partition is under Replicated
# [zookeeper_exporter]
zk_znode_count: The total count of znodes stored
zk_ephemerals_count: The number of Ephemerals nodes
zk_watch_count: The number of watchers setup over Zookeeper nodes.
zk_approximate_data_size: Size of data in bytes that a zookeeper server has in its data tree
zk_outstanding_requests: Number of currently executing requests
zk_packets_sent: Count of the number of zookeeper packets sent from a server
zk_packets_received: Count of the number of zookeeper packets received by a server
zk_num_alive_connections: Number of active clients connected to a zookeeper server
zk_open_file_descriptor_count: Number of file descriptors that a zookeeper server has open
zk_max_file_descriptor_count: Maximum number of file descriptors that a zookeeper server can open
zk_avg_latency: Average time in milliseconds for requests to be processed
zk_min_latency: Maximum time in milliseconds for a request to be processed
zk_max_latency: Minimum time in milliseconds for a request to be processed

View File

@@ -0,0 +1,216 @@
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import sys
import json
import urllib2
import smtplib
from email.mime.text import MIMEText
reload(sys)
sys.setdefaultencoding('utf8')
notify_channel_funcs = {
"email":"email",
"sms":"sms",
"voice":"voice",
"dingtalk":"dingtalk",
"wecom":"wecom",
"feishu":"feishu"
}
mail_host = "smtp.163.com"
mail_port = 994
mail_user = "ulricqin"
mail_pass = "password"
mail_from = "ulricqin@163.com"
class Sender(object):
@classmethod
def send_email(cls, payload):
if mail_user == "ulricqin" and mail_pass == "password":
print("invalid smtp configuration")
return
users = payload.get('event').get("notify_users_obj")
emails = {}
for u in users:
if u.get("email"):
emails[u.get("email")] = 1
if not emails:
return
recipients = emails.keys()
mail_body = payload.get('tpls').get("email.tpl", "email.tpl not found")
message = MIMEText(mail_body, 'html', 'utf-8')
message['From'] = mail_from
message['To'] = ", ".join(recipients)
message["Subject"] = payload.get('tpls').get("subject.tpl", "subject.tpl not found")
try:
smtp = smtplib.SMTP_SSL(mail_host, mail_port)
smtp.login(mail_user, mail_pass)
smtp.sendmail(mail_from, recipients, message.as_string())
smtp.close()
except smtplib.SMTPException, error:
print(error)
@classmethod
def send_wecom(cls, payload):
users = payload.get('event').get("notify_users_obj")
tokens = {}
for u in users:
contacts = u.get("contacts")
if contacts.get("wecom_robot_token", ""):
tokens[contacts.get("wecom_robot_token", "")] = 1
opener = urllib2.build_opener(urllib2.HTTPHandler())
method = "POST"
for t in tokens:
url = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key={}".format(t)
body = {
"msgtype": "markdown",
"markdown": {
"content": payload.get('tpls').get("wecom.tpl", "wecom.tpl not found")
}
}
request = urllib2.Request(url, data=json.dumps(body))
request.add_header("Content-Type",'application/json;charset=utf-8')
request.get_method = lambda: method
try:
connection = opener.open(request)
print(connection.read())
except urllib2.HTTPError, error:
print(error)
@classmethod
def send_dingtalk(cls, payload):
event = payload.get('event')
users = event.get("notify_users_obj")
rule_name = event.get("rule_name")
event_state = "Triggered"
if event.get("is_recovered"):
event_state = "Recovered"
tokens = {}
phones = {}
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
contacts = u.get("contacts")
if contacts.get("dingtalk_robot_token", ""):
tokens[contacts.get("dingtalk_robot_token", "")] = 1
opener = urllib2.build_opener(urllib2.HTTPHandler())
method = "POST"
for t in tokens:
url = "https://oapi.dingtalk.com/robot/send?access_token={}".format(t)
body = {
"msgtype": "markdown",
"markdown": {
"title": "{} - {}".format(event_state, rule_name),
"text": payload.get('tpls').get("dingtalk.tpl", "dingtalk.tpl not found") + ' '.join(["@"+i for i in phones.keys()])
},
"at": {
"atMobiles": phones.keys(),
"isAtAll": False
}
}
request = urllib2.Request(url, data=json.dumps(body))
request.add_header("Content-Type",'application/json;charset=utf-8')
request.get_method = lambda: method
try:
connection = opener.open(request)
print(connection.read())
except urllib2.HTTPError, error:
print(error)
@classmethod
def send_feishu(cls, payload):
users = payload.get('event').get("notify_users_obj")
tokens = {}
phones = {}
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
contacts = u.get("contacts")
if contacts.get("feishu_robot_token", ""):
tokens[contacts.get("feishu_robot_token", "")] = 1
opener = urllib2.build_opener(urllib2.HTTPHandler())
method = "POST"
for t in tokens:
url = "https://open.feishu.cn/open-apis/bot/v2/hook/{}".format(t)
body = {
"msg_type": "text",
"content": {
"text": payload.get('tpls').get("feishu.tpl", "feishu.tpl not found")
},
"at": {
"atMobiles": phones.keys(),
"isAtAll": False
}
}
request = urllib2.Request(url, data=json.dumps(body))
request.add_header("Content-Type",'application/json;charset=utf-8')
request.get_method = lambda: method
try:
connection = opener.open(request)
print(connection.read())
except urllib2.HTTPError, error:
print(error)
@classmethod
def send_sms(cls, payload):
users = payload.get('event').get("notify_users_obj")
phones = {}
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
if phones:
print("send_sms not implemented, phones: {}".format(phones.keys()))
@classmethod
def send_voice(cls, payload):
users = payload.get('event').get("notify_users_obj")
phones = {}
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
if phones:
print("send_voice not implemented, phones: {}".format(phones.keys()))
def main():
payload = json.load(sys.stdin)
with open(".payload", 'w') as f:
f.write(json.dumps(payload, indent=4))
for ch in payload.get('event').get('notify_channels'):
send_func_name = "send_{}".format(notify_channel_funcs.get(ch.strip()))
if not hasattr(Sender, send_func_name):
print("function: {} not found", send_func_name)
continue
send_func = getattr(Sender, send_func_name)
send_func(payload)
def hello():
print("hello nightingale")
if __name__ == "__main__":
if len(sys.argv) == 1:
main()
elif sys.argv[1] == "hello":
hello()
else:
print("I am confused")

View File

@@ -0,0 +1,73 @@
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
import sys
import json
class Sender(object):
@classmethod
def send_email(cls, payload):
# already done in go code
pass
@classmethod
def send_wecom(cls, payload):
# already done in go code
pass
@classmethod
def send_dingtalk(cls, payload):
# already done in go code
pass
@classmethod
def send_feishu(cls, payload):
# already done in go code
pass
@classmethod
def send_mm(cls, payload):
# already done in go code
pass
@classmethod
def send_sms(cls, payload):
users = payload.get('event').get("notify_users_obj")
phones = {}
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
if phones:
print("send_sms not implemented, phones: {}".format(phones.keys()))
@classmethod
def send_voice(cls, payload):
users = payload.get('event').get("notify_users_obj")
phones = {}
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
if phones:
print("send_voice not implemented, phones: {}".format(phones.keys()))
def main():
payload = json.load(sys.stdin)
with open(".payload", 'w') as f:
f.write(json.dumps(payload, indent=4))
for ch in payload.get('event').get('notify_channels'):
send_func_name = "send_{}".format(ch.strip())
if not hasattr(Sender, send_func_name):
print("function: {} not found", send_func_name)
continue
send_func = getattr(Sender, send_func_name)
send_func(payload)
def hello():
print("hello nightingale")
if __name__ == "__main__":
if len(sys.argv) == 1:
main()
elif sys.argv[1] == "hello":
hello()
else:
print("I am confused")

View File

@@ -0,0 +1,92 @@
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import sys
import json
import requests
class Sender(object):
@classmethod
def send_email(cls, payload):
# already done in go code
pass
@classmethod
def send_wecom(cls, payload):
# already done in go code
pass
@classmethod
def send_dingtalk(cls, payload):
# already done in go code
pass
@classmethod
def send_ifeishu(cls, payload):
users = payload.get('event').get("notify_users_obj")
tokens = {}
phones = {}
for u in users:
if u.get("phone"):
phones[u.get("phone")] = 1
contacts = u.get("contacts")
if contacts.get("feishu_robot_token", ""):
tokens[contacts.get("feishu_robot_token", "")] = 1
headers = {
"Content-Type": "application/json;charset=utf-8",
"Host": "open.feishu.cn"
}
for t in tokens:
url = "https://open.feishu.cn/open-apis/bot/v2/hook/{}".format(t)
body = {
"msg_type": "text",
"content": {
"text": payload.get('tpls').get("feishu", "feishu not found")
},
"at": {
"atMobiles": list(phones.keys()),
"isAtAll": False
}
}
response = requests.post(url, headers=headers, data=json.dumps(body))
print(f"notify_ifeishu: token={t} status_code={response.status_code} response_text={response.text}")
@classmethod
def send_mm(cls, payload):
# already done in go code
pass
@classmethod
def send_sms(cls, payload):
pass
@classmethod
def send_voice(cls, payload):
pass
def main():
payload = json.load(sys.stdin)
with open(".payload", 'w') as f:
f.write(json.dumps(payload, indent=4))
for ch in payload.get('event').get('notify_channels'):
send_func_name = "send_{}".format(ch.strip())
if not hasattr(Sender, send_func_name):
print("function: {} not found", send_func_name)
continue
send_func = getattr(Sender, send_func_name)
send_func(payload)
def hello():
print("hello nightingale")
if __name__ == "__main__":
if len(sys.argv) == 1:
main()
elif sys.argv[1] == "hello":
hello()
else:
print("I am confused")

Some files were not shown because too many files have changed in this diff Show More