Compare commits

...

15 Commits

Author SHA1 Message Date
Andrei Kvapil
0daa7605af Prepare release v0.16.1 (#390)
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Introduced the `cozystack` application with necessary Kubernetes
resources, including a new namespace, service account, and deployment.
- Updated container images for `cozystack` and associated services to
version `v0.16.1`.

- **Bug Fixes**
- Resolved issues related to image versioning across various components,
ensuring consistency and reliability.

- **Documentation**
- Updated configuration files to reflect new image tags and versions for
multiple components, enhancing clarity for users.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
2024-10-04 12:34:40 +02:00
Andrei Kvapil
4eaca42ce9 fix node-exporter alerts (#389)
to show node hostname instead of ip address

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
2024-10-03 16:14:42 +02:00
Andrei Kvapil
b605c85eb2 Rework alerts; Add fluxcd alerts (#388)
- Rework alerts
- Add fluxcd alerts

---------

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
2024-10-03 15:59:49 +02:00
Andrei Kvapil
929ab5c5eb cilium: enable native routing in distro-full bundle (#384)
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
2024-10-02 15:21:59 +02:00
Andrei Kvapil
4b90bf5aac Prepare release v0.16.0 (#375)
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
2024-10-01 18:53:30 +02:00
Andrei Kvapil
7a1b56fa78 postgres: fix setting max_connections (#382)
fix regression introduced by
https://github.com/aenix-io/cozystack/pull/376

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Enhanced flexibility in PostgreSQL configuration with conditional
handling of the `max_connections` parameter.

- **Bug Fixes**
- Improved parameter assignment logic for better configuration
management.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
2024-10-01 18:38:03 +02:00
Andrei Kvapil
7161b4db06 Disable Kamaji default datastore check (#381)
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
2024-10-01 17:52:07 +02:00
Andrei Kvapil
b6e3203446 Update Talos Linux v1.8.0 (#380)
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
2024-10-01 17:12:07 +02:00
Andrei Kvapil
ab8394140c Update fluxcd v2.4.0 (#379)
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
2024-10-01 13:35:47 +02:00
Andrei Kvapil
d657ca62b8 Update Cilium v1.16.2 and enable genev_sys_6081 interface (#378)
This PR includes the upstream fix:
- https://github.com/kubeovn/kube-ovn/pull/4575

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
2024-10-01 13:32:18 +02:00
klinch0
3d928611ed fix postgres max_connections (#376)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **New Features**
- Updated the `max_connections` parameter to accept numeric values for
improved clarity and correctness in PostgreSQL configurations.

- **Bug Fixes**
- Corrected the data type for `max_connections` from string to number in
both schema and configuration files to ensure proper interpretation by
the PostgreSQL server.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Co-authored-by: Kirill Klinchenkov <Kirill.Klinchenkov@mvideo.ru>
2024-09-30 18:03:23 +02:00
Andrei Kvapil
8cb2256042 Nginx-ingress: fix tls-passthrough if ClientHello is fragmented (#372)
Fixed nginx-ingress image to include this patch:
- https://github.com/kubernetes/ingress-nginx/pull/11843

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
2024-09-27 15:47:55 +02:00
Andrei Kvapil
ecfa4f8005 Seaweedfs: Fix attributes for bucket creation (#371)
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
2024-09-27 11:49:25 +02:00
Kingdon Barrett
01ce122ada Adopt flux-instance from upstream (#363)
Builds on #362 

The main issue we will have to solve (maybe with a patch) is that
`cluster.domain` is always specified in this chart;

I'm reading to try to recall how we solved this last time.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Release Notes

- **New Features**
- Updated the Flux Operator Helm chart to version 0.9.0, introducing
enhanced configuration options for service monitoring and resource
management.
	- Added a new `ServiceMonitor` resource for Prometheus integration.
- Introduced a `serviceMonitor` configuration option with default values
for scraping settings.
- New `FluxInstance` resource configuration file added for deploying a
Flux instance.

- **Documentation**
- Updated README files to reflect new version and provide installation
instructions for the Flux instance.
- Added a `NOTES.txt` file directing users to Flux CD operator
documentation.

- **Bug Fixes**
- Corrected links in documentation and ensured proper metadata for the
new chart.

- **Chores**
- Restructured configuration files for improved organization and
clarity.
	- Introduced a `.helmignore` file to streamline package building.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Kingdon Barrett <kingdon+github@tuesdaystudios.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Co-authored-by: Andrei Kvapil <kvapss@gmail.com>
2024-09-26 20:40:34 +02:00
Andrei Kvapil
00b2834efc Fix rabbitmq users creation (#367) 2024-09-26 20:26:28 +02:00
136 changed files with 3911 additions and 8114 deletions

View File

@@ -114,7 +114,7 @@ machine:
- name: zfs
- name: spl
install:
image: ghcr.io/aenix-io/cozystack/talos:v1.7.1
image: ghcr.io/aenix-io/cozystack/talos:v1.8.0
files:
- content: |
[plugins]

View File

@@ -68,7 +68,7 @@ spec:
serviceAccountName: cozystack
containers:
- name: cozystack
image: "ghcr.io/aenix-io/cozystack/cozystack:v0.15.0"
image: "ghcr.io/aenix-io/cozystack/cozystack:v0.16.1"
env:
- name: KUBERNETES_SERVICE_HOST
value: localhost
@@ -87,7 +87,7 @@ spec:
fieldRef:
fieldPath: metadata.name
- name: darkhttpd
image: "ghcr.io/aenix-io/cozystack/cozystack:v0.15.0"
image: "ghcr.io/aenix-io/cozystack/cozystack:v0.16.1"
command:
- /usr/bin/darkhttpd
- /cozystack/assets

View File

@@ -1 +1 @@
ghcr.io/aenix-io/cozystack/postgres-backup:0.6.2@sha256:d2015c6dba92293bda652d055e97d1be80e8414c2dc78037c12812d1a2e2cba1
ghcr.io/aenix-io/cozystack/postgres-backup:0.7.0@sha256:d2015c6dba92293bda652d055e97d1be80e8414c2dc78037c12812d1a2e2cba1

View File

@@ -1 +1 @@
ghcr.io/aenix-io/cozystack/nginx-cache:0.3.1@sha256:556bc8d29ee9e90b3d64d0481dcfc66483d055803315bba3d9ece17c0d97f32b
ghcr.io/aenix-io/cozystack/nginx-cache:0.3.1@sha256:cd744b2d1d50191f4908f2db83079b32973d1c009fe9468627be72efbfa0a107

View File

@@ -1 +1 @@
ghcr.io/aenix-io/cozystack/cluster-autoscaler:latest@sha256:7f617de5a24de790a15d9e97c6287ff2b390922e6e74c7a665cbf498f634514d
ghcr.io/aenix-io/cozystack/cluster-autoscaler:0.11.0@sha256:7f617de5a24de790a15d9e97c6287ff2b390922e6e74c7a665cbf498f634514d

View File

@@ -1 +1 @@
ghcr.io/aenix-io/cozystack/kubevirt-cloud-provider:latest@sha256:735aa8092501fc0f2904b685b15bc0137ea294cb08301ca1185d3dec5f467f0f
ghcr.io/aenix-io/cozystack/kubevirt-cloud-provider:0.11.0@sha256:91e6843afa704ba7c513842bc3a612f2c0b295ce95aebe60fbb6be09709a1947

View File

@@ -1 +1 @@
ghcr.io/aenix-io/cozystack/kubevirt-csi-driver:latest@sha256:e56b46591cdf9140e97c3220a0c2681aadd4a4b3f7ea8473fb2504dc96e8b53a
ghcr.io/aenix-io/cozystack/kubevirt-csi-driver:0.11.0@sha256:1a9e6592fc035dbaae27f308b934206858c2e0025d4c99cd906b51615cc9766c

View File

@@ -1 +1 @@
ghcr.io/aenix-io/cozystack/ubuntu-container-disk:v1.30.1@sha256:5ce80a453073c4f44347409133fc7b15f1d2f37a564d189871a4082fc552ff0f
ghcr.io/aenix-io/cozystack/ubuntu-container-disk:v1.30.1@sha256:1f249fbe52821a62f706c6038b13401234e1b758ac498e53395b8f9a642b015f

View File

@@ -1 +1 @@
ghcr.io/aenix-io/cozystack/mariadb-backup:0.5.1@sha256:fa2b3195521cffa55eb6d71a50b875d3c234a45e5dff71b2b9002674175bea93
ghcr.io/aenix-io/cozystack/mariadb-backup:0.5.1@sha256:793edb25a29cbc00781e40af883815ca36937e736e2b0d202ea9c9619fb6ca11

View File

@@ -1 +1 @@
ghcr.io/aenix-io/cozystack/postgres-backup:0.6.2@sha256:d2015c6dba92293bda652d055e97d1be80e8414c2dc78037c12812d1a2e2cba1
ghcr.io/aenix-io/cozystack/postgres-backup:0.7.0@sha256:d2015c6dba92293bda652d055e97d1be80e8414c2dc78037c12812d1a2e2cba1

View File

@@ -10,7 +10,9 @@ spec:
postgresql:
parameters:
max_wal_senders: "30"
max_connections: “{{ .Values.postgresql.parameters.max_connections }}
{{- with .Values.postgresql.parameters.max_connections }}
max_connections: "{{ . }}"
{{- end }}
minSyncReplicas: {{ .Values.quorum.minSyncReplicas }}
maxSyncReplicas: {{ .Values.quorum.maxSyncReplicas }}

View File

@@ -29,9 +29,9 @@
"type": "object",
"properties": {
"max_connections": {
"type": "string",
"type": "number",
"description": "Determines the maximum number of concurrent connections to the database server. The default is typically 100 connections",
"default": "100"
"default": 100
}
}
}
@@ -103,4 +103,4 @@
}
}
}
}
}

View File

@@ -14,7 +14,7 @@ storageClass: ""
## @param postgresql.parameters.max_connections Determines the maximum number of concurrent connections to the database server. The default is typically 100 connections
postgresql:
parameters:
max_connections: "100"
max_connections: 100
## Configuration for the quorum-based synchronous replication
## @param quorum.minSyncReplicas Minimum number of synchronous replicas that must acknowledge a transaction before it is considered committed.

View File

@@ -16,7 +16,7 @@ type: application
# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.4.1
version: 0.4.2
# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to

View File

@@ -47,7 +47,7 @@ metadata:
config: '{{ printf "%s %s" $user $password | sha256sum }}'
spec:
importCredentialsSecret:
name: {{ $.Release.Name }}-{{ $user }}-credentials
name: {{ $.Release.Name }}-{{ kebabcase $user }}-credentials
rabbitmqClusterReference:
name: {{ $.Release.Name }}
---

View File

@@ -31,7 +31,8 @@ kubernetes 0.8.0 ac11056e
kubernetes 0.8.1 e54608d8
kubernetes 0.8.2 5ca8823
kubernetes 0.9.0 9b6dd19
kubernetes 0.10.0 HEAD
kubernetes 0.10.0 ac5c38b
kubernetes 0.11.0 HEAD
mysql 0.1.0 f642698
mysql 0.2.0 8b975ff0
mysql 0.3.0 5ca8823
@@ -48,12 +49,14 @@ postgres 0.4.0 ec283c33
postgres 0.4.1 5ca8823
postgres 0.5.0 c07c4bbd
postgres 0.6.0 2a4768a
postgres 0.6.2 HEAD
postgres 0.6.2 54fd61c
postgres 0.7.0 HEAD
rabbitmq 0.1.0 f642698
rabbitmq 0.2.0 5ca8823
rabbitmq 0.3.0 9e33dc0
rabbitmq 0.4.0 36d8855
rabbitmq 0.4.1 HEAD
rabbitmq 0.4.1 35536bb
rabbitmq 0.4.2 HEAD
redis 0.1.1 f642698
redis 0.2.0 5ca8823
redis 0.3.0 HEAD

View File

@@ -3,24 +3,24 @@
arch: amd64
platform: metal
secureboot: false
version: v1.7.6
version: v1.8.0
input:
kernel:
path: /usr/install/amd64/vmlinuz
initramfs:
path: /usr/install/amd64/initramfs.xz
baseInstaller:
imageRef: ghcr.io/siderolabs/installer:v1.7.6
imageRef: ghcr.io/siderolabs/installer:v1.8.0
systemExtensions:
- imageRef: ghcr.io/siderolabs/amd-ucode:20240811
- imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240811
- imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240811
- imageRef: ghcr.io/siderolabs/i915-ucode:20240811
- imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240811
- imageRef: ghcr.io/siderolabs/intel-ucode:20240813
- imageRef: ghcr.io/siderolabs/qlogic-firmware:20240811
- imageRef: ghcr.io/siderolabs/drbd:9.2.8-v1.7.6
- imageRef: ghcr.io/siderolabs/zfs:2.2.4-v1.7.6
- imageRef: ghcr.io/siderolabs/amd-ucode:20240909
- imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240909
- imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240909
- imageRef: ghcr.io/siderolabs/i915-ucode:20240909
- imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240909
- imageRef: ghcr.io/siderolabs/intel-ucode:20240910
- imageRef: ghcr.io/siderolabs/qlogic-firmware:20240909
- imageRef: ghcr.io/siderolabs/drbd:9.2.11-v1.8.0
- imageRef: ghcr.io/siderolabs/zfs:2.2.6-v1.8.0
output:
kind: initramfs
imageOptions: {}

View File

@@ -3,24 +3,24 @@
arch: amd64
platform: metal
secureboot: false
version: v1.7.6
version: v1.8.0
input:
kernel:
path: /usr/install/amd64/vmlinuz
initramfs:
path: /usr/install/amd64/initramfs.xz
baseInstaller:
imageRef: ghcr.io/siderolabs/installer:v1.7.6
imageRef: ghcr.io/siderolabs/installer:v1.8.0
systemExtensions:
- imageRef: ghcr.io/siderolabs/amd-ucode:20240811
- imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240811
- imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240811
- imageRef: ghcr.io/siderolabs/i915-ucode:20240811
- imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240811
- imageRef: ghcr.io/siderolabs/intel-ucode:20240813
- imageRef: ghcr.io/siderolabs/qlogic-firmware:20240811
- imageRef: ghcr.io/siderolabs/drbd:9.2.8-v1.7.6
- imageRef: ghcr.io/siderolabs/zfs:2.2.4-v1.7.6
- imageRef: ghcr.io/siderolabs/amd-ucode:20240909
- imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240909
- imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240909
- imageRef: ghcr.io/siderolabs/i915-ucode:20240909
- imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240909
- imageRef: ghcr.io/siderolabs/intel-ucode:20240910
- imageRef: ghcr.io/siderolabs/qlogic-firmware:20240909
- imageRef: ghcr.io/siderolabs/drbd:9.2.11-v1.8.0
- imageRef: ghcr.io/siderolabs/zfs:2.2.6-v1.8.0
output:
kind: installer
imageOptions: {}

View File

@@ -3,24 +3,24 @@
arch: amd64
platform: metal
secureboot: false
version: v1.7.6
version: v1.8.0
input:
kernel:
path: /usr/install/amd64/vmlinuz
initramfs:
path: /usr/install/amd64/initramfs.xz
baseInstaller:
imageRef: ghcr.io/siderolabs/installer:v1.7.6
imageRef: ghcr.io/siderolabs/installer:v1.8.0
systemExtensions:
- imageRef: ghcr.io/siderolabs/amd-ucode:20240811
- imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240811
- imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240811
- imageRef: ghcr.io/siderolabs/i915-ucode:20240811
- imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240811
- imageRef: ghcr.io/siderolabs/intel-ucode:20240813
- imageRef: ghcr.io/siderolabs/qlogic-firmware:20240811
- imageRef: ghcr.io/siderolabs/drbd:9.2.8-v1.7.6
- imageRef: ghcr.io/siderolabs/zfs:2.2.4-v1.7.6
- imageRef: ghcr.io/siderolabs/amd-ucode:20240909
- imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240909
- imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240909
- imageRef: ghcr.io/siderolabs/i915-ucode:20240909
- imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240909
- imageRef: ghcr.io/siderolabs/intel-ucode:20240910
- imageRef: ghcr.io/siderolabs/qlogic-firmware:20240909
- imageRef: ghcr.io/siderolabs/drbd:9.2.11-v1.8.0
- imageRef: ghcr.io/siderolabs/zfs:2.2.6-v1.8.0
output:
kind: iso
imageOptions: {}

View File

@@ -3,24 +3,24 @@
arch: amd64
platform: metal
secureboot: false
version: v1.7.6
version: v1.8.0
input:
kernel:
path: /usr/install/amd64/vmlinuz
initramfs:
path: /usr/install/amd64/initramfs.xz
baseInstaller:
imageRef: ghcr.io/siderolabs/installer:v1.7.6
imageRef: ghcr.io/siderolabs/installer:v1.8.0
systemExtensions:
- imageRef: ghcr.io/siderolabs/amd-ucode:20240811
- imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240811
- imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240811
- imageRef: ghcr.io/siderolabs/i915-ucode:20240811
- imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240811
- imageRef: ghcr.io/siderolabs/intel-ucode:20240813
- imageRef: ghcr.io/siderolabs/qlogic-firmware:20240811
- imageRef: ghcr.io/siderolabs/drbd:9.2.8-v1.7.6
- imageRef: ghcr.io/siderolabs/zfs:2.2.4-v1.7.6
- imageRef: ghcr.io/siderolabs/amd-ucode:20240909
- imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240909
- imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240909
- imageRef: ghcr.io/siderolabs/i915-ucode:20240909
- imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240909
- imageRef: ghcr.io/siderolabs/intel-ucode:20240910
- imageRef: ghcr.io/siderolabs/qlogic-firmware:20240909
- imageRef: ghcr.io/siderolabs/drbd:9.2.11-v1.8.0
- imageRef: ghcr.io/siderolabs/zfs:2.2.6-v1.8.0
output:
kind: kernel
imageOptions: {}

View File

@@ -3,24 +3,24 @@
arch: amd64
platform: metal
secureboot: false
version: v1.7.6
version: v1.8.0
input:
kernel:
path: /usr/install/amd64/vmlinuz
initramfs:
path: /usr/install/amd64/initramfs.xz
baseInstaller:
imageRef: ghcr.io/siderolabs/installer:v1.7.6
imageRef: ghcr.io/siderolabs/installer:v1.8.0
systemExtensions:
- imageRef: ghcr.io/siderolabs/amd-ucode:20240811
- imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240811
- imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240811
- imageRef: ghcr.io/siderolabs/i915-ucode:20240811
- imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240811
- imageRef: ghcr.io/siderolabs/intel-ucode:20240813
- imageRef: ghcr.io/siderolabs/qlogic-firmware:20240811
- imageRef: ghcr.io/siderolabs/drbd:9.2.8-v1.7.6
- imageRef: ghcr.io/siderolabs/zfs:2.2.4-v1.7.6
- imageRef: ghcr.io/siderolabs/amd-ucode:20240909
- imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240909
- imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240909
- imageRef: ghcr.io/siderolabs/i915-ucode:20240909
- imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240909
- imageRef: ghcr.io/siderolabs/intel-ucode:20240910
- imageRef: ghcr.io/siderolabs/qlogic-firmware:20240909
- imageRef: ghcr.io/siderolabs/drbd:9.2.11-v1.8.0
- imageRef: ghcr.io/siderolabs/zfs:2.2.6-v1.8.0
output:
kind: image
imageOptions: { diskSize: 1306525696, diskFormat: raw }

View File

@@ -3,24 +3,24 @@
arch: amd64
platform: nocloud
secureboot: false
version: v1.7.6
version: v1.8.0
input:
kernel:
path: /usr/install/amd64/vmlinuz
initramfs:
path: /usr/install/amd64/initramfs.xz
baseInstaller:
imageRef: ghcr.io/siderolabs/installer:v1.7.6
imageRef: ghcr.io/siderolabs/installer:v1.8.0
systemExtensions:
- imageRef: ghcr.io/siderolabs/amd-ucode:20240811
- imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240811
- imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240811
- imageRef: ghcr.io/siderolabs/i915-ucode:20240811
- imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240811
- imageRef: ghcr.io/siderolabs/intel-ucode:20240813
- imageRef: ghcr.io/siderolabs/qlogic-firmware:20240811
- imageRef: ghcr.io/siderolabs/drbd:9.2.8-v1.7.6
- imageRef: ghcr.io/siderolabs/zfs:2.2.4-v1.7.6
- imageRef: ghcr.io/siderolabs/amd-ucode:20240909
- imageRef: ghcr.io/siderolabs/amdgpu-firmware:20240909
- imageRef: ghcr.io/siderolabs/bnx2-bnx2x:20240909
- imageRef: ghcr.io/siderolabs/i915-ucode:20240909
- imageRef: ghcr.io/siderolabs/intel-ice-firmware:20240909
- imageRef: ghcr.io/siderolabs/intel-ucode:20240910
- imageRef: ghcr.io/siderolabs/qlogic-firmware:20240909
- imageRef: ghcr.io/siderolabs/drbd:9.2.11-v1.8.0
- imageRef: ghcr.io/siderolabs/zfs:2.2.6-v1.8.0
output:
kind: image
imageOptions: { diskSize: 1306525696, diskFormat: raw }

View File

@@ -1,2 +1,2 @@
cozystack:
image: ghcr.io/aenix-io/cozystack/cozystack:v0.15.0@sha256:aeff26a80f84b4323578e613b3bf03caa842d617ec8d9ca98706867c1e70609f
image: ghcr.io/aenix-io/cozystack/cozystack:v0.16.1@sha256:f27695d23d449f10888295bd2ba6c084c8fa4b81f109d4836ec9db528b943b62

View File

@@ -29,6 +29,7 @@ releases:
enableIdentityMark: true
ipv4NativeRoutingCIDR: "{{ index $cozyConfig.data "ipv4-pod-cidr" }}"
autoDirectNodeRoutes: true
routingMode: native
- name: cert-manager
releaseName: cert-manager

View File

@@ -1,2 +1,2 @@
e2e:
image: ghcr.io/aenix-io/cozystack/e2e-sandbox:v0.15.0@sha256:20cc84e4a11db31434881355c070113a7823501a28a6114ca02830b18607ad21
image: ghcr.io/aenix-io/cozystack/e2e-sandbox:v0.16.1@sha256:25b298d621ec79431d106184d59849bbae634588742583d111628126ad8615c5

View File

@@ -3,4 +3,4 @@ name: monitoring
description: Monitoring and observability stack
icon: /logos/monitoring.svg
type: application
version: 1.4.0
version: 1.5.0

View File

@@ -12,6 +12,7 @@ monitoring 1.1.0 15478a88
monitoring 1.2.0 c9e0d63b
monitoring 1.2.1 4471b4ba
monitoring 1.3.0 6c5cf5b
monitoring 1.4.0 HEAD
monitoring 1.4.0 adaf603b
monitoring 1.5.0 HEAD
seaweedfs 0.1.0 5ca8823
seaweedfs 0.2.0 HEAD

View File

@@ -79,7 +79,7 @@ annotations:
Pod IP Pool\n description: |\n CiliumPodIPPool defines an IP pool that can
be used for pooled IPAM (i.e. the multi-pool IPAM mode).\n"
apiVersion: v2
appVersion: 1.16.1
appVersion: 1.16.2
description: eBPF-based Networking, Security, and Observability
home: https://cilium.io/
icon: https://cdn.jsdelivr.net/gh/cilium/cilium@main/Documentation/images/logo-solo.svg
@@ -95,4 +95,4 @@ kubeVersion: '>= 1.21.0-0'
name: cilium
sources:
- https://github.com/cilium/cilium
version: 1.16.1
version: 1.16.2

View File

@@ -1,6 +1,6 @@
# cilium
![Version: 1.16.1](https://img.shields.io/badge/Version-1.16.1-informational?style=flat-square) ![AppVersion: 1.16.1](https://img.shields.io/badge/AppVersion-1.16.1-informational?style=flat-square)
![Version: 1.16.2](https://img.shields.io/badge/Version-1.16.2-informational?style=flat-square) ![AppVersion: 1.16.2](https://img.shields.io/badge/AppVersion-1.16.2-informational?style=flat-square)
Cilium is open source software for providing and transparently securing
network connectivity and loadbalancing between application workloads such as
@@ -83,7 +83,7 @@ contributors across the globe, there is almost always someone available to help.
| authentication.mutual.spire.install.agent.tolerations | list | `[{"effect":"NoSchedule","key":"node.kubernetes.io/not-ready"},{"effect":"NoSchedule","key":"node-role.kubernetes.io/master"},{"effect":"NoSchedule","key":"node-role.kubernetes.io/control-plane"},{"effect":"NoSchedule","key":"node.cloudprovider.kubernetes.io/uninitialized","value":"true"},{"key":"CriticalAddonsOnly","operator":"Exists"}]` | SPIRE agent tolerations configuration By default it follows the same tolerations as the agent itself to allow the Cilium agent on this node to connect to SPIRE. ref: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ |
| authentication.mutual.spire.install.enabled | bool | `true` | Enable SPIRE installation. This will only take effect only if authentication.mutual.spire.enabled is true |
| authentication.mutual.spire.install.existingNamespace | bool | `false` | SPIRE namespace already exists. Set to true if Helm should not create, manage, and import the SPIRE namespace. |
| authentication.mutual.spire.install.initImage | object | `{"digest":"sha256:9ae97d36d26566ff84e8893c64a6dc4fe8ca6d1144bf5b87b2b85a32def253c7","override":null,"pullPolicy":"IfNotPresent","repository":"docker.io/library/busybox","tag":"1.36.1","useDigest":true}` | init container image of SPIRE agent and server |
| authentication.mutual.spire.install.initImage | object | `{"digest":"sha256:c230832bd3b0be59a6c47ed64294f9ce71e91b327957920b6929a0caa8353140","override":null,"pullPolicy":"IfNotPresent","repository":"docker.io/library/busybox","tag":"1.36.1","useDigest":true}` | init container image of SPIRE agent and server |
| authentication.mutual.spire.install.namespace | string | `"cilium-spire"` | SPIRE namespace to install into |
| authentication.mutual.spire.install.server.affinity | object | `{}` | SPIRE server affinity configuration |
| authentication.mutual.spire.install.server.annotations | object | `{}` | SPIRE server annotations |
@@ -182,7 +182,7 @@ contributors across the globe, there is almost always someone available to help.
| clustermesh.apiserver.extraVolumeMounts | list | `[]` | Additional clustermesh-apiserver volumeMounts. |
| clustermesh.apiserver.extraVolumes | list | `[]` | Additional clustermesh-apiserver volumes. |
| clustermesh.apiserver.healthPort | int | `9880` | TCP port for the clustermesh-apiserver health API. |
| clustermesh.apiserver.image | object | `{"digest":"sha256:e9c77417cd474cc943b2303a76c5cf584ac7024dd513ebb8d608cb62fe28896f","override":null,"pullPolicy":"IfNotPresent","repository":"quay.io/cilium/clustermesh-apiserver","tag":"v1.16.1","useDigest":true}` | Clustermesh API server image. |
| clustermesh.apiserver.image | object | `{"digest":"sha256:cc84190fed92e03a2b3a33bc670b2447b521ee258ad9b076baaad13be312ea73","override":null,"pullPolicy":"IfNotPresent","repository":"quay.io/cilium/clustermesh-apiserver","tag":"v1.16.2","useDigest":true}` | Clustermesh API server image. |
| clustermesh.apiserver.kvstoremesh.enabled | bool | `true` | Enable KVStoreMesh. KVStoreMesh caches the information retrieved from the remote clusters in the local etcd instance. |
| clustermesh.apiserver.kvstoremesh.extraArgs | list | `[]` | Additional KVStoreMesh arguments. |
| clustermesh.apiserver.kvstoremesh.extraEnv | list | `[]` | Additional KVStoreMesh environment variables. |
@@ -353,7 +353,7 @@ contributors across the globe, there is almost always someone available to help.
| envoy.extraVolumes | list | `[]` | Additional envoy volumes. |
| envoy.healthPort | int | `9878` | TCP port for the health API. |
| envoy.idleTimeoutDurationSeconds | int | `60` | Set Envoy upstream HTTP idle connection timeout seconds. Does not apply to connections with pending requests. Default 60s |
| envoy.image | object | `{"digest":"sha256:bd5ff8c66716080028f414ec1cb4f7dc66f40d2fb5a009fff187f4a9b90b566b","override":null,"pullPolicy":"IfNotPresent","repository":"quay.io/cilium/cilium-envoy","tag":"v1.29.7-39a2a56bbd5b3a591f69dbca51d3e30ef97e0e51","useDigest":true}` | Envoy container image. |
| envoy.image | object | `{"digest":"sha256:9762041c3760de226a8b00cc12f27dacc28b7691ea926748f9b5c18862db503f","override":null,"pullPolicy":"IfNotPresent","repository":"quay.io/cilium/cilium-envoy","tag":"v1.29.9-1726784081-a90146d13b4cd7d168d573396ccf2b3db5a3b047","useDigest":true}` | Envoy container image. |
| envoy.livenessProbe.failureThreshold | int | `10` | failure threshold of liveness probe |
| envoy.livenessProbe.periodSeconds | int | `30` | interval between checks of the liveness probe |
| envoy.log.format | string | `"[%Y-%m-%d %T.%e][%t][%l][%n] [%g:%#] %v"` | The format string to use for laying out the log message metadata of Envoy. |
@@ -484,7 +484,7 @@ contributors across the globe, there is almost always someone available to help.
| hubble.relay.extraVolumes | list | `[]` | Additional hubble-relay volumes. |
| hubble.relay.gops.enabled | bool | `true` | Enable gops for hubble-relay |
| hubble.relay.gops.port | int | `9893` | Configure gops listen port for hubble-relay |
| hubble.relay.image | object | `{"digest":"sha256:2e1b4c739a676ae187d4c2bfc45c3e865bda2567cc0320a90cb666657fcfcc35","override":null,"pullPolicy":"IfNotPresent","repository":"quay.io/cilium/hubble-relay","tag":"v1.16.1","useDigest":true}` | Hubble-relay container image. |
| hubble.relay.image | object | `{"digest":"sha256:4b559907b378ac18af82541dafab430a857d94f1057f2598645624e6e7ea286c","override":null,"pullPolicy":"IfNotPresent","repository":"quay.io/cilium/hubble-relay","tag":"v1.16.2","useDigest":true}` | Hubble-relay container image. |
| hubble.relay.listenHost | string | `""` | Host to listen to. Specify an empty string to bind to all the interfaces. |
| hubble.relay.listenPort | string | `"4245"` | Port to listen to. |
| hubble.relay.nodeSelector | object | `{"kubernetes.io/os":"linux"}` | Node labels for pod assignment ref: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector |
@@ -590,7 +590,7 @@ contributors across the globe, there is almost always someone available to help.
| hubble.ui.updateStrategy | object | `{"rollingUpdate":{"maxUnavailable":1},"type":"RollingUpdate"}` | hubble-ui update strategy. |
| identityAllocationMode | string | `"crd"` | Method to use for identity allocation (`crd` or `kvstore`). |
| identityChangeGracePeriod | string | `"5s"` | Time to wait before using new identity on endpoint identity change. |
| image | object | `{"digest":"sha256:0b4a3ab41a4760d86b7fc945b8783747ba27f29dac30dd434d94f2c9e3679f39","override":null,"pullPolicy":"IfNotPresent","repository":"quay.io/cilium/cilium","tag":"v1.16.1","useDigest":true}` | Agent container image. |
| image | object | `{"digest":"sha256:4386a8580d8d86934908eea022b0523f812e6a542f30a86a47edd8bed90d51ea","override":null,"pullPolicy":"IfNotPresent","repository":"quay.io/cilium/cilium","tag":"v1.16.2","useDigest":true}` | Agent container image. |
| imagePullSecrets | list | `[]` | Configure image pull secrets for pulling container images |
| ingressController.default | bool | `false` | Set cilium ingress controller to be the default ingress controller This will let cilium ingress controller route entries without ingress class set |
| ingressController.defaultSecretName | string | `nil` | Default secret name for ingresses without .spec.tls[].secretName set. |
@@ -717,7 +717,7 @@ contributors across the globe, there is almost always someone available to help.
| operator.hostNetwork | bool | `true` | HostNetwork setting |
| operator.identityGCInterval | string | `"15m0s"` | Interval for identity garbage collection. |
| operator.identityHeartbeatTimeout | string | `"30m0s"` | Timeout for identity heartbeats. |
| operator.image | object | `{"alibabacloudDigest":"sha256:4381adf48d76ec482551183947e537d44bcac9b6c31a635a9ac63f696d978804","awsDigest":"sha256:e3876fcaf2d6ccc8d5b4aaaded7b1efa971f3f4175eaa2c8a499878d58c39df4","azureDigest":"sha256:e55c222654a44ceb52db7ade3a7b9e8ef05681ff84c14ad1d46fea34869a7a22","genericDigest":"sha256:3bc7e7a43bc4a4d8989cb7936c5d96675dd2d02c306adf925ce0a7c35aa27dc4","override":null,"pullPolicy":"IfNotPresent","repository":"quay.io/cilium/operator","suffix":"","tag":"v1.16.1","useDigest":true}` | cilium-operator image. |
| operator.image | object | `{"alibabacloudDigest":"sha256:16e33abb6b8381e2f66388b6d7141399f06c9b51b9ffa08fd159b8d321929716","awsDigest":"sha256:b6a73ec94407a56cccc8a395225e2aecc3ca3611e7acfeec86201c19fc0727dd","azureDigest":"sha256:fde7cf8bb887e106cd388bb5c3327e92682b2ec3ab4f03bb57b87f495b99f727","genericDigest":"sha256:cccfd3b886d52cb132c06acca8ca559f0fce91a6bd99016219b1a81fdbc4813a","override":null,"pullPolicy":"IfNotPresent","repository":"quay.io/cilium/operator","suffix":"","tag":"v1.16.2","useDigest":true}` | cilium-operator image. |
| operator.nodeGCInterval | string | `"5m0s"` | Interval for cilium node garbage collection. |
| operator.nodeSelector | object | `{"kubernetes.io/os":"linux"}` | Node labels for cilium-operator pod assignment ref: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector |
| operator.podAnnotations | object | `{}` | Annotations to be added to cilium-operator pods |
@@ -767,7 +767,7 @@ contributors across the globe, there is almost always someone available to help.
| preflight.extraEnv | list | `[]` | Additional preflight environment variables. |
| preflight.extraVolumeMounts | list | `[]` | Additional preflight volumeMounts. |
| preflight.extraVolumes | list | `[]` | Additional preflight volumes. |
| preflight.image | object | `{"digest":"sha256:0b4a3ab41a4760d86b7fc945b8783747ba27f29dac30dd434d94f2c9e3679f39","override":null,"pullPolicy":"IfNotPresent","repository":"quay.io/cilium/cilium","tag":"v1.16.1","useDigest":true}` | Cilium pre-flight image. |
| preflight.image | object | `{"digest":"sha256:4386a8580d8d86934908eea022b0523f812e6a542f30a86a47edd8bed90d51ea","override":null,"pullPolicy":"IfNotPresent","repository":"quay.io/cilium/cilium","tag":"v1.16.2","useDigest":true}` | Cilium pre-flight image. |
| preflight.nodeSelector | object | `{"kubernetes.io/os":"linux"}` | Node labels for preflight pod assignment ref: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector |
| preflight.podAnnotations | object | `{}` | Annotations to be added to preflight pods |
| preflight.podDisruptionBudget.enabled | bool | `false` | enable PodDisruptionBudget ref: https://kubernetes.io/docs/concepts/workloads/pods/disruptions/ |

View File

@@ -26,10 +26,6 @@ spec:
template:
metadata:
annotations:
{{- if and .Values.envoy.prometheus.enabled (not .Values.envoy.prometheus.serviceMonitor.enabled) }}
prometheus.io/port: "{{ .Values.envoy.prometheus.port }}"
prometheus.io/scrape: "true"
{{- end }}
{{- if .Values.envoy.rollOutPods }}
# ensure pods roll when configmap updates
cilium.io/cilium-envoy-configmap-checksum: {{ include (print $.Template.BasePath "/cilium-envoy/configmap.yaml") . | sha256sum | quote }}

View File

@@ -0,0 +1,33 @@
{{- $envoyDS := eq (include "envoyDaemonSetEnabled" .) "true" -}}
{{- if and $envoyDS (not .Values.preflight.enabled) .Values.envoy.prometheus.enabled }}
apiVersion: v1
kind: Service
metadata:
name: cilium-envoy
namespace: {{ .Release.Namespace }}
{{- if or (not .Values.envoy.prometheus.serviceMonitor.enabled) .Values.envoy.annotations }}
annotations:
{{- if not .Values.envoy.prometheus.serviceMonitor.enabled }}
prometheus.io/scrape: "true"
prometheus.io/port: {{ .Values.envoy.prometheus.port | quote }}
{{- end }}
{{- with .Values.envoy.annotations }}
{{- toYaml . | nindent 4 }}
{{- end }}
{{- end }}
labels:
k8s-app: cilium-envoy
app.kubernetes.io/name: cilium-envoy
app.kubernetes.io/part-of: cilium
io.cilium/app: proxy
spec:
clusterIP: None
type: ClusterIP
selector:
k8s-app: cilium-envoy
ports:
- name: envoy-metrics
port: {{ .Values.envoy.prometheus.port }}
protocol: TCP
targetPort: envoy-metrics
{{- end }}

View File

@@ -362,7 +362,7 @@ spec:
name: cilium-clustermesh
optional: true
# note: items are not explicitly listed here, since the entries of this secret
# depend on the peers configured, and that would cause a restart of all agents
# depend on the peers configured, and that would cause a restart of all operators
# at every addition/removal. Leaving the field empty makes each secret entry
# to be automatically projected into the volume as a file whose name is the key.
- secret:
@@ -384,5 +384,28 @@ spec:
- key: {{ .Values.tls.caBundle.key }}
path: common-etcd-client-ca.crt
{{- end }}
# note: we configure the volume for the kvstoremesh-specific certificate
# regardless of whether KVStoreMesh is enabled or not, so that it can be
# automatically mounted in case KVStoreMesh gets subsequently enabled,
# without requiring an operator restart.
- secret:
name: clustermesh-apiserver-local-cert
optional: true
items:
- key: tls.key
path: local-etcd-client.key
- key: tls.crt
path: local-etcd-client.crt
{{- if not .Values.tls.caBundle.enabled }}
- key: ca.crt
path: local-etcd-client-ca.crt
{{- else }}
- {{ .Values.tls.caBundle.useSecret | ternary "secret" "configMap" }}:
name: {{ .Values.tls.caBundle.name }}
optional: true
items:
- key: {{ .Values.tls.caBundle.key }}
path: local-etcd-client-ca.crt
{{- end }}
{{- end }}
{{- end }}

View File

@@ -1,3 +1,47 @@
{{/* validate deprecated options are not being used */}}
{{/* Options deprecated in v1.15 and removed in v1.16 */}}
{{- if or
(dig "encryption" "keyFile" "" .Values.AsMap)
(dig "encryption" "mountPath" "" .Values.AsMap)
(dig "encryption" "secretName" "" .Values.AsMap)
(dig "encryption" "interface" "" .Values.AsMap)
}}
{{ fail "encryption.{keyFile,mountPath,secretName,interface} were deprecated in v1.14 and has been removed in v1.16. For details please refer to https://docs.cilium.io/en/v1.16/operations/upgrade/#helm-options" }}
{{- end }}
{{- if or
((dig "proxy" "prometheus" "enabled" "" .Values.AsMap) | toString)
(dig "proxy" "prometheus" "port" "" .Values.AsMap)
}}
{{ fail "proxy.prometheus.enabled and proxy.prometheus.port were deprecated in v1.14 and has been removed in v1.16. For details please refer to https://docs.cilium.io/en/v1.16/operations/upgrade/#helm-options" }}
{{- end }}
{{- if (dig "endpointStatus" "" .Values.AsMap) }}
{{ fail "endpointStatus has been removed in v1.16. For details please refer to https://docs.cilium.io/en/v1.16/operations/upgrade/#helm-options" }}
{{- end }}
{{- if (dig "remoteNodeIdentity" "" .Values.AsMap) }}
{{ fail "remoteNodeIdentity was deprecated in v1.15 and has been removed in v1.16. For details please refer to https://docs.cilium.io/en/v1.16/operations/upgrade/#helm-options" }}
{{- end }}
{{- if (dig "containerRuntime" "integration" "" .Values.AsMap) }}
{{ fail "containerRuntime.integration was deprecated in v1.14 and has been removed in v1.16. For details please refer to https://docs.cilium.io/en/v1.16/operations/upgrade/#helm-options" }}
{{- end }}
{{- if (dig "etcd" "managed" "" .Values.AsMap) }}
{{ fail "etcd.managed was deprecated in v1.10 has been removed in v1.16. For details please refer to https://docs.cilium.io/en/v1.16/operations/upgrade/#helm-options" }}
{{- end }}
{{/* Options deprecated in v1.14 and removed in v1.15 */}}
{{- if .Values.tunnel }}
{{ fail "tunnel was deprecated in v1.14 and has been removed in v1.15. For details please refer to https://docs.cilium.io/en/v1.15/operations/upgrade/#helm-options" }}
{{- end }}
{{- if or (dig "clustermesh" "apiserver" "tls" "ca" "cert" "" .Values.AsMap) (dig "clustermesh" "apiserver" "tls" "ca" "key" "" .Values.AsMap) }}
{{ fail "clustermesh.apiserver.tls.ca.cert and clustermesh.apiserver.tls.ca.key were deprecated in v1.14 and has been removed in v1.15. For details please refer to https://docs.cilium.io/en/v1.15/operations/upgrade/#helm-options" }}
{{- end }}
{{- if .Values.enableK8sEventHandover }}
{{ fail "enableK8sEventHandover was deprecated in v1.14 and has been removed in v1.15. For details please refer to https://docs.cilium.io/en/v1.15/operations/upgrade/#helm-options" }}
{{- end }}
{{- if .Values.enableCnpStatusUpdates }}
{{ fail "enableCnpStatusUpdates was deprecated in v1.14 and has been removed in v1.15. For details please refer to https://docs.cilium.io/en/v1.15/operations/upgrade/#helm-options" }}
{{- end }}
{{/* validate hubble config */}}
{{- if and .Values.hubble.ui.enabled (not .Values.hubble.ui.standalone.enabled) }}
{{- if not .Values.hubble.relay.enabled }}

View File

@@ -153,10 +153,10 @@ image:
# @schema
override: ~
repository: "quay.io/cilium/cilium"
tag: "v1.16.1"
tag: "v1.16.2"
pullPolicy: "IfNotPresent"
# cilium-digest
digest: "sha256:0b4a3ab41a4760d86b7fc945b8783747ba27f29dac30dd434d94f2c9e3679f39"
digest: "sha256:4386a8580d8d86934908eea022b0523f812e6a542f30a86a47edd8bed90d51ea"
useDigest: true
# -- Affinity for cilium-agent.
affinity:
@@ -1309,9 +1309,9 @@ hubble:
# @schema
override: ~
repository: "quay.io/cilium/hubble-relay"
tag: "v1.16.1"
tag: "v1.16.2"
# hubble-relay-digest
digest: "sha256:2e1b4c739a676ae187d4c2bfc45c3e865bda2567cc0320a90cb666657fcfcc35"
digest: "sha256:4b559907b378ac18af82541dafab430a857d94f1057f2598645624e6e7ea286c"
useDigest: true
pullPolicy: "IfNotPresent"
# -- Specifies the resources for the hubble-relay pods
@@ -2158,9 +2158,9 @@ envoy:
# @schema
override: ~
repository: "quay.io/cilium/cilium-envoy"
tag: "v1.29.7-39a2a56bbd5b3a591f69dbca51d3e30ef97e0e51"
tag: "v1.29.9-1726784081-a90146d13b4cd7d168d573396ccf2b3db5a3b047"
pullPolicy: "IfNotPresent"
digest: "sha256:bd5ff8c66716080028f414ec1cb4f7dc66f40d2fb5a009fff187f4a9b90b566b"
digest: "sha256:9762041c3760de226a8b00cc12f27dacc28b7691ea926748f9b5c18862db503f"
useDigest: true
# -- Additional containers added to the cilium Envoy DaemonSet.
extraContainers: []
@@ -2474,15 +2474,15 @@ operator:
# @schema
override: ~
repository: "quay.io/cilium/operator"
tag: "v1.16.1"
tag: "v1.16.2"
# operator-generic-digest
genericDigest: "sha256:3bc7e7a43bc4a4d8989cb7936c5d96675dd2d02c306adf925ce0a7c35aa27dc4"
genericDigest: "sha256:cccfd3b886d52cb132c06acca8ca559f0fce91a6bd99016219b1a81fdbc4813a"
# operator-azure-digest
azureDigest: "sha256:e55c222654a44ceb52db7ade3a7b9e8ef05681ff84c14ad1d46fea34869a7a22"
azureDigest: "sha256:fde7cf8bb887e106cd388bb5c3327e92682b2ec3ab4f03bb57b87f495b99f727"
# operator-aws-digest
awsDigest: "sha256:e3876fcaf2d6ccc8d5b4aaaded7b1efa971f3f4175eaa2c8a499878d58c39df4"
awsDigest: "sha256:b6a73ec94407a56cccc8a395225e2aecc3ca3611e7acfeec86201c19fc0727dd"
# operator-alibabacloud-digest
alibabacloudDigest: "sha256:4381adf48d76ec482551183947e537d44bcac9b6c31a635a9ac63f696d978804"
alibabacloudDigest: "sha256:16e33abb6b8381e2f66388b6d7141399f06c9b51b9ffa08fd159b8d321929716"
useDigest: true
pullPolicy: "IfNotPresent"
suffix: ""
@@ -2756,9 +2756,9 @@ preflight:
# @schema
override: ~
repository: "quay.io/cilium/cilium"
tag: "v1.16.1"
tag: "v1.16.2"
# cilium-digest
digest: "sha256:0b4a3ab41a4760d86b7fc945b8783747ba27f29dac30dd434d94f2c9e3679f39"
digest: "sha256:4386a8580d8d86934908eea022b0523f812e6a542f30a86a47edd8bed90d51ea"
useDigest: true
pullPolicy: "IfNotPresent"
# -- The priority class to use for the preflight pod.
@@ -2905,9 +2905,9 @@ clustermesh:
# @schema
override: ~
repository: "quay.io/cilium/clustermesh-apiserver"
tag: "v1.16.1"
tag: "v1.16.2"
# clustermesh-apiserver-digest
digest: "sha256:e9c77417cd474cc943b2303a76c5cf584ac7024dd513ebb8d608cb62fe28896f"
digest: "sha256:cc84190fed92e03a2b3a33bc670b2447b521ee258ad9b076baaad13be312ea73"
useDigest: true
pullPolicy: "IfNotPresent"
# -- TCP port for the clustermesh-apiserver health API.
@@ -3406,7 +3406,7 @@ authentication:
override: ~
repository: "docker.io/library/busybox"
tag: "1.36.1"
digest: "sha256:9ae97d36d26566ff84e8893c64a6dc4fe8ca6d1144bf5b87b2b85a32def253c7"
digest: "sha256:c230832bd3b0be59a6c47ed64294f9ce71e91b327957920b6929a0caa8353140"
useDigest: true
pullPolicy: "IfNotPresent"
# SPIRE agent configuration

View File

@@ -1,2 +1,2 @@
ARG VERSION=v1.16.1
ARG VERSION=v1.16.2
FROM quay.io/cilium/cilium:${VERSION}

View File

@@ -15,4 +15,4 @@ cilium:
enableIdentityMark: false
enableRuntimeDeviceDetection: true
forceDeviceDetection: true
devices: ovn0
devices: "ovn0 genev_sys_6081"

View File

@@ -12,7 +12,7 @@ cilium:
mode: "kubernetes"
image:
repository: ghcr.io/aenix-io/cozystack/cilium
tag: 1.16.1
digest: "sha256:9593dbc3bd25487b52d8f43330d4a308e450605479a8384a32117e9613289892"
tag: 1.16.2
digest: "sha256:534c5b04fef356a6be59234243c23c0c09702fe1e2c8872012afb391ce2965c4"
envoy:
enabled: false

View File

@@ -33,11 +33,11 @@ kubeapps:
image:
registry: ghcr.io/aenix-io/cozystack
repository: dashboard
tag: v0.15.0
tag: v0.16.1
digest: "sha256:4818712e9fc9c57cc321512760c3226af564a04e69d4b3ec9229ab91fd39abeb"
kubeappsapis:
image:
registry: ghcr.io/aenix-io/cozystack
repository: kubeapps-apis
tag: v0.15.0
digest: "sha256:70c095c8f7e3ecfa11433a3a2c8f57f6ff5a0053f006939a2c171c180cc50baf"
tag: v0.16.1
digest: "sha256:55bc8e2495933112c7cb4bb9e3b1fcb8df46aa14e27fa007f78388a9757e3238"

View File

@@ -1,7 +1,11 @@
NAME=fluxcd
NAMESPACE=cozy-$(NAME)
include ../../../scripts/package.mk
apply-locally:
helm upgrade -i -n $(NAMESPACE) $(NAME) .
include ../../../scripts/package.mk
update:
rm -rf charts
helm pull oci://ghcr.io/controlplaneio-fluxcd/charts/flux-instance --untar --untardir charts

View File

@@ -21,6 +21,4 @@
.idea/
*.tmproj
.vscode/
# Ignore img folder used for documentation
img/
helmdocs.gotmpl

View File

@@ -0,0 +1,28 @@
annotations:
artifacthub.io/license: AGPL-3.0
artifacthub.io/links: |
- name: Documentation
url: https://fluxcd.control-plane.io/operator
- name: Chart Source
url: https://github.com/controlplaneio-fluxcd/charts
- name: Upstream Project
url: https://github.com/controlplaneio-fluxcd/flux-operator
apiVersion: v2
appVersion: v0.9.0
description: 'A Helm chart for deploying a Flux instance managed by Flux Operator. '
home: https://github.com/controlplaneio-fluxcd
icon: https://raw.githubusercontent.com/cncf/artwork/main/projects/flux/icon/color/flux-icon-color.png
keywords:
- flux
- fluxcd
- gitops
kubeVersion: '>=1.22.0-0'
maintainers:
- email: flux-enterprise@control-plane.io
name: ControlPlane Flux Team
name: flux-instance
sources:
- https://github.com/controlplaneio-fluxcd/flux-operator
- https://github.com/controlplaneio-fluxcd/charts
type: application
version: 0.9.0

View File

@@ -0,0 +1,52 @@
# flux-instance
![Version: 0.9.0](https://img.shields.io/badge/Version-0.9.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: v0.9.0](https://img.shields.io/badge/AppVersion-v0.9.0-informational?style=flat-square)
This chart is a thin wrapper around the `FluxInstance` custom resource, which is
used by the [Flux Operator](https://github.com/controlplaneio-fluxcd/flux-operator)
to install, configure and automatically upgrade Flux.
## Prerequisites
- Kubernetes 1.22+
- Helm 3.8+
## Installing the Chart
To deploy Flux in the `flux-system` namespace:
```console
helm -n flux-system install flux oci://ghcr.io/controlplaneio-fluxcd/charts/flux-instance
```
For more information on the available configuration options,
see the [Flux Instance documentation](https://fluxcd.control-plane.io/operator/fluxinstance/).
## Uninstalling the Chart
To uninstall Flux without affecting the resources it manages:
```console
helm -n flux-system uninstall flux
```
## Values
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| commonAnnotations | object | `{}` | Common annotations to add to all deployed objects including pods. |
| commonLabels | object | `{}` | Common labels to add to all deployed objects including pods. |
| fullnameOverride | string | `"flux"` | |
| instance.cluster | object | `{"domain":"cluster.local","multitenant":false,"networkPolicy":true,"tenantDefaultServiceAccount":"default","type":"kubernetes"}` | Cluster https://fluxcd.control-plane.io/operator/fluxinstance/#cluster-configuration |
| instance.components | list | `["source-controller","kustomize-controller","helm-controller","notification-controller"]` | Components https://fluxcd.control-plane.io/operator/fluxinstance/#components-configuration |
| instance.distribution | object | `{"artifact":"oci://ghcr.io/controlplaneio-fluxcd/flux-operator-manifests:latest","imagePullSecret":"","registry":"ghcr.io/fluxcd","version":"2.x"}` | Distribution https://fluxcd.control-plane.io/operator/fluxinstance/#distribution-configuration |
| instance.kustomize.patches | list | `[]` | Kustomize patches https://fluxcd.control-plane.io/operator/fluxinstance/#kustomize-patches |
| instance.sharding | object | `{"key":"sharding.fluxcd.io/key","shards":[]}` | Sharding https://fluxcd.control-plane.io/operator/fluxinstance/#sharding-configuration |
| instance.storage | object | `{"class":"","size":""}` | Storage https://fluxcd.control-plane.io/operator/fluxinstance/#storage-configuration |
| instance.sync | object | `{"kind":"GitRepository","path":"","pullSecret":"","ref":"","url":""}` | Sync https://fluxcd.control-plane.io/operator/fluxinstance/#sync-configuration |
| nameOverride | string | `""` | |
## Source Code
* <https://github.com/controlplaneio-fluxcd/flux-operator>
* <https://github.com/controlplaneio-fluxcd/charts>

View File

@@ -0,0 +1 @@
Documentation at https://fluxcd.control-plane.io/operator/

View File

@@ -0,0 +1,51 @@
{{/*
Expand the name of the chart.
*/}}
{{- define "flux-instance.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Create a default fully qualified app name.
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
If release name contains chart name it will be used as a full name.
*/}}
{{- define "flux-instance.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}
{{/*
Create chart name and version as used by the chart label.
*/}}
{{- define "flux-instance.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Common labels
*/}}
{{- define "flux-instance.labels" -}}
helm.sh/chart: {{ include "flux-instance.chart" . }}
{{ include "flux-instance.selectorLabels" . }}
{{- if .Chart.AppVersion }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
{{- end }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}
{{/*
Selector labels
*/}}
{{- define "flux-instance.selectorLabels" -}}
app.kubernetes.io/name: {{ include "flux-instance.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}

View File

@@ -0,0 +1,43 @@
apiVersion: fluxcd.controlplane.io/v1
kind: FluxInstance
metadata:
name: {{ include "flux-instance.fullname" . }}
namespace: {{ .Release.Namespace }}
labels:
{{- include "flux-instance.labels" . | nindent 4 }}
{{- with .Values.commonLabels }}
{{- toYaml . | nindent 4 }}
{{- end }}
{{- with .Values.commonAnnotations }}
annotations:
{{- toYaml . | nindent 4 }}
{{- end }}
spec:
distribution:
version: {{ .Values.instance.distribution.version }}
registry: {{ .Values.instance.distribution.registry }}
artifact: {{ .Values.instance.distribution.artifact }}
{{- if .Values.instance.distribution.imagePullSecret }}
imagePullSecret: {{ .Values.instance.distribution.imagePullSecret }}
{{- end }}
components: {{ .Values.instance.components | toYaml | nindent 4 }}
cluster: {{ .Values.instance.cluster | toYaml | nindent 4 }}
kustomize: {{ .Values.instance.kustomize | toYaml | nindent 4 }}
{{- if .Values.instance.sync.url }}
sync:
kind: {{ .Values.instance.sync.kind }}
url: {{ .Values.instance.sync.url }}
ref: {{ .Values.instance.sync.ref }}
path: {{ .Values.instance.sync.path }}
{{- if .Values.instance.sync.pullSecret }}
pullSecret: {{ .Values.instance.sync.pullSecret }}
{{- end }}
{{- end }}
{{- if .Values.instance.storage.size }}
storage: {{ .Values.instance.storage | toYaml | nindent 4 }}
{{- end }}
{{- if .Values.instance.sharding.shards }}
sharding:
key: {{ .Values.instance.sharding.key }}
shards: {{ .Values.instance.sharding.shards | toYaml | nindent 4 }}
{{- end }}

View File

@@ -0,0 +1,153 @@
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"properties": {
"commonAnnotations": {
"properties": {},
"type": "object"
},
"commonLabels": {
"properties": {},
"type": "object"
},
"fullnameOverride": {
"type": "string"
},
"instance": {
"properties": {
"cluster": {
"properties": {
"domain": {
"type": "string"
},
"multitenant": {
"type": "boolean"
},
"networkPolicy": {
"type": "boolean"
},
"tenantDefaultServiceAccount": {
"type": "string"
},
"type": {
"enum": [
"kubernetes",
"openshift",
"aws",
"azure",
"gcp"
],
"type": "string"
}
},
"type": "object"
},
"components": {
"items": {
"enum": [
"source-controller",
"kustomize-controller",
"helm-controller",
"notification-controller",
"image-reflector-controller",
"image-automation-controller"
],
"type": "string"
},
"type": "array",
"uniqueItems": true
},
"distribution": {
"properties": {
"artifact": {
"type": "string"
},
"imagePullSecret": {
"type": "string"
},
"registry": {
"type": "string"
},
"version": {
"type": "string"
}
},
"required": [
"version",
"registry"
],
"type": "object"
},
"kustomize": {
"properties": {
"patches": {
"items": {
"type": "object"
},
"type": "array"
}
},
"type": "object"
},
"sharding": {
"properties": {
"key": {
"type": "string"
},
"shards": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"storage": {
"properties": {
"class": {
"type": "string"
},
"size": {
"type": "string"
}
},
"type": "object"
},
"sync": {
"properties": {
"kind": {
"enum": [
"GitRepository",
"OCIRepository",
"Bucket"
],
"type": "string"
},
"path": {
"type": "string"
},
"pullSecret": {
"type": "string"
},
"ref": {
"type": "string"
},
"url": {
"type": "string"
}
},
"type": "object"
}
},
"required": [
"distribution",
"cluster"
],
"type": "object"
},
"nameOverride": {
"type": "string"
}
},
"type": "object"
}

View File

@@ -0,0 +1,49 @@
# Default values for flux-instance.
nameOverride: ""
fullnameOverride: "flux"
instance:
# -- Distribution https://fluxcd.control-plane.io/operator/fluxinstance/#distribution-configuration
distribution: # @schema required: true
version: "2.x" # @schema required: true
registry: "ghcr.io/fluxcd" # @schema required: true
artifact: "oci://ghcr.io/controlplaneio-fluxcd/flux-operator-manifests:latest"
imagePullSecret: ""
# -- Components https://fluxcd.control-plane.io/operator/fluxinstance/#components-configuration
components: # @schema item: string; uniqueItems: true; itemEnum: [source-controller,kustomize-controller,helm-controller,notification-controller,image-reflector-controller,image-automation-controller]
- source-controller
- kustomize-controller
- helm-controller
- notification-controller
# -- Cluster https://fluxcd.control-plane.io/operator/fluxinstance/#cluster-configuration
cluster: # @schema required: true
type: kubernetes # @schema enum:[kubernetes,openshift,aws,azure,gcp]
domain: "cluster.local"
networkPolicy: true
multitenant: false
tenantDefaultServiceAccount: "default"
# -- Storage https://fluxcd.control-plane.io/operator/fluxinstance/#storage-configuration
storage: # @schema required: false
class: ""
size: ""
# -- Sharding https://fluxcd.control-plane.io/operator/fluxinstance/#sharding-configuration
sharding: # @schema required: false
key: "sharding.fluxcd.io/key"
shards: [] # @schema item: string
# -- Sync https://fluxcd.control-plane.io/operator/fluxinstance/#sync-configuration
sync: # @schema required: false
kind: "GitRepository" # @schema enum:[GitRepository,OCIRepository,Bucket]
url: ""
ref: ""
path: ""
pullSecret: ""
kustomize: # @schema required: false
# -- Kustomize patches https://fluxcd.control-plane.io/operator/fluxinstance/#kustomize-patches
patches: [] # @schema item: object
# -- Common annotations to add to all deployed objects including pods.
commonAnnotations: { }
# -- Common labels to add to all deployed objects including pods.
commonLabels: { }

View File

@@ -1,25 +0,0 @@
apiVersion: fluxcd.controlplane.io/v1
kind: FluxInstance
metadata:
name: flux
spec:
{{- with .Values.cluster }}
cluster:
{{- with .networkPolicy }}
networkPolicy: {{ . }}
{{- end }}
{{- with .domain }}
domain: {{ . }}
{{- end }}
{{- end }}
distribution:
version: {{ .Values.distribution.version }}
registry: {{ .Values.distribution.registry }}
components:
{{- if .Values.components }}
{{- toYaml .Values.components | nindent 4 }}
{{- end }}
kustomize:
{{- if .Values.kustomize }}
{{- toYaml .Values.kustomize | nindent 4 }}
{{- end }}

View File

@@ -1,47 +1,49 @@
cluster:
networkPolicy: true
# domain: cozy.local
distribution:
version: 2.3.x
registry: ghcr.io/fluxcd
components:
- source-controller
- kustomize-controller
- helm-controller
- notification-controller
- image-reflector-controller
- image-automation-controller
kustomize:
patches:
- target:
kind: Deployment
name: "(kustomize-controller|helm-controller|source-controller)"
patch: |
- op: add
path: /spec/template/spec/containers/0/args/-
value: --concurrent=20
- op: add
path: /spec/template/spec/containers/0/args/-
value: --requeue-dependency=5s
- op: replace
path: /spec/template/spec/containers/0/resources/limits
value:
cpu: 2000m
memory: 2048Mi
- target:
kind: Deployment
name: source-controller
patch: |
- op: add
path: /spec/template/spec/containers/0/args/-
value: --storage-adv-addr=source-controller.cozy-fluxcd.svc
- op: add
path: /spec/template/spec/containers/0/args/-
value: --events-addr=http://notification-controller.cozy-fluxcd.svc/
- target:
kind: Deployment
name: (kustomize-controller|helm-controller|image-reflector-controller|image-automation-controller)
patch: |
- op: add
path: /spec/template/spec/containers/0/args/-
value: --events-addr=http://notification-controller.cozy-fluxcd.svc/
flux-instance:
instance:
cluster:
networkPolicy: true
domain: cozy.local # -- default value is overriden in patches
distribution:
version: 2.3.x
registry: ghcr.io/fluxcd
components:
- source-controller
- kustomize-controller
- helm-controller
- notification-controller
- image-reflector-controller
- image-automation-controller
kustomize:
patches:
- target:
kind: Deployment
name: "(kustomize-controller|helm-controller|source-controller)"
patch: |
- op: add
path: /spec/template/spec/containers/0/args/-
value: --concurrent=20
- op: add
path: /spec/template/spec/containers/0/args/-
value: --requeue-dependency=5s
- op: replace
path: /spec/template/spec/containers/0/resources/limits
value:
cpu: 2000m
memory: 2048Mi
- target:
kind: Deployment
name: source-controller
patch: |
- op: add
path: /spec/template/spec/containers/0/args/-
value: --storage-adv-addr=source-controller.cozy-fluxcd.svc
- op: add
path: /spec/template/spec/containers/0/args/-
value: --events-addr=http://notification-controller.cozy-fluxcd.svc/
- target:
kind: Deployment
name: (kustomize-controller|helm-controller|image-reflector-controller|image-automation-controller)
patch: |
- op: add
path: /spec/template/spec/containers/0/args/-
value: --events-addr=http://notification-controller.cozy-fluxcd.svc/

View File

@@ -6,7 +6,7 @@ ingress-nginx:
registry: ghcr.io
image: kvaps/ingress-nginx-with-protobuf-exporter/controller
tag: v1.11.2
digest: sha256:f4194edb06a43c82405167427ebd552b90af9698bd295845418680aebc13f600
digest: sha256:e80856ece4e30e9646d65c8d92c25a3446a0bba1c2468cd026f17df9e60d2c0f
allowSnippetAnnotations: true
replicaCount: 2
admissionWebhooks:

View File

@@ -1 +1,25 @@
FROM clastix/kamaji:edge-24.9.2
# Build the manager binary
FROM golang:1.22 as builder
ARG VERSION=edge-24.9.2
ARG TARGETOS TARGETARCH
WORKDIR /workspace
RUN curl -sSL https://github.com/clastix/kamaji/archive/refs/tags/${VERSION}.tar.gz | tar -xzvf- --strip=1
COPY patches /patches
RUN git apply /patches/disable-datastore-check.diff
RUN CGO_ENABLED=0 GOOS=linux GOARCH=$TARGETARCH go build \
-ldflags "-X github.com/clastix/kamaji/internal.GitRepo=$GIT_REPO -X github.com/clastix/kamaji/internal.GitTag=$GIT_LAST_TAG -X github.com/clastix/kamaji/internal.GitCommit=$GIT_HEAD_COMMIT -X github.com/clastix/kamaji/internal.GitDirty=$GIT_MODIFIED -X github.com/clastix/kamaji/internal.BuildTime=$BUILD_DATE" \
-a -o kamaji main.go
# Use distroless as minimal base image to package the manager binary
# Refer to https://github.com/GoogleContainerTools/distroless for more details
FROM gcr.io/distroless/static:nonroot
WORKDIR /
COPY --from=builder /workspace/kamaji .
USER 65532:65532
ENTRYPOINT ["/kamaji"]

View File

@@ -0,0 +1,23 @@
diff --git a/cmd/manager/cmd.go b/cmd/manager/cmd.go
index 9a24d4e..a03a4e0 100644
--- a/cmd/manager/cmd.go
+++ b/cmd/manager/cmd.go
@@ -31,7 +31,6 @@ import (
"github.com/clastix/kamaji/controllers/soot"
"github.com/clastix/kamaji/internal"
"github.com/clastix/kamaji/internal/builders/controlplane"
- datastoreutils "github.com/clastix/kamaji/internal/datastore/utils"
"github.com/clastix/kamaji/internal/webhook"
"github.com/clastix/kamaji/internal/webhook/handlers"
"github.com/clastix/kamaji/internal/webhook/routes"
@@ -80,10 +79,6 @@ func NewCmd(scheme *runtime.Scheme) *cobra.Command {
return fmt.Errorf("unable to read webhook CA: %w", err)
}
- if err = datastoreutils.CheckExists(ctx, scheme, datastore); err != nil {
- return err
- }
-
if controllerReconcileTimeout.Seconds() == 0 {
return fmt.Errorf("the controller reconcile timeout must be greater than zero")
}

View File

@@ -3,7 +3,7 @@ kamaji:
deploy: false
image:
pullPolicy: IfNotPresent
tag: latest@sha256:bb45d953a8ba46a19c8941ccc9fc8498d91435c77db439d8b1d6bde9fea8802a
tag: v0.16.1@sha256:95a9658cbbe1cbfbc42b9ab1df4f2a39342d7a8f1ff10a10b81b8656f3744c39
repository: ghcr.io/aenix-io/cozystack/kamaji
resources:
limits:

View File

@@ -22,4 +22,4 @@ global:
images:
kubeovn:
repository: kubeovn
tag: v1.13.0@sha256:11c4ef0f71c73df4703743c0f63b7ff0ec67af6342caf1e7db8ebd5546071855
tag: v1.13.0@sha256:d13ac4f916cd88d33d1d64c949978165272998d6594441a9dd4be5e6892caf4e

View File

@@ -19,26 +19,3 @@ update:
helm repo add fluent https://fluent.github.io/helm-charts
helm repo update fluent
helm pull fluent/fluent-bit --untar --untardir charts
# alerts from victoria-metrics-k8s-stack
helm repo add vm https://victoriametrics.github.io/helm-charts/
helm repo update vm
helm pull vm/victoria-metrics-k8s-stack --untar --untardir charts
rm -rf charts/victoria-metrics-k8s-stack/charts
rm -rf charts/victoria-metrics-k8s-stack/hack
rm -rf charts/victoria-metrics-k8s-stack/templates/victoria-metrics-operator
rm -rf charts/victoria-metrics-k8s-stack/templates/grafana
rm -rf charts/victoria-metrics-k8s-stack/templates/ingress.yaml
rm -rf charts/victoria-metrics-k8s-stack/files/dashboards
rm -f charts/victoria-metrics-k8s-stack/templates/servicemonitors.yaml
rm -f charts/victoria-metrics-k8s-stack/templates/serviceaccount.yaml
rm -f charts/victoria-metrics-k8s-stack/templates/rules/additionalVictoriaMetricsRules.yml
sed -i '/ namespace:/d' charts/victoria-metrics-k8s-stack/templates/rules/rule.yaml
sed -i 's|job="apiserver"|job="kube-apiserver"|g' `grep -rl 'job="apiserver"' charts/victoria-metrics-k8s-stack/files/rules/generated`
sed -i 's|severity: info|severity: informational|g' `grep -rl 'severity: info' ./charts/victoria-metrics-k8s-stack/files/rules/generated`
sed -i 's|severity: none|severity: ok|g' ./charts/victoria-metrics-k8s-stack/files/rules/generated/general.rules.yaml
sed -i ./charts/victoria-metrics-k8s-stack/files/rules/generated/general.rules.yaml \
-e '/Watchdog/,/severity:/s/severity: none/severity: ok/' \
-e '/InfoInhibitor/,/severity:/s/severity: none/severity: major/'
# TODO
rm -f charts/victoria-metrics-k8s-stack/files/rules/generated/alertmanager.rules.yaml
rm -f charts/victoria-metrics-k8s-stack/files/rules/generated/vm*.yaml

View File

@@ -0,0 +1,221 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-etcd
spec:
groups:
- name: etcd
params: {}
rules:
- alert: etcdMembersDown
annotations:
description: 'etcd cluster "{{ $labels.job }}": members are down ({{ $value
}}).'
summary: etcd cluster members are down.
expr: |-
max without (endpoint) (
sum without (instance) (up{job=~".*etcd.*"} == bool 0)
or
count without (To) (
sum without (instance) (rate(etcd_network_peer_sent_failures_total{job=~".*etcd.*"}[120s])) > 0.01
)
)
> 0
for: 10m
labels:
severity: critical
exported_instance: '{{ $labels.instance }}'
service: etcd
- alert: etcdInsufficientMembers
annotations:
description: 'etcd cluster "{{ $labels.job }}": insufficient members ({{ $value
}}).'
summary: etcd cluster has insufficient number of members.
expr: sum(up{job=~".*etcd.*"} == bool 1) without (instance) < ((count(up{job=~".*etcd.*"})
without (instance) + 1) / 2)
for: 3m
labels:
severity: critical
exported_instance: '{{ $labels.instance }}'
service: etcd
- alert: etcdNoLeader
annotations:
description: 'etcd cluster "{{ $labels.job }}": member {{ $labels.instance
}} has no leader.'
summary: etcd cluster has no leader.
expr: etcd_server_has_leader{job=~".*etcd.*"} == 0
for: 1m
labels:
severity: critical
exported_instance: '{{ $labels.instance }}'
service: etcd
- alert: etcdHighNumberOfLeaderChanges
annotations:
description: 'etcd cluster "{{ $labels.job }}": {{ $value }} leader changes
within the last 15 minutes. Frequent elections may be a sign of insufficient
resources, high network latency, or disruptions by other components and
should be investigated.'
summary: etcd cluster has high number of leader changes.
expr: increase((max without (instance) (etcd_server_leader_changes_seen_total{job=~".*etcd.*"})
or 0*absent(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}))[15m:1m])
>= 4
for: 5m
labels:
severity: warning
exported_instance: '{{ $labels.instance }}'
service: etcd
- alert: etcdHighNumberOfFailedGRPCRequests
annotations:
description: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests
for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance
}}.'
summary: etcd cluster has high number of failed grpc requests.
expr: |-
100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code)
/
sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) without (grpc_type, grpc_code)
> 1
for: 10m
labels:
severity: warning
exported_instance: '{{ $labels.instance }}/{{ $labels.grpc_method }}'
service: etcd
- alert: etcdHighNumberOfFailedGRPCRequests
annotations:
description: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests
for {{ $labels.grpc_method }} failed on etcd instance {{ $labels.instance
}}.'
summary: etcd cluster has high number of failed grpc requests.
expr: |-
100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code)
/
sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) without (grpc_type, grpc_code)
> 5
for: 5m
labels:
severity: critical
exported_instance: '{{ $labels.instance }}/{{ $labels.grpc_method }}'
service: etcd
- alert: etcdGRPCRequestsSlow
annotations:
description: 'etcd cluster "{{ $labels.job }}": 99th percentile of gRPC requests
is {{ $value }}s on etcd instance {{ $labels.instance }} for {{ $labels.grpc_method
}} method.'
summary: etcd grpc requests are slow
expr: |-
histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~".*etcd.*", grpc_method!="Defragment", grpc_type="unary"}[5m])) without(grpc_type))
> 0.15
for: 10m
labels:
severity: critical
exported_instance: '{{ $labels.instance }}/{{ $labels.grpc_method }}'
service: etcd
- alert: etcdMemberCommunicationSlow
annotations:
description: 'etcd cluster "{{ $labels.job }}": member communication with
{{ $labels.To }} is taking {{ $value }}s on etcd instance {{ $labels.instance
}}.'
summary: etcd cluster member communication is slow.
expr: |-
histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~".*etcd.*"}[5m]))
> 0.15
for: 10m
labels:
severity: warning
exported_instance: '{{ $labels.instance }}/{{ $labels.member }}'
service: etcd
- alert: etcdHighNumberOfFailedProposals
annotations:
description: 'etcd cluster "{{ $labels.job }}": {{ $value }} proposal failures
within the last 30 minutes on etcd instance {{ $labels.instance }}.'
summary: etcd cluster has high number of proposal failures.
expr: rate(etcd_server_proposals_failed_total{job=~".*etcd.*"}[15m]) > 5
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.instance }}'
service: etcd
- alert: etcdHighFsyncDurations
annotations:
description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations
are {{ $value }}s on etcd instance {{ $labels.instance }}.'
summary: etcd cluster 99th percentile fsync durations are too high.
expr: |-
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 0.5
for: 10m
labels:
severity: warning
exported_instance: '{{ $labels.instance }}'
service: etcd
- alert: etcdHighFsyncDurations
annotations:
description: 'etcd cluster "{{ $labels.job }}": 99th percentile fsync durations
are {{ $value }}s on etcd instance {{ $labels.instance }}.'
summary: etcd cluster 99th percentile fsync durations are too high.
expr: |-
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 1
for: 10m
labels:
severity: critical
exported_instance: '{{ $labels.instance }}'
service: etcd
- alert: etcdHighCommitDurations
annotations:
description: 'etcd cluster "{{ $labels.job }}": 99th percentile commit durations
{{ $value }}s on etcd instance {{ $labels.instance }}.'
summary: etcd cluster 99th percentile commit durations are too high.
expr: |-
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 0.25
for: 10m
labels:
severity: warning
exported_instance: '{{ $labels.instance }}'
service: etcd
- alert: etcdDatabaseQuotaLowSpace
annotations:
description: 'etcd cluster "{{ $labels.job }}": database size exceeds the
defined quota on etcd instance {{ $labels.instance }}, please defrag or
increase the quota as the writes to etcd will be disabled when it is full.'
summary: etcd cluster database is running full.
expr: (last_over_time(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[5m])
/ last_over_time(etcd_server_quota_backend_bytes{job=~".*etcd.*"}[5m]))*100
> 95
for: 10m
labels:
severity: critical
exported_instance: '{{ $labels.instance }}'
service: etcd
- alert: etcdExcessiveDatabaseGrowth
annotations:
description: 'etcd cluster "{{ $labels.job }}": Predicting running out of
disk space in the next four hours, based on write observations within the
past four hours on etcd instance {{ $labels.instance }}, please check as
it might be disruptive.'
summary: etcd cluster database growing very fast.
expr: predict_linear(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[4h],
4*60*60) > etcd_server_quota_backend_bytes{job=~".*etcd.*"}
for: 10m
labels:
severity: warning
exported_instance: '{{ $labels.instance }}'
service: etcd
- alert: etcdDatabaseHighFragmentationRatio
annotations:
description: 'etcd cluster "{{ $labels.job }}": database size in use on instance
{{ $labels.instance }} is {{ $value | humanizePercentage }} of the actual
allocated disk space, please run defragmentation (e.g. etcdctl defrag) to
retrieve the unused fragmented disk space.'
runbook_url: https://etcd.io/docs/v3.5/op-guide/maintenance/#defragmentation
summary: etcd database size in use is less than 50% of the actual allocated
storage.
expr: (last_over_time(etcd_mvcc_db_total_size_in_use_in_bytes{job=~".*etcd.*"}[5m])
/ last_over_time(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[5m])) <
0.5 and etcd_mvcc_db_total_size_in_use_in_bytes{job=~".*etcd.*"} > 104857600
for: 10m
labels:
severity: warning
exported_instance: '{{ $labels.instance }}'
service: etcd

View File

@@ -0,0 +1,128 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
annotations:
meta.helm.sh/release-name: monitoring
meta.helm.sh/release-namespace: cozy-monitoring
labels:
app: victoria-metrics-k8s-stack
app.kubernetes.io/instance: monitoring
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: victoria-metrics-k8s-stack
app.kubernetes.io/version: v1.102.1
helm.sh/chart: victoria-metrics-k8s-stack-0.25.17
name: alerts-flux-resources
namespace: cozy-monitoring
spec:
groups:
- name: flux-resources-alerts
rules:
- alert: HelmReleaseNotReady
expr: gotk_resource_info{customresource_kind="HelmRelease", ready!="True"} > 0
for: 5m
labels:
severity: major
service: fluxcd
exported_instance: '{{ $labels.exported_namespace }}/{{ $labels.name }}'
annotations:
summary: "HelmRelease {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is not ready"
description: "HelmRelease {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is in an unready state for more than 15 minutes."
- alert: GitRepositorySyncFailed
expr: gotk_resource_info{customresource_kind="GitRepository", ready!="True"} > 0
for: 5m
labels:
severity: major
service: fluxcd
exported_instance: '{{ $labels.exported_namespace }}/{{ $labels.name }}'
annotations:
summary: "GitRepository {{ $labels.name }} in namespace {{ $labels.exported_namespace }} sync failed"
description: "GitRepository {{ $labels.name }} in namespace {{ $labels.exported_namespace }} has not been successfully synced for more than 15 minutes."
- alert: KustomizationNotApplied
expr: gotk_resource_info{customresource_kind="Kustomization", ready!="True"} > 0
for: 5m
labels:
severity: major
service: fluxcd
exported_instance: '{{ $labels.exported_namespace }}/{{ $labels.name }}'
annotations:
summary: "Kustomization {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is not applied"
description: "Kustomization {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is not successfully applied for more than 15 minutes."
- alert: ImageRepositorySyncFailed
expr: gotk_resource_info{customresource_kind="ImageRepository", ready!="True"} > 0
for: 5m
labels:
severity: major
service: fluxcd
exported_instance: '{{ $labels.exported_namespace }}/{{ $labels.name }}'
annotations:
summary: "ImageRepository {{ $labels.name }} in namespace {{ $labels.exported_namespace }} sync failed"
description: "ImageRepository {{ $labels.name }} in namespace {{ $labels.exported_namespace }} has not been successfully synced for more than 15 minutes."
- alert: HelmChartFailed
expr: gotk_resource_info{customresource_kind="HelmChart", ready!="True"} > 0
for: 5m
labels:
severity: major
service: fluxcd
exported_instance: '{{ $labels.exported_namespace }}/{{ $labels.name }}'
annotations:
summary: "HelmChart {{ $labels.name }} in namespace {{ $labels.exported_namespace }} has failed"
description: "HelmChart {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is not ready for more than 15 minutes."
- alert: HelmReleaseSuspended
expr: gotk_resource_info{customresource_kind="HelmRelease", suspended="true"} > 0
for: 5m
labels:
severity: warning
service: fluxcd
exported_instance: '{{ $labels.exported_namespace }}/{{ $labels.name }}'
annotations:
summary: "HelmRelease {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is suspended"
description: "HelmRelease {{ $labels.name }} in namespace {{ $labels.exported_namespace }} has been suspended."
- alert: GitRepositorySuspended
expr: gotk_resource_info{customresource_kind="GitRepository", suspended="true"} > 0
for: 5m
labels:
severity: warning
service: fluxcd
exported_instance: '{{ $labels.exported_namespace }}/{{ $labels.name }}'
annotations:
summary: "GitRepository {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is suspended"
description: "GitRepository {{ $labels.name }} in namespace {{ $labels.exported_namespace }} has been suspended."
- alert: KustomizationSuspended
expr: gotk_resource_info{customresource_kind="Kustomization", suspended="true"} > 0
for: 5m
labels:
severity: warning
service: fluxcd
exported_instance: '{{ $labels.exported_namespace }}/{{ $labels.name }}'
annotations:
summary: "Kustomization {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is suspended"
description: "Kustomization {{ $labels.name }} in namespace {{ $labels.exported_namespace }} has been suspended."
- alert: ImageRepositorySuspended
expr: gotk_resource_info{customresource_kind="ImageRepository", suspended="true"} > 0
for: 5m
labels:
severity: warning
service: fluxcd
exported_instance: '{{ $labels.exported_namespace }}/{{ $labels.name }}'
annotations:
summary: "ImageRepository {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is suspended"
description: "ImageRepository {{ $labels.name }} in namespace {{ $labels.exported_namespace }} has been suspended."
- alert: HelmChartSuspended
expr: gotk_resource_info{customresource_kind="HelmChart", suspended="true"} > 0
for: 5m
labels:
severity: warning
service: fluxcd
exported_instance: '{{ $labels.exported_namespace }}/{{ $labels.name }}'
annotations:
summary: "HelmChart {{ $labels.name }} in namespace {{ $labels.exported_namespace }} is suspended"
description: "HelmChart {{ $labels.name }} in namespace {{ $labels.exported_namespace }} has been suspended."

View File

@@ -0,0 +1,57 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-general.rules
spec:
groups:
- name: general.rules
params: {}
rules:
- alert: TargetDown
annotations:
description: '{{ printf "%.4g" $value }}% of the {{ $labels.job }}/{{ $labels.service
}} targets in {{ $labels.namespace }} namespace are down.'
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/targetdown
summary: One or more targets are unreachable.
expr: 100 * (count(up == 0) BY (job,namespace,service,cluster) / count(up) BY
(job,namespace,service,cluster)) > 10
for: 10m
labels:
severity: warning
exported_instance: '{{ $labels.instance }}'
service: general.rules
- alert: Watchdog
annotations:
description: |
This is an alert meant to ensure that the entire alerting pipeline is functional.
This alert is always firing, therefore it should always be firing in Alertmanager
and always fire against a receiver. There are integrations with various notification
mechanisms that send a notification when this alert is not firing. For example the
"DeadMansSnitch" integration in PagerDuty.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/watchdog
summary: An alert that should always be firing to certify that Alertmanager
is working properly.
expr: vector(1)
labels:
severity: ok
exported_instance: global
service: general.rules
event: Heartbeat
- alert: InfoInhibitor
annotations:
description: |
This is an alert that is used to inhibit info alerts.
By themselves, the info-level alerts are sometimes very noisy, but they are relevant when combined with
other alerts.
This alert fires whenever there's a severity="info" alert, and stops firing when another alert with a
severity of 'warning' or 'critical' starts firing on the same namespace.
This alert should be routed to a null receiver and configured to inhibit alerts with severity="info".
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/infoinhibitor
summary: Info-level alert inhibition.
expr: ALERTS{severity = "info"} == 1 unless on (namespace,cluster) ALERTS{alertname
!= "InfoInhibitor", severity =~ "warning|critical", alertstate="firing"} ==
1
labels:
severity: major
exported_instance: global
service: general.rules

View File

@@ -0,0 +1,18 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-k8s.rules.containercpuusagesecondstotal
spec:
groups:
- name: k8s.rules.container_cpu_usage_seconds_total
params: {}
rules:
- annotations: {}
expr: |-
sum by (namespace,pod,container,cluster) (
irate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m])
) * on (namespace,pod,cluster) group_left(node) topk by (namespace,pod,cluster) (
1, max by (namespace,pod,node,cluster) (kube_pod_info{node!=""})
)
labels: {}
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate

View File

@@ -0,0 +1,17 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-k8s.rules.containermemorycache
spec:
groups:
- name: k8s.rules.container_memory_cache
params: {}
rules:
- annotations: {}
expr: |-
container_memory_cache{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}
* on (namespace,pod,cluster) group_left(node) topk by (namespace,pod,cluster) (1,
max by (namespace,pod,node,cluster) (kube_pod_info{node!=""})
)
labels: {}
record: node_namespace_pod_container:container_memory_cache

View File

@@ -0,0 +1,17 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-k8s.rules.containermemoryrss
spec:
groups:
- name: k8s.rules.container_memory_rss
params: {}
rules:
- annotations: {}
expr: |-
container_memory_rss{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}
* on (namespace,pod,cluster) group_left(node) topk by (namespace,pod,cluster) (1,
max by (namespace,pod,node,cluster) (kube_pod_info{node!=""})
)
labels: {}
record: node_namespace_pod_container:container_memory_rss

View File

@@ -0,0 +1,17 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-k8s.rules.containermemoryswap
spec:
groups:
- name: k8s.rules.container_memory_swap
params: {}
rules:
- annotations: {}
expr: |-
container_memory_swap{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}
* on (namespace,pod,cluster) group_left(node) topk by (namespace,pod,cluster) (1,
max by (namespace,pod,node,cluster) (kube_pod_info{node!=""})
)
labels: {}
record: node_namespace_pod_container:container_memory_swap

View File

@@ -0,0 +1,17 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-k8s.rules.containermemoryworkingsetbytes
spec:
groups:
- name: k8s.rules.container_memory_working_set_bytes
params: {}
rules:
- annotations: {}
expr: |-
container_memory_working_set_bytes{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}
* on (namespace,pod,cluster) group_left(node) topk by (namespace,pod,cluster) (1,
max by (namespace,pod,node,cluster) (kube_pod_info{node!=""})
)
labels: {}
record: node_namespace_pod_container:container_memory_working_set_bytes

View File

@@ -0,0 +1,93 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-k8s.rules.containerresource
spec:
groups:
- name: k8s.rules.container_resource
params: {}
rules:
- annotations: {}
expr: |-
kube_pod_container_resource_requests{resource="memory",job="kube-state-metrics"} * on (namespace,pod,cluster)
group_left() max by (namespace,pod,cluster) (
(kube_pod_status_phase{phase=~"Pending|Running"} == 1)
)
labels: {}
record: cluster:namespace:pod_memory:active:kube_pod_container_resource_requests
- annotations: {}
expr: |-
sum by (namespace,cluster) (
sum by (namespace,pod,cluster) (
max by (namespace,pod,container,cluster) (
kube_pod_container_resource_requests{resource="memory",job="kube-state-metrics"}
) * on (namespace,pod,cluster) group_left() max by (namespace,pod,cluster) (
kube_pod_status_phase{phase=~"Pending|Running"} == 1
)
)
)
labels: {}
record: namespace_memory:kube_pod_container_resource_requests:sum
- annotations: {}
expr: |-
kube_pod_container_resource_requests{resource="cpu",job="kube-state-metrics"} * on (namespace,pod,cluster)
group_left() max by (namespace,pod,cluster) (
(kube_pod_status_phase{phase=~"Pending|Running"} == 1)
)
labels: {}
record: cluster:namespace:pod_cpu:active:kube_pod_container_resource_requests
- annotations: {}
expr: |-
sum by (namespace,cluster) (
sum by (namespace,pod,cluster) (
max by (namespace,pod,container,cluster) (
kube_pod_container_resource_requests{resource="cpu",job="kube-state-metrics"}
) * on (namespace,pod,cluster) group_left() max by (namespace,pod,cluster) (
kube_pod_status_phase{phase=~"Pending|Running"} == 1
)
)
)
labels: {}
record: namespace_cpu:kube_pod_container_resource_requests:sum
- annotations: {}
expr: |-
kube_pod_container_resource_limits{resource="memory",job="kube-state-metrics"} * on (namespace,pod,cluster)
group_left() max by (namespace,pod,cluster) (
(kube_pod_status_phase{phase=~"Pending|Running"} == 1)
)
labels: {}
record: cluster:namespace:pod_memory:active:kube_pod_container_resource_limits
- annotations: {}
expr: |-
sum by (namespace,cluster) (
sum by (namespace,pod,cluster) (
max by (namespace,pod,container,cluster) (
kube_pod_container_resource_limits{resource="memory",job="kube-state-metrics"}
) * on (namespace,pod,cluster) group_left() max by (namespace,pod,cluster) (
kube_pod_status_phase{phase=~"Pending|Running"} == 1
)
)
)
labels: {}
record: namespace_memory:kube_pod_container_resource_limits:sum
- annotations: {}
expr: |-
kube_pod_container_resource_limits{resource="cpu",job="kube-state-metrics"} * on (namespace,pod,cluster)
group_left() max by (namespace,pod,cluster) (
(kube_pod_status_phase{phase=~"Pending|Running"} == 1)
)
labels: {}
record: cluster:namespace:pod_cpu:active:kube_pod_container_resource_limits
- annotations: {}
expr: |-
sum by (namespace,cluster) (
sum by (namespace,pod,cluster) (
max by (namespace,pod,container,cluster) (
kube_pod_container_resource_limits{resource="cpu",job="kube-state-metrics"}
) * on (namespace,pod,cluster) group_left() max by (namespace,pod,cluster) (
kube_pod_status_phase{phase=~"Pending|Running"} == 1
)
)
)
labels: {}
record: namespace_cpu:kube_pod_container_resource_limits:sum

View File

@@ -0,0 +1,60 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-k8s.rules.podowner
spec:
groups:
- name: k8s.rules.pod_owner
params: {}
rules:
- annotations: {}
expr: |-
max by (namespace,workload,pod,cluster) (
label_replace(
label_replace(
kube_pod_owner{job="kube-state-metrics", owner_kind="ReplicaSet"},
"replicaset", "$1", "owner_name", "(.*)"
) * on (replicaset,namespace,cluster) group_left(owner_name) topk by (replicaset,namespace,cluster) (
1, max by (replicaset,namespace,owner_name,cluster) (
kube_replicaset_owner{job="kube-state-metrics"}
)
),
"workload", "$1", "owner_name", "(.*)"
)
)
labels:
workload_type: deployment
record: namespace_workload_pod:kube_pod_owner:relabel
- annotations: {}
expr: |-
max by (namespace,workload,pod,cluster) (
label_replace(
kube_pod_owner{job="kube-state-metrics", owner_kind="DaemonSet"},
"workload", "$1", "owner_name", "(.*)"
)
)
labels:
workload_type: daemonset
record: namespace_workload_pod:kube_pod_owner:relabel
- annotations: {}
expr: |-
max by (namespace,workload,pod,cluster) (
label_replace(
kube_pod_owner{job="kube-state-metrics", owner_kind="StatefulSet"},
"workload", "$1", "owner_name", "(.*)"
)
)
labels:
workload_type: statefulset
record: namespace_workload_pod:kube_pod_owner:relabel
- annotations: {}
expr: |-
max by (namespace,workload,pod,cluster) (
label_replace(
kube_pod_owner{job="kube-state-metrics", owner_kind="Job"},
"workload", "$1", "owner_name", "(.*)"
)
)
labels:
workload_type: job
record: namespace_workload_pod:kube_pod_owner:relabel

View File

@@ -0,0 +1,146 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kube-apiserver-availability.rules
spec:
groups:
- interval: 3m
name: kube-apiserver-availability.rules
params: {}
rules:
- annotations: {}
expr: avg_over_time(code_verb:apiserver_request_total:increase1h[30d]) * 24
* 30
labels: {}
record: code_verb:apiserver_request_total:increase30d
- annotations: {}
expr: sum by (code,cluster) (code_verb:apiserver_request_total:increase30d{verb=~"LIST|GET"})
labels:
verb: read
record: code:apiserver_request_total:increase30d
- annotations: {}
expr: sum by (code,cluster) (code_verb:apiserver_request_total:increase30d{verb=~"POST|PUT|PATCH|DELETE"})
labels:
verb: write
record: code:apiserver_request_total:increase30d
- annotations: {}
expr: sum by (verb,scope,cluster) (increase(apiserver_request_sli_duration_seconds_count{job="kube-apiserver"}[1h]))
labels: {}
record: cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h
- annotations: {}
expr: sum by (verb,scope,cluster) (avg_over_time(cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase1h[30d])
* 24 * 30)
labels: {}
record: cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d
- annotations: {}
expr: sum by (verb,scope,le,cluster) (increase(apiserver_request_sli_duration_seconds_bucket[1h]))
labels: {}
record: cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h
- annotations: {}
expr: sum by (verb,scope,le,cluster) (avg_over_time(cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase1h[30d])
* 24 * 30)
labels: {}
record: cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d
- annotations: {}
expr: |-
1 - (
(
# write too slow
sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"})
-
sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"POST|PUT|PATCH|DELETE",le="1"})
) +
(
# read too slow
sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"LIST|GET"})
-
(
(
sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope=~"resource|",le="1"})
or
vector(0)
)
+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="namespace",le="5"})
+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="cluster",le="30"})
)
) +
# errors
sum by (cluster) (code:apiserver_request_total:increase30d{code=~"5.."} or vector(0))
)
/
sum by (cluster) (code:apiserver_request_total:increase30d)
labels:
verb: all
record: apiserver_request:availability30d
- annotations: {}
expr: |-
1 - (
sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"LIST|GET"})
-
(
# too slow
(
sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope=~"resource|",le="1"})
or
vector(0)
)
+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="namespace",le="5"})
+
sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"LIST|GET",scope="cluster",le="30"})
)
+
# errors
sum by (cluster) (code:apiserver_request_total:increase30d{verb="read",code=~"5.."} or vector(0))
)
/
sum by (cluster) (code:apiserver_request_total:increase30d{verb="read"})
labels:
verb: read
record: apiserver_request:availability30d
- annotations: {}
expr: |-
1 - (
(
# too slow
sum by (cluster) (cluster_verb_scope:apiserver_request_sli_duration_seconds_count:increase30d{verb=~"POST|PUT|PATCH|DELETE"})
-
sum by (cluster) (cluster_verb_scope_le:apiserver_request_sli_duration_seconds_bucket:increase30d{verb=~"POST|PUT|PATCH|DELETE",le="1"})
)
+
# errors
sum by (cluster) (code:apiserver_request_total:increase30d{verb="write",code=~"5.."} or vector(0))
)
/
sum by (cluster) (code:apiserver_request_total:increase30d{verb="write"})
labels:
verb: write
record: apiserver_request:availability30d
- annotations: {}
expr: sum by (code,resource,cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[5m]))
labels:
verb: read
record: code_resource:apiserver_request_total:rate5m
- annotations: {}
expr: sum by (code,resource,cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))
labels:
verb: write
record: code_resource:apiserver_request_total:rate5m
- annotations: {}
expr: sum by (code,verb,cluster) (increase(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET|POST|PUT|PATCH|DELETE",code=~"2.."}[1h]))
labels: {}
record: code_verb:apiserver_request_total:increase1h
- annotations: {}
expr: sum by (code,verb,cluster) (increase(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET|POST|PUT|PATCH|DELETE",code=~"3.."}[1h]))
labels: {}
record: code_verb:apiserver_request_total:increase1h
- annotations: {}
expr: sum by (code,verb,cluster) (increase(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET|POST|PUT|PATCH|DELETE",code=~"4.."}[1h]))
labels: {}
record: code_verb:apiserver_request_total:increase1h
- annotations: {}
expr: sum by (code,verb,cluster) (increase(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET|POST|PUT|PATCH|DELETE",code=~"5.."}[1h]))
labels: {}
record: code_verb:apiserver_request_total:increase1h

View File

@@ -0,0 +1,324 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kube-apiserver-burnrate.rules
spec:
groups:
- name: kube-apiserver-burnrate.rules
params: {}
rules:
- annotations: {}
expr: |-
(
(
# too slow
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward"}[1d]))
-
(
(
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope=~"resource|",le="1"}[1d]))
or
vector(0)
)
+
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope="namespace",le="5"}[1d]))
+
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope="cluster",le="30"}[1d]))
)
)
+
# errors
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET",code=~"5.."}[1d]))
)
/
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[1d]))
labels:
verb: read
record: apiserver_request:burnrate1d
- annotations: {}
expr: |-
(
(
# too slow
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward"}[1h]))
-
(
(
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope=~"resource|",le="1"}[1h]))
or
vector(0)
)
+
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope="namespace",le="5"}[1h]))
+
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope="cluster",le="30"}[1h]))
)
)
+
# errors
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET",code=~"5.."}[1h]))
)
/
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[1h]))
labels:
verb: read
record: apiserver_request:burnrate1h
- annotations: {}
expr: |-
(
(
# too slow
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward"}[2h]))
-
(
(
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope=~"resource|",le="1"}[2h]))
or
vector(0)
)
+
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope="namespace",le="5"}[2h]))
+
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope="cluster",le="30"}[2h]))
)
)
+
# errors
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET",code=~"5.."}[2h]))
)
/
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[2h]))
labels:
verb: read
record: apiserver_request:burnrate2h
- annotations: {}
expr: |-
(
(
# too slow
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward"}[30m]))
-
(
(
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope=~"resource|",le="1"}[30m]))
or
vector(0)
)
+
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope="namespace",le="5"}[30m]))
+
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope="cluster",le="30"}[30m]))
)
)
+
# errors
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET",code=~"5.."}[30m]))
)
/
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[30m]))
labels:
verb: read
record: apiserver_request:burnrate30m
- annotations: {}
expr: |-
(
(
# too slow
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward"}[3d]))
-
(
(
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope=~"resource|",le="1"}[3d]))
or
vector(0)
)
+
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope="namespace",le="5"}[3d]))
+
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope="cluster",le="30"}[3d]))
)
)
+
# errors
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET",code=~"5.."}[3d]))
)
/
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[3d]))
labels:
verb: read
record: apiserver_request:burnrate3d
- annotations: {}
expr: |-
(
(
# too slow
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward"}[5m]))
-
(
(
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope=~"resource|",le="1"}[5m]))
or
vector(0)
)
+
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope="namespace",le="5"}[5m]))
+
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope="cluster",le="30"}[5m]))
)
)
+
# errors
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET",code=~"5.."}[5m]))
)
/
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[5m]))
labels:
verb: read
record: apiserver_request:burnrate5m
- annotations: {}
expr: |-
(
(
# too slow
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward"}[6h]))
-
(
(
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope=~"resource|",le="1"}[6h]))
or
vector(0)
)
+
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope="namespace",le="5"}[6h]))
+
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward",scope="cluster",le="30"}[6h]))
)
)
+
# errors
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET",code=~"5.."}[6h]))
)
/
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[6h]))
labels:
verb: read
record: apiserver_request:burnrate6h
- annotations: {}
expr: |-
(
(
# too slow
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_count{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward"}[1d]))
-
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward",le="1"}[1d]))
)
+
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[1d]))
)
/
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d]))
labels:
verb: write
record: apiserver_request:burnrate1d
- annotations: {}
expr: |-
(
(
# too slow
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_count{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward"}[1h]))
-
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward",le="1"}[1h]))
)
+
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[1h]))
)
/
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h]))
labels:
verb: write
record: apiserver_request:burnrate1h
- annotations: {}
expr: |-
(
(
# too slow
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_count{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward"}[2h]))
-
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward",le="1"}[2h]))
)
+
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[2h]))
)
/
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h]))
labels:
verb: write
record: apiserver_request:burnrate2h
- annotations: {}
expr: |-
(
(
# too slow
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_count{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward"}[30m]))
-
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward",le="1"}[30m]))
)
+
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[30m]))
)
/
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m]))
labels:
verb: write
record: apiserver_request:burnrate30m
- annotations: {}
expr: |-
(
(
# too slow
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_count{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward"}[3d]))
-
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward",le="1"}[3d]))
)
+
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[3d]))
)
/
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d]))
labels:
verb: write
record: apiserver_request:burnrate3d
- annotations: {}
expr: |-
(
(
# too slow
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_count{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward"}[5m]))
-
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward",le="1"}[5m]))
)
+
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[5m]))
)
/
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))
labels:
verb: write
record: apiserver_request:burnrate5m
- annotations: {}
expr: |-
(
(
# too slow
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_count{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward"}[6h]))
-
sum by (cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward",le="1"}[6h]))
)
+
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",code=~"5.."}[6h]))
)
/
sum by (cluster) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h]))
labels:
verb: write
record: apiserver_request:burnrate6h

View File

@@ -0,0 +1,23 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kube-apiserver-histogram.rules
spec:
groups:
- name: kube-apiserver-histogram.rules
params: {}
rules:
- annotations: {}
expr: histogram_quantile(0.99, sum by (le,resource,cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET",subresource!~"proxy|attach|log|exec|portforward"}[5m])))
> 0
labels:
quantile: '0.99'
verb: read
record: cluster_quantile:apiserver_request_sli_duration_seconds:histogram_quantile
- annotations: {}
expr: histogram_quantile(0.99, sum by (le,resource,cluster) (rate(apiserver_request_sli_duration_seconds_bucket{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE",subresource!~"proxy|attach|log|exec|portforward"}[5m])))
> 0
labels:
quantile: '0.99'
verb: write
record: cluster_quantile:apiserver_request_sli_duration_seconds:histogram_quantile

View File

@@ -0,0 +1,73 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kube-apiserver-slos
spec:
groups:
- name: kube-apiserver-slos
params: {}
rules:
- alert: KubeAPIErrorBudgetBurn
annotations:
description: The API server is burning too much error budget.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapierrorbudgetburn
summary: The API server is burning too much error budget.
expr: |-
sum(apiserver_request:burnrate1h) > (14.40 * 0.01000)
and
sum(apiserver_request:burnrate5m) > (14.40 * 0.01000)
for: 2m
labels:
long: 1h
severity: critical
short: 5m
exported_instance: '{{ $labels.namespace }}/{{ $labels.apiserver }}'
service: kube-apiserver-slos
- alert: KubeAPIErrorBudgetBurn
annotations:
description: The API server is burning too much error budget.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapierrorbudgetburn
summary: The API server is burning too much error budget.
expr: |-
sum(apiserver_request:burnrate6h) > (6.00 * 0.01000)
and
sum(apiserver_request:burnrate30m) > (6.00 * 0.01000)
for: 15m
labels:
long: 6h
severity: critical
short: 30m
exported_instance: '{{ $labels.namespace }}/{{ $labels.apiserver }}'
service: kube-apiserver-slos
- alert: KubeAPIErrorBudgetBurn
annotations:
description: The API server is burning too much error budget.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapierrorbudgetburn
summary: The API server is burning too much error budget.
expr: |-
sum(apiserver_request:burnrate1d) > (3.00 * 0.01000)
and
sum(apiserver_request:burnrate2h) > (3.00 * 0.01000)
for: 1h
labels:
long: 1d
severity: warning
short: 2h
exported_instance: '{{ $labels.namespace }}/{{ $labels.apiserver }}'
service: kube-apiserver-slos
- alert: KubeAPIErrorBudgetBurn
annotations:
description: The API server is burning too much error budget.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapierrorbudgetburn
summary: The API server is burning too much error budget.
expr: |-
sum(apiserver_request:burnrate3d) > (1.00 * 0.01000)
and
sum(apiserver_request:burnrate6h) > (1.00 * 0.01000)
for: 3h
labels:
long: 3d
severity: warning
short: 6h
exported_instance: '{{ $labels.namespace }}/{{ $labels.apiserver }}'
service: kube-apiserver-slos

View File

@@ -0,0 +1,17 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kube-prometheus-general.rules
spec:
groups:
- name: kube-prometheus-general.rules
params: {}
rules:
- annotations: {}
expr: count without(instance, pod, node) (up == 1)
labels: {}
record: count:up1
- annotations: {}
expr: count without(instance, pod, node) (up == 0)
labels: {}
record: count:up0

View File

@@ -0,0 +1,37 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kube-prometheus-node-recording.rules
spec:
groups:
- name: kube-prometheus-node-recording.rules
params: {}
rules:
- annotations: {}
expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[3m]))
BY (instance)
labels: {}
record: instance:node_cpu:rate:sum
- annotations: {}
expr: sum(rate(node_network_receive_bytes_total[3m])) BY (instance)
labels: {}
record: instance:node_network_receive_bytes:rate:sum
- annotations: {}
expr: sum(rate(node_network_transmit_bytes_total[3m])) BY (instance)
labels: {}
record: instance:node_network_transmit_bytes:rate:sum
- annotations: {}
expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[5m]))
WITHOUT (cpu, mode) / ON(instance) GROUP_LEFT() count(sum(node_cpu_seconds_total)
BY (instance, cpu)) BY (instance)
labels: {}
record: instance:node_cpu:ratio
- annotations: {}
expr: sum(rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"}[5m]))
labels: {}
record: cluster:node_cpu:sum_rate5m
- annotations: {}
expr: cluster:node_cpu:sum_rate5m / count(sum(node_cpu_seconds_total) BY (instance,
cpu))
labels: {}
record: cluster:node_cpu:ratio

View File

@@ -0,0 +1,63 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kube-scheduler.rules
spec:
groups:
- name: kube-scheduler.rules
params: {}
rules:
- annotations: {}
expr: histogram_quantile(0.99, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m]))
without(instance, pod))
labels:
quantile: '0.99'
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
- annotations: {}
expr: histogram_quantile(0.99, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m]))
without(instance, pod))
labels:
quantile: '0.99'
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
- annotations: {}
expr: histogram_quantile(0.99, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m]))
without(instance, pod))
labels:
quantile: '0.99'
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile
- annotations: {}
expr: histogram_quantile(0.9, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m]))
without(instance, pod))
labels:
quantile: '0.9'
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
- annotations: {}
expr: histogram_quantile(0.9, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m]))
without(instance, pod))
labels:
quantile: '0.9'
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
- annotations: {}
expr: histogram_quantile(0.9, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m]))
without(instance, pod))
labels:
quantile: '0.9'
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile
- annotations: {}
expr: histogram_quantile(0.5, sum(rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m]))
without(instance, pod))
labels:
quantile: '0.5'
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile
- annotations: {}
expr: histogram_quantile(0.5, sum(rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m]))
without(instance, pod))
labels:
quantile: '0.5'
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile
- annotations: {}
expr: histogram_quantile(0.5, sum(rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m]))
without(instance, pod))
labels:
quantile: '0.5'
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile

View File

@@ -0,0 +1,73 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kube-state-metrics
spec:
groups:
- name: kube-state-metrics
params: {}
rules:
- alert: KubeStateMetricsListErrors
annotations:
description: kube-state-metrics is experiencing errors at an elevated rate
in list operations. This is likely causing it to not be able to expose metrics
about Kubernetes objects correctly or at all.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kube-state-metrics/kubestatemetricslisterrors
summary: kube-state-metrics is experiencing errors in list operations.
expr: |-
(sum(rate(kube_state_metrics_list_total{job="kube-state-metrics",result="error"}[5m])) by (cluster)
/
sum(rate(kube_state_metrics_list_total{job="kube-state-metrics"}[5m])) by (cluster))
> 0.01
for: 15m
labels:
severity: critical
exported_instance: '{{ $labels.cluster }}/kube-state-metrics'
service: kube-state-metrics
- alert: KubeStateMetricsWatchErrors
annotations:
description: kube-state-metrics is experiencing errors at an elevated rate
in watch operations. This is likely causing it to not be able to expose
metrics about Kubernetes objects correctly or at all.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kube-state-metrics/kubestatemetricswatcherrors
summary: kube-state-metrics is experiencing errors in watch operations.
expr: |-
(sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics",result="error"}[5m])) by (cluster)
/
sum(rate(kube_state_metrics_watch_total{job="kube-state-metrics"}[5m])) by (cluster))
> 0.01
for: 15m
labels:
severity: critical
exported_instance: '{{ $labels.cluster }}/kube-state-metrics'
service: kube-state-metrics
- alert: KubeStateMetricsShardingMismatch
annotations:
description: kube-state-metrics pods are running with different --total-shards
configuration, some Kubernetes objects may be exposed multiple times or
not exposed at all.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kube-state-metrics/kubestatemetricsshardingmismatch
summary: kube-state-metrics sharding is misconfigured.
expr: stdvar (kube_state_metrics_total_shards{job="kube-state-metrics"}) by
(cluster) != 0
for: 15m
labels:
severity: critical
exported_instance: '{{ $labels.cluster }}/kube-state-metrics'
service: kube-state-metrics
- alert: KubeStateMetricsShardsMissing
annotations:
description: kube-state-metrics shards are missing, some Kubernetes objects
are not being exposed.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kube-state-metrics/kubestatemetricsshardsmissing
summary: kube-state-metrics shards are missing.
expr: |-
2^max(kube_state_metrics_total_shards{job="kube-state-metrics"}) by (cluster) - 1
-
sum( 2 ^ max by (shard_ordinal,cluster) (kube_state_metrics_shard_ordinal{job="kube-state-metrics"}) ) by (cluster)
!= 0
for: 15m
labels:
severity: critical
exported_instance: '{{ $labels.cluster }}/kube-state-metrics'
service: kube-state-metrics

View File

@@ -0,0 +1,30 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kubelet.rules
spec:
groups:
- name: kubelet.rules
params: {}
rules:
- annotations: {}
expr: histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",
metrics_path="/metrics"}[5m])) by (instance,le,cluster) * on (instance,cluster)
group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
labels:
quantile: '0.99'
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile
- annotations: {}
expr: histogram_quantile(0.9, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",
metrics_path="/metrics"}[5m])) by (instance,le,cluster) * on (instance,cluster)
group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
labels:
quantile: '0.9'
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile
- annotations: {}
expr: histogram_quantile(0.5, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",
metrics_path="/metrics"}[5m])) by (instance,le,cluster) * on (instance,cluster)
group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"})
labels:
quantile: '0.5'
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile

View File

@@ -0,0 +1,304 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kubernetes-apps
spec:
groups:
- name: kubernetes-apps
params: {}
rules:
- alert: KubePodCrashLooping
annotations:
description: 'Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container
}}) is in waiting state (reason: "CrashLoopBackOff").'
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodcrashlooping
summary: Pod is crash looping.
expr: max_over_time(kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff",
job="kube-state-metrics", namespace=~".*"}[5m]) >= 1
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.pod }}'
service: kubernetes-apps
- alert: KubePodNotReady
annotations:
description: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready
state for longer than 15 minutes.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepodnotready
summary: Pod has been in a non-ready state for more than 15 minutes.
expr: |-
sum by (namespace,pod,cluster) (
max by (namespace,pod,cluster) (
kube_pod_status_phase{job="kube-state-metrics", namespace=~".*", phase=~"Pending|Unknown|Failed"}
) * on (namespace,pod,cluster) group_left(owner_kind) topk by (namespace,pod,cluster) (
1, max by (namespace,pod,owner_kind,cluster) (kube_pod_owner{owner_kind!="Job"})
)
) > 0
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.pod }}'
service: kubernetes-apps
- alert: KubeDeploymentGenerationMismatch
annotations:
description: Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment
}} does not match, this indicates that the Deployment has failed but has
not been rolled back.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentgenerationmismatch
summary: Deployment generation mismatch due to possible roll-back
expr: |-
kube_deployment_status_observed_generation{job="kube-state-metrics", namespace=~".*"}
!=
kube_deployment_metadata_generation{job="kube-state-metrics", namespace=~".*"}
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.deployment }}'
service: kubernetes-apps
- alert: KubeDeploymentReplicasMismatch
annotations:
description: Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has
not matched the expected number of replicas for longer than 15 minutes.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentreplicasmismatch
summary: Deployment has not matched the expected number of replicas.
expr: |-
(
kube_deployment_spec_replicas{job="kube-state-metrics", namespace=~".*"}
>
kube_deployment_status_replicas_available{job="kube-state-metrics", namespace=~".*"}
) and (
changes(kube_deployment_status_replicas_updated{job="kube-state-metrics", namespace=~".*"}[10m])
==
0
)
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.deployment }}'
service: kubernetes-apps
- alert: KubeDeploymentRolloutStuck
annotations:
description: Rollout of deployment {{ $labels.namespace }}/{{ $labels.deployment
}} is not progressing for longer than 15 minutes.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedeploymentrolloutstuck
summary: Deployment rollout is not progressing.
expr: |-
kube_deployment_status_condition{condition="Progressing", status="false",job="kube-state-metrics", namespace=~".*"}
!= 0
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.deployment }}'
service: kubernetes-apps
- alert: KubeStatefulSetReplicasMismatch
annotations:
description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }}
has not matched the expected number of replicas for longer than 15 minutes.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetreplicasmismatch
summary: StatefulSet has not matched the expected number of replicas.
expr: |-
(
kube_statefulset_status_replicas_ready{job="kube-state-metrics", namespace=~".*"}
!=
kube_statefulset_status_replicas{job="kube-state-metrics", namespace=~".*"}
) and (
changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics", namespace=~".*"}[10m])
==
0
)
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.statefulset }}'
service: kubernetes-apps
- alert: KubeStatefulSetGenerationMismatch
annotations:
description: StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset
}} does not match, this indicates that the StatefulSet has failed but has
not been rolled back.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetgenerationmismatch
summary: StatefulSet generation mismatch due to possible roll-back
expr: |-
kube_statefulset_status_observed_generation{job="kube-state-metrics", namespace=~".*"}
!=
kube_statefulset_metadata_generation{job="kube-state-metrics", namespace=~".*"}
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.statefulset }}'
service: kubernetes-apps
- alert: KubeStatefulSetUpdateNotRolledOut
annotations:
description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }}
update has not been rolled out.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubestatefulsetupdatenotrolledout
summary: StatefulSet update has not been rolled out.
expr: |-
(
max by (namespace,statefulset,cluster) (
kube_statefulset_status_current_revision{job="kube-state-metrics", namespace=~".*"}
unless
kube_statefulset_status_update_revision{job="kube-state-metrics", namespace=~".*"}
)
*
(
kube_statefulset_replicas{job="kube-state-metrics", namespace=~".*"}
!=
kube_statefulset_status_replicas_updated{job="kube-state-metrics", namespace=~".*"}
)
) and (
changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics", namespace=~".*"}[5m])
==
0
)
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.statefulset }}'
service: kubernetes-apps
- alert: KubeDaemonSetRolloutStuck
annotations:
description: DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} has
not finished or progressed for at least 15 minutes.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetrolloutstuck
summary: DaemonSet rollout is stuck.
expr: |-
(
(
kube_daemonset_status_current_number_scheduled{job="kube-state-metrics", namespace=~".*"}
!=
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics", namespace=~".*"}
) or (
kube_daemonset_status_number_misscheduled{job="kube-state-metrics", namespace=~".*"}
!=
0
) or (
kube_daemonset_status_updated_number_scheduled{job="kube-state-metrics", namespace=~".*"}
!=
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics", namespace=~".*"}
) or (
kube_daemonset_status_number_available{job="kube-state-metrics", namespace=~".*"}
!=
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics", namespace=~".*"}
)
) and (
changes(kube_daemonset_status_updated_number_scheduled{job="kube-state-metrics", namespace=~".*"}[5m])
==
0
)
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.daemonset }}'
service: kubernetes-apps
- alert: KubeContainerWaiting
annotations:
description: pod/{{ $labels.pod }} in namespace {{ $labels.namespace }} on
container {{ $labels.container}} has been in waiting state for longer than
1 hour.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecontainerwaiting
summary: Pod container waiting longer than 1 hour
expr: sum by (namespace,pod,container,cluster) (kube_pod_container_status_waiting_reason{job="kube-state-metrics",
namespace=~".*"}) > 0
for: 1h
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container
}}'
service: kubernetes-apps
- alert: KubeDaemonSetNotScheduled
annotations:
description: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset
}} are not scheduled.'
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetnotscheduled
summary: DaemonSet pods are not scheduled.
expr: |-
kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics", namespace=~".*"}
-
kube_daemonset_status_current_number_scheduled{job="kube-state-metrics", namespace=~".*"} > 0
for: 10m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.daemonset }}'
service: kubernetes-apps
- alert: KubeDaemonSetMisScheduled
annotations:
description: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset
}} are running where they are not supposed to run.'
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubedaemonsetmisscheduled
summary: DaemonSet pods are misscheduled.
expr: kube_daemonset_status_number_misscheduled{job="kube-state-metrics", namespace=~".*"}
> 0
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.daemonset }}'
service: kubernetes-apps
- alert: KubeJobNotCompleted
annotations:
description: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking
more than {{ "43200" | humanizeDuration }} to complete.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobnotcompleted
summary: Job did not complete in time
expr: |-
time() - max by (namespace,job_name,cluster) (kube_job_status_start_time{job="kube-state-metrics", namespace=~".*"}
and
kube_job_status_active{job="kube-state-metrics", namespace=~".*"} > 0) > 43200
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.job_name }}'
service: kubernetes-apps
- alert: KubeJobFailed
annotations:
description: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to
complete. Removing failed job after investigation should clear this alert.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubejobfailed
summary: Job failed to complete.
expr: kube_job_failed{job="kube-state-metrics", namespace=~".*"} > 0
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.job_name }}'
service: kubernetes-apps
- alert: KubeHpaReplicasMismatch
annotations:
description: HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }}
has not matched the desired number of replicas for longer than 15 minutes.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpareplicasmismatch
summary: HPA has not matched desired number of replicas.
expr: |-
(kube_horizontalpodautoscaler_status_desired_replicas{job="kube-state-metrics", namespace=~".*"}
!=
kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"})
and
(kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"}
>
kube_horizontalpodautoscaler_spec_min_replicas{job="kube-state-metrics", namespace=~".*"})
and
(kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"}
<
kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics", namespace=~".*"})
and
changes(kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"}[15m]) == 0
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler
}}'
service: kubernetes-apps
- alert: KubeHpaMaxedOut
annotations:
description: HPA {{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler }}
has been running at max replicas for longer than 15 minutes.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubehpamaxedout
summary: HPA is running at max replicas
expr: |-
kube_horizontalpodautoscaler_status_current_replicas{job="kube-state-metrics", namespace=~".*"}
==
kube_horizontalpodautoscaler_spec_max_replicas{job="kube-state-metrics", namespace=~".*"}
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.horizontalpodautoscaler
}}'
service: kubernetes-apps

View File

@@ -0,0 +1,138 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kubernetes-resources
spec:
groups:
- name: kubernetes-resources
params: {}
rules:
- alert: KubeCPUOvercommit
annotations:
description: Cluster {{ $labels.cluster }} has overcommitted CPU resource
requests for Pods by {{ $value }} CPU shares and cannot tolerate node failure.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuovercommit
summary: Cluster has overcommitted CPU resource requests.
expr: |-
sum(namespace_cpu:kube_pod_container_resource_requests:sum{}) by (cluster) - (sum(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) by (cluster) - max(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) by (cluster)) > 0
and
(sum(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) by (cluster) - max(kube_node_status_allocatable{job="kube-state-metrics",resource="cpu"}) by (cluster)) > 0
for: 10m
labels:
severity: warning
exported_instance: '{{ $labels.cluster }}'
service: kubernetes-resources
- alert: KubeMemoryOvercommit
annotations:
description: Cluster {{ $labels.cluster }} has overcommitted memory resource
requests for Pods by {{ $value | humanize }} bytes and cannot tolerate node
failure.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubememoryovercommit
summary: Cluster has overcommitted memory resource requests.
expr: |-
sum(namespace_memory:kube_pod_container_resource_requests:sum{}) by (cluster) - (sum(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster) - max(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster)) > 0
and
(sum(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster) - max(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster)) > 0
for: 10m
labels:
severity: warning
exported_instance: '{{ $labels.cluster }}'
service: kubernetes-resources
- alert: KubeCPUQuotaOvercommit
annotations:
description: Cluster {{ $labels.cluster }} has overcommitted CPU resource
requests for Namespaces.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecpuquotaovercommit
summary: Cluster has overcommitted CPU resource requests.
expr: |-
sum(min without(resource) (kube_resourcequota{job="kube-state-metrics", type="hard", resource=~"(cpu|requests.cpu)"})) by (cluster)
/
sum(kube_node_status_allocatable{resource="cpu", job="kube-state-metrics"}) by (cluster)
> 1.5
for: 5m
labels:
severity: warning
exported_instance: '{{ $labels.cluster }}'
service: kubernetes-resources
- alert: KubeMemoryQuotaOvercommit
annotations:
description: Cluster {{ $labels.cluster }} has overcommitted memory resource
requests for Namespaces.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubememoryquotaovercommit
summary: Cluster has overcommitted memory resource requests.
expr: |-
sum(min without(resource) (kube_resourcequota{job="kube-state-metrics", type="hard", resource=~"(memory|requests.memory)"})) by (cluster)
/
sum(kube_node_status_allocatable{resource="memory", job="kube-state-metrics"}) by (cluster)
> 1.5
for: 5m
labels:
severity: warning
exported_instance: '{{ $labels.cluster }}'
service: kubernetes-resources
- alert: KubeQuotaAlmostFull
annotations:
description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage
}} of its {{ $labels.resource }} quota.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaalmostfull
summary: Namespace quota is going to be full.
expr: |-
kube_resourcequota{job="kube-state-metrics", type="used"}
/ ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
> 0.9 < 1
for: 15m
labels:
severity: informational
exported_instance: '{{ $labels.namespace }}'
service: kubernetes-resources
- alert: KubeQuotaFullyUsed
annotations:
description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage
}} of its {{ $labels.resource }} quota.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotafullyused
summary: Namespace quota is fully used.
expr: |-
kube_resourcequota{job="kube-state-metrics", type="used"}
/ ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
== 1
for: 15m
labels:
severity: informational
exported_instance: '{{ $labels.namespace }}'
service: kubernetes-resources
- alert: KubeQuotaExceeded
annotations:
description: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage
}} of its {{ $labels.resource }} quota.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubequotaexceeded
summary: Namespace quota has exceeded the limits.
expr: |-
kube_resourcequota{job="kube-state-metrics", type="used"}
/ ignoring(instance, job, type)
(kube_resourcequota{job="kube-state-metrics", type="hard"} > 0)
> 1
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}'
service: kubernetes-resources
- alert: CPUThrottlingHigh
annotations:
description: '{{ $value | humanizePercentage }} throttling of CPU in namespace
{{ $labels.namespace }} for container {{ $labels.container }} in pod {{
$labels.pod }}.'
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/cputhrottlinghigh
summary: Processes experience elevated CPU throttling.
expr: |-
sum(increase(container_cpu_cfs_throttled_periods_total{container!="", }[5m])) by (container,pod,namespace,cluster)
/
sum(increase(container_cpu_cfs_periods_total{}[5m])) by (container,pod,namespace,cluster)
> ( 25 / 100 )
for: 15m
labels:
severity: informational
exported_instance: '{{ $labels.namespace }}/{{ $labels.pod }}/{{ $labels.container
}}'
service: kubernetes-resources

View File

@@ -0,0 +1,130 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kubernetes-storage
spec:
groups:
- name: kubernetes-storage
params: {}
rules:
- alert: KubePersistentVolumeFillingUp
annotations:
description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim
}} in Namespace {{ $labels.namespace }} {{ with $labels.cluster -}} on Cluster
{{ . }} {{- end }} is only {{ $value | humanizePercentage }} free.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup
summary: PersistentVolume is filling up.
expr: |-
(
kubelet_volume_stats_available_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
/
kubelet_volume_stats_capacity_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
) < 0.03
and
kubelet_volume_stats_used_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"} > 0
unless on (namespace,persistentvolumeclaim,cluster)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on (namespace,persistentvolumeclaim,cluster)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1
for: 1m
labels:
severity: critical
exported_instance: '{{ $labels.namespace }}/{{ $labels.persistentvolumeclaim
}}'
service: kubernetes-storage
- alert: KubePersistentVolumeFillingUp
annotations:
description: Based on recent sampling, the PersistentVolume claimed by {{
$labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} {{
with $labels.cluster -}} on Cluster {{ . }} {{- end }} is expected to fill
up within four days. Currently {{ $value | humanizePercentage }} is available.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumefillingup
summary: PersistentVolume is filling up.
expr: |-
(
kubelet_volume_stats_available_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
/
kubelet_volume_stats_capacity_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
) < 0.15
and
kubelet_volume_stats_used_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"} > 0
and
predict_linear(kubelet_volume_stats_available_bytes{job="kubelet", namespace=~".*", metrics_path="/metrics"}[6h], 4 * 24 * 3600) < 0
unless on (namespace,persistentvolumeclaim,cluster)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on (namespace,persistentvolumeclaim,cluster)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1
for: 1h
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.persistentvolumeclaim
}}'
service: kubernetes-storage
- alert: KubePersistentVolumeInodesFillingUp
annotations:
description: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim
}} in Namespace {{ $labels.namespace }} {{ with $labels.cluster -}} on Cluster
{{ . }} {{- end }} only has {{ $value | humanizePercentage }} free inodes.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumeinodesfillingup
summary: PersistentVolumeInodes are filling up.
expr: |-
(
kubelet_volume_stats_inodes_free{job="kubelet", namespace=~".*", metrics_path="/metrics"}
/
kubelet_volume_stats_inodes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
) < 0.03
and
kubelet_volume_stats_inodes_used{job="kubelet", namespace=~".*", metrics_path="/metrics"} > 0
unless on (namespace,persistentvolumeclaim,cluster)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on (namespace,persistentvolumeclaim,cluster)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1
for: 1m
labels:
severity: critical
exported_instance: '{{ $labels.namespace }}/{{ $labels.persistentvolumeclaim
}}'
service: kubernetes-storage
- alert: KubePersistentVolumeInodesFillingUp
annotations:
description: Based on recent sampling, the PersistentVolume claimed by {{
$labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} {{
with $labels.cluster -}} on Cluster {{ . }} {{- end }} is expected to run
out of inodes within four days. Currently {{ $value | humanizePercentage
}} of its inodes are free.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumeinodesfillingup
summary: PersistentVolumeInodes are filling up.
expr: |-
(
kubelet_volume_stats_inodes_free{job="kubelet", namespace=~".*", metrics_path="/metrics"}
/
kubelet_volume_stats_inodes{job="kubelet", namespace=~".*", metrics_path="/metrics"}
) < 0.15
and
kubelet_volume_stats_inodes_used{job="kubelet", namespace=~".*", metrics_path="/metrics"} > 0
and
predict_linear(kubelet_volume_stats_inodes_free{job="kubelet", namespace=~".*", metrics_path="/metrics"}[6h], 4 * 24 * 3600) < 0
unless on (namespace,persistentvolumeclaim,cluster)
kube_persistentvolumeclaim_access_mode{ access_mode="ReadOnlyMany"} == 1
unless on (namespace,persistentvolumeclaim,cluster)
kube_persistentvolumeclaim_labels{label_excluded_from_alerts="true"} == 1
for: 1h
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.persistentvolumeclaim
}}'
service: kubernetes-storage
- alert: KubePersistentVolumeErrors
annotations:
description: The persistent volume {{ $labels.persistentvolume }} {{ with
$labels.cluster -}} on Cluster {{ . }} {{- end }} has status {{ $labels.phase
}}.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubepersistentvolumeerrors
summary: PersistentVolume is having issues with provisioning.
expr: kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"}
> 0
for: 5m
labels:
severity: critical
exported_instance: '{{ $labels.persistentvolume }}'
service: kubernetes-storage

View File

@@ -0,0 +1,91 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kubernetes-system-apiserver
spec:
groups:
- name: kubernetes-system-apiserver
params: {}
rules:
- alert: KubeClientCertificateExpiration
annotations:
description: A client certificate used to authenticate to kubernetes apiserver
is expiring in less than 7.0 days.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclientcertificateexpiration
summary: Client certificate is about to expire.
expr: apiserver_client_certificate_expiration_seconds_count{job="kube-apiserver"}
> 0 and on (job,cluster) histogram_quantile(0.01, sum by (job,le,cluster)
(rate(apiserver_client_certificate_expiration_seconds_bucket{job="kube-apiserver"}[5m])))
< 604800
for: 5m
labels:
severity: warning
exported_instance: '{{ $labels.namespace }}/{{ $labels.pod }}'
service: kubernetes-system-apiserver
- alert: KubeClientCertificateExpiration
annotations:
description: A client certificate used to authenticate to kubernetes apiserver
is expiring in less than 24.0 hours.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclientcertificateexpiration
summary: Client certificate is about to expire.
expr: apiserver_client_certificate_expiration_seconds_count{job="kube-apiserver"}
> 0 and on (job,cluster) histogram_quantile(0.01, sum by (job,le,cluster)
(rate(apiserver_client_certificate_expiration_seconds_bucket{job="kube-apiserver"}[5m])))
< 86400
for: 5m
labels:
severity: critical
exported_instance: '{{ $labels.namespace }}/{{ $labels.pod }}'
service: kubernetes-system-apiserver
- alert: KubeAggregatedAPIErrors
annotations:
description: Kubernetes aggregated API {{ $labels.name }}/{{ $labels.namespace
}} has reported errors. It has appeared unavailable {{ $value | humanize
}} times averaged over the past 10m.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeaggregatedapierrors
summary: Kubernetes aggregated API has reported errors.
expr: sum by (name,namespace,cluster)(increase(aggregator_unavailable_apiservice_total{job="kube-apiserver"}[10m]))
> 4
labels:
severity: warning
exported_instance: '{{ $labels.name }}/{{ $labels.namespace }}'
service: kubernetes-system-apiserver
- alert: KubeAggregatedAPIDown
annotations:
description: Kubernetes aggregated API {{ $labels.name }}/{{ $labels.namespace
}} has been only {{ $value | humanize }}% available over the last 10m.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeaggregatedapidown
summary: Kubernetes aggregated API is down.
expr: (1 - max by (name,namespace,cluster)(avg_over_time(aggregator_unavailable_apiservice{job="kube-apiserver"}[10m])))
* 100 < 85
for: 5m
labels:
severity: warning
exported_instance: '{{ $labels.name }}/{{ $labels.namespace }}'
service: kubernetes-system-apiserver
- alert: KubeAPIDown
annotations:
description: KubeAPI has disappeared from Prometheus target discovery.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapidown
summary: Target disappeared from Prometheus target discovery.
expr: absent(up{job="kube-apiserver"} == 1)
for: 15m
labels:
severity: critical
exported_instance: '{{ $labels.cluster }}/apiserver'
service: kubernetes-system-apiserver
- alert: KubeAPITerminatedRequests
annotations:
description: The kubernetes apiserver has terminated {{ $value | humanizePercentage
}} of its incoming requests.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeapiterminatedrequests
summary: The kubernetes apiserver has terminated {{ $value | humanizePercentage
}} of its incoming requests.
expr: sum(rate(apiserver_request_terminations_total{job="kube-apiserver"}[10m])) /
( sum(rate(apiserver_request_total{job="kube-apiserver"}[10m])) + sum(rate(apiserver_request_terminations_total{job="kube-apiserver"}[10m]))
) > 0.20
for: 5m
labels:
severity: warning
exported_instance: '{{ $labels.cluster }}/apiserver'
service: kubernetes-system-apiserver

View File

@@ -0,0 +1,21 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kubernetes-system-controller-manager
spec:
groups:
- name: kubernetes-system-controller-manager
params: {}
rules:
- alert: KubeControllerManagerDown
annotations:
description: KubeControllerManager has disappeared from Prometheus target
discovery.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubecontrollermanagerdown
summary: Target disappeared from Prometheus target discovery.
expr: absent(up{job="kube-controller-manager"} == 1)
for: 15m
labels:
severity: critical
exported_instance: '{{ $labels.instance }}/controller-manager'
service: kubernetes-system-controller-manager

View File

@@ -0,0 +1,175 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kubernetes-system-kubelet
spec:
groups:
- name: kubernetes-system-kubelet
params: {}
rules:
- alert: KubeNodeNotReady
annotations:
description: '{{ $labels.node }} has been unready for more than 15 minutes.'
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodenotready
summary: Node is not ready.
expr: kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"}
== 0
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: kubernetes-system-kubelet
- alert: KubeNodeUnreachable
annotations:
description: '{{ $labels.node }} is unreachable and some workloads may be
rescheduled.'
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodeunreachable
summary: Node is unreachable.
expr: (kube_node_spec_taint{job="kube-state-metrics",key="node.kubernetes.io/unreachable",effect="NoSchedule"}
unless ignoring(key,value) kube_node_spec_taint{job="kube-state-metrics",key=~"ToBeDeletedByClusterAutoscaler|cloud.google.com/impending-node-termination|aws-node-termination-handler/spot-itn"})
== 1
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: kubernetes-system-kubelet
- alert: KubeletTooManyPods
annotations:
description: Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage
}} of its Pod capacity.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubelettoomanypods
summary: Kubelet is running at capacity.
expr: |-
count by (node,cluster) (
(kube_pod_status_phase{job="kube-state-metrics",phase="Running"} == 1) * on (instance,pod,namespace,cluster) group_left(node) topk by (instance,pod,namespace,cluster) (1, kube_pod_info{job="kube-state-metrics"})
)
/
max by (node,cluster) (
kube_node_status_capacity{job="kube-state-metrics",resource="pods"} != 1
) > 0.95
for: 15m
labels:
severity: informational
exported_instance: '{{ $labels.node }}'
service: kubernetes-system-kubelet
- alert: KubeNodeReadinessFlapping
annotations:
description: The readiness status of node {{ $labels.node }} has changed {{
$value }} times in the last 15 minutes.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubenodereadinessflapping
summary: Node readiness status is flapping.
expr: sum(changes(kube_node_status_condition{job="kube-state-metrics",status="true",condition="Ready"}[15m]))
by (node,cluster) > 2
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: kubernetes-system-kubelet
- alert: KubeletPlegDurationHigh
annotations:
description: The Kubelet Pod Lifecycle Event Generator has a 99th percentile
duration of {{ $value }} seconds on node {{ $labels.node }}.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletplegdurationhigh
summary: Kubelet Pod Lifecycle Event Generator is taking too long to relist.
expr: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"}
>= 10
for: 5m
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: kubernetes-system-kubelet
- alert: KubeletPodStartUpLatencyHigh
annotations:
description: Kubelet Pod startup 99th percentile latency is {{ $value }} seconds
on node {{ $labels.node }}.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletpodstartuplatencyhigh
summary: Kubelet Pod startup latency is too high.
expr: histogram_quantile(0.99, sum(rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet",
metrics_path="/metrics"}[5m])) by (instance,le,cluster)) * on (instance,cluster)
group_left(node) kubelet_node_name{job="kubelet", metrics_path="/metrics"}
> 60
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: kubernetes-system-kubelet
- alert: KubeletClientCertificateExpiration
annotations:
description: Client certificate for Kubelet on node {{ $labels.node }} expires
in {{ $value | humanizeDuration }}.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificateexpiration
summary: Kubelet client certificate is about to expire.
expr: kubelet_certificate_manager_client_ttl_seconds < 604800
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: kubernetes-system-kubelet
- alert: KubeletClientCertificateExpiration
annotations:
description: Client certificate for Kubelet on node {{ $labels.node }} expires
in {{ $value | humanizeDuration }}.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificateexpiration
summary: Kubelet client certificate is about to expire.
expr: kubelet_certificate_manager_client_ttl_seconds < 86400
labels:
severity: critical
exported_instance: '{{ $labels.node }}'
service: kubernetes-system-kubelet
- alert: KubeletServerCertificateExpiration
annotations:
description: Server certificate for Kubelet on node {{ $labels.node }} expires
in {{ $value | humanizeDuration }}.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificateexpiration
summary: Kubelet server certificate is about to expire.
expr: kubelet_certificate_manager_server_ttl_seconds < 604800
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: kubernetes-system-kubelet
- alert: KubeletServerCertificateExpiration
annotations:
description: Server certificate for Kubelet on node {{ $labels.node }} expires
in {{ $value | humanizeDuration }}.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificateexpiration
summary: Kubelet server certificate is about to expire.
expr: kubelet_certificate_manager_server_ttl_seconds < 86400
labels:
severity: critical
exported_instance: '{{ $labels.node }}'
service: kubernetes-system-kubelet
- alert: KubeletClientCertificateRenewalErrors
annotations:
description: Kubelet on node {{ $labels.node }} has failed to renew its client
certificate ({{ $value | humanize }} errors in the last 5 minutes).
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletclientcertificaterenewalerrors
summary: Kubelet has failed to renew its client certificate.
expr: increase(kubelet_certificate_manager_client_expiration_renew_errors[5m])
> 0
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: kubernetes-system-kubelet
- alert: KubeletServerCertificateRenewalErrors
annotations:
description: Kubelet on node {{ $labels.node }} has failed to renew its server
certificate ({{ $value | humanize }} errors in the last 5 minutes).
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletservercertificaterenewalerrors
summary: Kubelet has failed to renew its server certificate.
expr: increase(kubelet_server_expiration_renew_errors[5m]) > 0
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: kubernetes-system-kubelet
- alert: KubeletDown
annotations:
description: Kubelet has disappeared from Prometheus target discovery.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeletdown
summary: Target disappeared from Prometheus target discovery.
expr: absent(up{job="kubelet", metrics_path="/metrics"} == 1)
for: 15m
labels:
severity: critical
exported_instance: '{{ $labels.node }}'
service: kubernetes-system-kubelet

View File

@@ -0,0 +1,20 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kubernetes-system-scheduler
spec:
groups:
- name: kubernetes-system-scheduler
params: {}
rules:
- alert: KubeSchedulerDown
annotations:
description: KubeScheduler has disappeared from Prometheus target discovery.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeschedulerdown
summary: Target disappeared from Prometheus target discovery.
expr: absent(up{job="kube-scheduler"} == 1)
for: 15m
labels:
severity: critical
exported_instance: '{{ $labels.scheduler }}'
service: kubernetes-system-scheduler

View File

@@ -0,0 +1,37 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-kubernetes-system
spec:
groups:
- name: kubernetes-system
params: {}
rules:
- alert: KubeVersionMismatch
annotations:
description: There are {{ $value }} different semantic versions of Kubernetes
components running.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeversionmismatch
summary: Different semantic versions of Kubernetes components running.
expr: count by (cluster) (count by (git_version,cluster) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"},"git_version","$1","git_version","(v[0-9]*.[0-9]*).*")))
> 1
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.cluster }}'
service: kubernetes-system
- alert: KubeClientErrors
annotations:
description: Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance
}}' is experiencing {{ $value | humanizePercentage }} errors.'
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/kubernetes/kubeclienterrors
summary: Kubernetes API server client is experiencing errors.
expr: |-
(sum(rate(rest_client_requests_total{job="kube-apiserver",code=~"5.."}[5m])) by (instance,job,namespace,cluster)
/
sum(rate(rest_client_requests_total{job="kube-apiserver"}[5m])) by (instance,job,namespace,cluster))
> 0.01
for: 15m
labels:
severity: warning
service: kubernetes-system

View File

@@ -0,0 +1,93 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-node-exporter.rules
spec:
groups:
- name: node-exporter.rules
params: {}
rules:
- annotations: {}
expr: |-
count without (cpu, mode) (
node_cpu_seconds_total{job="node-exporter",mode="idle"}
)
labels: {}
record: instance:node_num_cpu:sum
- annotations: {}
expr: |-
1 - avg without (cpu) (
sum without (mode) (rate(node_cpu_seconds_total{job="node-exporter", mode=~"idle|iowait|steal"}[5m]))
)
labels: {}
record: instance:node_cpu_utilisation:rate5m
- annotations: {}
expr: |-
(
node_load1{job="node-exporter"}
/
instance:node_num_cpu:sum{job="node-exporter"}
)
labels: {}
record: instance:node_load1_per_cpu:ratio
- annotations: {}
expr: |-
1 - (
(
node_memory_MemAvailable_bytes{job="node-exporter"}
or
(
node_memory_Buffers_bytes{job="node-exporter"}
+
node_memory_Cached_bytes{job="node-exporter"}
+
node_memory_MemFree_bytes{job="node-exporter"}
+
node_memory_Slab_bytes{job="node-exporter"}
)
)
/
node_memory_MemTotal_bytes{job="node-exporter"}
)
labels: {}
record: instance:node_memory_utilisation:ratio
- annotations: {}
expr: rate(node_vmstat_pgmajfault{job="node-exporter"}[5m])
labels: {}
record: instance:node_vmstat_pgmajfault:rate5m
- annotations: {}
expr: rate(node_disk_io_time_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"}[5m])
labels: {}
record: instance_device:node_disk_io_time_seconds:rate5m
- annotations: {}
expr: rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"}[5m])
labels: {}
record: instance_device:node_disk_io_time_weighted_seconds:rate5m
- annotations: {}
expr: |-
sum without (device) (
rate(node_network_receive_bytes_total{job="node-exporter", device!="lo"}[5m])
)
labels: {}
record: instance:node_network_receive_bytes_excluding_lo:rate5m
- annotations: {}
expr: |-
sum without (device) (
rate(node_network_transmit_bytes_total{job="node-exporter", device!="lo"}[5m])
)
labels: {}
record: instance:node_network_transmit_bytes_excluding_lo:rate5m
- annotations: {}
expr: |-
sum without (device) (
rate(node_network_receive_drop_total{job="node-exporter", device!="lo"}[5m])
)
labels: {}
record: instance:node_network_receive_drop_excluding_lo:rate5m
- annotations: {}
expr: |-
sum without (device) (
rate(node_network_transmit_drop_total{job="node-exporter", device!="lo"}[5m])
)
labels: {}
record: instance:node_network_transmit_drop_excluding_lo:rate5m

View File

@@ -0,0 +1,396 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-node-exporter
spec:
groups:
- name: node-exporter
params: {}
rules:
- alert: NodeFilesystemSpaceFillingUp
annotations:
description: Filesystem on {{ $labels.device }}, mounted on {{ $labels.mountpoint
}}, at {{ $labels.node }} has only {{ printf "%.2f" $value }}% available
space left and is filling up.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup
summary: Filesystem is predicted to run out of space within the next 24 hours.
expr: |-
(
node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 15
and
predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""}[6h], 24*60*60) < 0
and
node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)
for: 1h
labels:
severity: warning
exported_instance: '{{ $labels.node }}/{{ $labels.device }}'
service: node-exporter
- alert: NodeFilesystemSpaceFillingUp
annotations:
description: Filesystem on {{ $labels.device }}, mounted on {{ $labels.mountpoint
}}, at {{ $labels.node }} has only {{ printf "%.2f" $value }}% available
space left and is filling up fast.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemspacefillingup
summary: Filesystem is predicted to run out of space within the next 4 hours.
expr: |-
(
node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 10
and
predict_linear(node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""}[6h], 4*60*60) < 0
and
node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)
for: 1h
labels:
severity: critical
exported_instance: '{{ $labels.node }}/{{ $labels.device }}'
service: node-exporter
- alert: NodeFilesystemAlmostOutOfSpace
annotations:
description: Filesystem on {{ $labels.device }}, mounted on {{ $labels.mountpoint
}}, at {{ $labels.node }} has only {{ printf "%.2f" $value }}% available
space left.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace
summary: Filesystem has less than 5% space left.
expr: |-
(
node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 5
and
node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)
for: 30m
labels:
severity: warning
exported_instance: '{{ $labels.node }}/{{ $labels.device }}'
service: node-exporter
- alert: NodeFilesystemAlmostOutOfSpace
annotations:
description: Filesystem on {{ $labels.device }}, mounted on {{ $labels.mountpoint
}}, at {{ $labels.node }} has only {{ printf "%.2f" $value }}% available
space left.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutofspace
summary: Filesystem has less than 3% space left.
expr: |-
(
node_filesystem_avail_bytes{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_size_bytes{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 3
and
node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)
for: 30m
labels:
severity: critical
exported_instance: '{{ $labels.node }}/{{ $labels.device }}'
service: node-exporter
- alert: NodeFilesystemFilesFillingUp
annotations:
description: Filesystem on {{ $labels.device }}, mounted on {{ $labels.mountpoint
}}, at {{ $labels.node }} has only {{ printf "%.2f" $value }}% available
inodes left and is filling up.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup
summary: Filesystem is predicted to run out of inodes within the next 24 hours.
expr: |-
(
node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_files{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 40
and
predict_linear(node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""}[6h], 24*60*60) < 0
and
node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)
for: 1h
labels:
severity: warning
exported_instance: '{{ $labels.node }}/{{ $labels.device }}'
service: node-exporter
- alert: NodeFilesystemFilesFillingUp
annotations:
description: Filesystem on {{ $labels.device }}, mounted on {{ $labels.mountpoint
}}, at {{ $labels.node }} has only {{ printf "%.2f" $value }}% available
inodes left and is filling up fast.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemfilesfillingup
summary: Filesystem is predicted to run out of inodes within the next 4 hours.
expr: |-
(
node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_files{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 20
and
predict_linear(node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""}[6h], 4*60*60) < 0
and
node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)
for: 1h
labels:
severity: critical
exported_instance: '{{ $labels.node }}/{{ $labels.device }}'
service: node-exporter
- alert: NodeFilesystemAlmostOutOfFiles
annotations:
description: Filesystem on {{ $labels.device }}, mounted on {{ $labels.mountpoint
}}, at {{ $labels.node }} has only {{ printf "%.2f" $value }}% available
inodes left.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles
summary: Filesystem has less than 5% inodes left.
expr: |-
(
node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_files{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 5
and
node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)
for: 1h
labels:
severity: warning
exported_instance: '{{ $labels.node }}/{{ $labels.device }}'
service: node-exporter
- alert: NodeFilesystemAlmostOutOfFiles
annotations:
description: Filesystem on {{ $labels.device }}, mounted on {{ $labels.mountpoint
}}, at {{ $labels.node }} has only {{ printf "%.2f" $value }}% available
inodes left.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefilesystemalmostoutoffiles
summary: Filesystem has less than 3% inodes left.
expr: |-
(
node_filesystem_files_free{job="node-exporter",fstype!="",mountpoint!=""} / node_filesystem_files{job="node-exporter",fstype!="",mountpoint!=""} * 100 < 3
and
node_filesystem_readonly{job="node-exporter",fstype!="",mountpoint!=""} == 0
)
for: 1h
labels:
severity: critical
exported_instance: '{{ $labels.node }}/{{ $labels.device }}'
service: node-exporter
- alert: NodeNetworkReceiveErrs
annotations:
description: '{{ $labels.node }} interface {{ $labels.device }} has encountered
{{ printf "%.0f" $value }} receive errors in the last two minutes.'
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworkreceiveerrs
summary: Network interface is reporting many receive errors.
expr: rate(node_network_receive_errs_total{job="node-exporter"}[2m]) / rate(node_network_receive_packets_total{job="node-exporter"}[2m])
> 0.01
for: 1h
labels:
severity: warning
exported_instance: '{{ $labels.node }}/{{ $labels.device }}'
service: node-exporter
- alert: NodeNetworkTransmitErrs
annotations:
description: '{{ $labels.node }} interface {{ $labels.device }} has encountered
{{ printf "%.0f" $value }} transmit errors in the last two minutes.'
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodenetworktransmiterrs
summary: Network interface is reporting many transmit errors.
expr: rate(node_network_transmit_errs_total{job="node-exporter"}[2m]) / rate(node_network_transmit_packets_total{job="node-exporter"}[2m])
> 0.01
for: 1h
labels:
severity: warning
exported_instance: '{{ $labels.node }}/{{ $labels.device }}'
service: node-exporter
- alert: NodeHighNumberConntrackEntriesUsed
annotations:
description: '{{ $value | humanizePercentage }} of conntrack entries are used.'
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodehighnumberconntrackentriesused
summary: Number of conntrack are getting close to the limit.
expr: (node_nf_conntrack_entries{job="node-exporter"} / node_nf_conntrack_entries_limit)
> 0.75
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: node-exporter
- alert: NodeTextFileCollectorScrapeError
annotations:
description: Node Exporter text file collector on {{ $labels.node }} failed
to scrape.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodetextfilecollectorscrapeerror
summary: Node Exporter text file collector failed to scrape.
expr: node_textfile_scrape_error{job="node-exporter"} == 1
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: node-exporter
- alert: NodeClockSkewDetected
annotations:
description: Clock at {{ $labels.node }} is out of sync by more than 0.05s.
Ensure NTP is configured correctly on this host.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodeclockskewdetected
summary: Clock skew detected.
expr: |-
(
node_timex_offset_seconds{job="node-exporter"} > 0.05
and
deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) >= 0
)
or
(
node_timex_offset_seconds{job="node-exporter"} < -0.05
and
deriv(node_timex_offset_seconds{job="node-exporter"}[5m]) <= 0
)
for: 10m
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: node-exporter
- alert: NodeClockNotSynchronising
annotations:
description: Clock at {{ $labels.node }} is not synchronising. Ensure
NTP is configured on this host.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodeclocknotsynchronising
summary: Clock not synchronising.
expr: |-
min_over_time(node_timex_sync_status{job="node-exporter"}[5m]) == 0
and
node_timex_maxerror_seconds{job="node-exporter"} >= 16
for: 10m
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: node-exporter
- alert: NodeRAIDDegraded
annotations:
description: RAID array '{{ $labels.device }}' at {{ $labels.node }} is
in degraded state due to one or more disks failures. Number of spare drives
is insufficient to fix issue automatically.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/noderaiddegraded
summary: RAID Array is degraded.
expr: node_md_disks_required{job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"}
- ignoring (state) (node_md_disks{state="active",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"})
> 0
for: 15m
labels:
severity: critical
exported_instance: '{{ $labels.node }}/{{ $labels.device }}'
service: node-exporter
- alert: NodeRAIDDiskFailure
annotations:
description: At least one device in RAID array at {{ $labels.node }} failed.
Array '{{ $labels.device }}' needs attention and possibly a disk swap.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/noderaiddiskfailure
summary: Failed device in RAID array.
expr: node_md_disks{state="failed",job="node-exporter",device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"}
> 0
labels:
severity: warning
exported_instance: '{{ $labels.node }}/{{ $labels.device }}'
service: node-exporter
- alert: NodeFileDescriptorLimit
annotations:
description: File descriptors limit at {{ $labels.node }} is currently
at {{ printf "%.2f" $value }}%.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit
summary: Kernel is predicted to exhaust file descriptors limit soon.
expr: |-
(
node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 70
)
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: node-exporter
- alert: NodeFileDescriptorLimit
annotations:
description: File descriptors limit at {{ $labels.node }} is currently
at {{ printf "%.2f" $value }}%.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodefiledescriptorlimit
summary: Kernel is predicted to exhaust file descriptors limit soon.
expr: |-
(
node_filefd_allocated{job="node-exporter"} * 100 / node_filefd_maximum{job="node-exporter"} > 90
)
for: 15m
labels:
severity: critical
exported_instance: '{{ $labels.node }}'
service: node-exporter
- alert: NodeCPUHighUsage
annotations:
description: |
CPU usage at {{ $labels.node }} has been above 90% for the last 15 minutes, is currently at {{ printf "%.2f" $value }}%.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodecpuhighusage
summary: High CPU usage.
expr: sum without(mode) (avg without (cpu) (rate(node_cpu_seconds_total{job="node-exporter",
mode!="idle"}[2m]))) * 100 > 90
for: 15m
labels:
severity: informational
exported_instance: '{{ $labels.node }}'
service: node-exporter
- alert: NodeSystemSaturation
annotations:
description: |
System load per core at {{ $labels.node }} has been above 2 for the last 15 minutes, is currently at {{ printf "%.2f" $value }}.
This might indicate this instance resources saturation and can cause it becoming unresponsive.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodesystemsaturation
summary: System saturated, load per core is very high.
expr: |-
node_load1{job="node-exporter"}
/ count without (cpu, mode) (node_cpu_seconds_total{job="node-exporter", mode="idle"}) > 2
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: node-exporter
- alert: NodeMemoryMajorPagesFaults
annotations:
description: |
Memory major pages are occurring at very high rate at {{ $labels.node }}, 500 major page faults per second for the last 15 minutes, is currently at {{ printf "%.2f" $value }}.
Please check that there is enough memory available at this instance.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodememorymajorpagesfaults
summary: Memory major page faults are occurring at very high rate.
expr: rate(node_vmstat_pgmajfault{job="node-exporter"}[5m]) > 500
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: node-exporter
- alert: NodeMemoryHighUtilization
annotations:
description: |
Memory is filling up at {{ $labels.node }}, has been above 90% for the last 15 minutes, is currently at {{ printf "%.2f" $value }}%.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodememoryhighutilization
summary: Host is running out of memory.
expr: 100 - (node_memory_MemAvailable_bytes{job="node-exporter"} / node_memory_MemTotal_bytes{job="node-exporter"}
* 100) > 90
for: 15m
labels:
severity: warning
exported_instance: '{{ $labels.node }}'
service: node-exporter
- alert: NodeDiskIOSaturation
annotations:
description: |
Disk IO queue (aqu-sq) is high on {{ $labels.device }} at {{ $labels.node }}, has been above 10 for the last 30 minutes, is currently at {{ printf "%.2f" $value }}.
This symptom might indicate disk saturation.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodediskiosaturation
summary: Disk IO queue is high.
expr: rate(node_disk_io_time_weighted_seconds_total{job="node-exporter", device=~"(/dev/)?(mmcblk.p.+|nvme.+|rbd.+|sd.+|vd.+|xvd.+|dm-.+|md.+|dasd.+)"}[5m])
> 10
for: 30m
labels:
severity: warning
exported_instance: '{{ $labels.node }}/{{ $labels.device }}'
service: node-exporter
- alert: NodeSystemdServiceFailed
annotations:
description: Systemd service {{ $labels.name }} has entered failed state at
{{ $labels.node }}
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodesystemdservicefailed
summary: Systemd service has entered failed state.
expr: node_systemd_unit_state{job="node-exporter", state="failed"} == 1
for: 5m
labels:
severity: warning
exported_instance: '{{ $labels.node }}/{{ $labels.name }}'
service: node-exporter
- alert: NodeBondingDegraded
annotations:
description: Bonding interface {{ $labels.master }} on {{ $labels.node
}} is in degraded state due to one or more slave failures.
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/node/nodebondingdegraded
summary: Bonding interface is degraded
expr: (node_bonding_slaves - node_bonding_active) != 0
for: 5m
labels:
severity: warning
exported_instance: '{{ $labels.node }}/{{ $labels.master }}'
service: node-exporter

View File

@@ -0,0 +1,21 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-node-network
spec:
groups:
- name: node-network
params: {}
rules:
- alert: NodeNetworkInterfaceFlapping
annotations:
description: Network interface "{{ $labels.device }}" changing its up status
often on node-exporter {{ $labels.namespace }}/{{ $labels.pod }}
runbook_url: https://runbooks.prometheus-operator.dev/runbooks/general/nodenetworkinterfaceflapping
summary: Network interface is often changing its status
expr: changes(node_network_up{job="node-exporter",device!~"veth.+"}[2m]) > 2
for: 2m
labels:
severity: warning
exported_instance: '{{ $labels.instance }}/{{ $labels.device }}'
service: node-network

View File

@@ -0,0 +1,55 @@
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMRule
metadata:
name: alerts-node.rules
spec:
groups:
- name: node.rules
params: {}
rules:
- annotations: {}
expr: |-
topk by (namespace,pod,cluster) (1,
max by (node,namespace,pod,cluster) (
label_replace(kube_pod_info{job="kube-state-metrics",node!=""}, "pod", "$1", "pod", "(.*)")
))
labels: {}
record: 'node_namespace_pod:kube_pod_info:'
- annotations: {}
expr: |-
count by (node,cluster) (
node_cpu_seconds_total{mode="idle",job="node-exporter"}
* on (namespace,pod,cluster) group_left(node)
topk by (namespace,pod,cluster) (1, node_namespace_pod:kube_pod_info:)
)
labels: {}
record: node:node_num_cpu:sum
- annotations: {}
expr: |-
sum(
node_memory_MemAvailable_bytes{job="node-exporter"} or
(
node_memory_Buffers_bytes{job="node-exporter"} +
node_memory_Cached_bytes{job="node-exporter"} +
node_memory_MemFree_bytes{job="node-exporter"} +
node_memory_Slab_bytes{job="node-exporter"}
)
) by (cluster)
labels: {}
record: :node_memory_MemAvailable_bytes:sum
- annotations: {}
expr: |-
avg by (node,cluster) (
sum without (mode) (
rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal",job="node-exporter"}[5m])
)
)
labels: {}
record: node:node_cpu_utilization:ratio_rate5m
- annotations: {}
expr: |-
avg by (cluster) (
node:node_cpu_utilization:ratio_rate5m
)
labels: {}
record: cluster:node_cpu:ratio_rate5m

View File

@@ -1,688 +0,0 @@
## Next release
- TODO
## 0.25.17
**Release date:** 2024-09-20
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Added VMAuth to k8s stack. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/829)
- Fixed ETCD dashboard
- Use path prefix from args as a default path prefix for ingress. Related [issue](https://github.com/VictoriaMetrics/helm-charts/issues/1260)
- Allow using vmalert without notifiers configuration. Note that it is required to use `.vmalert.spec.extraArgs["notifiers.blackhole"]: true` in order to start vmalert with a blackhole configuration.
## 0.25.16
**Release date:** 2024-09-10
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Do not truncate servicemonitor, datasources, rules, dashboard, alertmanager & vmalert templates names
- Use service label for node-exporter instead of podLabel. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/1458)
- Added common chart to a k8s-stack. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/1456)
- Fixed value of custom alertmanager configSecret. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/1461)
## 0.25.15
**Release date:** 2024-09-05
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Drop empty endpoints param from scrape configuration
- Fixed proto when TLS is enabled. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/1449)
## 0.25.14
**Release date:** 2024-09-04
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- fixed alertmanager templates
## 0.25.13
**Release date:** 2024-09-04
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Use operator's own service monitor
## 0.25.12
**Release date:** 2024-09-03
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Fixed dashboards rendering. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/1414)
- Fixed service monitor label name.
## 0.25.11
**Release date:** 2024-09-03
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Merged ingress templates
- Removed custom VMServiceScrape for operator
- Added ability to override default Prometheus-compatible datatasources with all available parameters. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/860).
- Do not use `grafana.dashboards` and `grafana.dashboardProviders`. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/1312).
- Migrated Node Exporter dashboard into chart
- Deprecated `grafana.sidecar.jsonData`, `grafana.provisionDefaultDatasource` in a favour of `grafana.sidecar.datasources.default` slice of datasources.
- Fail if no notifiers are set, do not set `notifiers` to null if empty
## 0.25.10
**Release date:** 2024-08-31
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- fixed ingress extraPaths and externalVM urls rendering
## 0.25.9
**Release date:** 2024-08-31
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- fixed vmalert ingress name typo
- Added ability to override default Prometheus-compatible datatasources with all available parameters. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/860).
- Do not use `grafana.dashboards` and `grafana.dashboardProviders`. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/1312).
## 0.25.8
**Release date:** 2024-08-30
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- fixed external notifiers rendering, when alertmanager is disabled. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/1378)
## 0.25.7
**Release date:** 2024-08-30
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- fixed extra rules template context
## 0.25.6
**Release date:** 2024-08-29
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
**Update note**: Update `kubeProxy.spec` to `kubeProxy.vmScrape.spec`
**Update note**: Update `kubeScheduler.spec` to `kubeScheduler.vmScrape.spec`
**Update note**: Update `kubeEtcd.spec` to `kubeEtcd.vmScrape.spec`
**Update note**: Update `coreDns.spec` to `coreDns.vmScrape.spec`
**Update note**: Update `kubeDns.spec` to `kubeDns.vmScrape.spec`
**Update note**: Update `kubeProxy.spec` to `kubeProxy.vmScrape.spec`
**Update note**: Update `kubeControllerManager.spec` to `kubeControllerManager.vmScrape.spec`
**Update note**: Update `kubeApiServer.spec` to `kubeApiServer.vmScrape.spec`
**Update note**: Update `kubelet.spec` to `kubelet.vmScrape.spec`
**Update note**: Update `kube-state-metrics.spec` to `kube-state-metrics.vmScrape.spec`
**Update note**: Update `prometheus-node-exporter.spec` to `prometheus-node-exporter.vmScrape.spec`
**Update note**: Update `grafana.spec` to `grafana.vmScrape.spec`
- bump version of VM components to [v1.103.0](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.103.0)
- Added `dashboards.<dashboardName>` bool flag to enable dashboard even if component it is for is not installed.
- Allow extra `vmalert.notifiers` without dropping default notifier if `alertmanager.enabled: true`
- Do not drop default notifier, when vmalert.additionalNotifierConfigs is set
- Replaced static url proto with a template, which selects proto depending on a present tls configuration
- Moved kubernetes components monitoring config from `spec` config to `vmScrape.spec`
- Merged servicemonitor templates
## 0.25.5
**Release date:** 2024-08-26
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- TODO
## 0.25.4
**Release date:** 2024-08-26
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- updates operator to [v0.47.2](https://github.com/VictoriaMetrics/operator/releases/tag/v0.47.2)
- kube-state-metrics - 5.16.4 -> 5.25.1
- prometheus-node-exporter - 4.27.0 -> 4.29.0
- grafana - 8.3.8 -> 8.4.7
- added configurable `.Values.global.clusterLabel` to all alerting and recording rules `by` and `on` expressions
## 0.25.3
**Release date:** 2024-08-23
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- updated operator to v0.47.1 release
- Build `app.kubernetes.io/instance` label consistently. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/1282)
## 0.25.2
**Release date:** 2024-08-21
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- fixed vmalert ingress name. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/1271)
- fixed alertmanager ingress host template rendering. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/1270)
## 0.25.1
**Release date:** 2024-08-21
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Added `.Values.global.license` configuration
- Fixed extraLabels rendering. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/1248)
- Fixed vmalert url to alertmanager by including its path prefix
- Removed `networking.k8s.io/v1beta1/Ingress` and `extensions/v1beta1/Ingress` support
- Fixed kubedns servicemonitor template. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/1255)
## 0.25.0
**Release date:** 2024-08-16
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
**Update note**: it requires to update CRD dependency manually before upgrade
**Update note**: requires Helm 3.14+
- Moved dashboards templating logic out of sync script to Helm template
- Allow to disable default grafana datasource
- Synchronize Etcd dashboards and rules with mixin provided by Etcd
- Add alerting rules for VictoriaMetrics operator.
- Updated alerting rules for VictoriaMetrics components.
- Fixed exact rule annotations propagation to other rules.
- Set minimal kubernetes version to 1.25
- updates operator to v0.47.0 version
## 0.24.5
**Release date:** 2024-08-01
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- bump version of VM components to [v1.102.1](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.102.1)
## 0.24.4
**Release date:** 2024-08-01
![AppVersion: v1.102.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Update dependencies: grafana -> 8.3.6.
- Added `.Values.defaultRules.alerting` and `.Values.defaultRules.recording` to setup common properties for all alerting an recording rules
## 0.24.3
**Release date:** 2024-07-23
![AppVersion: v1.102.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- bump version of VM components to [v1.102.0](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.102.0)
## 0.24.2
**Release date:** 2024-07-15
![AppVersion: v1.101.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.101.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- fix vmalertmanager configuration when using `.VMAlertmanagerSpec.ConfigRawYaml`. See [this pull request](https://github.com/VictoriaMetrics/helm-charts/pull/1136).
## 0.24.1
**Release date:** 2024-07-10
![AppVersion: v1.101.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.101.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- updates operator to v0.46.4
## 0.24.0
**Release date:** 2024-07-10
![AppVersion: v1.101.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.101.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- added ability to override alerting rules labels and annotations:
- globally - `.Values.defaultRules.rule.spec.labels` (before it was `.Values.defaultRules.additionalRuleLabels`) and `.Values.defaultRules.rule.spec.annotations`
- for all rules in a group - `.Values.defaultRules.groups.<groupName>.rules.spec.labels` and `.Valeus.defaultRules.groups.<groupName>.rules.spec.annotations`
- for each rule individually - `.Values.defaultRules.rules.<ruleName>.spec.labels` and `.Values.defaultRules.rules.<ruleName>.spec.annotations`
- changed `.Values.defaultRules.rules.<groupName>` to `.Values.defaultRules.groups.<groupName>.create`
- changed `.Values.defaultRules.appNamespacesTarget` to `.Values.defaultRules.groups.<groupName>.targetNamespace`
- changed `.Values.defaultRules.params` to `.Values.defaultRules.group.spec.params` with ability to override it at `.Values.defaultRules.groups.<groupName>.spec.params`
## 0.23.6
**Release date:** 2024-07-08
![AppVersion: v1.101.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.101.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- added ability to override alerting rules labels and annotations:
- globally - `.Values.defaultRules.rule.spec.labels` (before it was `.Values.defaultRules.additionalRuleLabels`) and `.Values.defaultRules.rule.spec.annotations`
- for all rules in a group - `.Values.defaultRules.groups.<groupName>.rules.spec.labels` and `.Valeus.defaultRules.groups.<groupName>.rules.spec.annotations`
- for each rule individually - `.Values.defaultRules.rules.<ruleName>.spec.labels` and `.Values.defaultRules.rules.<ruleName>.spec.annotations`
- changed `.Values.defaultRules.rules.<groupName>` to `.Values.defaultRules.groups.<groupName>.create`
- changed `.Values.defaultRules.appNamespacesTarget` to `.Values.defaultRules.groups.<groupName>.targetNamespace`
- changed `.Values.defaultRules.params` to `.Values.defaultRules.group.spec.params` with ability to override it at `.Values.defaultRules.groups.<groupName>.spec.params`
## 0.23.5
**Release date:** 2024-07-04
![AppVersion: v1.101.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.101.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Support configuring vmalert `-notifier.config` with `.Values.vmalert.additionalNotifierConfigs`.
## 0.23.4
**Release date:** 2024-07-02
![AppVersion: v1.101.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.101.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Add `extraObjects` to allow deploying additional resources with the chart release.
## 0.23.3
**Release date:** 2024-06-26
![AppVersion: v1.101.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.101.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Enable [conversion of Prometheus CRDs](https://docs.victoriametrics.com/operator/migration/#objects-conversion) by default. See [this](https://github.com/VictoriaMetrics/helm-charts/pull/1069) pull request for details.
- use bitnami/kubectl image for cleanup instead of deprecated gcr.io/google_containers/hyperkube
## 0.23.2
**Release date:** 2024-06-14
![AppVersion: v1.101.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.101.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Do not add `cluster` external label at VMAgent by default. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/774) for the details.
## 0.23.1
**Release date:** 2024-06-10
![AppVersion: v1.101.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.101.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- updates operator to v0.45.0 release
- sync latest vm alerts and dashboards.
## 0.23.0
**Release date:** 2024-05-30
![AppVersion: v1.101.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.101.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- sync latest etcd v3.5.x rules from [upstream](https://github.com/etcd-io/etcd/blob/release-3.5/contrib/mixin/mixin.libsonnet).
- add Prometheus operator CRDs as an optional dependency. See [this PR](https://github.com/VictoriaMetrics/helm-charts/pull/1022) and [related issue](https://github.com/VictoriaMetrics/helm-charts/issues/341) for the details.
## 0.22.1
**Release date:** 2024-05-14
![AppVersion: v1.101.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.101.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- fix missing serviceaccounts patch permission in VM operator, see [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/1012) for details.
## 0.22.0
**Release date:** 2024-05-10
![AppVersion: v1.101.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.101.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- bump version of VM operator to [0.44.0](https://github.com/VictoriaMetrics/operator/releases/tag/v0.44.0)
## 0.21.3
**Release date:** 2024-04-26
![AppVersion: v1.101.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.101.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- bump version of VM components to [v1.101.0](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.101.0)
## 0.21.2
**Release date:** 2024-04-23
![AppVersion: v1.100.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.100.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- bump version of VM operator to [0.43.3](https://github.com/VictoriaMetrics/operator/releases/tag/v0.43.3)
## 0.21.1
**Release date:** 2024-04-18
![AppVersion: v1.100.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.100.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
## 0.21.0
**Release date:** 2024-04-18
![AppVersion: v1.100.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.100.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- TODO
- bump version of VM operator to [0.43.0](https://github.com/VictoriaMetrics/operator/releases/tag/v0.43.0)
- updates CRDs definitions.
## 0.20.1
**Release date:** 2024-04-16
![AppVersion: v1.100.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.100.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- upgraded dashboards and alerting rules, added values file for local (Minikube) setup
- bump version of VM components to [v1.100.1](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.100.1)
## 0.20.0
**Release date:** 2024-04-02
![AppVersion: v1.99.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.99.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- bump version of VM operator to [0.42.3](https://github.com/VictoriaMetrics/operator/releases/tag/v0.42.3)
## 0.19.4
**Release date:** 2024-03-05
![AppVersion: v1.99.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.99.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- bump version of VM components to [v1.99.0](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.99.0)
## 0.19.3
**Release date:** 2024-03-05
![AppVersion: v1.98.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.98.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Commented default configuration for alertmanager. It simplifies configuration and makes it more explicit. See this [issue](https://github.com/VictoriaMetrics/helm-charts/issues/473) for details.
- Allow enabling/disabling default k8s rules when installing. See [#904](https://github.com/VictoriaMetrics/helm-charts/pull/904) by @passie.
## 0.19.2
**Release date:** 2024-02-26
![AppVersion: v1.98.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.98.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Fix templating of VMAgent `remoteWrite` in case both `VMSingle` and `VMCluster` are disabled. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/865) for details.
## 0.19.1
**Release date:** 2024-02-21
![AppVersion: v1.98.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.98.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Update dependencies: victoria-metrics-operator -> 0.28.1, grafana -> 7.3.1.
- Update victoriametrics CRD resources yaml.
## 0.19.0
**Release date:** 2024-02-09
![AppVersion: v1.97.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.97.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Do not store original labels in `vmagent`'s memory by default. This reduces memory usage of `vmagent` but makes `vmagent`'s debugging UI less informative. See [this docs](https://docs.victoriametrics.com/vmagent/#relabel-debug) for details on relabeling debug.
- Update dependencies: kube-state-metrics -> 5.16.0, prometheus-node-exporter -> 4.27.0, grafana -> 7.3.0.
- Update victoriametrics CRD resources yaml.
- Update builtin dashboards and rules.
## 0.18.12
**Release date:** 2024-02-01
![AppVersion: v1.97.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.97.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- bump version of VM components to [v1.97.1](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.97.1)
- Fix helm lint when ingress resources enabled - split templates of resources per kind. See [#820](https://github.com/VictoriaMetrics/helm-charts/pull/820) by @MemberIT.
## 0.18.11
**Release date:** 2023-12-15
![AppVersion: v1.96.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.96.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Fix missing `.Values.defaultRules.rules.vmcluster` value. See [#801](https://github.com/VictoriaMetrics/helm-charts/pull/801) by @MemberIT.
## 0.18.10
**Release date:** 2023-12-12
![AppVersion: v1.96.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.96.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- bump version of VM components to [v1.96.0](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.96.0)
- Add optional allowCrossNamespaceImport to GrafanaDashboard(s) (#788)
## 0.18.9
**Release date:** 2023-12-08
![AppVersion: v1.95.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.95.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Properly use variable from values file for Grafana datasource type. (#769)
- Update dashboards from upstream sources. (#780)
## 0.18.8
**Release date:** 2023-11-16
![AppVersion: v1.95.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.95.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- bump version of VM components to [v1.95.1](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.95.1)
## 0.18.7
**Release date:** 2023-11-15
![AppVersion: v1.95.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.95.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- bump version of VM components to [v1.95.0](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.95.0)
- Support adding extra group parameters for default vmrules. (#752)
## 0.18.6
**Release date:** 2023-11-01
![AppVersion: v1.94.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.94.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Fix kube scheduler default scraping port from 10251 to 10259, Kubernetes changed it since 1.23.0. See [this pr](https://github.com/VictoriaMetrics/helm-charts/pull/736) for details.
- Bump version of operator chart to [0.27.4](https://github.com/VictoriaMetrics/helm-charts/releases/tag/victoria-metrics-operator-0.27.4)
## 0.18.5
**Release date:** 2023-10-08
![AppVersion: v1.94.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.94.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Update operator chart to [v0.27.3](https://github.com/VictoriaMetrics/helm-charts/releases/tag/victoria-metrics-operator-0.27.3) for fixing [#708](https://github.com/VictoriaMetrics/helm-charts/issues/708)
## 0.18.4
**Release date:** 2023-10-04
![AppVersion: v1.94.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.94.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Update dependencies: [victoria-metrics-operator -> 0.27.2](https://github.com/VictoriaMetrics/helm-charts/releases/tag/victoria-metrics-operator-0.27.2), prometheus-node-exporter -> 4.23.2, grafana -> 6.59.5.
## 0.18.3
**Release date:** 2023-10-04
![AppVersion: v1.94.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.94.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- bump version of VM components to [v1.94.0](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.94.0)
## 0.18.2
**Release date:** 2023-09-28
![AppVersion: v1.93.5](https://img.shields.io/static/v1?label=AppVersion&message=v1.93.5&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Fix behavior of `vmalert.remoteWriteVMAgent` - remoteWrite.url for VMAlert is correctly generated considering endpoint, name, port and http.pathPrefix of VMAgent
## 0.18.1
**Release date:** 2023-09-21
![AppVersion: v1.93.5](https://img.shields.io/static/v1?label=AppVersion&message=v1.93.5&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Bump version of VM components to [v1.93.5](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.93.5)
## 0.18.0
**Release date:** 2023-09-12
![AppVersion: v1.93.4](https://img.shields.io/static/v1?label=AppVersion&message=v1.93.4&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Bump version of `grafana` helm-chart to `6.59.*`
- Bump version of `prometheus-node-exporter` helm-chart to `4.23.*`
- Bump version of `kube-state-metrics` helm-chart to `0.59.*`
- Update alerting rules
- Update grafana dashboards
- Add `make` commands `sync-rules` and `sync-dashboards`
- Add support of VictoriaMetrics datasource
## 0.17.8
**Release date:** 2023-09-11
![AppVersion: v1.93.4](https://img.shields.io/static/v1?label=AppVersion&message=v1.93.4&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Bump version of VM components to [v1.93.4](https://github.com/VictoriaMetrics/VictoriaMetrics/releases/tag/v1.93.4)
- Bump version of operator chart to [0.27.0](https://github.com/VictoriaMetrics/helm-charts/releases/tag/victoria-metrics-operator-0.27.0)
## 0.17.7
**Release date:** 2023-09-07
![AppVersion: v1.93.3](https://img.shields.io/static/v1?label=AppVersion&message=v1.93.3&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Bump version of operator helm-chart to `0.26.2`
## 0.17.6
**Release date:** 2023-09-04
![AppVersion: v1.93.3](https://img.shields.io/static/v1?label=AppVersion&message=v1.93.3&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Move `cleanupCRD` option to victoria-metrics-operator chart (#593)
- Disable `honorTimestamps` for cadvisor scrape job by default (#617)
- For vmalert all replicas of alertmanager are added to notifiers (only if alertmanager is enabled) (#619)
- Add `grafanaOperatorDashboardsFormat` option (#615)
- Fix query expression for memory calculation in `k8s-views-global` dashboard (#636)
- Bump version of Victoria Metrics components to `v1.93.3`
- Bump version of operator helm-chart to `0.26.0`
## 0.17.5
**Release date:** 2023-08-23
![AppVersion: v1.93.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.93.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Update VictoriaMetrics components from v1.93.0 to v1.93.1
## 0.17.4
**Release date:** 2023-08-12
![AppVersion: v1.93.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.93.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Update VictoriaMetrics components from v1.92.1 to v1.93.0
- delete an obsolete parameter remaining by mistake (see <https://github.com/VictoriaMetrics/helm-charts/tree/master/charts/victoria-metrics-k8s-stack#upgrade-to-0130>) (#602)
## 0.17.3
**Release date:** 2023-07-28
![AppVersion: v1.92.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.92.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Update VictoriaMetrics components from v1.92.0 to v1.92.1 (#599)
## 0.17.2
**Release date:** 2023-07-27
![AppVersion: v1.92.0](https://img.shields.io/static/v1?label=AppVersion&message=v1.92.0&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Update VictoriaMetrics components from v1.91.3 to v1.92.0

View File

@@ -1,24 +0,0 @@
dependencies:
- name: victoria-metrics-common
repository: https://victoriametrics.github.io/helm-charts
version: 0.0.11
- name: victoria-metrics-operator
repository: https://victoriametrics.github.io/helm-charts
version: 0.34.8
- name: kube-state-metrics
repository: https://prometheus-community.github.io/helm-charts
version: 5.25.1
- name: prometheus-node-exporter
repository: https://prometheus-community.github.io/helm-charts
version: 4.39.0
- name: grafana
repository: https://grafana.github.io/helm-charts
version: 8.4.9
- name: crds
repository: ""
version: 0.0.0
- name: prometheus-operator-crds
repository: https://prometheus-community.github.io/helm-charts
version: 11.0.0
digest: sha256:11b119ebabf4ff0ea2951e7c72f51d0223dc3f50fb061a43b01fe7856491b836
generated: "2024-09-12T11:50:51.935071545Z"

View File

@@ -1,66 +0,0 @@
annotations:
artifacthub.io/category: monitoring-logging
artifacthub.io/changes: |
- Added VMAuth to k8s stack. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/829)
- Fixed ETCD dashboard
- Use path prefix from args as a default path prefix for ingress. Related [issue](https://github.com/VictoriaMetrics/helm-charts/issues/1260)
- 'Allow using vmalert without notifiers configuration. Note that it is required to use `.vmalert.spec.extraArgs["notifiers.blackhole"]: true` in order to start vmalert with a blackhole configuration.'
artifacthub.io/license: Apache-2.0
artifacthub.io/links: |
- name: Sources
url: https://docs.victoriametrics.com/vmgateway
- name: Charts repo
url: https://victoriametrics.github.io/helm-charts/
- name: Docs
url: https://docs.victoriametrics.com
artifacthub.io/operator: "true"
apiVersion: v2
appVersion: v1.102.1
dependencies:
- name: victoria-metrics-common
repository: https://victoriametrics.github.io/helm-charts
version: 0.0.*
- condition: victoria-metrics-operator.enabled
name: victoria-metrics-operator
repository: https://victoriametrics.github.io/helm-charts
version: 0.34.*
- condition: kube-state-metrics.enabled
name: kube-state-metrics
repository: https://prometheus-community.github.io/helm-charts
version: 5.25.*
- condition: prometheus-node-exporter.enabled
name: prometheus-node-exporter
repository: https://prometheus-community.github.io/helm-charts
version: 4.39.*
- condition: grafana.enabled
name: grafana
repository: https://grafana.github.io/helm-charts
version: 8.4.*
- condition: crds.enabled
name: crds
repository: ""
version: 0.0.0
- condition: prometheus-operator-crds.enabled
name: prometheus-operator-crds
repository: https://prometheus-community.github.io/helm-charts
version: 11.0.*
description: Kubernetes monitoring on VictoriaMetrics stack. Includes VictoriaMetrics
Operator, Grafana dashboards, ServiceScrapes and VMRules
home: https://github.com/VictoriaMetrics/helm-charts
icon: https://avatars.githubusercontent.com/u/43720803?s=200&v=4
keywords:
- victoriametrics
- operator
- monitoring
- kubernetes
- observability
- tsdb
- metrics
- metricsql
- timeseries
kubeVersion: '>=1.25.0-0'
name: victoria-metrics-k8s-stack
sources:
- https://github.com/VictoriaMetrics/helm-charts
type: application
version: 0.25.17

View File

@@ -1,300 +0,0 @@
{{ template "chart.typeBadge" . }} {{ template "chart.versionBadge" . }}
[![Artifact Hub](https://img.shields.io/endpoint?url=https://artifacthub.io/badge/repository/victoriametrics)](https://artifacthub.io/packages/helm/victoriametrics/victoria-metrics-k8s-stack)
{{ template "chart.description" . }}
* [Overview](#Overview)
* [Configuration](#Configuration)
* [Prerequisites](#Prerequisites)
* [Dependencies](#Dependencies)
* [Quick Start](#How-to-install)
* [Uninstall](#How-to-uninstall)
* [Version Upgrade](#Upgrade-guide)
* [Troubleshooting](#Troubleshooting)
* [Values](#Parameters)
## Overview
This chart is an All-in-one solution to start monitoring kubernetes cluster.
It installs multiple dependency charts like [grafana](https://github.com/grafana/helm-charts/tree/main/charts/grafana), [node-exporter](https://github.com/prometheus-community/helm-charts/tree/main/charts/prometheus-node-exporter), [kube-state-metrics](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-state-metrics) and [victoria-metrics-operator](https://github.com/VictoriaMetrics/helm-charts/tree/master/charts/victoria-metrics-operator).
Also it installs Custom Resources like [VMSingle](https://docs.victoriametrics.com/operator/quick-start#vmsingle), [VMCluster](https://docs.victoriametrics.com/operator/quick-start#vmcluster), [VMAgent](https://docs.victoriametrics.com/operator/quick-start#vmagent), [VMAlert](https://docs.victoriametrics.com/operator/quick-start#vmalert).
By default, the operator [converts all existing prometheus-operator API objects](https://docs.victoriametrics.com/operator/quick-start#migration-from-prometheus-operator-objects) into corresponding VictoriaMetrics Operator objects.
To enable metrics collection for kubernetes this chart installs multiple scrape configurations for kuberenetes components like kubelet and kube-proxy, etc. Metrics collection is done by [VMAgent](https://docs.victoriametrics.com/operator/quick-start#vmagent). So if want to ship metrics to external VictoriaMetrics database you can disable VMSingle installation by setting `vmsingle.enabled` to `false` and setting `vmagent.vmagentSpec.remoteWrite.url` to your external VictoriaMetrics database.
This chart also installs bunch of dashboards and recording rules from [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus) project.
![Overview](img/k8s-stack-overview.png)
## Configuration
Configuration of this chart is done through helm values.
### Dependencies
Dependencies can be enabled or disabled by setting `enabled` to `true` or `false` in `values.yaml` file.
**!Important:** for dependency charts anything that you can find in values.yaml of dependency chart can be configured in this chart under key for that dependency. For example if you want to configure `grafana` you can find all possible configuration options in [values.yaml](https://github.com/grafana/helm-charts/blob/main/charts/grafana/values.yaml) and you should set them in values for this chart under grafana: key. For example if you want to configure `grafana.persistence.enabled` you should set it in values.yaml like this:
```yaml
#################################################
### dependencies #####
#################################################
# Grafana dependency chart configuration. For possible values refer to https://github.com/grafana/helm-charts/tree/main/charts/grafana#configuration
grafana:
enabled: true
persistence:
type: pvc
enabled: false
```
### VictoriaMetrics components
This chart installs multiple VictoriaMetrics components using Custom Resources that are managed by [victoria-metrics-operator](https://docs.victoriametrics.com/operator/design)
Each resource can be configured using `spec` of that resource from API docs of [victoria-metrics-operator](https://docs.victoriametrics.com/operator/api). For example if you want to configure `VMAgent` you can find all possible configuration options in [API docs](https://docs.victoriametrics.com/operator/api#vmagent) and you should set them in values for this chart under `vmagent.spec` key. For example if you want to configure `remoteWrite.url` you should set it in values.yaml like this:
```yaml
vmagent:
spec:
remoteWrite:
- url: "https://insert.vmcluster.domain.com/insert/0/prometheus/api/v1/write"
```
### ArgoCD issues
#### Operator self signed certificates
When deploying K8s stack using ArgoCD without Cert Manager (`.Values.victoria-metrics-operator.admissionWebhooks.certManager.enabled: false`)
it will rerender operator's webhook certificates on each sync since Helm `lookup` function is not respected by ArgoCD.
To prevent this please update you K8s stack Application `spec.syncPolicy` and `spec.ignoreDifferences` with a following:
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
...
spec:
...
syncPolicy:
syncOptions:
# https://argo-cd.readthedocs.io/en/stable/user-guide/sync-options/#respect-ignore-difference-configs
# argocd must also ignore difference during apply stage
# otherwise it ll silently override changes and cause a problem
- RespectIgnoreDifferences=true
ignoreDifferences:
- group: ""
kind: Secret
name: <fullname>-validation
namespace: kube-system
jsonPointers:
- /data
- group: admissionregistration.k8s.io
kind: ValidatingWebhookConfiguration
name: <fullname>-admission
jqPathExpressions:
- '.webhooks[]?.clientConfig.caBundle'
```
where `<fullname>` is output of `{{"{{"}} include "vm-operator.fullname" {{"}}"}}` for your setup
#### `metadata.annotations: Too long: must have at most 262144 bytes` on dashboards
If one of dashboards ConfigMap is failing with error `Too long: must have at most 262144 bytes`, please make sure you've added `argocd.argoproj.io/sync-options: ServerSideApply=true` annotation to your dashboards:
```yaml
grafana:
sidecar:
dashboards:
additionalDashboardAnnotations
argocd.argoproj.io/sync-options: ServerSideApply=true
```
argocd.argoproj.io/sync-options: ServerSideApply=true
### Rules and dashboards
This chart by default install multiple dashboards and recording rules from [kube-prometheus](https://github.com/prometheus-operator/kube-prometheus)
you can disable dashboards with `defaultDashboardsEnabled: false` and `experimentalDashboardsEnabled: false`
and rules can be configured under `defaultRules`
### Prometheus scrape configs
This chart installs multiple scrape configurations for kubernetes monitoring. They are configured under `#ServiceMonitors` section in `values.yaml` file. For example if you want to configure scrape config for `kubelet` you should set it in values.yaml like this:
```yaml
kubelet:
enabled: true
# spec for VMNodeScrape crd
# https://docs.victoriametrics.com/operator/api#vmnodescrapespec
spec:
interval: "30s"
```
### Using externally managed Grafana
If you want to use an externally managed Grafana instance but still want to use the dashboards provided by this chart you can set
`grafana.enabled` to `false` and set `defaultDashboardsEnabled` to `true`. This will install the dashboards
but will not install Grafana.
For example:
```yaml
defaultDashboardsEnabled: true
grafana:
enabled: false
```
This will create ConfigMaps with dashboards to be imported into Grafana.
If additional configuration for labels or annotations is needed in order to import dashboard to an existing Grafana you can
set `.grafana.sidecar.dashboards.additionalDashboardLabels` or `.grafana.sidecar.dashboards.additionalDashboardAnnotations` in `values.yaml`:
For example:
```yaml
defaultDashboardsEnabled: true
grafana:
enabled: false
sidecar:
dashboards:
additionalDashboardLabels:
key: value
additionalDashboardAnnotations:
key: value
```
## Prerequisites
* Install the follow packages: ``git``, ``kubectl``, ``helm``, ``helm-docs``. See this [tutorial](../../REQUIREMENTS.md).
* Add dependency chart repositories
```console
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
```
* PV support on underlying infrastructure.
{{ include "chart.installSection" . }}
### Install locally (Minikube)
To run VictoriaMetrics stack locally it's possible to use [Minikube](https://github.com/kubernetes/minikube). To avoid dashboards and alert rules issues please follow the steps below:
Run Minikube cluster
```
minikube start --container-runtime=containerd --extra-config=scheduler.bind-address=0.0.0.0 --extra-config=controller-manager.bind-address=0.0.0.0
```
Install helm chart
```
helm install [RELEASE_NAME] vm/victoria-metrics-k8s-stack -f values.yaml -f values.minikube.yaml -n NAMESPACE --debug --dry-run
```
{{ include "chart.uninstallSection" . }}
CRDs created by this chart are not removed by default and should be manually cleaned up:
```console
kubectl get crd | grep victoriametrics.com | awk '{print $1 }' | xargs -i kubectl delete crd {}
```
## Troubleshooting
- If you cannot install helm chart with error `configmap already exist`. It could happen because of name collisions, if you set too long release name.
Kubernetes by default, allows only 63 symbols at resource names and all resource names are trimmed by helm to 63 symbols.
To mitigate it, use shorter name for helm chart release name, like:
```bash
# stack - is short enough
helm upgrade -i stack vm/victoria-metrics-k8s-stack
```
Or use override for helm chart release name:
```bash
helm upgrade -i some-very-long-name vm/victoria-metrics-k8s-stack --set fullnameOverride=stack
```
## Upgrade guide
Usually, helm upgrade doesn't requires manual actions. Just execute command:
```console
$ helm upgrade [RELEASE_NAME] vm/victoria-metrics-k8s-stack
```
But release with CRD update can only be patched manually with kubectl.
Since helm does not perform a CRD update, we recommend that you always perform this when updating the helm-charts version:
```console
# 1. check the changes in CRD
$ helm show crds vm/victoria-metrics-k8s-stack --version [YOUR_CHART_VERSION] | kubectl diff -f -
# 2. apply the changes (update CRD)
$ helm show crds vm/victoria-metrics-k8s-stack --version [YOUR_CHART_VERSION] | kubectl apply -f - --server-side
```
All other manual actions upgrades listed below:
### Upgrade to 0.13.0
- node-exporter starting from version 4.0.0 is using the Kubernetes recommended labels. Therefore you have to delete the daemonset before you upgrade.
```bash
kubectl delete daemonset -l app=prometheus-node-exporter
```
- scrape configuration for kubernetes components was moved from `vmServiceScrape.spec` section to `spec` section. If you previously modified scrape configuration you need to update your `values.yaml`
- `grafana.defaultDashboardsEnabled` was renamed to `defaultDashboardsEnabled` (moved to top level). You may need to update it in your `values.yaml`
### Upgrade to 0.6.0
All `CRD` must be update to the lastest version with command:
```bash
kubectl apply -f https://raw.githubusercontent.com/VictoriaMetrics/helm-charts/master/charts/victoria-metrics-k8s-stack/crds/crd.yaml
```
### Upgrade to 0.4.0
All `CRD` must be update to `v1` version with command:
```bash
kubectl apply -f https://raw.githubusercontent.com/VictoriaMetrics/helm-charts/master/charts/victoria-metrics-k8s-stack/crds/crd.yaml
```
### Upgrade from 0.2.8 to 0.2.9
Update `VMAgent` crd
command:
```bash
kubectl apply -f https://raw.githubusercontent.com/VictoriaMetrics/operator/v0.16.0/config/crd/bases/operator.victoriametrics.com_vmagents.yaml
```
### Upgrade from 0.2.5 to 0.2.6
New CRD added to operator - `VMUser` and `VMAuth`, new fields added to exist crd.
Manual commands:
```bash
kubectl apply -f https://raw.githubusercontent.com/VictoriaMetrics/operator/v0.15.0/config/crd/bases/operator.victoriametrics.com_vmusers.yaml
kubectl apply -f https://raw.githubusercontent.com/VictoriaMetrics/operator/v0.15.0/config/crd/bases/operator.victoriametrics.com_vmauths.yaml
kubectl apply -f https://raw.githubusercontent.com/VictoriaMetrics/operator/v0.15.0/config/crd/bases/operator.victoriametrics.com_vmalerts.yaml
kubectl apply -f https://raw.githubusercontent.com/VictoriaMetrics/operator/v0.15.0/config/crd/bases/operator.victoriametrics.com_vmagents.yaml
kubectl apply -f https://raw.githubusercontent.com/VictoriaMetrics/operator/v0.15.0/config/crd/bases/operator.victoriametrics.com_vmsingles.yaml
kubectl apply -f https://raw.githubusercontent.com/VictoriaMetrics/operator/v0.15.0/config/crd/bases/operator.victoriametrics.com_vmclusters.yaml
```
{{ include "chart.helmDocs" . }}
## Parameters
The following tables lists the configurable parameters of the chart and their default values.
Change the values according to the need of the environment in ``victoria-metrics-k8s-stack/values.yaml`` file.
{{ template "chart.valuesTableHtml" . }}

View File

@@ -1,40 +0,0 @@
# Release process guidance
## Update version for VictoriaMetrics kubernetes monitoring stack
1. Update dependency requirements in [Chart.yml](https://github.com/VictoriaMetrics/helm-charts/blob/master/charts/victoria-metrics-k8s-stack/Chart.yaml)
2. Apply changes via `helm dependency update`
3. Update image tag in chart values:
<div class="with-copy" markdown="1">
```console
make sync-rules
make sync-dashboards
```
</div>
4. Bump version of the victoria-metrics-k8s-stack [Chart.yml](https://github.com/VictoriaMetrics/helm-charts/blob/master/charts/victoria-metrics-k8s-stack/Chart.yaml)
5. Run linter:
<div class="with-copy" markdown="1">
```console
make lint
```
</div>
6. Render templates locally to check for errors:
<div class="with-copy" markdown="1">
```console
helm template vm-k8s-stack ./charts/victoria-metrics-k8s-stack --output-dir out --values ./charts/victoria-metrics-k8s-stack/values.yaml --debug
```
</div>
7. Test updated chart by installing it to your kubernetes cluster.
8. Update docs with
```console
helm-docs
```
9. Commit the changes and send a [PR](https://github.com/VictoriaMetrics/helm-charts/pulls)

View File

@@ -1,12 +0,0 @@
# Release notes for version 0.25.17
**Release date:** 2024-09-20
![AppVersion: v1.102.1](https://img.shields.io/static/v1?label=AppVersion&message=v1.102.1&color=success&logo=)
![Helm: v3](https://img.shields.io/static/v1?label=Helm&message=v3&color=informational&logo=helm)
- Added VMAuth to k8s stack. See [this issue](https://github.com/VictoriaMetrics/helm-charts/issues/829)
- Fixed ETCD dashboard
- Use path prefix from args as a default path prefix for ingress. Related [issue](https://github.com/VictoriaMetrics/helm-charts/issues/1260)
- Allow using vmalert without notifiers configuration. Note that it is required to use `.vmalert.spec.extraArgs["notifiers.blackhole"]: true` in order to start vmalert with a blackhole configuration.

View File

@@ -1,13 +0,0 @@
---
weight: 1
title: CHANGELOG
menu:
docs:
weight: 1
identifier: helm-victoriametrics-k8s-stack-changelog
parent: helm-victoriametrics-k8s-stack
url: /helm/victoriametrics-k8s-stack/changelog
aliases:
- /helm/victoriametrics-k8s-stack/changelog/index.html
---
{{% content "CHANGELOG.md" %}}

View File

@@ -1,13 +0,0 @@
---
weight: 9
title: VictoriaMetrics K8s Stack
menu:
docs:
parent: helm
weight: 9
identifier: helm-victoriametrics-k8s-stack
url: /helm/victoriametrics-k8s-stack
aliases:
- /helm/victoriametrics-k8s-stack/index.html
---
{{% content "README.md" %}}

View File

@@ -1,165 +0,0 @@
condition: '{{ .Values.kubeEtcd.enabled }}'
name: etcd
rules:
- alert: etcdMembersDown
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": members are down ({{`{{`}} $value {{`}}`}}).'
summary: 'etcd cluster members are down.'
condition: '{{ true }}'
expr: |-
max without (endpoint) (
sum without (instance) (up{job=~".*etcd.*"} == bool 0)
or
count without (To) (
sum without (instance) (rate(etcd_network_peer_sent_failures_total{job=~".*etcd.*"}[120s])) > 0.01
)
)
> 0
for: 10m
labels:
severity: critical
- alert: etcdInsufficientMembers
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": insufficient members ({{`{{`}} $value {{`}}`}}).'
summary: 'etcd cluster has insufficient number of members.'
condition: '{{ true }}'
expr: sum(up{job=~".*etcd.*"} == bool 1) without (instance) < ((count(up{job=~".*etcd.*"}) without (instance) + 1) / 2)
for: 3m
labels:
severity: critical
- alert: etcdNoLeader
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": member {{`{{`}} $labels.instance {{`}}`}} has no leader.'
summary: 'etcd cluster has no leader.'
condition: '{{ true }}'
expr: etcd_server_has_leader{job=~".*etcd.*"} == 0
for: 1m
labels:
severity: critical
- alert: etcdHighNumberOfLeaderChanges
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": {{`{{`}} $value {{`}}`}} leader changes within the last 15 minutes. Frequent elections may be a sign of insufficient resources, high network latency, or disruptions by other components and should be investigated.'
summary: 'etcd cluster has high number of leader changes.'
condition: '{{ true }}'
expr: increase((max without (instance) (etcd_server_leader_changes_seen_total{job=~".*etcd.*"}) or 0*absent(etcd_server_leader_changes_seen_total{job=~".*etcd.*"}))[15m:1m]) >= 4
for: 5m
labels:
severity: warning
- alert: etcdHighNumberOfFailedGRPCRequests
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": {{`{{`}} $value {{`}}`}}% of requests for {{`{{`}} $labels.grpc_method {{`}}`}} failed on etcd instance {{`{{`}} $labels.instance {{`}}`}}.'
summary: 'etcd cluster has high number of failed grpc requests.'
condition: '{{ true }}'
expr: |-
100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code)
/
sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) without (grpc_type, grpc_code)
> 1
for: 10m
labels:
severity: warning
- alert: etcdHighNumberOfFailedGRPCRequests
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": {{`{{`}} $value {{`}}`}}% of requests for {{`{{`}} $labels.grpc_method {{`}}`}} failed on etcd instance {{`{{`}} $labels.instance {{`}}`}}.'
summary: 'etcd cluster has high number of failed grpc requests.'
condition: '{{ true }}'
expr: |-
100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code)
/
sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) without (grpc_type, grpc_code)
> 5
for: 5m
labels:
severity: critical
- alert: etcdGRPCRequestsSlow
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": 99th percentile of gRPC requests is {{`{{`}} $value {{`}}`}}s on etcd instance {{`{{`}} $labels.instance {{`}}`}} for {{`{{`}} $labels.grpc_method {{`}}`}} method.'
summary: 'etcd grpc requests are slow'
condition: '{{ true }}'
expr: |-
histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job=~".*etcd.*", grpc_method!="Defragment", grpc_type="unary"}[5m])) without(grpc_type))
> 0.15
for: 10m
labels:
severity: critical
- alert: etcdMemberCommunicationSlow
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": member communication with {{`{{`}} $labels.To {{`}}`}} is taking {{`{{`}} $value {{`}}`}}s on etcd instance {{`{{`}} $labels.instance {{`}}`}}.'
summary: 'etcd cluster member communication is slow.'
condition: '{{ true }}'
expr: |-
histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job=~".*etcd.*"}[5m]))
> 0.15
for: 10m
labels:
severity: warning
- alert: etcdHighNumberOfFailedProposals
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": {{`{{`}} $value {{`}}`}} proposal failures within the last 30 minutes on etcd instance {{`{{`}} $labels.instance {{`}}`}}.'
summary: 'etcd cluster has high number of proposal failures.'
condition: '{{ true }}'
expr: rate(etcd_server_proposals_failed_total{job=~".*etcd.*"}[15m]) > 5
for: 15m
labels:
severity: warning
- alert: etcdHighFsyncDurations
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": 99th percentile fsync durations are {{`{{`}} $value {{`}}`}}s on etcd instance {{`{{`}} $labels.instance {{`}}`}}.'
summary: 'etcd cluster 99th percentile fsync durations are too high.'
condition: '{{ true }}'
expr: |-
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 0.5
for: 10m
labels:
severity: warning
- alert: etcdHighFsyncDurations
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": 99th percentile fsync durations are {{`{{`}} $value {{`}}`}}s on etcd instance {{`{{`}} $labels.instance {{`}}`}}.'
summary: 'etcd cluster 99th percentile fsync durations are too high.'
condition: '{{ true }}'
expr: |-
histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 1
for: 10m
labels:
severity: critical
- alert: etcdHighCommitDurations
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": 99th percentile commit durations {{`{{`}} $value {{`}}`}}s on etcd instance {{`{{`}} $labels.instance {{`}}`}}.'
summary: 'etcd cluster 99th percentile commit durations are too high.'
condition: '{{ true }}'
expr: |-
histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket{job=~".*etcd.*"}[5m]))
> 0.25
for: 10m
labels:
severity: warning
- alert: etcdDatabaseQuotaLowSpace
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": database size exceeds the defined quota on etcd instance {{`{{`}} $labels.instance {{`}}`}}, please defrag or increase the quota as the writes to etcd will be disabled when it is full.'
summary: 'etcd cluster database is running full.'
condition: '{{ true }}'
expr: (last_over_time(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[5m]) / last_over_time(etcd_server_quota_backend_bytes{job=~".*etcd.*"}[5m]))*100 > 95
for: 10m
labels:
severity: critical
- alert: etcdExcessiveDatabaseGrowth
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": Predicting running out of disk space in the next four hours, based on write observations within the past four hours on etcd instance {{`{{`}} $labels.instance {{`}}`}}, please check as it might be disruptive.'
summary: 'etcd cluster database growing very fast.'
condition: '{{ true }}'
expr: predict_linear(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[4h], 4*60*60) > etcd_server_quota_backend_bytes{job=~".*etcd.*"}
for: 10m
labels:
severity: warning
- alert: etcdDatabaseHighFragmentationRatio
annotations:
description: 'etcd cluster "{{`{{`}} $labels.job {{`}}`}}": database size in use on instance {{`{{`}} $labels.instance {{`}}`}} is {{`{{`}} $value | humanizePercentage {{`}}`}} of the actual allocated disk space, please run defragmentation (e.g. etcdctl defrag) to retrieve the unused fragmented disk space.'
runbook_url: 'https://etcd.io/docs/v3.5/op-guide/maintenance/#defragmentation'
summary: 'etcd database size in use is less than 50% of the actual allocated storage.'
condition: '{{ true }}'
expr: (last_over_time(etcd_mvcc_db_total_size_in_use_in_bytes{job=~".*etcd.*"}[5m]) / last_over_time(etcd_mvcc_db_total_size_in_bytes{job=~".*etcd.*"}[5m])) < 0.5 and etcd_mvcc_db_total_size_in_use_in_bytes{job=~".*etcd.*"} > 104857600
for: 10m
labels:
severity: warning

View File

@@ -1,53 +0,0 @@
condition: '{{ true }}'
name: general.rules
rules:
- alert: TargetDown
annotations:
description: '{{`{{`}} printf "%.4g" $value {{`}}`}}% of the {{`{{`}} $labels.job {{`}}`}}/{{`{{`}} $labels.service {{`}}`}} targets in {{`{{`}} $labels.namespace {{`}}`}} namespace are down.'
runbook_url: '{{ .Values.defaultRules.runbookUrl }}/general/targetdown'
summary: 'One or more targets are unreachable.'
condition: '{{ true }}'
expr: 100 * (count(up == 0) BY (job,namespace,service,{{ .Values.global.clusterLabel }}) / count(up) BY (job,namespace,service,{{ .Values.global.clusterLabel }})) > 10
for: 10m
labels:
severity: warning
- alert: Watchdog
annotations:
description: 'This is an alert meant to ensure that the entire alerting pipeline is functional.
This alert is always firing, therefore it should always be firing in Alertmanager
and always fire against a receiver. There are integrations with various notification
mechanisms that send a notification when this alert is not firing. For example the
"DeadMansSnitch" integration in PagerDuty.
'
runbook_url: '{{ .Values.defaultRules.runbookUrl }}/general/watchdog'
summary: 'An alert that should always be firing to certify that Alertmanager is working properly.'
condition: '{{ true }}'
expr: vector(1)
labels:
severity: ok
- alert: InfoInhibitor
annotations:
description: 'This is an alert that is used to inhibit info alerts.
By themselves, the info-level alerts are sometimes very noisy, but they are relevant when combined with
other alerts.
This alert fires whenever there''s a severity="info" alert, and stops firing when another alert with a
severity of ''warning'' or ''critical'' starts firing on the same namespace.
This alert should be routed to a null receiver and configured to inhibit alerts with severity="info".
'
runbook_url: '{{ .Values.defaultRules.runbookUrl }}/general/infoinhibitor'
summary: 'Info-level alert inhibition.'
condition: '{{ true }}'
expr: ALERTS{severity = "info"} == 1 unless on (namespace,{{ .Values.global.clusterLabel }}) ALERTS{alertname != "InfoInhibitor", severity =~ "warning|critical", alertstate="firing"} == 1
labels:
severity: major

View File

@@ -1,11 +0,0 @@
condition: '{{ true }}'
name: k8s.rules.container_cpu_usage_seconds_total
rules:
- condition: '{{ true }}'
expr: |-
sum by (namespace,pod,container,{{ .Values.global.clusterLabel }}) (
irate(container_cpu_usage_seconds_total{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}[5m])
) * on (namespace,pod,{{ .Values.global.clusterLabel }}) group_left(node) topk by (namespace,pod,{{ .Values.global.clusterLabel }}) (
1, max by (namespace,pod,node,{{ .Values.global.clusterLabel }}) (kube_pod_info{node!=""})
)
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate

View File

@@ -1,10 +0,0 @@
condition: '{{ true }}'
name: k8s.rules.container_memory_cache
rules:
- condition: '{{ true }}'
expr: |-
container_memory_cache{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}
* on (namespace,pod,{{ .Values.global.clusterLabel }}) group_left(node) topk by (namespace,pod,{{ .Values.global.clusterLabel }}) (1,
max by (namespace,pod,node,{{ .Values.global.clusterLabel }}) (kube_pod_info{node!=""})
)
record: node_namespace_pod_container:container_memory_cache

View File

@@ -1,10 +0,0 @@
condition: '{{ true }}'
name: k8s.rules.container_memory_rss
rules:
- condition: '{{ true }}'
expr: |-
container_memory_rss{job="kubelet", metrics_path="/metrics/cadvisor", image!=""}
* on (namespace,pod,{{ .Values.global.clusterLabel }}) group_left(node) topk by (namespace,pod,{{ .Values.global.clusterLabel }}) (1,
max by (namespace,pod,node,{{ .Values.global.clusterLabel }}) (kube_pod_info{node!=""})
)
record: node_namespace_pod_container:container_memory_rss

Some files were not shown because too many files have changed in this diff Show More