157 Commits

Author SHA1 Message Date
Kubernetes Prow Robot
7b6c56e5fb Merge pull request #130135 from saschagrunert/image-volume-beta
[KEP-4639] Graduate image volume sources to beta
2025-03-12 18:03:58 -07:00
Sascha Grunert
f9e5dd84ad Graduate image volume sources to beta
Graduate the feature to beta, by:

- Allowing `subPath`/`subPathExpr` for image volumes
- Modifying the CRI to pass down the (resolved) sub path
- Adding metrics which are outlined in the KEP

Signed-off-by: Sascha Grunert <sgrunert@redhat.com>
2025-03-11 13:41:45 +01:00
Francesco Romani
04129d1dc8 node: metrics for alignment failures
Add metrics to report alignment allocation failures
See: https://github.com/kubernetes/enhancements/pull/5108

Signed-off-by: Francesco Romani <fromani@redhat.com>
2025-03-04 19:50:08 +01:00
Anish Shah
d4f05fdda5 Introduce a metric to track kubelet admission failure. 2024-11-06 00:07:17 -08:00
Kubernetes Prow Robot
8c5472ce66 Merge pull request #128189 from zylxjtu/bug
Fix the incorrect metrics setting/naming in nodeshutdown manager
2024-11-06 02:29:29 +00:00
Talor Itzhak
37bcd38785 memorymanager: FG is locked to ON by default
We can remove the `if()` guards, since
the feature is always available.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2024-11-04 09:57:21 +02:00
Kubernetes Prow Robot
a339a36a36 Merge pull request #127506 from ffromani/cpu-pool-size-metrics
node: metrics: add metrics about cpu pool sizes
2024-10-30 00:17:24 +00:00
Ed Bartosh
c1cd8495a5 kubelet: define custom buckets for DRA metrics 2024-10-28 18:04:51 +02:00
Ed Bartosh
2048b7b8fd kubelet: add DRAGRPCOperationsDuration metric 2024-10-27 10:47:06 +02:00
Ed Bartosh
a21f3f0a04 kubelet: add DRAOperationsDuration metric 2024-10-27 10:46:59 +02:00
zylxjtu
04b1378804 Fix the incorrect metrics setting/naming in nodeshutdown manager 2024-10-25 10:16:47 -07:00
Francesco Romani
14ec0edd10 node: metrics: add metrics about cpu pool sizes
Add metrics about the sizing of the cpu pools.
Currently the cpumanager maintains 2 cpu pools:
- shared pool: this is where all pods with non-exclusive
  cpu allocation run
- exclusive pool: this is the union of the set of exclusive
  cpus allocated to containers, if any (requires static policy in use).

By reporting the size of the pools, the users (humans or machines)
can get better insights and more feedback about how the resources
actually allocated to the workload and how the node resources are used.
2024-10-24 15:35:51 +02:00
Francesco Romani
c025861e0c node: metrics: add resource alignment metrics
In order to improve the observability of the resource management
in kubelet, cpu allocation and NUMA alignment, we add more metrics
to report if resource alignment is in effect.

The more precise reporting would probably be using pod status,
but this would require more invasive and riskier changes,
and possibly extra interactions to the APIServer.

We start adding metrics to report if containers got their
compute resources aligned.
If metrics are growing, the assingment is working as expected;
If metrics stay consistent, perhaps at zero, no resource
alignment is done.

Extra fixes brought by this work
- retroactively add labels for existing tests
- running metrics test demands precision accounting to avoid flakes;
  ensure the node state is restored pristine between each test, to
  minimize the aforementioned risk of flakes.
- The test pod command line was wrong, with this the pod could not
  reach Running state. That gone unnoticed so far because
  no test using this utility function actually needed a pod
  in running state.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2024-10-23 08:05:38 +02:00
googs1025
754ad2001f fix(kubelet): register ImageGarbageCollectedTotal metrics 2024-08-30 22:45:07 +08:00
Harshal Patil
68d317a8d1 Add a warning log, event and metric for cgroup version 1
Signed-off-by: Harshal Patil <harpatil@redhat.com>
2024-07-09 11:34:46 -04:00
Kubernetes Prow Robot
cbfebf02e8 Merge pull request #121720 from aojea/first_pod_network_startup
kubelet: add internal metric for the first pod with network latency
2024-02-22 07:13:25 -08:00
Kubernetes Prow Robot
0f7cc6fcaa Merge pull request #121778 from Tal-or/mm_metrics
kubelet: memorymanager: metrics:  add metrics about static allocation
2024-02-20 09:41:50 -08:00
Kubernetes Prow Robot
5d776f935c Merge pull request #123345 from haircommander/image-gc-metric-reason
KEP-4210: kubelet: add reason field to image gc metric
2024-02-19 18:56:59 -08:00
AxeZhan
c74ec3df09 graduate PodLifecycleSleepAction to beta 2024-02-19 19:40:52 +08:00
Peter Hunt
c8b4d8ebed kubelet: add reason field to image gc metric
Signed-off-by: Peter Hunt <pehunt@redhat.com>
2024-02-16 16:02:41 -05:00
Kubernetes Prow Robot
14f8f5519d Merge pull request #121719 from ruiwen-zhao/metric-size
Add image pull duration metric with bucketed image size
2024-02-13 16:23:50 -08:00
ruiwen-zhao
0f5cf6c1cd Add image pull duration metric with bucketed image size
Signed-off-by: ruiwen-zhao <ruiwen@google.com>
2024-02-08 00:30:31 +00:00
carlory
55c5db172e lock GA feature-gate ConsistentHTTPGetHandlers to default 2024-01-04 15:12:08 +08:00
Antonio Ojea
b8533f7976 kubelet: add metric for the first pod with network latency
The first pod with network latency impact user workloads, however,
it is difficuly to understand where is the problem of this latency,
since it depends on the CNI plugin to be ready at the moment of the
pod creation.

Add a new internal metric in the kubelet that allow developers and cluster
administrator to understand the source of the latency problems on
node startups.

kubelet_first_network_pod_start_sli_duration_seconds

Change-Id: I4cdb55b0df72c96a3a65b78ce2aae404c5195006
2023-11-15 06:09:49 +00:00
Talor Itzhak
ddd60de3f3 memorymanager:metrics: add metrics
As part of the memory manager GA graduation effort, we should add
metrics in order to iprove observability.

The metrics also mentioned in the PR https://github.com/kubernetes/enhancements/pull/4251 (which was not merged yet)

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2023-11-12 09:34:55 +02:00
ruiwen-zhao
1165609036 Add metric for e2e pod startup latency including image pull
Signed-off-by: ruiwen-zhao <ruiwen@google.com>
2023-10-25 20:34:17 +00:00
Kubernetes Prow Robot
12b01aff1b Merge pull request #121275 from haircommander/image-max-gc
KEP-4210: add support for ImageMaximumGCAge field
2023-10-25 21:29:10 +02:00
Kubernetes Prow Robot
f82670d8ec Merge pull request #120680 from ruiwen-zhao/pod-start-bucket
Use a wider-range of metric buckets for PodStartDuration
2023-10-25 20:16:34 +02:00
Peter Hunt
49c947ba15 metrics: add and use ImageGarbageCollectedTotal
to help find MaxAge thresholds and detect image addition/removal thrashing

Signed-off-by: Peter Hunt <pehunt@redhat.com>
2023-10-20 12:23:31 -04:00
ruiwen-zhao
9b50af1f4f Use a wider-range of metric buckets for PodStartDuration
Signed-off-by: ruiwen-zhao <ruiwen@google.com>
2023-09-14 21:32:14 +00:00
Qiutong Song
d3eb082568 Create a node startup latency tracker
Signed-off-by: Qiutong Song <songqt01@gmail.com>
2023-09-11 05:54:25 +00:00
Francesco Romani
01c3a51a78 node: podresources: getallocatable: move to GA
lock the feature gate to GA, and remove the now-redundant code.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-07-12 14:11:22 +02:00
Kubernetes Prow Robot
cfeb83d56b Merge pull request #116525 from ffromani/kubelet-podresources-endpoint-ga
node: podresources: graduate to GA
2023-05-25 16:38:50 -07:00
Mark Rossetti
ab9c8eb1e8 Removing WindowsHostProcessContainers feature-gate
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
2023-05-01 13:30:38 -07:00
Francesco Romani
69bc685556 node: podresources: graduate to GA
Lock the feature gate to ON and simplify the code
accordingly.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2023-05-01 16:23:28 +02:00
Moshe Levi
71d6e4d53c kubelet metrics: add pod resources get metrics
Signed-off-by: Moshe Levi <moshele@nvidia.com>
2023-03-14 19:33:03 +02:00
Kubernetes Prow Robot
c6f3007071 Merge pull request #115967 from harche/evented_pleg_metrics
Graduate Evented PLEG to Beta
2023-03-10 17:34:40 -08:00
Kubernetes Prow Robot
a408be817f Merge pull request #115972 from jsafrane/add-orphan-pod-metrics
Add metric for failed orphan pod cleanup
2023-03-09 22:43:26 -08:00
Clayton Coleman
6b9a381185 kubelet: Force deleted pods can fail to move out of terminating
If a CRI error occurs during the terminating phase after a pod is
force deleted (API or static) then the housekeeping loop will not
deliver updates to the pod worker which prevents the pod's state
machine from progressing. The pod will remain in the terminating
phase but no further attempts to terminate or cleanup will occur
until the kubelet is restarted.

The pod worker now maintains a store of the pods state that it is
attempting to reconcile and uses that to resync unknown pods when
SyncKnownPods() is invoked, so that failures in sync methods for
unknown pods no longer hang forever.

The pod worker's store tracks desired updates and the last update
applied on podSyncStatuses. Each goroutine now synchronizes to
acquire the next work item, context, and whether the pod can start.
This synchronization moves the pending update to the stored last
update, which will ensure third parties accessing pod worker state
don't see updates before the pod worker begins synchronizing them.

As a consequence, the update channel becomes a simple notifier
(struct{}) so that SyncKnownPods can coordinate with the pod worker
to create a synthetic pending update for unknown pods (i.e. no one
besides the pod worker has data about those pods). Otherwise the
pending update info would be hidden inside the channel.

In order to properly track pending updates, we have to be very
careful not to mix RunningPods (which are calculated from the
container runtime and are missing all spec info) and config-
sourced pods. Update the pod worker to avoid using ToAPIPod()
and instead require the pod worker to directly use
update.Options.Pod or update.Options.RunningPod for the
correct methods. Add a new SyncTerminatingRuntimePod to prevent
accidental invocations of runtime only pod data.

Finally, fix SyncKnownPods to replay the last valid update for
undesired pods which drives the pod state machine towards
termination, and alter HandlePodCleanups to:

- terminate runtime pods that aren't known to the pod worker
- launch admitted pods that aren't known to the pod worker

Any started pods receive a replay until they reach the finished
state, and then are removed from the pod worker. When a desired
pod is detected as not being in the worker, the usual cause is
that the pod was deleted and recreated with the same UID (almost
always a static pod since API UID reuse is statistically
unlikely). This simplifies the previous restartable pod support.
We are careful to filter for active pods (those not already
terminal or those which have been previously rejected by
admission). We also force a refresh of the runtime cache to
ensure we don't see an older version of the state.

Future changes will allow other components that need to view the
pod worker's actual state (not the desired state the podManager
represents) to retrieve that info from the pod worker.

Several bugs in pod lifecycle have been undetectable at runtime
because the kubelet does not clearly describe the number of pods
in use. To better report, add the following metrics:

  kubelet_desired_pods: Pods the pod manager sees
  kubelet_active_pods: "Admitted" pods that gate new pods
  kubelet_mirror_pods: Mirror pods the kubelet is tracking
  kubelet_working_pods: Breakdown of pods from the last sync in
    each phase, orphaned state, and static or not
  kubelet_restarted_pods_total: A counter for pods that saw a
    CREATE before the previous pod with the same UID was finished
  kubelet_orphaned_runtime_pods_total: A counter for pods detected
    at runtime that were not known to the kubelet. Will be
    populated at Kubelet startup and should never be incremented
    after.

Add a metric check to our e2e tests that verifies the values are
captured correctly during a serial test, and then verify them in
detail in unit tests.

Adds 23 series to the kubelet /metrics endpoint.
2023-03-08 22:03:51 -06:00
Harshal Patil
412b4b3329 Add connection related metrics to EventedPLEG
Signed-off-by: Harshal Patil <harpatil@redhat.com>
2023-03-01 11:35:27 -05:00
Jan Safranek
7bf9991389 Add metric for failed orphan pod cleanup 2023-02-22 18:43:38 +01:00
Swati Sehgal
bc941633c1 node: topology-mgr: add metric to measure topology mgr admission latency
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-02-15 13:59:47 +00:00
Kubernetes Prow Robot
4df945853e Merge pull request #115137 from swatisehgal/topologymgr-metrics
node: topologymgr: add metrics about admission requests and errors
2023-01-30 18:43:00 -08:00
Swati Sehgal
172c55d310 node: topologymgr: add metrics about admission requests and errors
Signed-off-by: Swati Sehgal <swsehgal@redhat.com>
2023-01-17 17:50:29 +00:00
Paco Xu
70e56fa71a cleanup: EphemeralContainers feature gate related codes 2023-01-15 21:15:01 +08:00
Kubernetes Prow Robot
1bf4af4584 Merge pull request #111930 from azylinski/new-histogram-pod_start_sli_duration_seconds
New histogram: Pod start SLI duration
2022-11-04 07:28:14 -07:00
Francesco Romani
ff44dc1932 cpumanager: the FG is locked to default (ON)
hence we can remove the if() guards, the feature
is always available.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2022-11-02 18:41:41 +01:00
Mark Rossetti
498d065cc5 Promoting WindowsHostProcessContainers to stable
Signed-off-by: Mark Rossetti <marosset@microsoft.com>
2022-11-01 14:06:25 -07:00
Francesco Romani
47d3299781 node: metrics: cpumanager: add pinning metrics
In order to improve the observability of the cpumanager,
add and populate metrics to track if the combination of
the kubelet configuration and podspec would trigger
exclusive core allocation and pinning.

We should avoid leaking any node/machine specific information
(e.g. core ids, even though this is admittedly an extreme example);
tracking these metrics seems to be a good first step, because
it allows us to get feedback without exposing details.

Signed-off-by: Francesco Romani <fromani@redhat.com>
2022-10-27 14:40:40 +02:00
Artur Żyliński
9f31669a53 New histogram: Pod start SLI duration 2022-10-26 11:28:17 +02:00