kubernetes

mirror of https://github.com/optim-enterprises-bv/kubernetes.git synced 2025-11-03 03:38:15 +00:00

Author	SHA1	Message	Date
Kubernetes Prow Robot	7b6c56e5fb	Merge pull request #130135 from saschagrunert/image-volume-beta [KEP-4639] Graduate image volume sources to beta	2025-03-12 18:03:58 -07:00
Sascha Grunert	f9e5dd84ad	Graduate image volume sources to beta Graduate the feature to beta, by: - Allowing `subPath`/`subPathExpr` for image volumes - Modifying the CRI to pass down the (resolved) sub path - Adding metrics which are outlined in the KEP Signed-off-by: Sascha Grunert <sgrunert@redhat.com>	2025-03-11 13:41:45 +01:00
Francesco Romani	04129d1dc8	node: metrics for alignment failures Add metrics to report alignment allocation failures See: https://github.com/kubernetes/enhancements/pull/5108 Signed-off-by: Francesco Romani <fromani@redhat.com>	2025-03-04 19:50:08 +01:00
Anish Shah	d4f05fdda5	Introduce a metric to track kubelet admission failure.	2024-11-06 00:07:17 -08:00
Kubernetes Prow Robot	8c5472ce66	Merge pull request #128189 from zylxjtu/bug Fix the incorrect metrics setting/naming in nodeshutdown manager	2024-11-06 02:29:29 +00:00
Talor Itzhak	37bcd38785	memorymanager: FG is locked to ON by default We can remove the `if()` guards, since the feature is always available. Signed-off-by: Talor Itzhak <titzhak@redhat.com>	2024-11-04 09:57:21 +02:00
Kubernetes Prow Robot	a339a36a36	Merge pull request #127506 from ffromani/cpu-pool-size-metrics node: metrics: add metrics about cpu pool sizes	2024-10-30 00:17:24 +00:00
Ed Bartosh	c1cd8495a5	kubelet: define custom buckets for DRA metrics	2024-10-28 18:04:51 +02:00
Ed Bartosh	2048b7b8fd	kubelet: add DRAGRPCOperationsDuration metric	2024-10-27 10:47:06 +02:00
Ed Bartosh	a21f3f0a04	kubelet: add DRAOperationsDuration metric	2024-10-27 10:46:59 +02:00
zylxjtu	04b1378804	Fix the incorrect metrics setting/naming in nodeshutdown manager	2024-10-25 10:16:47 -07:00
Francesco Romani	14ec0edd10	node: metrics: add metrics about cpu pool sizes Add metrics about the sizing of the cpu pools. Currently the cpumanager maintains 2 cpu pools: - shared pool: this is where all pods with non-exclusive cpu allocation run - exclusive pool: this is the union of the set of exclusive cpus allocated to containers, if any (requires static policy in use). By reporting the size of the pools, the users (humans or machines) can get better insights and more feedback about how the resources actually allocated to the workload and how the node resources are used.	2024-10-24 15:35:51 +02:00
Francesco Romani	c025861e0c	node: metrics: add resource alignment metrics In order to improve the observability of the resource management in kubelet, cpu allocation and NUMA alignment, we add more metrics to report if resource alignment is in effect. The more precise reporting would probably be using pod status, but this would require more invasive and riskier changes, and possibly extra interactions to the APIServer. We start adding metrics to report if containers got their compute resources aligned. If metrics are growing, the assingment is working as expected; If metrics stay consistent, perhaps at zero, no resource alignment is done. Extra fixes brought by this work - retroactively add labels for existing tests - running metrics test demands precision accounting to avoid flakes; ensure the node state is restored pristine between each test, to minimize the aforementioned risk of flakes. - The test pod command line was wrong, with this the pod could not reach Running state. That gone unnoticed so far because no test using this utility function actually needed a pod in running state. Signed-off-by: Francesco Romani <fromani@redhat.com>	2024-10-23 08:05:38 +02:00
googs1025	754ad2001f	fix(kubelet): register ImageGarbageCollectedTotal metrics	2024-08-30 22:45:07 +08:00
Harshal Patil	68d317a8d1	Add a warning log, event and metric for cgroup version 1 Signed-off-by: Harshal Patil <harpatil@redhat.com>	2024-07-09 11:34:46 -04:00
Kubernetes Prow Robot	cbfebf02e8	Merge pull request #121720 from aojea/first_pod_network_startup kubelet: add internal metric for the first pod with network latency	2024-02-22 07:13:25 -08:00
Kubernetes Prow Robot	0f7cc6fcaa	Merge pull request #121778 from Tal-or/mm_metrics kubelet: memorymanager: metrics: add metrics about static allocation	2024-02-20 09:41:50 -08:00
Kubernetes Prow Robot	5d776f935c	Merge pull request #123345 from haircommander/image-gc-metric-reason KEP-4210: kubelet: add reason field to image gc metric	2024-02-19 18:56:59 -08:00
AxeZhan	c74ec3df09	graduate PodLifecycleSleepAction to beta	2024-02-19 19:40:52 +08:00
Peter Hunt	c8b4d8ebed	kubelet: add reason field to image gc metric Signed-off-by: Peter Hunt <pehunt@redhat.com>	2024-02-16 16:02:41 -05:00
Kubernetes Prow Robot	14f8f5519d	Merge pull request #121719 from ruiwen-zhao/metric-size Add image pull duration metric with bucketed image size	2024-02-13 16:23:50 -08:00
ruiwen-zhao	0f5cf6c1cd	Add image pull duration metric with bucketed image size Signed-off-by: ruiwen-zhao <ruiwen@google.com>	2024-02-08 00:30:31 +00:00
carlory	55c5db172e	lock GA feature-gate ConsistentHTTPGetHandlers to default	2024-01-04 15:12:08 +08:00
Antonio Ojea	b8533f7976	kubelet: add metric for the first pod with network latency The first pod with network latency impact user workloads, however, it is difficuly to understand where is the problem of this latency, since it depends on the CNI plugin to be ready at the moment of the pod creation. Add a new internal metric in the kubelet that allow developers and cluster administrator to understand the source of the latency problems on node startups. kubelet_first_network_pod_start_sli_duration_seconds Change-Id: I4cdb55b0df72c96a3a65b78ce2aae404c5195006	2023-11-15 06:09:49 +00:00
Talor Itzhak	ddd60de3f3	memorymanager:metrics: add metrics As part of the memory manager GA graduation effort, we should add metrics in order to iprove observability. The metrics also mentioned in the PR https://github.com/kubernetes/enhancements/pull/4251 (which was not merged yet) Signed-off-by: Talor Itzhak <titzhak@redhat.com>	2023-11-12 09:34:55 +02:00
ruiwen-zhao	1165609036	Add metric for e2e pod startup latency including image pull Signed-off-by: ruiwen-zhao <ruiwen@google.com>	2023-10-25 20:34:17 +00:00
Kubernetes Prow Robot	12b01aff1b	Merge pull request #121275 from haircommander/image-max-gc KEP-4210: add support for ImageMaximumGCAge field	2023-10-25 21:29:10 +02:00
Kubernetes Prow Robot	f82670d8ec	Merge pull request #120680 from ruiwen-zhao/pod-start-bucket Use a wider-range of metric buckets for PodStartDuration	2023-10-25 20:16:34 +02:00
Peter Hunt	49c947ba15	metrics: add and use ImageGarbageCollectedTotal to help find MaxAge thresholds and detect image addition/removal thrashing Signed-off-by: Peter Hunt <pehunt@redhat.com>	2023-10-20 12:23:31 -04:00
ruiwen-zhao	9b50af1f4f	Use a wider-range of metric buckets for PodStartDuration Signed-off-by: ruiwen-zhao <ruiwen@google.com>	2023-09-14 21:32:14 +00:00
Qiutong Song	d3eb082568	Create a node startup latency tracker Signed-off-by: Qiutong Song <songqt01@gmail.com>	2023-09-11 05:54:25 +00:00
Francesco Romani	01c3a51a78	node: podresources: getallocatable: move to GA lock the feature gate to GA, and remove the now-redundant code. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-07-12 14:11:22 +02:00
Kubernetes Prow Robot	cfeb83d56b	Merge pull request #116525 from ffromani/kubelet-podresources-endpoint-ga node: podresources: graduate to GA	2023-05-25 16:38:50 -07:00
Mark Rossetti	ab9c8eb1e8	Removing WindowsHostProcessContainers feature-gate Signed-off-by: Mark Rossetti <marosset@microsoft.com>	2023-05-01 13:30:38 -07:00
Francesco Romani	69bc685556	node: podresources: graduate to GA Lock the feature gate to ON and simplify the code accordingly. Signed-off-by: Francesco Romani <fromani@redhat.com>	2023-05-01 16:23:28 +02:00
Moshe Levi	71d6e4d53c	kubelet metrics: add pod resources get metrics Signed-off-by: Moshe Levi <moshele@nvidia.com>	2023-03-14 19:33:03 +02:00
Kubernetes Prow Robot	c6f3007071	Merge pull request #115967 from harche/evented_pleg_metrics Graduate Evented PLEG to Beta	2023-03-10 17:34:40 -08:00
Kubernetes Prow Robot	a408be817f	Merge pull request #115972 from jsafrane/add-orphan-pod-metrics Add metric for failed orphan pod cleanup	2023-03-09 22:43:26 -08:00
Clayton Coleman	6b9a381185	kubelet: Force deleted pods can fail to move out of terminating If a CRI error occurs during the terminating phase after a pod is force deleted (API or static) then the housekeeping loop will not deliver updates to the pod worker which prevents the pod's state machine from progressing. The pod will remain in the terminating phase but no further attempts to terminate or cleanup will occur until the kubelet is restarted. The pod worker now maintains a store of the pods state that it is attempting to reconcile and uses that to resync unknown pods when SyncKnownPods() is invoked, so that failures in sync methods for unknown pods no longer hang forever. The pod worker's store tracks desired updates and the last update applied on podSyncStatuses. Each goroutine now synchronizes to acquire the next work item, context, and whether the pod can start. This synchronization moves the pending update to the stored last update, which will ensure third parties accessing pod worker state don't see updates before the pod worker begins synchronizing them. As a consequence, the update channel becomes a simple notifier (struct{}) so that SyncKnownPods can coordinate with the pod worker to create a synthetic pending update for unknown pods (i.e. no one besides the pod worker has data about those pods). Otherwise the pending update info would be hidden inside the channel. In order to properly track pending updates, we have to be very careful not to mix RunningPods (which are calculated from the container runtime and are missing all spec info) and config- sourced pods. Update the pod worker to avoid using ToAPIPod() and instead require the pod worker to directly use update.Options.Pod or update.Options.RunningPod for the correct methods. Add a new SyncTerminatingRuntimePod to prevent accidental invocations of runtime only pod data. Finally, fix SyncKnownPods to replay the last valid update for undesired pods which drives the pod state machine towards termination, and alter HandlePodCleanups to: - terminate runtime pods that aren't known to the pod worker - launch admitted pods that aren't known to the pod worker Any started pods receive a replay until they reach the finished state, and then are removed from the pod worker. When a desired pod is detected as not being in the worker, the usual cause is that the pod was deleted and recreated with the same UID (almost always a static pod since API UID reuse is statistically unlikely). This simplifies the previous restartable pod support. We are careful to filter for active pods (those not already terminal or those which have been previously rejected by admission). We also force a refresh of the runtime cache to ensure we don't see an older version of the state. Future changes will allow other components that need to view the pod worker's actual state (not the desired state the podManager represents) to retrieve that info from the pod worker. Several bugs in pod lifecycle have been undetectable at runtime because the kubelet does not clearly describe the number of pods in use. To better report, add the following metrics: kubelet_desired_pods: Pods the pod manager sees kubelet_active_pods: "Admitted" pods that gate new pods kubelet_mirror_pods: Mirror pods the kubelet is tracking kubelet_working_pods: Breakdown of pods from the last sync in each phase, orphaned state, and static or not kubelet_restarted_pods_total: A counter for pods that saw a CREATE before the previous pod with the same UID was finished kubelet_orphaned_runtime_pods_total: A counter for pods detected at runtime that were not known to the kubelet. Will be populated at Kubelet startup and should never be incremented after. Add a metric check to our e2e tests that verifies the values are captured correctly during a serial test, and then verify them in detail in unit tests. Adds 23 series to the kubelet /metrics endpoint.	2023-03-08 22:03:51 -06:00
Harshal Patil	412b4b3329	Add connection related metrics to EventedPLEG Signed-off-by: Harshal Patil <harpatil@redhat.com>	2023-03-01 11:35:27 -05:00
Jan Safranek	7bf9991389	Add metric for failed orphan pod cleanup	2023-02-22 18:43:38 +01:00
Swati Sehgal	bc941633c1	node: topology-mgr: add metric to measure topology mgr admission latency Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-02-15 13:59:47 +00:00
Kubernetes Prow Robot	4df945853e	Merge pull request #115137 from swatisehgal/topologymgr-metrics node: topologymgr: add metrics about admission requests and errors	2023-01-30 18:43:00 -08:00
Swati Sehgal	172c55d310	node: topologymgr: add metrics about admission requests and errors Signed-off-by: Swati Sehgal <swsehgal@redhat.com>	2023-01-17 17:50:29 +00:00
Paco Xu	70e56fa71a	cleanup: EphemeralContainers feature gate related codes	2023-01-15 21:15:01 +08:00
Kubernetes Prow Robot	1bf4af4584	Merge pull request #111930 from azylinski/new-histogram-pod_start_sli_duration_seconds New histogram: Pod start SLI duration	2022-11-04 07:28:14 -07:00
Francesco Romani	ff44dc1932	cpumanager: the FG is locked to default (ON) hence we can remove the if() guards, the feature is always available. Signed-off-by: Francesco Romani <fromani@redhat.com>	2022-11-02 18:41:41 +01:00
Mark Rossetti	498d065cc5	Promoting WindowsHostProcessContainers to stable Signed-off-by: Mark Rossetti <marosset@microsoft.com>	2022-11-01 14:06:25 -07:00
Francesco Romani	47d3299781	node: metrics: cpumanager: add pinning metrics In order to improve the observability of the cpumanager, add and populate metrics to track if the combination of the kubelet configuration and podspec would trigger exclusive core allocation and pinning. We should avoid leaking any node/machine specific information (e.g. core ids, even though this is admittedly an extreme example); tracking these metrics seems to be a good first step, because it allows us to get feedback without exposing details. Signed-off-by: Francesco Romani <fromani@redhat.com>	2022-10-27 14:40:40 +02:00
Artur Żyliński	9f31669a53	New histogram: Pod start SLI duration	2022-10-26 11:28:17 +02:00

1 2 3 4

157 Commits