Commit Graph

2371 Commits

Author SHA1 Message Date
Tim Allclair
660bd6b42d Track actuated resources in the allocation manager 2025-03-10 09:58:29 -07:00
Tim Allclair
ed326fea13 Always report pod status resources consistent with the current pod sync 2025-03-05 16:01:03 -08:00
Tim Allclair
cb5c8d159c Don't automatically clear in-progress status when resize is not allowed 2025-03-03 15:26:40 -08:00
Tim Allclair
523a19aa44 Extract isInPlacePodVerticalScalingAllowed to shared function 2025-03-03 14:08:49 -08:00
Tim Allclair
460db5c137 Always use allocated resources for pods that don't support resize 2025-03-03 14:07:30 -08:00
Kubernetes Prow Robot
3560950041 Merge pull request #130254 from tallclair/allocation-manager-2
[FG:InPlacePodVerticalScaling] Move pod resource allocation management out of the status manager
2025-02-28 11:30:56 -08:00
Tim Allclair
fe4671356c Call allocationManager directly 2025-02-21 09:28:37 -08:00
Antonio Ojea
2418b54ee2 Revert "Add random interval to nodeStatusReport interval every time after an actual node status change" 2025-02-21 17:29:08 +01:00
Kubernetes Prow Robot
0634e21fb5 Merge pull request #128367 from vivzbansal/sidecar-2
[FG:InPlacePodVerticalScaling] Implement resize for sidecar containers
2025-02-05 14:38:15 -08:00
Ed Bartosh
71b9114840 kubelet: Migrate pkg/kubelet/sysctl to contextual logging 2025-01-30 10:31:58 +02:00
Kubernetes Prow Robot
8294abc599 Merge pull request #128998 from bart0sh/PR165-migrate-oom-to-contextual-logging
kubelet: Migrate pkg/kubelet/oom to contextual logging
2025-01-28 13:33:22 -08:00
vivzbansal
6c5cf68722 Resolved latest review comments 2025-01-27 19:46:33 +00:00
vivzbansal
d1fac494f4 resolve merge conflicts 2025-01-27 19:42:13 +00:00
Ed Bartosh
f622be0333 kubelet: Migrate pkg/kubelet/oom to contextual logging 2024-11-28 17:47:02 +02:00
Talor Itzhak
dc258e65ac memmanager:cleanup: drop Experimental prefix
Since MemoryManager goes GA, we should drop the
`Experimental` prefix from the its fields.

Signed-off-by: Talor Itzhak <titzhak@redhat.com>
2024-11-12 09:45:17 +02:00
Kubernetes Prow Robot
6b031e50b2 Merge pull request #128713 from tallclair/ippr-debug-events
[FG:InPlacePodVerticalScaling] Emit events for Deferred and Infeasible statuses
2024-11-11 23:22:45 +00:00
lauralorenz
7fe41da522 KEP-4603: Node specific kubelet config for maximum backoff down to 1 second (#128374)
* Add feature gate, API, and conflict validation tests for enablecrashloopbackoffmax

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Handle when current base is longer than node max

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Update pkg/features/kube_features.go

Co-authored-by: Tsubasa Nagasawa <toversus2357@gmail.com>

* Fix indentation

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Follow convention for success test

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Normalize casing, and change field to Duration

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Fix json name and some other casing errors

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Another one I missed before

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Don't clobber global max function

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Change to flat value in defaults.go

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Streamline validation and defaults

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Fix typecheck

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Lint

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Tighten up validation for subsecond values

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Rename field from MaxBackOffPeriod to MaxContainerRestartPeriod

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* A few missed references to renames

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Only compare flags in flags test

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Don't mess with SetDefault signature

Nobody messes with SetDefault signature

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Fix stale signature change, and update test data

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Inspect current feature gates at defaulting time

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Don't use the global feature gate for temp usage

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Expose default error, and some comments

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Hint fuzzer for less arbitrary values to FeatureGates

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

---------

Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Co-authored-by: Tsubasa Nagasawa <toversus2357@gmail.com>
2024-11-09 01:44:43 +00:00
Tim Allclair
3a2555ee93 Emit events for resize error states 2024-11-08 16:43:55 -08:00
Tim Allclair
61e6242967 Move windows infeasible resize check into canResizePod 2024-11-08 16:42:10 -08:00
Kubernetes Prow Robot
0fff5bbe7d Merge pull request #128680 from tallclair/min-cpu
[FG:InPlacePodVerticalScaling] Handle edge cases around CPU MinShares
2024-11-08 05:24:51 +00:00
Kubernetes Prow Robot
81dc4538db Merge pull request #128287 from Nordix/esotsal/128068
[FG:InPlacePodVerticalScaling] Gate Disallow in-place resize for guaranteed pods on nodes with a static topology policy
2024-11-08 05:24:44 +00:00
Tim Allclair
5a3a40cd19 Handle resize edge cases around min CPU shares 2024-11-07 17:02:25 -08:00
Kubernetes Prow Robot
8504758a2e Merge pull request #125757 from Nordix/esotsal/125205
[FG:InPlacePodVerticalScaling] Fix backoff problem when quickly reverting resize patch
2024-11-07 23:32:42 +00:00
Kubernetes Prow Robot
ab30adcbae Merge pull request #128356 from lauralorenz/crashloopbackoff-maintain10minuterecoverythreshold
KEP-4603: Maintain current 10 minute recovery threshold for container backoff regardless of changes to the maximum duration
2024-11-07 22:20:50 +00:00
Kubernetes Prow Robot
1ce20b2b6f Merge pull request #126336 from HirazawaUi/remove-runonce-mode
Kubelet: Remove runonce mode
2024-11-07 21:06:46 +00:00
Kubernetes Prow Robot
25101d33bc Merge pull request #128518 from tallclair/pleg-watch-conditions
[FG:InPlacePodVerticalScaling] PLEG watch conditions: rapid polling for expected changes
2024-11-07 19:45:01 +00:00
Sotiris Salloumis
68fcc9cf8a Fix slow reconcile when quickly reverting resize patch 2024-11-07 19:51:47 +01:00
Laura Lorenz
a0b83a7741 Maintain 10 minute recovery threshold for container backoff
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-07 18:46:11 +00:00
Sotiris Salloumis
2d8939c4ae Gate: disallow in-place resize for guaranteed pods on nodes with a static topology policy
New gate "InPlacePodVerticalScalingExclusiveCPUs" is off by default,
but can be enabled to unblock development of Static CPU management alongside
InPlacePodVerticalScaling.
2024-11-07 16:59:23 +00:00
Kubernetes Prow Robot
c9024e7ae6 Merge pull request #128640 from mengqiy/spreadkubeletlaod
Add random interval to nodeStatusReport interval every time after an actual node status change
2024-11-07 13:48:03 +00:00
HirazawaUi
ecf2b402be remove runonce mode 2024-11-07 19:54:11 +08:00
Kubernetes Prow Robot
c462d4c8e5 Merge pull request #126096 from utam0k/support-disabling-oom-group-kill
kubelet: new kubelet config option for disabling group oom kill
2024-11-07 06:29:36 +00:00
Mengqi (David) Yu
1003d36870 Add random interval to nodeStatusReport interval every time after an actual node status change
update TestUpdateNodeStatusWithLease this time to avoid flakiness
2024-11-07 04:33:59 +00:00
Kubernetes Prow Robot
3184eb3d1b Merge pull request #128629 from liggitt/revert-spreadkubeletload
Revert "Add random interval to nodeStatusReport interval every time after an actual node status change
2024-11-07 03:53:42 +00:00
utam0k
4f909c14a0 kubelet: new kubelet config option for disabling group oom kill
Signed-off-by: utam0k <k0ma@utam0k.jp>
2024-11-07 12:03:04 +09:00
Jordan Liggitt
4850b31bda Revert "Add random interval to nodeStatusReport interval every time after an actual node status change"
This reverts commit d6e17ad808.
2024-11-06 17:12:13 -05:00
Anish Shah
207842d3e0 drop InPlacePodVerticalScaling support in windows 2024-11-06 12:57:55 -08:00
Kubernetes Prow Robot
099449954e Merge pull request #128556 from AnishShah/kubelet-reject-metric
Introduce a metric to track kubelet admission failure.
2024-11-06 20:10:33 +00:00
Tim Allclair
da9c2c553b Set pod watch conditions for resize 2024-11-06 11:05:24 -08:00
Anish Shah
d4f05fdda5 Introduce a metric to track kubelet admission failure. 2024-11-06 00:07:17 -08:00
Mengqi (David) Yu
d6e17ad808 Add random interval to nodeStatusReport interval every time after an actual node status change 2024-11-06 06:11:05 +00:00
Kubernetes Prow Robot
5e0b818ff9 Merge pull request #128551 from tallclair/allocated-checkpoint
[FG:InPlacePodVerticalScaling] Don't checkpoint ResizeStatus
2024-11-06 04:19:36 +00:00
Kubernetes Prow Robot
bf75546494 Merge pull request #128432 from zhifei92/integrating-health-check
Integrate device plugin registration gRPC server health checks.
2024-11-06 04:19:29 +00:00
Tim Allclair
ea53083c14 Don't checkpoint ResizeStatus 2024-11-05 15:48:35 -08:00
Tim Allclair
4a4748d23c Determine resize status from state in handlePodResourcesResize 2024-11-05 15:41:49 -08:00
Kubernetes Prow Robot
f81a68f488 Merge pull request #128377 from tallclair/allocated-status-2
[FG:InPlacePodVerticalScaling] Implement AllocatedResources status changes for Beta
2024-11-05 23:21:49 +00:00
Kubernetes Prow Robot
e57618970e Merge pull request #126870 from AnishShah/outofcpu-fix
Ensure mirror pods are created as soon as node is registered
2024-11-05 19:15:29 +00:00
zhangzhifei16
1381e41f28 feat: Integrate device plugin registration gRPC server health checks. 2024-11-05 19:59:56 +08:00
Anish Shah
dcafd93b68 kubelet: try registering mirror pods as soon as node is registered.
Mirror pods for static pods may not be created immediately during node startup
because either the node is not registered or node informer is not synced.
They will be created eventually when static pods are resynced (every 1-1.5 minutes).

However, during this delay of 1-1.5 mins, kube-scheduler might overcommit resources
to the node and eventually cause kubelet to reject pods with
OutOfCPU/OutOfMemory/OutOfPods error.

To ensure kube-scheduler is aware of static pod resource usage faster,
mirror pods are created as soon as the node registers.
2024-11-05 00:56:21 -08:00
lauralorenz
4965a7a8a0 KEP-4603: Refactor various hardcoded backoffs into separate constants (#128369)
* Refactor various hardcoded backoffs into separate constants

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

* Fix comment formatting

Signed-off-by: Laura Lorenz <lauralorenz@google.com>

---------

Signed-off-by: Laura Lorenz <lauralorenz@google.com>
2024-11-05 06:07:28 +00:00