Commit Graph

85 Commits

Author SHA1 Message Date
Tim Hockin
e54719bb66 Use randfill, do API renames 2025-03-08 15:18:00 -08:00
Kubernetes Prow Robot
9d45ea8b9d Merge pull request #128586 from mortent/DRAPrioritizedList
Prioritized Alternatives in Device Requests
2025-03-06 21:01:44 -08:00
Kubernetes Prow Robot
0556b20d3d Merge pull request #129435 from googs1025/dra/validation
chore: add more error info for validateResourceSliceSpec
2025-03-01 02:16:55 -08:00
Morten Torkildsen
e2d1fcc162 Addressed comments 2025-02-28 20:47:35 +00:00
Morten Torkildsen
a716095a8a DRA: Update validation for Prioritized Alternatives in Device Requests 2025-02-28 19:28:50 +00:00
Morten Torkildsen
68040a3173 Run make update 2025-02-28 19:28:26 +00:00
Morten Torkildsen
8f7b43b6fd DRA: Update types and defaults for Prioritized Alternatives in Device Requests 2025-02-28 19:13:48 +00:00
Kubernetes Prow Robot
803e9d6495 Merge pull request #130355 from yongruilin/validation_origin
validation: Add Origin field to field.Error for more precise error tracking
2025-02-28 00:04:23 -08:00
yongruilin
a488509197 test: Improve error comparison in resource validation tests
Replace manual error logging with cmp.Diff for more precise error comparisons, using cmpopts to ignore Origin field and support UniqueString comparison.
2025-02-27 05:20:54 +00:00
Kubernetes Prow Robot
a18b4a8d97 Merge pull request #129158 from LionelJouin/fix-128831
Fix ResourceClaim status API inconsistency
2025-02-26 20:32:30 -08:00
googs1025
f540197768 chore: add more error info for validateResourceSliceSpec 2025-02-22 08:47:58 +08:00
Kubernetes Prow Robot
9bf60d06e0 Merge pull request #129219 from danwinship/networkdevicedata-validation
Require canonicalization of NetworkDeviceData IPs
2025-02-20 16:14:26 -08:00
Dan Winship
2636aa35e3 Require canonicalization of NetworkDeviceData IPs
There's no reason to allow non-standard or non-canonical IP values in
new APIs.
2025-02-20 12:49:03 -05:00
Kubernetes Prow Robot
481cc1a392 Merge pull request #129560 from bart0sh/PR168-DRA-fix-All-allocation-mode
DRA: fix allocation mode `All`
2025-02-05 00:38:16 -08:00
Ed Bartosh
829fa63b5b DRA: fix allocation mode All
`All` allocation mode should mean 'at least one' for DRA.
Allocation should fail if `All` devices requested and none found.
2025-01-30 16:34:25 +02:00
Patrick Ohly
2cc3dbf225 DRA CEL: add missing size estimator
Not implementing a size estimator had the effect that strings retrieved from
the attributes were treated as "unknown size", leading to wildly overestimating
the cost and validation errors even for even simple expressions like this:

    device.attributes["qat.intel.com"].services.matches("[^a]?sym")

Maximum number of elements in maps and the maximum length of the driver name
string were also ignored resp. missing. Pre-defined types like
apiservercel.StringType must be avoided because they are defined as having
a zero maximum size.
2025-01-16 16:36:43 +01:00
Patrick Ohly
1cee3682da DRA API: bump maximum size of ReservedFor to 256
The original limit of 32 seemed sufficient for a single GPU on a node. But for
shared non-local resources it is too low. For example, a ResourceClaim might be
used to allocate an interconnect channel that connects all pods of a workload
running on several different nodes, in which case the number of pods can be
considerably larger.

256 is high enough for currently planned systems. If we need something even
higher in the future, an alternative approach might be needed to avoid
scalability problems.

Normally, increasing such a limit would have to be done incrementally over two
releases. In this case we decided on
Slack (https://kubernetes.slack.com/archives/CJUQN3E4T/p1734593174791519) to
make an exception and apply this change to current master for 1.33 and backport
it to the next 1.32.x patch release for production usage.

This breaks downgrades to a 1.32 release without this change if there are
ResourceClaims with a number of consumers > 32 in ReservedFor. In practice,
this breakage is very unlikely because there are no workloads yet which need so
many consumers and such downgrades to a previous patch release are also
unlikely. Downgrades to 1.31 already weren't supported when using DRA v1beta1.
2025-01-09 14:26:01 +01:00
Lionel Jouin
5f4d646ea3 Add Device status const comments
Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-12-29 12:29:58 +01:00
Lionel Jouin
1d13ff2a05 make update
Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-12-14 19:00:06 +01:00
Lionel Jouin
11d68ecc4e ResourceClaim.Status.Devices.Data as pointer
Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-12-14 18:59:59 +01:00
Lionel Jouin
ca5f1deed4 Fix ResourceClaim status API inconsistency
* Add constant for limits
* Fix comments in API

Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-12-13 14:44:09 +01:00
Patrick Ohly
8a908e0c0b remove import doc comments
The "// import <path>" comment has been superseded by Go modules.
We don't have to remove them, but doing so has some advantages:

- They are used inconsistently, which is confusing.
- We can then also remove the (currently broken) hack/update-vanity-imports.sh.
- Last but not least, it would be a first step towards avoiding the k8s.io domain.

This commit was generated with
   sed -i -e 's;^package \(.*\) // import.*;package \1;' $(git grep -l '^package.*// import' | grep -v 'vendor/')

Everything was included, except for
   package labels // import k8s.io/kubernetes/pkg/util/labels
because that package is marked as "read-only".
2024-12-02 16:59:34 +01:00
AxeZhan
3075a9ae96 DRA API: validate node selector labels
Previously, ValidateNodeSelector did not check that labels are valid. Now it
does for resource.k8s.io, regardless whether an object already was created with
invalid labels in an earlier Kubernetes release. Theoretically this is a
breaking change and could cause problems during an upgrade, but that is highly
unlikely in practice.

In contrast to node affinity, DRA does not ignore parse errors
(= uses NewNodeSelector, not NewLazyErrorNodeSelector), so invalid labels would
have been found instead of being silently ignored.

Even if some object has invalid labels, this only affects an alpha -> beta
upgrade which isn't guaranteed to work seamlessly.
2024-11-22 09:10:02 +01:00
Lionel Jouin
118356175d [KEP-4817] Add limits on conditions and IPs + fix documentation
Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-11-07 22:18:53 +01:00
Lionel Jouin
d28b50e0a0 [KEP-4817] make update
Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-11-07 10:36:09 +01:00
Lionel Jouin
39f55e1cd0 [KEP-4817] Add data length limit (from #128601)
Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-11-07 10:35:29 +01:00
Lionel Jouin
4b76ba1a87 [KEP-4817] Rename Addresses to IPs
Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-11-07 09:59:56 +01:00
Lionel Jouin
43d23b8994 [KEP-4817] Use structured.MakeDeviceID
Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-11-07 09:59:56 +01:00
Lionel Jouin
8ab33b8413 [KEP-4817] Improve NetworkData Validation
* Add max length for InterfaceName and HardwareAddress
* Prevent duplicated Addresses

Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-11-07 09:59:56 +01:00
Lionel Jouin
a062f91106 [KEP-4817] Fixes based on review
* Rename HWAddress to HardwareAddress
* Fix condition validation
* Remove feature gate validation
* Fix drop field on disabled feature gate

Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-11-07 09:59:56 +01:00
Lionel Jouin
5df47a64d3 [KEP-4817] Remove unnecessary DeepCopy in validation tests
Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-11-07 09:59:56 +01:00
Lionel Jouin
cb9ee1d4fe [KEP-4817] Remove pointer on Data, InterfaceName and HWAddress fields
Adapat validation and tests based on these API changes

Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-11-07 09:59:51 +01:00
Lionel Jouin
5d7a16b0a5 [KEP-4817] improve testing
* Test feature-gate enabled/disabled for validation
* Test pkg/registry/resource/resourceclaim
* Add Data and NetworkData to integration test

Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-11-07 09:54:19 +01:00
Lionel Jouin
4bd62e5234 [KEP-4817] Fix fuzz API tests and ./hack/update-featuregates.sh
Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-11-07 09:54:19 +01:00
Lionel Jouin
3e595db0af [KEP-4817] API, validation and feature-gate
* Add status
* Add validation to check if fields are correct (Network field, device
  has been allocated))
* Add feature-gate
* Drop field if feature-gate not set

Signed-off-by: Lionel Jouin <lionel.jouin@est.tech>
2024-11-07 09:54:17 +01:00
Patrick Ohly
446f20aa3e DRA API: add maximum length of opaque parameters
This had been left out unintentionally earlier. Because theoretically there
might now be existing objects with parameters that are larger than whatever
limit gets enforced now, the limit only gets checked when parameters get
created or modified.

This is similar to the validation of CEL expressions and for consistency, the
same 10 Ki limit as for those is chosen.

Because the limit is not enforced for stored parameters, it can be increased in
the future, with the caveat that users who need larger parameters then depend
on the newer Kubernetes release with a higher limit. Lowering the limit is
harder because creating deployments that worked in older Kubernetes will not
work anymore with newer Kubernetes.
2024-11-06 17:29:51 +01:00
Patrick Ohly
30f5282656 DRA API: rename DeviceCapacity.Quantity to DeviceCapacity.Value
Based on review
feedback (https://github.com/kubernetes/kubernetes/pull/127511#discussion_r1823521172).
2024-11-06 13:03:20 +01:00
Patrick Ohly
81fd64256c DRA API: use DeviceCapacity struct instead of plain Quantity
This enables a future extension where capacity of a single device gets consumed
by different claims. The semantic without any additional fields is the same as
before: a capacity cannot be split up and is only an attribute of a device.

Because its semantically the same as before, two-way conversion to v1alpha3 is
possible.
2024-11-06 13:03:19 +01:00
Patrick Ohly
142319bd92 DRA API: use v1beta1 as storage version
This is meant to make it easier to remove the v1alpha3 because it won't be used
in clusters that started with DRA as beta in Kubernetes 1.32 when all clients
support v1beta1.
2024-11-06 13:03:19 +01:00
Patrick Ohly
2e64c72249 DRA API: register v1beta1
This is the minimal set of changes that are needed to make the new version
usable. The storage version is still v1alpha3. More changes will follow.
2024-11-06 13:03:18 +01:00
Patrick Ohly
d685064ff7 DRA API: search/replace v1alpha3 -> v1beta1 2024-11-06 13:03:18 +01:00
Patrick Ohly
f1e5616f05 DRA API: verbatim copy of v1alpha3 -> v1beta1 2024-11-06 13:03:18 +01:00
Patrick Ohly
99acb67c68 DRA API: enhance validation testing
The line coverage is now at 98.5% and several more corner cases are
covered. The remaining lines are hard or impossible to reach.

The actual validation is the same as before, with some small tweaks to the
generated errors.

When failures are not as expected, it is useful to show what the expected and
actual failures look like to a user. Perhaps even better would be to put the
expected texts into the test files instead of the error structs. That would
be easier to review and shorter.
2024-11-06 13:03:18 +01:00
Patrick Ohly
51d5992335 DRA API: fix some comments
Wording in one case was wrong. The tombstone comment should use
the same field definition as before the removal.
2024-11-06 11:05:05 +01:00
Tim Hockin
c8eeb486f4 Call-site comments: the "" arg to TooLong is unused 2024-11-05 15:10:24 -08:00
Tim Hockin
8a7af90300 Clarify that value arg to field.TooLong is unused 2024-11-05 15:10:23 -08:00
Tim Hockin
4d0e1c8fd4 Kill TooLongMaxLength() in favor of TooLong() 2024-11-05 15:10:22 -08:00
Kubernetes Prow Robot
daef8c2419 Merge pull request #127266 from pohly/dra-admin-access-in-status
DRA API: AdminAccess in DeviceRequestAllocationResult + DRAAdminAccess feature gate
2024-10-30 03:41:25 +00:00
Patrick Ohly
4419568259 DRA: treat AdminAccess as a new feature gated field
Using the "normal" logic for a feature gated field simplifies the
implementation of the feature gate.

There is one (entirely theoretic!) problem with updating from 1.31: if a claim
was allocated in 1.31 with admin access, the status field was not set because
it didn't exist yet. If a driver now follows the current definition of "unset =
off", then it will not grant admin access even though it should. This is
theoretic because drivers are starting to support admin access with 1.32, so
there shouldn't be any claim where this problem could occur.
2024-10-29 10:22:31 +01:00
Patrick Ohly
9a7e4ccab2 DRA admin access: add feature gate
The new DRAAdminAccess feature gate has the following effects:
- If disabled in the apiserver, the spec.devices.requests[*].adminAccess
  field gets cleared. Same in the status. In both cases the scenario
  that it was already set and a claim or claim template get updated
  is special: in those cases, the field is not cleared.

  Also, allocating a claim with admin access is allowed regardless of the
  feature gate and the field is not cleared. In practice, the scheduler
  will not do that.
- If disabled in the resource claim controller, creating ResourceClaims
  with the field set gets rejected. This prevents running workloads
  which depend on admin access.
- If disabled in the scheduler, claims with admin access don't get
  allocated. The effect is the same.

The alternative would have been to ignore the fields in claim controller and
scheduler. This is bad because a monitoring workload then runs, blocking
resources that probably were meant for production workloads.
2024-10-29 09:50:11 +01:00