mirror of
				https://github.com/optim-enterprises-bv/kubernetes.git
				synced 2025-11-04 04:08:16 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			281 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			281 lines
		
	
	
		
			13 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 | 
						|
 | 
						|
<!-- BEGIN STRIP_FOR_RELEASE -->
 | 
						|
 | 
						|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
 | 
						|
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 | 
						|
 | 
						|
If you are using a released version of Kubernetes, you should
 | 
						|
refer to the docs that go with that version.
 | 
						|
 | 
						|
<!-- TAG RELEASE_LINK, added by the munger automatically -->
 | 
						|
<strong>
 | 
						|
The latest release of this document can be found
 | 
						|
[here](http://releases.k8s.io/release-1.4/docs/design/nodeaffinity.md).
 | 
						|
 | 
						|
Documentation for other releases can be found at
 | 
						|
[releases.k8s.io](http://releases.k8s.io).
 | 
						|
</strong>
 | 
						|
--
 | 
						|
 | 
						|
<!-- END STRIP_FOR_RELEASE -->
 | 
						|
 | 
						|
<!-- END MUNGE: UNVERSIONED_WARNING -->
 | 
						|
 | 
						|
# Node affinity and NodeSelector
 | 
						|
 | 
						|
## Introduction
 | 
						|
 | 
						|
This document proposes a new label selector representation, called
 | 
						|
`NodeSelector`, that is similar in many ways to `LabelSelector`, but is a bit
 | 
						|
more flexible and is intended to be used only for selecting nodes.
 | 
						|
 | 
						|
In addition, we propose to replace the `map[string]string` in `PodSpec` that the
 | 
						|
scheduler currently uses as part of restricting the set of nodes onto which a
 | 
						|
pod is eligible to schedule, with a field of type `Affinity` that contains one
 | 
						|
or more affinity specifications. In this document we discuss `NodeAffinity`,
 | 
						|
which contains one or more of the following:
 | 
						|
* a field called `RequiredDuringSchedulingRequiredDuringExecution` that will be
 | 
						|
represented by a `NodeSelector`, and thus generalizes the scheduling behavior of
 | 
						|
the current `map[string]string` but still serves the purpose of restricting
 | 
						|
the set of nodes onto which the pod can schedule. In addition, unlike the
 | 
						|
behavior of the current `map[string]string`, when it becomes violated the system
 | 
						|
will try to eventually evict the pod from its node.
 | 
						|
* a field called `RequiredDuringSchedulingIgnoredDuringExecution` which is
 | 
						|
identical to `RequiredDuringSchedulingRequiredDuringExecution` except that the
 | 
						|
system may or may not try to eventually evict the pod from its node.
 | 
						|
* a field called `PreferredDuringSchedulingIgnoredDuringExecution` that
 | 
						|
specifies which nodes are preferred for scheduling among those that meet all
 | 
						|
scheduling requirements.
 | 
						|
 | 
						|
(In practice, as discussed later, we will actually *add* the `Affinity` field
 | 
						|
rather than replacing `map[string]string`, due to backward compatibility
 | 
						|
requirements.)
 | 
						|
 | 
						|
The affinity specifications described above allow a pod to request various
 | 
						|
properties that are inherent to nodes, for example "run this pod on a node with
 | 
						|
an Intel CPU" or, in a multi-zone cluster, "run this pod on a node in zone Z."
 | 
						|
([This issue](https://github.com/kubernetes/kubernetes/issues/9044) describes
 | 
						|
some of the properties that a node might publish as labels, which affinity
 | 
						|
expressions can match against.) They do *not* allow a pod to request to schedule
 | 
						|
(or not schedule) on a node based on what other pods are running on the node.
 | 
						|
That feature is called "inter-pod topological affinity/anti-affinity" and is
 | 
						|
described [here](https://github.com/kubernetes/kubernetes/pull/18265).
 | 
						|
 | 
						|
## API
 | 
						|
 | 
						|
### NodeSelector
 | 
						|
 | 
						|
```go
 | 
						|
// A node selector represents the union of the results of one or more label queries
 | 
						|
// over a set of nodes; that is, it represents the OR of the selectors represented
 | 
						|
// by the nodeSelectorTerms.
 | 
						|
type NodeSelector struct {
 | 
						|
	// nodeSelectorTerms is a list of node selector terms. The terms are ORed.
 | 
						|
	NodeSelectorTerms []NodeSelectorTerm `json:"nodeSelectorTerms,omitempty"`
 | 
						|
}
 | 
						|
 | 
						|
// An empty node selector term matches all objects. A null node selector term
 | 
						|
// matches no objects.
 | 
						|
type NodeSelectorTerm struct {
 | 
						|
	// matchExpressions is a list of node selector requirements. The requirements are ANDed.
 | 
						|
	MatchExpressions []NodeSelectorRequirement `json:"matchExpressions,omitempty"`
 | 
						|
}
 | 
						|
 | 
						|
// A node selector requirement is a selector that contains values, a key, and an operator
 | 
						|
// that relates the key and values.
 | 
						|
type NodeSelectorRequirement struct {
 | 
						|
	// key is the label key that the selector applies to.
 | 
						|
	Key string `json:"key" patchStrategy:"merge" patchMergeKey:"key"`
 | 
						|
	// operator represents a key's relationship to a set of values.
 | 
						|
	// Valid operators are In, NotIn, Exists, DoesNotExist. Gt, and Lt.
 | 
						|
	Operator NodeSelectorOperator `json:"operator"`
 | 
						|
	// values is an array of string values. If the operator is In or NotIn,
 | 
						|
	// the values array must be non-empty. If the operator is Exists or DoesNotExist,
 | 
						|
	// the values array must be empty. If the operator is Gt or Lt, the values
 | 
						|
	// array must have a single element, which will be interpreted as an integer.
 | 
						|
    // This array is replaced during a strategic merge patch.
 | 
						|
	Values []string `json:"values,omitempty"`
 | 
						|
}
 | 
						|
 | 
						|
// A node selector operator is the set of operators that can be used in
 | 
						|
// a node selector requirement.
 | 
						|
type NodeSelectorOperator string
 | 
						|
 | 
						|
const (
 | 
						|
	NodeSelectorOpIn           NodeSelectorOperator = "In"
 | 
						|
	NodeSelectorOpNotIn        NodeSelectorOperator = "NotIn"
 | 
						|
	NodeSelectorOpExists       NodeSelectorOperator = "Exists"
 | 
						|
	NodeSelectorOpDoesNotExist NodeSelectorOperator = "DoesNotExist"
 | 
						|
	NodeSelectorOpGt           NodeSelectorOperator = "Gt"
 | 
						|
	NodeSelectorOpLt           NodeSelectorOperator = "Lt"
 | 
						|
)
 | 
						|
```
 | 
						|
 | 
						|
### NodeAffinity
 | 
						|
 | 
						|
We will add one field to `PodSpec`
 | 
						|
 | 
						|
```go
 | 
						|
Affinity *Affinity  `json:"affinity,omitempty"`
 | 
						|
```
 | 
						|
 | 
						|
The `Affinity` type is defined as follows
 | 
						|
 | 
						|
```go
 | 
						|
type Affinity struct {
 | 
						|
	NodeAffinity *NodeAffinity `json:"nodeAffinity,omitempty"`
 | 
						|
}
 | 
						|
 | 
						|
type NodeAffinity struct {
 | 
						|
	// If the affinity requirements specified by this field are not met at
 | 
						|
	// scheduling time, the pod will not be scheduled onto the node.
 | 
						|
	// If the affinity requirements specified by this field cease to be met
 | 
						|
	// at some point during pod execution (e.g. due to a node label update),
 | 
						|
	// the system will try to eventually evict the pod from its node.
 | 
						|
	RequiredDuringSchedulingRequiredDuringExecution *NodeSelector  `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
 | 
						|
	// If the affinity requirements specified by this field are not met at
 | 
						|
	// scheduling time, the pod will not be scheduled onto the node.
 | 
						|
	// If the affinity requirements specified by this field cease to be met
 | 
						|
	// at some point during pod execution (e.g. due to a node label update),
 | 
						|
	// the system may or may not try to eventually evict the pod from its node.
 | 
						|
	RequiredDuringSchedulingIgnoredDuringExecution  *NodeSelector  `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
 | 
						|
	// The scheduler will prefer to schedule pods to nodes that satisfy
 | 
						|
	// the affinity expressions specified by this field, but it may choose
 | 
						|
	// a node that violates one or more of the expressions. The node that is
 | 
						|
	// most preferred is the one with the greatest sum of weights, i.e.
 | 
						|
	// for each node that meets all of the scheduling requirements (resource
 | 
						|
	// request, RequiredDuringScheduling affinity expressions, etc.),
 | 
						|
	// compute a sum by iterating through the elements of this field and adding
 | 
						|
	// "weight" to the sum if the node matches the corresponding MatchExpressions; the
 | 
						|
	// node(s) with the highest sum are the most preferred.
 | 
						|
	PreferredDuringSchedulingIgnoredDuringExecution []PreferredSchedulingTerm  `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
 | 
						|
}
 | 
						|
 | 
						|
// An empty preferred scheduling term matches all objects with implicit weight 0
 | 
						|
// (i.e. it's a no-op). A null preferred scheduling term matches no objects.
 | 
						|
type PreferredSchedulingTerm struct {
 | 
						|
    // weight is in the range 1-100
 | 
						|
	Weight int  `json:"weight"`
 | 
						|
	// matchExpressions is a list of node selector requirements. The requirements are ANDed.
 | 
						|
	MatchExpressions []NodeSelectorRequirement  `json:"matchExpressions,omitempty"`
 | 
						|
}
 | 
						|
```
 | 
						|
 | 
						|
Unfortunately, the name of the existing `map[string]string` field in PodSpec is
 | 
						|
`NodeSelector` and we can't change it since this name is part of the API.
 | 
						|
Hopefully this won't cause too much confusion.
 | 
						|
 | 
						|
## Examples
 | 
						|
 | 
						|
** TODO: fill in this section **
 | 
						|
 | 
						|
* Run this pod on a node with an Intel or AMD CPU
 | 
						|
 | 
						|
* Run this pod on a node in availability zone Z
 | 
						|
 | 
						|
 | 
						|
## Backward compatibility
 | 
						|
 | 
						|
When we add `Affinity` to PodSpec, we will deprecate, but not remove, the
 | 
						|
current field in PodSpec
 | 
						|
 | 
						|
```go
 | 
						|
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
 | 
						|
```
 | 
						|
 | 
						|
Old version of the scheduler will ignore the `Affinity` field. New versions of
 | 
						|
the scheduler will apply their scheduling predicates to both `Affinity` and
 | 
						|
`nodeSelector`, i.e. the pod can only schedule onto nodes that satisfy both sets
 | 
						|
of requirements. We will not attempt to convert between `Affinity` and
 | 
						|
`nodeSelector`.
 | 
						|
 | 
						|
Old versions of non-scheduling clients will not know how to do anything
 | 
						|
semantically meaningful with `Affinity`, but we don't expect that this will
 | 
						|
cause a problem.
 | 
						|
 | 
						|
See [this comment](https://github.com/kubernetes/kubernetes/issues/341#issuecomment-140809259)
 | 
						|
for more discussion.
 | 
						|
 | 
						|
Users should not start using `NodeAffinity` until the full implementation has
 | 
						|
been in Kubelet and the master for enough binary versions that we feel
 | 
						|
comfortable that we will not need to roll back either Kubelet or master to a
 | 
						|
version that does not support them. Longer-term we will use a programatic
 | 
						|
approach to enforcing this ([#4855](https://github.com/kubernetes/kubernetes/issues/4855)).
 | 
						|
 | 
						|
## Implementation plan
 | 
						|
 | 
						|
1. Add the `Affinity` field to PodSpec and the `NodeAffinity`,
 | 
						|
`PreferredDuringSchedulingIgnoredDuringExecution`, and
 | 
						|
`RequiredDuringSchedulingIgnoredDuringExecution` types to the API.
 | 
						|
2. Implement a scheduler predicate that takes
 | 
						|
`RequiredDuringSchedulingIgnoredDuringExecution` into account.
 | 
						|
3. Implement a scheduler priority function that takes
 | 
						|
`PreferredDuringSchedulingIgnoredDuringExecution` into account.
 | 
						|
4. At this point, the feature can be deployed and `PodSpec.NodeSelector` can be
 | 
						|
marked as deprecated.
 | 
						|
5. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to the API.
 | 
						|
6. Modify the scheduler predicate from step 2 to also take
 | 
						|
`RequiredDuringSchedulingRequiredDuringExecution` into account.
 | 
						|
7. Add `RequiredDuringSchedulingRequiredDuringExecution` to Kubelet's admission
 | 
						|
decision.
 | 
						|
8. Implement code in Kubelet *or* the controllers that evicts a pod that no
 | 
						|
longer satisfies `RequiredDuringSchedulingRequiredDuringExecution` (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
 | 
						|
 | 
						|
We assume Kubelet publishes labels describing the node's membership in all of
 | 
						|
the relevant scheduling domains (e.g. node name, rack name, availability zone
 | 
						|
name, etc.). See [#9044](https://github.com/kubernetes/kubernetes/issues/9044).
 | 
						|
 | 
						|
## Extensibility
 | 
						|
 | 
						|
The design described here is the result of careful analysis of use cases, a
 | 
						|
decade of experience with Borg at Google, and a review of similar features in
 | 
						|
other open-source container orchestration systems. We believe that it properly
 | 
						|
balances the goal of expressiveness against the goals of simplicity and
 | 
						|
efficiency of implementation. However, we recognize that use cases may arise in
 | 
						|
the future that cannot be expressed using the syntax described here. Although we
 | 
						|
are not implementing an affinity-specific extensibility mechanism for a variety
 | 
						|
of reasons (simplicity of the codebase, simplicity of cluster deployment, desire
 | 
						|
for Kubernetes users to get a consistent experience, etc.), the regular
 | 
						|
Kubernetes annotation mechanism can be used to add or replace affinity rules.
 | 
						|
The way this work would is:
 | 
						|
 | 
						|
1. Define one or more annotations to describe the new affinity rule(s)
 | 
						|
1. User (or an admission controller) attaches the annotation(s) to pods to
 | 
						|
request the desired scheduling behavior. If the new rule(s) *replace* one or
 | 
						|
more fields of `Affinity` then the user would omit those fields from `Affinity`;
 | 
						|
if they are *additional rules*, then the user would fill in `Affinity` as well
 | 
						|
as the annotation(s).
 | 
						|
1. Scheduler takes the annotation(s) into account when scheduling.
 | 
						|
 | 
						|
If some particular new syntax becomes popular, we would consider upstreaming it
 | 
						|
by integrating it into the standard `Affinity`.
 | 
						|
 | 
						|
## Future work
 | 
						|
 | 
						|
Are there any other fields we should convert from `map[string]string` to
 | 
						|
`NodeSelector`?
 | 
						|
 | 
						|
## Related issues
 | 
						|
 | 
						|
The review for this proposal is in [#18261](https://github.com/kubernetes/kubernetes/issues/18261).
 | 
						|
 | 
						|
The main related issue is [#341](https://github.com/kubernetes/kubernetes/issues/341).
 | 
						|
Issue [#367](https://github.com/kubernetes/kubernetes/issues/367) is also related.
 | 
						|
Those issues reference other related issues.
 | 
						|
 | 
						|
 | 
						|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | 
						|
[]()
 | 
						|
<!-- END MUNGE: GENERATED_ANALYTICS -->
 |