mirror of
				https://github.com/optim-enterprises-bv/kubernetes.git
				synced 2025-11-04 04:08:16 +00:00 
			
		
		
		
	Minor spelling mistakes - descibe > describe menioned > mentioned compatiblity > compatibility programatic > programmatic
		
			
				
	
	
		
			631 lines
		
	
	
		
			38 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			631 lines
		
	
	
		
			38 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 | 
						|
 | 
						|
<!-- BEGIN STRIP_FOR_RELEASE -->
 | 
						|
 | 
						|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
 | 
						|
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 | 
						|
 | 
						|
If you are using a released version of Kubernetes, you should
 | 
						|
refer to the docs that go with that version.
 | 
						|
 | 
						|
Documentation for other releases can be found at
 | 
						|
[releases.k8s.io](http://releases.k8s.io).
 | 
						|
</strong>
 | 
						|
--
 | 
						|
 | 
						|
<!-- END STRIP_FOR_RELEASE -->
 | 
						|
 | 
						|
<!-- END MUNGE: UNVERSIONED_WARNING -->
 | 
						|
 | 
						|
# Inter-pod topological affinity and anti-affinity
 | 
						|
 | 
						|
## Introduction
 | 
						|
 | 
						|
NOTE: It is useful to read about [node affinity](nodeaffinity.md) first.
 | 
						|
 | 
						|
This document describes a proposal for specifying and implementing inter-pod topological affinity and
 | 
						|
anti-affinity. By that we mean: rules that specify that certain pods should be placed
 | 
						|
in the same topological domain (e.g. same node, same rack, same zone, same
 | 
						|
power domain, etc.) as some other pods, or, conversely, should *not* be placed in the
 | 
						|
same topological domain as some other pods.
 | 
						|
 | 
						|
Here are a few example rules; we explain how to express them using the API described
 | 
						|
in this doc later, in the section "Examples."
 | 
						|
* Affinity
 | 
						|
  * Co-locate the pods from a particular service or Job in the same availability zone,
 | 
						|
	without specifying which zone that should be.
 | 
						|
  * Co-locate the pods from service S1 with pods from service S2 because S1 uses S2
 | 
						|
	and thus it is useful to minimize the network latency between them. Co-location
 | 
						|
	might mean same nodes and/or same availability zone.
 | 
						|
* Anti-affinity
 | 
						|
  * Spread the pods of a service across nodes and/or availability zones,
 | 
						|
	e.g. to reduce correlated failures
 | 
						|
  * Give a pod "exclusive" access to a node to guarantee resource isolation -- it must never share the node with other pods
 | 
						|
  * Don't schedule the pods of a particular service on the same nodes as pods of
 | 
						|
  another service that are known to interfere with the performance of the pods of the first service.
 | 
						|
 | 
						|
For both affinity and anti-affinity, there are three variants. Two variants have the
 | 
						|
property of requiring the affinity/anti-affinity to be satisfied for the pod to be allowed
 | 
						|
to schedule onto a node; the difference between them is that if the condition ceases to
 | 
						|
be met later on at runtime, for one of them the system will try to eventually evict the pod,
 | 
						|
while for the other the system may not try to do so. The third variant
 | 
						|
simply provides scheduling-time *hints* that the scheduler will try
 | 
						|
to satisfy but may not be able to. These three variants are directly analogous to the three
 | 
						|
variants of [node affinity](nodeaffinity.md).
 | 
						|
 | 
						|
Note that this proposal is only about *inter-pod* topological affinity and anti-affinity.
 | 
						|
There are other forms of topological affinity and anti-affinity. For example,
 | 
						|
you can use [node affinity](nodeaffinity.md) to require (prefer)
 | 
						|
that a set of pods all be scheduled in some specific zone Z. Node affinity is not
 | 
						|
capable of expressing inter-pod dependencies, and conversely the API
 | 
						|
we describe in this document is not capable of expressing node affinity rules.
 | 
						|
For simplicity, we will use the terms "affinity" and "anti-affinity" to mean
 | 
						|
"inter-pod topological affinity" and "inter-pod topological anti-affinity," respectively,
 | 
						|
in the remainder of this document.
 | 
						|
 | 
						|
## API
 | 
						|
 | 
						|
We will add one field to `PodSpec`
 | 
						|
 | 
						|
```go
 | 
						|
Affinity *Affinity  `json:"affinity,omitempty"`
 | 
						|
```
 | 
						|
 | 
						|
The `Affinity` type is defined as follows
 | 
						|
 | 
						|
```go
 | 
						|
type Affinity struct {
 | 
						|
	PodAffinity     *PodAffinity  `json:"podAffinity,omitempty"`
 | 
						|
	PodAntiAffinity *PodAntiAffinity  `json:"podAntiAffinity,omitempty"`
 | 
						|
}
 | 
						|
 | 
						|
type PodAffinity struct {
 | 
						|
	// If the affinity requirements specified by this field are not met at
 | 
						|
    // scheduling time, the pod will not be scheduled onto the node.
 | 
						|
    // If the affinity requirements specified by this field cease to be met
 | 
						|
    // at some point during pod execution (e.g. due to a pod label update), the
 | 
						|
    // system will try to eventually evict the pod from its node.
 | 
						|
	// When there are multiple elements, the lists of nodes corresponding to each
 | 
						|
	// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
 | 
						|
	RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm  `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
 | 
						|
    // If the affinity requirements specified by this field are not met at
 | 
						|
    // scheduling time, the pod will not be scheduled onto the node.
 | 
						|
    // If the affinity requirements specified by this field cease to be met
 | 
						|
    // at some point during pod execution (e.g. due to a pod label update), the
 | 
						|
    // system may or may not try to eventually evict the pod from its node.
 | 
						|
	// When there are multiple elements, the lists of nodes corresponding to each
 | 
						|
	// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
 | 
						|
	RequiredDuringSchedulingIgnoredDuringExecution  []PodAffinityTerm  `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
 | 
						|
	// The scheduler will prefer to schedule pods to nodes that satisfy
 | 
						|
    // the affinity expressions specified by this field, but it may choose
 | 
						|
    // a node that violates one or more of the expressions. The node that is
 | 
						|
    // most preferred is the one with the greatest sum of weights, i.e.
 | 
						|
    // for each node that meets all of the scheduling requirements (resource
 | 
						|
    // request, RequiredDuringScheduling affinity expressions, etc.),
 | 
						|
    // compute a sum by iterating through the elements of this field and adding
 | 
						|
    // "weight" to the sum if the node matches the corresponding MatchExpressions; the
 | 
						|
    // node(s) with the highest sum are the most preferred.
 | 
						|
	PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm  `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
 | 
						|
}
 | 
						|
 | 
						|
type PodAntiAffinity struct {
 | 
						|
	// If the anti-affinity requirements specified by this field are not met at
 | 
						|
    // scheduling time, the pod will not be scheduled onto the node.
 | 
						|
    // If the anti-affinity requirements specified by this field cease to be met
 | 
						|
    // at some point during pod execution (e.g. due to a pod label update), the
 | 
						|
    // system will try to eventually evict the pod from its node.
 | 
						|
	// When there are multiple elements, the lists of nodes corresponding to each
 | 
						|
	// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
 | 
						|
	RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm  `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
 | 
						|
    // If the anti-affinity requirements specified by this field are not met at
 | 
						|
    // scheduling time, the pod will not be scheduled onto the node.
 | 
						|
    // If the anti-affinity requirements specified by this field cease to be met
 | 
						|
    // at some point during pod execution (e.g. due to a pod label update), the
 | 
						|
    // system may or may not try to eventually evict the pod from its node.
 | 
						|
	// When there are multiple elements, the lists of nodes corresponding to each
 | 
						|
	// PodAffinityTerm are intersected, i.e. all terms must be satisfied.
 | 
						|
	RequiredDuringSchedulingIgnoredDuringExecution  []PodAffinityTerm  `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
 | 
						|
	// The scheduler will prefer to schedule pods to nodes that satisfy
 | 
						|
    // the anti-affinity expressions specified by this field, but it may choose
 | 
						|
    // a node that violates one or more of the expressions. The node that is
 | 
						|
    // most preferred is the one with the greatest sum of weights, i.e.
 | 
						|
    // for each node that meets all of the scheduling requirements (resource
 | 
						|
    // request, RequiredDuringScheduling anti-affinity expressions, etc.),
 | 
						|
    // compute a sum by iterating through the elements of this field and adding
 | 
						|
    // "weight" to the sum if the node matches the corresponding MatchExpressions; the
 | 
						|
    // node(s) with the highest sum are the most preferred.
 | 
						|
	PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm  `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
 | 
						|
}
 | 
						|
 | 
						|
type WeightedPodAffinityTerm struct {
 | 
						|
    // weight is in the range 1-100
 | 
						|
    Weight int  `json:"weight"`
 | 
						|
    PodAffinityTerm PodAffinityTerm  `json:"podAffinityTerm"`
 | 
						|
}
 | 
						|
 | 
						|
type PodAffinityTerm struct {
 | 
						|
	LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
 | 
						|
	// namespaces specifies which namespaces the LabelSelector applies to (matches against);
 | 
						|
	// nil list means "this pod's namespace," empty list means "all namespaces"
 | 
						|
	// The json tag here is not "omitempty" since we need to distinguish nil and empty.
 | 
						|
	// See https://golang.org/pkg/encoding/json/#Marshal for more details.
 | 
						|
	Namespaces []api.Namespace  `json:"namespaces,omitempty"`
 | 
						|
	// empty topology key is interpreted by the scheduler as "all topologies"
 | 
						|
	TopologyKey string `json:"topologyKey,omitempty"`
 | 
						|
}
 | 
						|
```
 | 
						|
 | 
						|
Note that the `Namespaces` field is necessary because normal `LabelSelector` is scoped
 | 
						|
to the pod's namespace, but we need to be able to match against all pods globally.
 | 
						|
 | 
						|
To explain how this API works, let's say that the `PodSpec` of a pod `P` has an `Affinity`
 | 
						|
that is configured as follows (note that we've omitted and collapsed some fields for
 | 
						|
simplicity, but this should sufficiently convey the intent of the design):
 | 
						|
 | 
						|
```go
 | 
						|
PodAffinity {
 | 
						|
	RequiredDuringScheduling: {{LabelSelector: P1, TopologyKey: "node"}},
 | 
						|
	PreferredDuringScheduling: {{LabelSelector: P2, TopologyKey: "zone"}},
 | 
						|
}
 | 
						|
PodAntiAffinity {
 | 
						|
	RequiredDuringScheduling: {{LabelSelector: P3, TopologyKey: "rack"}},
 | 
						|
	PreferredDuringScheduling: {{LabelSelector: P4, TopologyKey: "power"}}
 | 
						|
}
 | 
						|
```
 | 
						|
 | 
						|
Then when scheduling pod P, the scheduler
 | 
						|
* Can only schedule P onto nodes that are running pods that satisfy `P1`. (Assumes all nodes have a label with key "node" and value specifying their node name.)
 | 
						|
* Should try to schedule P onto zones that are running pods that satisfy `P3`. (Assumes all nodes have a label with key "zone" and value specifying their zone.)
 | 
						|
* Cannot schedule P onto any racks that are running pods that satisfy `P2`. (Assumes all nodes have a label with key "rack" and value specifying their rack name.)
 | 
						|
* Should try not to schedule P onto any power domains that are running pods that satisfy `P4`. (Assumes all nodes have a label with key "power" and value specifying their power domain.)
 | 
						|
 | 
						|
When `RequiredDuringScheduling` has multiple elements, the requirements are ANDed.
 | 
						|
For `PreferredDuringScheduling` the weights are added for the terms that are satisfied for each node, and
 | 
						|
the node(s) with the highest weight(s) are the most preferred.
 | 
						|
 | 
						|
In reality there are two variants of `RequiredDuringScheduling`: one suffixed with
 | 
						|
`RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`. For the
 | 
						|
first variant, if the affinity/anti-affinity ceases to be met at some point during
 | 
						|
pod execution (e.g. due to a pod label update), the system will try to eventually evict the pod
 | 
						|
from its node. In the second variant, the system may or may not try to eventually
 | 
						|
evict the pod from its node.
 | 
						|
 | 
						|
## A comment on symmetry
 | 
						|
 | 
						|
One thing that makes affinity and anti-affinity tricky is symmetry.
 | 
						|
 | 
						|
Imagine a cluster that is running pods from two services, S1 and S2. Imagine that the pods of S1 have a RequiredDuringScheduling anti-affinity rule
 | 
						|
"do not run me on nodes that are running pods from S2." It is not sufficient just to check that there are no S2 pods on a node when
 | 
						|
you are scheduling a S1 pod. You also need to ensure that there are no S1 pods on a node when you are scheduling a S2 pod,
 | 
						|
*even though the S2 pod does not have any anti-affinity rules*. Otherwise if an S1 pod schedules before an S2 pod, the S1
 | 
						|
pod's RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving S2 pod. More specifically, if S1 has the aforementioned
 | 
						|
RequiredDuringScheduling anti-affinity rule, then
 | 
						|
* if a node is empty, you can schedule S1 or S2 onto the node
 | 
						|
* if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node
 | 
						|
 | 
						|
Note that while RequiredDuringScheduling anti-affinity is symmetric,
 | 
						|
RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running
 | 
						|
pods from S2," it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node. More
 | 
						|
specifically, if S1 has the aforementioned RequiredDuringScheduling affinity rule, then
 | 
						|
* if a node is empty, you can schedule S2 onto the node
 | 
						|
* if a node is empty, you cannot schedule S1 onto the node
 | 
						|
* if a node is running S2, you can schedule S1 onto the node
 | 
						|
* if a node is running S1+S2 and S1 terminates, S2 continues running
 | 
						|
* if a node is running S1+S2 and S2 terminates, the system terminates S1 (eventually)
 | 
						|
 | 
						|
However, although RequiredDuringScheduling affinity is not symmetric, there is an implicit PreferredDuringScheduling affinity rule corresponding to every
 | 
						|
RequiredDuringScheduling affinity rule: if the pods of S1 have a RequiredDuringScheduling affinity rule "run me on nodes that are running
 | 
						|
pods from S2" then it is not required that there be S1 pods on a node in order to schedule a S2 pod onto that node,
 | 
						|
but it would be better if there are.
 | 
						|
 | 
						|
PreferredDuringScheduling is symmetric.
 | 
						|
If the pods of S1 had a PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that are running pods from S2"
 | 
						|
then we would prefer to keep a S1 pod that we are scheduling off of nodes that are running S2 pods, and also
 | 
						|
to keep a S2 pod that we are scheduling off of nodes that are running S1 pods. Likewise if the pods of
 | 
						|
S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that are running pods from S2" then we would prefer
 | 
						|
to place a S1 pod that we are scheduling onto a node that is running a S2 pod, and also to place
 | 
						|
a S2 pod that we are scheduling onto a node that is running a S1 pod.
 | 
						|
 | 
						|
## Examples
 | 
						|
 | 
						|
Here are some examples of how you would express various affinity and anti-affinity rules using the API we described.
 | 
						|
 | 
						|
### Affinity
 | 
						|
 | 
						|
In the examples below, the word "put" is intentionally ambiguous; the rules are the same
 | 
						|
whether "put" means "must put" (RequiredDuringScheduling) or "try to put"
 | 
						|
(PreferredDuringScheduling)--all that changes is which field the rule goes into.
 | 
						|
Also, we only discuss scheduling-time, and ignore the execution-time.
 | 
						|
Finally, some of the examples
 | 
						|
use "zone" and some use "node," just to make the examples more interesting; any of the examples
 | 
						|
with "zone" will also work for "node" if you change the `TopologyKey`, and vice-versa.
 | 
						|
 | 
						|
* **Put the pod in zone Z**:
 | 
						|
Tricked you! It is not possible express this using the API described here. For this you should use node affinity.
 | 
						|
 | 
						|
* **Put the pod in a zone that is running at least one pod from service S**:
 | 
						|
`{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}`
 | 
						|
 | 
						|
* **Put the pod on a node that is already running a pod that requires a license for software package P**:
 | 
						|
Assuming pods that require a license for software package P have a label `{key=license, value=P}`:
 | 
						|
`{LabelSelector: "license" In "P", TopologyKey: "node"}`
 | 
						|
 | 
						|
* **Put this pod in the same zone as other pods from its same service**:
 | 
						|
Assuming pods from this pod's service have some label `{key=service, value=S}`:
 | 
						|
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
 | 
						|
 | 
						|
This last example illustrates a small issue with this API when it is used
 | 
						|
with a scheduler that processes the pending queue one pod at a time, like the current
 | 
						|
Kubernetes scheduler. The RequiredDuringScheduling rule
 | 
						|
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
 | 
						|
only "works" once one pod from service S has been scheduled. But if all pods in service
 | 
						|
S have this RequiredDuringScheduling rule in their PodSpec, then the RequiredDuringScheduling rule
 | 
						|
will block the first
 | 
						|
pod of the service from ever scheduling, since it is only allowed to run in a zone with another pod from
 | 
						|
the same service. And of course that means none of the pods of the service will be able
 | 
						|
to schedule. This problem *only* applies to RequiredDuringScheduling affinity, not
 | 
						|
PreferredDuringScheduling affinity or any variant of anti-affinity.
 | 
						|
There are at least three ways to solve this problem
 | 
						|
* **short-term**: have the scheduler use a rule that if the RequiredDuringScheduling affinity requirement
 | 
						|
matches a pod's own labels, and there are no other such pods anywhere, then disregard the requirement.
 | 
						|
This approach has a corner case when running parallel schedulers that are allowed to
 | 
						|
schedule pods from the same replicated set (e.g. a single PodTemplate): both schedulers may try to
 | 
						|
schedule pods from the set
 | 
						|
at the same time and think there are no other pods from that set scheduled yet (e.g. they are
 | 
						|
trying to schedule the first two pods from the set), but by the time
 | 
						|
the second binding is committed, the first one has already been committed, leaving you with
 | 
						|
two pods running that do not respect their RequiredDuringScheduling affinity. There is no
 | 
						|
simple way to detect this "conflict" at scheduling time given the current system implementation.
 | 
						|
* **longer-term**: when a controller creates pods from a PodTemplate, for exactly *one* of those
 | 
						|
pods, it should omit any RequiredDuringScheduling affinity rules that select the pods of that PodTemplate.
 | 
						|
* **very long-term/speculative**: controllers could present the scheduler with a group of pods from
 | 
						|
the same PodTemplate as a single unit. This is similar to the first approach described above but
 | 
						|
avoids the corner case. No special logic is needed in the controllers. Moreover, this would allow
 | 
						|
the scheduler to do proper [gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845)
 | 
						|
since it could receive an entire gang simultaneously as a single unit.
 | 
						|
 | 
						|
### Anti-affinity
 | 
						|
 | 
						|
As with the affinity examples, the examples here can be RequiredDuringScheduling or
 | 
						|
PreferredDuringScheduling anti-affinity, i.e.
 | 
						|
"don't" can be interpreted as "must not" or as "try not to" depending on whether the rule appears
 | 
						|
in `RequiredDuringScheduling` or `PreferredDuringScheduling`.
 | 
						|
 | 
						|
* **Spread the pods of this service S across nodes and zones**:
 | 
						|
`{{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"}, {LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}}`
 | 
						|
(note that if this is specified as a RequiredDuringScheduling anti-affinity, then the first clause is redundant, since the second
 | 
						|
clause will force the scheduler to not put more than one pod from S in the same zone, and thus by
 | 
						|
definition it will not put more than one pod from S on the same node, assuming each node is in one zone.
 | 
						|
This rule is more useful as PreferredDuringScheduling anti-affinity, e.g. one might expect it to be common in
 | 
						|
[Ubernetes](../../docs/proposals/federation.md) clusters.)
 | 
						|
 | 
						|
* **Don't co-locate pods of this service with pods from service "evilService"**:
 | 
						|
`{LabelSelector: selector that matches evilService's pods, TopologyKey: "node"}`
 | 
						|
 | 
						|
* **Don't co-locate pods of this service with any other pods including pods of this service**:
 | 
						|
`{LabelSelector: empty, TopologyKey: "node"}`
 | 
						|
 | 
						|
* **Don't co-locate pods of this service with any other pods except other pods of this service**:
 | 
						|
Assuming pods from the service have some label `{key=service, value=S}`:
 | 
						|
`{LabelSelector: "service" NotIn "S", TopologyKey: "node"}`
 | 
						|
Note that this works because `"service" NotIn "S"` matches pods with no key "service"
 | 
						|
as well as pods with key "service" and a corresponding value that is not "S."
 | 
						|
 | 
						|
## Algorithm
 | 
						|
 | 
						|
An example algorithm a scheduler might use to implement affinity and anti-affinity rules is as follows.
 | 
						|
There are certainly more efficient ways to do it; this is just intended to demonstrate that the API's
 | 
						|
semantics are implementable.
 | 
						|
 | 
						|
Terminology definition: We say a pod P is "feasible" on a node N if P meets all of the scheduler
 | 
						|
predicates for scheduling P onto N. Note that this algorithm is only concerned about scheduling
 | 
						|
time, thus it makes no distinction between RequiredDuringExecution and IgnoredDuringExecution.
 | 
						|
 | 
						|
To make the algorithm slightly more readable, we use the term "HardPodAffinity" as shorthand
 | 
						|
for "RequiredDuringSchedulingScheduling pod affinity" and "SoftPodAffinity" as shorthand for
 | 
						|
"PreferredDuringScheduling pod affinity." Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity."
 | 
						|
 | 
						|
** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity} into account;
 | 
						|
currently it assumes all terms have weight 1. **
 | 
						|
 | 
						|
```
 | 
						|
Z = the pod you are scheduling
 | 
						|
{N} = the set of all nodes in the system  // this algorithm will reduce it to the set of all nodes feasible for Z
 | 
						|
// Step 1a: Reduce {N} to the set of nodes satisfying Z's HardPodAffinity in the "forward" direction
 | 
						|
X = {Z's PodSpec's HardPodAffinity}
 | 
						|
foreach element H of {X}
 | 
						|
	P = {all pods in the system that match H.LabelSelector}
 | 
						|
	M map[string]int  // topology value -> number of pods running on nodes with that topology value
 | 
						|
	foreach pod Q of {P}
 | 
						|
		L = {labels of the node on which Q is running, represented as a map from label key to label value}
 | 
						|
		M[L[H.TopologyKey]]++
 | 
						|
	{N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]>0]}
 | 
						|
// Step 1b: Further reduce {N} to the set of nodes also satisfying Z's HardPodAntiAffinity
 | 
						|
// This step is identical to Step 1a except the M[K] > 0 comparison becomes M[K] == 0
 | 
						|
X = {Z's PodSpec's HardPodAntiAffinity}
 | 
						|
foreach element H of {X}
 | 
						|
	P = {all pods in the system that match H.LabelSelector}
 | 
						|
	M map[string]int  // topology value -> number of pods running on nodes with that topology value
 | 
						|
	foreach pod Q of {P}
 | 
						|
		L = {labels of the node on which Q is running, represented as a map from label key to label value}
 | 
						|
		M[L[H.TopologyKey]]++
 | 
						|
	{N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]==0]}
 | 
						|
// Step 2: Further reduce {N} by enforcing symmetry requirement for other pods' HardPodAntiAffinity
 | 
						|
foreach node A of {N}
 | 
						|
	foreach pod B that is bound to A
 | 
						|
		if any of B's HardPodAntiAffinity are currently satisfied but would be violated if Z runs on A, then remove A from {N}
 | 
						|
// At this point, all node in {N} are feasible for Z.
 | 
						|
// Step 3a: Soft version of Step 1a
 | 
						|
Y map[string]int  // node -> number of Z's soft affinity/anti-affinity preferences satisfied by that node
 | 
						|
Initialize the keys of Y to all of the nodes in {N}, and the values to 0
 | 
						|
X = {Z's PodSpec's SoftPodAffinity}
 | 
						|
Repeat Step 1a except replace the last line with "foreach node W of {N} having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++"
 | 
						|
// Step 3b: Soft version of Step 1b
 | 
						|
X = {Z's PodSpec's SoftPodAntiAffinity}
 | 
						|
Repeat Step 1b except replace the last line with "foreach node W of {N} not having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++"
 | 
						|
// Step 4: Symmetric soft, plus treat forward direction of hard affinity as a soft
 | 
						|
foreach node A of {N}
 | 
						|
	foreach pod B that is bound to A
 | 
						|
		increment Y[A] by the number of B's SoftPodAffinity, SoftPodAntiAffinity, and HardPodAffinity that are satisfied if Z runs on A but are not satisfied if Z does not run on A
 | 
						|
// We're done. {N} contains all of the nodes that satisfy the affinity/anti-affinity rules, and Y is
 | 
						|
// a map whose keys are the elements of {N} and whose values are how "good" of a choice N is for Z with
 | 
						|
// respect to the explicit and implicit affinity/anti-affinity rules (larger number is better).
 | 
						|
```
 | 
						|
 | 
						|
## Special considerations for RequiredDuringScheduling anti-affinity
 | 
						|
 | 
						|
In this section we discuss three issues with RequiredDuringScheduling anti-affinity:
 | 
						|
Denial of Service (DoS), co-existing with daemons, and determining which pod(s) to kill.
 | 
						|
See issue #18265 for additional discussion of these topics.
 | 
						|
 | 
						|
### Denial of Service
 | 
						|
 | 
						|
Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity can intentionally
 | 
						|
or unintentionally cause various problems for other pods, due to the symmetry property of anti-affinity.
 | 
						|
 | 
						|
The most notable danger is the ability for a
 | 
						|
pod that arrives first to some topology domain, to block all other pods from
 | 
						|
scheduling there by stating a conflict with all other pods.
 | 
						|
The standard approach
 | 
						|
to preventing resource hogging is quota, but simple resource quota cannot prevent
 | 
						|
this scenario because the pod may request very little resources. Addressing this
 | 
						|
using quota requires a quota scheme that charges based on "opportunity cost" rather
 | 
						|
than based simply on requested resources. For example, when handling a pod that expresses
 | 
						|
RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey`
 | 
						|
(i.e. exclusive access to a node), it could charge for the resources of the
 | 
						|
average or largest node in the cluster. Likewise if a pod expresses RequiredDuringScheduling
 | 
						|
anti-affinity for all pods using a "cluster" `TopologyKey`, it could charge for the resources of the
 | 
						|
entire cluster. If node affinity is used to
 | 
						|
constrain the pod to a particular topology domain, then the admission-time quota
 | 
						|
charging should take that into account (e.g. not charge for the average/largest machine
 | 
						|
if the PodSpec constrains the pod to a specific machine with a known size; instead charge
 | 
						|
for the size of the actual machine that the pod was constrained to). In all cases
 | 
						|
once the pod is scheduled, the quota charge should be adjusted down to the
 | 
						|
actual amount of resources allocated (e.g. the size of the actual machine that was
 | 
						|
assigned, not the average/largest). If a cluster administrator wants to overcommit quota, for
 | 
						|
example to allow more than N pods across all users to request exclusive node
 | 
						|
access in a cluster with N nodes, then a priority/preemption scheme should be added
 | 
						|
so that the most important pods run when resource demand exceeds supply.
 | 
						|
 | 
						|
An alternative approach, which is a bit of a blunt hammer, is to use a
 | 
						|
capability mechanism to restrict use of RequiredDuringScheduling anti-affinity
 | 
						|
to trusted users. A more complex capability mechanism might only restrict it when
 | 
						|
using a non-"node" TopologyKey.
 | 
						|
 | 
						|
Our initial implementation will use a variant of the capability approach, which
 | 
						|
requires no configuration: we will simply reject ALL requests, regardless of user,
 | 
						|
that specify "all namespaces" with non-"node" TopologyKey for RequiredDuringScheduling anti-affinity.
 | 
						|
This allows the "exclusive node" use case while prohibiting the more dangerous ones.
 | 
						|
 | 
						|
A weaker variant of the problem described in the previous paragraph is a pod's ability to use anti-affinity to degrade
 | 
						|
the scheduling quality of another pod, but not completely block it from scheduling.
 | 
						|
For example, a set of pods S1 could use node affinity to request to schedule onto a set
 | 
						|
of nodes that some other set of pods S2 prefers to schedule onto. If the pods in S1
 | 
						|
have RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for S2,
 | 
						|
then due to the symmetry property of anti-affinity, they can prevent the pods in S2 from
 | 
						|
scheduling onto their preferred nodes if they arrive first (for sure in the RequiredDuringScheduling case, and
 | 
						|
with some probability that depends on the weighting scheme for the PreferredDuringScheduling case).
 | 
						|
A very sophisticated priority and/or quota scheme could mitigate this, or alternatively
 | 
						|
we could eliminate the symmetry property of the implementation of PreferredDuringScheduling anti-affinity.
 | 
						|
Then only RequiredDuringScheduling anti-affinity could affect scheduling quality
 | 
						|
of another pod, and as we described in the previous paragraph, such pods could be charged
 | 
						|
quota for the full topology domain, thereby reducing the potential for abuse.
 | 
						|
 | 
						|
We won't try to address this issue in our initial implementation; we can consider one
 | 
						|
of the approaches mentioned above if it turns out to be a problem in practice.
 | 
						|
 | 
						|
### Co-existing with daemons
 | 
						|
 | 
						|
A cluster administrator
 | 
						|
may wish to allow pods that express anti-affinity against all pods, to nonetheless co-exist with
 | 
						|
system daemon pods, such as those run by DaemonSet. In principle, we would like the specification
 | 
						|
for RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or more
 | 
						|
other pods (see #18263 for a more detailed explanation of the toleration concept). There are
 | 
						|
at least two ways to accomplish this:
 | 
						|
 | 
						|
* Scheduler special-cases the namespace(s) where daemons live, in the
 | 
						|
  sense that it ignores pods in those namespaces when it is
 | 
						|
  determining feasibility for pods with anti-affinity. The name(s) of
 | 
						|
  the special namespace(s) could be a scheduler configuration
 | 
						|
  parameter, and default to `kube-system`. We could allow
 | 
						|
  multiple namespaces to be specified if we want cluster admins to be
 | 
						|
  able to give their own daemons this special power (they would add
 | 
						|
  their namespace to the list in the scheduler configuration). And of
 | 
						|
  course this would be symmetric, so daemons could schedule onto a node
 | 
						|
  that is already running a pod with anti-affinity.
 | 
						|
 | 
						|
* We could add an explicit "toleration" concept/field to allow the
 | 
						|
  user to specify namespaces that are excluded when they use
 | 
						|
  RequiredDuringScheduling anti-affinity, and use an admission
 | 
						|
  controller/defaulter to ensure these namespaces are always listed.
 | 
						|
 | 
						|
Our initial implementation will use the first approach.
 | 
						|
 | 
						|
### Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution)
 | 
						|
 | 
						|
Because anti-affinity is symmetric, in the case of RequiredDuringSchedulingRequiredDuringExecution
 | 
						|
anti-affinity, the system must determine which pod(s) to kill when a pod's labels are updated in
 | 
						|
such as way as to cause them to conflict with one or more other pods' RequiredDuringSchedulingRequiredDuringExecution
 | 
						|
anti-affinity rules. In the absence of a priority/preemption scheme, our rule will be that the pod
 | 
						|
with the anti-affinity rule that becomes violated should be the one killed.
 | 
						|
A pod should only specify constraints that apply to
 | 
						|
namespaces it trusts to not do malicious things. Once we have priority/preemption, we can
 | 
						|
change the rule to say that the lowest-priority pod(s) are killed until all
 | 
						|
RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied.
 | 
						|
 | 
						|
## Special considerations for RequiredDuringScheduling affinity
 | 
						|
 | 
						|
The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its symmetry:
 | 
						|
if a pod P requests anti-affinity, P cannot schedule onto a node with conflicting pods,
 | 
						|
and pods that conflict with P cannot schedule onto the node one P has been scheduled there.
 | 
						|
The design we have described says that the symmetry property for RequiredDuringScheduling *affinity*
 | 
						|
is weaker: if a pod P says it can only schedule onto nodes running pod Q, this
 | 
						|
does not mean Q can only run on a node that is running P, but the scheduler will try
 | 
						|
to schedule Q onto a node that is running P (i.e. treats the reverse direction as
 | 
						|
preferred). This raises the same scheduling quality concern as we mentioned at the
 | 
						|
end of the Denial of Service section above, and can be addressed in similar ways.
 | 
						|
 | 
						|
The nature of affinity (as opposed to anti-affinity) means that there is no issue of
 | 
						|
determining which pod(s) to kill
 | 
						|
when a pod's labels change: it is obviously the pod with the affinity rule that becomes
 | 
						|
violated that must be killed. (Killing a pod never "fixes" violation of an affinity rule;
 | 
						|
it can only "fix" violation an anti-affinity rule.) However, affinity does have a
 | 
						|
different question related to killing: how long should the system wait before declaring
 | 
						|
that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met at runtime?
 | 
						|
For example, if a pod P has such an affinity for a pod Q and pod Q is temporarily killed
 | 
						|
so that it can be updated to a new binary version, should that trigger killing of P? More
 | 
						|
generally, how long should the system wait before declaring that P's affinity is
 | 
						|
violated? (Of course affinity is expressed in terms of label selectors, not for a specific
 | 
						|
pod, but the scenario is easier to describe using a concrete pod.) This is closely related to
 | 
						|
the concept of forgiveness (see issue #1574). In theory we could make this time duration be
 | 
						|
configurable by the user on a per-pod basis, but for the first version of this feature we will
 | 
						|
make it a configurable property of whichever component does the killing and that applies across
 | 
						|
all pods using the feature. Making it configurable by the user would require a nontrivial change
 | 
						|
to the API syntax (since the field would only apply to RequiredDuringSchedulingRequiredDuringExecution
 | 
						|
affinity).
 | 
						|
 | 
						|
## Implementation plan
 | 
						|
 | 
						|
1. Add the `Affinity` field to PodSpec and the `PodAffinity` and `PodAntiAffinity` types to the API along with all of their descendant types.
 | 
						|
2. Implement a scheduler predicate that takes `RequiredDuringSchedulingIgnoredDuringExecution`
 | 
						|
affinity and anti-affinity into account. Include a workaround for the issue described at the end of the Affinity section of the Examples section (can't schedule first pod).
 | 
						|
3. Implement a scheduler priority function that takes `PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into account
 | 
						|
4. Implement admission controller that rejects requests that specify "all namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling` anti-affinity.
 | 
						|
This admission controller should be enabled by default.
 | 
						|
5. Implement the recommended solution to the "co-existing with daemons" issue
 | 
						|
6. At this point, the feature can be deployed.
 | 
						|
7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity and anti-affinity, and make sure
 | 
						|
the pieces of the system already implemented for `RequiredDuringSchedulingIgnoredDuringExecution` also take
 | 
						|
`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the scheduler predicate, the quota mechanism,
 | 
						|
the "co-existing with daemons" solution).
 | 
						|
8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node" `TopologyKey` to Kubelet's admission decision
 | 
						|
9. Implement code in Kubelet *or* the controllers that evicts a pod that no longer satisfies
 | 
						|
`RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet then only for "node" `TopologyKey`;
 | 
						|
if controller then potentially for all `TopologyKeys`'s.
 | 
						|
(see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
 | 
						|
Do so in a way that addresses the "determining which pod(s) to kill" issue.
 | 
						|
 | 
						|
We assume Kubelet publishes labels describing the node's membership in all of the relevant scheduling
 | 
						|
domains (e.g. node name, rack name, availability zone name, etc.). See #9044.
 | 
						|
 | 
						|
## Backward compatibility
 | 
						|
 | 
						|
Old versions of the scheduler will ignore `Affinity`.
 | 
						|
 | 
						|
Users should not start using `Affinity` until the full implementation has
 | 
						|
been in Kubelet and the master for enough binary versions that we feel
 | 
						|
comfortable that we will not need to roll back either Kubelet or
 | 
						|
master to a version that does not support them. Longer-term we will
 | 
						|
use a programmatic approach to enforcing this (#4855).
 | 
						|
 | 
						|
## Extensibility
 | 
						|
 | 
						|
The design described here is the result of careful analysis of use cases, a decade of experience
 | 
						|
with Borg at Google, and a review of similar features in other open-source container orchestration
 | 
						|
systems. We believe that it properly balances the goal of expressiveness against the goals of
 | 
						|
simplicity and efficiency of implementation. However, we recognize that
 | 
						|
use cases may arise in the future that cannot be expressed using the syntax described here.
 | 
						|
Although we are not implementing an affinity-specific extensibility mechanism for a variety
 | 
						|
of reasons (simplicity of the codebase, simplicity of cluster deployment, desire for Kubernetes
 | 
						|
users to get a consistent experience, etc.), the regular Kubernetes
 | 
						|
annotation mechanism can be used to add or replace affinity rules. The way this work would is
 | 
						|
1. Define one or more annotations to describe the new affinity rule(s)
 | 
						|
1. User (or an admission controller) attaches the annotation(s) to pods to request the desired scheduling behavior.
 | 
						|
If the new rule(s) *replace* one or more fields of `Affinity` then the user would omit those fields
 | 
						|
from `Affinity`; if they are *additional rules*, then the user would fill in `Affinity` as well as the
 | 
						|
annotation(s).
 | 
						|
1. Scheduler takes the annotation(s) into account when scheduling.
 | 
						|
 | 
						|
If some particular new syntax becomes popular, we would consider upstreaming it by integrating
 | 
						|
it into the standard `Affinity`.
 | 
						|
 | 
						|
## Future work and non-work
 | 
						|
 | 
						|
One can imagine that in the anti-affinity RequiredDuringScheduling case
 | 
						|
one might want to associate a number with the rule,
 | 
						|
for example "do not allow this pod to share a rack with more than three other
 | 
						|
pods (in total, or from the same service as the pod)." We could allow this to be
 | 
						|
specified by adding an integer `Limit` to `PodAffinityTerm` just for the
 | 
						|
`RequiredDuringScheduling` case. However, this flexibility complicates the
 | 
						|
system and we do not intend to implement it.
 | 
						|
 | 
						|
It is likely that the specification and implementation of pod anti-affinity
 | 
						|
can be unified with [taints and tolerations](taint-toleration-dedicated.md),
 | 
						|
and likewise that the specification and implementation of pod affinity
 | 
						|
can be unified with [node affinity](nodeaffinity.md).
 | 
						|
The basic idea is that pod labels would be "inherited" by the node, and pods
 | 
						|
would only be able to specify affinity and anti-affinity for a node's labels.
 | 
						|
Our main motivation for not unifying taints and tolerations with
 | 
						|
pod anti-affinity is that we foresee taints and tolerations as being a concept that
 | 
						|
only cluster administrators need to understand (and indeed in some setups taints and
 | 
						|
tolerations wouldn't even be directly manipulated by a cluster administrator,
 | 
						|
instead they would only be set by an admission controller that is implementing the administrator's
 | 
						|
high-level policy about different classes of special machines and the users who belong to the groups
 | 
						|
allowed to access them). Moreover, the concept of nodes "inheriting" labels
 | 
						|
from pods seems complicated; it seems conceptually simpler to separate rules involving
 | 
						|
relatively static properties of nodes from rules involving which other pods are running
 | 
						|
on the same node or larger topology domain.
 | 
						|
 | 
						|
Data/storage affinity is related to pod affinity, and is likely to draw on some of the
 | 
						|
ideas we have used for pod affinity. Today, data/storage affinity is expressed using
 | 
						|
node affinity, on the assumption that the pod knows which node(s) store(s) the data
 | 
						|
it wants. But a more flexible approach would allow the pod to name the data rather than
 | 
						|
the node.
 | 
						|
 | 
						|
## Related issues
 | 
						|
 | 
						|
The review for this proposal is in #18265.
 | 
						|
 | 
						|
The topic of affinity/anti-affinity has generated a lot of discussion. The main issue
 | 
						|
is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341, #1965, and #2906
 | 
						|
all have additional discussion and use cases.
 | 
						|
 | 
						|
As the examples in this document have demonstrated, topological affinity is very useful
 | 
						|
in clusters that are spread across availability zones, e.g. to co-locate pods of a service
 | 
						|
in the same zone to avoid a wide-area network hop, or to spread pods across zones for
 | 
						|
failure tolerance. #17059, #13056, #13063, and #4235 are relevant.
 | 
						|
 | 
						|
Issue #15675 describes connection affinity, which is vaguely related.
 | 
						|
 | 
						|
This proposal is to satisfy #14816.
 | 
						|
 | 
						|
## Related work
 | 
						|
 | 
						|
** TODO: cite references **
 | 
						|
 | 
						|
 | 
						|
 | 
						|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | 
						|
[]()
 | 
						|
<!-- END MUNGE: GENERATED_ANALYTICS -->
 |