kubernetes/docs/design/podaffinity.md

<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->

<!-- BEGIN STRIP_FOR_RELEASE -->

<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">

<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>

If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.

<!-- TAG RELEASE_LINK, added by the munger automatically -->
<strong>
The latest release of this document can be found
[here](http://releases.k8s.io/release-1.2/docs/design/podaffinity.md).

Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--

<!-- END STRIP_FOR_RELEASE -->

<!-- END MUNGE: UNVERSIONED_WARNING -->

# Inter-pod topological affinity and anti-affinity

## Introduction

NOTE: It is useful to read about [node affinity](nodeaffinity.md) first.

This document describes a proposal for specifying and implementing inter-pod
topological affinity and anti-affinity. By that we mean: rules that specify that
certain pods should be placed in the same topological domain (e.g. same node,
same rack, same zone, same power domain, etc.) as some other pods, or,
conversely, should *not* be placed in the same topological domain as some other
pods.

Here are a few example rules; we explain how to express them using the API
described in this doc later, in the section "Examples."
* Affinity
  * Co-locate the pods from a particular service or Job in the same availability
zone, without specifying which zone that should be.
  * Co-locate the pods from service S1 with pods from service S2 because S1 uses
S2 and thus it is useful to minimize the network latency between them.
Co-location might mean same nodes and/or same availability zone.
* Anti-affinity
  * Spread the pods of a service across nodes and/or availability zones, e.g. to
reduce correlated failures.
  * Give a pod "exclusive" access to a node to guarantee resource isolation --
it must never share the node with other pods.
  * Don't schedule the pods of a particular service on the same nodes as pods of
another service that are known to interfere with the performance of the pods of
the first service.

For both affinity and anti-affinity, there are three variants. Two variants have
the property of requiring the affinity/anti-affinity to be satisfied for the pod
to be allowed to schedule onto a node; the difference between them is that if
the condition ceases to be met later on at runtime, for one of them the system
will try to eventually evict the pod, while for the other the system may not try
to do so. The third variant simply provides scheduling-time *hints* that the
scheduler will try to satisfy but may not be able to. These three variants are
directly analogous to the three variants of [node affinity](nodeaffinity.md).

Note that this proposal is only about *inter-pod* topological affinity and
anti-affinity. There are other forms of topological affinity and anti-affinity.
For example, you can use [node affinity](nodeaffinity.md) to require (prefer)
that a set of pods all be scheduled in some specific zone Z. Node affinity is
not capable of expressing inter-pod dependencies, and conversely the API we
describe in this document is not capable of expressing node affinity rules. For
simplicity, we will use the terms "affinity" and "anti-affinity" to mean
"inter-pod topological affinity" and "inter-pod topological anti-affinity,"
respectively, in the remainder of this document.

## API

We will add one field to `PodSpec`

```go
Affinity *Affinity  `json:"affinity,omitempty"`
```

The `Affinity` type is defined as follows

```go
type Affinity struct {
    PodAffinity     *PodAffinity  `json:"podAffinity,omitempty"`
    PodAntiAffinity *PodAntiAffinity  `json:"podAntiAffinity,omitempty"`
}

type PodAffinity struct {
    // If the affinity requirements specified by this field are not met at
    // scheduling time, the pod will not be scheduled onto the node.
    // If the affinity requirements specified by this field cease to be met
    // at some point during pod execution (e.g. due to a pod label update), the
    // system will try to eventually evict the pod from its node.
    // When there are multiple elements, the lists of nodes corresponding to each
    // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
    RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm  `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
    // If the affinity requirements specified by this field are not met at
    // scheduling time, the pod will not be scheduled onto the node.
    // If the affinity requirements specified by this field cease to be met
    // at some point during pod execution (e.g. due to a pod label update), the
    // system may or may not try to eventually evict the pod from its node.
    // When there are multiple elements, the lists of nodes corresponding to each
    // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
    RequiredDuringSchedulingIgnoredDuringExecution  []PodAffinityTerm  `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
    // The scheduler will prefer to schedule pods to nodes that satisfy
    // the affinity expressions specified by this field, but it may choose
    // a node that violates one or more of the expressions. The node that is
    // most preferred is the one with the greatest sum of weights, i.e.
    // for each node that meets all of the scheduling requirements (resource
    // request, RequiredDuringScheduling affinity expressions, etc.),
    // compute a sum by iterating through the elements of this field and adding
    // "weight" to the sum if the node matches the corresponding MatchExpressions; the
    // node(s) with the highest sum are the most preferred.
    PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm  `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
}

type PodAntiAffinity struct {
    // If the anti-affinity requirements specified by this field are not met at
    // scheduling time, the pod will not be scheduled onto the node.
    // If the anti-affinity requirements specified by this field cease to be met
    // at some point during pod execution (e.g. due to a pod label update), the
    // system will try to eventually evict the pod from its node.
    // When there are multiple elements, the lists of nodes corresponding to each
    // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
    RequiredDuringSchedulingRequiredDuringExecution []PodAffinityTerm  `json:"requiredDuringSchedulingRequiredDuringExecution,omitempty"`
    // If the anti-affinity requirements specified by this field are not met at
    // scheduling time, the pod will not be scheduled onto the node.
    // If the anti-affinity requirements specified by this field cease to be met
    // at some point during pod execution (e.g. due to a pod label update), the
    // system may or may not try to eventually evict the pod from its node.
    // When there are multiple elements, the lists of nodes corresponding to each
    // PodAffinityTerm are intersected, i.e. all terms must be satisfied.
    RequiredDuringSchedulingIgnoredDuringExecution  []PodAffinityTerm  `json:"requiredDuringSchedulingIgnoredDuringExecution,omitempty"`
    // The scheduler will prefer to schedule pods to nodes that satisfy
    // the anti-affinity expressions specified by this field, but it may choose
    // a node that violates one or more of the expressions. The node that is
    // most preferred is the one with the greatest sum of weights, i.e.
    // for each node that meets all of the scheduling requirements (resource
    // request, RequiredDuringScheduling anti-affinity expressions, etc.),
    // compute a sum by iterating through the elements of this field and adding
    // "weight" to the sum if the node matches the corresponding MatchExpressions; the
    // node(s) with the highest sum are the most preferred.
    PreferredDuringSchedulingIgnoredDuringExecution []WeightedPodAffinityTerm  `json:"preferredDuringSchedulingIgnoredDuringExecution,omitempty"`
}

type WeightedPodAffinityTerm struct {
    // weight is in the range 1-100
    Weight int  `json:"weight"`
    PodAffinityTerm PodAffinityTerm  `json:"podAffinityTerm"`
}

type PodAffinityTerm struct {
    LabelSelector *LabelSelector `json:"labelSelector,omitempty"`
    // namespaces specifies which namespaces the LabelSelector applies to (matches against);
    // nil list means "this pod's namespace," empty list means "all namespaces"
    // The json tag here is not "omitempty" since we need to distinguish nil and empty.
    // See https://golang.org/pkg/encoding/json/#Marshal for more details.
    Namespaces []api.Namespace  `json:"namespaces,omitempty"`
    // empty topology key is interpreted by the scheduler as "all topologies"
    TopologyKey string `json:"topologyKey,omitempty"`
}
```

Note that the `Namespaces` field is necessary because normal `LabelSelector` is
scoped to the pod's namespace, but we need to be able to match against all pods
globally.

To explain how this API works, let's say that the `PodSpec` of a pod `P` has an
`Affinity` that is configured as follows (note that we've omitted and collapsed
some fields for simplicity, but this should sufficiently convey the intent of
the design):

```go
PodAffinity {
	RequiredDuringScheduling: {{LabelSelector: P1, TopologyKey: "node"}},
	PreferredDuringScheduling: {{LabelSelector: P2, TopologyKey: "zone"}},
}
PodAntiAffinity {
	RequiredDuringScheduling: {{LabelSelector: P3, TopologyKey: "rack"}},
	PreferredDuringScheduling: {{LabelSelector: P4, TopologyKey: "power"}}
}
```

Then when scheduling pod P, the scheduler:
* Can only schedule P onto nodes that are running pods that satisfy `P1`.
(Assumes all nodes have a label with key `node` and value specifying their node
name.)
* Should try to schedule P onto zones that are running pods that satisfy `P2`.
(Assumes all nodes have a label with key `zone` and value specifying their
zone.)
* Cannot schedule P onto any racks that are running pods that satisfy `P3`.
(Assumes all nodes have a label with key `rack` and value specifying their rack
name.)
* Should try not to schedule P onto any power domains that are running pods that
satisfy `P4`. (Assumes all nodes have a label with key `power` and value
specifying their power domain.)

When `RequiredDuringScheduling` has multiple elements, the requirements are
ANDed. For `PreferredDuringScheduling` the weights are added for the terms that
are satisfied for each node, and the node(s) with the highest weight(s) are the
most preferred.

In reality there are two variants of `RequiredDuringScheduling`: one suffixed
with `RequiredDuringEecution` and one suffixed with `IgnoredDuringExecution`.
For the first variant, if the affinity/anti-affinity ceases to be met at some
point during pod execution (e.g. due to a pod label update), the system will try
to eventually evict the pod from its node. In the second variant, the system may
or may not try to eventually evict the pod from its node.

## A comment on symmetry

One thing that makes affinity and anti-affinity tricky is symmetry.

Imagine a cluster that is running pods from two services, S1 and S2. Imagine
that the pods of S1 have a RequiredDuringScheduling anti-affinity rule "do not
run me on nodes that are running pods from S2." It is not sufficient just to
check that there are no S2 pods on a node when you are scheduling a S1 pod. You
also need to ensure that there are no S1 pods on a node when you are scheduling
a S2 pod, *even though the S2 pod does not have any anti-affinity rules*.
Otherwise if an S1 pod schedules before an S2 pod, the S1 pod's
RequiredDuringScheduling anti-affinity rule can be violated by a later-arriving
S2 pod. More specifically, if S1 has the aforementioned RequiredDuringScheduling
anti-affinity rule, then:
* if a node is empty, you can schedule S1 or S2 onto the node
* if a node is running S1 (S2), you cannot schedule S2 (S1) onto the node

Note that while RequiredDuringScheduling anti-affinity is symmetric,
RequiredDuringScheduling affinity is *not* symmetric. That is, if the pods of S1
have a RequiredDuringScheduling affinity rule "run me on nodes that are running
pods from S2," it is not required that there be S1 pods on a node in order to
schedule a S2 pod onto that node. More specifically, if S1 has the
aforementioned RequiredDuringScheduling affinity rule, then:
* if a node is empty, you can schedule S2 onto the node
* if a node is empty, you cannot schedule S1 onto the node
* if a node is running S2, you can schedule S1 onto the node
* if a node is running S1+S2 and S1 terminates, S2 continues running
* if a node is running S1+S2 and S2 terminates, the system terminates S1
(eventually)

However, although RequiredDuringScheduling affinity is not symmetric, there is
an implicit PreferredDuringScheduling affinity rule corresponding to every
RequiredDuringScheduling affinity rule: if the pods of S1 have a
RequiredDuringScheduling affinity rule "run me on nodes that are running pods
from S2" then it is not required that there be S1 pods on a node in order to
schedule a S2 pod onto that node, but it would be better if there are.

PreferredDuringScheduling is symmetric. If the pods of S1 had a
PreferredDuringScheduling anti-affinity rule "try not to run me on nodes that
are running pods from S2" then we would prefer to keep a S1 pod that we are
scheduling off of nodes that are running S2 pods, and also to keep a S2 pod that
we are scheduling off of nodes that are running S1 pods. Likewise if the pods of
S1 had a PreferredDuringScheduling affinity rule "try to run me on nodes that
are running pods from S2" then we would prefer to place a S1 pod that we are
scheduling onto a node that is running a S2 pod, and also to place a S2 pod that
we are scheduling onto a node that is running a S1 pod.

## Examples

Here are some examples of how you would express various affinity and
anti-affinity rules using the API we described.

### Affinity

In the examples below, the word "put" is intentionally ambiguous; the rules are
the same whether "put" means "must put" (RequiredDuringScheduling) or "try to
put" (PreferredDuringScheduling)--all that changes is which field the rule goes
into. Also, we only discuss scheduling-time, and ignore the execution-time.
Finally, some of the examples use "zone" and some use "node," just to make the
examples more interesting; any of the examples with "zone" will also work for
"node" if you change the `TopologyKey`, and vice-versa.

* **Put the pod in zone Z**:
Tricked you! It is not possible express this using the API described here. For
this you should use node affinity.

* **Put the pod in a zone that is running at least one pod from service S**:
`{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}`

* **Put the pod on a node that is already running a pod that requires a license
for software package P**: Assuming pods that require a license for software
package P have a label `{key=license, value=P}`:
`{LabelSelector: "license" In "P", TopologyKey: "node"}`

* **Put this pod in the same zone as other pods from its same service**:
Assuming pods from this pod's service have some label `{key=service, value=S}`:
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`

This last example illustrates a small issue with this API when it is used with a
scheduler that processes the pending queue one pod at a time, like the current
Kubernetes scheduler. The RequiredDuringScheduling rule
`{LabelSelector: "service" In "S", TopologyKey: "zone"}`
only "works" once one pod from service S has been scheduled. But if all pods in
service S have this RequiredDuringScheduling rule in their PodSpec, then the
RequiredDuringScheduling rule will block the first pod of the service from ever
scheduling, since it is only allowed to run in a zone with another pod from the
same service. And of course that means none of the pods of the service will be
able to schedule. This problem *only* applies to RequiredDuringScheduling
affinity, not PreferredDuringScheduling affinity or any variant of
anti-affinity. There are at least three ways to solve this problem:
* **short-term**: have the scheduler use a rule that if the
RequiredDuringScheduling affinity requirement matches a pod's own labels, and
there are no other such pods anywhere, then disregard the requirement. This
approach has a corner case when running parallel schedulers that are allowed to
schedule pods from the same replicated set (e.g. a single PodTemplate): both
schedulers may try to schedule pods from the set at the same time and think
there are no other pods from that set scheduled yet (e.g. they are trying to
schedule the first two pods from the set), but by the time the second binding is
committed, the first one has already been committed, leaving you with two pods
running that do not respect their RequiredDuringScheduling affinity. There is no
simple way to detect this "conflict" at scheduling time given the current system
implementation.
* **longer-term**: when a controller creates pods from a PodTemplate, for
exactly *one* of those pods, it should omit any RequiredDuringScheduling
affinity rules that select the pods of that PodTemplate.
* **very long-term/speculative**: controllers could present the scheduler with a
group of pods from the same PodTemplate as a single unit. This is similar to the
first approach described above but avoids the corner case. No special logic is
needed in the controllers. Moreover, this would allow the scheduler to do proper
[gang scheduling](https://github.com/kubernetes/kubernetes/issues/16845) since
it could receive an entire gang simultaneously as a single unit.

### Anti-affinity

As with the affinity examples, the examples here can be RequiredDuringScheduling
or PreferredDuringScheduling anti-affinity, i.e. "don't" can be interpreted as
"must not" or as "try not to" depending on whether the rule appears in
`RequiredDuringScheduling` or `PreferredDuringScheduling`.

* **Spread the pods of this service S across nodes and zones**:
`{{LabelSelector: <selector that matches S's pods>, TopologyKey: "node"},
{LabelSelector: <selector that matches S's pods>, TopologyKey: "zone"}}`
(note that if this is specified as a RequiredDuringScheduling anti-affinity,
then the first clause is redundant, since the second clause will force the
scheduler to not put more than one pod from S in the same zone, and thus by
definition it will not put more than one pod from S on the same node, assuming
each node is in one zone. This rule is more useful as PreferredDuringScheduling
anti-affinity, e.g. one might expect it to be common in
[Ubernetes](../../docs/proposals/federation.md) clusters.)

* **Don't co-locate pods of this service with pods from service "evilService"**:
`{LabelSelector: selector that matches evilService's pods, TopologyKey: "node"}`

* **Don't co-locate pods of this service with any other pods including pods of this service**:
`{LabelSelector: empty, TopologyKey: "node"}`

* **Don't co-locate pods of this service with any other pods except other pods of this service**:
Assuming pods from the service have some label `{key=service, value=S}`:
`{LabelSelector: "service" NotIn "S", TopologyKey: "node"}`
Note that this works because `"service" NotIn "S"` matches pods with no key
"service" as well as pods with key "service" and a corresponding value that is
not "S."

## Algorithm

An example algorithm a scheduler might use to implement affinity and
anti-affinity rules is as follows. There are certainly more efficient ways to
do it; this is just intended to demonstrate that the API's semantics are
implementable.

Terminology definition: We say a pod P is "feasible" on a node N if P meets all
of the scheduler predicates for scheduling P onto N. Note that this algorithm is
only concerned about scheduling time, thus it makes no distinction between
RequiredDuringExecution and IgnoredDuringExecution.

To make the algorithm slightly more readable, we use the term "HardPodAffinity"
as shorthand for "RequiredDuringSchedulingScheduling pod affinity" and
"SoftPodAffinity" as shorthand for "PreferredDuringScheduling pod affinity."
Analogously for "HardPodAntiAffinity" and "SoftPodAntiAffinity."

** TODO: Update this algorithm to take weight for SoftPod{Affinity,AntiAffinity}
into account; currently it assumes all terms have weight 1. **

```
Z = the pod you are scheduling
{N} = the set of all nodes in the system  // this algorithm will reduce it to the set of all nodes feasible for Z
// Step 1a: Reduce {N} to the set of nodes satisfying Z's HardPodAffinity in the "forward" direction
X = {Z's PodSpec's HardPodAffinity}
foreach element H of {X}
	P = {all pods in the system that match H.LabelSelector}
	M map[string]int  // topology value -> number of pods running on nodes with that topology value
	foreach pod Q of {P}
		L = {labels of the node on which Q is running, represented as a map from label key to label value}
		M[L[H.TopologyKey]]++
	{N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]>0]}
// Step 1b: Further reduce {N} to the set of nodes also satisfying Z's HardPodAntiAffinity
// This step is identical to Step 1a except the M[K] > 0 comparison becomes M[K] == 0
X = {Z's PodSpec's HardPodAntiAffinity}
foreach element H of {X}
	P = {all pods in the system that match H.LabelSelector}
	M map[string]int  // topology value -> number of pods running on nodes with that topology value
	foreach pod Q of {P}
		L = {labels of the node on which Q is running, represented as a map from label key to label value}
		M[L[H.TopologyKey]]++
	{N} = {N} intersect {all nodes of N with label [key=H.TopologyKey, value=any K such that M[K]==0]}
// Step 2: Further reduce {N} by enforcing symmetry requirement for other pods' HardPodAntiAffinity
foreach node A of {N}
	foreach pod B that is bound to A
		if any of B's HardPodAntiAffinity are currently satisfied but would be violated if Z runs on A, then remove A from {N}
// At this point, all node in {N} are feasible for Z.
// Step 3a: Soft version of Step 1a
Y map[string]int  // node -> number of Z's soft affinity/anti-affinity preferences satisfied by that node
Initialize the keys of Y to all of the nodes in {N}, and the values to 0
X = {Z's PodSpec's SoftPodAffinity}
Repeat Step 1a except replace the last line with "foreach node W of {N} having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++"
// Step 3b: Soft version of Step 1b
X = {Z's PodSpec's SoftPodAntiAffinity}
Repeat Step 1b except replace the last line with "foreach node W of {N} not having label [key=H.TopologyKey, value=any K such that M[K]>0], Y[W]++"
// Step 4: Symmetric soft, plus treat forward direction of hard affinity as a soft
foreach node A of {N}
	foreach pod B that is bound to A
		increment Y[A] by the number of B's SoftPodAffinity, SoftPodAntiAffinity, and HardPodAffinity that are satisfied if Z runs on A but are not satisfied if Z does not run on A
// We're done. {N} contains all of the nodes that satisfy the affinity/anti-affinity rules, and Y is
// a map whose keys are the elements of {N} and whose values are how "good" of a choice N is for Z with
// respect to the explicit and implicit affinity/anti-affinity rules (larger number is better).
```

## Special considerations for RequiredDuringScheduling anti-affinity

In this section we discuss three issues with RequiredDuringScheduling
anti-affinity: Denial of Service (DoS), co-existing with daemons, and
determining which pod(s) to kill. See issue #18265 for additional discussion of
these topics.

### Denial of Service

Without proper safeguards, a pod using RequiredDuringScheduling anti-affinity
can intentionally or unintentionally cause various problems for other pods, due
to the symmetry property of anti-affinity.

The most notable danger is the ability for a pod that arrives first to some
topology domain, to block all other pods from scheduling there by stating a
conflict with all other pods. The standard approach to preventing resource
hogging is quota, but simple resource quota cannot prevent this scenario because
the pod may request very little resources. Addressing this using quota requires
a quota scheme that charges based on "opportunity cost" rather than based simply
on requested resources. For example, when handling a pod that expresses
RequiredDuringScheduling anti-affinity for all pods using a "node" `TopologyKey`
(i.e. exclusive access to a node), it could charge for the resources of the
average or largest node in the cluster. Likewise if a pod expresses
RequiredDuringScheduling anti-affinity for all pods using a "cluster"
`TopologyKey`, it could charge for the resources of the entire cluster. If node
affinity is used to constrain the pod to a particular topology domain, then the
admission-time quota charging should take that into account (e.g. not charge for
the average/largest machine if the PodSpec constrains the pod to a specific
machine with a known size; instead charge for the size of the actual machine
that the pod was constrained to). In all cases once the pod is scheduled, the
quota charge should be adjusted down to the actual amount of resources allocated
(e.g. the size of the actual machine that was assigned, not the
average/largest). If a cluster administrator wants to overcommit quota, for
example to allow more than N pods across all users to request exclusive node
access in a cluster with N nodes, then a priority/preemption scheme should be
added so that the most important pods run when resource demand exceeds supply.

An alternative approach, which is a bit of a blunt hammer, is to use a
capability mechanism to restrict use of RequiredDuringScheduling anti-affinity
to trusted users. A more complex capability mechanism might only restrict it
when using a non-"node" TopologyKey.

Our initial implementation will use a variant of the capability approach, which
requires no configuration: we will simply reject ALL requests, regardless of
user, that specify "all namespaces" with non-"node" TopologyKey for
RequiredDuringScheduling anti-affinity. This allows the "exclusive node" use
case while prohibiting the more dangerous ones.

A weaker variant of the problem described in the previous paragraph is a pod's
ability to use anti-affinity to degrade the scheduling quality of another pod,
but not completely block it from scheduling. For example, a set of pods S1 could
use node affinity to request to schedule onto a set of nodes that some other set
of pods S2 prefers to schedule onto. If the pods in S1 have
RequiredDuringScheduling or even PreferredDuringScheduling pod anti-affinity for
S2, then due to the symmetry property of anti-affinity, they can prevent the
pods in S2 from scheduling onto their preferred nodes if they arrive first (for
sure in the RequiredDuringScheduling case, and with some probability that
depends on the weighting scheme for the PreferredDuringScheduling case). A very
sophisticated priority and/or quota scheme could mitigate this, or alternatively
we could eliminate the symmetry property of the implementation of
PreferredDuringScheduling anti-affinity. Then only RequiredDuringScheduling
anti-affinity could affect scheduling quality of another pod, and as we
described in the previous paragraph, such pods could be charged quota for the
full topology domain, thereby reducing the potential for abuse.

We won't try to address this issue in our initial implementation; we can
consider one of the approaches mentioned above if it turns out to be a problem
in practice.

### Co-existing with daemons

A cluster administrator may wish to allow pods that express anti-affinity
against all pods, to nonetheless co-exist with system daemon pods, such as those
run by DaemonSet. In principle, we would like the specification for
RequiredDuringScheduling inter-pod anti-affinity to allow "toleration" of one or
more other pods (see #18263 for a more detailed explanation of the toleration
concept). There are at least two ways to accomplish this:

* Scheduler special-cases the namespace(s) where daemons live, in the
  sense that it ignores pods in those namespaces when it is
  determining feasibility for pods with anti-affinity. The name(s) of
  the special namespace(s) could be a scheduler configuration
  parameter, and default to `kube-system`. We could allow
  multiple namespaces to be specified if we want cluster admins to be
  able to give their own daemons this special power (they would add
  their namespace to the list in the scheduler configuration). And of
  course this would be symmetric, so daemons could schedule onto a node
  that is already running a pod with anti-affinity.

* We could add an explicit "toleration" concept/field to allow the
  user to specify namespaces that are excluded when they use
  RequiredDuringScheduling anti-affinity, and use an admission
  controller/defaulter to ensure these namespaces are always listed.

Our initial implementation will use the first approach.

### Determining which pod(s) to kill (for RequiredDuringSchedulingRequiredDuringExecution)

Because anti-affinity is symmetric, in the case of
RequiredDuringSchedulingRequiredDuringExecution anti-affinity, the system must
determine which pod(s) to kill when a pod's labels are updated in such as way as
to cause them to conflict with one or more other pods'
RequiredDuringSchedulingRequiredDuringExecution anti-affinity rules. In the
absence of a priority/preemption scheme, our rule will be that the pod with the
anti-affinity rule that becomes violated should be the one killed. A pod should
only specify constraints that apply to namespaces it trusts to not do malicious
things. Once we have priority/preemption, we can change the rule to say that the
lowest-priority pod(s) are killed until all
RequiredDuringSchedulingRequiredDuringExecution anti-affinity is satisfied.

## Special considerations for RequiredDuringScheduling affinity

The DoS potential of RequiredDuringScheduling *anti-affinity* stemmed from its
symmetry: if a pod P requests anti-affinity, P cannot schedule onto a node with
conflicting pods, and pods that conflict with P cannot schedule onto the node
one P has been scheduled there. The design we have described says that the
symmetry property for RequiredDuringScheduling *affinity* is weaker: if a pod P
says it can only schedule onto nodes running pod Q, this does not mean Q can
only run on a node that is running P, but the scheduler will try to schedule Q
onto a node that is running P (i.e. treats the reverse direction as preferred).
This raises the same scheduling quality concern as we mentioned at the end of
the Denial of Service section above, and can be addressed in similar ways.

The nature of affinity (as opposed to anti-affinity) means that there is no
issue of determining which pod(s) to kill when a pod's labels change: it is
obviously the pod with the affinity rule that becomes violated that must be
killed. (Killing a pod never "fixes" violation of an affinity rule; it can only
"fix" violation an anti-affinity rule.) However, affinity does have a different
question related to killing: how long should the system wait before declaring
that RequiredDuringSchedulingRequiredDuringExecution affinity is no longer met
at runtime? For example, if a pod P has such an affinity for a pod Q and pod Q
is temporarily killed so that it can be updated to a new binary version, should
that trigger killing of P? More generally, how long should the system wait
before declaring that P's affinity is violated? (Of course affinity is expressed
in terms of label selectors, not for a specific pod, but the scenario is easier
to describe using a concrete pod.) This is closely related to the concept of
forgiveness (see issue #1574). In theory we could make this time duration be
configurable by the user on a per-pod basis, but for the first version of this
feature we will make it a configurable property of whichever component does the
killing and that applies across all pods using the feature. Making it
configurable by the user would require a nontrivial change to the API syntax
(since the field would only apply to
RequiredDuringSchedulingRequiredDuringExecution affinity).

## Implementation plan

1. Add the `Affinity` field to PodSpec and the `PodAffinity` and
`PodAntiAffinity` types to the API along with all of their descendant types.
2. Implement a scheduler predicate that takes
`RequiredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity into
account. Include a workaround for the issue described at the end of the Affinity
section of the Examples section (can't schedule first pod).
3. Implement a scheduler priority function that takes
`PreferredDuringSchedulingIgnoredDuringExecution` affinity and anti-affinity
into account.
4. Implement admission controller that rejects requests that specify "all
namespaces" with non-"node" TopologyKey for `RequiredDuringScheduling`
anti-affinity. This admission controller should be enabled by default.
5. Implement the recommended solution to the "co-existing with daemons" issue
6. At this point, the feature can be deployed.
7. Add the `RequiredDuringSchedulingRequiredDuringExecution` field to affinity
and anti-affinity, and make sure the pieces of the system already implemented
for `RequiredDuringSchedulingIgnoredDuringExecution` also take
`RequiredDuringSchedulingRequiredDuringExecution` into account (e.g. the
scheduler predicate, the quota mechanism, the "co-existing with daemons"
solution).
8. Add `RequiredDuringSchedulingRequiredDuringExecution` for "node"
`TopologyKey` to Kubelet's admission decision.
9. Implement code in Kubelet *or* the controllers that evicts a pod that no
longer satisfies `RequiredDuringSchedulingRequiredDuringExecution`. If Kubelet
then only for "node" `TopologyKey`; if controller then potentially for all
`TopologyKeys`'s. (see [this comment](https://github.com/kubernetes/kubernetes/issues/12744#issuecomment-164372008)).
Do so in a way that addresses the "determining which pod(s) to kill" issue.

We assume Kubelet publishes labels describing the node's membership in all of
the relevant scheduling domains (e.g. node name, rack name, availability zone
name, etc.). See #9044.

## Backward compatibility

Old versions of the scheduler will ignore `Affinity`.

Users should not start using `Affinity` until the full implementation has been
in Kubelet and the master for enough binary versions that we feel comfortable
that we will not need to roll back either Kubelet or master to a version that
does not support them. Longer-term we will use a programmatic approach to
enforcing this (#4855).

## Extensibility

The design described here is the result of careful analysis of use cases, a
decade of experience with Borg at Google, and a review of similar features in
other open-source container orchestration systems. We believe that it properly
balances the goal of expressiveness against the goals of simplicity and
efficiency of implementation. However, we recognize that use cases may arise in
the future that cannot be expressed using the syntax described here. Although we
are not implementing an affinity-specific extensibility mechanism for a variety
of reasons (simplicity of the codebase, simplicity of cluster deployment, desire
for Kubernetes users to get a consistent experience, etc.), the regular
Kubernetes annotation mechanism can be used to add or replace affinity rules.
The way this work would is:
1. Define one or more annotations to describe the new affinity rule(s)
1. User (or an admission controller) attaches the annotation(s) to pods to
request the desired scheduling behavior. If the new rule(s) *replace* one or
more fields of `Affinity` then the user would omit those fields from `Affinity`;
if they are *additional rules*, then the user would fill in `Affinity` as well
as the annotation(s).
1. Scheduler takes the annotation(s) into account when scheduling.

If some particular new syntax becomes popular, we would consider upstreaming it
by integrating it into the standard `Affinity`.

## Future work and non-work

One can imagine that in the anti-affinity RequiredDuringScheduling case one
might want to associate a number with the rule, for example "do not allow this
pod to share a rack with more than three other pods (in total, or from the same
service as the pod)." We could allow this to be specified by adding an integer
`Limit` to `PodAffinityTerm` just for the `RequiredDuringScheduling` case.
However, this flexibility complicates the system and we do not intend to
implement it.

It is likely that the specification and implementation of pod anti-affinity
can be unified with [taints and tolerations](taint-toleration-dedicated.md),
and likewise that the specification and implementation of pod affinity
can be unified with [node affinity](nodeaffinity.md). The basic idea is that pod
labels would be "inherited" by the node, and pods would only be able to specify
affinity and anti-affinity for a node's labels. Our main motivation for not
unifying taints and tolerations with pod anti-affinity is that we foresee taints
and tolerations as being a concept that only cluster administrators need to
understand (and indeed in some setups taints and tolerations wouldn't even be
directly manipulated by a cluster administrator, instead they would only be set
by an admission controller that is implementing the administrator's high-level
policy about different classes of special machines and the users who belong to
the groups allowed to access them). Moreover, the concept of nodes "inheriting"
labels from pods seems complicated; it seems conceptually simpler to separate
rules involving relatively static properties of nodes from rules involving which
other pods are running on the same node or larger topology domain.

Data/storage affinity is related to pod affinity, and is likely to draw on some
of the ideas we have used for pod affinity. Today, data/storage affinity is
expressed using node affinity, on the assumption that the pod knows which
node(s) store(s) the data it wants. But a more flexible approach would allow the
pod to name the data rather than the node.

## Related issues

The review for this proposal is in #18265.

The topic of affinity/anti-affinity has generated a lot of discussion. The main
issue is #367 but #14484/#14485, #9560, #11369, #14543, #11707, #3945, #341,

# 1965, and #2906 all have additional discussion and use cases.

As the examples in this document have demonstrated, topological affinity is very
useful in clusters that are spread across availability zones, e.g. to co-locate
pods of a service in the same zone to avoid a wide-area network hop, or to
spread pods across zones for failure tolerance. #17059, #13056, #13063, and

# 4235 are relevant.

Issue #15675 describes connection affinity, which is vaguely related.

This proposal is to satisfy #14816.

## Related work

** TODO: cite references **


<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/podaffinity.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->