RFC design docs for Cluster Federation/Ubernetes.

2025-11-04 04:08:16 +00:00 · 2016-01-05 17:53:54 -08:00
parent afa7816c38
commit 227710d25b
6 changed files with 1253 additions and 0 deletions
--- a/docs/design/control-plane-resilience.md
+++ b/docs/design/control-plane-resilience.md
@@ -0,0 +1,269 @@
 <!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 <!-- BEGIN STRIP_FOR_RELEASE -->
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 If you are using a released version of Kubernetes, you should
 refer to the docs that go with that version.
 Documentation for other releases can be found at
 [releases.k8s.io](http://releases.k8s.io).
 </strong>
 --
 <!-- END STRIP_FOR_RELEASE -->
 <!-- END MUNGE: UNVERSIONED_WARNING -->
 # Kubernetes/Ubernetes Control Plane Resilience
 ## Long Term Design and Current Status
 ### by Quinton Hoole, Mike Danese and Justin Santa-Barbara
 ### December 14, 2015
 ## Summary
 Some amount of confusion exists around how we currently, and in future
 want to ensure resilience of the Kubernetes (and by implication
 Ubernetes) control plane.  This document is an attempt to capture that
 definitively. It covers areas including self-healing, high
 availability, bootstrapping and recovery.  Most of the information in
 this document already exists in the form of github comments,
 PR's/proposals, scattered documents, and corridor conversations, so
 document is primarily a consolidation and clarification of existing
 ideas.
 ## Terms
 * **Self-healing:** automatically restarting or replacing failed
  processes and machines without human intervention
 * **High availability:** continuing to be available and work correctly
  even if some components are down or uncontactable.  This typically
  involves multiple replicas of critical services, and a reliable way
  to find available replicas.  Note that it's possible (but not
  desirable) to have high
  availability properties (e.g. multiple replicas) in the absence of
  self-healing properties (e.g. if a replica fails, nothing replaces
  it). Fairly obviously, given enough time, such systems typically
  become unavailable (after enough replicas have failed).
 * **Bootstrapping**: creating an empty cluster from nothing
 * **Recovery**: recreating a non-empty cluster after perhaps
  catastrophic failure/unavailability/data corruption
 ## Overall Goals
 1. **Resilience to single failures:** Kubernetes clusters constrained
   to single availability zones should be resilient to individual
   machine and process failures by being both self-healing and highly
   available (within the context of such individual failures).
 1. **Ubiquitous resilience by default:** The default cluster creation
   scripts for (at least) GCE, AWS and basic bare metal should adhere
   to the above (self-healing and high availability) by default (with
   options available to disable these features to reduce control plane
   resource requirements if so required).  It is hoped that other
   cloud providers will also follow the above guidelines, but the
   above 3 are the primary canonical use cases.
 1. **Resilience to some correlated failures:** Kubernetes clusters
   which span multiple availability zones in a region should by
   default be resilient to complete failure of one entire availability
   zone (by similarly providing self-healing and high availability in
   the default cluster creation scripts as above).
 1. **Default implementation shared across cloud providers:** The
   differences between the default implementations of the above for
   GCE, AWS and basic bare metal should be minimized.  This implies
   using shared libraries across these providers in the default
   scripts in preference to highly customized implementations per
   cloud provider.  This is not to say that highly differentiated,
   customized per-cloud cluster creation processes (e.g. for GKE on
   GCE, or some hosted Kubernetes provider on AWS) are discouraged.
   But those fall squarely outside the basic cross-platform OSS
   Kubernetes distro.
 1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms
   for achieving system resilience (replication controllers, health
   checking, service load balancing etc) should be used in preference
   to building a separate set of mechanisms to achieve the same thing.
   This implies that self hosting (the kubernetes control plane on
   kubernetes) is strongly preferred, with the caveat below.
 1. **Recovery from catastrophic failure:** The ability to quickly and
   reliably recover a cluster from catastrophic failure is critical,
   and should not be compromised by the above goal to self-host
   (i.e. it goes without saying that the cluster should be quickly and
   reliably recoverable, even if the cluster control plane is
   broken). This implies that such catastrophic failure scenarios
   should be carefully thought out, and the subject of regular
   continuous integration testing, and disaster recovery exercises.
 ## Relative Priorities
 1. **(Possibly manual) recovery from catastrophic failures:** having a Kubernetes cluster, and all
   applications running inside it, disappear forever perhaps is the worst
   possible failure mode.  So it is critical that we be able to
   recover the applications running inside a cluster from such
   failures in some well-bounded time period.
    1. In theory a cluster can be recovered by replaying all API calls
       that have ever been executed against it, in order, but most
       often that state has been lost, and/or is scattered across
       multiple client applications or groups. So in general it is
       probably infeasible.
    1. In theory a cluster can also be recovered to some relatively
       recent non-corrupt backup/snapshot of the disk(s) backing the
       etcd cluster state.  But we have no default consistent
       backup/snapshot, verification or restoration process.  And we
       don't routinely test restoration, so even if we did routinely
       perform and verify backups, we have no hard evidence that we
       can in practise effectively recover from catastrophic cluster
       failure or data corruption by restoring from these backups.  So
       there's more work to be done here.
 1. **Self-healing:** Most major cloud providers provide the ability to
   easily and automatically replace failed virtual machines within a
   small number of minutes (e.g. GCE
   [Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart)
   and Managed Instance Groups,
   AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/)
   and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This
   can fairly trivially be used to reduce control-plane down-time due
   to machine failure to a small number of minutes per failure
   (i.e. typically around "3 nines" availability), provided that:
    1. cluster persistent state (i.e. etcd disks) is either:
        1. truely persistent (i.e. remote persistent disks), or
        1. reconstructible (e.g. using etcd [dynamic member
           addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member)
           or [backup and
           recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)).
    1. and boot disks are either:
        1. truely persistent (i.e. remote persistent disks), or
        1. reconstructible (e.g. using boot-from-snapshot,
           boot-from-pre-configured-image or
           boot-from-auto-initializing image).
 1. **High Availability:** This has the potential to increase
   availability above the approximately "3 nines" level provided by
   automated self-healing, but it's somewhat more complex, and
   requires additional resources (e.g. redundant API servers and etcd
   quorum members).  In environments where cloud-assisted automatic
   self-healing might be infeasible (e.g. on-premise bare-metal
   deployments), it also gives cluster administrators more time to
   respond (e.g.  replace/repair failed machines) without incurring
   system downtime.
 ## Design and Status (as of December 2015)
 <table>
 <tr>
 <td><b>Control Plane Component</b></td>
 <td><b>Resilience Plan</b></td>
 <td><b>Current Status</b></td>
 </tr>
 <tr>
 <td><b>API Server</b></td>
 <td>
 Multiple stateless, self-hosted, self-healing API servers behind a HA
 load balancer, built out by the default "kube-up" automation on GCE,
 AWS and basic bare metal (BBM).  Note that the single-host approach of
 hving etcd listen only on localhost to ensure that onyl API server can
 connect to it will no longer work, so alternative security will be
 needed in the regard (either using firewall rules, SSL certs, or
 something else). All necessary flags are currently supported to enable
 SSL between API server and etcd (OpenShift runs like this out of the
 box), but this needs to be woven into the "kube-up" and related
 scripts.  Detailed design of self-hosting and related bootstrapping
 and catastrophic failure recovery will be detailed in a separate
 design doc.
 </td>
 <td>
 No scripted self-healing or HA on GCE, AWS or basic bare metal
 currently exists in the OSS distro.   To be clear, "no self healing"
 means that even if multiple e.g. API servers are provisioned for HA
 purposes, if they fail, nothing replaces them, so eventually the
 system will fail.  Self-healing and HA can be set up
 manually by following documented instructions, but this is not
 currently an automated process, and it is not tested as part of
 continuous integration.  So it's probably safest to assume that it
 doesn't actually work in practise.
 </td>
 </tr>
 <tr>
 <td><b>Controller manager and scheduler</b></td>
 <td>
 Multiple self-hosted, self healing warm standby stateless controller
 managers and schedulers with leader election and automatic failover of API server
 clients, automatically installed by default "kube-up" automation.
 </td>
 <td>As above.</td>
 </tr>
 <tr>
 <td><b>etcd</b></td>
 <td>
 Multiple (3-5) etcd quorum members behind a load balancer with session
 affinity (to prevent clients from being bounced from one to another).
 Regarding self-healing, if a node running etcd goes down, it is always necessary to do three
 things:
 <ol>
 <li>allocate a new node (not necessary if running etcd as a pod, in
 which case specific measures are required to prevent user pods from
 interfering with system pods, for example using node selectors as
 described in <A HREF=")
 <li>start an etcd replica on that new node,
 <li>have the new replica recover the etcd state.
 </ol>
 In the case of local disk (which fails in concert with the machine), the etcd
 state must be recovered from the other replicas.  This is called  <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">dynamic member
           addition</A>.
 In the case of remote persistent disk, the etcd state can be recovered
 by attaching the remote persistent disk to the replacement node, thus
 the state is recoverable even if all other replicas are down.
 There are also significant performance differences between local disks and remote
 persistent disks.  For example, the <A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">sustained throughput
 local disks in GCE is approximatley 20x that of remote disks</A>.
 Hence we suggest that self-healing be provided by remotely mounted persistent disks in
 non-performance critical, single-zone cloud deployments.  For
 performance critical installations, faster local SSD's should be used,
 in which case remounting on node failure is not an option, so
 <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">etcd runtime configuration</A>
 should be used to replace the failed machine.  Similarly, for
 cross-zone self-healing, cloud persistent disks are zonal, so
 automatic
 <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">runtime configuration</A>
 is required.  Similarly, basic bare metal deployments cannot generally
 rely on
 remote persistent disks, so the same approach applies there.
 </td>
 <td>
 <A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html">
 Somewhat vague instructions exist</A>
 on how to set some of this up manually in a self-hosted
 configuration. But automatic bootstrapping and self-healing is not
 described (and is not implemented for the non-PD cases).  This all
 still needs to be automated and continuously tested.
 </td>
 </tr>
 </table>
 <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/control-plane-resilience.md?pixel)]()
 <!-- END MUNGE: GENERATED_ANALYTICS -->
--- a/docs/design/federated-services.md
+++ b/docs/design/federated-services.md
@@ -0,0 +1,550 @@
 <!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 <!-- BEGIN STRIP_FOR_RELEASE -->
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 If you are using a released version of Kubernetes, you should
 refer to the docs that go with that version.
 Documentation for other releases can be found at
 [releases.k8s.io](http://releases.k8s.io).
 </strong>
 --
 <!-- END STRIP_FOR_RELEASE -->
 <!-- END MUNGE: UNVERSIONED_WARNING -->
 # Kubernetes Cluster Federation (a.k.a. "Ubernetes")
 ## Cross-cluster Load Balancing and Service Discovery
 ### Requirements and System Design
 ### by Quinton Hoole, Dec 3 2015
 ## Requirements
 ### Discovery, Load-balancing and Failover
 1. **Internal discovery and connection**: Pods/containers (running in
   a Kubernetes cluster) must be able to easily discover and connect
   to endpoints for Kubernetes services on which they depend in a
   consistent way, irrespective of whether those services exist in a
   different kubernetes cluster within the same cluster federation.
   Hence-forth referred to as "cluster-internal clients", or simply
   "internal clients".
 1. **External discovery and connection**: External clients (running
   outside a Kubernetes cluster) must be able to discover and connect
   to endpoints for Kubernetes services on which they depend.
   1. **External clients predominantly speak HTTP(S)**: External
      clients are most often, but not always, web browsers, or at
      least speak HTTP(S) - notable exceptions include Enterprise
      Message Busses (Java, TLS), DNS servers (UDP),
      SIP servers and databases)
 1. **Find the "best" endpoint:** Upon initial discovery and
   connection, both internal and external clients should ideally find
   "the best" endpoint if multiple eligible endpoints exist.  "Best"
   in this context implies the closest (by network topology) endpoint
   that is both operational (as defined by some positive health check)
   and not overloaded (by some published load metric).  For example:
   1. An internal client should find an endpoint which is local to its
      own cluster if one exists, in preference to one in a remote
      cluster (if both are operational and non-overloaded).
      Similarly, one in a nearby cluster (e.g. in the same zone or
      region) is preferable to one further afield.
   1. An external client (e.g. in New York City) should find an
      endpoint in a nearby cluster (e.g. U.S. East Coast) in
      preference to one further away (e.g. Japan).
 1. **Easy fail-over:** If the endpoint to which a client is connected
   becomes unavailable (no network response/disconnected) or
   overloaded, the client should reconnect to a better endpoint,
   somehow.
   1. In the case where there exist one or more connection-terminating
      load balancers between the client and the serving Pod, failover
      might be completely automatic (i.e. the client's end of the
      connection remains intact, and the client is completely
      oblivious of the fail-over).  This approach incurs network speed
      and cost penalties (by traversing possibly multiple load
      balancers), but requires zero smarts in clients, DNS libraries,
      recursing DNS servers etc, as the IP address of the endpoint
      remains constant over time.
   1. In a scenario where clients need to choose between multiple load
      balancer endpoints (e.g. one per cluster), multiple DNS A
      records associated with a single DNS name enable even relatively
      dumb clients to try the next IP address in the list of returned
      A records (without even necessarily re-issuing a DNS resolution
      request).  For example, all major web browsers will try all A
      records in sequence until a working one is found (TBD: justify
      this claim with details for Chrome, IE, Safari, Firefox).
   1. In a slightly more sophisticated scenario, upon disconnection, a
      smarter client might re-issue a DNS resolution query, and
      (modulo DNS record TTL's which can typically be set as low as 3
      minutes, and buggy DNS resolvers, caches and libraries which
      have been known to completely ignore TTL's), receive updated A
      records specifying a new set of IP addresses to which to
      connect.
 ### Portability
 A Kubernetes application configuration (e.g. for a Pod, Replication
 Controller, Service etc) should be able to be successfully deployed
 into any Kubernetes Cluster or Ubernetes Federation of Clusters,
 without modification.  More specifically, a typical configuration
 should work correctly (although possibly not optimally) across any of
 the following environments:
 1. A single Kubernetes Cluster on one cloud provider (e.g. Google
   Compute Engine, GCE)
 1. A single Kubernetes Cluster on a different cloud provider
   (e.g. Amazon Web Services, AWS)
 1. A single Kubernetes Cluster on a non-cloud, on-premise data center
 1. A Federation of Kubernetes Clusters all on the same cloud provider
   (e.g. GCE)
 1. A Federation of Kubernetes Clusters across multiple different cloud
   providers and/or on-premise data centers (e.g. one cluster on
   GCE/GKE, one on AWS, and one on-premise).
 ### Trading Portability for Optimization
 It should be possible to explicitly opt out of portability across some
 subset of the above environments in order to take advantage of
 non-portable load balancing and DNS features of one or more
 environments.  More specifically, for example:
 1. For HTTP(S) applications running on GCE-only Federations,
   [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
   should be usable.  These provide single, static global IP addresses
   which load balance and fail over globally (i.e. across both regions
   and zones).  These allow for really dumb clients, but they only
   work on GCE, and only for HTTP(S) traffic.
 1. For non-HTTP(S) applications running on GCE-only Federations within
   a single region,
   [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
   should be usable.  These provide TCP (i.e. both HTTP/S and
   non-HTTP/S) load balancing and failover, but only on GCE, and only
   within a single region.
   [Google Cloud DNS](https://cloud.google.com/dns) can be used to
   route traffic between regions (and between different cloud
   providers and on-premise clusters, as it's plain DNS, IP only).
 1. For applications running on AWS-only Federations,
   [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
   should be usable.  These provide both L7 (HTTP(S)) and L4 load
   balancing, but only within a single region, and only on AWS
   ([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be
   used to load balance and fail over across multiple regions, and is
   also capable of resolving to non-AWS endpoints).
 ## Component Cloud Services
 Ubernetes cross-cluster load balancing is built on top of the following:
 1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
   provide single, static global IP addresses which load balance and
   fail over globally (i.e. across both regions and zones).  These
   allow for really dumb clients, but they only work on GCE, and only
   for HTTP(S) traffic.
 1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
   provide both HTTP(S) and non-HTTP(S) load balancing and failover,
   but only on GCE, and only within a single region.
 1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
   provide both L7 (HTTP(S)) and L4 load balancing, but only within a
   single region, and only on AWS.
 1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other
   programmable DNS service, like
   [CloudFlare](http://www.cloudflare.com) can be used to route
   traffic between regions (and between different cloud providers and
   on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS
   doesn't provide any built-in geo-DNS, latency-based routing, health
   checking, weighted round robin or other advanced capabilities.
   It's plain old DNS. We would need to build all the aforementioned
   on top of it.  It can provide internal DNS services (i.e. serve RFC
   1918 addresses).
   1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can
   be used to load balance and fail over across regions, and is also
   capable of routing to non-AWS endpoints). It provides built-in
   geo-DNS, latency-based routing, health checking, weighted
   round robin and optional tight integration with some other
   AWS services (e.g. Elastic Load Balancers).
 1. Kubernetes L4 Service Load Balancing: This provides both a
   [virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies)
   and a
   [real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer)
   service IP which is load-balanced (currently simple round-robin)
   across the healthy pods comprising a service within a single
   Kubernetes cluster.
 1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): A generic wrapper around cloud-provided L4 and L7 load balancing services, and roll-your-own load balancers run in pods, e.g. HA Proxy.
 ## Ubernetes API
 The Ubernetes API for load balancing should be compatible with the
 equivalent Kubernetes API, to ease porting of clients between
 Ubernetes and Kubernetes. Further details below.
 ## Common Client Behavior
 To be useful, our load balancing solution needs to work properly with
 real client applications. There are a few different classes of
 those...
 ### Browsers
 These are the most common external clients.  These are all well-written.  See below.
 ### Well-written clients
 1. Do a DNS resolution every time they connect.
 1. Don't cache beyond TTL (although a small percentage of the DNS
   servers on which they rely might).
 1. Do try multiple A records (in order) to connect.
 1. (in an ideal world) Do use SRV records rather than hard-coded port numbers.
 Examples:
 +  all common browsers (except for SRV records)
 +  ...
 ### Dumb clients
 1. Don't do a DNS resolution every time they connect (or do cache
   beyond the TTL).
 1. Do try multiple A records
 Examples:
 +  ...
 ### Dumber clients
 1. Only do a DNS lookup once on startup.
 1. Only try the first returned DNS A record.
 Examples:
 +  ...
 ### Dumbest clients
 1. Never do a DNS lookup - are pre-configured with a single (or
   possibly multiple) fixed server IP(s).  Nothing else matters.
 ## Architecture and Implementation
 ### General control plane architecture
 Each cluster hosts one or more Ubernetes master components (Ubernetes API servers, controller managers with leader election, and
 etcd quorum members.  This is documented in more detail in a
 [separate design doc: Kubernetes/Ubernetes Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#).
 In the description below, assume that 'n' clusters, named
 'cluster-1'... 'cluster-n' have been registered against an Ubernetes
 Federation "federation-1", each with their own set of Kubernetes API
 endpoints,so,
 "[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1),
 [http://endpoint-2.cluster-1](http://endpoint-2.cluster-1)
 ... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) .
 ### Federated Services
 Ubernetes Services are pretty straight-forward.  They're comprised of
 multiple equivalent underlying Kubernetes Services, each with their
 own external endpoint, and a load balancing mechanism across them.
 Let's work through how exactly that works in practice.
 Our user creates the following Ubernetes Service (against an Ubernetes
 API endpoint):
    $ kubectl create -f my-service.yaml --context="federation-1"
 where service.yaml contains the following:
    kind: Service
    metadata:
      labels:
        run: my-service
      name: my-service
      namespace: my-namespace
    spec:
      ports:
      - port: 2379
        protocol: TCP
        targetPort: 2379
        name: client
      - port: 2380
        protocol: TCP
        targetPort: 2380
        name: peer
      selector:
        run: my-service
      type: LoadBalancer
 Ubernetes in turn creates one equivalent service (identical config to
 the above) in each of the underlying Kubernetes clusters, each of
 which results in something like this:
    $ kubectl get -o yaml --context="cluster-1" service my-service
    apiVersion: v1
    kind: Service
    metadata:
      creationTimestamp: 2015-11-25T23:35:25Z
      labels:
        run: my-service
      name: my-service
      namespace: my-namespace
      resourceVersion: "147365"
      selfLink: /api/v1/namespaces/my-namespace/services/my-service
      uid: 33bfc927-93cd-11e5-a38c-42010af00002
    spec:
      clusterIP: 10.0.153.185
      ports:
      - name: client
        nodePort: 31333
        port: 2379
        protocol: TCP
        targetPort: 2379
      - name: peer
        nodePort: 31086
        port: 2380
        protocol: TCP
        targetPort: 2380
      selector:
        run: my-service
      sessionAffinity: None
      type: LoadBalancer
    status:
      loadBalancer:
        ingress:
        - ip: 104.197.117.10
 Similar services are created in `cluster-2` and `cluster-3`, each of
 which are allocated their own `spec.clusterIP`, and
 `status.loadBalancer.ingress.ip`.
 In Ubernetes `federation-1`, the resulting federated service looks as follows:
    $ kubectl get -o yaml --context="federation-1" service my-service
    apiVersion: v1
    kind: Service
    metadata:
      creationTimestamp: 2015-11-25T23:35:23Z
      labels:
        run: my-service
      name: my-service
      namespace: my-namespace
      resourceVersion: "157333"
      selfLink: /api/v1/namespaces/my-namespace/services/my-service
      uid: 33bfc927-93cd-11e5-a38c-42010af00007
    spec:
      clusterIP:
      ports:
      - name: client
        nodePort: 31333
        port: 2379
        protocol: TCP
        targetPort: 2379
      - name: peer
        nodePort: 31086
        port: 2380
        protocol: TCP
        targetPort: 2380
      selector:
        run: my-service
      sessionAffinity: None
      type: LoadBalancer
    status:
      loadBalancer:
        ingress:
        - hostname: my-service.my-namespace.my-federation.my-domain.com
 Note that the federated service:
 1. Is API-compatible with a vanilla Kubernetes service.
 1. has no clusterIP (as it is cluster-independent)
 1. has a federation-wide load balancer hostname
 In addition to the set of underlying Kubernetes services (one per
 cluster) described above, Ubernetes has also created a DNS name
 (e.g. on [Google Cloud DNS](https://cloud.google.com/dns) or
 [AWS Route 53](https://aws.amazon.com/route53/), depending on
 configuration) which provides load balancing across all of those
 services.  For example, in a very basic configuration:
    $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.117.10
    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.74.77
    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.38.157
 Each of the above IP addresses (which are just the external load
 balancer ingress IP's of each cluster service) is of course load
 balanced across the pods comprising the service in each cluster.
 In a more sophisticated configuration (e.g. on GCE or GKE), Ubernetes
 automatically creates a
 [GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
 which exposes a single, globally load-balanced IP:
    $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 107.194.17.44
 Optionally, Ubernetes also configures the local DNS servers (SkyDNS)
 in each Kubernetes cluster to preferentially return the local
 clusterIP for the service in that cluster, with other clusters'
 external service IP's (or a global load-balanced IP) also configured
 for failover purposes:
    $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 10.0.153.185
    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.74.77
    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.38.157
 If Ubernetes Global Service Health Checking is enabled, multiple
 service health checkers running across the federated clusters
 collaborate to monitor the health of the service endpoints, and
 automatically remove unhealthy endpoints from the DNS record (e.g. a
 majority quorum is required to vote a service endpoint unhealthy, to
 avoid false positives due to individual health checker network
 isolation).
 ### Federated Replication Controllers
 So far we have a federated service defined, with a resolvable load
 balancer hostname by which clients can reach it, but no pods serving
 traffic directed there.  So now we need a Federated Replication
 Controller.  These are also fairly straight-forward, being comprised
 of multiple underlying Kubernetes Replication Controllers which do the
 hard work of keeping the desired number of Pod replicas alive in each
 Kubernetes cluster.
    $ kubectl create -f my-service-rc.yaml --context="federation-1"
 where `my-service-rc.yaml` contains the following:
    kind: ReplicationController
    metadata:
      labels:
        run: my-service
      name: my-service
      namespace: my-namespace
    spec:
      replicas: 6
      selector:
        run: my-service
      template:
        metadata:
          labels:
            run: my-service
        spec:
          containers:
            image: gcr.io/google_samples/my-service:v1
            name: my-service
            ports:
            - containerPort: 2379
              protocol: TCP
            - containerPort: 2380
              protocol: TCP
 Ubernetes in turn creates one equivalent replication controller
 (identical config to the above, except for the replica count) in each
 of the underlying Kubernetes clusters, each of which results in
 something like this:
    $ ./kubectl get -o yaml rc my-service --context="cluster-1"
    kind: ReplicationController
    metadata:
      creationTimestamp: 2015-12-02T23:00:47Z
      labels:
        run: my-service
      name: my-service
      namespace: my-namespace
      selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service
      uid: 86542109-9948-11e5-a38c-42010af00002
    spec:
      replicas: 2
      selector:
        run: my-service
      template:
        metadata:
          labels:
            run: my-service
        spec:
          containers:
            image: gcr.io/google_samples/my-service:v1
            name: my-service
            ports:
            - containerPort: 2379
              protocol: TCP
            - containerPort: 2380
              protocol: TCP
            resources: {}
          dnsPolicy: ClusterFirst
          restartPolicy: Always
    status:
      replicas: 2
 The exact number of replicas created in each underlying cluster will
 of course depend on what scheduling policy is in force.  In the above
 example, the scheduler created an equal number of replicas (2) in each
 of the three underlying clusters, to make up the total of 6 replicas
 required.  To handle entire cluster failures, various approaches are possible,
 including:
 1. **simple overprovisioing**, such that sufficient replicas remain even if a
   cluster fails.  This wastes some resources, but is simple and
   reliable.
 2. **pod autoscaling**, where the replication controller in each
      cluster automatically and autonomously increases the number of
      replicas in its cluster in response to the additional traffic
      diverted from the
      failed cluster.  This saves resources and is reatively simple,
      but there is some delay in the autoscaling.
 3. **federated replica migration**, where the Ubernetes Federation
   Control Plane detects the cluster failure and automatically
   increases the replica count in the remainaing clusters to make up
   for the lost replicas in the failed cluster.  This does not seem to
   offer any benefits relative to pod autoscaling above, and is
   arguably more complex to implement, but we note it here as a
   possibility.
 ### Implementation Details
 The implementation approach and architecture is very similar to
 Kubernetes, so if you're familiar with how Kubernetes works, none of
 what follows will be surprising.  One additional design driver not
 present in Kubernetes is that Ubernetes aims to be resilient to
 individual cluster and availability zone failures.  So the control
 plane spans multiple clusters.  More specifically:
 + Ubernetes runs it's own distinct set of API servers (typically one
   or more per underlying Kubernetes cluster).  These are completely
   distinct from the Kubernetes API servers for each of the underlying
   clusters.
 +  Ubernetes runs it's own distinct quorum-based metadata store (etcd,
   by default).  Approximately 1 quorum member runs in each underlying
   cluster ("approximately" because we aim for an odd number of quorum
   members, and typically don't want more than 5 quorum members, even
   if we have a larger number of federated clusters, so 2 clusters->3
   quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc).
 Cluster Controllers in Ubernetes watch against the Ubernetes API
 server/etcd state, and apply changes to the underlying kubernetes
 clusters accordingly.  They also have the anti-entropy mechanism for
 reconciling ubernetes "desired desired" state against kubernetes
 "actual desired" state.
 <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federated-services.md?pixel)]()
 <!-- END MUNGE: GENERATED_ANALYTICS -->
--- a/docs/design/federation-phase-1.md
+++ b/docs/design/federation-phase-1.md
@@ -0,0 +1,434 @@
 <!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 <!-- BEGIN STRIP_FOR_RELEASE -->
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <img src="http://kubernetes.io/img/warning.png" alt="WARNING"
     width="25" height="25">
 <h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 If you are using a released version of Kubernetes, you should
 refer to the docs that go with that version.
 Documentation for other releases can be found at
 [releases.k8s.io](http://releases.k8s.io).
 </strong>
 --
 <!-- END STRIP_FOR_RELEASE -->
 <!-- END MUNGE: UNVERSIONED_WARNING -->
 # Ubernetes Design Spec (phase one)
 **Huawei PaaS Team**
 ## INTRODUCTION
 In this document we propose a design for the “Control Plane” of
 Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of
 this work please refer to
 [this proposal](../../docs/proposals/federation.md).
 The document is arranged as following. First we briefly list scenarios
 and use cases that motivate K8S federation work. These use cases drive
 the design and they also verify the design. We summarize the
 functionality requirements from these use cases, and define the “in
 scope” functionalities that will be covered by this design (phase
 one). After that we give an overview of the proposed architecture, API
 and building blocks. And also we go through several activity flows to
 see how these building blocks work together to support use cases.
 ## REQUIREMENTS
 There are many reasons why customers may want to build a K8S
 federation:
 + **High Availability:** Customers want to be immune to the outage of
   a single availability zone, region or even a cloud provider.
 + **Sensitive workloads:** Some workloads can only run on a particular
   cluster. They cannot be scheduled to or migrated to other clusters.
 + **Capacity overflow:** Customers prefer to run workloads on a
   primary cluster. But if the capacity of the cluster is not
   sufficient, workloads should be automatically distributed to other
   clusters.
 + **Vendor lock-in avoidance:** Customers want to spread their
   workloads on different cloud providers, and can easily increase or
   decrease the workload proportion of a specific provider.
 + **Cluster Size Enhancement:** Currently K8S cluster can only support
 a limited size. While the community is actively improving it, it can
 be expected that cluster size will be a problem if K8S is used for
 large workloads or public PaaS infrastructure. While we can separate
 different tenants to different clusters, it would be good to have a
 unified view.
 Here are the functionality requirements derived from above use cases:
 + Clients of the federation control plane API server can register and deregister clusters.
 + Workloads should be spread to different clusters according to the
   workload distribution policy.
 + Pods are able to discover and connect to services hosted in other
  clusters (in cases where inter-cluster networking is necessary,
  desirable and implemented).
 + Traffic to these pods should be spread across clusters (in a manner
  similar to load balancing, although it might not be strictly
  speaking balanced).
 + The control plane needs to know when a cluster is down, and migrate
   the workloads to other clusters.
 + Clients have a unified view and a central control point for above
   activities.
 ## SCOPE
 It’s difficult to have a perfect design with one click that implements
 all the above requirements. Therefore we will go with an iterative
 approach to design and build the system. This document describes the
 phase one of the whole work.  In phase one we will cover only the
 following objectives:
 + Define the basic building blocks and API objects of control plane
 + Implement a basic end-to-end workflow
   + Clients register federated clusters
   + Clients submit a workload
   + The workload is distributed to different clusters
   + Service discovery
   + Load balancing
 The following parts are NOT covered in phase one:
 + Authentication and authorization (other than basic client
  authentication against the ubernetes API, and from ubernetes control
  plane to the underlying kubernetes clusters).
 + Deployment units other than replication controller and service
 + Complex distribution policy of workloads
 + Service affinity and migration
 ## ARCHITECTURE
 The overall architecture of a control plane is shown as following:
 ![Ubernetes Architecture](ubernetes-design.png)
 Some design principles we are following in this architecture:
 1. Keep the underlying K8S clusters independent. They should have no
   knowledge of control plane or of each other.
 1. Keep the Ubernetes API interface compatible with K8S API as much as
   possible.
 1. Re-use concepts from K8S as much as possible. This reduces
 customers’ learning curve and is good for adoption.  Below is a brief
 description of each module contained in above diagram.
 ## Ubernetes API Server
 The API Server in the Ubernetes control plane works just like the API
 Server in K8S. It talks to a distributed key-value store to persist,
 retrieve and watch API objects.  This store is completely distinct
 from the kubernetes key-value stores (etcd) in the underlying
 kubernetes clusters.  We still use `etcd` as the distributed
 storage so customers don’t need to learn and manage a different
 storage system, although it is envisaged that other storage systems
 (consol, zookeeper) will probably be developedand supported over
 time.
 ## Ubernetes Scheduler
 The Ubernetes Scheduler schedules resources onto the underlying
 Kubernetes clusters.  For example it watches for unscheduled Ubernetes
 replication controllers (those that have not yet been scheduled onto
 underlying Kubernetes clusters) and performs the global scheduling
 work.  For each unscheduled replication controller, it calls policy
 engine to decide how to spit workloads among clusters. It creates a
 Kubernetes Replication Controller on one ore more underlying cluster,
 and post them back to `etcd` storage.
 One sublety worth noting here is that the scheduling decision is
 arrived at by combining the application-specific request from the user (which might
 include, for example, placement constraints), and the global policy specified
 by the federation administrator (for example, "prefer on-premise
 clusters over AWS clusters" or "spread load equally across clusters").
 ## Ubernetes Cluster Controller
 The cluster controller
 performs the following two kinds of work:
 1. It watches all the sub-resources that are created by Ubernetes
   components, like a sub-RC or a sub-service. And then it creates the
   corresponding API objects on the underlying K8S clusters.
 1. It periodically retrieves the available resources metrics from the
   underlying K8S cluster, and updates them as object status of the
   `cluster` API object.  An alternative design might be to run a pod
   in each underlying cluster that reports metrics for that cluster to
   the Ubernetes control plane.  Which approach is better remains an
   open topic of discussion.
 ## Ubernetes Service Controller
 The Ubernetes service controller is a federation-level implementation
 of K8S service controller. It watches service resources created on
 control plane, creates corresponding K8S services on each involved K8S
 clusters.  Besides interacting with services resources on each
 individual K8S clusters, the Ubernetes service controller also
 performs some global DNS registration work.
 ## API OBJECTS
 ## Cluster
 Cluster is a new first-class API object introduced in this design. For
 each registered K8S cluster there will be such an API resource in
 control plane. The way clients register or deregister a cluster is to
 send corresponding REST requests to following URL:
 `/api/{$version}/clusters`.  Because control plane is behaving like a
 regular K8S client to the underlying clusters, the spec of a cluster
 object contains necessary properties like K8S cluster address and
 credentials.  The status of a cluster API object will contain
 following information:
 1. Which phase of its lifecycle
 1. Cluster resource metrics for scheduling decisions.
 1. Other metadata like the version of cluster
 $version.clusterSpec
 <table style="border:1px solid #000000;border-collapse:collapse;">
 <tbody>
 <tr>
 <td style="padding:5px;"><b>Name</b><br>
 </td>
 <td style="padding:5px;"><b>Description</b><br>
 </td>
 <td style="padding:5px;"><b>Required</b><br>
 </td>
 <td style="padding:5px;"><b>Schema</b><br>
 </td>
 <td style="padding:5px;"><b>Default</b><br>
 </td>
 </tr>
 <tr>
 <td style="padding:5px;">Address<br>
 </td>
 <td style="padding:5px;">address of the cluster<br>
 </td>
 <td style="padding:5px;">yes<br>
 </td>
 <td style="padding:5px;">address<br>
 </td>
 <td style="padding:5px;"><p></p></td>
 </tr>
 <tr>
 <td style="padding:5px;">Credential<br>
 </td>
 <td style="padding:5px;">the type (e.g. bearer token, client
 certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)<br>
 </td>
 <td style="padding:5px;">yes<br>
 </td>
 <td style="padding:5px;">string <br>
 </td>
 <td style="padding:5px;"><p></p></td>
 </tr>
 </tbody>
 </table>
 $version.clusterStatus
 <table style="border:1px solid #000000;border-collapse:collapse;">
 <tbody>
 <tr>
 <td style="padding:5px;"><b>Name</b><br>
 </td>
 <td style="padding:5px;"><b>Description</b><br>
 </td>
 <td style="padding:5px;"><b>Required</b><br>
 </td>
 <td style="padding:5px;"><b>Schema</b><br>
 </td>
 <td style="padding:5px;"><b>Default</b><br>
 </td>
 </tr>
 <tr>
 <td style="padding:5px;">Phase<br>
 </td>
 <td style="padding:5px;">the recently observed lifecycle phase of the cluster<br>
 </td>
 <td style="padding:5px;">yes<br>
 </td>
 <td style="padding:5px;">enum<br>
 </td>
 <td style="padding:5px;"><p></p></td>
 </tr>
 <tr>
 <td style="padding:5px;">Capacity<br>
 </td>
 <td style="padding:5px;">represents the available resources of a cluster<br>
 </td>
 <td style="padding:5px;">yes<br>
 </td>
 <td style="padding:5px;">any<br>
 </td>
 <td style="padding:5px;"><p></p></td>
 </tr>
 <tr>
 <td style="padding:5px;">ClusterMeta<br>
 </td>
 <td style="padding:5px;">Other cluster metadata like the version<br>
 </td>
 <td style="padding:5px;">yes<br>
 </td>
 <td style="padding:5px;">ClusterMeta<br>
 </td>
 <td style="padding:5px;"><p></p></td>
 </tr>
 </tbody>
 </table>
 **For simplicity we didn’t introduce a separate “cluster metrics” API
 object here**. The cluster resource metrics are stored in cluster
 status section, just like what we did to nodes in K8S. In phase one it
 only contains available CPU resources and memory resources.  The
 cluster controller will periodically poll the underlying cluster API
 Server to get cluster capability. In phase one it gets the metrics by
 simply aggregating metrics from all nodes. In future we will improve
 this with more efficient ways like leveraging heapster, and also more
 metrics will be supported.  Similar to node phases in K8S, the “phase”
 field includes following values:
 + pending: newly registered clusters or clusters suspended by admin
   for various reasons. They are not eligible for accepting workloads
 + running: clusters in normal status that can accept workloads
 + offline: clusters temporarily down or not reachable
 + terminated: clusters removed from federation
 Below is the state transition diagram.
 ![Cluster State Transition Diagram](ubernetes-cluster-state.png)
 ## Replication Controller
 A global workload submitted to control plane is represented as an
 Ubernetes replication controller.   When a replication controller
 is submitted to control plane, clients need a way to express its
 requirements or preferences on clusters. Depending on different use
 cases it may be complex. For example:
 + This workload can only be scheduled to cluster Foo. It cannot be
   scheduled to any other clusters. (use case: sensitive workloads).
 + This workload prefers cluster Foo. But if there is no available
   capacity on cluster Foo, it’s OK to be scheduled to cluster Bar
   (use case: workload )
 + Seventy percent of this workload should be scheduled to cluster Foo,
    and thirty percent should be scheduled to cluster Bar (use case:
    vendor lock-in avoidance).  In phase one, we only introduce a
    _clusterSelector_ field to filter acceptable clusters. In default
    case there is no such selector and it means any cluster is
    acceptable.
 Below is a sample of the YAML to create such a replication controller.
 ``` 
 apiVersion: v1
 kind: ReplicationController
 metadata:
  name: nginx-controller
 spec:
  replicas: 5
  selector:
    app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
      clusterSelector: 
      name in (Foo, Bar)
 ```
 Currently clusterSelector (implemented as a
 [LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704))
 only supports a simple list of acceptable clusters. Workloads will be
 evenly distributed on these acceptable clusters in phase one. After
 phase one we will define syntax to represent more advanced
 constraints, like cluster preference ordering, desired number of
 splitted workloads, desired ratio of workloads spread on different
 clusters, etc.
 Besides this explicit “clusterSelector” filter, a workload may have
 some implicit scheduling restrictions. For example it defines
 “nodeSelector” which can only be satisfied on some particular
 clusters. How to handle this will be addressed after phase one.
 ## Ubernetes Services
 The Service API object exposed by Ubernetes is similar to service
 objects on Kubernetes. It defines the access to a group of pods. The
 Ubernetes service controller will create corresponding Kubernetes
 service objects on underlying clusters.  These are detailed in a
 separate design document: [Federated Services](federated-services.md).
 ## Pod
 In phase one we only support scheduling replication controllers. Pod
 scheduling will be supported in later phase. This is primarily in
 order to keep the Ubernetes API compatible with the Kubernetes API.
 ## ACTIVITY FLOWS
 ## Scheduling
 The below diagram shows how workloads are scheduled on the Ubernetes control plane:
 1. A replication controller is created by the client.
 1. APIServer persists it into the storage.
 1. Cluster controller periodically polls the latest available resource
   metrics from the underlying clusters.
 1. Scheduler is watching all pending RCs. It picks up the RC, make
   policy-driven decisions and split it into different sub RCs.
 1. Each cluster control is watching the sub RCs bound to its
   corresponding cluster. It picks up the newly created sub RC.
 1. The cluster controller issues requests to the underlying cluster
 API Server to create the RC.  In phase one we don’t support complex
 distribution policies. The scheduling rule is basically:
    1. If a RC does not specify any nodeSelector, it will be scheduled
       to the least loaded K8S cluster(s) that has enough available
       resources.
    1. If a RC specifies _N_ acceptable clusters in the
       clusterSelector, all replica will be evenly distributed among
       these clusters.
 There is a potential race condition here. Say at time _T1_ the control
 plane learns there are _m_ available resources in a K8S cluster. As
 the cluster is working independently it still accepts workload
 requests from other K8S clients or even another Ubernetes control
 plane. The Ubernetes scheduling decision is based on this data of
 available resources. However when the actual RC creation happens to
 the cluster at time _T2_, the cluster may don’t have enough resources
 at that time. We will address this problem in later phases with some
 proposed solutions like resource reservation mechanisms.
 ![Ubernetes Scheduling](ubernetes-scheduling.png)
 ## Service Discovery
 This part has been included in the section “Federated Service” of
 document
 “[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. Please
 refer to that document for details.
 <!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 [![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/federation-phase-1.md?pixel)]()
 <!-- END MUNGE: GENERATED_ANALYTICS -->
--- a/docs/design/ubernetes-cluster-state.png
+++ b/docs/design/ubernetes-cluster-state.png
--- a/docs/design/ubernetes-design.png
+++ b/docs/design/ubernetes-design.png
--- a/docs/design/ubernetes-scheduling.png
+++ b/docs/design/ubernetes-scheduling.png