mirror of
				https://github.com/optim-enterprises-bv/kubernetes.git
				synced 2025-11-04 04:08:16 +00:00 
			
		
		
		
	RFC design docs for Cluster Federation/Ubernetes.
This commit is contained in:
		
							
								
								
									
										269
									
								
								docs/design/control-plane-resilience.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										269
									
								
								docs/design/control-plane-resilience.md
									
									
									
									
									
										Normal file
									
								
							@@ -0,0 +1,269 @@
 | 
				
			|||||||
 | 
					<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<!-- BEGIN STRIP_FOR_RELEASE -->
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					If you are using a released version of Kubernetes, you should
 | 
				
			||||||
 | 
					refer to the docs that go with that version.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Documentation for other releases can be found at
 | 
				
			||||||
 | 
					[releases.k8s.io](http://releases.k8s.io).
 | 
				
			||||||
 | 
					</strong>
 | 
				
			||||||
 | 
					--
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<!-- END STRIP_FOR_RELEASE -->
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<!-- END MUNGE: UNVERSIONED_WARNING -->
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Kubernetes/Ubernetes Control Plane Resilience
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Long Term Design and Current Status
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### by Quinton Hoole, Mike Danese and Justin Santa-Barbara
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### December 14, 2015
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Summary
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Some amount of confusion exists around how we currently, and in future
 | 
				
			||||||
 | 
					want to ensure resilience of the Kubernetes (and by implication
 | 
				
			||||||
 | 
					Ubernetes) control plane.  This document is an attempt to capture that
 | 
				
			||||||
 | 
					definitively. It covers areas including self-healing, high
 | 
				
			||||||
 | 
					availability, bootstrapping and recovery.  Most of the information in
 | 
				
			||||||
 | 
					this document already exists in the form of github comments,
 | 
				
			||||||
 | 
					PR's/proposals, scattered documents, and corridor conversations, so
 | 
				
			||||||
 | 
					document is primarily a consolidation and clarification of existing
 | 
				
			||||||
 | 
					ideas.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Terms
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					* **Self-healing:** automatically restarting or replacing failed
 | 
				
			||||||
 | 
					  processes and machines without human intervention
 | 
				
			||||||
 | 
					* **High availability:** continuing to be available and work correctly
 | 
				
			||||||
 | 
					  even if some components are down or uncontactable.  This typically
 | 
				
			||||||
 | 
					  involves multiple replicas of critical services, and a reliable way
 | 
				
			||||||
 | 
					  to find available replicas.  Note that it's possible (but not
 | 
				
			||||||
 | 
					  desirable) to have high
 | 
				
			||||||
 | 
					  availability properties (e.g. multiple replicas) in the absence of
 | 
				
			||||||
 | 
					  self-healing properties (e.g. if a replica fails, nothing replaces
 | 
				
			||||||
 | 
					  it). Fairly obviously, given enough time, such systems typically
 | 
				
			||||||
 | 
					  become unavailable (after enough replicas have failed).
 | 
				
			||||||
 | 
					* **Bootstrapping**: creating an empty cluster from nothing
 | 
				
			||||||
 | 
					* **Recovery**: recreating a non-empty cluster after perhaps
 | 
				
			||||||
 | 
					  catastrophic failure/unavailability/data corruption
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Overall Goals
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. **Resilience to single failures:** Kubernetes clusters constrained
 | 
				
			||||||
 | 
					   to single availability zones should be resilient to individual
 | 
				
			||||||
 | 
					   machine and process failures by being both self-healing and highly
 | 
				
			||||||
 | 
					   available (within the context of such individual failures).
 | 
				
			||||||
 | 
					1. **Ubiquitous resilience by default:** The default cluster creation
 | 
				
			||||||
 | 
					   scripts for (at least) GCE, AWS and basic bare metal should adhere
 | 
				
			||||||
 | 
					   to the above (self-healing and high availability) by default (with
 | 
				
			||||||
 | 
					   options available to disable these features to reduce control plane
 | 
				
			||||||
 | 
					   resource requirements if so required).  It is hoped that other
 | 
				
			||||||
 | 
					   cloud providers will also follow the above guidelines, but the
 | 
				
			||||||
 | 
					   above 3 are the primary canonical use cases.
 | 
				
			||||||
 | 
					1. **Resilience to some correlated failures:** Kubernetes clusters
 | 
				
			||||||
 | 
					   which span multiple availability zones in a region should by
 | 
				
			||||||
 | 
					   default be resilient to complete failure of one entire availability
 | 
				
			||||||
 | 
					   zone (by similarly providing self-healing and high availability in
 | 
				
			||||||
 | 
					   the default cluster creation scripts as above).
 | 
				
			||||||
 | 
					1. **Default implementation shared across cloud providers:** The
 | 
				
			||||||
 | 
					   differences between the default implementations of the above for
 | 
				
			||||||
 | 
					   GCE, AWS and basic bare metal should be minimized.  This implies
 | 
				
			||||||
 | 
					   using shared libraries across these providers in the default
 | 
				
			||||||
 | 
					   scripts in preference to highly customized implementations per
 | 
				
			||||||
 | 
					   cloud provider.  This is not to say that highly differentiated,
 | 
				
			||||||
 | 
					   customized per-cloud cluster creation processes (e.g. for GKE on
 | 
				
			||||||
 | 
					   GCE, or some hosted Kubernetes provider on AWS) are discouraged.
 | 
				
			||||||
 | 
					   But those fall squarely outside the basic cross-platform OSS
 | 
				
			||||||
 | 
					   Kubernetes distro.
 | 
				
			||||||
 | 
					1. **Self-hosting:** Where possible, Kubernetes's existing mechanisms
 | 
				
			||||||
 | 
					   for achieving system resilience (replication controllers, health
 | 
				
			||||||
 | 
					   checking, service load balancing etc) should be used in preference
 | 
				
			||||||
 | 
					   to building a separate set of mechanisms to achieve the same thing.
 | 
				
			||||||
 | 
					   This implies that self hosting (the kubernetes control plane on
 | 
				
			||||||
 | 
					   kubernetes) is strongly preferred, with the caveat below.
 | 
				
			||||||
 | 
					1. **Recovery from catastrophic failure:** The ability to quickly and
 | 
				
			||||||
 | 
					   reliably recover a cluster from catastrophic failure is critical,
 | 
				
			||||||
 | 
					   and should not be compromised by the above goal to self-host
 | 
				
			||||||
 | 
					   (i.e. it goes without saying that the cluster should be quickly and
 | 
				
			||||||
 | 
					   reliably recoverable, even if the cluster control plane is
 | 
				
			||||||
 | 
					   broken). This implies that such catastrophic failure scenarios
 | 
				
			||||||
 | 
					   should be carefully thought out, and the subject of regular
 | 
				
			||||||
 | 
					   continuous integration testing, and disaster recovery exercises.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Relative Priorities
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. **(Possibly manual) recovery from catastrophic failures:** having a Kubernetes cluster, and all
 | 
				
			||||||
 | 
					   applications running inside it, disappear forever perhaps is the worst
 | 
				
			||||||
 | 
					   possible failure mode.  So it is critical that we be able to
 | 
				
			||||||
 | 
					   recover the applications running inside a cluster from such
 | 
				
			||||||
 | 
					   failures in some well-bounded time period.
 | 
				
			||||||
 | 
					    1. In theory a cluster can be recovered by replaying all API calls
 | 
				
			||||||
 | 
					       that have ever been executed against it, in order, but most
 | 
				
			||||||
 | 
					       often that state has been lost, and/or is scattered across
 | 
				
			||||||
 | 
					       multiple client applications or groups. So in general it is
 | 
				
			||||||
 | 
					       probably infeasible.
 | 
				
			||||||
 | 
					    1. In theory a cluster can also be recovered to some relatively
 | 
				
			||||||
 | 
					       recent non-corrupt backup/snapshot of the disk(s) backing the
 | 
				
			||||||
 | 
					       etcd cluster state.  But we have no default consistent
 | 
				
			||||||
 | 
					       backup/snapshot, verification or restoration process.  And we
 | 
				
			||||||
 | 
					       don't routinely test restoration, so even if we did routinely
 | 
				
			||||||
 | 
					       perform and verify backups, we have no hard evidence that we
 | 
				
			||||||
 | 
					       can in practise effectively recover from catastrophic cluster
 | 
				
			||||||
 | 
					       failure or data corruption by restoring from these backups.  So
 | 
				
			||||||
 | 
					       there's more work to be done here.
 | 
				
			||||||
 | 
					1. **Self-healing:** Most major cloud providers provide the ability to
 | 
				
			||||||
 | 
					   easily and automatically replace failed virtual machines within a
 | 
				
			||||||
 | 
					   small number of minutes (e.g. GCE
 | 
				
			||||||
 | 
					   [Auto-restart](https://cloud.google.com/compute/docs/instances/setting-instance-scheduling-options#autorestart)
 | 
				
			||||||
 | 
					   and Managed Instance Groups,
 | 
				
			||||||
 | 
					   AWS[ Auto-recovery](https://aws.amazon.com/blogs/aws/new-auto-recovery-for-amazon-ec2/)
 | 
				
			||||||
 | 
					   and [Auto scaling](https://aws.amazon.com/autoscaling/) etc). This
 | 
				
			||||||
 | 
					   can fairly trivially be used to reduce control-plane down-time due
 | 
				
			||||||
 | 
					   to machine failure to a small number of minutes per failure
 | 
				
			||||||
 | 
					   (i.e. typically around "3 nines" availability), provided that:
 | 
				
			||||||
 | 
					    1. cluster persistent state (i.e. etcd disks) is either:
 | 
				
			||||||
 | 
					        1. truely persistent (i.e. remote persistent disks), or
 | 
				
			||||||
 | 
					        1. reconstructible (e.g. using etcd [dynamic member
 | 
				
			||||||
 | 
					           addition](https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member)
 | 
				
			||||||
 | 
					           or [backup and
 | 
				
			||||||
 | 
					           recovery](https://github.com/coreos/etcd/blob/master/Documentation/admin_guide.md#disaster-recovery)).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    1. and boot disks are either:
 | 
				
			||||||
 | 
					        1. truely persistent (i.e. remote persistent disks), or
 | 
				
			||||||
 | 
					        1. reconstructible (e.g. using boot-from-snapshot,
 | 
				
			||||||
 | 
					           boot-from-pre-configured-image or
 | 
				
			||||||
 | 
					           boot-from-auto-initializing image).
 | 
				
			||||||
 | 
					1. **High Availability:** This has the potential to increase
 | 
				
			||||||
 | 
					   availability above the approximately "3 nines" level provided by
 | 
				
			||||||
 | 
					   automated self-healing, but it's somewhat more complex, and
 | 
				
			||||||
 | 
					   requires additional resources (e.g. redundant API servers and etcd
 | 
				
			||||||
 | 
					   quorum members).  In environments where cloud-assisted automatic
 | 
				
			||||||
 | 
					   self-healing might be infeasible (e.g. on-premise bare-metal
 | 
				
			||||||
 | 
					   deployments), it also gives cluster administrators more time to
 | 
				
			||||||
 | 
					   respond (e.g.  replace/repair failed machines) without incurring
 | 
				
			||||||
 | 
					   system downtime.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Design and Status (as of December 2015)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<table>
 | 
				
			||||||
 | 
					<tr>
 | 
				
			||||||
 | 
					<td><b>Control Plane Component</b></td>
 | 
				
			||||||
 | 
					<td><b>Resilience Plan</b></td>
 | 
				
			||||||
 | 
					<td><b>Current Status</b></td>
 | 
				
			||||||
 | 
					</tr>
 | 
				
			||||||
 | 
					<tr>
 | 
				
			||||||
 | 
					<td><b>API Server</b></td>
 | 
				
			||||||
 | 
					<td>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Multiple stateless, self-hosted, self-healing API servers behind a HA
 | 
				
			||||||
 | 
					load balancer, built out by the default "kube-up" automation on GCE,
 | 
				
			||||||
 | 
					AWS and basic bare metal (BBM).  Note that the single-host approach of
 | 
				
			||||||
 | 
					hving etcd listen only on localhost to ensure that onyl API server can
 | 
				
			||||||
 | 
					connect to it will no longer work, so alternative security will be
 | 
				
			||||||
 | 
					needed in the regard (either using firewall rules, SSL certs, or
 | 
				
			||||||
 | 
					something else). All necessary flags are currently supported to enable
 | 
				
			||||||
 | 
					SSL between API server and etcd (OpenShift runs like this out of the
 | 
				
			||||||
 | 
					box), but this needs to be woven into the "kube-up" and related
 | 
				
			||||||
 | 
					scripts.  Detailed design of self-hosting and related bootstrapping
 | 
				
			||||||
 | 
					and catastrophic failure recovery will be detailed in a separate
 | 
				
			||||||
 | 
					design doc.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					No scripted self-healing or HA on GCE, AWS or basic bare metal
 | 
				
			||||||
 | 
					currently exists in the OSS distro.   To be clear, "no self healing"
 | 
				
			||||||
 | 
					means that even if multiple e.g. API servers are provisioned for HA
 | 
				
			||||||
 | 
					purposes, if they fail, nothing replaces them, so eventually the
 | 
				
			||||||
 | 
					system will fail.  Self-healing and HA can be set up
 | 
				
			||||||
 | 
					manually by following documented instructions, but this is not
 | 
				
			||||||
 | 
					currently an automated process, and it is not tested as part of
 | 
				
			||||||
 | 
					continuous integration.  So it's probably safest to assume that it
 | 
				
			||||||
 | 
					doesn't actually work in practise.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					</tr>
 | 
				
			||||||
 | 
					<tr>
 | 
				
			||||||
 | 
					<td><b>Controller manager and scheduler</b></td>
 | 
				
			||||||
 | 
					<td>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Multiple self-hosted, self healing warm standby stateless controller
 | 
				
			||||||
 | 
					managers and schedulers with leader election and automatic failover of API server
 | 
				
			||||||
 | 
					clients, automatically installed by default "kube-up" automation.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td>As above.</td>
 | 
				
			||||||
 | 
					</tr>
 | 
				
			||||||
 | 
					<tr>
 | 
				
			||||||
 | 
					<td><b>etcd</b></td>
 | 
				
			||||||
 | 
					<td>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Multiple (3-5) etcd quorum members behind a load balancer with session
 | 
				
			||||||
 | 
					affinity (to prevent clients from being bounced from one to another).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Regarding self-healing, if a node running etcd goes down, it is always necessary to do three
 | 
				
			||||||
 | 
					things:
 | 
				
			||||||
 | 
					<ol>
 | 
				
			||||||
 | 
					<li>allocate a new node (not necessary if running etcd as a pod, in
 | 
				
			||||||
 | 
					which case specific measures are required to prevent user pods from
 | 
				
			||||||
 | 
					interfering with system pods, for example using node selectors as
 | 
				
			||||||
 | 
					described in <A HREF=")
 | 
				
			||||||
 | 
					<li>start an etcd replica on that new node,
 | 
				
			||||||
 | 
					<li>have the new replica recover the etcd state.
 | 
				
			||||||
 | 
					</ol>
 | 
				
			||||||
 | 
					In the case of local disk (which fails in concert with the machine), the etcd
 | 
				
			||||||
 | 
					state must be recovered from the other replicas.  This is called  <A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md#add-a-new-member">dynamic member
 | 
				
			||||||
 | 
					           addition</A>.
 | 
				
			||||||
 | 
					In the case of remote persistent disk, the etcd state can be recovered
 | 
				
			||||||
 | 
					by attaching the remote persistent disk to the replacement node, thus
 | 
				
			||||||
 | 
					the state is recoverable even if all other replicas are down.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There are also significant performance differences between local disks and remote
 | 
				
			||||||
 | 
					persistent disks.  For example, the <A HREF="https://cloud.google.com/compute/docs/disks/#comparison_of_disk_types">sustained throughput
 | 
				
			||||||
 | 
					local disks in GCE is approximatley 20x that of remote disks</A>.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Hence we suggest that self-healing be provided by remotely mounted persistent disks in
 | 
				
			||||||
 | 
					non-performance critical, single-zone cloud deployments.  For
 | 
				
			||||||
 | 
					performance critical installations, faster local SSD's should be used,
 | 
				
			||||||
 | 
					in which case remounting on node failure is not an option, so
 | 
				
			||||||
 | 
					<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md ">etcd runtime configuration</A>
 | 
				
			||||||
 | 
					should be used to replace the failed machine.  Similarly, for
 | 
				
			||||||
 | 
					cross-zone self-healing, cloud persistent disks are zonal, so
 | 
				
			||||||
 | 
					automatic
 | 
				
			||||||
 | 
					<A HREF="https://github.com/coreos/etcd/blob/master/Documentation/runtime-configuration.md">runtime configuration</A>
 | 
				
			||||||
 | 
					is required.  Similarly, basic bare metal deployments cannot generally
 | 
				
			||||||
 | 
					rely on
 | 
				
			||||||
 | 
					remote persistent disks, so the same approach applies there.
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td>
 | 
				
			||||||
 | 
					<A HREF="http://kubernetes.io/v1.1/docs/admin/high-availability.html">
 | 
				
			||||||
 | 
					Somewhat vague instructions exist</A>
 | 
				
			||||||
 | 
					on how to set some of this up manually in a self-hosted
 | 
				
			||||||
 | 
					configuration. But automatic bootstrapping and self-healing is not
 | 
				
			||||||
 | 
					described (and is not implemented for the non-PD cases).  This all
 | 
				
			||||||
 | 
					still needs to be automated and continuously tested.
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					</tr>
 | 
				
			||||||
 | 
					</table>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | 
				
			||||||
 | 
					[]()
 | 
				
			||||||
 | 
					<!-- END MUNGE: GENERATED_ANALYTICS -->
 | 
				
			||||||
							
								
								
									
										550
									
								
								docs/design/federated-services.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										550
									
								
								docs/design/federated-services.md
									
									
									
									
									
										Normal file
									
								
							@@ -0,0 +1,550 @@
 | 
				
			|||||||
 | 
					<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<!-- BEGIN STRIP_FOR_RELEASE -->
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					If you are using a released version of Kubernetes, you should
 | 
				
			||||||
 | 
					refer to the docs that go with that version.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Documentation for other releases can be found at
 | 
				
			||||||
 | 
					[releases.k8s.io](http://releases.k8s.io).
 | 
				
			||||||
 | 
					</strong>
 | 
				
			||||||
 | 
					--
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<!-- END STRIP_FOR_RELEASE -->
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<!-- END MUNGE: UNVERSIONED_WARNING -->
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Kubernetes Cluster Federation (a.k.a. "Ubernetes")
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Cross-cluster Load Balancing and Service Discovery
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Requirements and System Design
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### by Quinton Hoole, Dec 3 2015
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Requirements
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Discovery, Load-balancing and Failover
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. **Internal discovery and connection**: Pods/containers (running in
 | 
				
			||||||
 | 
					   a Kubernetes cluster) must be able to easily discover and connect
 | 
				
			||||||
 | 
					   to endpoints for Kubernetes services on which they depend in a
 | 
				
			||||||
 | 
					   consistent way, irrespective of whether those services exist in a
 | 
				
			||||||
 | 
					   different kubernetes cluster within the same cluster federation.
 | 
				
			||||||
 | 
					   Hence-forth referred to as "cluster-internal clients", or simply
 | 
				
			||||||
 | 
					   "internal clients".
 | 
				
			||||||
 | 
					1. **External discovery and connection**: External clients (running
 | 
				
			||||||
 | 
					   outside a Kubernetes cluster) must be able to discover and connect
 | 
				
			||||||
 | 
					   to endpoints for Kubernetes services on which they depend.
 | 
				
			||||||
 | 
					   1. **External clients predominantly speak HTTP(S)**: External
 | 
				
			||||||
 | 
					      clients are most often, but not always, web browsers, or at
 | 
				
			||||||
 | 
					      least speak HTTP(S) - notable exceptions include Enterprise
 | 
				
			||||||
 | 
					      Message Busses (Java, TLS), DNS servers (UDP),
 | 
				
			||||||
 | 
					      SIP servers and databases)
 | 
				
			||||||
 | 
					1. **Find the "best" endpoint:** Upon initial discovery and
 | 
				
			||||||
 | 
					   connection, both internal and external clients should ideally find
 | 
				
			||||||
 | 
					   "the best" endpoint if multiple eligible endpoints exist.  "Best"
 | 
				
			||||||
 | 
					   in this context implies the closest (by network topology) endpoint
 | 
				
			||||||
 | 
					   that is both operational (as defined by some positive health check)
 | 
				
			||||||
 | 
					   and not overloaded (by some published load metric).  For example:
 | 
				
			||||||
 | 
					   1. An internal client should find an endpoint which is local to its
 | 
				
			||||||
 | 
					      own cluster if one exists, in preference to one in a remote
 | 
				
			||||||
 | 
					      cluster (if both are operational and non-overloaded).
 | 
				
			||||||
 | 
					      Similarly, one in a nearby cluster (e.g. in the same zone or
 | 
				
			||||||
 | 
					      region) is preferable to one further afield.
 | 
				
			||||||
 | 
					   1. An external client (e.g. in New York City) should find an
 | 
				
			||||||
 | 
					      endpoint in a nearby cluster (e.g. U.S. East Coast) in
 | 
				
			||||||
 | 
					      preference to one further away (e.g. Japan).
 | 
				
			||||||
 | 
					1. **Easy fail-over:** If the endpoint to which a client is connected
 | 
				
			||||||
 | 
					   becomes unavailable (no network response/disconnected) or
 | 
				
			||||||
 | 
					   overloaded, the client should reconnect to a better endpoint,
 | 
				
			||||||
 | 
					   somehow.
 | 
				
			||||||
 | 
					   1. In the case where there exist one or more connection-terminating
 | 
				
			||||||
 | 
					      load balancers between the client and the serving Pod, failover
 | 
				
			||||||
 | 
					      might be completely automatic (i.e. the client's end of the
 | 
				
			||||||
 | 
					      connection remains intact, and the client is completely
 | 
				
			||||||
 | 
					      oblivious of the fail-over).  This approach incurs network speed
 | 
				
			||||||
 | 
					      and cost penalties (by traversing possibly multiple load
 | 
				
			||||||
 | 
					      balancers), but requires zero smarts in clients, DNS libraries,
 | 
				
			||||||
 | 
					      recursing DNS servers etc, as the IP address of the endpoint
 | 
				
			||||||
 | 
					      remains constant over time.
 | 
				
			||||||
 | 
					   1. In a scenario where clients need to choose between multiple load
 | 
				
			||||||
 | 
					      balancer endpoints (e.g. one per cluster), multiple DNS A
 | 
				
			||||||
 | 
					      records associated with a single DNS name enable even relatively
 | 
				
			||||||
 | 
					      dumb clients to try the next IP address in the list of returned
 | 
				
			||||||
 | 
					      A records (without even necessarily re-issuing a DNS resolution
 | 
				
			||||||
 | 
					      request).  For example, all major web browsers will try all A
 | 
				
			||||||
 | 
					      records in sequence until a working one is found (TBD: justify
 | 
				
			||||||
 | 
					      this claim with details for Chrome, IE, Safari, Firefox).
 | 
				
			||||||
 | 
					   1. In a slightly more sophisticated scenario, upon disconnection, a
 | 
				
			||||||
 | 
					      smarter client might re-issue a DNS resolution query, and
 | 
				
			||||||
 | 
					      (modulo DNS record TTL's which can typically be set as low as 3
 | 
				
			||||||
 | 
					      minutes, and buggy DNS resolvers, caches and libraries which
 | 
				
			||||||
 | 
					      have been known to completely ignore TTL's), receive updated A
 | 
				
			||||||
 | 
					      records specifying a new set of IP addresses to which to
 | 
				
			||||||
 | 
					      connect.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Portability
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					A Kubernetes application configuration (e.g. for a Pod, Replication
 | 
				
			||||||
 | 
					Controller, Service etc) should be able to be successfully deployed
 | 
				
			||||||
 | 
					into any Kubernetes Cluster or Ubernetes Federation of Clusters,
 | 
				
			||||||
 | 
					without modification.  More specifically, a typical configuration
 | 
				
			||||||
 | 
					should work correctly (although possibly not optimally) across any of
 | 
				
			||||||
 | 
					the following environments:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. A single Kubernetes Cluster on one cloud provider (e.g. Google
 | 
				
			||||||
 | 
					   Compute Engine, GCE)
 | 
				
			||||||
 | 
					1. A single Kubernetes Cluster on a different cloud provider
 | 
				
			||||||
 | 
					   (e.g. Amazon Web Services, AWS)
 | 
				
			||||||
 | 
					1. A single Kubernetes Cluster on a non-cloud, on-premise data center
 | 
				
			||||||
 | 
					1. A Federation of Kubernetes Clusters all on the same cloud provider
 | 
				
			||||||
 | 
					   (e.g. GCE)
 | 
				
			||||||
 | 
					1. A Federation of Kubernetes Clusters across multiple different cloud
 | 
				
			||||||
 | 
					   providers and/or on-premise data centers (e.g. one cluster on
 | 
				
			||||||
 | 
					   GCE/GKE, one on AWS, and one on-premise).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Trading Portability for Optimization
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					It should be possible to explicitly opt out of portability across some
 | 
				
			||||||
 | 
					subset of the above environments in order to take advantage of
 | 
				
			||||||
 | 
					non-portable load balancing and DNS features of one or more
 | 
				
			||||||
 | 
					environments.  More specifically, for example:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. For HTTP(S) applications running on GCE-only Federations,
 | 
				
			||||||
 | 
					   [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
 | 
				
			||||||
 | 
					   should be usable.  These provide single, static global IP addresses
 | 
				
			||||||
 | 
					   which load balance and fail over globally (i.e. across both regions
 | 
				
			||||||
 | 
					   and zones).  These allow for really dumb clients, but they only
 | 
				
			||||||
 | 
					   work on GCE, and only for HTTP(S) traffic.
 | 
				
			||||||
 | 
					1. For non-HTTP(S) applications running on GCE-only Federations within
 | 
				
			||||||
 | 
					   a single region,
 | 
				
			||||||
 | 
					   [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
 | 
				
			||||||
 | 
					   should be usable.  These provide TCP (i.e. both HTTP/S and
 | 
				
			||||||
 | 
					   non-HTTP/S) load balancing and failover, but only on GCE, and only
 | 
				
			||||||
 | 
					   within a single region.
 | 
				
			||||||
 | 
					   [Google Cloud DNS](https://cloud.google.com/dns) can be used to
 | 
				
			||||||
 | 
					   route traffic between regions (and between different cloud
 | 
				
			||||||
 | 
					   providers and on-premise clusters, as it's plain DNS, IP only).
 | 
				
			||||||
 | 
					1. For applications running on AWS-only Federations,
 | 
				
			||||||
 | 
					   [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
 | 
				
			||||||
 | 
					   should be usable.  These provide both L7 (HTTP(S)) and L4 load
 | 
				
			||||||
 | 
					   balancing, but only within a single region, and only on AWS
 | 
				
			||||||
 | 
					   ([AWS Route 53 DNS service](https://aws.amazon.com/route53/) can be
 | 
				
			||||||
 | 
					   used to load balance and fail over across multiple regions, and is
 | 
				
			||||||
 | 
					   also capable of resolving to non-AWS endpoints).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Component Cloud Services
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Ubernetes cross-cluster load balancing is built on top of the following:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. [GCE Global L7 Load Balancers](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
 | 
				
			||||||
 | 
					   provide single, static global IP addresses which load balance and
 | 
				
			||||||
 | 
					   fail over globally (i.e. across both regions and zones).  These
 | 
				
			||||||
 | 
					   allow for really dumb clients, but they only work on GCE, and only
 | 
				
			||||||
 | 
					   for HTTP(S) traffic.
 | 
				
			||||||
 | 
					1. [GCE L4 Network Load Balancers](https://cloud.google.com/compute/docs/load-balancing/network/)
 | 
				
			||||||
 | 
					   provide both HTTP(S) and non-HTTP(S) load balancing and failover,
 | 
				
			||||||
 | 
					   but only on GCE, and only within a single region.
 | 
				
			||||||
 | 
					1. [AWS Elastic Load Balancers (ELB's)](https://aws.amazon.com/elasticloadbalancing/details/)
 | 
				
			||||||
 | 
					   provide both L7 (HTTP(S)) and L4 load balancing, but only within a
 | 
				
			||||||
 | 
					   single region, and only on AWS.
 | 
				
			||||||
 | 
					1. [Google Cloud DNS](https://cloud.google.com/dns) (or any other
 | 
				
			||||||
 | 
					   programmable DNS service, like
 | 
				
			||||||
 | 
					   [CloudFlare](http://www.cloudflare.com) can be used to route
 | 
				
			||||||
 | 
					   traffic between regions (and between different cloud providers and
 | 
				
			||||||
 | 
					   on-premise clusters, as it's plain DNS, IP only). Google Cloud DNS
 | 
				
			||||||
 | 
					   doesn't provide any built-in geo-DNS, latency-based routing, health
 | 
				
			||||||
 | 
					   checking, weighted round robin or other advanced capabilities.
 | 
				
			||||||
 | 
					   It's plain old DNS. We would need to build all the aforementioned
 | 
				
			||||||
 | 
					   on top of it.  It can provide internal DNS services (i.e. serve RFC
 | 
				
			||||||
 | 
					   1918 addresses).
 | 
				
			||||||
 | 
					   1. [AWS Route 53 DNS service](https://aws.amazon.com/route53/) can
 | 
				
			||||||
 | 
					   be used to load balance and fail over across regions, and is also
 | 
				
			||||||
 | 
					   capable of routing to non-AWS endpoints). It provides built-in
 | 
				
			||||||
 | 
					   geo-DNS, latency-based routing, health checking, weighted
 | 
				
			||||||
 | 
					   round robin and optional tight integration with some other
 | 
				
			||||||
 | 
					   AWS services (e.g. Elastic Load Balancers).
 | 
				
			||||||
 | 
					1. Kubernetes L4 Service Load Balancing: This provides both a
 | 
				
			||||||
 | 
					   [virtual cluster-local](http://kubernetes.io/v1.1/docs/user-guide/services.html#virtual-ips-and-service-proxies)
 | 
				
			||||||
 | 
					   and a
 | 
				
			||||||
 | 
					   [real externally routable](http://kubernetes.io/v1.1/docs/user-guide/services.html#type-loadbalancer)
 | 
				
			||||||
 | 
					   service IP which is load-balanced (currently simple round-robin)
 | 
				
			||||||
 | 
					   across the healthy pods comprising a service within a single
 | 
				
			||||||
 | 
					   Kubernetes cluster.
 | 
				
			||||||
 | 
					1. [Kubernetes Ingress](http://kubernetes.io/v1.1/docs/user-guide/ingress.html): A generic wrapper around cloud-provided L4 and L7 load balancing services, and roll-your-own load balancers run in pods, e.g. HA Proxy.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Ubernetes API
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The Ubernetes API for load balancing should be compatible with the
 | 
				
			||||||
 | 
					equivalent Kubernetes API, to ease porting of clients between
 | 
				
			||||||
 | 
					Ubernetes and Kubernetes. Further details below.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Common Client Behavior
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					To be useful, our load balancing solution needs to work properly with
 | 
				
			||||||
 | 
					real client applications. There are a few different classes of
 | 
				
			||||||
 | 
					those...
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Browsers
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					These are the most common external clients.  These are all well-written.  See below.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Well-written clients
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. Do a DNS resolution every time they connect.
 | 
				
			||||||
 | 
					1. Don't cache beyond TTL (although a small percentage of the DNS
 | 
				
			||||||
 | 
					   servers on which they rely might).
 | 
				
			||||||
 | 
					1. Do try multiple A records (in order) to connect.
 | 
				
			||||||
 | 
					1. (in an ideal world) Do use SRV records rather than hard-coded port numbers.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Examples:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+  all common browsers (except for SRV records)
 | 
				
			||||||
 | 
					+  ...
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Dumb clients
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. Don't do a DNS resolution every time they connect (or do cache
 | 
				
			||||||
 | 
					   beyond the TTL).
 | 
				
			||||||
 | 
					1. Do try multiple A records
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Examples:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+  ...
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Dumber clients
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. Only do a DNS lookup once on startup.
 | 
				
			||||||
 | 
					1. Only try the first returned DNS A record.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Examples:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+  ...
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Dumbest clients
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. Never do a DNS lookup - are pre-configured with a single (or
 | 
				
			||||||
 | 
					   possibly multiple) fixed server IP(s).  Nothing else matters.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Architecture and Implementation
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### General control plane architecture
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Each cluster hosts one or more Ubernetes master components (Ubernetes API servers, controller managers with leader election, and
 | 
				
			||||||
 | 
					etcd quorum members.  This is documented in more detail in a
 | 
				
			||||||
 | 
					[separate design doc: Kubernetes/Ubernetes Control Plane Resilience](https://docs.google.com/document/d/1jGcUVg9HDqQZdcgcFYlWMXXdZsplDdY6w3ZGJbU7lAw/edit#).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					In the description below, assume that 'n' clusters, named
 | 
				
			||||||
 | 
					'cluster-1'... 'cluster-n' have been registered against an Ubernetes
 | 
				
			||||||
 | 
					Federation "federation-1", each with their own set of Kubernetes API
 | 
				
			||||||
 | 
					endpoints,so,
 | 
				
			||||||
 | 
					"[http://endpoint-1.cluster-1](http://endpoint-1.cluster-1),
 | 
				
			||||||
 | 
					[http://endpoint-2.cluster-1](http://endpoint-2.cluster-1)
 | 
				
			||||||
 | 
					... [http://endpoint-m.cluster-n](http://endpoint-m.cluster-n) .
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Federated Services
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Ubernetes Services are pretty straight-forward.  They're comprised of
 | 
				
			||||||
 | 
					multiple equivalent underlying Kubernetes Services, each with their
 | 
				
			||||||
 | 
					own external endpoint, and a load balancing mechanism across them.
 | 
				
			||||||
 | 
					Let's work through how exactly that works in practice.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Our user creates the following Ubernetes Service (against an Ubernetes
 | 
				
			||||||
 | 
					API endpoint):
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    $ kubectl create -f my-service.yaml --context="federation-1"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					where service.yaml contains the following:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    kind: Service
 | 
				
			||||||
 | 
					    metadata:
 | 
				
			||||||
 | 
					      labels:
 | 
				
			||||||
 | 
					        run: my-service
 | 
				
			||||||
 | 
					      name: my-service
 | 
				
			||||||
 | 
					      namespace: my-namespace
 | 
				
			||||||
 | 
					    spec:
 | 
				
			||||||
 | 
					      ports:
 | 
				
			||||||
 | 
					      - port: 2379
 | 
				
			||||||
 | 
					        protocol: TCP
 | 
				
			||||||
 | 
					        targetPort: 2379
 | 
				
			||||||
 | 
					        name: client
 | 
				
			||||||
 | 
					      - port: 2380
 | 
				
			||||||
 | 
					        protocol: TCP
 | 
				
			||||||
 | 
					        targetPort: 2380
 | 
				
			||||||
 | 
					        name: peer
 | 
				
			||||||
 | 
					      selector:
 | 
				
			||||||
 | 
					        run: my-service
 | 
				
			||||||
 | 
					      type: LoadBalancer
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Ubernetes in turn creates one equivalent service (identical config to
 | 
				
			||||||
 | 
					the above) in each of the underlying Kubernetes clusters, each of
 | 
				
			||||||
 | 
					which results in something like this:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    $ kubectl get -o yaml --context="cluster-1" service my-service
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    apiVersion: v1
 | 
				
			||||||
 | 
					    kind: Service
 | 
				
			||||||
 | 
					    metadata:
 | 
				
			||||||
 | 
					      creationTimestamp: 2015-11-25T23:35:25Z
 | 
				
			||||||
 | 
					      labels:
 | 
				
			||||||
 | 
					        run: my-service
 | 
				
			||||||
 | 
					      name: my-service
 | 
				
			||||||
 | 
					      namespace: my-namespace
 | 
				
			||||||
 | 
					      resourceVersion: "147365"
 | 
				
			||||||
 | 
					      selfLink: /api/v1/namespaces/my-namespace/services/my-service
 | 
				
			||||||
 | 
					      uid: 33bfc927-93cd-11e5-a38c-42010af00002
 | 
				
			||||||
 | 
					    spec:
 | 
				
			||||||
 | 
					      clusterIP: 10.0.153.185
 | 
				
			||||||
 | 
					      ports:
 | 
				
			||||||
 | 
					      - name: client
 | 
				
			||||||
 | 
					        nodePort: 31333
 | 
				
			||||||
 | 
					        port: 2379
 | 
				
			||||||
 | 
					        protocol: TCP
 | 
				
			||||||
 | 
					        targetPort: 2379
 | 
				
			||||||
 | 
					      - name: peer
 | 
				
			||||||
 | 
					        nodePort: 31086
 | 
				
			||||||
 | 
					        port: 2380
 | 
				
			||||||
 | 
					        protocol: TCP
 | 
				
			||||||
 | 
					        targetPort: 2380
 | 
				
			||||||
 | 
					      selector:
 | 
				
			||||||
 | 
					        run: my-service
 | 
				
			||||||
 | 
					      sessionAffinity: None
 | 
				
			||||||
 | 
					      type: LoadBalancer
 | 
				
			||||||
 | 
					    status:
 | 
				
			||||||
 | 
					      loadBalancer:
 | 
				
			||||||
 | 
					        ingress:
 | 
				
			||||||
 | 
					        - ip: 104.197.117.10
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Similar services are created in `cluster-2` and `cluster-3`, each of
 | 
				
			||||||
 | 
					which are allocated their own `spec.clusterIP`, and
 | 
				
			||||||
 | 
					`status.loadBalancer.ingress.ip`.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					In Ubernetes `federation-1`, the resulting federated service looks as follows:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    $ kubectl get -o yaml --context="federation-1" service my-service
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    apiVersion: v1
 | 
				
			||||||
 | 
					    kind: Service
 | 
				
			||||||
 | 
					    metadata:
 | 
				
			||||||
 | 
					      creationTimestamp: 2015-11-25T23:35:23Z
 | 
				
			||||||
 | 
					      labels:
 | 
				
			||||||
 | 
					        run: my-service
 | 
				
			||||||
 | 
					      name: my-service
 | 
				
			||||||
 | 
					      namespace: my-namespace
 | 
				
			||||||
 | 
					      resourceVersion: "157333"
 | 
				
			||||||
 | 
					      selfLink: /api/v1/namespaces/my-namespace/services/my-service
 | 
				
			||||||
 | 
					      uid: 33bfc927-93cd-11e5-a38c-42010af00007
 | 
				
			||||||
 | 
					    spec:
 | 
				
			||||||
 | 
					      clusterIP:
 | 
				
			||||||
 | 
					      ports:
 | 
				
			||||||
 | 
					      - name: client
 | 
				
			||||||
 | 
					        nodePort: 31333
 | 
				
			||||||
 | 
					        port: 2379
 | 
				
			||||||
 | 
					        protocol: TCP
 | 
				
			||||||
 | 
					        targetPort: 2379
 | 
				
			||||||
 | 
					      - name: peer
 | 
				
			||||||
 | 
					        nodePort: 31086
 | 
				
			||||||
 | 
					        port: 2380
 | 
				
			||||||
 | 
					        protocol: TCP
 | 
				
			||||||
 | 
					        targetPort: 2380
 | 
				
			||||||
 | 
					      selector:
 | 
				
			||||||
 | 
					        run: my-service
 | 
				
			||||||
 | 
					      sessionAffinity: None
 | 
				
			||||||
 | 
					      type: LoadBalancer
 | 
				
			||||||
 | 
					    status:
 | 
				
			||||||
 | 
					      loadBalancer:
 | 
				
			||||||
 | 
					        ingress:
 | 
				
			||||||
 | 
					        - hostname: my-service.my-namespace.my-federation.my-domain.com
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Note that the federated service:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. Is API-compatible with a vanilla Kubernetes service.
 | 
				
			||||||
 | 
					1. has no clusterIP (as it is cluster-independent)
 | 
				
			||||||
 | 
					1. has a federation-wide load balancer hostname
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					In addition to the set of underlying Kubernetes services (one per
 | 
				
			||||||
 | 
					cluster) described above, Ubernetes has also created a DNS name
 | 
				
			||||||
 | 
					(e.g. on [Google Cloud DNS](https://cloud.google.com/dns) or
 | 
				
			||||||
 | 
					[AWS Route 53](https://aws.amazon.com/route53/), depending on
 | 
				
			||||||
 | 
					configuration) which provides load balancing across all of those
 | 
				
			||||||
 | 
					services.  For example, in a very basic configuration:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
 | 
				
			||||||
 | 
					    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.117.10
 | 
				
			||||||
 | 
					    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.74.77
 | 
				
			||||||
 | 
					    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.38.157
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Each of the above IP addresses (which are just the external load
 | 
				
			||||||
 | 
					balancer ingress IP's of each cluster service) is of course load
 | 
				
			||||||
 | 
					balanced across the pods comprising the service in each cluster.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					In a more sophisticated configuration (e.g. on GCE or GKE), Ubernetes
 | 
				
			||||||
 | 
					automatically creates a
 | 
				
			||||||
 | 
					[GCE Global L7 Load Balancer](https://cloud.google.com/compute/docs/load-balancing/http/global-forwarding-rules)
 | 
				
			||||||
 | 
					which exposes a single, globally load-balanced IP:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
 | 
				
			||||||
 | 
					    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 107.194.17.44
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Optionally, Ubernetes also configures the local DNS servers (SkyDNS)
 | 
				
			||||||
 | 
					in each Kubernetes cluster to preferentially return the local
 | 
				
			||||||
 | 
					clusterIP for the service in that cluster, with other clusters'
 | 
				
			||||||
 | 
					external service IP's (or a global load-balanced IP) also configured
 | 
				
			||||||
 | 
					for failover purposes:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    $ dig +noall +answer my-service.my-namespace.my-federation.my-domain.com
 | 
				
			||||||
 | 
					    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 10.0.153.185
 | 
				
			||||||
 | 
					    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.74.77
 | 
				
			||||||
 | 
					    my-service.my-namespace.my-federation.my-domain.com 180 IN	A 104.197.38.157
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					If Ubernetes Global Service Health Checking is enabled, multiple
 | 
				
			||||||
 | 
					service health checkers running across the federated clusters
 | 
				
			||||||
 | 
					collaborate to monitor the health of the service endpoints, and
 | 
				
			||||||
 | 
					automatically remove unhealthy endpoints from the DNS record (e.g. a
 | 
				
			||||||
 | 
					majority quorum is required to vote a service endpoint unhealthy, to
 | 
				
			||||||
 | 
					avoid false positives due to individual health checker network
 | 
				
			||||||
 | 
					isolation).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Federated Replication Controllers
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					So far we have a federated service defined, with a resolvable load
 | 
				
			||||||
 | 
					balancer hostname by which clients can reach it, but no pods serving
 | 
				
			||||||
 | 
					traffic directed there.  So now we need a Federated Replication
 | 
				
			||||||
 | 
					Controller.  These are also fairly straight-forward, being comprised
 | 
				
			||||||
 | 
					of multiple underlying Kubernetes Replication Controllers which do the
 | 
				
			||||||
 | 
					hard work of keeping the desired number of Pod replicas alive in each
 | 
				
			||||||
 | 
					Kubernetes cluster.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    $ kubectl create -f my-service-rc.yaml --context="federation-1"
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					where `my-service-rc.yaml` contains the following:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    kind: ReplicationController
 | 
				
			||||||
 | 
					    metadata:
 | 
				
			||||||
 | 
					      labels:
 | 
				
			||||||
 | 
					        run: my-service
 | 
				
			||||||
 | 
					      name: my-service
 | 
				
			||||||
 | 
					      namespace: my-namespace
 | 
				
			||||||
 | 
					    spec:
 | 
				
			||||||
 | 
					      replicas: 6
 | 
				
			||||||
 | 
					      selector:
 | 
				
			||||||
 | 
					        run: my-service
 | 
				
			||||||
 | 
					      template:
 | 
				
			||||||
 | 
					        metadata:
 | 
				
			||||||
 | 
					          labels:
 | 
				
			||||||
 | 
					            run: my-service
 | 
				
			||||||
 | 
					        spec:
 | 
				
			||||||
 | 
					          containers:
 | 
				
			||||||
 | 
					            image: gcr.io/google_samples/my-service:v1
 | 
				
			||||||
 | 
					            name: my-service
 | 
				
			||||||
 | 
					            ports:
 | 
				
			||||||
 | 
					            - containerPort: 2379
 | 
				
			||||||
 | 
					              protocol: TCP
 | 
				
			||||||
 | 
					            - containerPort: 2380
 | 
				
			||||||
 | 
					              protocol: TCP
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Ubernetes in turn creates one equivalent replication controller
 | 
				
			||||||
 | 
					(identical config to the above, except for the replica count) in each
 | 
				
			||||||
 | 
					of the underlying Kubernetes clusters, each of which results in
 | 
				
			||||||
 | 
					something like this:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					    $ ./kubectl get -o yaml rc my-service --context="cluster-1"
 | 
				
			||||||
 | 
					    kind: ReplicationController
 | 
				
			||||||
 | 
					    metadata:
 | 
				
			||||||
 | 
					      creationTimestamp: 2015-12-02T23:00:47Z
 | 
				
			||||||
 | 
					      labels:
 | 
				
			||||||
 | 
					        run: my-service
 | 
				
			||||||
 | 
					      name: my-service
 | 
				
			||||||
 | 
					      namespace: my-namespace
 | 
				
			||||||
 | 
					      selfLink: /api/v1/namespaces/my-namespace/replicationcontrollers/my-service
 | 
				
			||||||
 | 
					      uid: 86542109-9948-11e5-a38c-42010af00002
 | 
				
			||||||
 | 
					    spec:
 | 
				
			||||||
 | 
					      replicas: 2
 | 
				
			||||||
 | 
					      selector:
 | 
				
			||||||
 | 
					        run: my-service
 | 
				
			||||||
 | 
					      template:
 | 
				
			||||||
 | 
					        metadata:
 | 
				
			||||||
 | 
					          labels:
 | 
				
			||||||
 | 
					            run: my-service
 | 
				
			||||||
 | 
					        spec:
 | 
				
			||||||
 | 
					          containers:
 | 
				
			||||||
 | 
					            image: gcr.io/google_samples/my-service:v1
 | 
				
			||||||
 | 
					            name: my-service
 | 
				
			||||||
 | 
					            ports:
 | 
				
			||||||
 | 
					            - containerPort: 2379
 | 
				
			||||||
 | 
					              protocol: TCP
 | 
				
			||||||
 | 
					            - containerPort: 2380
 | 
				
			||||||
 | 
					              protocol: TCP
 | 
				
			||||||
 | 
					            resources: {}
 | 
				
			||||||
 | 
					          dnsPolicy: ClusterFirst
 | 
				
			||||||
 | 
					          restartPolicy: Always
 | 
				
			||||||
 | 
					    status:
 | 
				
			||||||
 | 
					      replicas: 2
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The exact number of replicas created in each underlying cluster will
 | 
				
			||||||
 | 
					of course depend on what scheduling policy is in force.  In the above
 | 
				
			||||||
 | 
					example, the scheduler created an equal number of replicas (2) in each
 | 
				
			||||||
 | 
					of the three underlying clusters, to make up the total of 6 replicas
 | 
				
			||||||
 | 
					required.  To handle entire cluster failures, various approaches are possible,
 | 
				
			||||||
 | 
					including:
 | 
				
			||||||
 | 
					1. **simple overprovisioing**, such that sufficient replicas remain even if a
 | 
				
			||||||
 | 
					   cluster fails.  This wastes some resources, but is simple and
 | 
				
			||||||
 | 
					   reliable.
 | 
				
			||||||
 | 
					2. **pod autoscaling**, where the replication controller in each
 | 
				
			||||||
 | 
					      cluster automatically and autonomously increases the number of
 | 
				
			||||||
 | 
					      replicas in its cluster in response to the additional traffic
 | 
				
			||||||
 | 
					      diverted from the
 | 
				
			||||||
 | 
					      failed cluster.  This saves resources and is reatively simple,
 | 
				
			||||||
 | 
					      but there is some delay in the autoscaling.
 | 
				
			||||||
 | 
					3. **federated replica migration**, where the Ubernetes Federation
 | 
				
			||||||
 | 
					   Control Plane detects the cluster failure and automatically
 | 
				
			||||||
 | 
					   increases the replica count in the remainaing clusters to make up
 | 
				
			||||||
 | 
					   for the lost replicas in the failed cluster.  This does not seem to
 | 
				
			||||||
 | 
					   offer any benefits relative to pod autoscaling above, and is
 | 
				
			||||||
 | 
					   arguably more complex to implement, but we note it here as a
 | 
				
			||||||
 | 
					   possibility.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					### Implementation Details
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The implementation approach and architecture is very similar to
 | 
				
			||||||
 | 
					Kubernetes, so if you're familiar with how Kubernetes works, none of
 | 
				
			||||||
 | 
					what follows will be surprising.  One additional design driver not
 | 
				
			||||||
 | 
					present in Kubernetes is that Ubernetes aims to be resilient to
 | 
				
			||||||
 | 
					individual cluster and availability zone failures.  So the control
 | 
				
			||||||
 | 
					plane spans multiple clusters.  More specifically:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+ Ubernetes runs it's own distinct set of API servers (typically one
 | 
				
			||||||
 | 
					   or more per underlying Kubernetes cluster).  These are completely
 | 
				
			||||||
 | 
					   distinct from the Kubernetes API servers for each of the underlying
 | 
				
			||||||
 | 
					   clusters.
 | 
				
			||||||
 | 
					+  Ubernetes runs it's own distinct quorum-based metadata store (etcd,
 | 
				
			||||||
 | 
					   by default).  Approximately 1 quorum member runs in each underlying
 | 
				
			||||||
 | 
					   cluster ("approximately" because we aim for an odd number of quorum
 | 
				
			||||||
 | 
					   members, and typically don't want more than 5 quorum members, even
 | 
				
			||||||
 | 
					   if we have a larger number of federated clusters, so 2 clusters->3
 | 
				
			||||||
 | 
					   quorum members, 3->3, 4->3, 5->5, 6->5, 7->5 etc).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Cluster Controllers in Ubernetes watch against the Ubernetes API
 | 
				
			||||||
 | 
					server/etcd state, and apply changes to the underlying kubernetes
 | 
				
			||||||
 | 
					clusters accordingly.  They also have the anti-entropy mechanism for
 | 
				
			||||||
 | 
					reconciling ubernetes "desired desired" state against kubernetes
 | 
				
			||||||
 | 
					"actual desired" state.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | 
				
			||||||
 | 
					[]()
 | 
				
			||||||
 | 
					<!-- END MUNGE: GENERATED_ANALYTICS -->
 | 
				
			||||||
							
								
								
									
										434
									
								
								docs/design/federation-phase-1.md
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										434
									
								
								docs/design/federation-phase-1.md
									
									
									
									
									
										Normal file
									
								
							@@ -0,0 +1,434 @@
 | 
				
			|||||||
 | 
					<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<!-- BEGIN STRIP_FOR_RELEASE -->
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					<img src="http://kubernetes.io/img/warning.png" alt="WARNING"
 | 
				
			||||||
 | 
					     width="25" height="25">
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					If you are using a released version of Kubernetes, you should
 | 
				
			||||||
 | 
					refer to the docs that go with that version.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Documentation for other releases can be found at
 | 
				
			||||||
 | 
					[releases.k8s.io](http://releases.k8s.io).
 | 
				
			||||||
 | 
					</strong>
 | 
				
			||||||
 | 
					--
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<!-- END STRIP_FOR_RELEASE -->
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<!-- END MUNGE: UNVERSIONED_WARNING -->
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					# Ubernetes Design Spec (phase one)
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					**Huawei PaaS Team**
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## INTRODUCTION
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					In this document we propose a design for the “Control Plane” of
 | 
				
			||||||
 | 
					Kubernetes (K8S) federation (a.k.a. “Ubernetes”). For background of
 | 
				
			||||||
 | 
					this work please refer to
 | 
				
			||||||
 | 
					[this proposal](../../docs/proposals/federation.md).
 | 
				
			||||||
 | 
					The document is arranged as following. First we briefly list scenarios
 | 
				
			||||||
 | 
					and use cases that motivate K8S federation work. These use cases drive
 | 
				
			||||||
 | 
					the design and they also verify the design. We summarize the
 | 
				
			||||||
 | 
					functionality requirements from these use cases, and define the “in
 | 
				
			||||||
 | 
					scope” functionalities that will be covered by this design (phase
 | 
				
			||||||
 | 
					one). After that we give an overview of the proposed architecture, API
 | 
				
			||||||
 | 
					and building blocks. And also we go through several activity flows to
 | 
				
			||||||
 | 
					see how these building blocks work together to support use cases.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## REQUIREMENTS
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There are many reasons why customers may want to build a K8S
 | 
				
			||||||
 | 
					federation:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+ **High Availability:** Customers want to be immune to the outage of
 | 
				
			||||||
 | 
					   a single availability zone, region or even a cloud provider.
 | 
				
			||||||
 | 
					+ **Sensitive workloads:** Some workloads can only run on a particular
 | 
				
			||||||
 | 
					   cluster. They cannot be scheduled to or migrated to other clusters.
 | 
				
			||||||
 | 
					+ **Capacity overflow:** Customers prefer to run workloads on a
 | 
				
			||||||
 | 
					   primary cluster. But if the capacity of the cluster is not
 | 
				
			||||||
 | 
					   sufficient, workloads should be automatically distributed to other
 | 
				
			||||||
 | 
					   clusters.
 | 
				
			||||||
 | 
					+ **Vendor lock-in avoidance:** Customers want to spread their
 | 
				
			||||||
 | 
					   workloads on different cloud providers, and can easily increase or
 | 
				
			||||||
 | 
					   decrease the workload proportion of a specific provider.
 | 
				
			||||||
 | 
					+ **Cluster Size Enhancement:** Currently K8S cluster can only support
 | 
				
			||||||
 | 
					a limited size. While the community is actively improving it, it can
 | 
				
			||||||
 | 
					be expected that cluster size will be a problem if K8S is used for
 | 
				
			||||||
 | 
					large workloads or public PaaS infrastructure. While we can separate
 | 
				
			||||||
 | 
					different tenants to different clusters, it would be good to have a
 | 
				
			||||||
 | 
					unified view.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Here are the functionality requirements derived from above use cases:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+ Clients of the federation control plane API server can register and deregister clusters.
 | 
				
			||||||
 | 
					+ Workloads should be spread to different clusters according to the
 | 
				
			||||||
 | 
					   workload distribution policy.
 | 
				
			||||||
 | 
					+ Pods are able to discover and connect to services hosted in other
 | 
				
			||||||
 | 
					  clusters (in cases where inter-cluster networking is necessary,
 | 
				
			||||||
 | 
					  desirable and implemented).
 | 
				
			||||||
 | 
					+ Traffic to these pods should be spread across clusters (in a manner
 | 
				
			||||||
 | 
					  similar to load balancing, although it might not be strictly
 | 
				
			||||||
 | 
					  speaking balanced).
 | 
				
			||||||
 | 
					+ The control plane needs to know when a cluster is down, and migrate
 | 
				
			||||||
 | 
					   the workloads to other clusters.
 | 
				
			||||||
 | 
					+ Clients have a unified view and a central control point for above
 | 
				
			||||||
 | 
					   activities.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## SCOPE
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					It’s difficult to have a perfect design with one click that implements
 | 
				
			||||||
 | 
					all the above requirements. Therefore we will go with an iterative
 | 
				
			||||||
 | 
					approach to design and build the system. This document describes the
 | 
				
			||||||
 | 
					phase one of the whole work.  In phase one we will cover only the
 | 
				
			||||||
 | 
					following objectives:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+ Define the basic building blocks and API objects of control plane
 | 
				
			||||||
 | 
					+ Implement a basic end-to-end workflow
 | 
				
			||||||
 | 
					   + Clients register federated clusters
 | 
				
			||||||
 | 
					   + Clients submit a workload
 | 
				
			||||||
 | 
					   + The workload is distributed to different clusters
 | 
				
			||||||
 | 
					   + Service discovery
 | 
				
			||||||
 | 
					   + Load balancing
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The following parts are NOT covered in phase one:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+ Authentication and authorization (other than basic client
 | 
				
			||||||
 | 
					  authentication against the ubernetes API, and from ubernetes control
 | 
				
			||||||
 | 
					  plane to the underlying kubernetes clusters).
 | 
				
			||||||
 | 
					+ Deployment units other than replication controller and service
 | 
				
			||||||
 | 
					+ Complex distribution policy of workloads
 | 
				
			||||||
 | 
					+ Service affinity and migration
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## ARCHITECTURE
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The overall architecture of a control plane is shown as following:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Some design principles we are following in this architecture:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. Keep the underlying K8S clusters independent. They should have no
 | 
				
			||||||
 | 
					   knowledge of control plane or of each other.
 | 
				
			||||||
 | 
					1. Keep the Ubernetes API interface compatible with K8S API as much as
 | 
				
			||||||
 | 
					   possible.
 | 
				
			||||||
 | 
					1. Re-use concepts from K8S as much as possible. This reduces
 | 
				
			||||||
 | 
					customers’ learning curve and is good for adoption.  Below is a brief
 | 
				
			||||||
 | 
					description of each module contained in above diagram.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Ubernetes API Server
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The API Server in the Ubernetes control plane works just like the API
 | 
				
			||||||
 | 
					Server in K8S. It talks to a distributed key-value store to persist,
 | 
				
			||||||
 | 
					retrieve and watch API objects.  This store is completely distinct
 | 
				
			||||||
 | 
					from the kubernetes key-value stores (etcd) in the underlying
 | 
				
			||||||
 | 
					kubernetes clusters.  We still use `etcd` as the distributed
 | 
				
			||||||
 | 
					storage so customers don’t need to learn and manage a different
 | 
				
			||||||
 | 
					storage system, although it is envisaged that other storage systems
 | 
				
			||||||
 | 
					(consol, zookeeper) will probably be developedand supported over
 | 
				
			||||||
 | 
					time.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Ubernetes Scheduler
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The Ubernetes Scheduler schedules resources onto the underlying
 | 
				
			||||||
 | 
					Kubernetes clusters.  For example it watches for unscheduled Ubernetes
 | 
				
			||||||
 | 
					replication controllers (those that have not yet been scheduled onto
 | 
				
			||||||
 | 
					underlying Kubernetes clusters) and performs the global scheduling
 | 
				
			||||||
 | 
					work.  For each unscheduled replication controller, it calls policy
 | 
				
			||||||
 | 
					engine to decide how to spit workloads among clusters. It creates a
 | 
				
			||||||
 | 
					Kubernetes Replication Controller on one ore more underlying cluster,
 | 
				
			||||||
 | 
					and post them back to `etcd` storage.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					One sublety worth noting here is that the scheduling decision is
 | 
				
			||||||
 | 
					arrived at by combining the application-specific request from the user (which might
 | 
				
			||||||
 | 
					include, for example, placement constraints), and the global policy specified
 | 
				
			||||||
 | 
					by the federation administrator (for example, "prefer on-premise
 | 
				
			||||||
 | 
					clusters over AWS clusters" or "spread load equally across clusters").
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Ubernetes Cluster Controller
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The cluster controller
 | 
				
			||||||
 | 
					performs the following two kinds of work:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. It watches all the sub-resources that are created by Ubernetes
 | 
				
			||||||
 | 
					   components, like a sub-RC or a sub-service. And then it creates the
 | 
				
			||||||
 | 
					   corresponding API objects on the underlying K8S clusters.
 | 
				
			||||||
 | 
					1. It periodically retrieves the available resources metrics from the
 | 
				
			||||||
 | 
					   underlying K8S cluster, and updates them as object status of the
 | 
				
			||||||
 | 
					   `cluster` API object.  An alternative design might be to run a pod
 | 
				
			||||||
 | 
					   in each underlying cluster that reports metrics for that cluster to
 | 
				
			||||||
 | 
					   the Ubernetes control plane.  Which approach is better remains an
 | 
				
			||||||
 | 
					   open topic of discussion.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Ubernetes Service Controller
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The Ubernetes service controller is a federation-level implementation
 | 
				
			||||||
 | 
					of K8S service controller. It watches service resources created on
 | 
				
			||||||
 | 
					control plane, creates corresponding K8S services on each involved K8S
 | 
				
			||||||
 | 
					clusters.  Besides interacting with services resources on each
 | 
				
			||||||
 | 
					individual K8S clusters, the Ubernetes service controller also
 | 
				
			||||||
 | 
					performs some global DNS registration work.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## API OBJECTS
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Cluster
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Cluster is a new first-class API object introduced in this design. For
 | 
				
			||||||
 | 
					each registered K8S cluster there will be such an API resource in
 | 
				
			||||||
 | 
					control plane. The way clients register or deregister a cluster is to
 | 
				
			||||||
 | 
					send corresponding REST requests to following URL:
 | 
				
			||||||
 | 
					`/api/{$version}/clusters`.  Because control plane is behaving like a
 | 
				
			||||||
 | 
					regular K8S client to the underlying clusters, the spec of a cluster
 | 
				
			||||||
 | 
					object contains necessary properties like K8S cluster address and
 | 
				
			||||||
 | 
					credentials.  The status of a cluster API object will contain
 | 
				
			||||||
 | 
					following information:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. Which phase of its lifecycle
 | 
				
			||||||
 | 
					1. Cluster resource metrics for scheduling decisions.
 | 
				
			||||||
 | 
					1. Other metadata like the version of cluster
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					$version.clusterSpec
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<table style="border:1px solid #000000;border-collapse:collapse;">
 | 
				
			||||||
 | 
					<tbody>
 | 
				
			||||||
 | 
					<tr>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><b>Name</b><br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><b>Description</b><br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><b>Required</b><br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><b>Schema</b><br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><b>Default</b><br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					</tr>
 | 
				
			||||||
 | 
					<tr>
 | 
				
			||||||
 | 
					<td style="padding:5px;">Address<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">address of the cluster<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">yes<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">address<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><p></p></td>
 | 
				
			||||||
 | 
					</tr>
 | 
				
			||||||
 | 
					<tr>
 | 
				
			||||||
 | 
					<td style="padding:5px;">Credential<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">the type (e.g. bearer token, client
 | 
				
			||||||
 | 
					certificate etc) and data of the credential used to access cluster. It’s used for system routines (not behalf of users)<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">yes<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">string <br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><p></p></td>
 | 
				
			||||||
 | 
					</tr>
 | 
				
			||||||
 | 
					</tbody>
 | 
				
			||||||
 | 
					</table>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					$version.clusterStatus
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<table style="border:1px solid #000000;border-collapse:collapse;">
 | 
				
			||||||
 | 
					<tbody>
 | 
				
			||||||
 | 
					<tr>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><b>Name</b><br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><b>Description</b><br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><b>Required</b><br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><b>Schema</b><br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><b>Default</b><br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					</tr>
 | 
				
			||||||
 | 
					<tr>
 | 
				
			||||||
 | 
					<td style="padding:5px;">Phase<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">the recently observed lifecycle phase of the cluster<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">yes<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">enum<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><p></p></td>
 | 
				
			||||||
 | 
					</tr>
 | 
				
			||||||
 | 
					<tr>
 | 
				
			||||||
 | 
					<td style="padding:5px;">Capacity<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">represents the available resources of a cluster<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">yes<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">any<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><p></p></td>
 | 
				
			||||||
 | 
					</tr>
 | 
				
			||||||
 | 
					<tr>
 | 
				
			||||||
 | 
					<td style="padding:5px;">ClusterMeta<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">Other cluster metadata like the version<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">yes<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;">ClusterMeta<br>
 | 
				
			||||||
 | 
					</td>
 | 
				
			||||||
 | 
					<td style="padding:5px;"><p></p></td>
 | 
				
			||||||
 | 
					</tr>
 | 
				
			||||||
 | 
					</tbody>
 | 
				
			||||||
 | 
					</table>
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					**For simplicity we didn’t introduce a separate “cluster metrics” API
 | 
				
			||||||
 | 
					object here**. The cluster resource metrics are stored in cluster
 | 
				
			||||||
 | 
					status section, just like what we did to nodes in K8S. In phase one it
 | 
				
			||||||
 | 
					only contains available CPU resources and memory resources.  The
 | 
				
			||||||
 | 
					cluster controller will periodically poll the underlying cluster API
 | 
				
			||||||
 | 
					Server to get cluster capability. In phase one it gets the metrics by
 | 
				
			||||||
 | 
					simply aggregating metrics from all nodes. In future we will improve
 | 
				
			||||||
 | 
					this with more efficient ways like leveraging heapster, and also more
 | 
				
			||||||
 | 
					metrics will be supported.  Similar to node phases in K8S, the “phase”
 | 
				
			||||||
 | 
					field includes following values:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+ pending: newly registered clusters or clusters suspended by admin
 | 
				
			||||||
 | 
					   for various reasons. They are not eligible for accepting workloads
 | 
				
			||||||
 | 
					+ running: clusters in normal status that can accept workloads
 | 
				
			||||||
 | 
					+ offline: clusters temporarily down or not reachable
 | 
				
			||||||
 | 
					+ terminated: clusters removed from federation
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Below is the state transition diagram.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Replication Controller
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					A global workload submitted to control plane is represented as an
 | 
				
			||||||
 | 
					Ubernetes replication controller.   When a replication controller
 | 
				
			||||||
 | 
					is submitted to control plane, clients need a way to express its
 | 
				
			||||||
 | 
					requirements or preferences on clusters. Depending on different use
 | 
				
			||||||
 | 
					cases it may be complex. For example:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					+ This workload can only be scheduled to cluster Foo. It cannot be
 | 
				
			||||||
 | 
					   scheduled to any other clusters. (use case: sensitive workloads).
 | 
				
			||||||
 | 
					+ This workload prefers cluster Foo. But if there is no available
 | 
				
			||||||
 | 
					   capacity on cluster Foo, it’s OK to be scheduled to cluster Bar
 | 
				
			||||||
 | 
					   (use case: workload )
 | 
				
			||||||
 | 
					+ Seventy percent of this workload should be scheduled to cluster Foo,
 | 
				
			||||||
 | 
					    and thirty percent should be scheduled to cluster Bar (use case:
 | 
				
			||||||
 | 
					    vendor lock-in avoidance).  In phase one, we only introduce a
 | 
				
			||||||
 | 
					    _clusterSelector_ field to filter acceptable clusters. In default
 | 
				
			||||||
 | 
					    case there is no such selector and it means any cluster is
 | 
				
			||||||
 | 
					    acceptable.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Below is a sample of the YAML to create such a replication controller.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					``` 
 | 
				
			||||||
 | 
					apiVersion: v1
 | 
				
			||||||
 | 
					kind: ReplicationController
 | 
				
			||||||
 | 
					metadata:
 | 
				
			||||||
 | 
					  name: nginx-controller
 | 
				
			||||||
 | 
					spec:
 | 
				
			||||||
 | 
					  replicas: 5
 | 
				
			||||||
 | 
					  selector:
 | 
				
			||||||
 | 
					    app: nginx
 | 
				
			||||||
 | 
					  template:
 | 
				
			||||||
 | 
					    metadata:
 | 
				
			||||||
 | 
					      labels:
 | 
				
			||||||
 | 
					        app: nginx
 | 
				
			||||||
 | 
					    spec:
 | 
				
			||||||
 | 
					      containers:
 | 
				
			||||||
 | 
					      - name: nginx
 | 
				
			||||||
 | 
					        image: nginx
 | 
				
			||||||
 | 
					        ports:
 | 
				
			||||||
 | 
					        - containerPort: 80
 | 
				
			||||||
 | 
					      clusterSelector: 
 | 
				
			||||||
 | 
					      name in (Foo, Bar)
 | 
				
			||||||
 | 
					```
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Currently clusterSelector (implemented as a
 | 
				
			||||||
 | 
					[LabelSelector](../../pkg/apis/extensions/v1beta1/types.go#L704))
 | 
				
			||||||
 | 
					only supports a simple list of acceptable clusters. Workloads will be
 | 
				
			||||||
 | 
					evenly distributed on these acceptable clusters in phase one. After
 | 
				
			||||||
 | 
					phase one we will define syntax to represent more advanced
 | 
				
			||||||
 | 
					constraints, like cluster preference ordering, desired number of
 | 
				
			||||||
 | 
					splitted workloads, desired ratio of workloads spread on different
 | 
				
			||||||
 | 
					clusters, etc.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					Besides this explicit “clusterSelector” filter, a workload may have
 | 
				
			||||||
 | 
					some implicit scheduling restrictions. For example it defines
 | 
				
			||||||
 | 
					“nodeSelector” which can only be satisfied on some particular
 | 
				
			||||||
 | 
					clusters. How to handle this will be addressed after phase one.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Ubernetes Services
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The Service API object exposed by Ubernetes is similar to service
 | 
				
			||||||
 | 
					objects on Kubernetes. It defines the access to a group of pods. The
 | 
				
			||||||
 | 
					Ubernetes service controller will create corresponding Kubernetes
 | 
				
			||||||
 | 
					service objects on underlying clusters.  These are detailed in a
 | 
				
			||||||
 | 
					separate design document: [Federated Services](federated-services.md).
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Pod
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					In phase one we only support scheduling replication controllers. Pod
 | 
				
			||||||
 | 
					scheduling will be supported in later phase. This is primarily in
 | 
				
			||||||
 | 
					order to keep the Ubernetes API compatible with the Kubernetes API.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## ACTIVITY FLOWS
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Scheduling
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					The below diagram shows how workloads are scheduled on the Ubernetes control plane:
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					1. A replication controller is created by the client.
 | 
				
			||||||
 | 
					1. APIServer persists it into the storage.
 | 
				
			||||||
 | 
					1. Cluster controller periodically polls the latest available resource
 | 
				
			||||||
 | 
					   metrics from the underlying clusters.
 | 
				
			||||||
 | 
					1. Scheduler is watching all pending RCs. It picks up the RC, make
 | 
				
			||||||
 | 
					   policy-driven decisions and split it into different sub RCs.
 | 
				
			||||||
 | 
					1. Each cluster control is watching the sub RCs bound to its
 | 
				
			||||||
 | 
					   corresponding cluster. It picks up the newly created sub RC.
 | 
				
			||||||
 | 
					1. The cluster controller issues requests to the underlying cluster
 | 
				
			||||||
 | 
					API Server to create the RC.  In phase one we don’t support complex
 | 
				
			||||||
 | 
					distribution policies. The scheduling rule is basically:
 | 
				
			||||||
 | 
					    1. If a RC does not specify any nodeSelector, it will be scheduled
 | 
				
			||||||
 | 
					       to the least loaded K8S cluster(s) that has enough available
 | 
				
			||||||
 | 
					       resources.
 | 
				
			||||||
 | 
					    1. If a RC specifies _N_ acceptable clusters in the
 | 
				
			||||||
 | 
					       clusterSelector, all replica will be evenly distributed among
 | 
				
			||||||
 | 
					       these clusters.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					There is a potential race condition here. Say at time _T1_ the control
 | 
				
			||||||
 | 
					plane learns there are _m_ available resources in a K8S cluster. As
 | 
				
			||||||
 | 
					the cluster is working independently it still accepts workload
 | 
				
			||||||
 | 
					requests from other K8S clients or even another Ubernetes control
 | 
				
			||||||
 | 
					plane. The Ubernetes scheduling decision is based on this data of
 | 
				
			||||||
 | 
					available resources. However when the actual RC creation happens to
 | 
				
			||||||
 | 
					the cluster at time _T2_, the cluster may don’t have enough resources
 | 
				
			||||||
 | 
					at that time. We will address this problem in later phases with some
 | 
				
			||||||
 | 
					proposed solutions like resource reservation mechanisms.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					## Service Discovery
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					This part has been included in the section “Federated Service” of
 | 
				
			||||||
 | 
					document
 | 
				
			||||||
 | 
					“[Ubernetes Cross-cluster Load Balancing and Service Discovery Requirements and System Design](federated-services.md))”. Please
 | 
				
			||||||
 | 
					refer to that document for details.
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					
 | 
				
			||||||
 | 
					<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | 
				
			||||||
 | 
					[]()
 | 
				
			||||||
 | 
					<!-- END MUNGE: GENERATED_ANALYTICS -->
 | 
				
			||||||
							
								
								
									
										
											BIN
										
									
								
								docs/design/ubernetes-cluster-state.png
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								docs/design/ubernetes-cluster-state.png
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							| 
		 After Width: | Height: | Size: 14 KiB  | 
							
								
								
									
										
											BIN
										
									
								
								docs/design/ubernetes-design.png
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								docs/design/ubernetes-design.png
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							| 
		 After Width: | Height: | Size: 20 KiB  | 
							
								
								
									
										
											BIN
										
									
								
								docs/design/ubernetes-scheduling.png
									
									
									
									
									
										Normal file
									
								
							
							
						
						
									
										
											BIN
										
									
								
								docs/design/ubernetes-scheduling.png
									
									
									
									
									
										Normal file
									
								
							
										
											Binary file not shown.
										
									
								
							| 
		 After Width: | Height: | Size: 38 KiB  | 
		Reference in New Issue
	
	Block a user