mirror of
				https://github.com/optim-enterprises-bv/kubernetes.git
				synced 2025-11-04 04:08:16 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			282 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			282 lines
		
	
	
		
			12 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 | 
						||
 | 
						||
<!-- BEGIN STRIP_FOR_RELEASE -->
 | 
						||
 | 
						||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						||
     width="25" height="25">
 | 
						||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						||
     width="25" height="25">
 | 
						||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						||
     width="25" height="25">
 | 
						||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						||
     width="25" height="25">
 | 
						||
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						||
     width="25" height="25">
 | 
						||
 | 
						||
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 | 
						||
 | 
						||
If you are using a released version of Kubernetes, you should
 | 
						||
refer to the docs that go with that version.
 | 
						||
 | 
						||
Documentation for other releases can be found at
 | 
						||
[releases.k8s.io](http://releases.k8s.io).
 | 
						||
</strong>
 | 
						||
--
 | 
						||
 | 
						||
<!-- END STRIP_FOR_RELEASE -->
 | 
						||
 | 
						||
<!-- END MUNGE: UNVERSIONED_WARNING -->
 | 
						||
 | 
						||
# Redefine Container Runtime Interface
 | 
						||
 | 
						||
The umbrella issue: [#22964](https://issues.k8s.io/22964)
 | 
						||
 | 
						||
## Motivation
 | 
						||
 | 
						||
Kubelet employs a declarative pod-level interface, which acts as the sole
 | 
						||
integration point for container runtimes (e.g., `docker` and `rkt`). The
 | 
						||
high-level, declarative interface has caused higher integration and maintenance
 | 
						||
cost, and also slowed down feature velocity for the following reasons.
 | 
						||
  1. **Not every container runtime supports the concept of pods natively**.
 | 
						||
     When integrating with Kubernetes, a significant amount of work needs to
 | 
						||
     go into implementing a shim of significant size to support all pod
 | 
						||
     features. This also adds maintenance overhead (e.g., `docker`).
 | 
						||
  2. **High-level interface discourages code sharing and reuse among runtimes**.
 | 
						||
     E.g, each runtime today implements an all-encompassing `SyncPod()`
 | 
						||
     function, with the Pod Spec as the input argument. The runtime implements
 | 
						||
     logic to determine how to achieve the desired state based on the current
 | 
						||
     status, (re-)starts pods/containers and manages lifecycle hooks
 | 
						||
     accordingly.
 | 
						||
  3. **Pod Spec is evolving rapidly**. New features are being added constantly.
 | 
						||
     Any pod-level change or addition requires changing of all container
 | 
						||
     runtime shims. E.g., init containers and volume containers.
 | 
						||
 | 
						||
## Goals and Non-Goals
 | 
						||
 | 
						||
The goals of defining the interface are to
 | 
						||
 - **improve extensibility**: Easier container runtime integration.
 | 
						||
 - **improve feature velocity**
 | 
						||
 - **improve code maintainability**
 | 
						||
 | 
						||
The non-goals include
 | 
						||
 - proposing *how* to integrate with new runtimes, i.e., where the shim
 | 
						||
   resides. The discussion of adopting a client-server architecture is tracked
 | 
						||
   by [#13768](https://issues.k8s.io/13768), where benefits and shortcomings of
 | 
						||
   such an architecture is discussed.
 | 
						||
 - versioning the new interface/API. We intend to provide API versioning to
 | 
						||
   offer stability for runtime integrations, but the details are beyond the
 | 
						||
   scope of this proposal.
 | 
						||
 - adding support to Windows containers. Windows container support is a
 | 
						||
   parallel effort and is tracked by [#22623](https://issues.k8s.io/22623).
 | 
						||
   The new interface will not be augmented to support Windows containers, but
 | 
						||
   it will be made extensible such that the support can be added in the future.
 | 
						||
 - re-defining Kubelet's internal interfaces. These interfaces, though, may
 | 
						||
   affect Kubelet's maintainability, is not relevant to runtime integration.
 | 
						||
 - improving Kubelet's efficiency or performance, e.g., adopting event stream
 | 
						||
   from the container runtime [#8756](https://issues.k8s.io/8756),
 | 
						||
   [#16831](https://issues.k8s.io/16831).
 | 
						||
 | 
						||
## Requirements
 | 
						||
 | 
						||
 * Support the already integrated container runtime: `docker` and `rkt`
 | 
						||
 * Support hypervisor-based container runtimes: `hyper`.
 | 
						||
 | 
						||
The existing pod-level interface will remain as it is in the near future to
 | 
						||
ensure supports of all existing runtimes are continued. Meanwhile, we will
 | 
						||
work with all parties involved to switching to the proposed interface.
 | 
						||
 | 
						||
 | 
						||
## Container Runtime Interface
 | 
						||
 | 
						||
The main idea of this proposal is to adopt an imperative container-level
 | 
						||
interface, which allows Kubelet to directly control the lifecycles of the
 | 
						||
containers.
 | 
						||
 | 
						||
Pod is composed of a group of containers in an isolated environment with
 | 
						||
resource constraints. In Kubernetes, pod is also the smallest schedulable unit.
 | 
						||
After a pod has been scheduled to the node, Kubelet will create the environment
 | 
						||
for the pod, and add/update/remove containers in that environment to meet the
 | 
						||
Pod Spec. To distinguish between the environment and the pod as a whole, we
 | 
						||
will call the pod environment **PodSandbox.**
 | 
						||
 | 
						||
The container runtimes may interpret the PodSandBox concept differently based
 | 
						||
on how it operates internally. For runtimes relying on hypervisor, sandbox
 | 
						||
represents a virtual machine naturally. For others, it can be Linux namespaces.
 | 
						||
 | 
						||
In short, a PodSandbox should have the following features.
 | 
						||
 | 
						||
 * **Isolation**: E.g., Linux namespaces or a full virtual machine, or even
 | 
						||
   support additional security features.
 | 
						||
 * **Compute resource specifications**: A PodSandbox should implement pod-level
 | 
						||
   resource demands and restrictions.
 | 
						||
 | 
						||
*NOTE: The resource specification does not include externalized costs to
 | 
						||
container setup that are not currently trackable as Pod constraints, e.g.,
 | 
						||
filesystem setup, container image pulling, etc.*
 | 
						||
 | 
						||
A container in a PodSandbox maps to an application in the Pod Spec. For Linux
 | 
						||
containers, they are expected to share at least network and IPC namespaces,
 | 
						||
with sharing more namespaces discussed in [#1615](https://issues.k8s.io/1615).
 | 
						||
 | 
						||
 | 
						||
Below is an example of the proposed interfaces.
 | 
						||
 | 
						||
```go
 | 
						||
// PodSandboxManager contains basic operations for sandbox.
 | 
						||
type PodSandboxManager interface {
 | 
						||
    Create(config *PodSandboxConfig) (string, error)
 | 
						||
    Delete(id string) (string, error)
 | 
						||
    List(filter PodSandboxFilter) []PodSandboxListItem
 | 
						||
    Status(id string) PodSandboxStatus
 | 
						||
}
 | 
						||
 | 
						||
// ContainerRuntime contains basic operations for containers.
 | 
						||
type ContainerRuntime interface {
 | 
						||
    Create(config *ContainerConfig, sandboxConfig *PodSandboxConfig, PodSandboxID string) (string, error)
 | 
						||
    Start(id string) error
 | 
						||
    Stop(id string, timeout int) error
 | 
						||
    Remove(id string) error
 | 
						||
    List(filter ContainerFilter) ([]ContainerListItem, error)
 | 
						||
    Status(id string) (ContainerStatus, error)
 | 
						||
    Exec(id string, cmd []string, streamOpts StreamOptions) error
 | 
						||
}
 | 
						||
 | 
						||
// ImageService contains image-related operations.
 | 
						||
type ImageService interface {
 | 
						||
    List() ([]Image, error)
 | 
						||
    Pull(image ImageSpec, auth AuthConfig) error
 | 
						||
    Remove(image ImageSpec) error
 | 
						||
    Status(image ImageSpec) (Image, error)
 | 
						||
    Metrics(image ImageSpec) (ImageMetrics, error)
 | 
						||
}
 | 
						||
 | 
						||
type ContainerMetricsGetter interface {
 | 
						||
    ContainerMetrics(id string) (ContainerMetrics, error)
 | 
						||
}
 | 
						||
 | 
						||
All functions listed above are expected to be thread-safe.
 | 
						||
```
 | 
						||
 | 
						||
### Pod/Container Lifecycle
 | 
						||
 | 
						||
The PodSandbox’s lifecycle is decoupled from the containers, i.e., a sandbox
 | 
						||
is created before any containers, and can exist after all containers in it have
 | 
						||
terminated.
 | 
						||
 | 
						||
Assume there is a pod with a single container C. To start a pod:
 | 
						||
 | 
						||
```
 | 
						||
  create sandbox Foo --> create container C --> start container C
 | 
						||
```
 | 
						||
 | 
						||
To delete a pod:
 | 
						||
 | 
						||
```
 | 
						||
  stop container C --> remove container C --> delete sandbox Foo
 | 
						||
```
 | 
						||
 | 
						||
The restart policy in the Pod Spec defines how indiviual containers should
 | 
						||
be handled when they terminated. Kubelet is responsible to ensure that the
 | 
						||
restart policy is enforced. In other words, once Kubelet discovers that a
 | 
						||
container terminates (e.g., through `List()`), it will create and start a new
 | 
						||
container if needed.
 | 
						||
 | 
						||
Kubelet is also responsible for gracefully terminating all the containers
 | 
						||
in the sandbox before deleting the sandbox. If Kubelet chooses to delete
 | 
						||
the sandbox with running containers in it, those containers should be forcibly
 | 
						||
deleted.
 | 
						||
 | 
						||
Note that every PodSandbox/container lifecycle operation (create, start,
 | 
						||
stop, delete) should either return an error or block until the operation
 | 
						||
succeeds. A successful operation should include a state transition of the
 | 
						||
PodSandbox/container. E.g., if a `Create` call for a container does not
 | 
						||
return an error, the container state should be "created" when the runtime is
 | 
						||
queried.
 | 
						||
 | 
						||
### Updates to PodSandbox or Containers
 | 
						||
 | 
						||
Kubernetes support updates only to a very limited set of fields in the Pod
 | 
						||
Spec.  These updates may require containers to be re-created by Kubelet. This
 | 
						||
can be achieved through the proposed, imperative container-level interface.
 | 
						||
On the other hand, PodSandbox update currently is not required.
 | 
						||
 | 
						||
 | 
						||
### Container Lifecycle Hooks
 | 
						||
 | 
						||
Kubernetes supports post-start and pre-stop lifecycle hooks, with ongoing
 | 
						||
discussion for supporting pre-start and post-stop hooks in
 | 
						||
[#140](https://issues.k8s.io/140).
 | 
						||
 | 
						||
These lifecycle hooks will be implemented by Kubelet via `Exec` calls to the
 | 
						||
container runtime. This frees the runtimes from having to support hooks
 | 
						||
natively.
 | 
						||
 | 
						||
Illustration of the container lifecycle and hooks:
 | 
						||
 | 
						||
```
 | 
						||
            pre-start post-start    pre-stop post-stop
 | 
						||
               |        |              |       |
 | 
						||
              exec     exec           exec    exec
 | 
						||
               |        |              |       |
 | 
						||
 create --------> start ----------------> stop --------> remove
 | 
						||
```
 | 
						||
 | 
						||
In order for the lifecycle hooks to function as expected, the `Exec` call
 | 
						||
will need access to the container's filesystem (e.g., mount namespaces).
 | 
						||
 | 
						||
### Extensibility
 | 
						||
 | 
						||
There are several dimensions for container runtime extensibility.
 | 
						||
 - Host OS (e.g., Linux)
 | 
						||
 - PodSandbox isolation mechanism (e.g., namespaces or VM)
 | 
						||
 - PodSandbox OS (e.g., Linux)
 | 
						||
 | 
						||
As mentioned previously, this proposal will only address the Linux based
 | 
						||
PodSandbox and containers. All Linux-specific configuration will be grouped
 | 
						||
into one field. A container runtime is required to enforce all configuration
 | 
						||
applicable to its platform, and should return an error otherwise.
 | 
						||
 | 
						||
### Keep it minimal
 | 
						||
 | 
						||
The proposed interface is experimental, i.e., it will go through (many) changes
 | 
						||
until it stabilizes. The principle is to to keep the interface minimal and
 | 
						||
extend it later if needed. This includes a several features that are still in
 | 
						||
discussion and may be achieved alternatively:
 | 
						||
 | 
						||
 * `AttachContainer`: [#23335](https://issues.k8s.io/23335)
 | 
						||
 * `PortForward`: [#25113](https://issues.k8s.io/25113)
 | 
						||
 | 
						||
## Alternatives
 | 
						||
 | 
						||
**[Status quo] Declarative pod-level interface**
 | 
						||
 - Pros: No changes needed.
 | 
						||
 - Cons: All the issues stated in #motivation
 | 
						||
 | 
						||
**Allow integration at both pod- and container-level interfaces**
 | 
						||
 - Pros: Flexibility.
 | 
						||
 - Cons: All the issues stated in #motivation
 | 
						||
 | 
						||
**Imperative pod-level interface**
 | 
						||
The interface contains only CreatePod(), StartPod(), StopPod() and RemovePod().
 | 
						||
This implies that the runtime needs to take over container lifecycle
 | 
						||
manangement (i.e., enforce restart policy), lifecycle hooks, liveness checks,
 | 
						||
etc. Kubelet will mainly be responsible for interfacing with the apiserver, and
 | 
						||
can potentially become a very thin daemon.
 | 
						||
 - Pros: Lower maintenance overhead for the Kubernetes maintainers if `Docker`
 | 
						||
   shim maintenance cost is discounted.
 | 
						||
 - Cons: This will incur higher integration cost because every new container
 | 
						||
   runtime needs to implement all the features and need to understand the
 | 
						||
   concept of pods. This would also lead to lower feature velocity because the
 | 
						||
   interface will need to be changed, and the new pod-level feature will need
 | 
						||
   to be supported in each runtime.
 | 
						||
 | 
						||
## Related Issues
 | 
						||
 | 
						||
 * Metrics: [#27097](https://issues.k8s.io/27097)
 | 
						||
 * Log management: [#24677](https://issues.k8s.io/24677)
 | 
						||
 | 
						||
 | 
						||
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | 
						||
[]()
 | 
						||
<!-- END MUNGE: GENERATED_ANALYTICS -->
 |