mirror of
				https://github.com/optim-enterprises-bv/kubernetes.git
				synced 2025-11-04 04:08:16 +00:00 
			
		
		
		
	
		
			
				
	
	
		
			935 lines
		
	
	
		
			32 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			935 lines
		
	
	
		
			32 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->
 | 
						|
 | 
						|
<!-- BEGIN STRIP_FOR_RELEASE -->
 | 
						|
 | 
						|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
 | 
						|
     width="25" height="25">
 | 
						|
 | 
						|
<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>
 | 
						|
 | 
						|
If you are using a released version of Kubernetes, you should
 | 
						|
refer to the docs that go with that version.
 | 
						|
 | 
						|
<!-- TAG RELEASE_LINK, added by the munger automatically -->
 | 
						|
<strong>
 | 
						|
The latest release of this document can be found
 | 
						|
[here](http://releases.k8s.io/release-1.3/docs/design/indexed-job.md).
 | 
						|
 | 
						|
Documentation for other releases can be found at
 | 
						|
[releases.k8s.io](http://releases.k8s.io).
 | 
						|
</strong>
 | 
						|
--
 | 
						|
 | 
						|
<!-- END STRIP_FOR_RELEASE -->
 | 
						|
 | 
						|
<!-- END MUNGE: UNVERSIONED_WARNING -->
 | 
						|
 | 
						|
# Design: Indexed Feature of Job object
 | 
						|
 | 
						|
 | 
						|
## Summary
 | 
						|
 | 
						|
This design extends kubernetes with user-friendly support for
 | 
						|
running embarrassingly parallel jobs.
 | 
						|
 | 
						|
Here, *parallel* means on multiple nodes, which means multiple pods.
 | 
						|
By *embarrassingly parallel*,  it is meant that the pods
 | 
						|
have no dependencies between each other.  In particular, neither
 | 
						|
ordering between pods nor gang scheduling are supported.
 | 
						|
 | 
						|
Users already have two other options for running embarrassingly parallel
 | 
						|
Jobs (described in the next section), but both have ease-of-use issues.
 | 
						|
 | 
						|
Therefore, this document proposes extending the Job resource type to support
 | 
						|
a third way to run embarrassingly parallel programs, with a focus on
 | 
						|
ease of use.
 | 
						|
 | 
						|
This new style of Job is called an *indexed job*, because each Pod of the Job
 | 
						|
is specialized to work on a particular *index* from a fixed length array of work
 | 
						|
items.
 | 
						|
 | 
						|
## Background
 | 
						|
 | 
						|
The Kubernetes [Job](../../docs/user-guide/jobs.md) already supports
 | 
						|
the embarrassingly parallel use case through *workqueue jobs*.
 | 
						|
While [workqueue jobs](../../docs/user-guide/jobs.md#job-patterns) are very
 | 
						|
flexible, they can be difficult to use. They: (1) typically require running a
 | 
						|
message queue or other database service, (2) typically require modifications
 | 
						|
to existing binaries and images and (3) subtle race conditions are easy to
 | 
						|
 overlook.
 | 
						|
 | 
						|
Users also have another option for parallel jobs: creating [multiple Job objects
 | 
						|
from a template](hdocs/design/indexed-job.md#job-patterns). For small numbers of
 | 
						|
Jobs, this is a fine choice. Labels make it easy to view and delete multiple Job
 | 
						|
objects at once. But, that approach also has its drawbacks: (1) for large levels
 | 
						|
of parallelism (hundreds or thousands of pods) this approach means that listing
 | 
						|
all jobs presents too much information, (2) users want a single source of
 | 
						|
information about the success or failure of what the user views as a single
 | 
						|
logical process.
 | 
						|
 | 
						|
Indexed job fills provides a third option with better ease-of-use for common
 | 
						|
use cases.
 | 
						|
 | 
						|
## Requirements
 | 
						|
 | 
						|
### User Requirements
 | 
						|
 | 
						|
- Users want an easy way to run a Pod to completion *for each* item within a
 | 
						|
[work list](#example-use-cases).
 | 
						|
 | 
						|
- Users want to run these pods in parallel for speed, but to vary the level of
 | 
						|
parallelism as needed, independent of the number of work items.
 | 
						|
 | 
						|
- Users want to do this without requiring changes to existing images,
 | 
						|
or source-to-image pipelines.
 | 
						|
 | 
						|
- Users want a single object that encompasses the lifetime of the parallel
 | 
						|
program. Deleting it should delete all dependent objects. It should report the
 | 
						|
status of the overall process. Users should be able to wait for it to complete,
 | 
						|
and can refer to it from other resource types, such as
 | 
						|
[ScheduledJob](https://github.com/kubernetes/kubernetes/pull/11980).
 | 
						|
 | 
						|
 | 
						|
### Example Use Cases
 | 
						|
 | 
						|
Here are several examples of *work lists*: lists of command lines that the user
 | 
						|
wants to run, each line its own Pod. (Note that in practice, a work list may not
 | 
						|
ever be written out in this form, but it exists in the mind of the Job creator,
 | 
						|
and it is a useful way to talk about the intent of the user when discussing
 | 
						|
alternatives for specifying Indexed Jobs).
 | 
						|
 | 
						|
Note that we will not have the user express their requirements in work list
 | 
						|
form; it is just a format for presenting use cases. Subsequent discussion will
 | 
						|
reference these work lists.
 | 
						|
 | 
						|
#### Work List 1
 | 
						|
 | 
						|
Process several files with the same program:
 | 
						|
 | 
						|
```
 | 
						|
/usr/local/bin/process_file 12342.dat
 | 
						|
/usr/local/bin/process_file 97283.dat
 | 
						|
/usr/local/bin/process_file 38732.dat
 | 
						|
```
 | 
						|
 | 
						|
#### Work List 2
 | 
						|
 | 
						|
Process a matrix (or image, etc) in rectangular blocks:
 | 
						|
 | 
						|
```
 | 
						|
/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15
 | 
						|
/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15
 | 
						|
/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31
 | 
						|
/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31
 | 
						|
```
 | 
						|
 | 
						|
#### Work List 3
 | 
						|
 | 
						|
Build a program at several different git commits:
 | 
						|
 | 
						|
```
 | 
						|
HASH=3cab5cb4a git checkout $HASH && make clean && make VERSION=$HASH
 | 
						|
HASH=fe97ef90b git checkout $HASH && make clean && make VERSION=$HASH
 | 
						|
HASH=a8b5e34c5 git checkout $HASH && make clean && make VERSION=$HASH
 | 
						|
```
 | 
						|
 | 
						|
#### Work List 4
 | 
						|
 | 
						|
Render several frames of a movie:
 | 
						|
 | 
						|
```
 | 
						|
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 1
 | 
						|
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 2
 | 
						|
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 3
 | 
						|
```
 | 
						|
 | 
						|
#### Work List 5
 | 
						|
 | 
						|
Render several blocks of frames (Render blocks to avoid Pod startup overhead for
 | 
						|
every frame):
 | 
						|
 | 
						|
```
 | 
						|
./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 1 --frame-end 100
 | 
						|
./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 101 --frame-end 200
 | 
						|
./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 201 --frame-end 300
 | 
						|
```
 | 
						|
 | 
						|
## Design Discussion
 | 
						|
 | 
						|
### Converting Work Lists into Indexed Jobs.
 | 
						|
 | 
						|
Given a work list, like in the [work list examples](#work-list-examples),
 | 
						|
the information from the work list needs to get into each Pod of the Job.
 | 
						|
 | 
						|
Users will typically not want to create a new image for each job they
 | 
						|
run. They will want to use existing images. So, the image is not the place
 | 
						|
for the work list.
 | 
						|
 | 
						|
A work list can be stored on networked storage, and mounted by pods of the job.
 | 
						|
Also, as a shortcut, for small worklists, it can be included in an annotation on
 | 
						|
the Job object, which is then exposed as a volume in the pod via the downward
 | 
						|
API.
 | 
						|
 | 
						|
### What Varies Between Pods of a Job
 | 
						|
 | 
						|
Pods need to differ in some way to do something different. (They do not differ
 | 
						|
in the work-queue style of Job, but that style has ease-of-use issues).
 | 
						|
 | 
						|
A general approach would be to allow pods to differ from each other in arbitrary
 | 
						|
ways. For example, the Job object could have a list of PodSpecs to run.
 | 
						|
However, this is so general that it provides little value. It would:
 | 
						|
 | 
						|
- make the Job Spec very verbose, especially for jobs with thousands of work
 | 
						|
items
 | 
						|
- Job becomes such a vague concept that it is hard to explain to users
 | 
						|
- in practice, we do not see cases where many pods which differ across many
 | 
						|
fields of their specs, and need to run as a group, with no ordering constraints.
 | 
						|
- CLIs and UIs need to support more options for creating Job
 | 
						|
- it is useful for monitoring and accounting databases want to aggregate data
 | 
						|
for pods with the same controller. However, pods with very different Specs may
 | 
						|
not make sense to aggregate.
 | 
						|
- profiling, debugging, accounting, auditing and monitoring tools cannot assume
 | 
						|
common images/files, behaviors, provenance and so on between Pods of a Job.
 | 
						|
 | 
						|
Also, variety has another cost. Pods which differ in ways that affect scheduling
 | 
						|
(node constraints, resource requirements, labels) prevent the scheduler from
 | 
						|
treating them as fungible, which is an important optimization for the scheduler.
 | 
						|
 | 
						|
Therefore, we will not allow Pods from the same Job to differ arbitrarily
 | 
						|
(anyway, users can use multiple Job objects for that case).  We will try to
 | 
						|
allow as little as possible to differ between pods of the same Job, while still
 | 
						|
allowing users to express common parallel patterns easily. For users who need to
 | 
						|
run jobs which differ in other ways, they can create multiple Jobs, and manage
 | 
						|
them as a group using labels.
 | 
						|
 | 
						|
From the above work lists, we see a need for Pods which differ in their command
 | 
						|
lines, and in their environment variables.  These work lists do not require the
 | 
						|
pods to differ in other ways.
 | 
						|
 | 
						|
Experience in [similar systems](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf)
 | 
						|
has shown this model to be applicable to a very broad range of problems, despite
 | 
						|
this restriction.
 | 
						|
 | 
						|
Therefore we to allow pods in the same Job to differ **only** in the following
 | 
						|
 aspects:
 | 
						|
- command line
 | 
						|
- environment variables
 | 
						|
 | 
						|
### Composition of existing images
 | 
						|
 | 
						|
The docker image that is used in a job may not be maintained by the person
 | 
						|
running the job.  Over time, the Dockerfile may change the ENTRYPOINT or CMD.
 | 
						|
If we require people to specify the complete command line to use Indexed Job,
 | 
						|
then they will not automatically pick up changes in the default
 | 
						|
command or args.
 | 
						|
 | 
						|
This needs more thought.
 | 
						|
 | 
						|
### Running Ad-Hoc Jobs using kubectl
 | 
						|
 | 
						|
A user should be able to easily start an Indexed Job using `kubectl`. For
 | 
						|
example to run [work list 1](#work-list-1), a user should be able to type
 | 
						|
something simple like:
 | 
						|
 | 
						|
```
 | 
						|
kubectl run process-files --image=myfileprocessor \
 | 
						|
   --per-completion-env=F="12342.dat 97283.dat 38732.dat" \
 | 
						|
   --restart=OnFailure  \
 | 
						|
   -- \
 | 
						|
   /usr/local/bin/process_file '$F'
 | 
						|
```
 | 
						|
 | 
						|
In the above example:
 | 
						|
 | 
						|
- `--restart=OnFailure` implies creating a job instead of replicationController.
 | 
						|
- Each pods command line is `/usr/local/bin/process_file $F`.
 | 
						|
- `--per-completion-env=` implies the jobs `.spec.completions` is set to the
 | 
						|
length of the argument array (3 in the example).
 | 
						|
- `--per-completion-env=F=<values>` causes env var with `F` to be available in
 | 
						|
the environment when the command line is evaluated.
 | 
						|
 | 
						|
How exactly this happens is discussed later in the doc: this is a sketch of the
 | 
						|
user experience.
 | 
						|
 | 
						|
In practice, the list of files might be much longer and stored in a file on the
 | 
						|
users local host, like:
 | 
						|
 | 
						|
```
 | 
						|
$ cat files-to-process.txt
 | 
						|
12342.dat
 | 
						|
97283.dat
 | 
						|
38732.dat
 | 
						|
...
 | 
						|
```
 | 
						|
 | 
						|
So, the user could specify instead: `--per-completion-env=F="$(cat files-to-process.txt)"`.
 | 
						|
 | 
						|
However, `kubectl` should also support a format like:
 | 
						|
 `--per-completion-env=F=@files-to-process.txt`.
 | 
						|
That allows `kubectl` to parse the file, point out any syntax errors, and would
 | 
						|
not run up against command line length limits (2MB is common, as low as 4kB is
 | 
						|
POSIX compliant).
 | 
						|
 | 
						|
One case we do not try to handle is where the file of work is stored on a cloud
 | 
						|
filesystem, and not accessible from the users local host.  Then we cannot easily
 | 
						|
use indexed job, because we do not know the number of completions.  The user
 | 
						|
needs to copy the file locally first or use the Work-Queue style of Job (already
 | 
						|
supported).
 | 
						|
 | 
						|
Another case we do not try to handle is where the input file does not exist yet
 | 
						|
because this Job is to be run at a future time, or depends on another job. The
 | 
						|
workflow and scheduled job proposal need to consider this case. For that case,
 | 
						|
you could use an indexed job which runs a program which shards the input file
 | 
						|
(map-reduce-style).
 | 
						|
 | 
						|
#### Multiple parameters
 | 
						|
 | 
						|
The user may also have multiple parameters, like in [work list 2](#work-list-2).
 | 
						|
One way is to just list all the command lines already expanded, one per line, in
 | 
						|
a file, like this:
 | 
						|
 | 
						|
```
 | 
						|
$ cat matrix-commandlines.txt
 | 
						|
/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15
 | 
						|
/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15
 | 
						|
/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31
 | 
						|
/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31
 | 
						|
```
 | 
						|
 | 
						|
and run the Job like this:
 | 
						|
 | 
						|
```
 | 
						|
kubectl run process-matrix --image=my/matrix \
 | 
						|
   --per-completion-env=COMMAND_LINE=@matrix-commandlines.txt \
 | 
						|
   --restart=OnFailure  \
 | 
						|
   -- \
 | 
						|
   'eval "$COMMAND_LINE"'
 | 
						|
```
 | 
						|
 | 
						|
However, this may have some subtleties with shell escaping.  Also, it depends on
 | 
						|
the user knowing all the correct arguments to the docker image being used (more
 | 
						|
on this later).
 | 
						|
 | 
						|
Instead, kubectl should support multiple instances of the `--per-completion-env`
 | 
						|
flag. For example, to implement work list 2, a user could do:
 | 
						|
 | 
						|
```
 | 
						|
kubectl run process-matrix --image=my/matrix \
 | 
						|
   --per-completion-env=SR="0 16 0 16" \
 | 
						|
   --per-completion-env=ER="15 31 15 31" \
 | 
						|
   --per-completion-env=SC="0 0 16 16" \
 | 
						|
   --per-completion-env=EC="15 15 31 31" \
 | 
						|
   --restart=OnFailure  \
 | 
						|
   -- \
 | 
						|
   /usr/local/bin/process_matrix_block -start_row $SR -end_row $ER -start_col $ER --end_col $EC 
 | 
						|
```
 | 
						|
 | 
						|
### Composition With Workflows and ScheduledJob
 | 
						|
 | 
						|
A user should be able to create a job (Indexed or not) which runs at a specific
 | 
						|
time(s). For example:
 | 
						|
 | 
						|
```
 | 
						|
$ kubectl run process-files --image=myfileprocessor \
 | 
						|
   --per-completion-env=F="12342.dat 97283.dat 38732.dat" \
 | 
						|
   --restart=OnFailure  \
 | 
						|
   --runAt=2015-07-21T14:00:00Z
 | 
						|
   -- \
 | 
						|
   /usr/local/bin/process_file '$F'
 | 
						|
created "scheduledJob/process-files-37dt3"
 | 
						|
```
 | 
						|
 | 
						|
Kubectl should build the same JobSpec, and then put it into a ScheduledJob
 | 
						|
(#11980) and create that.
 | 
						|
 | 
						|
For [workflow type jobs](../../docs/user-guide/jobs.md#job-patterns), creating a
 | 
						|
complete workflow from a single command line would be messy, because of the need
 | 
						|
to specify all the arguments multiple times.
 | 
						|
 | 
						|
For that use case, the user could create a workflow message by hand. Or the user
 | 
						|
could create a job template, and then make a workflow from the templates,
 | 
						|
perhaps like this:
 | 
						|
 | 
						|
```
 | 
						|
$ kubectl run process-files --image=myfileprocessor \
 | 
						|
   --per-completion-env=F="12342.dat 97283.dat 38732.dat" \
 | 
						|
   --restart=OnFailure  \
 | 
						|
   --asTemplate \
 | 
						|
   -- \
 | 
						|
   /usr/local/bin/process_file '$F'
 | 
						|
created "jobTemplate/process-files"
 | 
						|
$ kubectl run merge-files --image=mymerger \
 | 
						|
   --restart=OnFailure  \
 | 
						|
   --asTemplate \
 | 
						|
   -- \
 | 
						|
   /usr/local/bin/mergefiles 12342.out 97283.out 38732.out \
 | 
						|
created "jobTemplate/merge-files"
 | 
						|
$ kubectl create-workflow process-and-merge \
 | 
						|
   --job=jobTemplate/process-files
 | 
						|
   --job=jobTemplate/merge-files
 | 
						|
   --dependency=process-files:merge-files
 | 
						|
created "workflow/process-and-merge"
 | 
						|
```
 | 
						|
 | 
						|
### Completion Indexes
 | 
						|
 | 
						|
A JobSpec specifies the number of times a pod needs to complete successfully,
 | 
						|
through the `job.Spec.Completions` field. The number of completions will be
 | 
						|
equal to the number of work items in the work list.
 | 
						|
 | 
						|
Each pod that the job controller creates is intended to complete one work item
 | 
						|
from the work list. Since a pod may fail, several pods may, serially, attempt to
 | 
						|
complete the same index. Therefore, we call it a *completion index* (or just
 | 
						|
*index*), but not a *pod index*.
 | 
						|
 | 
						|
For each completion index, in the range 1 to `.job.Spec.Completions`, the job
 | 
						|
controller will create a pod with that index, and keep creating them on failure,
 | 
						|
until each index is completed.
 | 
						|
 | 
						|
An dense integer index, rather than a sparse string index (e.g. using just
 | 
						|
`metadata.generate-name`) makes it easy to use the index to lookup parameters
 | 
						|
in, for example, an array in shared storage.
 | 
						|
 | 
						|
### Pod Identity and Template Substitution in Job Controller
 | 
						|
 | 
						|
The JobSpec contains a single pod template.  When the job controller creates a
 | 
						|
particular pod, it copies the pod template and modifies it in some way to make
 | 
						|
that pod distinctive. Whatever is distinctive about that pod is its *identity*.
 | 
						|
 | 
						|
We consider several options.
 | 
						|
 | 
						|
#### Index Substitution Only
 | 
						|
 | 
						|
The job controller substitutes only the *completion index* of the pod into the
 | 
						|
pod template when creating it.  The JSON it POSTs differs only in a single
 | 
						|
fields.
 | 
						|
 | 
						|
We would put the completion index as a stringified integer, into an annotation
 | 
						|
of the pod. The user can extract it from the annotation into an env var via the
 | 
						|
downward API, or put it in a file via a Downward API volume, and parse it
 | 
						|
himself.
 | 
						|
 | 
						|
Once it is an environment variable in the pod (say `$INDEX`), then one of two
 | 
						|
things can happen.
 | 
						|
 | 
						|
First, the main program can know how to map from an integer index to what it
 | 
						|
needs to do. For example, from Work List 4 above:
 | 
						|
 | 
						|
```
 | 
						|
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f $INDEX
 | 
						|
```
 | 
						|
 | 
						|
Second, a shell script can be prepended to the original command line which maps
 | 
						|
the index to one or more string parameters. For example, to implement Work List
 | 
						|
5 above, you could do:
 | 
						|
 | 
						|
```
 | 
						|
/vol0/setupenv.sh && ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start $START_FRAME --frame-end $END_FRAME
 | 
						|
```
 | 
						|
 | 
						|
In the above example, `/vol0/setupenv.sh` is a shell script that reads `$INDEX`
 | 
						|
and exports `$START_FRAME` and `$END_FRAME`.
 | 
						|
 | 
						|
The shell could be part of the image, but more usefully, it could be generated
 | 
						|
by a program and stuffed in an annotation or a configMap, and from there added
 | 
						|
to a volume.
 | 
						|
 | 
						|
The first approach may require the user to modify an existing image (see next
 | 
						|
section) to be able to accept an `$INDEX` env var or argument. The second
 | 
						|
approach requires that the image have a shell. We think that together these two
 | 
						|
options cover a wide range of use cases (though not all).
 | 
						|
 | 
						|
#### Multiple Substitution
 | 
						|
 | 
						|
In this option, the JobSpec is extended to include a list of values to
 | 
						|
substitute, and which fields to substitute them into. For example, a worklist
 | 
						|
like this:
 | 
						|
 | 
						|
```
 | 
						|
FRUIT_COLOR=green process-fruit -a -b -c -f apple.txt --remove-seeds
 | 
						|
FRUIT_COLOR=yellow process-fruit -a -b -c -f banana.txt
 | 
						|
FRUIT_COLOR=red process-fruit -a -b -c -f cherry.txt --remove-pit
 | 
						|
```
 | 
						|
 | 
						|
Can be broken down into a template like this, with three parameters:
 | 
						|
 | 
						|
```
 | 
						|
<custom env var 1>; process-fruit -a -b -c <custom arg 1> <custom arg 1>
 | 
						|
```
 | 
						|
 | 
						|
and a list of parameter tuples, like this:
 | 
						|
 | 
						|
```
 | 
						|
("FRUIT_COLOR=green", "-f apple.txt", "--remove-seeds")
 | 
						|
("FRUIT_COLOR=yellow", "-f banana.txt", "")
 | 
						|
("FRUIT_COLOR=red", "-f cherry.txt", "--remove-pit")
 | 
						|
```
 | 
						|
 | 
						|
The JobSpec can be extended to hold a list of parameter tuples (which are more
 | 
						|
easily expressed as a list of lists of individual parameters). For example:
 | 
						|
 | 
						|
```
 | 
						|
apiVersion: extensions/v1beta1
 | 
						|
kind: Job
 | 
						|
...
 | 
						|
spec:
 | 
						|
  completions: 3
 | 
						|
  ...
 | 
						|
  template:
 | 
						|
    ...
 | 
						|
  perCompletionArgs:
 | 
						|
    container: 0
 | 
						|
      -
 | 
						|
        - "-f apple.txt"
 | 
						|
        - "-f banana.txt"
 | 
						|
        - "-f cherry.txt"
 | 
						|
      -
 | 
						|
        - "--remove-seeds"
 | 
						|
        - ""
 | 
						|
        - "--remove-pit"
 | 
						|
  perCompletionEnvVars:
 | 
						|
    - name: "FRUIT_COLOR"
 | 
						|
      - "green"
 | 
						|
      - "yellow"
 | 
						|
      - "red"
 | 
						|
```
 | 
						|
 | 
						|
However, just providing custom env vars, and not arguments, is sufficient for
 | 
						|
many use cases: parameter can be put into env vars, and then substituted on the
 | 
						|
command line.
 | 
						|
 | 
						|
#### Comparison
 | 
						|
 | 
						|
The multiple substitution approach:
 | 
						|
 | 
						|
- keeps the *per completion parameters* in the JobSpec.
 | 
						|
- Drawback: makes the job spec large for job with thousands of completions. (But
 | 
						|
for very large jobs, the work-queue style or another type of controller, such as
 | 
						|
map-reduce or spark, may be a better fit.)
 | 
						|
- Drawback: is a form of server-side templating, which we want in Kubernetes but
 | 
						|
have not fully designed (see the [PetSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)).
 | 
						|
 | 
						|
The index-only approach:
 | 
						|
 | 
						|
- Requires that the user keep the *per completion parameters* in a separate
 | 
						|
storage, such as a configData or networked storage.
 | 
						|
- Makes no changes to the JobSpec.
 | 
						|
- Drawback: while in separate storage, they could be mutated, which would have
 | 
						|
unexpected effects.
 | 
						|
- Drawback: Logic for using index to lookup parameters needs to be in the Pod.
 | 
						|
- Drawback: CLIs and UIs are limited to using the "index" as the identity of a
 | 
						|
pod from a job. They cannot easily say, for example `repeated failures on the
 | 
						|
pod processing banana.txt`.
 | 
						|
 | 
						|
Index-only approach relies on at least one of the following being true:
 | 
						|
 | 
						|
1. Image containing a shell and certain shell commands (not all images have
 | 
						|
this).
 | 
						|
1. Use directly consumes the index from annotations (file or env var) and
 | 
						|
expands to specific behavior in the main program.
 | 
						|
 | 
						|
Also Using the index-only approach from non-kubectl clients requires that they
 | 
						|
mimic the script-generation step, or only use the second style.
 | 
						|
 | 
						|
#### Decision
 | 
						|
 | 
						|
It is decided to implement the Index-only approach now. Once the server-side
 | 
						|
templating design is complete for Kubernetes, and we have feedback from users,
 | 
						|
we can consider if Multiple Substitution.
 | 
						|
 | 
						|
## Detailed Design
 | 
						|
 | 
						|
#### Job Resource Schema Changes
 | 
						|
 | 
						|
No changes are made to the JobSpec.
 | 
						|
 | 
						|
 | 
						|
The JobStatus is also not changed. The user can gauge the progress of the job by
 | 
						|
the `.status.succeeded` count.
 | 
						|
 | 
						|
 | 
						|
#### Job Spec Compatilibity
 | 
						|
 | 
						|
A job spec written before this change will work exactly the same as before with
 | 
						|
the new controller. The Pods it creates will have the same environment as
 | 
						|
before. They will have a new annotation, but pod are expected to tolerate
 | 
						|
unfamiliar annotations.
 | 
						|
 | 
						|
However, if the job controller version is reverted, to a version before this
 | 
						|
change, the jobs whose pod specs depend on the new annotation will fail.
 | 
						|
This is okay for a Beta resource.
 | 
						|
 | 
						|
#### Job Controller Changes
 | 
						|
 | 
						|
The Job controller will maintain for each Job a data structed which
 | 
						|
indicates the status of each completion index. We call this the
 | 
						|
*scoreboard* for short. It is an array of length `.spec.completions`.
 | 
						|
Elements of the array are `enum` type with possible values including
 | 
						|
`complete`, `running`, and `notStarted`.
 | 
						|
 | 
						|
The scoreboard is stored in Job Controller memory for efficiency. In either
 | 
						|
case, the Status can be reconstructed from watching pods of the job (such as on
 | 
						|
a controller manager restart). The index of the pods can be extracted from the
 | 
						|
pod annotation.
 | 
						|
 | 
						|
When Job controller sees that the number of running pods is less than the
 | 
						|
desired parallelism of the job, it finds the first index in the scoreboard with
 | 
						|
value `notRunning`. It creates a pod with this creation index.
 | 
						|
 | 
						|
When it creates a pod with creation index `i`,  it makes a copy of the
 | 
						|
`.spec.template`, and sets
 | 
						|
`.spec.template.metadata.annotations.[kubernetes.io/job/completion-index]` to
 | 
						|
`i`. It does this in both the index-only and multiple-substitutions options.
 | 
						|
 | 
						|
Then it creates the pod.
 | 
						|
 | 
						|
When the controller notices that a pod has completed or is running or failed,
 | 
						|
it updates the scoreboard.
 | 
						|
 | 
						|
When all entries in the scoreboard are `complete`, then the job is complete.
 | 
						|
 | 
						|
 | 
						|
#### Downward API Changes
 | 
						|
 | 
						|
The downward API is changed to support extracting specific key names into a
 | 
						|
single environment variable. So, the following would be supported:
 | 
						|
 | 
						|
```
 | 
						|
kind: Pod
 | 
						|
version: v1
 | 
						|
spec:
 | 
						|
  containers:
 | 
						|
  - name: foo
 | 
						|
    env:
 | 
						|
    - name: MY_INDEX
 | 
						|
      valueFrom:
 | 
						|
        fieldRef:
 | 
						|
          fieldPath: metadata.annotations[kubernetes.io/job/completion-index]
 | 
						|
```
 | 
						|
 | 
						|
This requires kubelet changes.
 | 
						|
 | 
						|
Users who fail to upgrade their kubelets at the same time as they upgrade their
 | 
						|
controller manager will see a failure for pods to run when they are created by
 | 
						|
the controller. The Kubelet will send an event about failure to create the pod.
 | 
						|
The `kubectl describe job` will show many failed pods.
 | 
						|
 | 
						|
 | 
						|
#### Kubectl Interface Changes
 | 
						|
 | 
						|
The `--completions` and `--completion-index-var-name` flags are added to
 | 
						|
kubectl.
 | 
						|
 | 
						|
For example, this command:
 | 
						|
 | 
						|
```
 | 
						|
kubectl run say-number --image=busybox \
 | 
						|
   --completions=3 \
 | 
						|
   --completion-index-var-name=I \
 | 
						|
   -- \
 | 
						|
   sh -c 'echo "My index is $I" && sleep 5' 
 | 
						|
```
 | 
						|
 | 
						|
will run 3 pods to completion, each printing one of the following lines:
 | 
						|
 | 
						|
```
 | 
						|
My index is 1
 | 
						|
My index is 2
 | 
						|
My index is 0
 | 
						|
```
 | 
						|
 | 
						|
Kubectl would create the following pod:
 | 
						|
 | 
						|
 | 
						|
 | 
						|
Kubectl will also support the `--per-completion-env` flag, as described
 | 
						|
previously. For example, this command:
 | 
						|
 | 
						|
```
 | 
						|
kubectl run say-fruit --image=busybox \
 | 
						|
   --per-completion-env=FRUIT="apple banana cherry" \
 | 
						|
   --per-completion-env=COLOR="green yellow red" \
 | 
						|
   -- \
 | 
						|
   sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' 
 | 
						|
```
 | 
						|
 | 
						|
or equivalently:
 | 
						|
 | 
						|
```
 | 
						|
echo "apple banana cherry" > fruits.txt
 | 
						|
echo "green yellow red" > colors.txt
 | 
						|
 | 
						|
kubectl run say-fruit --image=busybox \
 | 
						|
   --per-completion-env=FRUIT="$(cat fruits.txt)" \
 | 
						|
   --per-completion-env=COLOR="$(cat fruits.txt)" \
 | 
						|
   -- \
 | 
						|
   sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' 
 | 
						|
```
 | 
						|
 | 
						|
or similarly:
 | 
						|
 | 
						|
```
 | 
						|
kubectl run say-fruit --image=busybox \
 | 
						|
   --per-completion-env=FRUIT=@fruits.txt \
 | 
						|
   --per-completion-env=COLOR=@fruits.txt \
 | 
						|
   -- \
 | 
						|
   sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' 
 | 
						|
```
 | 
						|
 | 
						|
will all run 3 pods in parallel. Index 0 pod will log:
 | 
						|
 | 
						|
```
 | 
						|
Have a nice grenn apple
 | 
						|
```
 | 
						|
 | 
						|
and so on.
 | 
						|
 | 
						|
 | 
						|
Notes:
 | 
						|
 | 
						|
- `--per-completion-env=` is of form `KEY=VALUES` where `VALUES` is either a
 | 
						|
quoted space separated list or `@` and the name of a text file containing a
 | 
						|
list.
 | 
						|
- `--per-completion-env=` can be specified several times, but all must have the
 | 
						|
same length list.
 | 
						|
- `--completions=N` with `N` equal to list length is implied.
 | 
						|
- The flag `--completions=3` sets `job.spec.completions=3`.
 | 
						|
- The flag `--completion-index-var-name=I` causes an env var to be created named
 | 
						|
I in each pod, with the index in it.
 | 
						|
- The flag `--restart=OnFailure` is implied by `--completions` or any
 | 
						|
job-specific arguments. The user can also specify `--restart=Never` if they
 | 
						|
desire but may not specify `--restart=Always` with job-related flags.
 | 
						|
- Setting any of these flags in turn tells kubectl to create a Job, not a
 | 
						|
replicationController.
 | 
						|
 | 
						|
#### How Kubectl Creates Job Specs.
 | 
						|
 | 
						|
To pass in the parameters, kubectl will generate a shell script which
 | 
						|
can:
 | 
						|
- parse the index from the annotation
 | 
						|
- hold all the parameter lists.
 | 
						|
- lookup the correct index in each parameter list and set an env var.
 | 
						|
 | 
						|
For example, consider this command:
 | 
						|
 | 
						|
```
 | 
						|
kubectl run say-fruit --image=busybox \
 | 
						|
   --per-completion-env=FRUIT="apple banana cherry" \
 | 
						|
   --per-completion-env=COLOR="green yellow red" \
 | 
						|
   -- \
 | 
						|
   sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5' 
 | 
						|
```
 | 
						|
 | 
						|
First, kubectl generates the PodSpec as it normally does for `kubectl run`.
 | 
						|
 | 
						|
But, then it will generate this script:
 | 
						|
 | 
						|
```sh
 | 
						|
#!/bin/sh
 | 
						|
# Generated by kubectl run ...
 | 
						|
# Check for needed commands
 | 
						|
if [[ ! type cat ]]
 | 
						|
then
 | 
						|
  echo "$0: Image does not include required command: cat"
 | 
						|
  exit 2
 | 
						|
fi
 | 
						|
if [[ ! type grep ]]
 | 
						|
then
 | 
						|
  echo "$0: Image does not include required command: grep"
 | 
						|
  exit 2
 | 
						|
fi
 | 
						|
# Check that annotations are mounted from downward API
 | 
						|
if [[ ! -e /etc/annotations ]]
 | 
						|
then
 | 
						|
  echo "$0: Cannot find /etc/annotations"
 | 
						|
  exit 2
 | 
						|
fi
 | 
						|
# Get our index from annotations file
 | 
						|
I=$(cat /etc/annotations | grep job.kubernetes.io/index | cut -f 2 -d '\"') || echo "$0: failed to extract index"
 | 
						|
export I
 | 
						|
 | 
						|
# Our parameter lists are stored inline in this script.
 | 
						|
FRUIT_0="apple"
 | 
						|
FRUIT_1="banana"
 | 
						|
FRUIT_2="cherry"
 | 
						|
# Extract the right parameter value based on our index.
 | 
						|
# This works on any Bourne-based shell.
 | 
						|
FRUIT=$(eval echo \$"FRUIT_$I")
 | 
						|
export FRUIT
 | 
						|
 | 
						|
COLOR_0="green"
 | 
						|
COLOR_1="yellow"
 | 
						|
COLOR_2="red"
 | 
						|
 | 
						|
COLOR=$(eval echo \$"FRUIT_$I")
 | 
						|
export COLOR
 | 
						|
```
 | 
						|
 | 
						|
Then it POSTs this script, encoded, inside a ConfigData.
 | 
						|
It attaches this volume to the PodSpec.
 | 
						|
 | 
						|
Then it will edit the command line of the Pod to run this script before the rest of
 | 
						|
the command line.
 | 
						|
 | 
						|
Then it appends a DownwardAPI volume to the pod spec to get the annotations in a file, like this:
 | 
						|
It also appends the Secret (later configData) volume with the script in it.
 | 
						|
 | 
						|
So, the Pod template that kubectl creates (inside the job template) looks like this:
 | 
						|
 | 
						|
```
 | 
						|
apiVersion: v1
 | 
						|
kind: Job
 | 
						|
...
 | 
						|
spec:
 | 
						|
  ...
 | 
						|
  template:
 | 
						|
    ...
 | 
						|
    spec:
 | 
						|
      containers:
 | 
						|
        - name: c
 | 
						|
          image: gcr.io/google_containers/busybox
 | 
						|
          command:
 | 
						|
            - 'sh'
 | 
						|
            - '-c'
 | 
						|
            - '/etc/job-params.sh; echo "this is the rest of the command"'
 | 
						|
          volumeMounts:
 | 
						|
            - name: annotations
 | 
						|
              mountPath: /etc 
 | 
						|
            - name: script
 | 
						|
              mountPath: /etc
 | 
						|
      volumes:
 | 
						|
        - name: annotations
 | 
						|
          downwardAPI:
 | 
						|
            items:
 | 
						|
              - path: "annotations"
 | 
						|
                ieldRef:
 | 
						|
                  fieldPath: metadata.annotations
 | 
						|
        - name: script
 | 
						|
          secret:
 | 
						|
            secretName: jobparams-abc123
 | 
						|
```
 | 
						|
 | 
						|
###### Alternatives
 | 
						|
 | 
						|
Kubectl could append a `valueFrom` line like this to
 | 
						|
get the index into the environment:
 | 
						|
 | 
						|
```yaml
 | 
						|
apiVersion: extensions/v1beta1
 | 
						|
kind: Job
 | 
						|
metadata:
 | 
						|
  ...
 | 
						|
spec:
 | 
						|
  ...
 | 
						|
  template:
 | 
						|
    ...
 | 
						|
    spec:
 | 
						|
      containers:
 | 
						|
      - name: foo 
 | 
						|
        ...
 | 
						|
        env:        
 | 
						|
 # following block added:
 | 
						|
          - name: I
 | 
						|
            valueFrom:
 | 
						|
             fieldRef:
 | 
						|
               fieldPath:  metadata.annotations."kubernetes.io/job-idx"
 | 
						|
```
 | 
						|
 | 
						|
However, in order to inject other env vars from parameter list,
 | 
						|
kubectl still needs to edit the command line.
 | 
						|
 | 
						|
Parameter lists could be passed via a configData volume instead of a secret.
 | 
						|
Kubectl can be changed to work that way once the configData implementation is
 | 
						|
complete.
 | 
						|
 | 
						|
Parameter lists could be passed inside an EnvVar.  This would have length
 | 
						|
limitations, would pollute the output of `kubectl describe pods` and `kubectl
 | 
						|
get pods -o json`.
 | 
						|
 | 
						|
Parameter lists could be passed inside an annotation.  This would have length
 | 
						|
limitations, would pollute the output of `kubectl describe pods` and `kubectl
 | 
						|
get pods -o json`.  Also, currently annotations can only be extracted into a
 | 
						|
single file.  Complex logic is then needed to filter out exactly the desired
 | 
						|
annotation data.
 | 
						|
 | 
						|
Bash array variables could simplify extraction of a particular parameter from a
 | 
						|
list of parameters.  However, some popular base images do not include
 | 
						|
`/bin/bash`.  For example, `busybox` uses a compact `/bin/sh` implementation
 | 
						|
that does not support array syntax.
 | 
						|
 | 
						|
Kubelet does support [expanding varaibles without a
 | 
						|
shell](http://kubernetes.io/v1.1/docs/design/expansion.html).  But it does not
 | 
						|
allow for recursive substitution, which is required to extract the correct
 | 
						|
parameter from a list based on the completion index of the pod.  The syntax
 | 
						|
could be extended, but doing so seems complex and will be an unfamiliar syntax
 | 
						|
for users.
 | 
						|
 | 
						|
Putting all the command line editing into a script and running that causes
 | 
						|
the least pollution to the original command line, and it allows
 | 
						|
for complex error handling.
 | 
						|
 | 
						|
Kubectl could store the script in an [Inline Volume](
 | 
						|
https://github.com/kubernetes/kubernetes/issues/13610) if that proposal
 | 
						|
is approved. That would remove the need to manage the lifetime of the
 | 
						|
configData/secret, and prevent the case where someone changes the
 | 
						|
configData mid-job, and breaks things in a hard-to-debug way.
 | 
						|
 | 
						|
 | 
						|
## Interactions with other features
 | 
						|
 | 
						|
#### Supporting Work Queue Jobs too
 | 
						|
 | 
						|
For Work Queue Jobs, completions has no meaning. Parallelism should be allowed
 | 
						|
to be greater than it, and pods have no identity. So, the job controller should
 | 
						|
not create a scoreboard in the JobStatus, just a count.  Therefore, we need to
 | 
						|
add one of the following to JobSpec:
 | 
						|
 | 
						|
- allow unset `.spec.completions` to indicate no scoreboard, and no index for
 | 
						|
tasks (identical tasks).
 | 
						|
- allow `.spec.completions=-1` to indicate the same.
 | 
						|
- add `.spec.indexed` to job to indicate need for scoreboard.
 | 
						|
 | 
						|
#### Interaction with vertical autoscaling
 | 
						|
 | 
						|
Since pods of the same job will not be created with different resources,
 | 
						|
a vertical autoscaler will need to:
 | 
						|
 | 
						|
- if it has index-specific initial resource suggestions, suggest those at
 | 
						|
admission time; it will need to understand indexes.
 | 
						|
- mutate resource requests on already created pods based on usage trend or
 | 
						|
previous container failures.
 | 
						|
- modify the job template, affecting all indexes.
 | 
						|
 | 
						|
#### Comparison to PetSets
 | 
						|
 | 
						|
The *Index substitution-only* option corresponds roughly to PetSet Proposal 1b.
 | 
						|
The `perCompletionArgs` approach is similar to PetSet Proposal 1e, but more
 | 
						|
restrictive and thus less verbose.
 | 
						|
 | 
						|
It would be easier for users if Indexed Job and PetSet are similar where
 | 
						|
possible. However, PetSet differs in several key respects:
 | 
						|
 | 
						|
- PetSet is for ones to tens of instances.  Indexed job should work with tens of
 | 
						|
thousands of instances.
 | 
						|
- When you have few instances, you may want to given them pet names. When you
 | 
						|
have many instances, you that many instances, integer indexes make more sense.
 | 
						|
- When you have thousands of instances, storing the work-list in the JobSpec
 | 
						|
is verbose.  For PetSet, this is less of a problem.
 | 
						|
- PetSets (apparently) need to differ in more fields than indexed Jobs.
 | 
						|
 | 
						|
This differs from PetSet in that PetSet uses names and not indexes. PetSet is
 | 
						|
intended to support ones to tens of things.
 | 
						|
 | 
						|
 | 
						|
<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
 | 
						|
[]()
 | 
						|
<!-- END MUNGE: GENERATED_ANALYTICS -->
 |