kubernetes/docs/design/indexed-job.md

<!-- BEGIN MUNGE: UNVERSIONED_WARNING -->

<!-- BEGIN STRIP_FOR_RELEASE -->

<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">
<img src="http://kubernetes.io/kubernetes/img/warning.png" alt="WARNING"
     width="25" height="25">

<h2>PLEASE NOTE: This document applies to the HEAD of the source tree</h2>

If you are using a released version of Kubernetes, you should
refer to the docs that go with that version.

<!-- TAG RELEASE_LINK, added by the munger automatically -->
<strong>
The latest release of this document can be found
[here](http://releases.k8s.io/release-1.3/docs/design/indexed-job.md).

Documentation for other releases can be found at
[releases.k8s.io](http://releases.k8s.io).
</strong>
--

<!-- END STRIP_FOR_RELEASE -->

<!-- END MUNGE: UNVERSIONED_WARNING -->

# Design: Indexed Feature of Job object


## Summary

This design extends kubernetes with user-friendly support for
running embarrassingly parallel jobs.

Here, *parallel* means on multiple nodes, which means multiple pods.
By *embarrassingly parallel*,  it is meant that the pods
have no dependencies between each other.  In particular, neither
ordering between pods nor gang scheduling are supported.

Users already have two other options for running embarrassingly parallel
Jobs (described in the next section), but both have ease-of-use issues.

Therefore, this document proposes extending the Job resource type to support
a third way to run embarrassingly parallel programs, with a focus on
ease of use.

This new style of Job is called an *indexed job*, because each Pod of the Job
is specialized to work on a particular *index* from a fixed length array of work
items.

## Background

The Kubernetes [Job](../../docs/user-guide/jobs.md) already supports
the embarrassingly parallel use case through *workqueue jobs*.
While [workqueue jobs](../../docs/user-guide/jobs.md#job-patterns) are very
flexible, they can be difficult to use. They: (1) typically require running a
message queue or other database service, (2) typically require modifications
to existing binaries and images and (3) subtle race conditions are easy to
 overlook.

Users also have another option for parallel jobs: creating [multiple Job objects
from a template](hdocs/design/indexed-job.md#job-patterns). For small numbers of
Jobs, this is a fine choice. Labels make it easy to view and delete multiple Job
objects at once. But, that approach also has its drawbacks: (1) for large levels
of parallelism (hundreds or thousands of pods) this approach means that listing
all jobs presents too much information, (2) users want a single source of
information about the success or failure of what the user views as a single
logical process.

Indexed job fills provides a third option with better ease-of-use for common
use cases.

## Requirements

### User Requirements

- Users want an easy way to run a Pod to completion *for each* item within a
[work list](#example-use-cases).

- Users want to run these pods in parallel for speed, but to vary the level of
parallelism as needed, independent of the number of work items.

- Users want to do this without requiring changes to existing images,
or source-to-image pipelines.

- Users want a single object that encompasses the lifetime of the parallel
program. Deleting it should delete all dependent objects. It should report the
status of the overall process. Users should be able to wait for it to complete,
and can refer to it from other resource types, such as
[ScheduledJob](https://github.com/kubernetes/kubernetes/pull/11980).


### Example Use Cases

Here are several examples of *work lists*: lists of command lines that the user
wants to run, each line its own Pod. (Note that in practice, a work list may not
ever be written out in this form, but it exists in the mind of the Job creator,
and it is a useful way to talk about the intent of the user when discussing
alternatives for specifying Indexed Jobs).

Note that we will not have the user express their requirements in work list
form; it is just a format for presenting use cases. Subsequent discussion will
reference these work lists.

#### Work List 1

Process several files with the same program:

```
/usr/local/bin/process_file 12342.dat
/usr/local/bin/process_file 97283.dat
/usr/local/bin/process_file 38732.dat
```

#### Work List 2

Process a matrix (or image, etc) in rectangular blocks:

```
/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15
/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15
/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31
/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31
```

#### Work List 3

Build a program at several different git commits:

```
HASH=3cab5cb4a git checkout $HASH && make clean && make VERSION=$HASH
HASH=fe97ef90b git checkout $HASH && make clean && make VERSION=$HASH
HASH=a8b5e34c5 git checkout $HASH && make clean && make VERSION=$HASH
```

#### Work List 4

Render several frames of a movie:

```
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 1
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 2
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f 3
```

#### Work List 5

Render several blocks of frames (Render blocks to avoid Pod startup overhead for
every frame):

```
./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 1 --frame-end 100
./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 101 --frame-end 200
./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start 201 --frame-end 300
```

## Design Discussion

### Converting Work Lists into Indexed Jobs.

Given a work list, like in the [work list examples](#work-list-examples),
the information from the work list needs to get into each Pod of the Job.

Users will typically not want to create a new image for each job they
run. They will want to use existing images. So, the image is not the place
for the work list.

A work list can be stored on networked storage, and mounted by pods of the job.
Also, as a shortcut, for small worklists, it can be included in an annotation on
the Job object, which is then exposed as a volume in the pod via the downward
API.

### What Varies Between Pods of a Job

Pods need to differ in some way to do something different. (They do not differ
in the work-queue style of Job, but that style has ease-of-use issues).

A general approach would be to allow pods to differ from each other in arbitrary
ways. For example, the Job object could have a list of PodSpecs to run.
However, this is so general that it provides little value. It would:

- make the Job Spec very verbose, especially for jobs with thousands of work
items
- Job becomes such a vague concept that it is hard to explain to users
- in practice, we do not see cases where many pods which differ across many
fields of their specs, and need to run as a group, with no ordering constraints.
- CLIs and UIs need to support more options for creating Job
- it is useful for monitoring and accounting databases want to aggregate data
for pods with the same controller. However, pods with very different Specs may
not make sense to aggregate.
- profiling, debugging, accounting, auditing and monitoring tools cannot assume
common images/files, behaviors, provenance and so on between Pods of a Job.

Also, variety has another cost. Pods which differ in ways that affect scheduling
(node constraints, resource requirements, labels) prevent the scheduler from
treating them as fungible, which is an important optimization for the scheduler.

Therefore, we will not allow Pods from the same Job to differ arbitrarily
(anyway, users can use multiple Job objects for that case).  We will try to
allow as little as possible to differ between pods of the same Job, while still
allowing users to express common parallel patterns easily. For users who need to
run jobs which differ in other ways, they can create multiple Jobs, and manage
them as a group using labels.

From the above work lists, we see a need for Pods which differ in their command
lines, and in their environment variables.  These work lists do not require the
pods to differ in other ways.

Experience in [similar systems](http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43438.pdf)
has shown this model to be applicable to a very broad range of problems, despite
this restriction.

Therefore we to allow pods in the same Job to differ **only** in the following
 aspects:
- command line
- environment variables

### Composition of existing images

The docker image that is used in a job may not be maintained by the person
running the job.  Over time, the Dockerfile may change the ENTRYPOINT or CMD.
If we require people to specify the complete command line to use Indexed Job,
then they will not automatically pick up changes in the default
command or args.

This needs more thought.

### Running Ad-Hoc Jobs using kubectl

A user should be able to easily start an Indexed Job using `kubectl`. For
example to run [work list 1](#work-list-1), a user should be able to type
something simple like:

```
kubectl run process-files --image=myfileprocessor \
   --per-completion-env=F="12342.dat 97283.dat 38732.dat" \
   --restart=OnFailure  \
   -- \
   /usr/local/bin/process_file '$F'
```

In the above example:

- `--restart=OnFailure` implies creating a job instead of replicationController.
- Each pods command line is `/usr/local/bin/process_file $F`.
- `--per-completion-env=` implies the jobs `.spec.completions` is set to the
length of the argument array (3 in the example).
- `--per-completion-env=F=<values>` causes env var with `F` to be available in
the environment when the command line is evaluated.

How exactly this happens is discussed later in the doc: this is a sketch of the
user experience.

In practice, the list of files might be much longer and stored in a file on the
users local host, like:

```
$ cat files-to-process.txt
12342.dat
97283.dat
38732.dat
...
```

So, the user could specify instead: `--per-completion-env=F="$(cat files-to-process.txt)"`.

However, `kubectl` should also support a format like:
 `--per-completion-env=F=@files-to-process.txt`.
That allows `kubectl` to parse the file, point out any syntax errors, and would
not run up against command line length limits (2MB is common, as low as 4kB is
POSIX compliant).

One case we do not try to handle is where the file of work is stored on a cloud
filesystem, and not accessible from the users local host.  Then we cannot easily
use indexed job, because we do not know the number of completions.  The user
needs to copy the file locally first or use the Work-Queue style of Job (already
supported).

Another case we do not try to handle is where the input file does not exist yet
because this Job is to be run at a future time, or depends on another job. The
workflow and scheduled job proposal need to consider this case. For that case,
you could use an indexed job which runs a program which shards the input file
(map-reduce-style).

#### Multiple parameters

The user may also have multiple parameters, like in [work list 2](#work-list-2).
One way is to just list all the command lines already expanded, one per line, in
a file, like this:

```
$ cat matrix-commandlines.txt
/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 0 --end_col 15
/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 0 --end_col 15
/usr/local/bin/process_matrix_block -start_row 0 -end_row 15 -start_col 16 --end_col 31
/usr/local/bin/process_matrix_block -start_row 16 -end_row 31 -start_col 16 --end_col 31
```

and run the Job like this:

```
kubectl run process-matrix --image=my/matrix \
   --per-completion-env=COMMAND_LINE=@matrix-commandlines.txt \
   --restart=OnFailure  \
   -- \
   'eval "$COMMAND_LINE"'
```

However, this may have some subtleties with shell escaping.  Also, it depends on
the user knowing all the correct arguments to the docker image being used (more
on this later).

Instead, kubectl should support multiple instances of the `--per-completion-env`
flag. For example, to implement work list 2, a user could do:

```
kubectl run process-matrix --image=my/matrix \
   --per-completion-env=SR="0 16 0 16" \
   --per-completion-env=ER="15 31 15 31" \
   --per-completion-env=SC="0 0 16 16" \
   --per-completion-env=EC="15 15 31 31" \
   --restart=OnFailure  \
   -- \
   /usr/local/bin/process_matrix_block -start_row $SR -end_row $ER -start_col $ER --end_col $EC
```

### Composition With Workflows and ScheduledJob

A user should be able to create a job (Indexed or not) which runs at a specific
time(s). For example:

```
$ kubectl run process-files --image=myfileprocessor \
   --per-completion-env=F="12342.dat 97283.dat 38732.dat" \
   --restart=OnFailure  \
   --runAt=2015-07-21T14:00:00Z
   -- \
   /usr/local/bin/process_file '$F'
created "scheduledJob/process-files-37dt3"
```

Kubectl should build the same JobSpec, and then put it into a ScheduledJob
(#11980) and create that.

For [workflow type jobs](../../docs/user-guide/jobs.md#job-patterns), creating a
complete workflow from a single command line would be messy, because of the need
to specify all the arguments multiple times.

For that use case, the user could create a workflow message by hand. Or the user
could create a job template, and then make a workflow from the templates,
perhaps like this:

```
$ kubectl run process-files --image=myfileprocessor \
   --per-completion-env=F="12342.dat 97283.dat 38732.dat" \
   --restart=OnFailure  \
   --asTemplate \
   -- \
   /usr/local/bin/process_file '$F'
created "jobTemplate/process-files"
$ kubectl run merge-files --image=mymerger \
   --restart=OnFailure  \
   --asTemplate \
   -- \
   /usr/local/bin/mergefiles 12342.out 97283.out 38732.out \
created "jobTemplate/merge-files"
$ kubectl create-workflow process-and-merge \
   --job=jobTemplate/process-files
   --job=jobTemplate/merge-files
   --dependency=process-files:merge-files
created "workflow/process-and-merge"
```

### Completion Indexes

A JobSpec specifies the number of times a pod needs to complete successfully,
through the `job.Spec.Completions` field. The number of completions will be
equal to the number of work items in the work list.

Each pod that the job controller creates is intended to complete one work item
from the work list. Since a pod may fail, several pods may, serially, attempt to
complete the same index. Therefore, we call it a *completion index* (or just
*index*), but not a *pod index*.

For each completion index, in the range 1 to `.job.Spec.Completions`, the job
controller will create a pod with that index, and keep creating them on failure,
until each index is completed.

An dense integer index, rather than a sparse string index (e.g. using just
`metadata.generate-name`) makes it easy to use the index to lookup parameters
in, for example, an array in shared storage.

### Pod Identity and Template Substitution in Job Controller

The JobSpec contains a single pod template.  When the job controller creates a
particular pod, it copies the pod template and modifies it in some way to make
that pod distinctive. Whatever is distinctive about that pod is its *identity*.

We consider several options.

#### Index Substitution Only

The job controller substitutes only the *completion index* of the pod into the
pod template when creating it.  The JSON it POSTs differs only in a single
fields.

We would put the completion index as a stringified integer, into an annotation
of the pod. The user can extract it from the annotation into an env var via the
downward API, or put it in a file via a Downward API volume, and parse it
himself.

Once it is an environment variable in the pod (say `$INDEX`), then one of two
things can happen.

First, the main program can know how to map from an integer index to what it
needs to do. For example, from Work List 4 above:

```
./blender /vol1/mymodel.blend -o /vol2/frame_#### -f $INDEX
```

Second, a shell script can be prepended to the original command line which maps
the index to one or more string parameters. For example, to implement Work List
5 above, you could do:

```
/vol0/setupenv.sh && ./blender /vol1/mymodel.blend -o /vol2/frame_#### --frame-start $START_FRAME --frame-end $END_FRAME
```

In the above example, `/vol0/setupenv.sh` is a shell script that reads `$INDEX`
and exports `$START_FRAME` and `$END_FRAME`.

The shell could be part of the image, but more usefully, it could be generated
by a program and stuffed in an annotation or a configMap, and from there added
to a volume.

The first approach may require the user to modify an existing image (see next
section) to be able to accept an `$INDEX` env var or argument. The second
approach requires that the image have a shell. We think that together these two
options cover a wide range of use cases (though not all).

#### Multiple Substitution

In this option, the JobSpec is extended to include a list of values to
substitute, and which fields to substitute them into. For example, a worklist
like this:

```
FRUIT_COLOR=green process-fruit -a -b -c -f apple.txt --remove-seeds
FRUIT_COLOR=yellow process-fruit -a -b -c -f banana.txt
FRUIT_COLOR=red process-fruit -a -b -c -f cherry.txt --remove-pit
```

Can be broken down into a template like this, with three parameters:

```
<custom env var 1>; process-fruit -a -b -c <custom arg 1> <custom arg 1>
```

and a list of parameter tuples, like this:

```
("FRUIT_COLOR=green", "-f apple.txt", "--remove-seeds")
("FRUIT_COLOR=yellow", "-f banana.txt", "")
("FRUIT_COLOR=red", "-f cherry.txt", "--remove-pit")
```

The JobSpec can be extended to hold a list of parameter tuples (which are more
easily expressed as a list of lists of individual parameters). For example:

```
apiVersion: extensions/v1beta1
kind: Job
...
spec:
  completions: 3
  ...
  template:
    ...
  perCompletionArgs:
    container: 0
      -
        - "-f apple.txt"
        - "-f banana.txt"
        - "-f cherry.txt"
      -
        - "--remove-seeds"
        - ""
        - "--remove-pit"
  perCompletionEnvVars:
    - name: "FRUIT_COLOR"
      - "green"
      - "yellow"
      - "red"
```

However, just providing custom env vars, and not arguments, is sufficient for
many use cases: parameter can be put into env vars, and then substituted on the
command line.

#### Comparison

The multiple substitution approach:

- keeps the *per completion parameters* in the JobSpec.
- Drawback: makes the job spec large for job with thousands of completions. (But
for very large jobs, the work-queue style or another type of controller, such as
map-reduce or spark, may be a better fit.)
- Drawback: is a form of server-side templating, which we want in Kubernetes but
have not fully designed (see the [PetSets proposal](https://github.com/kubernetes/kubernetes/pull/18016/files?short_path=61f4179#diff-61f41798f4bced6e42e45731c1494cee)).

The index-only approach:

- Requires that the user keep the *per completion parameters* in a separate
storage, such as a configData or networked storage.
- Makes no changes to the JobSpec.
- Drawback: while in separate storage, they could be mutated, which would have
unexpected effects.
- Drawback: Logic for using index to lookup parameters needs to be in the Pod.
- Drawback: CLIs and UIs are limited to using the "index" as the identity of a
pod from a job. They cannot easily say, for example `repeated failures on the
pod processing banana.txt`.

Index-only approach relies on at least one of the following being true:

1. Image containing a shell and certain shell commands (not all images have
this).
1. Use directly consumes the index from annotations (file or env var) and
expands to specific behavior in the main program.

Also Using the index-only approach from non-kubectl clients requires that they
mimic the script-generation step, or only use the second style.

#### Decision

It is decided to implement the Index-only approach now. Once the server-side
templating design is complete for Kubernetes, and we have feedback from users,
we can consider if Multiple Substitution.

## Detailed Design

#### Job Resource Schema Changes

No changes are made to the JobSpec.


The JobStatus is also not changed. The user can gauge the progress of the job by
the `.status.succeeded` count.


#### Job Spec Compatilibity

A job spec written before this change will work exactly the same as before with
the new controller. The Pods it creates will have the same environment as
before. They will have a new annotation, but pod are expected to tolerate
unfamiliar annotations.

However, if the job controller version is reverted, to a version before this
change, the jobs whose pod specs depend on the new annotation will fail.
This is okay for a Beta resource.

#### Job Controller Changes

The Job controller will maintain for each Job a data structed which
indicates the status of each completion index. We call this the
*scoreboard* for short. It is an array of length `.spec.completions`.
Elements of the array are `enum` type with possible values including
`complete`, `running`, and `notStarted`.

The scoreboard is stored in Job Controller memory for efficiency. In either
case, the Status can be reconstructed from watching pods of the job (such as on
a controller manager restart). The index of the pods can be extracted from the
pod annotation.

When Job controller sees that the number of running pods is less than the
desired parallelism of the job, it finds the first index in the scoreboard with
value `notRunning`. It creates a pod with this creation index.

When it creates a pod with creation index `i`,  it makes a copy of the
`.spec.template`, and sets
`.spec.template.metadata.annotations.[kubernetes.io/job/completion-index]` to
`i`. It does this in both the index-only and multiple-substitutions options.

Then it creates the pod.

When the controller notices that a pod has completed or is running or failed,
it updates the scoreboard.

When all entries in the scoreboard are `complete`, then the job is complete.


#### Downward API Changes

The downward API is changed to support extracting specific key names into a
single environment variable. So, the following would be supported:

```
kind: Pod
version: v1
spec:
  containers:
  - name: foo
    env:
    - name: MY_INDEX
      valueFrom:
        fieldRef:
          fieldPath: metadata.annotations[kubernetes.io/job/completion-index]
```

This requires kubelet changes.

Users who fail to upgrade their kubelets at the same time as they upgrade their
controller manager will see a failure for pods to run when they are created by
the controller. The Kubelet will send an event about failure to create the pod.
The `kubectl describe job` will show many failed pods.


#### Kubectl Interface Changes

The `--completions` and `--completion-index-var-name` flags are added to
kubectl.

For example, this command:

```
kubectl run say-number --image=busybox \
   --completions=3 \
   --completion-index-var-name=I \
   -- \
   sh -c 'echo "My index is $I" && sleep 5'
```

will run 3 pods to completion, each printing one of the following lines:

```
My index is 1
My index is 2
My index is 0
```

Kubectl would create the following pod:


Kubectl will also support the `--per-completion-env` flag, as described
previously. For example, this command:

```
kubectl run say-fruit --image=busybox \
   --per-completion-env=FRUIT="apple banana cherry" \
   --per-completion-env=COLOR="green yellow red" \
   -- \
   sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5'
```

or equivalently:

```
echo "apple banana cherry" > fruits.txt
echo "green yellow red" > colors.txt

kubectl run say-fruit --image=busybox \
   --per-completion-env=FRUIT="$(cat fruits.txt)" \
   --per-completion-env=COLOR="$(cat fruits.txt)" \
   -- \
   sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5'
```

or similarly:

```
kubectl run say-fruit --image=busybox \
   --per-completion-env=FRUIT=@fruits.txt \
   --per-completion-env=COLOR=@fruits.txt \
   -- \
   sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5'
```

will all run 3 pods in parallel. Index 0 pod will log:

```
Have a nice grenn apple
```

and so on.


Notes:

- `--per-completion-env=` is of form `KEY=VALUES` where `VALUES` is either a
quoted space separated list or `@` and the name of a text file containing a
list.
- `--per-completion-env=` can be specified several times, but all must have the
same length list.
- `--completions=N` with `N` equal to list length is implied.
- The flag `--completions=3` sets `job.spec.completions=3`.
- The flag `--completion-index-var-name=I` causes an env var to be created named
I in each pod, with the index in it.
- The flag `--restart=OnFailure` is implied by `--completions` or any
job-specific arguments. The user can also specify `--restart=Never` if they
desire but may not specify `--restart=Always` with job-related flags.
- Setting any of these flags in turn tells kubectl to create a Job, not a
replicationController.

#### How Kubectl Creates Job Specs.

To pass in the parameters, kubectl will generate a shell script which
can:
- parse the index from the annotation
- hold all the parameter lists.
- lookup the correct index in each parameter list and set an env var.

For example, consider this command:

```
kubectl run say-fruit --image=busybox \
   --per-completion-env=FRUIT="apple banana cherry" \
   --per-completion-env=COLOR="green yellow red" \
   -- \
   sh -c 'echo "Have a nice $COLOR $FRUIT" && sleep 5'
```

First, kubectl generates the PodSpec as it normally does for `kubectl run`.

But, then it will generate this script:

```sh
#!/bin/sh
# Generated by kubectl run ...
# Check for needed commands
if [[ ! type cat ]]
then
  echo "$0: Image does not include required command: cat"
  exit 2
fi
if [[ ! type grep ]]
then
  echo "$0: Image does not include required command: grep"
  exit 2
fi
# Check that annotations are mounted from downward API
if [[ ! -e /etc/annotations ]]
then
  echo "$0: Cannot find /etc/annotations"
  exit 2
fi
# Get our index from annotations file
I=$(cat /etc/annotations | grep job.kubernetes.io/index | cut -f 2 -d '\"') || echo "$0: failed to extract index"
export I

# Our parameter lists are stored inline in this script.
FRUIT_0="apple"
FRUIT_1="banana"
FRUIT_2="cherry"
# Extract the right parameter value based on our index.
# This works on any Bourne-based shell.
FRUIT=$(eval echo \$"FRUIT_$I")
export FRUIT

COLOR_0="green"
COLOR_1="yellow"
COLOR_2="red"

COLOR=$(eval echo \$"FRUIT_$I")
export COLOR
```

Then it POSTs this script, encoded, inside a ConfigData.
It attaches this volume to the PodSpec.

Then it will edit the command line of the Pod to run this script before the rest of
the command line.

Then it appends a DownwardAPI volume to the pod spec to get the annotations in a file, like this:
It also appends the Secret (later configData) volume with the script in it.

So, the Pod template that kubectl creates (inside the job template) looks like this:

```
apiVersion: v1
kind: Job
...
spec:
  ...
  template:
    ...
    spec:
      containers:
        - name: c
          image: gcr.io/google_containers/busybox
          command:
            - 'sh'
            - '-c'
            - '/etc/job-params.sh; echo "this is the rest of the command"'
          volumeMounts:
            - name: annotations
              mountPath: /etc
            - name: script
              mountPath: /etc
      volumes:
        - name: annotations
          downwardAPI:
            items:
              - path: "annotations"
                ieldRef:
                  fieldPath: metadata.annotations
        - name: script
          secret:
            secretName: jobparams-abc123
```

###### Alternatives

Kubectl could append a `valueFrom` line like this to
get the index into the environment:

```yaml
apiVersion: extensions/v1beta1
kind: Job
metadata:
  ...
spec:
  ...
  template:
    ...
    spec:
      containers:
      - name: foo
        ...
        env:
 # following block added:
          - name: I
            valueFrom:
             fieldRef:
               fieldPath:  metadata.annotations."kubernetes.io/job-idx"
```

However, in order to inject other env vars from parameter list,
kubectl still needs to edit the command line.

Parameter lists could be passed via a configData volume instead of a secret.
Kubectl can be changed to work that way once the configData implementation is
complete.

Parameter lists could be passed inside an EnvVar.  This would have length
limitations, would pollute the output of `kubectl describe pods` and `kubectl
get pods -o json`.

Parameter lists could be passed inside an annotation.  This would have length
limitations, would pollute the output of `kubectl describe pods` and `kubectl
get pods -o json`.  Also, currently annotations can only be extracted into a
single file.  Complex logic is then needed to filter out exactly the desired
annotation data.

Bash array variables could simplify extraction of a particular parameter from a
list of parameters.  However, some popular base images do not include
`/bin/bash`.  For example, `busybox` uses a compact `/bin/sh` implementation
that does not support array syntax.

Kubelet does support [expanding varaibles without a
shell](http://kubernetes.io/v1.1/docs/design/expansion.html).  But it does not
allow for recursive substitution, which is required to extract the correct
parameter from a list based on the completion index of the pod.  The syntax
could be extended, but doing so seems complex and will be an unfamiliar syntax
for users.

Putting all the command line editing into a script and running that causes
the least pollution to the original command line, and it allows
for complex error handling.

Kubectl could store the script in an [Inline Volume](
https://github.com/kubernetes/kubernetes/issues/13610) if that proposal
is approved. That would remove the need to manage the lifetime of the
configData/secret, and prevent the case where someone changes the
configData mid-job, and breaks things in a hard-to-debug way.


## Interactions with other features

#### Supporting Work Queue Jobs too

For Work Queue Jobs, completions has no meaning. Parallelism should be allowed
to be greater than it, and pods have no identity. So, the job controller should
not create a scoreboard in the JobStatus, just a count.  Therefore, we need to
add one of the following to JobSpec:

- allow unset `.spec.completions` to indicate no scoreboard, and no index for
tasks (identical tasks).
- allow `.spec.completions=-1` to indicate the same.
- add `.spec.indexed` to job to indicate need for scoreboard.

#### Interaction with vertical autoscaling

Since pods of the same job will not be created with different resources,
a vertical autoscaler will need to:

- if it has index-specific initial resource suggestions, suggest those at
admission time; it will need to understand indexes.
- mutate resource requests on already created pods based on usage trend or
previous container failures.
- modify the job template, affecting all indexes.

#### Comparison to PetSets

The *Index substitution-only* option corresponds roughly to PetSet Proposal 1b.
The `perCompletionArgs` approach is similar to PetSet Proposal 1e, but more
restrictive and thus less verbose.

It would be easier for users if Indexed Job and PetSet are similar where
possible. However, PetSet differs in several key respects:

- PetSet is for ones to tens of instances.  Indexed job should work with tens of
thousands of instances.
- When you have few instances, you may want to given them pet names. When you
have many instances, you that many instances, integer indexes make more sense.
- When you have thousands of instances, storing the work-list in the JobSpec
is verbose.  For PetSet, this is less of a problem.
- PetSets (apparently) need to differ in more fields than indexed Jobs.

This differs from PetSet in that PetSet uses names and not indexes. PetSet is
intended to support ones to tens of things.


<!-- BEGIN MUNGE: GENERATED_ANALYTICS -->
[![Analytics](https://kubernetes-site.appspot.com/UA-36037335-10/GitHub/docs/design/indexed-job.md?pixel)]()
<!-- END MUNGE: GENERATED_ANALYTICS -->