Skip to content

Job lifecycle

Each job goes through three principal phases during its lifecycle. When this article refers to events, please see job events documentation for more detailed description of the event payload. Each event is of type dhi.waterdata.jobs.events.jobevent.

State "Pending"

This is the first stage in each job's lifecycle. The job enters this phase when it is created, and the job service indicates such moment by publishing an event with State = "Pending". The job will stay in this state until the underlying cluster finds or allocates an appropriate computational resource (a VM node).

State "Running"

The job enters this state when it actually starts its execution. An event is fired with State = "Running" and Code = 120. The execution begins with some preprocessing, which is the responsibility of the platform / job service, such as downloading of the specified inputs. So the transition to the running state does not necessarily mean that the user-provided container(s) is/are running, because that will happen only after the preprocessing is finished. When that happens, an event with Code = 130 is triggered.

Once the user-provided container(s) finish running, an event with Code = 140 is fired. Then the job's responsibility is to upload any outputs back to the Platform (or potentially to other destinations).

State "Finished"

In happy path scenarios this state is entered once the postprocessing (i.e. upload of results described in previous section) is complete. An event with State = "Finished" and Code = 150 is fired. This is the terminal state of the job. This state may also be entered in various failure states (such as when the job service or the cluster determines the job will not be able to start at all). Also cancelled running jobs eventually enter the finished state with the status message indicating that the job has been cancelled.


Note: It is important to point out that jobs should be idempotent. Any job can be restarted due to variety of reasons (one of its containers fails, the underlying node is being removed, etc). The user logic should be aware of this.