Job Executors may perform long-running tasks and / or use external process service/products to execute jobs.
JEF has a number of mechanisms providing resilience against potential outage scenarios across the microservice architecture.
One of the JEF strategies includes an active internal watcher running on each job instance. The Watcher can be disabled by configurations. Watcher works with ExecutionStep and non-ExecutionStep based jobs.
The Watchers are responsible to monitor all running jobs (across all job executors) in order to detect the following scenarios:
`QUEUED` jobs may be considered as “stalled” if they have not moved to a `RUNNING` state during a timeout deadline.
`QUEUED` jobs that have stalled can be taken by another job executor with an available slot.
This logic is handled by the JEF framework; no specific work is required by plugin developers.
`RUNNING` jobs may be considered as “stalled” if they haven’t updated the execution step before the timeout deadline.
If a running job stalls, it is moved to a `TIMED_OUT` state.
Timed Out Jobs
`TIMED_OUT` jobs are moved back to a `RUNNING` state if the original job execution progresses before another Job Executor takes the job.
`TIMED_OUT` jobs may be considered as definitely “stalled” if they still haven’t updated the progress after a second timeout deadline.
Jobs which remain `TIMED_OUT` after 2 deadlines can be taken by any other Job Executor instance which supports the defined action and has an available job execution slot.
If a `TIMED_OUT` job is picked up by another job executor, it is moved to a `RUNNING` state.
JEF Executors implements a basic back-pressure strategy to stop getting new jobs into `RUNNING` when no execution slot is available.
Finally, `TIMED_OUT` jobs are moved to a `FAILED` state if no other Job Executor takes responsibility for it (e.g. because they don't have job execution slot capacity left).
If the original Job Executor owning the job execution is still running and it attempts to progress the job after another Job Executor has taken the job, the original Job Executor attempt will be considered as a fatal error, and is ignored.