Job Executors may perform long-running tasks and / or use external process service/products to execute jobs.
JEF has a number of mechanisms providing resilience against potential outage scenarios across the microservice architecture.
Watcher mechanism
One of the JEF strategies includes an active internal watcher running on each job instance. The Watcher can be disabled by configurations. Watcher works with ExecutionStep and non-ExecutionStep based jobs.
The Watchers are responsible to monitor all running jobs (across all job executors) in order to detect the following scenarios:
Queued Jobs
QUEUED jobs may be considered as “stalled” if they have not moved to a RUNNING state during a timeout deadline.
QUEUED jobs that have stalled can be taken by another job executor with an available slot.
This logic is handled by the JEF framework; no specific work is required by plugin developers.
Running Jobs
RUNNING jobs may be considered as “stalled” if they haven’t updated the execution step before the timeout deadline.
If a running job stalls, it is moved to a TIMED_OUT state.
Timed Out Jobs
TIMED_OUT jobs are moved back to a RUNNING state if the original job execution progresses before another Job Executor takes the job.
TIMED_OUT jobs may be considered as definitely “stalled” if they still haven’t updated the progress after a second timeout deadline.
Jobs which remain TIMED_OUT after 2 deadlines can be taken by any other Job Executor instance which supports the defined action and has an available job execution slot.
If a TIMED_OUT job is picked up by another job executor, it is moved to a RUNNING state.
JEF Executors implements a basic back-pressure strategy to stop getting new jobs into RUNNING when no execution slot is available.
Finally, TIMED_OUT jobs are moved to a FAILED state if no other Job Executor takes responsibility for it (e.g. because they don't have job execution slot capacity left).
If the original Job Executor owning the job execution is still running and it attempts to progress the job after another Job Executor has taken the job, the original Job Executor attempt will be considered as a fatal error, and is ignored.
Comments
0 comments
Please sign in to leave a comment.