Designing A Distributed Job Scheduler (Cron Cluster / Task Orchestrator)

1) Problem Clarification / Làm rõ bài toán

EN

We need a system that runs scheduled or recurring jobs across multiple nodes, ensuring:

jobs run once and only once
retry on failure
distributed safety
scalable execution

VI

Thiết kế hệ thống chạy job định kỳ trên nhiều node:

job chạy đúng 1 lần
retry khi fail
đồng bộ phân tán
scale được

2) Requirements / Yêu cầu hệ thống

EN – Functional

✔ Cron scheduling
✔ Distributed execution
✔ Idempotent job semantics
✔ Retry + visibility timeout
✔ Monitoring / Pause / Stop

VI – Chức năng

✔ cron
✔ thực thi phân tán
✔ idempotent job
✔ retry + visibility timeout
✔ giám sát / pause / stop

EN – Non-functional

✔ high reliability
✔ fault tolerance
✔ extensible job type
✔ predictable execution timing

VI – Phi chức năng

✔ reliability cao
✔ fault tolerant
✔ mở rộng job type
✔ predict timing

3) High-Level Architecture / Kiến trúc tổng quan

Scheduler Manager → Job Queue → Worker Executors → Result Store

Coordinator handles:
- leader election
- job dispatch

VI

Các thành phần:

Scheduler manager (định thời)
Job queue
Worker executor (agent chạy job)
Result store

Coordinator lo:

bầu leader
dispatch job

4) Leader Election / Bầu leader phân tán

EN

We elect a leader to avoid duplicate scheduling.

Techniques:

Zookeeper
Etcd
Consul lock
Redis Redlock

Leader failure → automatic takeover.

VI

Để tránh chạy job 2 lần → chọn leader.

Cách:

Zookeeper
Etcd
Consul lock
Redis Redlock

Leader chết → failover node khác.

5) Job Dispatching / Phân phối job

EN

Once scheduled, leader pushes job messages to queue:

Kafka
RabbitMQ
SQS

Workers subscribe and execute.

VI

Leader schedule → đẩy job vào queue (Kafka/RabbitMQ/SQS).
Workers subscribe và thực thi.

6) Visibility Timeout / Thời gian chiếm job

EN

To prevent duplicate execution:

Worker gets job
Marks invisible for X seconds
If worker crashes → timeout expires → job becomes visible again

VI

Để tránh chạy trùng:

Worker lấy job
job “invisible” trong X giây
Worker crash → timeout hết → job visible lại để node khác xử lý

7) Idempotency in Job Execution / Idempotency của job

EN

Use:

job run signature
job result store

job_result:<job_id> = completed + output hash

If job retries → return previous result.

VI

Job idempotency:

lưu kết quả job theo job_id

Nếu job retry → trả kết quả cũ, tránh chạy lại logic destructive.

8) Retry and Dead Letter Queue (DLQ) / Retry + hàng lỗi

EN

If worker fails:

exponential retry
backoff
DLQ for investigation

VI

Retry theo:

backoff
exponential
DLQ chứa job lỗi

9) Job Types & Execution Model / Loại job và model chạy

EN

Types:

one-shot
scheduled recurring
dependent / chained

Execution:

Worker pool
Task isolation
Sandbox for security

VI

Loại job:

one-shot
định kỳ
chain phụ thuộc nhau

Chạy:

worker pool
isolate
sandbox

10) Monitoring & Coordination UI / Giám sát & giao diện

EN

Expose UI to:

pause job
retry job
inspect logs
live metrics

VI

UI cho:

pause job
retry job
xem log
metric live

11) Failure Scenarios / Các case lỗi

EN

leader crash → re-election
job duplication → idempotency resolution
job stuck → timeout + requeue
worker overload → scaling out pool

VI

leader crash → chọn leader mới
job chạy trùng → idempotent
job stuck → timeout requeue
worker quá tải → scale pool

12) Observability / Giám sát

EN

Track:

job latency
failure rate
retries count
queue lag
worker utilization

VI

Theo dõi:

latency job
failure rate
retries
queue lag
worker utilization

13) Alternative Architecture: Orchestrated Scheduler (Airflow style)

EN

Instead of distributed workers, central orchestration engine assigns tasks:

DAG-based scheduling
task re-try
state machine transitions

Used by Airflow / Prefect / Temporal.

VI

Thay vì worker phân tán, có engine trung tâm:

scheduling theo DAG
retry
state machine

Giống Airflow, Prefect, Temporal.

14) Future Enhancements / Nâng cấp

EN

ML-based scheduling
auto-scaling decision
SLA-driven prioritization
job outcome prediction

VI

ML scheduling
auto-scaling
ưu tiên theo SLA
dự đoán outcome job