Skip to content

Designing A Distributed Job Scheduler (Cron Cluster / Task Orchestrator)

1) Problem Clarification / Làm rõ bài toán

EN

We need a system that runs scheduled or recurring jobs across multiple nodes, ensuring:

  • jobs run once and only once
  • retry on failure
  • distributed safety
  • scalable execution

VI

Thiết kế hệ thống chạy job định kỳ trên nhiều node:

  • job chạy đúng 1 lần
  • retry khi fail
  • đồng bộ phân tán
  • scale được

2) Requirements / Yêu cầu hệ thống

EN – Functional

✔ Cron scheduling
✔ Distributed execution
✔ Idempotent job semantics
✔ Retry + visibility timeout
✔ Monitoring / Pause / Stop

VI – Chức năng

✔ cron
✔ thực thi phân tán
✔ idempotent job
✔ retry + visibility timeout
✔ giám sát / pause / stop

EN – Non-functional

✔ high reliability
✔ fault tolerance
✔ extensible job type
✔ predictable execution timing

VI – Phi chức năng

✔ reliability cao
✔ fault tolerant
✔ mở rộng job type
✔ predict timing

3) High-Level Architecture / Kiến trúc tổng quan

Scheduler Manager → Job Queue → Worker Executors → Result Store

Coordinator handles:
- leader election
- job dispatch

VI

Các thành phần:

  • Scheduler manager (định thời)
  • Job queue
  • Worker executor (agent chạy job)
  • Result store

Coordinator lo:

  • bầu leader
  • dispatch job

4) Leader Election / Bầu leader phân tán

EN

We elect a leader to avoid duplicate scheduling.

Techniques:

  • Zookeeper
  • Etcd
  • Consul lock
  • Redis Redlock

Leader failure → automatic takeover.

VI

Để tránh chạy job 2 lần → chọn leader.

Cách:

  • Zookeeper
  • Etcd
  • Consul lock
  • Redis Redlock

Leader chết → failover node khác.

5) Job Dispatching / Phân phối job

EN

Once scheduled, leader pushes job messages to queue:

  • Kafka
  • RabbitMQ
  • SQS

Workers subscribe and execute.

VI

Leader schedule → đẩy job vào queue (Kafka/RabbitMQ/SQS).
Workers subscribe và thực thi.

6) Visibility Timeout / Thời gian chiếm job

EN

To prevent duplicate execution:

  • Worker gets job
  • Marks invisible for X seconds
  • If worker crashes → timeout expires → job becomes visible again

VI

Để tránh chạy trùng:

  • Worker lấy job
  • job “invisible” trong X giây
  • Worker crash → timeout hết → job visible lại để node khác xử lý

7) Idempotency in Job Execution / Idempotency của job

EN

Use:

  • job run signature
  • job result store
job_result:<job_id> = completed + output hash

If job retries → return previous result.

VI

Job idempotency:

  • lưu kết quả job theo job_id

Nếu job retry → trả kết quả cũ, tránh chạy lại logic destructive.

8) Retry and Dead Letter Queue (DLQ) / Retry + hàng lỗi

EN

If worker fails:

  • exponential retry
  • backoff
  • DLQ for investigation

VI

Retry theo:

  • backoff
  • exponential
  • DLQ chứa job lỗi

9) Job Types & Execution Model / Loại job và model chạy

EN

Types:

  • one-shot
  • scheduled recurring
  • dependent / chained

Execution:

  • Worker pool
  • Task isolation
  • Sandbox for security

VI

Loại job:

  • one-shot
  • định kỳ
  • chain phụ thuộc nhau

Chạy:

  • worker pool
  • isolate
  • sandbox

10) Monitoring & Coordination UI / Giám sát & giao diện

EN

Expose UI to:

  • pause job
  • retry job
  • inspect logs
  • live metrics

VI

UI cho:

  • pause job
  • retry job
  • xem log
  • metric live

11) Failure Scenarios / Các case lỗi

EN

  • leader crash → re-election
  • job duplication → idempotency resolution
  • job stuck → timeout + requeue
  • worker overload → scaling out pool

VI

  • leader crash → chọn leader mới
  • job chạy trùng → idempotent
  • job stuck → timeout requeue
  • worker quá tải → scale pool

12) Observability / Giám sát

EN

Track:

  • job latency
  • failure rate
  • retries count
  • queue lag
  • worker utilization

VI

Theo dõi:

  • latency job
  • failure rate
  • retries
  • queue lag
  • worker utilization

13) Alternative Architecture: Orchestrated Scheduler (Airflow style)

EN

Instead of distributed workers, central orchestration engine assigns tasks:

  • DAG-based scheduling
  • task re-try
  • state machine transitions

Used by Airflow / Prefect / Temporal.

VI

Thay vì worker phân tán, có engine trung tâm:

  • scheduling theo DAG
  • retry
  • state machine

Giống Airflow, Prefect, Temporal.

14) Future Enhancements / Nâng cấp

EN

  • ML-based scheduling
  • auto-scaling decision
  • SLA-driven prioritization
  • job outcome prediction

VI

  • ML scheduling
  • auto-scaling
  • ưu tiên theo SLA
  • dự đoán outcome job
Published inAll

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *