1) Problem Clarification / Làm rõ bài toán
EN
We need a system that runs scheduled or recurring jobs across multiple nodes, ensuring:
- jobs run once and only once
- retry on failure
- distributed safety
- scalable execution
VI
Thiết kế hệ thống chạy job định kỳ trên nhiều node:
- job chạy đúng 1 lần
- retry khi fail
- đồng bộ phân tán
- scale được
2) Requirements / Yêu cầu hệ thống
EN – Functional
✔ Cron scheduling
✔ Distributed execution
✔ Idempotent job semantics
✔ Retry + visibility timeout
✔ Monitoring / Pause / Stop
VI – Chức năng
✔ cron
✔ thực thi phân tán
✔ idempotent job
✔ retry + visibility timeout
✔ giám sát / pause / stop
EN – Non-functional
✔ high reliability
✔ fault tolerance
✔ extensible job type
✔ predictable execution timing
VI – Phi chức năng
✔ reliability cao
✔ fault tolerant
✔ mở rộng job type
✔ predict timing
3) High-Level Architecture / Kiến trúc tổng quan
Scheduler Manager → Job Queue → Worker Executors → Result Store
Coordinator handles:
- leader election
- job dispatch
VI
Các thành phần:
- Scheduler manager (định thời)
- Job queue
- Worker executor (agent chạy job)
- Result store
Coordinator lo:
- bầu leader
- dispatch job
4) Leader Election / Bầu leader phân tán
EN
We elect a leader to avoid duplicate scheduling.
Techniques:
- Zookeeper
- Etcd
- Consul lock
- Redis Redlock
Leader failure → automatic takeover.
VI
Để tránh chạy job 2 lần → chọn leader.
Cách:
- Zookeeper
- Etcd
- Consul lock
- Redis Redlock
Leader chết → failover node khác.
5) Job Dispatching / Phân phối job
EN
Once scheduled, leader pushes job messages to queue:
- Kafka
- RabbitMQ
- SQS
Workers subscribe and execute.
VI
Leader schedule → đẩy job vào queue (Kafka/RabbitMQ/SQS).
Workers subscribe và thực thi.
6) Visibility Timeout / Thời gian chiếm job
EN
To prevent duplicate execution:
- Worker gets job
- Marks invisible for X seconds
- If worker crashes → timeout expires → job becomes visible again
VI
Để tránh chạy trùng:
- Worker lấy job
- job “invisible” trong X giây
- Worker crash → timeout hết → job visible lại để node khác xử lý
7) Idempotency in Job Execution / Idempotency của job
EN
Use:
- job run signature
- job result store
job_result:<job_id> = completed + output hash
If job retries → return previous result.
VI
Job idempotency:
- lưu kết quả job theo job_id
Nếu job retry → trả kết quả cũ, tránh chạy lại logic destructive.
8) Retry and Dead Letter Queue (DLQ) / Retry + hàng lỗi
EN
If worker fails:
- exponential retry
- backoff
- DLQ for investigation
VI
Retry theo:
- backoff
- exponential
- DLQ chứa job lỗi
9) Job Types & Execution Model / Loại job và model chạy
EN
Types:
- one-shot
- scheduled recurring
- dependent / chained
Execution:
- Worker pool
- Task isolation
- Sandbox for security
VI
Loại job:
- one-shot
- định kỳ
- chain phụ thuộc nhau
Chạy:
- worker pool
- isolate
- sandbox
10) Monitoring & Coordination UI / Giám sát & giao diện
EN
Expose UI to:
- pause job
- retry job
- inspect logs
- live metrics
VI
UI cho:
- pause job
- retry job
- xem log
- metric live
11) Failure Scenarios / Các case lỗi
EN
- leader crash → re-election
- job duplication → idempotency resolution
- job stuck → timeout + requeue
- worker overload → scaling out pool
VI
- leader crash → chọn leader mới
- job chạy trùng → idempotent
- job stuck → timeout requeue
- worker quá tải → scale pool
12) Observability / Giám sát
EN
Track:
- job latency
- failure rate
- retries count
- queue lag
- worker utilization
VI
Theo dõi:
- latency job
- failure rate
- retries
- queue lag
- worker utilization
13) Alternative Architecture: Orchestrated Scheduler (Airflow style)
EN
Instead of distributed workers, central orchestration engine assigns tasks:
- DAG-based scheduling
- task re-try
- state machine transitions
Used by Airflow / Prefect / Temporal.
VI
Thay vì worker phân tán, có engine trung tâm:
- scheduling theo DAG
- retry
- state machine
Giống Airflow, Prefect, Temporal.
14) Future Enhancements / Nâng cấp
EN
- ML-based scheduling
- auto-scaling decision
- SLA-driven prioritization
- job outcome prediction
VI
- ML scheduling
- auto-scaling
- ưu tiên theo SLA
- dự đoán outcome job
[…] Designing A Distributed Job Scheduler (Cron Cluster / Task Orchestrator) […]