1) Problem Clarification / Làm rõ bài toán
EN
A distributed task queue executes background jobs asynchronously:
- send email
- resize images
- process payments
- run ML workloads
- schedule jobs
- retry failed tasks
VI
Distributed task queue xử lý job nền không đồng bộ:
- gửi email
- resize ảnh
- xử lý thanh toán
- chạy ML
- job theo lịch
- retry job bị lỗi
2) General Architecture / Kiến trúc tổng quan
Producer → Queue/Broker → Workers → Result Store / Callback
VI
Producer → Queue → Worker → Lưu kết quả hoặc callback.
3) Push vs Pull Model
EN
Push-based (SQS, Redis queues)
Broker delivers messages to workers.
Pull-based (Kafka)
Workers poll for new messages.
VI
Push (SQS/Redis)
Queue đẩy job sang worker.
Pull (Kafka)
Worker chủ động kéo job.
4) Core Requirements
EN
✔ scalable workers
✔ retry logic
✔ idempotency
✔ dead-letter handling
✔ visibility timeout
✔ task timeouts
✔ ordering rules
VI
✔ scale worker
✔ retry
✔ idempotent
✔ dead-letter
✔ visibility timeout
✔ timeout job
✔ thứ tự xử lý
PART A — REDIS/ CELERY / SIDEKIQ STYLE QUEUES
5) Redis Queue Model
EN
Redis lists/zsets store tasks:
LPUSH queue task
BRPOP queue
Workers pop tasks and process them.
VI
Redis list/zset lưu job:
- LPUSH → push
- BRPOP → worker lấy job
6) Celery Architecture
EN
Components:
- Broker: Redis / RabbitMQ
- Worker
- Beat: scheduler
- Result backend: Redis / DB
Supports:
- ETA, countdown
- retry, exponential backoff
- task chaining
VI
Celery gồm:
- Broker
- Worker
- Beat (scheduler)
- Result backend
Hỗ trợ:
- ETA
- retry/backoff
- chain, group, chord
7) Sidekiq Model
EN
Ruby workers, Redis queue.
Features:
- concurrency
- retry set
- scheduling via sorted sets
VI
Sidekiq dùng Redis:
- đa luồng
- retry thông minh
- scheduled job dùng sorted set
PART B — AWS SQS STYLE QUEUES
8) SQS Key Concepts
EN
- At-least-once delivery
- Visibility timeout
- Dead-letter queue
- Long polling
VI
- At-least-once delivery
- Visibility timeout
- Dead-letter queue (DLQ)
- Long polling
9) Visibility Timeout
EN
When worker receives a message:
- message becomes invisible for N seconds
- if worker fails or crashes → message becomes visible again (retry)
VI
Worker nhận job → job bị ẩn tạm thời.
Nếu worker chết → job xuất hiện lại → retry.
10) Dead Letter Queue (DLQ)
EN
After X retries → move message to DLQ.
Allows investigation of failing tasks.
VI
Sau X lần retry → đẩy vào DLQ để điều tra thủ công.
PART C — KAFKA WORKER MODEL
11) Kafka Worker Characteristics
EN
✔ partition-level ordering
✔ consumer groups for parallelism
✔ at-least-once or exactly-once
✔ offset management
✔ no automatic retry → must implement logic
VI
✔ ordering theo partition
✔ consumer group để scale
✔ at-least-once hoặc exactly-once
✔ quản lý offset
✔ không retry tự động → phải tự code
12) Exactly-Once with Kafka
EN
Combine:
- idempotent producer
- transactional consumer
- atomic offset commit
VI
Dùng:
- producer idempotent
- consumer transactional
- commit offset nguyên tử
13) Handling Failures in Kafka Workers
EN
- retry topic
- DLQ topic
- backoff logic
- poison message protection
VI
Kafka xử lý lỗi bằng:
- retry topic
- DLQ topic
- backoff
- tránh poison message
PART D — TASK SCHEDULING
14) Scheduling Jobs
EN
Techniques:
- cron scheduler
- delayed queue (Redis sorted-set)
- SQS delay seconds
- Kafka delay topic
VI
Cách lên lịch job:
- cron
- Redis sorted-set
- SQS delay
- Kafka delay topic
15) Idempotency — CRITICAL for Safety
EN
Workers must avoid double-processing:
Techniques:
- idempotent operation
- job deduplication key
- database unique constraints
- Redis SETNX
VI
Phải đảm bảo job không xử lý 2 lần:
- logic idempotent
- dedupe key
- DB unique
- Redis SETNX
PART E — WORKER SCALING
16) Worker Auto-Scaling
EN
Base scaling on:
- queue depth
- message age
- processing time
- CPU usage
VI
Scale worker dựa trên:
- số job trong queue
- job bị treo lâu
- thời gian xử lý
- CPU
17) Worker Concurrency Control
EN
Use:
- per-worker concurrency
- per-task concurrency limits
- rate limit
VI
Dùng:
- giới hạn concurrency worker
- limit theo loại job
- rate limit
PART F — OBSERVABILITY
18) Monitoring Metrics
EN
Track:
- queue length
- job processing time
- success/failure rate
- retry count
- DLQ volume
- worker crash rate
VI
Theo dõi:
- chiều dài queue
- thời gian xử lý
- tỉ lệ lỗi
- số retry
- DLQ tăng
- worker crash
PART G — FAILURE SCENARIOS
19) Common Failure Modes
EN
- job processed twice → idempotency fix
- worker crash → SQS visibility timeout
- poison message loops → DLQ
- Redis down → queue unavailable
- Kafka rebalance → consumer restart
VI
Các lỗi:
- job chạy 2 lần → idempotent
- worker chết → SQS retry
- poison message → DLQ
- Redis down → queue lỗi
- Kafka rebalance → consumer restart
PART H — CHOOSING A TASK QUEUE
20) Task Queue Recommendations
EN
| Use Case | Recommended Queue |
|---|---|
| simple background jobs | Redis queue / Sidekiq / Celery |
| high reliability | SQS |
| long-running tasks | SQS + Lambda / ECS |
| strict ordering | Kafka |
| real-time streaming | Kafka Streams |
| cron jobs | Celery beat / Kubernetes CronJobs |
VI
| Use Case | Queue đề xuất |
|---|---|
| job nền đơn giản | Redis queue |
| độ tin cậy cao | SQS |
| job chạy lâu | SQS |
| yêu cầu order | Kafka |
| streaming realtime | Kafka Streams |
| Cron | Celery beat / K8S cronjob |
[…] Designing A Distributed Task Queue (Celery / Sidekiq / SQS Worker / Kafka Worker) […]