Skip to content

Designing A Distributed Task Queue (Celery / Sidekiq / SQS Worker / Kafka Worker)

1) Problem Clarification / Làm rõ bài toán

EN

A distributed task queue executes background jobs asynchronously:

  • send email
  • resize images
  • process payments
  • run ML workloads
  • schedule jobs
  • retry failed tasks

VI

Distributed task queue xử lý job nền không đồng bộ:

  • gửi email
  • resize ảnh
  • xử lý thanh toán
  • chạy ML
  • job theo lịch
  • retry job bị lỗi

2) General Architecture / Kiến trúc tổng quan

Producer → Queue/Broker → Workers → Result Store / Callback

VI

Producer → Queue → Worker → Lưu kết quả hoặc callback.

3) Push vs Pull Model

EN

Push-based (SQS, Redis queues)

Broker delivers messages to workers.

Pull-based (Kafka)

Workers poll for new messages.

VI

Push (SQS/Redis)

Queue đẩy job sang worker.

Pull (Kafka)

Worker chủ động kéo job.

4) Core Requirements

EN

✔ scalable workers
✔ retry logic
✔ idempotency
✔ dead-letter handling
✔ visibility timeout
✔ task timeouts
✔ ordering rules

VI

✔ scale worker
✔ retry
✔ idempotent
✔ dead-letter
✔ visibility timeout
✔ timeout job
✔ thứ tự xử lý

PART A — REDIS/ CELERY / SIDEKIQ STYLE QUEUES

5) Redis Queue Model

EN

Redis lists/zsets store tasks:

LPUSH queue task
BRPOP queue

Workers pop tasks and process them.

VI

Redis list/zset lưu job:

  • LPUSH → push
  • BRPOP → worker lấy job

6) Celery Architecture

EN

Components:

  • Broker: Redis / RabbitMQ
  • Worker
  • Beat: scheduler
  • Result backend: Redis / DB

Supports:

  • ETA, countdown
  • retry, exponential backoff
  • task chaining

VI

Celery gồm:

  • Broker
  • Worker
  • Beat (scheduler)
  • Result backend

Hỗ trợ:

  • ETA
  • retry/backoff
  • chain, group, chord

7) Sidekiq Model

EN

Ruby workers, Redis queue.

Features:

  • concurrency
  • retry set
  • scheduling via sorted sets

VI

Sidekiq dùng Redis:

  • đa luồng
  • retry thông minh
  • scheduled job dùng sorted set

PART B — AWS SQS STYLE QUEUES

8) SQS Key Concepts

EN

  • At-least-once delivery
  • Visibility timeout
  • Dead-letter queue
  • Long polling

VI

  • At-least-once delivery
  • Visibility timeout
  • Dead-letter queue (DLQ)
  • Long polling

9) Visibility Timeout

EN

When worker receives a message:

  • message becomes invisible for N seconds
  • if worker fails or crashes → message becomes visible again (retry)

VI

Worker nhận job → job bị ẩn tạm thời.
Nếu worker chết → job xuất hiện lại → retry.

10) Dead Letter Queue (DLQ)

EN

After X retries → move message to DLQ.

Allows investigation of failing tasks.

VI

Sau X lần retry → đẩy vào DLQ để điều tra thủ công.

PART C — KAFKA WORKER MODEL

11) Kafka Worker Characteristics

EN

✔ partition-level ordering
✔ consumer groups for parallelism
✔ at-least-once or exactly-once
✔ offset management
✔ no automatic retry → must implement logic

VI

✔ ordering theo partition
✔ consumer group để scale
✔ at-least-once hoặc exactly-once
✔ quản lý offset
✔ không retry tự động → phải tự code

12) Exactly-Once with Kafka

EN

Combine:

  • idempotent producer
  • transactional consumer
  • atomic offset commit

VI

Dùng:

  • producer idempotent
  • consumer transactional
  • commit offset nguyên tử

13) Handling Failures in Kafka Workers

EN

  • retry topic
  • DLQ topic
  • backoff logic
  • poison message protection

VI

Kafka xử lý lỗi bằng:

  • retry topic
  • DLQ topic
  • backoff
  • tránh poison message

PART D — TASK SCHEDULING

14) Scheduling Jobs

EN

Techniques:

  • cron scheduler
  • delayed queue (Redis sorted-set)
  • SQS delay seconds
  • Kafka delay topic

VI

Cách lên lịch job:

  • cron
  • Redis sorted-set
  • SQS delay
  • Kafka delay topic

15) Idempotency — CRITICAL for Safety

EN

Workers must avoid double-processing:

Techniques:

  • idempotent operation
  • job deduplication key
  • database unique constraints
  • Redis SETNX

VI

Phải đảm bảo job không xử lý 2 lần:

  • logic idempotent
  • dedupe key
  • DB unique
  • Redis SETNX

PART E — WORKER SCALING

16) Worker Auto-Scaling

EN

Base scaling on:

  • queue depth
  • message age
  • processing time
  • CPU usage

VI

Scale worker dựa trên:

  • số job trong queue
  • job bị treo lâu
  • thời gian xử lý
  • CPU

17) Worker Concurrency Control

EN

Use:

  • per-worker concurrency
  • per-task concurrency limits
  • rate limit

VI

Dùng:

  • giới hạn concurrency worker
  • limit theo loại job
  • rate limit

PART F — OBSERVABILITY

18) Monitoring Metrics

EN

Track:

  • queue length
  • job processing time
  • success/failure rate
  • retry count
  • DLQ volume
  • worker crash rate

VI

Theo dõi:

  • chiều dài queue
  • thời gian xử lý
  • tỉ lệ lỗi
  • số retry
  • DLQ tăng
  • worker crash

PART G — FAILURE SCENARIOS

19) Common Failure Modes

EN

  • job processed twice → idempotency fix
  • worker crash → SQS visibility timeout
  • poison message loops → DLQ
  • Redis down → queue unavailable
  • Kafka rebalance → consumer restart

VI

Các lỗi:

  • job chạy 2 lần → idempotent
  • worker chết → SQS retry
  • poison message → DLQ
  • Redis down → queue lỗi
  • Kafka rebalance → consumer restart

PART H — CHOOSING A TASK QUEUE

20) Task Queue Recommendations

EN

Use CaseRecommended Queue
simple background jobsRedis queue / Sidekiq / Celery
high reliabilitySQS
long-running tasksSQS + Lambda / ECS
strict orderingKafka
real-time streamingKafka Streams
cron jobsCelery beat / Kubernetes CronJobs

VI

Use CaseQueue đề xuất
job nền đơn giảnRedis queue
độ tin cậy caoSQS
job chạy lâuSQS
yêu cầu orderKafka
streaming realtimeKafka Streams
CronCelery beat / K8S cronjob
Published inAllSystem Design

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *