Skip to content

Designing A Stream Processing System (Kafka Streams / Flink / Spark Streaming)

1) Problem Clarification / Làm rõ bài toán

EN

A stream processing system handles continuous, unbounded data in real-time:

  • financial transactions
  • clickstreams
  • sensor/IoT
  • logs
  • social media events

VI

Hệ thống stream xử lý dữ liệu liên tục và không giới hạn:

  • giao dịch tài chính
  • clickstream
  • IoT
  • log
  • social event

2) High-Level Architecture / Kiến trúc tổng quan

Producers → Kafka → Stream Processor → State Store → Output Topics / DB / OLAP
                         ↓
                   Checkpoint / Savepoint

VI

Producers → Kafka → Stream Processor → State Store → Output
Kèm checkpoint và savepoint để phục hồi.

3) Key Concepts / Khái niệm cốt lõi

EN

✔ event-time vs processing-time
✔ windows (tumbling, sliding, session)
✔ watermarks
✔ stateful operations
✔ checkpointing
✔ exactly-once semantics
✔ backpressure

VI

✔ event-time và processing-time
✔ window (tumbling, sliding, session)
✔ watermark
✔ trạng thái (state)
✔ checkpoint
✔ exactly-once
✔ backpressure

4) Event-Time vs Processing-Time

EN

Processing-time:
Based on time system processes the event.

Event-time:
Based on timestamp embedded in the event.
Handles out-of-order events.

VI

Processing-time:
Theo thời gian máy xử lý event.

Event-time:
Theo timestamp bên trong event.
Xử lý được event đến trễ (late events).

5) Windowing

EN

Types of windows:

  • Tumbling window: fixed, non-overlapping
  • Sliding window: overlapping
  • Session window: ends with inactivity gap

VI

Loại window:

  • Tumbling: cố định, không overlap
  • Sliding: có overlap
  • Session: kết thúc khi không có event mới

6) Watermarks — Handling Late Events

EN

A watermark tells the system:

“I believe no earlier event will arrive.”

Late events:

  • accepted up to allowed lateness
  • dropped or sent to side-output

VI

Watermark báo cho hệ thống:

“Dữ liệu cũ hơn timestamp này sẽ không tới nữa.”

Late event:

  • chấp nhận nếu trong thời gian cho phép
  • nếu quá trễ → drop hoặc đẩy sang side-output

7) Stateful Stream Processing

EN

Examples of state:

  • counters per key
  • windows accumulation
  • join buffers
  • aggregations

State stored in:

  • local RocksDB (Flink)
  • in-memory + changelog (Kafka Streams)

VI

State gồm:

  • counter theo từng key
  • aggregation trong window
  • join buffer
  • bảng lookup

Lưu state bằng:

  • RocksDB (Flink)
  • in-memory + changelog topic (Kafka Streams)

8) Checkpointing & Recovery

EN

Checkpoint captures:

  • operator states
  • offsets
  • metadata

On failure → restart from last checkpoint.

VI

Checkpoint lưu:

  • state operator
  • offset Kafka
  • metadata

Lỗi → khôi phục từ checkpoint.

9) Exactly-Once Semantics (EOS)

EN

Achieved via:

  • idempotent producers
  • transactional writes
  • state checkpoint atomicity
  • Kafka Streams: EOS guarantees natively
  • Flink: two-phase commit sink

VI

Đảm bảo exactly-once bằng:

  • producer idempotent
  • ghi transactional
  • checkpoint state
  • Kafka Streams hỗ trợ EOS sẵn
  • Flink có 2-phase commit sink

10) Stateful Joins in Streams

EN

Types:

  • stream–stream join
  • stream–table join
  • table–table join

Must store history for join windows.

VI

Join gồm:

  • stream–stream
  • stream–table
  • table–table

Cần lưu state cho join window.

11) Handling Backpressure

EN

Backpressure controls load:

  • slow sink → reduce input consumption
  • buffer + flow control signals
  • Flink has automatic backpressure detection

VI

Backpressure điều tiết load:

  • sink chậm → giảm tốc xử lý
  • buffer + tín hiệu flow control
  • Flink tự động xử lý backpressure

12) Distributed Execution Model

Kafka Streams

EN

  • each instance = task
  • state local per shard
  • lightweight scaling

VI

Kafka Streams:

  • mỗi instance là task
  • state local
  • scale nhẹ

Apache Flink

EN

  • operators → operator chains
  • task slots
  • distributed DAG
  • powerful windowing + state

VI

Flink:

  • operator chain
  • task slot
  • DAG phân tán
  • mạnh về window và state

Spark Streaming (Structured Streaming)

EN

  • micro-batch
  • unified API
  • great for ETL + batch + stream
  • slightly higher latency than Flink

VI

Spark Streaming:

  • micro-batch
  • API thống nhất
  • hợp ETL + batch + stream
  • latency cao hơn Flink

13) Output Sinks

EN

Streams may output to:

  • Kafka topics
  • OLAP (ClickHouse, Pinot)
  • databases
  • storage lake (S3, HDFS)
  • materialized views

VI

Kết quả có thể ghi vào:

  • Kafka
  • OLAP
  • DB
  • S3 / HDFS
  • view vật lý

14) Monitoring & Metrics

EN

Track:

  • lag
  • throughput
  • checkpoint duration
  • watermark progress
  • out-of-order events
  • CPU/memory

VI

Theo dõi:

  • Kafka lag
  • throughput
  • checkpoint time
  • watermark
  • late event
  • CPU/RAM

15) Failure Scenarios

EN

  • corrupted checkpoint → rollback
  • slow operator → backpressure
  • skewed partition → hotspot
  • Kafka partition failure

VI

Lỗi:

  • checkpoint hỏng
  • operator chậm → backpressure
  • partition skew
  • Kafka partition lỗi

16) When to Choose Each Framework?

Kafka Streams

EN: lightweight, embedded, great for microservices.
VI: nhẹ, nhúng vào service, phù hợp event-driven microservice.

Apache Flink

EN: best for event-time, complex windows, exactly-once, long pipelines.
VI: mạnh nhất cho event-time, window phức tạp, EOS mạnh.

Spark Streaming

EN: unified batch/stream, strong for ETL + machine learning.
VI: phù hợp big data pipeline + ML.

Published inSystem Design

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *