1) Problem Clarification / Làm rõ bài toán
EN
A stream processing system handles continuous, unbounded data in real-time:
- financial transactions
- clickstreams
- sensor/IoT
- logs
- social media events
VI
Hệ thống stream xử lý dữ liệu liên tục và không giới hạn:
- giao dịch tài chính
- clickstream
- IoT
- log
- social event
2) High-Level Architecture / Kiến trúc tổng quan
Producers → Kafka → Stream Processor → State Store → Output Topics / DB / OLAP
↓
Checkpoint / Savepoint
VI
Producers → Kafka → Stream Processor → State Store → Output
Kèm checkpoint và savepoint để phục hồi.
3) Key Concepts / Khái niệm cốt lõi
EN
✔ event-time vs processing-time
✔ windows (tumbling, sliding, session)
✔ watermarks
✔ stateful operations
✔ checkpointing
✔ exactly-once semantics
✔ backpressure
VI
✔ event-time và processing-time
✔ window (tumbling, sliding, session)
✔ watermark
✔ trạng thái (state)
✔ checkpoint
✔ exactly-once
✔ backpressure
4) Event-Time vs Processing-Time
EN
Processing-time:
Based on time system processes the event.
Event-time:
Based on timestamp embedded in the event.
Handles out-of-order events.
VI
Processing-time:
Theo thời gian máy xử lý event.
Event-time:
Theo timestamp bên trong event.
Xử lý được event đến trễ (late events).
5) Windowing
EN
Types of windows:
- Tumbling window: fixed, non-overlapping
- Sliding window: overlapping
- Session window: ends with inactivity gap
VI
Loại window:
- Tumbling: cố định, không overlap
- Sliding: có overlap
- Session: kết thúc khi không có event mới
6) Watermarks — Handling Late Events
EN
A watermark tells the system:
“I believe no earlier event will arrive.”
Late events:
- accepted up to allowed lateness
- dropped or sent to side-output
VI
Watermark báo cho hệ thống:
“Dữ liệu cũ hơn timestamp này sẽ không tới nữa.”
Late event:
- chấp nhận nếu trong thời gian cho phép
- nếu quá trễ → drop hoặc đẩy sang side-output
7) Stateful Stream Processing
EN
Examples of state:
- counters per key
- windows accumulation
- join buffers
- aggregations
State stored in:
- local RocksDB (Flink)
- in-memory + changelog (Kafka Streams)
VI
State gồm:
- counter theo từng key
- aggregation trong window
- join buffer
- bảng lookup
Lưu state bằng:
- RocksDB (Flink)
- in-memory + changelog topic (Kafka Streams)
8) Checkpointing & Recovery
EN
Checkpoint captures:
- operator states
- offsets
- metadata
On failure → restart from last checkpoint.
VI
Checkpoint lưu:
- state operator
- offset Kafka
- metadata
Lỗi → khôi phục từ checkpoint.
9) Exactly-Once Semantics (EOS)
EN
Achieved via:
- idempotent producers
- transactional writes
- state checkpoint atomicity
- Kafka Streams: EOS guarantees natively
- Flink: two-phase commit sink
VI
Đảm bảo exactly-once bằng:
- producer idempotent
- ghi transactional
- checkpoint state
- Kafka Streams hỗ trợ EOS sẵn
- Flink có 2-phase commit sink
10) Stateful Joins in Streams
EN
Types:
- stream–stream join
- stream–table join
- table–table join
Must store history for join windows.
VI
Join gồm:
- stream–stream
- stream–table
- table–table
Cần lưu state cho join window.
11) Handling Backpressure
EN
Backpressure controls load:
- slow sink → reduce input consumption
- buffer + flow control signals
- Flink has automatic backpressure detection
VI
Backpressure điều tiết load:
- sink chậm → giảm tốc xử lý
- buffer + tín hiệu flow control
- Flink tự động xử lý backpressure
12) Distributed Execution Model
Kafka Streams
EN
- each instance = task
- state local per shard
- lightweight scaling
VI
Kafka Streams:
- mỗi instance là task
- state local
- scale nhẹ
Apache Flink
EN
- operators → operator chains
- task slots
- distributed DAG
- powerful windowing + state
VI
Flink:
- operator chain
- task slot
- DAG phân tán
- mạnh về window và state
Spark Streaming (Structured Streaming)
EN
- micro-batch
- unified API
- great for ETL + batch + stream
- slightly higher latency than Flink
VI
Spark Streaming:
- micro-batch
- API thống nhất
- hợp ETL + batch + stream
- latency cao hơn Flink
13) Output Sinks
EN
Streams may output to:
- Kafka topics
- OLAP (ClickHouse, Pinot)
- databases
- storage lake (S3, HDFS)
- materialized views
VI
Kết quả có thể ghi vào:
- Kafka
- OLAP
- DB
- S3 / HDFS
- view vật lý
14) Monitoring & Metrics
EN
Track:
- lag
- throughput
- checkpoint duration
- watermark progress
- out-of-order events
- CPU/memory
VI
Theo dõi:
- Kafka lag
- throughput
- checkpoint time
- watermark
- late event
- CPU/RAM
15) Failure Scenarios
EN
- corrupted checkpoint → rollback
- slow operator → backpressure
- skewed partition → hotspot
- Kafka partition failure
VI
Lỗi:
- checkpoint hỏng
- operator chậm → backpressure
- partition skew
- Kafka partition lỗi
16) When to Choose Each Framework?
Kafka Streams
EN: lightweight, embedded, great for microservices.
VI: nhẹ, nhúng vào service, phù hợp event-driven microservice.
Apache Flink
EN: best for event-time, complex windows, exactly-once, long pipelines.
VI: mạnh nhất cho event-time, window phức tạp, EOS mạnh.
Spark Streaming
EN: unified batch/stream, strong for ETL + machine learning.
VI: phù hợp big data pipeline + ML.
[…] Designing A Stream Processing System (Kafka Streams / Flink / Spark Streaming) […]