Skip to content

Designing A MapReduce System (Hadoop / Spark)

1) Problem Clarification / Làm rõ bài toán

EN

MapReduce is for processing huge datasets (TB–PB) across distributed machines.
Used in:

  • log analytics
  • recommendation engines
  • indexing (search engines)
  • machine learning preprocessing
  • ETL pipelines

VI

MapReduce dùng để xử lý dữ liệu khổng lồ (TB–PB) trên nhiều máy.
Ứng dụng:

  • phân tích log
  • recommendation
  • tạo index search
  • ML preprocessing
  • ETL

2) High-Level Architecture / Kiến trúc tổng quan

HDFS / Storage → Job Scheduler → Map Workers → Shuffle → Reduce Workers → Output

VI

HDFS → Scheduler → Map → Shuffle → Reduce → Kết quả

3) Why MapReduce? / Vì sao cần MapReduce?

EN

✔ horizontal scale
✔ fault tolerance
✔ automatic data partition
✔ automatic job recovery
✔ parallel processing

VI

✔ scale ngang
✔ chịu lỗi tốt
✔ tự chia dữ liệu
✔ tự phục hồi job
✔ xử lý song song

4) Distributed File System (HDFS)

EN

HDFS stores data in:

  • blocks (128MB)
  • 3x replication
  • NameNode → metadata
  • DataNode → actual blocks

VI

HDFS lưu dưới dạng:

  • block 128MB
  • replicate 3 lần
  • NameNode chứa metadata
  • DataNode chứa block thực

5) MapReduce Programming Model

EN

Map Phase

Input split → mapper emits:

(key, value)

Shuffle Phase

Group by key.

Reduce Phase

Reduce receives all values for key:

key → [v1, v2, v3] → output

VI

Map

Đọc data → emit (key, value)

Shuffle

Group theo key.

Reduce

Nhận tất cả value theo key → tạo output.

6) Example: Word Count

EN

Map:
“Hello world hello” →

(hello,1), (world,1), (hello,1)

Reduce:

hello → 2  
world → 1

VI

Map:
“Hello world hello” →

(hello,1), (world,1), (hello,1)

Reduce:

hello: 2  
world: 1

7) Job Scheduling

EN

JobTracker / ResourceManager assigns:

  • map tasks close to data (data locality)
  • reduce tasks after enough maps completed

VI

Scheduler (YARN) gán:

  • map gần nơi lưu data
  • reduce khi map đủ hoàn thành

8) Shuffle Phase — The Hardest Part

EN

Shuffle moves huge intermediate data across nodes.

Optimizations:

  • sort & merge
  • compression
  • combiner to reduce mapper output
  • partitioner for load balancing

VI

Shuffle là phần nặng nhất: di chuyển rất nhiều data.

Tối ưu:

  • sort/merge
  • compress
  • combiner giảm output map
  • partitioner cân bằng load

9) Fault Tolerance

Map Failures

EN → rerun mapper on another node
VI → chạy lại mapper trên node khác

Reduce Failures

EN → rerun reduce task
VI → chạy lại reduce

NameNode Failure

EN → HA NameNode (active/standby)
VI → NameNode có active/standby

10) Spark vs Hadoop MapReduce

EN

Hadoop MapReduce

✔ batch processing
✔ high latency
✔ disk-based
✔ robust

Spark

✔ in-memory
✔ 10–100x faster
✔ supports batch, streaming, ML
✔ lazy evaluation DAG

VI

Hadoop

✔ cho batch
✔ chậm hơn
✔ xử lý qua disk
✔ ổn định

Spark

✔ in-memory
✔ nhanh gấp 10–100 lần
✔ hỗ trợ streaming + ML
✔ tối ưu bằng DAG

11) Spark Execution Model

EN

  • RDDs / DataFrames
  • transformations (lazy)
  • actions (trigger execution)
  • lineage graph for fault recovery

VI

  • RDD/DataFrame
  • transformation (lười)
  • action (chạy job)
  • lineage để phục hồi khi lỗi

12) Storage Formats Optimized for MapReduce

EN

Columnar storage:

  • Parquet
  • ORC

With:

  • compression
  • predicate pushdown
  • vectorized execution

VI

Định dạng cột:

  • Parquet
  • ORC

Ưu điểm:

  • query nhanh
  • nén tốt
  • filter hiệu quả

13) Cluster Scaling

EN

Anti-pattern: Put all data in one reducer.

Scaling strategies:

  • hash partition
  • range partition
  • repartition or coalesce
  • memory-aware task sizing

VI

Sai lầm: gom tất cả vào 1 reduce.

Giải pháp:

  • hash partition
  • range partition
  • chia lại partition hợp lý

14) Observability / Monitoring

EN

Monitor:

  • job duration
  • shuffle time
  • failed tasks
  • node hotspots
  • GC overhead

VI

Theo dõi:

  • thời gian job
  • shuffle
  • số task lỗi
  • node quá tải
  • GC

15) Fault Scenarios

EN

  • data skew → fix via custom partitioner
  • mapper hotspot
  • reducer memory overflow
  • slow disk / network

VI

Lỗi phổ biến:

  • data skew
  • mapper hotspot
  • reduce OOM
  • disk/network chậm

16) When To Use MapReduce vs Spark vs Flink

EN

MapReduce

  • stable batch jobs
  • predictable workloads
  • petabyte scale

Spark

  • faster batch
  • ML + SQL + streaming

Flink

  • low-latency streaming
  • real-time analytics

VI

MapReduce: batch lớn, ổn định
Spark: batch nhanh, ML
Flink: streaming realtime

Published inSystem Design

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *