1) Problem Clarification / Làm rõ bài toán
EN
MapReduce is for processing huge datasets (TB–PB) across distributed machines.
Used in:
- log analytics
- recommendation engines
- indexing (search engines)
- machine learning preprocessing
- ETL pipelines
VI
MapReduce dùng để xử lý dữ liệu khổng lồ (TB–PB) trên nhiều máy.
Ứng dụng:
- phân tích log
- recommendation
- tạo index search
- ML preprocessing
- ETL
2) High-Level Architecture / Kiến trúc tổng quan
HDFS / Storage → Job Scheduler → Map Workers → Shuffle → Reduce Workers → Output
VI
HDFS → Scheduler → Map → Shuffle → Reduce → Kết quả
3) Why MapReduce? / Vì sao cần MapReduce?
EN
✔ horizontal scale
✔ fault tolerance
✔ automatic data partition
✔ automatic job recovery
✔ parallel processing
VI
✔ scale ngang
✔ chịu lỗi tốt
✔ tự chia dữ liệu
✔ tự phục hồi job
✔ xử lý song song
4) Distributed File System (HDFS)
EN
HDFS stores data in:
- blocks (128MB)
- 3x replication
- NameNode → metadata
- DataNode → actual blocks
VI
HDFS lưu dưới dạng:
- block 128MB
- replicate 3 lần
- NameNode chứa metadata
- DataNode chứa block thực
5) MapReduce Programming Model
EN
Map Phase
Input split → mapper emits:
(key, value)
Shuffle Phase
Group by key.
Reduce Phase
Reduce receives all values for key:
key → [v1, v2, v3] → output
VI
Map
Đọc data → emit (key, value)
Shuffle
Group theo key.
Reduce
Nhận tất cả value theo key → tạo output.
6) Example: Word Count
EN
Map:
“Hello world hello” →
(hello,1), (world,1), (hello,1)
Reduce:
hello → 2
world → 1
VI
Map:
“Hello world hello” →
(hello,1), (world,1), (hello,1)
Reduce:
hello: 2
world: 1
7) Job Scheduling
EN
JobTracker / ResourceManager assigns:
- map tasks close to data (data locality)
- reduce tasks after enough maps completed
VI
Scheduler (YARN) gán:
- map gần nơi lưu data
- reduce khi map đủ hoàn thành
8) Shuffle Phase — The Hardest Part
EN
Shuffle moves huge intermediate data across nodes.
Optimizations:
- sort & merge
- compression
- combiner to reduce mapper output
- partitioner for load balancing
VI
Shuffle là phần nặng nhất: di chuyển rất nhiều data.
Tối ưu:
- sort/merge
- compress
- combiner giảm output map
- partitioner cân bằng load
9) Fault Tolerance
Map Failures
EN → rerun mapper on another node
VI → chạy lại mapper trên node khác
Reduce Failures
EN → rerun reduce task
VI → chạy lại reduce
NameNode Failure
EN → HA NameNode (active/standby)
VI → NameNode có active/standby
10) Spark vs Hadoop MapReduce
EN
Hadoop MapReduce
✔ batch processing
✔ high latency
✔ disk-based
✔ robust
Spark
✔ in-memory
✔ 10–100x faster
✔ supports batch, streaming, ML
✔ lazy evaluation DAG
VI
Hadoop
✔ cho batch
✔ chậm hơn
✔ xử lý qua disk
✔ ổn định
Spark
✔ in-memory
✔ nhanh gấp 10–100 lần
✔ hỗ trợ streaming + ML
✔ tối ưu bằng DAG
11) Spark Execution Model
EN
- RDDs / DataFrames
- transformations (lazy)
- actions (trigger execution)
- lineage graph for fault recovery
VI
- RDD/DataFrame
- transformation (lười)
- action (chạy job)
- lineage để phục hồi khi lỗi
12) Storage Formats Optimized for MapReduce
EN
Columnar storage:
- Parquet
- ORC
With:
- compression
- predicate pushdown
- vectorized execution
VI
Định dạng cột:
- Parquet
- ORC
Ưu điểm:
- query nhanh
- nén tốt
- filter hiệu quả
13) Cluster Scaling
EN
Anti-pattern: Put all data in one reducer.
Scaling strategies:
- hash partition
- range partition
- repartition or coalesce
- memory-aware task sizing
VI
Sai lầm: gom tất cả vào 1 reduce.
Giải pháp:
- hash partition
- range partition
- chia lại partition hợp lý
14) Observability / Monitoring
EN
Monitor:
- job duration
- shuffle time
- failed tasks
- node hotspots
- GC overhead
VI
Theo dõi:
- thời gian job
- shuffle
- số task lỗi
- node quá tải
- GC
15) Fault Scenarios
EN
- data skew → fix via custom partitioner
- mapper hotspot
- reducer memory overflow
- slow disk / network
VI
Lỗi phổ biến:
- data skew
- mapper hotspot
- reduce OOM
- disk/network chậm
16) When To Use MapReduce vs Spark vs Flink
EN
MapReduce
- stable batch jobs
- predictable workloads
- petabyte scale
Spark
- faster batch
- ML + SQL + streaming
Flink
- low-latency streaming
- real-time analytics
VI
MapReduce: batch lớn, ổn định
Spark: batch nhanh, ML
Flink: streaming realtime
[…] Designing A MapReduce System (Hadoop / Spark) […]