Skip to content

Designing Distributed SQL Database (Google Spanner / Yugabyte / CockroachDB)

1) Problem Clarification / Làm rõ bài toán

EN

Distributed SQL databases combine:

  • ACID transactions
  • SQL query support
  • horizontal scalability
  • multi-region replication

They behave like PostgreSQL but scale like a NoSQL system.

VI

Distributed SQL là sự kết hợp giữa:

  • ACID
  • SQL
  • scale ngang nhiều node
  • replicate đa vùng

Trông giống PostgreSQL nhưng scale như NoSQL.

2) High-Level Architecture

Client → Query Router → Distributed KV Store → Raft/Paxos Groups → Replicated Shards
                 ↓
              SQL Layer (Parser, Optimizer, Planner)

VI

Client → Router → Distributed KV → Raft groups → Shards
Ở trên là lớp SQL (parser/optimizer).

3) Key Concepts (khái niệm cốt lõi)

EN

A) Sharding

Data split into ranges or tablets.

B) Replication

Each shard replicated via Raft or Paxos.

C) Consensus

Ensures safety & leader-based writes.

D) Distributed Query Execution

SQL runs across multiple shards.

VI

A) Sharding

Dữ liệu chia thành range/tablet.

B) Replication

Mỗi shard replicate bằng Raft/Paxos.

C) Consensus

Đảm bảo an toàn dữ liệu, leader-based write.

D) Query phân tán

SQL chạy qua nhiều shard.

4) Data Sharding Model

EN

Spanner, Yugabyte, CockroachDB shard data by key range:

[0–100), [100–200), ...

Hot shards auto-split.

VI

Sharding theo range key:

[0–100), [100–200) …

Shard nóng sẽ auto-split.

5) Replication via Consensus

EN

Each shard has:

  • 1 leader
  • 2–4 followers

Writes go to leader → consensus commit → followers updated.

VI

Mỗi shard có:

  • 1 leader
  • 2–4 follower

Ghi qua leader → consensus → replicate.

6) Google Spanner’s TrueTime (Magic Component)

EN

Problem: clock skew across machines affects transaction ordering.

Spanner solves with TrueTime API:

TT.now() returns [earliest, latest]

Guarantees bounded clock uncertainty (ɛ).

This enables external consistency (strict serializability).

VI

Vấn đề: đồng hồ lệch giữa máy → ảnh hưởng thứ tự transaction.

Spanner dùng TrueTime:

TT.now() → [earliest, latest]

TrueTime có sai số ɛ được đo lường.

Cho phép Spanner đạt external consistency.

7) Timestamp-Based Transactions

EN

Commit wait:

Database waits until:

current_time > commit_timestamp uncertainty window

Ensures no future transaction can appear earlier.

VI

Commit phải chờ:

time hiện tại > commit_timestamp + uncertainty

Đảm bảo thứ tự giao dịch đúng tuyệt đối.

8) Distributed SQL Query Execution

EN

Queries run in parallel:

  • local index lookups
  • remote shard scan
  • distributed joins
  • distributed aggregate

Query planner chooses best strategy.

VI

Query chạy phân tán:

  • lookup local
  • shard scan
  • join phân tán
  • aggregate phân tán

Planner chọn chiến lược tối ưu.

9) Secondary Indexes in a Distributed SQL DB

EN

Indexes also sharded & replicated.

Two types:

  • Local index (same shard as data)
  • Global index (sharded independently)

Problems:

  • index write amplification
  • index → cross-shard consistency

VI

Index cũng sharded & replicated.

2 loại:

  • Local index
  • Global index

Vấn đề:

  • write amplification
  • consistency cross-shard

10) Multi-Region Replication

EN

Modes:

  • Read-local / Write-global
  • Multi-primary (geo-distributed)
  • Follower reads for low-latency queries

VI

Kiểu replication:

  • đọc local / ghi global
  • multi-primary
  • follower read để giảm latency

11) Transactions in Distributed SQL

EN

Transactions use:

  • 2PL (two-phase locking)
  • MVCC (multi-version concurrency control)
  • timestamp ordering

Most use MVCC for snapshot isolation.

VI

Transaction dùng:

  • 2PL
  • MVCC
  • timestamp ordering

Hầu hết dùng MVCC.

12) ACID in Distributed System

EN

✔ Atomicity → via MVCC + consensus
✔ Consistency → schema enforcement
✔ Isolation → snapshot isolation / serializable
✔ Durability → log replication

VI

✔ Nguyên tử → MVCC + consensus
✔ Nhất quán → schema
✔ Cô lập → SI/Serializable
✔ Bền vững → log replicate

13) Fault Tolerance

EN

Leader failure → auto re-election in Raft.
Region failure → failover to nearest healthy region.
Node failure → shard rebalancing.

VI

Leader chết → bầu leader mới.
Region lỗi → failover.
Node hỏng → rebalance shard.

14) Schema Management

EN

DDL must also be distributed:

  • schema change transactions
  • online schema migration
  • versioned metadata

VI

DDL cũng phân tán:

  • chạy qua consensus
  • online migration
  • version metadata

15) Scaling Strategy

EN

Scale by:

  • adding nodes → auto rebalancing
  • splitting hot ranges
  • scattering leaders across nodes
  • optimizing locality (geo-partitioning)

VI

Scale bằng:

  • thêm node → tự rebalance
  • split shard nóng
  • phân leader đều
  • geo-partitioning

16) Observability

EN

Metrics:

  • Raft log lag
  • commit latency
  • hot shard detection
  • contention rate
  • read/write amplification

VI

Theo dõi:

  • độ trễ log Raft
  • latency commit
  • shard nóng
  • contention
  • write/read amplification

17) Real-World Examples

Google Spanner

  • TrueTime
  • externally consistent
  • global scale

Yugabyte

  • PostgreSQL-compatible
  • Raft-based
  • high performance

CockroachDB

  • auto-sharding
  • PostgreSQL wire protocol
  • strong consistency

VI

Spanner

  • TrueTime
  • consistency mạnh toàn cầu

Yugabyte

  • tương thích PostgreSQL
  • hiệu năng cao

CockroachDB

  • tự sharding
  • dùng Raft
  • mạnh về consistency
Published inAllSystem Design

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *