Designing A Search System

1) Problem Clarification / Làm rõ bài toán

EN

We want to build a scalable search engine that supports:

keyword search
phrase search
fuzzy matching
ranking
autocomplete
typo correction

VI

Thiết kế hệ thống search phải hỗ trợ:

tìm từ khóa
tìm cụm từ
fuzzy match
ranking
gợi ý từ (autocomplete)
sửa lỗi chính tả

2) High-Level Architecture / Kiến trúc tổng quan

Crawler / Producer → Indexer → Inverted Index Storage → Query Engine → Ranking Layer → Results
                                          ↓
                                      Auto-Suggest Trie

VI

Crawler/Producer → Indexer → Inverted Index → Query Engine → Ranking → Kết quả
Tách riêng mô-đun Auto-suggest (Trie / Prefix Index)

3) Inverted Index — Core Structure / Cấu trúc inverted index

EN

The key data structure:

word → list of documents containing that word

Each entry includes:

doc_id
term frequency
position

VI

Cấu trúc chính của search engine:

từ khóa → danh sách document chứa từ này

Mỗi record gồm:

doc_id
tần suất
vị trí xuất hiện

4) Indexing Pipeline / Pipeline tạo chỉ mục

EN

Steps:

Tokenization
Split text into words.
Normalization
Lowercase, remove stopwords, stemming/lemmatization.
Inverted index construction
Shard & replicate index for scale and availability.

VI

Các bước:

Token hoá
Chuẩn hoá (lowercase, stopwords, stemming)
Tạo inverted index
Shard + replicate để scale

5) Query Parsing / Xử lý câu truy vấn

EN

Support:

boolean queries (AND, OR, NOT)
phrase search (“iphone 14 pro”)
fuzzy lexing
typo tolerance (Levenshtein)
synonyms
boosting fields

VI

Hỗ trợ:

toán tử boolean
tìm cụm từ
fuzzy
sửa lỗi chính tả
từ đồng nghĩa
tăng trọng số theo field

6) Ranking & Relevance / Ranking kết quả

EN

Ranking factors:

term frequency (TF)
inverse document frequency (IDF)
BM25 score (industry standard)
recency boost
personalization
click-through data

VI

Ranking dựa vào:

TF
IDF
BM25 (tiêu chuẩn)
độ mới
tuỳ biến theo user
dữ liệu click thực tế

7) Distributed Search Architecture / Kiến trúc search phân tán

EN

Shard index by:

document ID
field type
language

Query → scatter → gather → aggregate → rank.

VI

Shard theo:

doc ID
kiểu dữ liệu
ngôn ngữ

Query sẽ: scatter → gather → merge → rank.

8) Auto-Suggest / Autocomplete System

EN

Two common structures:

A) Trie (prefix tree)

Word suggestion by prefix.

B) N-gram / Edge-gram index

Used by Elasticsearch.

VI

Hai kiểu chính:

A) Trie

Gợi ý theo prefix.

B) N-gram

Chỉ mục n-gram để gợi ý nhanh.

9) Typo Correction / Sửa lỗi chính tả

EN

Use:

Levenshtein distance
phonetic matching (Soundex)
popularity bias (prefer common terms)

VI

Sửa lỗi dùng:

Levenshtein
phonetic (Soundex)
ưu tiên từ phổ biến

10) Handling High Throughput / Chịu tải lớn

EN

Optimizations:

distributed indexing workers
segment merging
lazy refresh
query caching
SSD-based index storage

VI

Tối ưu:

index worker phân tán
merge segment
refresh lười
cache query
lưu index trên SSD

11) Search Latency Optimization

EN

pre-ranking
caching popular queries
vector search for semantic matching
query rewriting & expansion

VI

pre-ranking
cache truy vấn phổ biến
vector search cho semantic
rewrite + expand query

12) Vector Search (Semantic Search) — Modern Approach

EN

Use embeddings (OpenAI, BERT, SentenceTransformer):

query vector → ANN search → cosine similarity

Technologies:

FAISS
Milvus
Pinecone
Elasticsearch kNN

VI

Dùng embedding để tìm giống nghĩa:

vector query → ANN search → similarity

Công nghệ:

FAISS
Milvus
Pinecone
ES kNN

13) Observability / Giám sát

EN

Metrics:

QPS
p99 latency
recall & precision
index freshness
suggestion accuracy

VI

Theo dõi:

QPS
latency p99
recall/precision
độ mới của index
độ chính xác gợi ý

14) Failure Handling

EN

shard fail → reroute to replicas
partial search result → degrade gracefully
index corruption → rebuild from source

VI

shard down → dùng replica
không đủ kết quả → degrade
index hỏng → rebuild từ dữ liệu gốc

15) Use Case Examples

EN

E-commerce Search

boost sponsored products
category-specific ranking
personalization by user history

Social Search

semantics + trend awareness

Enterprise Search

ACL-based filtering
document embedding search

VI

Search thương mại điện tử:

ưu tiên sản phẩm quảng cáo
ranking theo danh mục
cá nhân hoá

Search mạng xã hội:

semantic + trend

Search doanh nghiệp:

ACL search
embedding văn bản