1) Problem Clarification / Làm rõ bài toán
EN
We need to build a central logging and tracing platform that allows engineers to:
- Search logs
- Trace distributed requests across services
- Detect failures and performance issues
VI
Thiết kế hệ thống log + trace tập trung:
- tìm kiếm log
- trace request giữa microservices
- phát hiện lỗi & performance bottleneck
2) Requirements Definition / Xác định yêu cầu
EN – Functional Requirements
✔ Store application logs
✔ Search via UI & query/filter
✔ Distributed tracing spans
✔ Retention lifecycle
✔ Alerting & dashboarding
VI – Chức năng
✔ lưu log
✔ search/UI & query
✔ distributed tracing
✔ retention theo chu kỳ
✔ alert + dashboard
EN – Non-functional
✔ Scalable to terabytes/day
✔ Fast search (p95 < 1s)
✔ High ingestion throughput
✔ Fault tolerance
✔ Query isolation
VI – Phi chức năng
✔ scale đến TB/ngày
✔ search nhanh (p95 <1s)
✔ ingestion throughput cao
✔ fault tolerant
✔ query isolation
3) Scale Estimation / Ước lượng tải
EN
Assume:
- 10M events/min ingestion
- 30-day retention
- 50TB/month storage
- query concurrency 500
VI
Giả định:
- 10M event/phút ingest
- retention 30 ngày
- ~50TB/tháng
- 500 queries concurrently
4) High Level Architecture / Kiến trúc tổng quan
App → Log Agent → Queue → Ingest Service → Storage → Index → Search API → UI
App → Trace SDK → Collector → Storage → Query Engine → UI
VI
2 pipeline chính:
- Logging pipeline
- Tracing pipeline
Flow:
Ứng dụng → Agent → Queue → Ingest → Storage+Index → Search UI
Ứng dụng → Trace SDK → Collector → Trace Store → Query UI
5) Log Ingestion Pipeline / Pipeline ingest log
EN
We need buffering to protect backend:
- FluentD / Logstash agent
- Kafka / Kinesis queue
- Consumer ingest batch writes
Batching improves throughput and reduces IO.
VI
Để bảo vệ backend cần buffer:
- FluentD / Logstash agent trên node
- Kafka/Kinesis queue
- Consumer ingest theo batch
Batch giúp throughput cao + giảm IO.
6) Storage Model / Mô hình lưu trữ
EN
For logs:
- Elasticsearch / Clickhouse / OpenSearch index
- Hot → Warm → Cold storage tiers
- Partition on time + tenant
VI
Lưu log:
- Elasticsearch/Clickhouse/OpenSearch
- tier Hot → Warm → Cold
- partition theo time + tenant
For traces:
- Jaeger/Tempo storage
- Columnar TSDB for spans
- Index on trace ID
7) Trace Propagation Model / Mô hình truyền Trace
EN
Trace must propagate through services using:
trace-idspan-idparent-id
Injected through HTTP headers (W3C Trace Context)
VI
Trace phải propagate qua header:
trace-idspan-idparent-id
Dùng chuẩn W3C Trace Context.
8) Search Layer / Lớp tìm kiếm
EN
- REST query API
- Observer UI (Kibana, Grafana, custom UI)
- Faceted search with index scans
- Tracing views: timeline + waterfall
VI
- API search
- UI như Kibana/Grafana
- search theo facet/index
- trace hiển thị theo timeline
9) Scaling Strategy / Chiến lược mở rộng
EN
- Horizontal scaling of index nodes
- Shard on timestamp or tenant
- Replicate for HA
- Use queue partitioning for parallel ingestion
VI
- scale ngang node index
- shard theo time/tenant
- replication cho HA
- partition queue để ingest song song
10) Retention & Tiering / Chu kỳ lưu trữ
EN
Logs lifecycle:
- Hot (1–3 days) — SSD
- Warm (7–30 days) — HDD
- Cold archive (S3 / Blob) — compression
VI
Chu kỳ lưu log:
- Hot 1–3 ngày — SSD
- Warm 7–30 ngày — HDD
- Cold (S3/Blob) — nén
11) Reliability & Query Isolation / Độ tin cậy
EN
We prevent large queries from starving ingestion or UI:
- Query worker pool
- Request timeout
- Result pagination
- Admission control
VI
Tránh query nặng giết hệ thống:
- worker pool
- timeout request
- phân trang
- admission control
12) Alerting & Monitoring / Alert & Giám sát
EN
Trigger alerts:
- Error spikes
- Service unavailability
- Queue lag
- Query failure rate
VI
Alert khi:
- spike lỗi
- downtime
- queue lag
- query fail tăng
13) Failure Modes / Các case lỗi
EN
- Index node failure → reroute
- Kafka outage → buffer local logs
- Storage full → eviction
- Query overload → rate limiting
VI
- node index fail → reroute
- Kafka down → buffer local
- storage đầy → eviction
- query overload → limit
14) Security / Bảo mật
EN
- RBAC on logs
- Redaction/PII masking
- Data encryption at rest and in transit
VI
- RBAC xem log
- ẩn PII
- mã hóa data at rest + transit
15) Future Enhancements / Mở rộng tương lai
EN
- ML-based anomaly detection
- Automatic RCA
- Trace-to-log correlation
- Tracing sampling optimization
VI
- ML anomaly detection
- RCA tự động
- correlation log-trace
- optimize trace sampling
[…] Designing A Logging & Distributed Tracing Platform (ELK / Jaeger / Tempo / DataDog Style) […]