Designing A Logging & Distributed Tracing Platform (ELK / Jaeger / Tempo / DataDog Style)

1) Problem Clarification / Làm rõ bài toán

EN

We need to build a central logging and tracing platform that allows engineers to:

Search logs
Trace distributed requests across services
Detect failures and performance issues

VI

Thiết kế hệ thống log + trace tập trung:

tìm kiếm log
trace request giữa microservices
phát hiện lỗi & performance bottleneck

2) Requirements Definition / Xác định yêu cầu

EN – Functional Requirements

✔ Store application logs
✔ Search via UI & query/filter
✔ Distributed tracing spans
✔ Retention lifecycle
✔ Alerting & dashboarding

VI – Chức năng

✔ lưu log
✔ search/UI & query
✔ distributed tracing
✔ retention theo chu kỳ
✔ alert + dashboard

EN – Non-functional

✔ Scalable to terabytes/day
✔ Fast search (p95 < 1s)
✔ High ingestion throughput
✔ Fault tolerance
✔ Query isolation

VI – Phi chức năng

✔ scale đến TB/ngày
✔ search nhanh (p95 <1s)
✔ ingestion throughput cao
✔ fault tolerant
✔ query isolation

3) Scale Estimation / Ước lượng tải

EN

Assume:

10M events/min ingestion
30-day retention
50TB/month storage
query concurrency 500

VI

Giả định:

10M event/phút ingest
retention 30 ngày
~50TB/tháng
500 queries concurrently

4) High Level Architecture / Kiến trúc tổng quan

App → Log Agent → Queue → Ingest Service → Storage → Index → Search API → UI

App → Trace SDK → Collector → Storage → Query Engine → UI

VI

2 pipeline chính:

Logging pipeline
Tracing pipeline

Flow:

Ứng dụng → Agent → Queue → Ingest → Storage+Index → Search UI
Ứng dụng → Trace SDK → Collector → Trace Store → Query UI

5) Log Ingestion Pipeline / Pipeline ingest log

EN

We need buffering to protect backend:

FluentD / Logstash agent
Kafka / Kinesis queue
Consumer ingest batch writes

Batching improves throughput and reduces IO.

VI

Để bảo vệ backend cần buffer:

FluentD / Logstash agent trên node
Kafka/Kinesis queue
Consumer ingest theo batch

Batch giúp throughput cao + giảm IO.

6) Storage Model / Mô hình lưu trữ

EN

For logs:

Elasticsearch / Clickhouse / OpenSearch index
Hot → Warm → Cold storage tiers
Partition on time + tenant

VI

Lưu log:

Elasticsearch/Clickhouse/OpenSearch
tier Hot → Warm → Cold
partition theo time + tenant

For traces:

Jaeger/Tempo storage
Columnar TSDB for spans
Index on trace ID

7) Trace Propagation Model / Mô hình truyền Trace

EN

Trace must propagate through services using:

trace-id
span-id
parent-id

Injected through HTTP headers (W3C Trace Context)

VI

Trace phải propagate qua header:

trace-id
span-id
parent-id

Dùng chuẩn W3C Trace Context.

8) Search Layer / Lớp tìm kiếm

EN

REST query API
Observer UI (Kibana, Grafana, custom UI)
Faceted search with index scans
Tracing views: timeline + waterfall

VI

API search
UI như Kibana/Grafana
search theo facet/index
trace hiển thị theo timeline

9) Scaling Strategy / Chiến lược mở rộng

EN

Horizontal scaling of index nodes
Shard on timestamp or tenant
Replicate for HA
Use queue partitioning for parallel ingestion

VI

scale ngang node index
shard theo time/tenant
replication cho HA
partition queue để ingest song song

10) Retention & Tiering / Chu kỳ lưu trữ

EN

Logs lifecycle:

Hot (1–3 days) — SSD
Warm (7–30 days) — HDD
Cold archive (S3 / Blob) — compression

VI

Chu kỳ lưu log:

Hot 1–3 ngày — SSD
Warm 7–30 ngày — HDD
Cold (S3/Blob) — nén

11) Reliability & Query Isolation / Độ tin cậy

EN

We prevent large queries from starving ingestion or UI:

Query worker pool
Request timeout
Result pagination
Admission control

VI

Tránh query nặng giết hệ thống:

worker pool
timeout request
phân trang
admission control

12) Alerting & Monitoring / Alert & Giám sát

EN

Trigger alerts:

Error spikes
Service unavailability
Queue lag
Query failure rate

VI

Alert khi:

spike lỗi
downtime
queue lag
query fail tăng

13) Failure Modes / Các case lỗi

EN

Index node failure → reroute
Kafka outage → buffer local logs
Storage full → eviction
Query overload → rate limiting

VI

node index fail → reroute
Kafka down → buffer local
storage đầy → eviction
query overload → limit

14) Security / Bảo mật

EN

RBAC on logs
Redaction/PII masking
Data encryption at rest and in transit

VI

RBAC xem log
ẩn PII
mã hóa data at rest + transit

15) Future Enhancements / Mở rộng tương lai

EN

ML-based anomaly detection
Automatic RCA
Trace-to-log correlation
Tracing sampling optimization

VI

ML anomaly detection
RCA tự động
correlation log-trace
optimize trace sampling