Skip to content

Designing A Logging & Distributed Tracing Platform (ELK / Jaeger / Tempo / DataDog Style)

1) Problem Clarification / Làm rõ bài toán

EN

We need to build a central logging and tracing platform that allows engineers to:

  • Search logs
  • Trace distributed requests across services
  • Detect failures and performance issues

VI

Thiết kế hệ thống log + trace tập trung:

  • tìm kiếm log
  • trace request giữa microservices
  • phát hiện lỗi & performance bottleneck

2) Requirements Definition / Xác định yêu cầu

EN – Functional Requirements

✔ Store application logs
✔ Search via UI & query/filter
✔ Distributed tracing spans
✔ Retention lifecycle
✔ Alerting & dashboarding

VI – Chức năng

✔ lưu log
✔ search/UI & query
✔ distributed tracing
✔ retention theo chu kỳ
✔ alert + dashboard

EN – Non-functional

✔ Scalable to terabytes/day
✔ Fast search (p95 < 1s)
✔ High ingestion throughput
✔ Fault tolerance
✔ Query isolation

VI – Phi chức năng

✔ scale đến TB/ngày
✔ search nhanh (p95 <1s)
✔ ingestion throughput cao
✔ fault tolerant
✔ query isolation

3) Scale Estimation / Ước lượng tải

EN

Assume:

  • 10M events/min ingestion
  • 30-day retention
  • 50TB/month storage
  • query concurrency 500

VI

Giả định:

  • 10M event/phút ingest
  • retention 30 ngày
  • ~50TB/tháng
  • 500 queries concurrently

4) High Level Architecture / Kiến trúc tổng quan

App → Log Agent → Queue → Ingest Service → Storage → Index → Search API → UI

App → Trace SDK → Collector → Storage → Query Engine → UI

VI

2 pipeline chính:

  • Logging pipeline
  • Tracing pipeline

Flow:

Ứng dụng → Agent → Queue → Ingest → Storage+Index → Search UI
Ứng dụng → Trace SDK → Collector → Trace Store → Query UI

5) Log Ingestion Pipeline / Pipeline ingest log

EN

We need buffering to protect backend:

  • FluentD / Logstash agent
  • Kafka / Kinesis queue
  • Consumer ingest batch writes

Batching improves throughput and reduces IO.

VI

Để bảo vệ backend cần buffer:

  • FluentD / Logstash agent trên node
  • Kafka/Kinesis queue
  • Consumer ingest theo batch

Batch giúp throughput cao + giảm IO.

6) Storage Model / Mô hình lưu trữ

EN

For logs:

  • Elasticsearch / Clickhouse / OpenSearch index
  • Hot → Warm → Cold storage tiers
  • Partition on time + tenant

VI

Lưu log:

  • Elasticsearch/Clickhouse/OpenSearch
  • tier Hot → Warm → Cold
  • partition theo time + tenant

For traces:

  • Jaeger/Tempo storage
  • Columnar TSDB for spans
  • Index on trace ID

7) Trace Propagation Model / Mô hình truyền Trace

EN

Trace must propagate through services using:

  • trace-id
  • span-id
  • parent-id

Injected through HTTP headers (W3C Trace Context)

VI

Trace phải propagate qua header:

  • trace-id
  • span-id
  • parent-id

Dùng chuẩn W3C Trace Context.

8) Search Layer / Lớp tìm kiếm

EN

  • REST query API
  • Observer UI (Kibana, Grafana, custom UI)
  • Faceted search with index scans
  • Tracing views: timeline + waterfall

VI

  • API search
  • UI như Kibana/Grafana
  • search theo facet/index
  • trace hiển thị theo timeline

9) Scaling Strategy / Chiến lược mở rộng

EN

  • Horizontal scaling of index nodes
  • Shard on timestamp or tenant
  • Replicate for HA
  • Use queue partitioning for parallel ingestion

VI

  • scale ngang node index
  • shard theo time/tenant
  • replication cho HA
  • partition queue để ingest song song

10) Retention & Tiering / Chu kỳ lưu trữ

EN

Logs lifecycle:

  • Hot (1–3 days) — SSD
  • Warm (7–30 days) — HDD
  • Cold archive (S3 / Blob) — compression

VI

Chu kỳ lưu log:

  • Hot 1–3 ngày — SSD
  • Warm 7–30 ngày — HDD
  • Cold (S3/Blob) — nén

11) Reliability & Query Isolation / Độ tin cậy

EN

We prevent large queries from starving ingestion or UI:

  • Query worker pool
  • Request timeout
  • Result pagination
  • Admission control

VI

Tránh query nặng giết hệ thống:

  • worker pool
  • timeout request
  • phân trang
  • admission control

12) Alerting & Monitoring / Alert & Giám sát

EN

Trigger alerts:

  • Error spikes
  • Service unavailability
  • Queue lag
  • Query failure rate

VI

Alert khi:

  • spike lỗi
  • downtime
  • queue lag
  • query fail tăng

13) Failure Modes / Các case lỗi

EN

  • Index node failure → reroute
  • Kafka outage → buffer local logs
  • Storage full → eviction
  • Query overload → rate limiting

VI

  • node index fail → reroute
  • Kafka down → buffer local
  • storage đầy → eviction
  • query overload → limit

14) Security / Bảo mật

EN

  • RBAC on logs
  • Redaction/PII masking
  • Data encryption at rest and in transit

VI

  • RBAC xem log
  • ẩn PII
  • mã hóa data at rest + transit

15) Future Enhancements / Mở rộng tương lai

EN

  • ML-based anomaly detection
  • Automatic RCA
  • Trace-to-log correlation
  • Tracing sampling optimization

VI

  • ML anomaly detection
  • RCA tự động
  • correlation log-trace
  • optimize trace sampling
Published inAll

One Comment

Leave a Reply

Your email address will not be published. Required fields are marked *