All problems

Q5 Real-Time Analytics & Monitoring 29 min read 12 sections

Metrics Collection & Monitoring System — L7 Reference Design

Collect fleet-wide metrics, retain history, and trigger alerts quickly without drowning the system in high-cardinality series.

StreamingObservabilityPartitioningAvailabilityCost / efficiency

1 Problem Restatement & Clarifying Questions #

Restated: Build a fleet-wide telemetry plane that ingests numeric time series from ~100K hosts, stores them cheaply at tiered retention, evaluates alert rules in seconds, serves ad-hoc dashboard queries in <1s, and never lies to the oncall.

Clarifying Qs (I'd ask the interviewer; I'll lock answers and proceed):

# Question Locked Assumption Why it matters
Q1 Fleet size? 100K hosts, 1 region today, scale path to 3 regions × 500K hosts Shard count, index partitioning
Q2 Metrics/host? ~1000 active series per host (host + container + app) Core BOE input
Q3 Scrape interval? 10s default, 1s for a 5% hot subset Sample rate
Q4 Cardinality budget? ~100M active series today, 1B ceiling in 2y Index RAM, fanout cost
Q5 Retention? 30d raw (10s), 1y 5-min rollups, 5y 1-hr rollups Tiered storage sizing
Q6 Query concurrency? ~1000 dashboard QPS, ~10K alert-rule evals/s Query planner + caching
Q7 Multi-tenant? Yes — ~500 internal tenants (teams), per-tenant quota + label-based isolation Admission control
Q8 Alert SLO? <30s end-to-end: scrape → eval → page Pipeline budget
Q9 Pillar scope? Metrics only. Logs & traces are sibling pillars (correlated via exemplars) Prevents scope creep into "design observability"
Q10 Regulatory? No PII in labels (hard rule, enforced at ingest). EU data residency for EU tenants Admission control + geo-partition
Q11 Push vs pull mandate? Candidate's choice — justify Core deep-dive
Q12 Query language? PromQL-compatible (industry-standard), MQL-like extensions for joins API shape

What I am NOT designing: log search (that's Loki/ELK), distributed tracing (Jaeger/Tempo), continuous profiling (Pyroscope/Parca), synthetic probes, APM (that's a product on top of these pillars), SLO/SLI computation engine (downstream consumer).


2 Functional Requirements #

In-scope (numbered):

  1. FR1 — Ingest numeric samples typed as counter, gauge, histogram, summary with a labelset {k=v, ...} and a timestamp. Support Prometheus remote_write (protobuf + snappy) and OTLP/metrics (gRPC).
  2. FR2 — Query via PromQL-compatible language over arbitrary time ranges, with range/instant queries, subqueries, and streaming results.
  3. FR3 — Alerting rules — declarative expressions evaluated on a schedule; fire on condition held for <duration>; route to notifiers (Pager, Slack, webhook).
  4. FR4 — Recording rules — pre-aggregate expensive expressions into new series (e.g., SLO burn rate).
  5. FR5 — Dashboards — serve to Grafana-style UIs; query frontend handles caching, splitting, and concurrency limits.
  6. FR6 — Host health — every agent emits a heartbeat series; missing heartbeat ⇒ synthetic host_down alert.
  7. FR7 — Service discovery — for pull-mode targets, maintain a registry that advertises scrape targets per tenant.
  8. FR8 — Multi-tenancy — per-tenant quotas (ingestion rate, active series, query RU/s), label-based isolation, auditable admission.
  9. FR9 — Silences & inhibitions — operator-facing API to mute alerts by label matcher; inhibition prevents child alerts when parent fires.
  10. FR10 — Downsampling & retention tiering — automatic rollups 10s → 5m → 1h; tiered evict to object store.

Out-of-scope (explicit): logs, traces, profiles, events, APM, RUM, synthetic monitoring, CMDB, deploy tracking. These are sibling observability pillars — metrics integrates with them via exemplars (trace_id attached to a sample) but does not subsume them. Conflating pillars is how you get "one giant unqueryable Elastic cluster."


3 NFRs + Capacity Estimate #

NFRs

Dim Target Notes
Availability — ingest 99.95% 4.3h/yr error budget. Ingest degrades gracefully (agent local buffer absorbs outages <1h).
Availability — alert eval 99.99% Harder SLO than ingest — alerts are the pager. 52 min/yr budget. Multi-region active-active for alert evaluators.
Availability — query 99.9% Dashboards can tolerate brief errors; retries are cheap.
Ingest latency p99 < 5s scrape → durable Batching + Kafka commit
Alert eval latency e2e p99 < 30s scrape (10s) + Kafka (2s) + eval window (15s) + notify (3s) = 30s budget
Query p99 <1s for 24h range over ≤10K series; <5s for 7d range Query frontend caches + shard fanout
Durability — raw hot 30d, 3-way replicated, SSD Loss tolerance: 1 AZ outage = 0 loss
Durability — warm 1y rolled-up, 3-way object storage S3-class 11 nines
Durability — cold 5y 1h rollups, erasure-coded Archive tier
Consistency Monotonic per-series writes; eventually-consistent across replicas; stale read tolerated at query (+/- 15s) Strong consistency not needed for metrics
Correctness Never silently drop without counter-bump Every drop increments metrics_dropped_total{reason=...} — you cannot debug what you cannot count

Back-of-Envelope (show the math)

Inputs:

  • H = 100K hosts
  • S_h = 1000 active series / host
  • r = 1 sample per 10s = 0.1 Hz

Samples/sec (peak):

samples/s = H × S_h × r = 100,000 × 1,000 × 0.1 = 10,000,000 = 10M samples/s

Round to 10M samples/s steady-state, 20M peak (deploy storms, rescrape bursts).

Active series (global): 100K × 1000 = 100M series. Plus 2× blast factor for label churn (pod restart adds a new series because pod_id rotates) → plan for 200–500M series, ceiling at 1B.

Compressed bytes per sample: Gorilla (Facebook 2015) achieves 1.37 bytes/sample average — timestamp delta-of-delta (7 bits median) + value XOR (8–16 bits median for slowly-moving gauges). Reference: Pelkonen et al., VLDB 2015.

Ingress bandwidth (wire, pre-compression):

per-sample raw = 16B ts + 8B val + ~200B labelset-reference (not repeated; sent as series_id after first) → ~32B amortized
wire = 10M × 32B = 320 MB/s raw
snappy ≈ 3× → ~107 MB/s on-wire

Call it ~120 MB/s ingest, ~1 Gbps. Split across regional receivers.

Storage/day (compressed at rest):

10M samples/s × 86,400s × 1.37B = 1.18 TB/day raw-hot

30d hot = ~35 TB. Replicated 3× = ~105 TB SSD. That's ~20 hosts with 6TB NVMe each — easy.

Downsampling:

  • 10s → 5m rollup = 30× reduction
  • 5m → 1h rollup = 12× reduction
Warm (1y @ 5m) = 1.18TB/d × 365 / 30 = ~14.3 TB/y replicated 3× = ~43 TB object storage
Cold (5y @ 1h) = warm / 12 × 5 = ~6 TB total

Total long-term footprint: 150 TB. At $0.023/GB/mo S3 standard-IA: **$40K/mo cold+warm storage**. Hot SSD is the expensive tier ($300/mo/TB on NVMe EC2) → **$30K/mo**. Cardinality dominates cost, not bytes — index RAM >> chunk bytes.

Cardinality budget vs index RAM:

  • Per series: ~1–2 KB inverted-index overhead (labelset interning + postings lists).
  • 100M series → ~150 GB RAM across index shards. Shard the index 10× → 15 GB/shard. Fits on 32GB boxes with headroom.
  • 1B series → 1.5 TB RAM index. Now you're sharding 100× and postings-list merges hurt. This is the real scale cliff.

Query fanout:

  • A dashboard panel "p99 latency by service over 24h" touches ~1K series × 8640 points = ~8M points. Across N=10 shards, each shard serves ~800K points.
  • At 1000 dashboard QPS, total internal fanout = 10K shard-queries/s. Hence the query frontend MUST cache and split.

4 High-Level API #

4.1 Ingest

POST /api/v1/write — Prometheus remote_write compatible

Content-Type: application/x-protobuf
Content-Encoding: snappy
X-Tenant-Id: team-search-ads
X-Prometheus-Remote-Write-Version: 0.1.0
Body: WriteRequest { timeseries: []TimeSeries, metadata: []MetricMetadata }

TimeSeries { labels: []Label, samples: []Sample, exemplars: []Exemplar }
Sample { value: double, timestamp_ms: int64 }
Exemplar { labels: []Label /* trace_id */, value: double, timestamp_ms: int64 }

Response: 204 (ok) | 400 (bad proto) | 429 (quota) | 503 (back-pressure)
  • Idempotency: (tenant, series_hash, timestamp) is the natural key. Dup writes are silently deduped at the TSDB (same value) or rejected (different value within 1ms tolerance; out-of-order writes beyond lookback_deltaerr_out_of_order counter bumped, not 4xx — we don't want agents to retry old data).
  • Batching: agents batch 30s worth or 1MB, whichever first.
  • Auth: mTLS (SPIFFE ID). Tenant derived from SPIFFE SAN.

POST /v1/metrics — OTLP/metrics gRPC

service MetricsService {
  rpc Export(ExportMetricsServiceRequest) returns (ExportMetricsServiceResponse);
}
  • Supported alongside remote_write; internal adapter converts OTLP resource/instrumentation scope → Prometheus label conventions.

4.2 Query

GET /api/v1/query?query=<PromQL>&time=<unix> — instant query GET /api/v1/query_range?query=<PromQL>&start=&end=&step= — range query GET /api/v1/labels, /api/v1/label/<name>/values, /api/v1/series — metadata

  • Response: { status, data: { resultType: "matrix"|"vector"|"scalar", result: [...] } }
  • Streaming variant: GET /api/v1/query_stream returns gRPC server-streaming QueryResult chunks — for dashboards loading 1M points without buffering.
  • Pagination: label-value endpoints paginate via match[] filter and limit+next_token (opaque cursor).
  • Errors: 400 parse, 422 evaluation (e.g., many-to-many match), 429 concurrency, 504 query-timeout (bounded at 60s), 507 too-many-series (>50K series matched → force user to narrow).

4.3 Alerting control plane

PUT /api/v1/rules/<group> — upsert rule group (gRPC + HTTP)

groups:
  - name: search-frontend
    interval: 15s
    rules:
      - alert: HighErrorRate
        expr: sum(rate(http_errors_total[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.01
        for: 5m
        labels: { severity: page, team: search }
        annotations: { runbook: "https://..." }
  • Rules are stored in a separate control-plane etcd (not in the TSDB — config ≠ data).

POST /api/v1/silences — create silence { matchers, starts_at, ends_at, creator, comment } POST /api/v1/inhibit — create inhibition pair { source_matchers, target_matchers, equal_labels }

4.4 Health

GET /healthz per component — fast, no dependencies. Returns 200 ok or 503 + reason. GET /readyz — slower, checks downstream (Kafka reachable, shard index loaded). Heartbeat metric emitted by agent: up{host=...} = 1. Absence ⇒ alert.


5 Data Schema #

Critical L7 insight: metrics schema is NOT one table. It's 5 distinct storage engines with very different access patterns.

5.1 Series (hot TSDB)

  • Logical: series_id = uint64 hash(tenant, metric_name, sorted_labels).
  • Physical: per-series chunk files, 2h blocks (Prometheus convention), chunks hold compressed (ts, val) pairs via Gorilla (XOR value + delta-of-delta ts).
  • Engine choice: custom columnar format (Prometheus TSDB mmap or M3DB similar). Not RocksDB — Gorilla compression beats LSM by 5–10× for this workload and we don't need point-update.
  • Partition key: (tenant, series_id) % shard_count, then within shard by 2h time block.

5.2 Inverted Index (label → series)

  • Logical: label_kv → sorted_postings_list[series_id] (e.g., service=checkout → [42, 108, 10234, ...]).
  • Query evaluation: intersect/union postings lists for each matcher in the selector {service="checkout", status="500"} — this is the classic search-engine algorithm.
  • Engine: RocksDB (LSM is good here — index is append-mostly with occasional compaction) plus in-memory hash of recent labels for the 2h active block.
  • Why separate from TSDB? TSDB layout is optimized for time-range scans of a known series; index is optimized for resolving "which series match this selector?" — different B-tree vs postings-list access pattern. Conflating them is what killed InfluxDB 1.x's tsm+tsi mashup.
  • Why RocksDB and not Lucene? We need fast range queries on timestamps inside postings, integer-heavy, and we want mmap friendliness. Lucene is too text-oriented and allocates too much. M3 and VictoriaMetrics both roll custom inverted indexes over similar LSM-like structures.

5.3 Alert Rule Definitions (control plane)

  • Schema: {rule_id, tenant_id, group_name, expr_promql, for_duration, labels, annotations, version, updated_by, updated_at}
  • Engine: etcd (strong consistency for config) or Spanner-like (if multi-region). ~10K rules/tenant × 500 tenants = 5M rules → fits.

5.4 Alert State (firing/pending/resolved)

  • Schema: {alert_key = hash(rule_id, group_by_label_values), state ∈ {inactive, pending, firing}, active_since, last_eval, last_value, annotations}
  • Engine: Replicated in-memory with periodic snapshot to object store. Cannot live in TSDB — state changes are not time-series; they're mutable records indexed by alert_key.
  • Why not Redis? We control failover semantics better with a custom Raft group per evaluator shard. Redis is fine if you trust Sentinel; Monarch-scale systems run their own.

5.5 Silences / Inhibitions

  • Schema: {silence_id, matchers: []LabelMatcher, starts_at, ends_at, creator, comment} — low QPS, <1M rows globally.
  • Engine: Spanner / etcd. Replicated to every alert evaluator on change.

5.6 Downsample Rollups (warm/cold)

  • Schema: same series_id, but samples are {ts, count, sum, min, max, last} quintuples (enables accurate rate/quantile reconstruction within rollup resolution).
  • Engine: object storage (S3/GCS) as immutable Parquet-like blocks, keyed tenant/<series_id_prefix>/<block_start>.chunk. Index (block → series_ids present) kept in a metadata service.
  • Why quintuple instead of single downsampled value? If you only store the mean, you destroy tail latency. Keeping count+sum+min+max+last (plus histogram_bucket_sum for native histograms) lets you reconstruct rate, avg, max, etc., within the rollup window. Prometheus native histograms and M3's approach both do this.

6 System Diagram — CENTERPIECE #

6.1 End-to-end (single region, one AZ triangle shown)

  +-----------------------------------------------------------------------------+
  |                              100K HOSTS (per region)                         |
  |   +---------+  +---------+  +---------+  +---------+  +---------+           |
  |   | host_1  |  | host_2  |  | host_3  |  |  ...    |  | host_N  |           |
  |   |  app+   |  |  app+   |  |  app+   |  |         |  |  app+   |           |
  |   |  agent  |  |  agent  |  |  agent  |  |  agent  |  |  agent  |           |
  |   |  (100MB |  |         |  |         |  |         |  |         |           |
  |   |  local  |  |         |  |         |  |         |  |         |           |
  |   |  buffer)|  |         |  |         |  |         |  |         |           |
  |   +----+----+  +----+----+  +----+----+  +----+----+  +----+----+           |
  |        |            |            |            |            |                |
  +--------|------------|------------|------------|------------|----------------+
           |            |            |            |            |
           |  HTTPS remote_write (snappy+proto), mTLS (SPIFFE), batch=30s/1MB
           |  per-host: ~0.3 KB/s wire, ~30 req/min (batched)
           |  aggregate: ~120 MB/s, ~50K RPS across fleet
           v            v            v            v            v
   +---------------------------------------------------------------------+
   |            REGIONAL INGEST GATEWAY (stateless, fleet of ~30 pods)   |
   |                                                                     |
   |   [ Envoy L7 ] -> [ auth/SPIFFE ] -> [ rate-limit (token bucket,    |
   |        per-tenant) ] -> [ relabeler / schema guard ] ->             |
   |        [ cardinality admission (Bloom+count-min on label vals) ]    |
   |                                                                     |
   |   Drops incrementing: dropped_total{reason=quota|schema|cardinality}|
   |   Timeouts: 5s write, 429 on quota, back-pressure via 503+Retry-After|
   +-------------------------------+-------------------------------------+
                                   |
                                   | gRPC produce -> Kafka
                                   | ~10M samples/s = ~1.5M records/s
                                   | (batched: 10 samples/record)
                                   v
   +---------------------------------------------------------------------+
   |   KAFKA INGEST TOPIC   "metrics.raw"   (RF=3, 256 partitions)       |
   |   Partition key = hash(tenant, series_id) -> preserves per-series   |
   |      order, enables independent consumers                           |
   |   Retention: 24h (replay buffer for TSDB recovery)                  |
   |   Throughput: ~120 MB/s in, ~360 MB/s out (1 writer + 2 readers)    |
   +----+--------------------+-----------------------+-------------------+
        |                    |                       |
        | (consumer 1)       | (consumer 2)          | (consumer 3)
        | TSDB writers       | Stream processor      | Tee: exemplar
        |                    | (Flink)               | fanout to trace store
        v                    v                       v
   +----------------+   +----------------------+   +-------------------+
   | TSDB WRITER   |   | FLINK: downsampler   |   | EXEMPLAR FORWARDER|
   | FLEET         |   | + recording rules    |   | (links metrics to |
   | (per shard)   |   | + anomaly pre-agg    |   |  trace backend)   |
   |               |   | windows: 5m/1h       |   |                   |
   | ~128 shards   |   | outputs: new series  |   |                   |
   | write path:   |   | back to Kafka topic  |   |                   |
   |   wal -> mem  |   | "metrics.rollup_5m"  |   |                   |
   |   -> 2h block |   |                      |   |                   |
   |   -> mmap +   |   | checkpoints to S3    |   |                   |
   |   inv. index  |   |                      |   |                   |
   +-------+-------+   +----------+-----------+   +-------------------+
           |                      |
           | 2h blocks flush      | rollup writes (lower volume)
           v                      v
   +--------------------------------------------------------------+
   |            HOT TIER (local NVMe, 30d, 3x replicated)         |
   |   shard_0 .. shard_127   each: ~500 GB active, ~120K RPS     |
   |   memory: 32 GB/shard (index + head block)                   |
   +--------------------------+-----------------------------------+
                              | age-out compactor (every 2h)
                              v
   +--------------------------------------------------------------+
   |   WARM TIER (object store, 1y, 5-min rollups, EC 8+4)        |
   |   ~14 TB/yr, served by "store gateway" stateless pods        |
   +--------------------------+-----------------------------------+
                              | age-out (90d)
                              v
   +--------------------------------------------------------------+
   |   COLD TIER (S3 Glacier-IR / GCS Archive, 5y, 1h rollups)    |
   |   ~6 TB total, restore latency O(minutes) — ok for forensics |
   +--------------------------------------------------------------+

   QUERY PATH (separate plane) -------------------------------------

   +-----------+  PromQL/MQL  +---------------------------+
   | Grafana / |------------->|    QUERY FRONTEND         |
   | user CLI  |              |  - parse + validate       |
   +-----------+              |  - split by time window   |
                              |  - cache (Redis/memcache) |
                              |  - admission/QoS limiter  |
                              +-------------+-------------+
                                            |
                                     fanout | (shard map from SD)
                                            v
                              +-------------+-------------+
                              |       QUERIERS            |
                              |  - pull from hot shards   |
                              |  - pull from warm store   |
                              |    gateway for older data |
                              |  - merge + dedup overlap  |
                              |  - evaluate PromQL        |
                              +-------------+-------------+
                                            |
                                            v
                              (result streamed to frontend, cached)

   ALERT PATH (separate plane) -------------------------------------

   +--------------------+         +-------------------+
   | Rule config API    |-------->| Control plane     |
   | (user upserts YAML)|         | (etcd / Spanner)  |
   +--------------------+         +---------+---------+
                                            | watch (rules updated)
                                            v
                              +-----------------------------+
                              |   ALERT EVALUATORS          |
                              |   sharded by hash(rule_id)  |
                              |   every 15s:                |
                              |     1. query query-frontend |
                              |     2. evaluate expr        |
                              |     3. update state machine |
                              |        (inactive/pending/   |
                              |         firing)             |
                              |     4. check "for <dur>"    |
                              +--------------+--------------+
                                             |
                                             v
                              +-----------------------------+
                              |   NOTIFIER / ALERTMANAGER   |
                              |   - group by labels         |
                              |   - dedupe across replicas  |
                              |   - apply silences          |
                              |   - apply inhibitions       |
                              |   - route to pager/slack    |
                              +-----------------------------+

6.2 Push vs Pull — where it lands in this diagram

Above diagram is push (remote_write). A pull variant would replace the top row with:

   +-----------------+       +-------------------+
   | Service         |<------+ Scraper fleet     |
   | Discovery       | list  | (Prometheus-style)|
   | (Consul/EDS)    |------>| GET /metrics      |
   +-----------------+       +---------+---------+
                                       |
                                       v
                              (same remote_write path to Kafka)

The ingest gateway still exists; the scrapers become the "agents" from ingest's perspective. Push vs pull is mostly a discovery and network boundary question — downstream of the gateway, everything is identical.

6.3 Control plane vs data plane separation

Plane Members SLO Rate
Data gateway, Kafka, Flink, TSDB, queriers 99.95% 10M samples/s
Control rule config, tenant quota, silences, schemas 99.99% (config changes rare but blocking) <100 QPS
Meta service discovery, shard map, consensus 99.99% very low

Do not let data-plane volume threaten control-plane availability. Config writes must succeed even when ingest is saturated — otherwise you can't push a silence during an alert storm.


7 Deep Dives (3 critical topics) #

Deep Dive 1 — TSDB storage engine internals: Gorilla compression, block layout, and why cardinality is the real bottleneck

Why critical: This is the hot loop. If you can't explain Gorilla + inverted index at whiteboard depth, you're L6. The answer to "how do you scale the TSDB?" is not "add shards" — it's bound cardinality and compress aggressively.

Alternatives considered:

Option Storage / sample Query p99 (24h, 10K series) Cardinality ceiling/node Rejection reason
InfluxDB 1.x (TSM + TSI) ~2.5 B/sample ~1.2s ~1M series/node before tsi struggles TSI had known OOM patterns on bursty cardinality; v2 pivoted to IOx for a reason
Cassandra column families per metric ~8–12 B/sample ~3s Arbitrary, but expensive LSM is wrong shape; SSTable compactions burn IO; no built-in XOR compression
Parquet-on-S3 (analytics-first, e.g. Druid) ~4 B/sample ~5s for range, great for aggregation Billions, but latency too high Batch-oriented, can't do 15s alert eval without a separate hot path
Prometheus TSDB (Gorilla + mmap) 1.37 B/sample 0.3s ~5M series/node Chosen for hot tier
M3DB (Uber) 1.5 B/sample ~0.8s 10M+ series/node (horizontal) Chosen as reference for multi-tenant scale-out
Monarch (Google) ~1 B/sample (rumored) <1s globally Billions globally Chosen as aspirational target for v3

Chosen approach: Prometheus TSDB engine per shard (hot tier), M3-style horizontal sharding with object-store warm tier, Monarch-style global query as north star.

Gorilla math (the earned-secret bit — I can derive this on the whiteboard):

  • Timestamp encoding: delta-of-delta (DoD). First sample: full 64-bit ts. Subsequent: if DoD=0, emit 0 bit (1 bit!). If |DoD|<64, emit 10 + 7-bit signed. Further ranges: 110, 1110, 1111 + 32-bit escape. Empirically, DoD=0 in >95% of samples when scrape interval is stable → ~1 bit/timestamp amortized.
  • Value encoding: XOR current with previous. If XOR=0 (value unchanged), emit 0 bit. If XOR has same leading+trailing zero count as prior, emit 10 + meaningful bits. Else 11 + 5-bit leading + 6-bit length + meaningful bits. For slowly-moving gauges (CPU %, memory), XOR is small → ~8 bits/value amortized.
  • Total: ~9 bits/sample + per-series overhead amortized over the block → empirically 1.37 B/sample. Counters (monotonic) do even better.
  • Block boundary: every 2h, flush a complete immutable block with embedded index. Old chunks become read-only and mmap-friendly. The head block stays in RAM + WAL.

Cardinality is the killer — the earned-secret story:

True story pattern: one team deploys a service where a label is session_id or uuid or user_id. Cardinality goes from 10K → 1M series in minutes. The inverted index postings lists explode. On a label-value query like {env="prod"} the intersection has to scan 1M-entry postings list per shard. RAM for the postings lists alone goes from 20 MB → 2 GB on one shard. A merge ({env="prod", service="checkout"} — both large postings lists) is O(|a| + |b|) — postings-list merge time, not lookup time. 30ms → 30s. The whole query frontend times out. Alert evaluation stalls (it queries through the same frontend), for 5m state timers expire incorrectly, spurious pages fire. Shard OOM-kills, replicas fail over, fail over again, and now you have a replication storm. Fleet-wide impact from one label.

Mitigations (layered defense):

  1. Admission-time cardinality budget per (tenant, metric_name). Gateway keeps a streaming count-min sketch of label-value diversity. Breach → reject with 429 cardinality_exceeded. Per-tenant default 100K series, per-metric default 10K series. Exception workflow via control plane (manually raise after schema review).
  2. Label schema governance — static config-as-code: user_id, uuid, session_id, trace_id BLACKLISTED as label keys. Trace_id belongs as an exemplar, not a label.
  3. Adaptive sampling fallback — when a metric blows budget, don't drop; sample 1:N transparently and annotate _sampled_ label. Preserves signal, breaks no dashboards, operator gets a warning alert. (Datadog and Honeycomb both do variants of this.)
  4. Bounded postings-list merge — set a query-time cap (max_series_matched=50000). Beyond cap, return 507 to user with "narrow your selector." Protects the cluster.
  5. Tombstone compaction — dead series (no writes in 2h) get evicted from the inverted index at block compaction, reclaiming postings list space. The churn case (pod_id label on ephemeral pods) is why this matters — without it, the index grows unboundedly even if active series is bounded.

Failure modes:

  • Shard OOM on index: symptom = write latency spike + RSS climbing. Detection: ingester_memory_series nearing ceiling. Blast radius: 1/N of writes (the shard's partition). Mitigation: emergency admission of that tenant (reject new series), manual shard split (double partition count for that tenant), add headroom. Recovery: 2h to roll new block and release memory.
  • Block corruption at flush: WAL replay rebuilds. WAL itself 3x replicated. Blast: 1 shard × 2h data in head → acceptable loss; can replay from Kafka if within 24h retention.

Real systems:

  • Prometheus TSDB (https://prometheus.io/docs/prometheus/latest/storage/)
  • Gorilla paper — Pelkonen et al., "Gorilla: A Fast, Scalable, In-Memory Time Series Database," VLDB 2015.
  • M3 — Uber, 2018, powers Uber's 10B+ series.
  • Monarch — Adams et al., "Monarch: Google's Planet-Scale In-Memory Time Series Database," VLDB 2020 — global push, RPC-based ingest, zone-local alerting, cross-zone aggregation. This is the north star.
  • VictoriaMetrics — single-binary ClickHouse-inspired alternative with aggressive compression.

Deep Dive 2 — Alerting at scale: rule sharding, the for state machine, dedup, and keeping e2e latency <30s

Why critical: The alert path is the pager. Ingestion at 10M/s is impressive, but if alerts fire 2 min late or flap, the SRE team revolts and the product is dead. Alerting is also where L6 candidates hand-wave ("Alertmanager handles it") and L7 candidates explain the state machine.

Alternatives considered:

Option Eval latency Dedup quality Scale ceiling Rejection
Single Alertmanager + single Prometheus (HA pair) <30s Good (gossip) ~1K rules Doesn't scale to 100K rules; single Prom can't hold global state
Thanos Ruler (decouple ruler from Prom) ~45s Via Alertmanager ~10K rules Ruler queries Thanos Querier → cross-region latency bad for sub-30s SLO
Cortex Ruler (multi-tenant, distributed) ~30s Strong, per-tenant ~100K rules Chosen approach for its sharding story; some consistency quirks at replica boundary
Borgmon-style pull + eval on aggregator <15s Good Scales to Google Not open-source; but architecture informs our design
Monarch global eval <30s globally Excellent Planet-scale Aspirational

Chosen approach: Cortex-style sharded rulers, Alertmanager-style notifier clusters (gossip-dedup), per-tenant isolation.

Rule sharding: shard = hash(rule_group_id) % N_rulers. Each ruler owns a subset of rule groups and evaluates them on their interval. Rules within a group share state and are evaluated serially (so rules in the same group can depend on each other's recording rules in the previous iteration).

The for <duration> state machine (L7-depth detail):

 +-----------+    expr=true    +---------+    held for <for> duration    +---------+
 | INACTIVE  |---------------->| PENDING |------------------------------>| FIRING  |
 +-----------+                 +---------+                                +---------+
      ^                             |                                          |
      |        expr=false           |                                          |
      +-----------------------------+                                          |
      |                             expr=false                                 |
      +------------------------------------------------------------------------+
  • State key: (rule_id, group_by_label_values_hash). A single rule error_rate > 0.01 by (service) produces one state entry per service.
  • Persistence: state must survive ruler restarts (otherwise pending resets to 0 and a flapping rule never fires). Stored in a Raft group per ruler shard, snapshotted to S3 every 10 min. On ruler restart, state is reloaded. Losing state for up to 5s is acceptable — rules re-evaluate and re-enter pending.
  • Skew handling: if evaluation is late (e.g. gateway slow), the ruler uses the scheduled timestamp, not wall-clock, for for accounting. Otherwise late eval = incorrect fire.

Deduplication across ruler replicas (the +1 detail):

Every rule group runs on replica_factor=3 rulers (HA). Without dedup, every fire sends 3 pages. Alertmanager gossip dedups by (alert_fingerprint = hash(alertname + sorted_labels)); first message wins in a sliding window. Clock skew between replicas is handled by a jitter window of 2 × scrape_interval. At most one page per fingerprint per route per group_interval.

Alert storm suppression:

  • Grouping — Alertmanager groups alerts by configurable labels (group_by: [cluster, alertname]). A 500-host outage → 500 HostDown alerts grouped into one notification with num_firing=500. Oncall gets one page, not 500.
  • Throttlinggroup_interval=5m caps notification rate.
  • Inhibition — if ClusterDown fires for cluster=us-east-1, inhibit all HostDown for hosts in that cluster. Defined declaratively.
  • Silence — operator API puts a matcher silence in effect (e.g., during planned maintenance).

Latency budget (proves <30s):

scrape interval            10 s (worst: just missed)
gateway -> Kafka           <1 s
Kafka -> TSDB writer       <2 s
TSDB block flush / index   <2 s  (head block queryable in RAM)
ruler next tick            <15 s (interval=15s)
query frontend eval        <1 s
state machine transition   <0.1 s
notifier dedupe + send     <3 s
                           -----
worst-case e2e             ~34 s  <-- just over budget
p99                        ~22 s  <-- within budget
p50                        ~12 s

Worst-case slightly blows budget; we fix by skip-the-queue shortcut for critical alerts: ruler queries the TSDB head block directly (bypasses query frontend cache) cutting ~2s. Monarch does this; its rulers run in the zone where the data lives.

Failure modes:

  • Ruler replica lag — detection: ruler_eval_lateness_seconds. Blast: alerts late by lag. Mitigation: oversubscribe rulers, shed non-critical rule groups first (priority label).
  • Notifier outage — detection: delivery failure counter. Mitigation: dual-write to 2 notifier clusters, page via secondary channel (SMS gateway) on double failure.
  • Alert fatigue loop — detection: pages/hour rising + page ACK rate falling. Mitigation: auto-silence with human escalation; per-rule SLO on false-positive rate.

Real systems: Borgmon (Google internal; described in SRE book ch. 10), Prometheus + Alertmanager, Cortex Ruler, Thanos Ruler, Grafana Mimir.


Deep Dive 3 — Push vs pull: trade-offs, and why we actually want both

Why critical: This is the canonical metrics-system interview bait. The L6 answer is "pull is Prometheus, push is StatsD." The L7 answer is: they solve different problems and a mature system supports both, with routing logic at the ingest gateway.

Pull (scrape) model:

Dimension Pull
Discovery Central service discovery (Consul/Kubernetes SD/EDS) knows targets. Scraper connects in.
Auth Scraper holds creds; easier to rotate centrally.
Rate limiting Natural — scraper controls interval. Target can't DoS the system.
Back-pressure If TSDB is slow, scraper slows. Target keeps a last-sample window.
Firewall traversal Target must be reachable from scraper. Bad for NAT / customer premises.
Short-lived jobs Terrible — the scrape may miss the job entirely. Need a push gateway adjunct.
Health Scrape failure IS the health signal — no heartbeat needed.
Sample skew Scraper drives, so samples align on scrape interval. Good for cross-host math.

Push (remote_write) model:

Dimension Push
Discovery Agent knows its destination; no global SD. Agent identity via SPIFFE.
Auth Each agent holds creds; rotation is harder. Need a control plane.
Rate limiting Server-side token bucket per tenant. Must handle thundering-herd.
Back-pressure Agent local buffer absorbs; if buffer fills, agent drops (or spills to disk). Complex.
Firewall traversal Excellent — agent dials out. Works across NAT, VPCs, customer environments.
Short-lived jobs Natural — job pushes final sample on exit.
Health Need an explicit heartbeat. Missing push ≠ target down (could be network).
Sample skew Agents self-schedule — timestamps may cluster if they're all started together.

Quantified comparison for our 100K-host fleet:

  • Pull steady state: scraper fleet of ~50 pods (each handling 2K targets at 10s interval = ~200 RPS each), with SD updates every 30s costing ~10 MB/update × 50 pods = 500 MB/min control traffic. Target-side CPU: ~0.1% for 1000 metrics (cheap).
  • Push steady state: agent CPU ~0.2% (batching + snappy), ~30 KB/min network per host, gateway fleet of ~30 pods handling 50K RPS total.
  • Pull fails: 5K ephemeral Lambda-like workloads that live <30s. We need a push gateway or a cron-scrape hybrid.
  • Push fails: third-party service we want to scrape but can't install an agent on. We'd need an agent sidecar or a central scraper.

Chosen approach — hybrid, with gateway as the switchboard:

  Hosts with agents -----------push remote_write----+
                                                    |
  Short jobs -------push to Pushgateway----+       |
                                           |       |
  Third-parties ---pulled by Scraper---+   |       |
                                       v   v       v
                               +---------------------+
                               |  INGEST GATEWAY     |
                               +---------------------+

Downstream is unchanged. All three paths converge at the gateway, which normalizes auth, labels, and admission. This is what Cortex/Mimir does.

Earned-secret:

In flaky-network environments (edge, mobile, customer VPC), push wins because agent-side buffer + outbound dial handles partitions gracefully. But you need to cap the agent buffer — unbounded buffer + prolonged outage = OOM the host you're trying to monitor. The right default: 100 MB disk-backed ring buffer (spills from RAM), 1h retention; beyond that, drop oldest and increment a counter. Ubuntu Noisy Neighbor pattern: agents that can't send will consume all host disk if you let them.

Failure modes:

  • SD flaps (pull) → scrape-target churn → missing samples → false-positive down alerts. Mitigation: target list debouncing, scrape honor_labels + freshness tolerance.
  • Agent buffer fills (push) → host OOM or data loss. Mitigation: bounded disk buffer, early warning at 80% full, drop-policy by metric priority.
  • Both fail silently → SREs don't know they're blind. Mitigation: meta-monitoring: the monitoring system itself is monitored by a different (smaller, independent) monitoring system. Monarch calls this "Monarch-of-Monarch." Borgmon had the same pattern.

Real systems: Prometheus (pull-native + remote_write push), Datadog agent (push), Monarch (push, planet-scale), Nagios (pull), StatsD (push), OpenTelemetry Collector (both).


8 Failure Modes & Resilience #

Component Failure Detection Blast Radius Mitigation Recovery
Agent Crash loop Missing heartbeat series up{} == 0 for 2 intervals 1 host's metrics Restart via systemd; health probe; stateless so fresh start is fine Auto-resume from local buffer if disk intact; else lose <30s
Agent Disk buffer fills (network partition) Counter agent_buffer_pct > 80 1 host + its neighbors if host OOMs Disk-backed bounded ring, drop-oldest policy, early warn Drain buffer on reconnect
Ingest gateway Pod crash Pod restarts, gateway_5xx spike Brief (seconds) for hosts hitting that pod Stateless behind L7 LB; HPA scales Agents retry with exponential backoff
Ingest gateway Cardinality explosion attack from 1 tenant Series-create rate spike; postings-list size growing fast That tenant + noisy-neighbor degradation of shared shards Per-tenant admission budget enforced with count-min sketch; 429 Operator raises or lowers quota; emergency "freeze new series" knob
Kafka Broker loss ISR shrink, producer errors Brief write unavailability for impacted partitions RF=3, min.insync.replicas=2 ensures no data loss on single broker Rebalance; agents retry
Kafka Consumer lag (TSDB writer slow) kafka_consumer_lag_ms > 60s Delayed queries; alert staleness Writer HPA; pause Flink rollup before TSDB (rollups are replayable) Catch up from Kafka; no data loss since retention=24h
TSDB shard OOM from cardinality ingester_memory_series near limit, RSS climbing 1/N of writes — that shard's partition Hard cardinality caps; series eviction for inactive; emergency shard split New shard drains active series; ~2h to stabilize
TSDB shard Disk full Disk metric Shard ingest stops Block compactor to warm tier; emergency block delete of coldest data Replace disk, replay from Kafka
Query frontend Query stampede (dashboard refresh avalanche) Concurrent queries > threshold, p99 latency climb Query unavailability Admission: per-tenant concurrency cap, query cost estimate, 429 Cache warms; tenant backs off
Querier OOM on "select all series" Heap metric 1 querier pod Query cost budget (max_series_matched, max_samples_per_query); 507 Pod restart; user rewrites query
Ruler Replica lag ruler_eval_lateness_seconds > 15s Alerts late Oversubscribe, shed low-priority rule groups, circuit-break queries to TSDB Catch up; may miss one eval but not page if for is long enough
Ruler Silent drop (bug) Meta-alert on ruler_evals_total rate drop Blind spot External synthetic alert that fires if ruler eval rate drops 50% Code fix; roll back
Alert evaluator state loss Raft quorum loss Quorum health metric pending alerts reset; 1 eval cycle of delayed firing RF=5 Raft, geographically spread New quorum from surviving members; snapshot restore
Notifier (Alertmanager) Delivery provider down (PagerDuty outage) Delivery failure metric Alerts queued, not delivered Multi-provider (PagerDuty + OpsGenie + SMS fallback); dead-letter queue Replay from DLQ when provider back
Alert storm 10K alerts/min (power outage) Page rate alert Oncall overload Grouping, inhibition (parent/child), rate limiting per route Human declares incident, silences, triages
Clock skew NTP broken on host Sample ts outside acceptance window (5 min) → drop counter That host's ingest Agents drop samples with bad clock, emit clock_skew counter; central NTP health check Fix NTP
Storage tier corruption Warm chunk CRC fail Read-time CRC mismatch counter 1 block, 2h of 1 series Cross-AZ replica; EC parity Re-fetch from replica or reconstruct from parity
Silent drops anywhere Bug or misconfig Meta-principle: every drop increments a labeled counter drops_total{reason=} Blind spot if not instrumented Invariant: no unrecorded drop. Code review rule. N/A (detection is the mitigation)
Monitoring-of-monitoring Main system down, nobody knows Secondary (smaller, independent) monitoring watches primary Extended blindness, dangerous Run a separate "meta-monitor" Prometheus pair that scrapes our system's /metrics endpoint, with its own independent pager Primary recovers, meta alerts clear

Three invariants I'd tattoo on the team wall:

  1. Every drop is counted. No silent failures. *_dropped_total{reason=...} on every boundary.
  2. The monitoring system must never take down the thing it monitors. Agent CPU/memory capped, buffered, non-blocking on application code.
  3. The monitoring system monitors itself with a separate system. Or an independent pager. Do not let the alerter for the alerter be the alerter.

9 Evolution Path #

v1 — "make it work" (100s of hosts, 1 team)

  • Stack: StatsD agents → Graphite (Carbon + Whisper) → Grafana.
  • Alerting: Nagios or bash scripts running PromQL-equivalent via check_graphite.
  • Storage: Whisper files on local disk; one server.
  • Cost: <$1K/mo.
  • Why this first: 1 engineer, 1 week, covers 90% of basic needs. Ship.
  • Pain points that force v2: no labels (hierarchical metric names only), Whisper writes are O(series) on every tick (doesn't scale past a few thousand series), no HA, no multi-tenancy.

v2 — "make it scale to a product" (1K–10K hosts, dozens of teams)

  • Stack: Prometheus (federated pairs) per region → Alertmanager → Grafana → Thanos/Cortex for long-term storage + global query.
  • Ingestion: remote_write to a central Cortex/Mimir cluster.
  • Alerting: per-team rule groups; Alertmanager HA trio.
  • Storage: Prometheus 2h blocks → S3 via Thanos Sidecar. Store Gateway for historical queries.
  • Multi-tenancy: Cortex per-tenant isolation (header-based).
  • Cost: $10–50K/mo.
  • What you get: labels, PromQL, HA, reasonable cardinality (~10M series), 1-year retention.
  • Pain points that force v3: cross-region query latency, federation fan-out math (federating 20 Prom pairs × 1M series each = 20M series queried from a single Thanos Querier → slow), tenant isolation at label-level not metric-level is hard, compaction jobs become a full-time ops burden.

v3 — "global, planet-scale, tenant-isolated" (100K+ hosts, 100s of teams, multi-region)

  • Stack: Monarch-like or M3/Mimir at scale. Push-based agents (remote_write), regional Kafka, regional TSDB shards, global query layer, sharded rulers, multi-tier storage.
  • Multi-region: active-active for alert eval (each region evaluates rules against local data, cross-region aggregation only when explicitly queried). Locality beats globality for <30s alerts.
  • Tenant isolation: per-tenant ingester shards (not just label filtering) — cardinality blast radius is contained to a tenant. Per-tenant quotas enforced at gateway.
  • Global query: hierarchical aggregation. Local query → regional aggregator → global. Most queries answered locally; only cross-region queries pay the WAN cost.
  • Storage: hot local NVMe (30d), warm object store (1y, 5m rollups), cold archive (5y, 1h rollups).
  • Cost: $1–5M/yr for 100K hosts at this scale.
  • What's next (v4): exemplar-based sampling unification with traces (OpenTelemetry), native histograms (sparse, high-resolution quantiles without bucket explosion), ML-based anomaly detection replacing static thresholds, differential-privacy-aware aggregation for externally-shared metrics.

The meta-lesson: every monitoring system evolution is driven by cardinality pressure and tenant-isolation failures, not by raw sample-rate growth. Sample rate you buy your way out of with more disk. Cardinality bankrupts your index design.


10 Out-of-1-Hour Notes (solo study enrichment) #

10.1 Cardinality quota enforcement in depth

  • Per-tenant budget (default 100K active series), per-metric budget (default 10K), per-label-value budget (e.g., pod_id capped at 50K unique values across a metric). Enforced at gateway.
  • Implementation: count-min sketch (CMS) per tenant + per metric. Sketch width/depth tuned so false-positive rate <0.1% at 100K items. Reset every 2h (aligns with TSDB block boundary).
  • Overflow behavior:
    • Soft limit breach → log + warn metric.
    • Hard limit → reject with 429 cardinality_exceeded, emit cardinality_rejected_total.
    • Operator whitelists with signed exception in control plane.
  • Why CMS and not exact set? Exact is O(|series|) RAM at the gateway — doesn't scale. CMS is O(1) with tunable error. Over-counting (false-positive rejection) is acceptable; under-counting isn't.
  • Edge case: when budget = exact cardinality, pod restarts cause label churn that false-positively hits limits. Solution: TTL-decayed sketch that forgets values unseen for 6h. Some teams run this as a Flink job because Flink owns the time-window semantics.

10.2 Schema governance

  • Label naming conventions — lowercase snake_case, namespace prefixes (k8s_, app_), blacklist (user_id, email, session_id, anything high-cardinality or PII).
  • Metric naming<namespace>_<subsystem>_<name>_<unit> (Prom convention): http_requests_total, process_cpu_seconds_total.
  • Enforcement: linter in CI on rule YAML + metric-name registry. Pre-registered metrics get fast-path; unknown metrics get delayed registration + cardinality audit.
  • Metric types: counter (monotonic), gauge (point-in-time), histogram (buckets pre-declared), summary (client-side quantile).

10.3 Histogram encoding deep dive

  • Classic Prom histogram: fixed buckets, e.g., le=0.005, 0.01, .... Pre-declared. Good for known distributions, bad for long tails or unknown-range metrics. Bucket count × series cardinality blows up fast (every bucket is a series!).
  • Native histograms (Prometheus 2.40+): exponential schema, sparse, logarithmically-spaced buckets at runtime. One series per metric+labels, internal bucket vector. ~5–10× cardinality reduction. Accurate quantiles without manual tuning. This is the way.
  • t-digest / DDSketch: client-side sketch for accurate quantiles, mergeable across replicas. Used in Datadog (DDSketch). Lower storage than bucketed histogram; slightly more CPU at emit.
  • Summary type (client quantile): client computes p50/p99 and emits as scalar. Cannot be aggregated across replicas (you can't average p99s). Useful only when per-instance quantile is the question; avoid otherwise.

10.4 Exemplars — linking metrics ↔ traces

  • Exemplar: a (trace_id, sample_value, timestamp) attached to a metric sample.
  • Use case: dashboard shows http_request_duration_seconds p99 = 2.3s — click it, get 10 trace exemplars for 2.3s requests. Jump into Jaeger/Tempo with the trace_id.
  • Storage: exemplars stored in a circular buffer per series (e.g., last 100 per series per hour). Not every sample has an exemplar — sampled at emit time.
  • Why this is the future: it's the bridge across observability pillars without making labels high-cardinality. The trace_id lives in the exemplar (low overhead), not in the label set (cardinality catastrophe).
  • OpenTelemetry standardizes the exemplar protocol.

10.5 Adaptive sampling under overload

  • Problem: metric emission is a best-effort pipeline. Under overload we want to degrade gracefully — not drop uniformly (biases signal), not drop entirely (go blind).
  • Strategy — priority-based shedding:
    1. Drop debug-level metrics first (priority=low).
    2. Sample info-level metrics at 1/N.
    3. Keep critical-level metrics (SLOs, RED dashboards, host health) at 100%.
  • Strategy — per-series sampling: for high-volume series, emit 1 in 10 samples but tag with _sampled_rate_=10 so PromQL rate() can compensate. Requires PromQL extension.
  • Meta-concern: you MUST NOT drop counters silently — they become unrecoverable. For counters, emit monotonic increments less often (sum preserved across sample drops).

10.6 Cross-pillar correlation: metrics ↔ logs ↔ traces

  • Shared IDs: every log line and trace span carries service, env, trace_id. Metrics tag exemplars with trace_id.
  • Convergence point: an incident page shows a metric anomaly → exemplar → trace → spans highlight slow DB call → log with that trace_id shows query plan.
  • Anti-pattern: adding trace_id as a metric label for "correlation." Kills cardinality. Exemplars exist specifically to avoid this.

10.7 Regulatory — PII in labels

  • Hard rule: no PII in metrics labels. Ever. Labels are indexed, searchable, cached, exported to dashboards, shared across teams. PII in a label is a GDPR/CCPA violation waiting to happen.
  • Enforcement:
    1. Ingest gateway runs a regex-based PII detector on label values (email, phone, SSN patterns). Matches → drop series + alert.
    2. Label keys have an allow-list (static CI config).
    3. Periodic audit scan: enumerate top-cardinality labels, verify against allow-list.
  • Edge case: "I want per-user metrics." Answer: per-user metrics are not metrics. They're analytics. Route them to a purpose-built analytics pipeline with privacy controls. Or aggregate to cohort granularity. Or use differential privacy with noise.
  • EU residency: EU-tagged tenant → ingest gateway routes to EU-region Kafka + TSDB only. Cross-region query explicitly denied for EU tenants unless legal-reviewed.

10.8 Cost model — $/series

  • Breakdown per active series (annualized, at our scale):
    • Ingest CPU/network: ~$0.001/series/yr
    • Hot TSDB (30d SSD replicated 3×): ~$0.05/series/yr
    • Warm (1y, rolled up): ~$0.01/series/yr
    • Cold (5y): ~$0.002/series/yr
    • Index RAM (hot): ~$0.03/series/yr
    • Total: ~$0.10/series/yr.
  • 100M series = $10M/yr all-in. 1B series = $100M/yr.
  • Unit economics for chargeback: bill tenants by active series-months, not bytes. Series is the resource that fails first; byte-based billing rewards cardinality abuse.
  • Compare: Datadog public price list is ~$0.10–$0.50 per custom metric per month = $1.20–$6/yr — so our internal cost is ~10× cheaper than buying Datadog, which is why big-co's build internal.

10.9 Observability of observability (meta-monitoring)

  • Principle: the monitoring system cannot be the sole monitor of itself. Circular dependency.
  • Implementation:
    • Meta-Prometheus pair (2 hosts, outside main stack) scrapes the main stack's /metrics endpoints.
    • Independent pager (different PagerDuty account? separate SMS gateway?) routes meta-alerts.
    • Synthetic canary: a cron job pushes a known metric value every minute; a meta-rule alerts if the value doesn't appear in the main system within 60s.
    • Dead-man's switch: a rule that fires if it stops firing. Detects ruler death.
  • Real-world lesson: when main Prometheus died at Facebook in 2013, they were blind to its death because Prom was monitoring itself. Hence Gorilla (out-of-band fast query) and a separate meta-alerting path.

10.10 L7 discussion hooks for the interview (what I'd volunteer if asked)

  1. "How would you detect an SRE who lies on their runbook?" — Runbooks embed a runbook_executed_total{runbook_id=...} counter. Delta over incident time vs attestation. Audit log pairing.
  2. "What's the one thing you'd change if you were starting over?" — Start with native histograms and exemplars from day 1. Retrofitting them is expensive because existing dashboards use classic histogram buckets; rewriting dashboards across 500 teams is a multi-year migration.
  3. "How do you handle the cardinality of cloud-native (Kubernetes) labels?" — Kubernetes labels (pod_name, replicaset_id) are intrinsically high-cardinality because pods churn. Solutions: (1) drop pod_name at ingest, keep only deployment_name (stable). (2) sample — only keep pod_name for a fraction. (3) short TTL — expire pod_name series 1h after last write. All three, layered.
  4. "AI/ML angle given my Meta Agentic AI background": the same metrics system underpins RL training monitoring (agent step rewards, loss curves, KL divergence) and inference serving (token latency, TTFT, cache hit). Privacy infra angle: when metrics carry per-tenant signals in a multi-tenant AI platform, label-level tenant isolation (not just auth) is required — a "noisy tenant neighbor" in the TSDB is a cross-tenant information leak if query authz is by label match. Push is safer here because auth happens at the wire, before tenanting.
  5. "How do you validate correctness end-to-end?" — Canary series. Inject a known series at ingress, query at egress, assert value+timestamp match with <5s skew. Alert if canary fails. This catches silent corruption that byte-level replicas don't.

Appendix A — Sanity check of the math #

  • Hosts: 100K.
  • Series/host: 1000.
  • Samples/series/s: 0.1.
  • Samples/s total: 100K × 1000 × 0.1 = 10M ✓
  • Bytes/sample (Gorilla): 1.37.
  • Bytes/s stored: 10M × 1.37 = 13.7 MB/s ✓
  • Bytes/day: 13.7 × 86400 = 1.18 TB ✓
  • Bytes/30d hot: 35.4 TB ✓ replicated 3× = 106 TB ✓
  • Active series: 100M to 500M realistic; 1B ceiling. Index RAM at 1 KB/series = 100 GB – 1 TB. Sharded 10–100×, fits. ✓
  • Alert budget: 10s scrape + 1s gateway + 2s Kafka + 2s TSDB + 15s eval + 3s notify = 33s worst, 22s p99. ✓ (tight but feasible; shortcut via head-block direct-query in ruler for critical alerts to hit <30s hard SLO)
  • Query fanout: 1000 QPS × avg 10 shards touched = 10K shard-queries/s. Each shard 20K shard-QPS capacity → 2 shard-queries/shard/s → easy. ✓

Appendix B — What I explicitly rejected and why #

  • InfluxDB 1.x — TSI index had OOM patterns under bursty cardinality; the rewrite to v2/IOx exists because the original design didn't separate index from storage correctly. Not production-grade at 100M+ series.
  • A single giant Prometheus — can't horizontally scale writes past ~1M samples/s on a single node. Federation is a dead-end for our scale.
  • Cassandra for time series — works, but 5–10× more storage than Gorilla, and no built-in delta-of-delta encoding. Good for Netflix's Atlas (which rolled their own time series atop Cassandra, with custom compression) — we'd reinvent what Prometheus/M3 already gave us.
  • Elasticsearch for metrics — Lucene is built for text, not floats. Storage overhead ~5× Gorilla. Query path not optimized for range queries. Was tried (Elastic TSDB features); still not competitive with purpose-built TSDBs.
  • Kafka as primary storage (log-only architecture) — tempting (everything is a log!), but you lose efficient range scans and you pay 5× storage. Kafka is the right buffer, the wrong store.
  • No Kafka at all (direct agent → TSDB) — couples agent uptime to TSDB uptime. TSDB restarts lose in-flight data. Kafka decouples the two control loops, at the cost of 2s added latency.

End of design. If I had another 30 min in the interview I'd go deep on (a) query frontend caching (split-and-merge, time-based cache keys) or (b) native histogram math (exponential bucketing schema, merge algebra). Both are L7-depth topics not fully expanded here.

esc
navigate open esc close