Q5 Real-Time Analytics & Monitoring 29 min read 12 sections
Metrics Collection & Monitoring System — L7 Reference Design
Collect fleet-wide metrics, retain history, and trigger alerts quickly without drowning the system in high-cardinality series.
1 Problem Restatement & Clarifying Questions #
Restated: Build a fleet-wide telemetry plane that ingests numeric time series from ~100K hosts, stores them cheaply at tiered retention, evaluates alert rules in seconds, serves ad-hoc dashboard queries in <1s, and never lies to the oncall.
Clarifying Qs (I'd ask the interviewer; I'll lock answers and proceed):
| # | Question | Locked Assumption | Why it matters |
|---|---|---|---|
| Q1 | Fleet size? | 100K hosts, 1 region today, scale path to 3 regions × 500K hosts | Shard count, index partitioning |
| Q2 | Metrics/host? | ~1000 active series per host (host + container + app) | Core BOE input |
| Q3 | Scrape interval? | 10s default, 1s for a 5% hot subset | Sample rate |
| Q4 | Cardinality budget? | ~100M active series today, 1B ceiling in 2y | Index RAM, fanout cost |
| Q5 | Retention? | 30d raw (10s), 1y 5-min rollups, 5y 1-hr rollups | Tiered storage sizing |
| Q6 | Query concurrency? | ~1000 dashboard QPS, ~10K alert-rule evals/s | Query planner + caching |
| Q7 | Multi-tenant? | Yes — ~500 internal tenants (teams), per-tenant quota + label-based isolation | Admission control |
| Q8 | Alert SLO? | <30s end-to-end: scrape → eval → page | Pipeline budget |
| Q9 | Pillar scope? | Metrics only. Logs & traces are sibling pillars (correlated via exemplars) | Prevents scope creep into "design observability" |
| Q10 | Regulatory? | No PII in labels (hard rule, enforced at ingest). EU data residency for EU tenants | Admission control + geo-partition |
| Q11 | Push vs pull mandate? | Candidate's choice — justify | Core deep-dive |
| Q12 | Query language? | PromQL-compatible (industry-standard), MQL-like extensions for joins | API shape |
What I am NOT designing: log search (that's Loki/ELK), distributed tracing (Jaeger/Tempo), continuous profiling (Pyroscope/Parca), synthetic probes, APM (that's a product on top of these pillars), SLO/SLI computation engine (downstream consumer).
2 Functional Requirements #
In-scope (numbered):
- FR1 — Ingest numeric samples typed as counter, gauge, histogram, summary with a labelset
{k=v, ...}and a timestamp. Support Prometheusremote_write(protobuf + snappy) and OTLP/metrics (gRPC). - FR2 — Query via PromQL-compatible language over arbitrary time ranges, with range/instant queries, subqueries, and streaming results.
- FR3 — Alerting rules — declarative expressions evaluated on a schedule; fire on condition held
for <duration>; route to notifiers (Pager, Slack, webhook). - FR4 — Recording rules — pre-aggregate expensive expressions into new series (e.g., SLO burn rate).
- FR5 — Dashboards — serve to Grafana-style UIs; query frontend handles caching, splitting, and concurrency limits.
- FR6 — Host health — every agent emits a heartbeat series; missing heartbeat ⇒ synthetic
host_downalert. - FR7 — Service discovery — for pull-mode targets, maintain a registry that advertises scrape targets per tenant.
- FR8 — Multi-tenancy — per-tenant quotas (ingestion rate, active series, query RU/s), label-based isolation, auditable admission.
- FR9 — Silences & inhibitions — operator-facing API to mute alerts by label matcher; inhibition prevents child alerts when parent fires.
- FR10 — Downsampling & retention tiering — automatic rollups 10s → 5m → 1h; tiered evict to object store.
Out-of-scope (explicit): logs, traces, profiles, events, APM, RUM, synthetic monitoring, CMDB, deploy tracking. These are sibling observability pillars — metrics integrates with them via exemplars (trace_id attached to a sample) but does not subsume them. Conflating pillars is how you get "one giant unqueryable Elastic cluster."
3 NFRs + Capacity Estimate #
NFRs
| Dim | Target | Notes |
|---|---|---|
| Availability — ingest | 99.95% | 4.3h/yr error budget. Ingest degrades gracefully (agent local buffer absorbs outages <1h). |
| Availability — alert eval | 99.99% | Harder SLO than ingest — alerts are the pager. 52 min/yr budget. Multi-region active-active for alert evaluators. |
| Availability — query | 99.9% | Dashboards can tolerate brief errors; retries are cheap. |
| Ingest latency | p99 < 5s scrape → durable | Batching + Kafka commit |
| Alert eval latency e2e | p99 < 30s | scrape (10s) + Kafka (2s) + eval window (15s) + notify (3s) = 30s budget |
| Query p99 | <1s for 24h range over ≤10K series; <5s for 7d range | Query frontend caches + shard fanout |
| Durability — raw hot | 30d, 3-way replicated, SSD | Loss tolerance: 1 AZ outage = 0 loss |
| Durability — warm | 1y rolled-up, 3-way object storage | S3-class 11 nines |
| Durability — cold | 5y 1h rollups, erasure-coded | Archive tier |
| Consistency | Monotonic per-series writes; eventually-consistent across replicas; stale read tolerated at query (+/- 15s) | Strong consistency not needed for metrics |
| Correctness | Never silently drop without counter-bump | Every drop increments metrics_dropped_total{reason=...} — you cannot debug what you cannot count |
Back-of-Envelope (show the math)
Inputs:
H= 100K hostsS_h= 1000 active series / hostr= 1 sample per 10s = 0.1 Hz
Samples/sec (peak):
samples/s = H × S_h × r = 100,000 × 1,000 × 0.1 = 10,000,000 = 10M samples/s
Round to 10M samples/s steady-state, 20M peak (deploy storms, rescrape bursts).
Active series (global): 100K × 1000 = 100M series. Plus 2× blast factor for label churn (pod restart adds a new series because pod_id rotates) → plan for 200–500M series, ceiling at 1B.
Compressed bytes per sample: Gorilla (Facebook 2015) achieves 1.37 bytes/sample average — timestamp delta-of-delta (7 bits median) + value XOR (8–16 bits median for slowly-moving gauges). Reference: Pelkonen et al., VLDB 2015.
Ingress bandwidth (wire, pre-compression):
per-sample raw = 16B ts + 8B val + ~200B labelset-reference (not repeated; sent as series_id after first) → ~32B amortized
wire = 10M × 32B = 320 MB/s raw
snappy ≈ 3× → ~107 MB/s on-wire
Call it ~120 MB/s ingest, ~1 Gbps. Split across regional receivers.
Storage/day (compressed at rest):
10M samples/s × 86,400s × 1.37B = 1.18 TB/day raw-hot
30d hot = ~35 TB. Replicated 3× = ~105 TB SSD. That's ~20 hosts with 6TB NVMe each — easy.
Downsampling:
- 10s → 5m rollup = 30× reduction
- 5m → 1h rollup = 12× reduction
Warm (1y @ 5m) = 1.18TB/d × 365 / 30 = ~14.3 TB/y replicated 3× = ~43 TB object storage
Cold (5y @ 1h) = warm / 12 × 5 = ~6 TB total
Total long-term footprint: 150 TB. At $0.023/GB/mo S3 standard-IA: **$40K/mo cold+warm storage**. Hot SSD is the expensive tier ($300/mo/TB on NVMe EC2) → **$30K/mo**. Cardinality dominates cost, not bytes — index RAM >> chunk bytes.
Cardinality budget vs index RAM:
- Per series: ~1–2 KB inverted-index overhead (labelset interning + postings lists).
- 100M series → ~150 GB RAM across index shards. Shard the index 10× → 15 GB/shard. Fits on 32GB boxes with headroom.
- 1B series → 1.5 TB RAM index. Now you're sharding 100× and postings-list merges hurt. This is the real scale cliff.
Query fanout:
- A dashboard panel "p99 latency by service over 24h" touches ~1K series × 8640 points = ~8M points. Across N=10 shards, each shard serves ~800K points.
- At 1000 dashboard QPS, total internal fanout = 10K shard-queries/s. Hence the query frontend MUST cache and split.
4 High-Level API #
4.1 Ingest
POST /api/v1/write — Prometheus remote_write compatible
Content-Type: application/x-protobuf
Content-Encoding: snappy
X-Tenant-Id: team-search-ads
X-Prometheus-Remote-Write-Version: 0.1.0
Body: WriteRequest { timeseries: []TimeSeries, metadata: []MetricMetadata }
TimeSeries { labels: []Label, samples: []Sample, exemplars: []Exemplar }
Sample { value: double, timestamp_ms: int64 }
Exemplar { labels: []Label /* trace_id */, value: double, timestamp_ms: int64 }
Response: 204 (ok) | 400 (bad proto) | 429 (quota) | 503 (back-pressure)
- Idempotency:
(tenant, series_hash, timestamp)is the natural key. Dup writes are silently deduped at the TSDB (same value) or rejected (different value within 1ms tolerance; out-of-order writes beyondlookback_delta→err_out_of_ordercounter bumped, not 4xx — we don't want agents to retry old data). - Batching: agents batch 30s worth or 1MB, whichever first.
- Auth: mTLS (SPIFFE ID). Tenant derived from SPIFFE SAN.
POST /v1/metrics — OTLP/metrics gRPC
service MetricsService {
rpc Export(ExportMetricsServiceRequest) returns (ExportMetricsServiceResponse);
}
- Supported alongside remote_write; internal adapter converts OTLP resource/instrumentation scope → Prometheus label conventions.
4.2 Query
GET /api/v1/query?query=<PromQL>&time=<unix> — instant query
GET /api/v1/query_range?query=<PromQL>&start=&end=&step= — range query
GET /api/v1/labels, /api/v1/label/<name>/values, /api/v1/series — metadata
- Response:
{ status, data: { resultType: "matrix"|"vector"|"scalar", result: [...] } } - Streaming variant:
GET /api/v1/query_streamreturns gRPC server-streamingQueryResultchunks — for dashboards loading 1M points without buffering. - Pagination: label-value endpoints paginate via
match[]filter andlimit+next_token(opaque cursor). - Errors: 400 parse, 422 evaluation (e.g.,
many-to-many match), 429 concurrency, 504 query-timeout (bounded at 60s), 507 too-many-series (>50K series matched → force user to narrow).
4.3 Alerting control plane
PUT /api/v1/rules/<group> — upsert rule group (gRPC + HTTP)
groups:
- name: search-frontend
interval: 15s
rules:
- alert: HighErrorRate
expr: sum(rate(http_errors_total[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.01
for: 5m
labels: { severity: page, team: search }
annotations: { runbook: "https://..." }
- Rules are stored in a separate control-plane etcd (not in the TSDB — config ≠ data).
POST /api/v1/silences — create silence { matchers, starts_at, ends_at, creator, comment }
POST /api/v1/inhibit — create inhibition pair { source_matchers, target_matchers, equal_labels }
4.4 Health
GET /healthz per component — fast, no dependencies. Returns 200 ok or 503 + reason.
GET /readyz — slower, checks downstream (Kafka reachable, shard index loaded).
Heartbeat metric emitted by agent: up{host=...} = 1. Absence ⇒ alert.
5 Data Schema #
Critical L7 insight: metrics schema is NOT one table. It's 5 distinct storage engines with very different access patterns.
5.1 Series (hot TSDB)
- Logical:
series_id = uint64 hash(tenant, metric_name, sorted_labels). - Physical: per-series chunk files, 2h blocks (Prometheus convention), chunks hold compressed (ts, val) pairs via Gorilla (XOR value + delta-of-delta ts).
- Engine choice: custom columnar format (Prometheus TSDB
mmapor M3DB similar). Not RocksDB — Gorilla compression beats LSM by 5–10× for this workload and we don't need point-update. - Partition key:
(tenant, series_id) % shard_count, then within shard by 2h time block.
5.2 Inverted Index (label → series)
- Logical:
label_kv → sorted_postings_list[series_id](e.g.,service=checkout → [42, 108, 10234, ...]). - Query evaluation: intersect/union postings lists for each matcher in the selector
{service="checkout", status="500"}— this is the classic search-engine algorithm. - Engine: RocksDB (LSM is good here — index is append-mostly with occasional compaction) plus in-memory hash of recent labels for the 2h active block.
- Why separate from TSDB? TSDB layout is optimized for time-range scans of a known series; index is optimized for resolving "which series match this selector?" — different B-tree vs postings-list access pattern. Conflating them is what killed InfluxDB 1.x's tsm+tsi mashup.
- Why RocksDB and not Lucene? We need fast range queries on timestamps inside postings, integer-heavy, and we want mmap friendliness. Lucene is too text-oriented and allocates too much. M3 and VictoriaMetrics both roll custom inverted indexes over similar LSM-like structures.
5.3 Alert Rule Definitions (control plane)
- Schema:
{rule_id, tenant_id, group_name, expr_promql, for_duration, labels, annotations, version, updated_by, updated_at} - Engine: etcd (strong consistency for config) or Spanner-like (if multi-region). ~10K rules/tenant × 500 tenants = 5M rules → fits.
5.4 Alert State (firing/pending/resolved)
- Schema:
{alert_key = hash(rule_id, group_by_label_values), state ∈ {inactive, pending, firing}, active_since, last_eval, last_value, annotations} - Engine: Replicated in-memory with periodic snapshot to object store. Cannot live in TSDB — state changes are not time-series; they're mutable records indexed by alert_key.
- Why not Redis? We control failover semantics better with a custom Raft group per evaluator shard. Redis is fine if you trust Sentinel; Monarch-scale systems run their own.
5.5 Silences / Inhibitions
- Schema:
{silence_id, matchers: []LabelMatcher, starts_at, ends_at, creator, comment}— low QPS, <1M rows globally. - Engine: Spanner / etcd. Replicated to every alert evaluator on change.
5.6 Downsample Rollups (warm/cold)
- Schema: same series_id, but samples are
{ts, count, sum, min, max, last}quintuples (enables accurate rate/quantile reconstruction within rollup resolution). - Engine: object storage (S3/GCS) as immutable Parquet-like blocks, keyed
tenant/<series_id_prefix>/<block_start>.chunk. Index (block → series_ids present) kept in a metadata service. - Why quintuple instead of single downsampled value? If you only store the mean, you destroy tail latency. Keeping
count+sum+min+max+last(plushistogram_bucket_sumfor native histograms) lets you reconstructrate,avg,max, etc., within the rollup window. Prometheus native histograms and M3's approach both do this.
6 System Diagram — CENTERPIECE #
6.1 End-to-end (single region, one AZ triangle shown)
+-----------------------------------------------------------------------------+
| 100K HOSTS (per region) |
| +---------+ +---------+ +---------+ +---------+ +---------+ |
| | host_1 | | host_2 | | host_3 | | ... | | host_N | |
| | app+ | | app+ | | app+ | | | | app+ | |
| | agent | | agent | | agent | | agent | | agent | |
| | (100MB | | | | | | | | | |
| | local | | | | | | | | | |
| | buffer)| | | | | | | | | |
| +----+----+ +----+----+ +----+----+ +----+----+ +----+----+ |
| | | | | | |
+--------|------------|------------|------------|------------|----------------+
| | | | |
| HTTPS remote_write (snappy+proto), mTLS (SPIFFE), batch=30s/1MB
| per-host: ~0.3 KB/s wire, ~30 req/min (batched)
| aggregate: ~120 MB/s, ~50K RPS across fleet
v v v v v
+---------------------------------------------------------------------+
| REGIONAL INGEST GATEWAY (stateless, fleet of ~30 pods) |
| |
| [ Envoy L7 ] -> [ auth/SPIFFE ] -> [ rate-limit (token bucket, |
| per-tenant) ] -> [ relabeler / schema guard ] -> |
| [ cardinality admission (Bloom+count-min on label vals) ] |
| |
| Drops incrementing: dropped_total{reason=quota|schema|cardinality}|
| Timeouts: 5s write, 429 on quota, back-pressure via 503+Retry-After|
+-------------------------------+-------------------------------------+
|
| gRPC produce -> Kafka
| ~10M samples/s = ~1.5M records/s
| (batched: 10 samples/record)
v
+---------------------------------------------------------------------+
| KAFKA INGEST TOPIC "metrics.raw" (RF=3, 256 partitions) |
| Partition key = hash(tenant, series_id) -> preserves per-series |
| order, enables independent consumers |
| Retention: 24h (replay buffer for TSDB recovery) |
| Throughput: ~120 MB/s in, ~360 MB/s out (1 writer + 2 readers) |
+----+--------------------+-----------------------+-------------------+
| | |
| (consumer 1) | (consumer 2) | (consumer 3)
| TSDB writers | Stream processor | Tee: exemplar
| | (Flink) | fanout to trace store
v v v
+----------------+ +----------------------+ +-------------------+
| TSDB WRITER | | FLINK: downsampler | | EXEMPLAR FORWARDER|
| FLEET | | + recording rules | | (links metrics to |
| (per shard) | | + anomaly pre-agg | | trace backend) |
| | | windows: 5m/1h | | |
| ~128 shards | | outputs: new series | | |
| write path: | | back to Kafka topic | | |
| wal -> mem | | "metrics.rollup_5m" | | |
| -> 2h block | | | | |
| -> mmap + | | checkpoints to S3 | | |
| inv. index | | | | |
+-------+-------+ +----------+-----------+ +-------------------+
| |
| 2h blocks flush | rollup writes (lower volume)
v v
+--------------------------------------------------------------+
| HOT TIER (local NVMe, 30d, 3x replicated) |
| shard_0 .. shard_127 each: ~500 GB active, ~120K RPS |
| memory: 32 GB/shard (index + head block) |
+--------------------------+-----------------------------------+
| age-out compactor (every 2h)
v
+--------------------------------------------------------------+
| WARM TIER (object store, 1y, 5-min rollups, EC 8+4) |
| ~14 TB/yr, served by "store gateway" stateless pods |
+--------------------------+-----------------------------------+
| age-out (90d)
v
+--------------------------------------------------------------+
| COLD TIER (S3 Glacier-IR / GCS Archive, 5y, 1h rollups) |
| ~6 TB total, restore latency O(minutes) — ok for forensics |
+--------------------------------------------------------------+
QUERY PATH (separate plane) -------------------------------------
+-----------+ PromQL/MQL +---------------------------+
| Grafana / |------------->| QUERY FRONTEND |
| user CLI | | - parse + validate |
+-----------+ | - split by time window |
| - cache (Redis/memcache) |
| - admission/QoS limiter |
+-------------+-------------+
|
fanout | (shard map from SD)
v
+-------------+-------------+
| QUERIERS |
| - pull from hot shards |
| - pull from warm store |
| gateway for older data |
| - merge + dedup overlap |
| - evaluate PromQL |
+-------------+-------------+
|
v
(result streamed to frontend, cached)
ALERT PATH (separate plane) -------------------------------------
+--------------------+ +-------------------+
| Rule config API |-------->| Control plane |
| (user upserts YAML)| | (etcd / Spanner) |
+--------------------+ +---------+---------+
| watch (rules updated)
v
+-----------------------------+
| ALERT EVALUATORS |
| sharded by hash(rule_id) |
| every 15s: |
| 1. query query-frontend |
| 2. evaluate expr |
| 3. update state machine |
| (inactive/pending/ |
| firing) |
| 4. check "for <dur>" |
+--------------+--------------+
|
v
+-----------------------------+
| NOTIFIER / ALERTMANAGER |
| - group by labels |
| - dedupe across replicas |
| - apply silences |
| - apply inhibitions |
| - route to pager/slack |
+-----------------------------+
6.2 Push vs Pull — where it lands in this diagram
Above diagram is push (remote_write). A pull variant would replace the top row with:
+-----------------+ +-------------------+
| Service |<------+ Scraper fleet |
| Discovery | list | (Prometheus-style)|
| (Consul/EDS) |------>| GET /metrics |
+-----------------+ +---------+---------+
|
v
(same remote_write path to Kafka)
The ingest gateway still exists; the scrapers become the "agents" from ingest's perspective. Push vs pull is mostly a discovery and network boundary question — downstream of the gateway, everything is identical.
6.3 Control plane vs data plane separation
| Plane | Members | SLO | Rate |
|---|---|---|---|
| Data | gateway, Kafka, Flink, TSDB, queriers | 99.95% | 10M samples/s |
| Control | rule config, tenant quota, silences, schemas | 99.99% (config changes rare but blocking) | <100 QPS |
| Meta | service discovery, shard map, consensus | 99.99% | very low |
Do not let data-plane volume threaten control-plane availability. Config writes must succeed even when ingest is saturated — otherwise you can't push a silence during an alert storm.
7 Deep Dives (3 critical topics) #
Deep Dive 1 — TSDB storage engine internals: Gorilla compression, block layout, and why cardinality is the real bottleneck
Why critical: This is the hot loop. If you can't explain Gorilla + inverted index at whiteboard depth, you're L6. The answer to "how do you scale the TSDB?" is not "add shards" — it's bound cardinality and compress aggressively.
Alternatives considered:
| Option | Storage / sample | Query p99 (24h, 10K series) | Cardinality ceiling/node | Rejection reason |
|---|---|---|---|---|
| InfluxDB 1.x (TSM + TSI) | ~2.5 B/sample | ~1.2s | ~1M series/node before tsi struggles | TSI had known OOM patterns on bursty cardinality; v2 pivoted to IOx for a reason |
| Cassandra column families per metric | ~8–12 B/sample | ~3s | Arbitrary, but expensive | LSM is wrong shape; SSTable compactions burn IO; no built-in XOR compression |
| Parquet-on-S3 (analytics-first, e.g. Druid) | ~4 B/sample | ~5s for range, great for aggregation | Billions, but latency too high | Batch-oriented, can't do 15s alert eval without a separate hot path |
| Prometheus TSDB (Gorilla + mmap) | 1.37 B/sample | 0.3s | ~5M series/node | Chosen for hot tier |
| M3DB (Uber) | 1.5 B/sample | ~0.8s | 10M+ series/node (horizontal) | Chosen as reference for multi-tenant scale-out |
| Monarch (Google) | ~1 B/sample (rumored) | <1s globally | Billions globally | Chosen as aspirational target for v3 |
Chosen approach: Prometheus TSDB engine per shard (hot tier), M3-style horizontal sharding with object-store warm tier, Monarch-style global query as north star.
Gorilla math (the earned-secret bit — I can derive this on the whiteboard):
- Timestamp encoding: delta-of-delta (DoD). First sample: full 64-bit ts. Subsequent: if DoD=0, emit
0bit (1 bit!). If |DoD|<64, emit10+ 7-bit signed. Further ranges:110,1110,1111+ 32-bit escape. Empirically, DoD=0 in >95% of samples when scrape interval is stable → ~1 bit/timestamp amortized. - Value encoding: XOR current with previous. If XOR=0 (value unchanged), emit
0bit. If XOR has same leading+trailing zero count as prior, emit10+ meaningful bits. Else11+ 5-bit leading + 6-bit length + meaningful bits. For slowly-moving gauges (CPU %, memory), XOR is small → ~8 bits/value amortized. - Total: ~9 bits/sample + per-series overhead amortized over the block → empirically 1.37 B/sample. Counters (monotonic) do even better.
- Block boundary: every 2h, flush a complete immutable block with embedded index. Old chunks become read-only and mmap-friendly. The head block stays in RAM + WAL.
Cardinality is the killer — the earned-secret story:
True story pattern: one team deploys a service where a label is
session_idoruuidoruser_id. Cardinality goes from 10K → 1M series in minutes. The inverted index postings lists explode. On a label-value query like{env="prod"}the intersection has to scan 1M-entry postings list per shard. RAM for the postings lists alone goes from 20 MB → 2 GB on one shard. A merge ({env="prod", service="checkout"}— both large postings lists) isO(|a| + |b|)— postings-list merge time, not lookup time. 30ms → 30s. The whole query frontend times out. Alert evaluation stalls (it queries through the same frontend),for 5mstate timers expire incorrectly, spurious pages fire. Shard OOM-kills, replicas fail over, fail over again, and now you have a replication storm. Fleet-wide impact from one label.
Mitigations (layered defense):
- Admission-time cardinality budget per (tenant, metric_name). Gateway keeps a streaming count-min sketch of label-value diversity. Breach → reject with
429 cardinality_exceeded. Per-tenant default 100K series, per-metric default 10K series. Exception workflow via control plane (manually raise after schema review). - Label schema governance — static config-as-code:
user_id,uuid,session_id,trace_idBLACKLISTED as label keys. Trace_id belongs as an exemplar, not a label. - Adaptive sampling fallback — when a metric blows budget, don't drop; sample 1:N transparently and annotate
_sampled_label. Preserves signal, breaks no dashboards, operator gets a warning alert. (Datadog and Honeycomb both do variants of this.) - Bounded postings-list merge — set a query-time cap (
max_series_matched=50000). Beyond cap, return507to user with "narrow your selector." Protects the cluster. - Tombstone compaction — dead series (no writes in 2h) get evicted from the inverted index at block compaction, reclaiming postings list space. The churn case (
pod_idlabel on ephemeral pods) is why this matters — without it, the index grows unboundedly even if active series is bounded.
Failure modes:
- Shard OOM on index: symptom = write latency spike + RSS climbing. Detection:
ingester_memory_seriesnearing ceiling. Blast radius: 1/N of writes (the shard's partition). Mitigation: emergency admission of that tenant (reject new series), manual shard split (double partition count for that tenant), add headroom. Recovery: 2h to roll new block and release memory. - Block corruption at flush: WAL replay rebuilds. WAL itself 3x replicated. Blast: 1 shard × 2h data in head → acceptable loss; can replay from Kafka if within 24h retention.
Real systems:
- Prometheus TSDB (https://prometheus.io/docs/prometheus/latest/storage/)
- Gorilla paper — Pelkonen et al., "Gorilla: A Fast, Scalable, In-Memory Time Series Database," VLDB 2015.
- M3 — Uber, 2018, powers Uber's 10B+ series.
- Monarch — Adams et al., "Monarch: Google's Planet-Scale In-Memory Time Series Database," VLDB 2020 — global push, RPC-based ingest, zone-local alerting, cross-zone aggregation. This is the north star.
- VictoriaMetrics — single-binary ClickHouse-inspired alternative with aggressive compression.
Deep Dive 2 — Alerting at scale: rule sharding, the for state machine, dedup, and keeping e2e latency <30s
Why critical: The alert path is the pager. Ingestion at 10M/s is impressive, but if alerts fire 2 min late or flap, the SRE team revolts and the product is dead. Alerting is also where L6 candidates hand-wave ("Alertmanager handles it") and L7 candidates explain the state machine.
Alternatives considered:
| Option | Eval latency | Dedup quality | Scale ceiling | Rejection |
|---|---|---|---|---|
| Single Alertmanager + single Prometheus (HA pair) | <30s | Good (gossip) | ~1K rules | Doesn't scale to 100K rules; single Prom can't hold global state |
| Thanos Ruler (decouple ruler from Prom) | ~45s | Via Alertmanager | ~10K rules | Ruler queries Thanos Querier → cross-region latency bad for sub-30s SLO |
| Cortex Ruler (multi-tenant, distributed) | ~30s | Strong, per-tenant | ~100K rules | Chosen approach for its sharding story; some consistency quirks at replica boundary |
| Borgmon-style pull + eval on aggregator | <15s | Good | Scales to Google | Not open-source; but architecture informs our design |
| Monarch global eval | <30s globally | Excellent | Planet-scale | Aspirational |
Chosen approach: Cortex-style sharded rulers, Alertmanager-style notifier clusters (gossip-dedup), per-tenant isolation.
Rule sharding: shard = hash(rule_group_id) % N_rulers. Each ruler owns a subset of rule groups and evaluates them on their interval. Rules within a group share state and are evaluated serially (so rules in the same group can depend on each other's recording rules in the previous iteration).
The for <duration> state machine (L7-depth detail):
+-----------+ expr=true +---------+ held for <for> duration +---------+
| INACTIVE |---------------->| PENDING |------------------------------>| FIRING |
+-----------+ +---------+ +---------+
^ | |
| expr=false | |
+-----------------------------+ |
| expr=false |
+------------------------------------------------------------------------+
- State key:
(rule_id, group_by_label_values_hash). A single ruleerror_rate > 0.01 by (service)produces one state entry per service. - Persistence: state must survive ruler restarts (otherwise
pendingresets to 0 and a flapping rule never fires). Stored in a Raft group per ruler shard, snapshotted to S3 every 10 min. On ruler restart, state is reloaded. Losing state for up to 5s is acceptable — rules re-evaluate and re-enter pending. - Skew handling: if evaluation is late (e.g. gateway slow), the ruler uses the scheduled timestamp, not wall-clock, for
foraccounting. Otherwise late eval = incorrect fire.
Deduplication across ruler replicas (the +1 detail):
Every rule group runs on replica_factor=3 rulers (HA). Without dedup, every fire sends 3 pages. Alertmanager gossip dedups by (alert_fingerprint = hash(alertname + sorted_labels)); first message wins in a sliding window. Clock skew between replicas is handled by a jitter window of 2 × scrape_interval. At most one page per fingerprint per route per group_interval.
Alert storm suppression:
- Grouping — Alertmanager groups alerts by configurable labels (
group_by: [cluster, alertname]). A 500-host outage → 500HostDownalerts grouped into one notification withnum_firing=500. Oncall gets one page, not 500. - Throttling —
group_interval=5mcaps notification rate. - Inhibition — if
ClusterDownfires forcluster=us-east-1, inhibit allHostDownfor hosts in that cluster. Defined declaratively. - Silence — operator API puts a matcher silence in effect (e.g., during planned maintenance).
Latency budget (proves <30s):
scrape interval 10 s (worst: just missed)
gateway -> Kafka <1 s
Kafka -> TSDB writer <2 s
TSDB block flush / index <2 s (head block queryable in RAM)
ruler next tick <15 s (interval=15s)
query frontend eval <1 s
state machine transition <0.1 s
notifier dedupe + send <3 s
-----
worst-case e2e ~34 s <-- just over budget
p99 ~22 s <-- within budget
p50 ~12 s
Worst-case slightly blows budget; we fix by skip-the-queue shortcut for critical alerts: ruler queries the TSDB head block directly (bypasses query frontend cache) cutting ~2s. Monarch does this; its rulers run in the zone where the data lives.
Failure modes:
- Ruler replica lag — detection:
ruler_eval_lateness_seconds. Blast: alerts late by lag. Mitigation: oversubscribe rulers, shed non-critical rule groups first (prioritylabel). - Notifier outage — detection: delivery failure counter. Mitigation: dual-write to 2 notifier clusters, page via secondary channel (SMS gateway) on double failure.
- Alert fatigue loop — detection: pages/hour rising + page ACK rate falling. Mitigation: auto-silence with human escalation; per-rule SLO on false-positive rate.
Real systems: Borgmon (Google internal; described in SRE book ch. 10), Prometheus + Alertmanager, Cortex Ruler, Thanos Ruler, Grafana Mimir.
Deep Dive 3 — Push vs pull: trade-offs, and why we actually want both
Why critical: This is the canonical metrics-system interview bait. The L6 answer is "pull is Prometheus, push is StatsD." The L7 answer is: they solve different problems and a mature system supports both, with routing logic at the ingest gateway.
Pull (scrape) model:
| Dimension | Pull |
|---|---|
| Discovery | Central service discovery (Consul/Kubernetes SD/EDS) knows targets. Scraper connects in. |
| Auth | Scraper holds creds; easier to rotate centrally. |
| Rate limiting | Natural — scraper controls interval. Target can't DoS the system. |
| Back-pressure | If TSDB is slow, scraper slows. Target keeps a last-sample window. |
| Firewall traversal | Target must be reachable from scraper. Bad for NAT / customer premises. |
| Short-lived jobs | Terrible — the scrape may miss the job entirely. Need a push gateway adjunct. |
| Health | Scrape failure IS the health signal — no heartbeat needed. |
| Sample skew | Scraper drives, so samples align on scrape interval. Good for cross-host math. |
Push (remote_write) model:
| Dimension | Push |
|---|---|
| Discovery | Agent knows its destination; no global SD. Agent identity via SPIFFE. |
| Auth | Each agent holds creds; rotation is harder. Need a control plane. |
| Rate limiting | Server-side token bucket per tenant. Must handle thundering-herd. |
| Back-pressure | Agent local buffer absorbs; if buffer fills, agent drops (or spills to disk). Complex. |
| Firewall traversal | Excellent — agent dials out. Works across NAT, VPCs, customer environments. |
| Short-lived jobs | Natural — job pushes final sample on exit. |
| Health | Need an explicit heartbeat. Missing push ≠ target down (could be network). |
| Sample skew | Agents self-schedule — timestamps may cluster if they're all started together. |
Quantified comparison for our 100K-host fleet:
- Pull steady state: scraper fleet of ~50 pods (each handling 2K targets at 10s interval = ~200 RPS each), with SD updates every 30s costing ~10 MB/update × 50 pods = 500 MB/min control traffic. Target-side CPU: ~0.1% for 1000 metrics (cheap).
- Push steady state: agent CPU ~0.2% (batching + snappy), ~30 KB/min network per host, gateway fleet of ~30 pods handling 50K RPS total.
- Pull fails: 5K ephemeral Lambda-like workloads that live <30s. We need a push gateway or a cron-scrape hybrid.
- Push fails: third-party service we want to scrape but can't install an agent on. We'd need an agent sidecar or a central scraper.
Chosen approach — hybrid, with gateway as the switchboard:
Hosts with agents -----------push remote_write----+
|
Short jobs -------push to Pushgateway----+ |
| |
Third-parties ---pulled by Scraper---+ | |
v v v
+---------------------+
| INGEST GATEWAY |
+---------------------+
Downstream is unchanged. All three paths converge at the gateway, which normalizes auth, labels, and admission. This is what Cortex/Mimir does.
Earned-secret:
In flaky-network environments (edge, mobile, customer VPC), push wins because agent-side buffer + outbound dial handles partitions gracefully. But you need to cap the agent buffer — unbounded buffer + prolonged outage = OOM the host you're trying to monitor. The right default: 100 MB disk-backed ring buffer (spills from RAM), 1h retention; beyond that, drop oldest and increment a counter. Ubuntu Noisy Neighbor pattern: agents that can't send will consume all host disk if you let them.
Failure modes:
- SD flaps (pull) → scrape-target churn → missing samples → false-positive down alerts. Mitigation: target list debouncing, scrape
honor_labels+ freshness tolerance. - Agent buffer fills (push) → host OOM or data loss. Mitigation: bounded disk buffer, early warning at 80% full, drop-policy by metric priority.
- Both fail silently → SREs don't know they're blind. Mitigation: meta-monitoring: the monitoring system itself is monitored by a different (smaller, independent) monitoring system. Monarch calls this "Monarch-of-Monarch." Borgmon had the same pattern.
Real systems: Prometheus (pull-native + remote_write push), Datadog agent (push), Monarch (push, planet-scale), Nagios (pull), StatsD (push), OpenTelemetry Collector (both).
8 Failure Modes & Resilience #
| Component | Failure | Detection | Blast Radius | Mitigation | Recovery |
|---|---|---|---|---|---|
| Agent | Crash loop | Missing heartbeat series up{} == 0 for 2 intervals |
1 host's metrics | Restart via systemd; health probe; stateless so fresh start is fine | Auto-resume from local buffer if disk intact; else lose <30s |
| Agent | Disk buffer fills (network partition) | Counter agent_buffer_pct > 80 |
1 host + its neighbors if host OOMs | Disk-backed bounded ring, drop-oldest policy, early warn | Drain buffer on reconnect |
| Ingest gateway | Pod crash | Pod restarts, gateway_5xx spike |
Brief (seconds) for hosts hitting that pod | Stateless behind L7 LB; HPA scales | Agents retry with exponential backoff |
| Ingest gateway | Cardinality explosion attack from 1 tenant | Series-create rate spike; postings-list size growing fast | That tenant + noisy-neighbor degradation of shared shards | Per-tenant admission budget enforced with count-min sketch; 429 | Operator raises or lowers quota; emergency "freeze new series" knob |
| Kafka | Broker loss | ISR shrink, producer errors | Brief write unavailability for impacted partitions | RF=3, min.insync.replicas=2 ensures no data loss on single broker | Rebalance; agents retry |
| Kafka | Consumer lag (TSDB writer slow) | kafka_consumer_lag_ms > 60s |
Delayed queries; alert staleness | Writer HPA; pause Flink rollup before TSDB (rollups are replayable) | Catch up from Kafka; no data loss since retention=24h |
| TSDB shard | OOM from cardinality | ingester_memory_series near limit, RSS climbing |
1/N of writes — that shard's partition | Hard cardinality caps; series eviction for inactive; emergency shard split | New shard drains active series; ~2h to stabilize |
| TSDB shard | Disk full | Disk metric | Shard ingest stops | Block compactor to warm tier; emergency block delete of coldest data | Replace disk, replay from Kafka |
| Query frontend | Query stampede (dashboard refresh avalanche) | Concurrent queries > threshold, p99 latency climb | Query unavailability | Admission: per-tenant concurrency cap, query cost estimate, 429 | Cache warms; tenant backs off |
| Querier | OOM on "select all series" | Heap metric | 1 querier pod | Query cost budget (max_series_matched, max_samples_per_query); 507 | Pod restart; user rewrites query |
| Ruler | Replica lag | ruler_eval_lateness_seconds > 15s |
Alerts late | Oversubscribe, shed low-priority rule groups, circuit-break queries to TSDB | Catch up; may miss one eval but not page if for is long enough |
| Ruler | Silent drop (bug) | Meta-alert on ruler_evals_total rate drop |
Blind spot | External synthetic alert that fires if ruler eval rate drops 50% | Code fix; roll back |
| Alert evaluator state loss | Raft quorum loss | Quorum health metric | pending alerts reset; 1 eval cycle of delayed firing |
RF=5 Raft, geographically spread | New quorum from surviving members; snapshot restore |
| Notifier (Alertmanager) | Delivery provider down (PagerDuty outage) | Delivery failure metric | Alerts queued, not delivered | Multi-provider (PagerDuty + OpsGenie + SMS fallback); dead-letter queue | Replay from DLQ when provider back |
| Alert storm | 10K alerts/min (power outage) | Page rate alert | Oncall overload | Grouping, inhibition (parent/child), rate limiting per route | Human declares incident, silences, triages |
| Clock skew | NTP broken on host | Sample ts outside acceptance window (5 min) → drop counter | That host's ingest | Agents drop samples with bad clock, emit clock_skew counter; central NTP health check |
Fix NTP |
| Storage tier corruption | Warm chunk CRC fail | Read-time CRC mismatch counter | 1 block, 2h of 1 series | Cross-AZ replica; EC parity | Re-fetch from replica or reconstruct from parity |
| Silent drops anywhere | Bug or misconfig | Meta-principle: every drop increments a labeled counter drops_total{reason=} |
Blind spot if not instrumented | Invariant: no unrecorded drop. Code review rule. | N/A (detection is the mitigation) |
| Monitoring-of-monitoring | Main system down, nobody knows | Secondary (smaller, independent) monitoring watches primary | Extended blindness, dangerous | Run a separate "meta-monitor" Prometheus pair that scrapes our system's /metrics endpoint, with its own independent pager |
Primary recovers, meta alerts clear |
Three invariants I'd tattoo on the team wall:
- Every drop is counted. No silent failures.
*_dropped_total{reason=...}on every boundary. - The monitoring system must never take down the thing it monitors. Agent CPU/memory capped, buffered, non-blocking on application code.
- The monitoring system monitors itself with a separate system. Or an independent pager. Do not let the alerter for the alerter be the alerter.
9 Evolution Path #
v1 — "make it work" (100s of hosts, 1 team)
- Stack: StatsD agents → Graphite (Carbon + Whisper) → Grafana.
- Alerting: Nagios or bash scripts running PromQL-equivalent via
check_graphite. - Storage: Whisper files on local disk; one server.
- Cost: <$1K/mo.
- Why this first: 1 engineer, 1 week, covers 90% of basic needs. Ship.
- Pain points that force v2: no labels (hierarchical metric names only), Whisper writes are O(series) on every tick (doesn't scale past a few thousand series), no HA, no multi-tenancy.
v2 — "make it scale to a product" (1K–10K hosts, dozens of teams)
- Stack: Prometheus (federated pairs) per region → Alertmanager → Grafana → Thanos/Cortex for long-term storage + global query.
- Ingestion: remote_write to a central Cortex/Mimir cluster.
- Alerting: per-team rule groups; Alertmanager HA trio.
- Storage: Prometheus 2h blocks → S3 via Thanos Sidecar. Store Gateway for historical queries.
- Multi-tenancy: Cortex per-tenant isolation (header-based).
- Cost: $10–50K/mo.
- What you get: labels, PromQL, HA, reasonable cardinality (~10M series), 1-year retention.
- Pain points that force v3: cross-region query latency, federation fan-out math (federating 20 Prom pairs × 1M series each = 20M series queried from a single Thanos Querier → slow), tenant isolation at label-level not metric-level is hard, compaction jobs become a full-time ops burden.
v3 — "global, planet-scale, tenant-isolated" (100K+ hosts, 100s of teams, multi-region)
- Stack: Monarch-like or M3/Mimir at scale. Push-based agents (remote_write), regional Kafka, regional TSDB shards, global query layer, sharded rulers, multi-tier storage.
- Multi-region: active-active for alert eval (each region evaluates rules against local data, cross-region aggregation only when explicitly queried). Locality beats globality for <30s alerts.
- Tenant isolation: per-tenant ingester shards (not just label filtering) — cardinality blast radius is contained to a tenant. Per-tenant quotas enforced at gateway.
- Global query: hierarchical aggregation. Local query → regional aggregator → global. Most queries answered locally; only cross-region queries pay the WAN cost.
- Storage: hot local NVMe (30d), warm object store (1y, 5m rollups), cold archive (5y, 1h rollups).
- Cost: $1–5M/yr for 100K hosts at this scale.
- What's next (v4): exemplar-based sampling unification with traces (OpenTelemetry), native histograms (sparse, high-resolution quantiles without bucket explosion), ML-based anomaly detection replacing static thresholds, differential-privacy-aware aggregation for externally-shared metrics.
The meta-lesson: every monitoring system evolution is driven by cardinality pressure and tenant-isolation failures, not by raw sample-rate growth. Sample rate you buy your way out of with more disk. Cardinality bankrupts your index design.
10 Out-of-1-Hour Notes (solo study enrichment) #
10.1 Cardinality quota enforcement in depth
- Per-tenant budget (default 100K active series), per-metric budget (default 10K), per-label-value budget (e.g.,
pod_idcapped at 50K unique values across a metric). Enforced at gateway. - Implementation: count-min sketch (CMS) per tenant + per metric. Sketch width/depth tuned so false-positive rate <0.1% at 100K items. Reset every 2h (aligns with TSDB block boundary).
- Overflow behavior:
- Soft limit breach → log + warn metric.
- Hard limit → reject with
429 cardinality_exceeded, emitcardinality_rejected_total. - Operator whitelists with signed exception in control plane.
- Why CMS and not exact set? Exact is O(|series|) RAM at the gateway — doesn't scale. CMS is O(1) with tunable error. Over-counting (false-positive rejection) is acceptable; under-counting isn't.
- Edge case: when budget = exact cardinality, pod restarts cause label churn that false-positively hits limits. Solution: TTL-decayed sketch that forgets values unseen for 6h. Some teams run this as a Flink job because Flink owns the time-window semantics.
10.2 Schema governance
- Label naming conventions — lowercase snake_case, namespace prefixes (
k8s_,app_), blacklist (user_id,email,session_id, anything high-cardinality or PII). - Metric naming —
<namespace>_<subsystem>_<name>_<unit>(Prom convention):http_requests_total,process_cpu_seconds_total. - Enforcement: linter in CI on rule YAML + metric-name registry. Pre-registered metrics get fast-path; unknown metrics get delayed registration + cardinality audit.
- Metric types: counter (monotonic), gauge (point-in-time), histogram (buckets pre-declared), summary (client-side quantile).
10.3 Histogram encoding deep dive
- Classic Prom histogram: fixed buckets, e.g.,
le=0.005, 0.01, .... Pre-declared. Good for known distributions, bad for long tails or unknown-range metrics. Bucket count × series cardinality blows up fast (every bucket is a series!). - Native histograms (Prometheus 2.40+): exponential schema, sparse, logarithmically-spaced buckets at runtime. One series per metric+labels, internal bucket vector. ~5–10× cardinality reduction. Accurate quantiles without manual tuning. This is the way.
- t-digest / DDSketch: client-side sketch for accurate quantiles, mergeable across replicas. Used in Datadog (DDSketch). Lower storage than bucketed histogram; slightly more CPU at emit.
- Summary type (client quantile): client computes p50/p99 and emits as scalar. Cannot be aggregated across replicas (you can't average p99s). Useful only when per-instance quantile is the question; avoid otherwise.
10.4 Exemplars — linking metrics ↔ traces
- Exemplar: a
(trace_id, sample_value, timestamp)attached to a metric sample. - Use case: dashboard shows
http_request_duration_seconds p99 = 2.3s— click it, get 10 trace exemplars for 2.3s requests. Jump into Jaeger/Tempo with the trace_id. - Storage: exemplars stored in a circular buffer per series (e.g., last 100 per series per hour). Not every sample has an exemplar — sampled at emit time.
- Why this is the future: it's the bridge across observability pillars without making labels high-cardinality. The trace_id lives in the exemplar (low overhead), not in the label set (cardinality catastrophe).
- OpenTelemetry standardizes the exemplar protocol.
10.5 Adaptive sampling under overload
- Problem: metric emission is a best-effort pipeline. Under overload we want to degrade gracefully — not drop uniformly (biases signal), not drop entirely (go blind).
- Strategy — priority-based shedding:
- Drop debug-level metrics first (
priority=low). - Sample
info-level metrics at 1/N. - Keep
critical-level metrics (SLOs, RED dashboards, host health) at 100%.
- Drop debug-level metrics first (
- Strategy — per-series sampling: for high-volume series, emit 1 in 10 samples but tag with
_sampled_rate_=10so PromQLrate()can compensate. Requires PromQL extension. - Meta-concern: you MUST NOT drop counters silently — they become unrecoverable. For counters, emit monotonic increments less often (
sumpreserved across sample drops).
10.6 Cross-pillar correlation: metrics ↔ logs ↔ traces
- Shared IDs: every log line and trace span carries
service,env,trace_id. Metrics tag exemplars withtrace_id. - Convergence point: an incident page shows a metric anomaly → exemplar → trace → spans highlight slow DB call → log with that
trace_idshows query plan. - Anti-pattern: adding
trace_idas a metric label for "correlation." Kills cardinality. Exemplars exist specifically to avoid this.
10.7 Regulatory — PII in labels
- Hard rule: no PII in metrics labels. Ever. Labels are indexed, searchable, cached, exported to dashboards, shared across teams. PII in a label is a GDPR/CCPA violation waiting to happen.
- Enforcement:
- Ingest gateway runs a regex-based PII detector on label values (email, phone, SSN patterns). Matches → drop series + alert.
- Label keys have an allow-list (static CI config).
- Periodic audit scan: enumerate top-cardinality labels, verify against allow-list.
- Edge case: "I want per-user metrics." Answer: per-user metrics are not metrics. They're analytics. Route them to a purpose-built analytics pipeline with privacy controls. Or aggregate to cohort granularity. Or use differential privacy with noise.
- EU residency: EU-tagged tenant → ingest gateway routes to EU-region Kafka + TSDB only. Cross-region query explicitly denied for EU tenants unless legal-reviewed.
10.8 Cost model — $/series
- Breakdown per active series (annualized, at our scale):
- Ingest CPU/network: ~$0.001/series/yr
- Hot TSDB (30d SSD replicated 3×): ~$0.05/series/yr
- Warm (1y, rolled up): ~$0.01/series/yr
- Cold (5y): ~$0.002/series/yr
- Index RAM (hot): ~$0.03/series/yr
- Total: ~$0.10/series/yr.
- 100M series = $10M/yr all-in. 1B series = $100M/yr.
- Unit economics for chargeback: bill tenants by active series-months, not bytes. Series is the resource that fails first; byte-based billing rewards cardinality abuse.
- Compare: Datadog public price list is ~$0.10–$0.50 per custom metric per month = $1.20–$6/yr — so our internal cost is ~10× cheaper than buying Datadog, which is why big-co's build internal.
10.9 Observability of observability (meta-monitoring)
- Principle: the monitoring system cannot be the sole monitor of itself. Circular dependency.
- Implementation:
- Meta-Prometheus pair (2 hosts, outside main stack) scrapes the main stack's
/metricsendpoints. - Independent pager (different PagerDuty account? separate SMS gateway?) routes meta-alerts.
- Synthetic canary: a cron job pushes a known metric value every minute; a meta-rule alerts if the value doesn't appear in the main system within 60s.
- Dead-man's switch: a rule that fires if it stops firing. Detects ruler death.
- Meta-Prometheus pair (2 hosts, outside main stack) scrapes the main stack's
- Real-world lesson: when main Prometheus died at Facebook in 2013, they were blind to its death because Prom was monitoring itself. Hence Gorilla (out-of-band fast query) and a separate meta-alerting path.
10.10 L7 discussion hooks for the interview (what I'd volunteer if asked)
- "How would you detect an SRE who lies on their runbook?" — Runbooks embed a
runbook_executed_total{runbook_id=...}counter. Delta over incident time vs attestation. Audit log pairing. - "What's the one thing you'd change if you were starting over?" — Start with native histograms and exemplars from day 1. Retrofitting them is expensive because existing dashboards use classic histogram buckets; rewriting dashboards across 500 teams is a multi-year migration.
- "How do you handle the cardinality of cloud-native (Kubernetes) labels?" — Kubernetes labels (pod_name, replicaset_id) are intrinsically high-cardinality because pods churn. Solutions: (1) drop pod_name at ingest, keep only deployment_name (stable). (2) sample — only keep pod_name for a fraction. (3) short TTL — expire pod_name series 1h after last write. All three, layered.
- "AI/ML angle given my Meta Agentic AI background": the same metrics system underpins RL training monitoring (agent step rewards, loss curves, KL divergence) and inference serving (token latency, TTFT, cache hit). Privacy infra angle: when metrics carry per-tenant signals in a multi-tenant AI platform, label-level tenant isolation (not just auth) is required — a "noisy tenant neighbor" in the TSDB is a cross-tenant information leak if query authz is by label match. Push is safer here because auth happens at the wire, before tenanting.
- "How do you validate correctness end-to-end?" — Canary series. Inject a known series at ingress, query at egress, assert value+timestamp match with <5s skew. Alert if canary fails. This catches silent corruption that byte-level replicas don't.
Appendix A — Sanity check of the math #
- Hosts: 100K.
- Series/host: 1000.
- Samples/series/s: 0.1.
- Samples/s total: 100K × 1000 × 0.1 = 10M ✓
- Bytes/sample (Gorilla): 1.37.
- Bytes/s stored: 10M × 1.37 = 13.7 MB/s ✓
- Bytes/day: 13.7 × 86400 = 1.18 TB ✓
- Bytes/30d hot: 35.4 TB ✓ replicated 3× = 106 TB ✓
- Active series: 100M to 500M realistic; 1B ceiling. Index RAM at 1 KB/series = 100 GB – 1 TB. Sharded 10–100×, fits. ✓
- Alert budget: 10s scrape + 1s gateway + 2s Kafka + 2s TSDB + 15s eval + 3s notify = 33s worst, 22s p99. ✓ (tight but feasible; shortcut via head-block direct-query in ruler for critical alerts to hit <30s hard SLO)
- Query fanout: 1000 QPS × avg 10 shards touched = 10K shard-queries/s. Each shard 20K shard-QPS capacity → 2 shard-queries/shard/s → easy. ✓
Appendix B — What I explicitly rejected and why #
- InfluxDB 1.x — TSI index had OOM patterns under bursty cardinality; the rewrite to v2/IOx exists because the original design didn't separate index from storage correctly. Not production-grade at 100M+ series.
- A single giant Prometheus — can't horizontally scale writes past ~1M samples/s on a single node. Federation is a dead-end for our scale.
- Cassandra for time series — works, but 5–10× more storage than Gorilla, and no built-in delta-of-delta encoding. Good for Netflix's Atlas (which rolled their own time series atop Cassandra, with custom compression) — we'd reinvent what Prometheus/M3 already gave us.
- Elasticsearch for metrics — Lucene is built for text, not floats. Storage overhead ~5× Gorilla. Query path not optimized for range queries. Was tried (Elastic TSDB features); still not competitive with purpose-built TSDBs.
- Kafka as primary storage (log-only architecture) — tempting (everything is a log!), but you lose efficient range scans and you pay 5× storage. Kafka is the right buffer, the wrong store.
- No Kafka at all (direct agent → TSDB) — couples agent uptime to TSDB uptime. TSDB restarts lose in-flight data. Kafka decouples the two control loops, at the cost of 2s added latency.
End of design. If I had another 30 min in the interview I'd go deep on (a) query frontend caching (split-and-merge, time-based cache keys) or (b) native histogram math (exponential bucketing schema, merge algebra). Both are L7-depth topics not fully expanded here.