Q14 · Design a Video Streaming Service

1 Problem Restatement & Clarifying Questions #

Restatement (say this first, 20 seconds): Build a global video streaming platform that ingests licensed mezzanine masters from studios, transcodes each into an adaptive bitrate ladder with DRM packaging, distributes via a tiered CDN (edge PoPs + ISP-embedded appliances) at multi-hundred-Tbps peak egress, and serves ABR playback with p99 startup <2s and rebuffer ratio <0.5%. Secondary surface: watch-history, personalized recommendations, per-region regulatory compliance. The cost center dominates — >70% of total infra spend is egress bandwidth — so every design choice is evaluated against its bandwidth bill first.

Clarifying questions I would ask, with the default I'd adopt if told "you decide":

#	Question	Why it matters	Default if unspecified
Q1	YouTube-style UGC vs Netflix-style licensed catalog?	UGC has a long tail of cold content (millions of titles, most watched <10 times) and ingest at human-generated scale (500 hrs/min). Licensed has 100k premium titles with predictable demand spikes and rights management.	Netflix-style licensed. Justification: the problem names "transcoding pipeline," "catalog," and "geo-restrictions" — all point at a curated-rights model. UGC is a separate design (§10). I'd call this out explicitly so the interviewer corrects me early if wrong.
Q2	VOD only, or live + VOD?	Live ingest is a separate sub-system (RTMP/SRT → low-latency HLS/LL-DASH with 2–6s glass-to-glass).	VOD only. Live-streaming treated as adjunct (§10).
Q3	MAU / peak concurrency?	Sets egress BW and CDN PoP count.	500M MAU, 100M peak concurrent (major-release global drop). §3 BOE math pivots on this.
Q4	DRM level? Studio contracts usually mandate HDCP 2.2 + L1 Widevine for 4K.	Determines whether decryption is hardware-bound (L1) or SW-allowed (L3). L1 = client-locked, no cloud-DRM shortcut.	Widevine L1 / FairPlay / PlayReady SL3000 for 4K; L3 fallback for SD. Assume all three DRM systems required (iOS/Android/Smart-TV fragmentation).
Q5	4K / HDR / Dolby Vision / Atmos?	Adds top rungs to the ladder, doubles encode cost, triples egress bytes per viewer.	Yes to all. Premium catalog; trade-off is real.
Q6	Geographic scope; regulatory regimes?	Drives region count, geo-fencing, per-country rights fencing, data residency (EU DPA, India DPDPA).	Global minus embargoed countries. 18 regions, 3 CDN tiers (core PoPs, regional PoPs, ISP-embedded appliances).
Q7	Offline downloads?	Needs persistent DRM license with bounded validity (e.g., 48h; 30d when connected).	Yes. Download path is a variant of playback, same DRM but a different license policy.
Q8	Cost target per streaming hour?	Anchors whether we use 3rd-party CDN, build custom appliances, or hybrid.	<$0.015 per streaming hour all-in (bandwidth + compute + DRM). §3 shows custom-appliance ROI.
Q9	SLO targets: startup time, rebuffer, availability?	The three numbers that define "good video."	p99 startup <2s, rebuffer ratio <0.5%, playback availability 99.99%. Catalog browse 99.95% (reco service allowed to degrade separately).
Q10	Recommendation scope? Candidate-gen + ranker or one-shot?	Drives whether we separate online/offline embedding.	Two-stage: offline candidate gen (collaborative filtering, graph walks) + online ranker (deep neural, real-time features). §7d depth.

I'd spend ~90 seconds on these, commit to the defaults, and say "I'll surface anywhere these bite."

2 Functional Requirements #

In scope (numbered):

FR1 — Studio ingest. InitiateUpload(title_id, manifest) → chunked resumable upload of mezzanine master (typically ProRes or JPEG2000 IMF, 200–500 GB per 90-min feature) → object store → trigger transcode DAG.
FR2 — Transcode to ABR ladder. Per-title ladder selection (dynamic optimizer), parallel segment encoding in 2–4s GOP chunks, multi-codec output (H.264 AVC, HEVC, AV1), multi-container (HLS fMP4, DASH CMAF), DRM packaging (Widevine/FairPlay/PlayReady via a single CENC CMAF package).
FR3 — QC & conformance. Automated perceptual quality checks (VMAF per rung), bitstream conformance (ffmpeg/Bento4), audio loudness (EBU R 128), subtitle timing.
FR4 — Catalog publish. Transactional flip from "draft" to "available in regions X" with geo-allowlist; embargoed release (countdown-to-live).
FR5 — Manifest serve. GET /v1/manifest/{title}/{session} returns signed HLS/DASH manifest with per-session CDN-token-rewritten segment URLs and per-profile rung list.
FR6 — Segment serve. Stateless CDN-cacheable GET /s/{title}/{rung}/{seg_no}.m4s; 99.9%+ hit rate at edge PoPs for head catalog.
FR7 — DRM license exchange. POST /drm/license exchanging a CDM-provided challenge for a signed license (Widevine/FairPlay/PlayReady key wrap).
FR8 — ABR playback. Client selects rung per segment via buffer + throughput signals. We serve, don't decide (clients own ABR).
FR9 — Watch-history & resume. POST /heartbeat every 10s; GET /continue-watching/{user} returns last positions with user-consistent-read.
FR10 — Recommendations. GET /recs/{user}?surface=home|detail|postplay — candidate gen + rank, with diversification.
FR11 — Analytics. QoE (rebuffer, startup, bitrate ladder switches) and engagement (play, pause, abandon) events; feeds recs + SLI dashboards + ladder tuning loop.
FR12 — Download-for-offline. Scoped DRM license, bounded offline duration.
FR13 — Geo-restriction / parental controls / kids profile. Enforced at manifest issuance (server-authoritative, cannot be bypassed by changing client state).

Out of scope (say out loud):

Live streaming (RTMP/SRT ingest, low-latency HLS, chat overlay) → separate sub-system, sketched in §10.
UGC ingest at human-generated rates (500 hrs/min YouTube). Our ingest is ~10–100 titles/day from a handful of studios.
Comments, social, watch-parties — app-layer features, separate services.
Payment / subscription billing — separate service; we consume entitlements.
Content moderation / Trust & Safety — trivialized for licensed catalog (studios vet); for UGC it would be huge.
Ad insertion / SSAI — noted in §10 as a clean extension (per-segment URL rewriting at manifest time).

3 Non-Functional Requirements + Capacity Estimate #

3.1 NFRs (with explicit SLO targets)

NFR	Target	How we achieve it
Playback availability	99.99% measured as `played-without-fatal-error / started`	Tiered CDN with 3 fallbacks; origin replicated 3× cross-region; client-side rung fallback
p99 startup time (click → first video frame)	<2s	Manifest edge-cached; first segment in client push; pre-fetched DRM license on manifest fetch
Rebuffer ratio (rebuffer-seconds / play-seconds)	<0.5%	ABR tuned (BOLA-E+MPC hybrid); segment prefetch = 30s buffer target; CDN hit >99.5%
Durability of master	11 nines (`1 - 10⁻¹¹` annual loss)	Object store w/ erasure coding 10+4 across AZs; tape archival for deep cold; checksum on ingest and every replication hop
Durability of encoded assets	9 nines; re-transcodable if lost	Master survival is ground truth; encoded ladder is recomputable (cost: 1 day & ~$30k per missing title)
DRM license issuance latency	p99 <300ms	License servers co-located with edge regions; HSM-backed key wrap hot-path sub-10ms
Catalog metadata freshness (new release → searchable)	<60s globally	Spanner / pub-sub fan-out
Recommendation freshness (watched → re-ranked)	<5 min for online features; daily for batch	Online ranker with last-N session features
Analytics completeness (QoE ingest)	99.9% events landed within 1 min	Lossy in-memory buffer → Kafka with 3× replication → batch to warehouse
Cost per streaming hour (all-in)	<$0.015	Open Connect-style appliances + pull-CDN spillover; AV1 where devices support it (30% BW reduction)

3.2 Back-of-envelope math — every number calculated

(a) Peak egress bandwidth

500M MAU × 2 hrs/day avg viewing = 1B streaming-hours/day = ~42M concurrent average.
Peak/average ratio for global VOD ~2.5× (evening TZ spike); for major releases ~4× (global drop).
Peak concurrent ≈ 100M viewers.
Blended bitrate (4K HDR 15 Mbps, 1080p 5 Mbps, 720p 2.5 Mbps, mobile 1 Mbps; weighted 15%/45%/30%/10%) = ~5 Mbps average.
Peak egress = 100M × 5 Mbps = 500 Tbps.

That number is the single biggest constraint. Build everything else backward from it.

(b) CDN PoP sizing

Assume tiered CDN:
- Tier 0: ISP-embedded appliances (Netflix Open Connect-style). Target ~75% of egress (the head of the catalog; pre-warmed).
- Tier 1: ~500 core/regional PoPs globally (our own or partner CDNs). Target ~20% of egress (body of catalog + cache-miss fallback).
- Tier 2: ~12 origin regions. Target ~5% of egress (long tail, live re-transcode, new release pre-warm).
Per-tier peak egress:
- Tier 0 = 0.75 × 500 Tbps = 375 Tbps spread across 10k appliances in ~2k ISPs → **37.5 Gbps per appliance peak**. Consistent with 100G NIC + 2 × 100G uplinks per appliance.
- Tier 1 = 0.20 × 500 Tbps = 100 Tbps across 500 PoPs → 200 Gbps/PoP peak. Modest; today's major PoPs deliver multi-Tbps.
- Tier 2 = 0.05 × 500 Tbps = 25 Tbps across 12 regions → ~2 Tbps/region, fine.

Sanity check: Netflix publicly discloses Open Connect delivers >200 Tbps at their peak; their scale is ~250M subs. Ours is 2× theirs, 500 Tbps is in-line.

(c) Catalog storage

100,000 titles × avg 90 min = 9M min = 150k hrs of content.
Per-title encoded bytes:
- Ladder: 15 rungs × avg 3 Mbps (harmonic average; fat rungs dominate) × 5400 s = ~3 GB per rung average × 15 = ~45 GB per ladder per language.
- Codecs: AVC + HEVC + AV1 × 3 = ~135 GB per language.
- Languages/audio: 10 audio tracks × (same video, separate audio @ 192 kbps each) ≈ add 2 GB → video dominates.
- Subtitles: 30 languages × ~1 MB = 30 MB, rounding error.
Per title: ~140 GB encoded (across all codecs/profiles).
Plus master (mezzanine): ProRes 422 HQ at 220 Mbps × 5400 s = ~150 GB per title.
Total encoded storage = 100k × 140 GB = 14 PB.
Total master storage = 100k × 150 GB = 15 PB (kept forever, warm/cold tiered).
With 3× replication + 10+4 erasure coding on masters (1.4× overhead vs 3× for hot): masters on EC = 21 PB raw; encoded hot on replication = 42 PB raw. Total raw ≈ 63 PB.
At $20/TB/yr for hot object storage (internal tiered), **$1.3M/yr storage**. Dwarfed by egress.

(d) Transcode compute

For each hour of source: encode a ladder (15 rungs × 3 codecs = 45 encodes, though shared analysis cuts overhead ~30%) + QC + package + DRM.
AV1 is the expensive rung: ~30–60× realtime for quality preset on modern Xeon. HEVC ~10×. AVC ~3×.
Blended per-hour-of-content: ~80 CPU-hours per hour of content (dominated by AV1 and top HEVC rungs).
Volume: 10 titles/day × 1.5 hrs = 15 hrs/day new content, plus 3× re-encode on codec/DRM/ladder changes across back catalog = ~50 hrs content-encode-equivalent/day.
= 4,000 CPU-hours/day → ~170 sustained cores at full utilization.
With launch burst (Netflix-class drops): 10× peak = 1,700 cores burst. Spot/preemptible farm at ~$0.01/core-hour = $30/hr burst, trivial.
Key leverage: split each hour into 2–4 s GOP chunks, encode in parallel — turns a 90-min encode into 10-min wall-clock by using 400 chunks in parallel. Same total CPU, wildly faster wall-clock. Enables "ingest → publish" in <2 h for a 90-min feature.

(e) Transcode farm sizing for head-of-line titles

A blockbuster master arrives 72 h before release.
Need to emit full ladder + QC + package in <24 h so we have 48 h to propagate to edge appliances (§7b).
Per title: 80 CPU-hrs × 1.5 hrs content = **120 CPU-hrs** per title, wall-clock target 4 h → need 30 cores in parallel for a single title. Trivial per title; the farm's value is handling many titles and re-encodes in parallel.

(f) DRM license server

100M concurrent viewers, each fetching ~1 license per playback session + periodic renewals (every 30–60 min for long sessions).
Peak license issuance = new-session rate. Assume avg session = 45 min → 100M / 2700 s = ~37k licenses/s steady-state; during a major release start, 300k licenses/s burst.
HSM-backed wrap is ~1ms per op on modern HSMs, but HSMs are throughput-limited. We put key wrapping in the HSM and session policy (geofence, device binding, entitlement) in the application layer.
With regional DRM servers (24 regions × 3 pods × 5k ops/s/pod) = 360k ops/s capacity — handles 300k burst with 20% headroom.
If we exceed HSM throughput: software key wrap with offline-mint HSM-signed chains (proxy re-encryption; §7c).

(g) Watch-history and recs

Heartbeats every 10s during playback × 42M avg concurrent × 30k/s → ~4M heartbeats/s.
Each heartbeat: 200 B → 800 MB/s into the write path, ~70 TB/day of raw events.
This cannot go into Spanner directly (50k writes/s Spanner per node). Lands in Kafka, downsampled to "progress point every 15s" in Cassandra-style wide row, materialized views daily to warehouse.

(h) Cost anchor (why egress dominates)

Egress at blended $0.01/GB (mixed tier-0/1/2 effective) × 500 Tbps × 86,400 = ~5.4 EB/day at peak-day rate, ~$54M/day if we paid retail. Reality: Tier-0 appliances at ISPs cost effectively $0.001/GB or less (ISP peers, no transit) → effective blended ~$0.003/GB → ~$16M/day peak, ~$8M/day avg, ~$2.5B/yr egress.
Compute + storage combined: ~$50M/yr. DRM + control plane: ~$20M/yr.
Total ~$2.6B/yr of which 95% is egress. This is the economic engine behind Open Connect.

4 High-Level API #

All APIs are HTTPS/1.1 with HTTP/2 for manifests; segment fetches explicitly on HTTP/1.1 or HTTP/3 (QUIC) depending on client. gRPC internal between control planes.

4.1 Ingest APIs (studio-facing, authn via mTLS + studio-key)

service Ingest {
  // Step 1: get an upload session + part URLs
  rpc InitiateUpload(InitRequest) returns (InitResponse);
  // Response carries: upload_session_id, part_size_bytes, signed S3-style URLs
  // for ~100MB parts up to 10k parts (1 TB max single upload).

  rpc CompleteUpload(CompleteRequest) returns (CompleteResponse);
  // Triggers transcode DAG. Returns job_id.

  rpc GetTranscodeStatus(StatusRequest) returns (StatusResponse) {
    // Streaming RPC. Emits QC, ladder progress, DRM packaging done.
  }

  rpc PublishTitle(PublishRequest) returns (PublishResponse);
  // Atomic catalog flip: draft → {geo_allowlist, valid_from}. Embargo-safe.
}

message InitRequest {
  string studio_id = 1;
  string title_id = 2;
  int64  mezzanine_bytes = 3;
  string mezzanine_sha256 = 4;
  string container_hint = 5;  // "IMF" | "ProRes_MOV" | "MXF"
  Metadata meta = 6;          // cast, runtime, content-rating, etc.
}

Chunked upload note. Uploads use a multipart pattern identical to S3 (InitiateMultipart → PutPart(PartNumber) → CompleteMultipart). 100 MB part size, max 10k parts. Resumable: part uploads are content-addressed, so a network blip retries only the failed part. End-to-end SHA-256 over reassembled bytes must match studio-supplied hash; mismatch = upload rejected, studio alerted.

4.2 Playback APIs (client-facing)

// Session ticket — opaque signed blob, 15-min lifetime, rotated every session
POST /v1/session/start
  Body: { device_id, device_capabilities (codecs, HDCP, HDR), entitlements }
  Response: { session_token, manifest_url_template, license_url, token_ttl }

GET /v1/manifest/{title}/{profile}?token={session_token}
  Response: HLS (m3u8) or DASH (MPD) manifest. Segment URLs carry:
    - cdn_prefix: per-session, signed with per-PoP key, TTL=12h
    - rung identifiers
    - DRM key IDs (kid) embedded

POST /v1/drm/license
  Body: widevine_challenge | fairplay_spc | playready_challenge
  Response: signed license blob (wrapped content key + policy)

POST /v1/heartbeat
  Body: { session_token, title, position_ms, bitrate_ladder_pos, rebuffer_ms_since_last }
  (10s interval; fire-and-forget; 204 No Content)

GET /v1/continue-watching/{user_id}
  Response: list of recently-played titles with last-position

GET /v1/recs/{user_id}?surface=home&row=1&limit=40
  Response: ranked title list with boost annotations

4.3 Segment serve (edge)

GET /s/{title_hash}/{profile}/{rung}/{seg_no}.m4s
    ?v={version}       # content version; ladder re-encodes bump this
    &cdn_sig={sig}     # per-PoP-key-signed, checked at PoP
Cache-Control: public, max-age=31536000, immutable
Content-Type: video/mp4

Segment URLs are immutable (never rewritten for the life of a content version). This is critical: immutability enables max-age=1y, which is the reason our hit rate is >99% at edge. Any title update bumps the {version} component — a new URL space, no cache purge needed.

4.4 Failure semantics

Manifest serve: if origin unhealthy, serve stale manifest from edge (TTL extended to 24h in degraded mode).
Segment fetch: client-side fallback to next lower rung on 5xx or timeout >2s; CDN itself routes miss to origin shield, then origin.
License fetch: if DRM license server is down, playback already-in-progress continues (license cached client-side for ≥1h); new sessions fail. Big red switch: "short-circuit license" for degraded mode (serves cached token-validated license for catalog already in flight — requires opt-in studio contract terms).

5 Data Schema + Engine Choice #

5.1 Catalog metadata (durable, strongly consistent)

Stored in Spanner (or CockroachDB / TiDB if open-source-preferred). Why: catalog is low-QPS writes (~100/day), extremely high-QPS reads (every home page load, every title click), must be globally consistent (a title becoming unavailable in region X needs to propagate fast), and transactional (embargo-release is a cross-column flip — availability + ladder URLs + DRM key refs flip atomically). Reads fronted by per-region cache (Redis) with pub-sub invalidation.

TABLE Title (
  title_id          STRING PRIMARY KEY,         -- UUID
  version           INT64,                       -- content version; bumps = new ladder
  display_name_i18n JSON,                        -- {"en":"Squid Game","ja":"イカゲーム",...}
  runtime_ms        INT64,
  content_rating    JSON,                        -- per-region: {"US":"TV-MA","DE":"16"}
  geo_allowlist     ARRAY<STRING>,               -- ISO-3166-1 alpha-2
  valid_from        TIMESTAMP,
  valid_until       TIMESTAMP,                    -- for licensed content with expiry
  hdr_flags         BITMASK,                     -- HDR10, HDR10+, DolbyVision
  subtitles         ARRAY<STRING>,               -- ["en","ja",...]
  audio_langs       ARRAY<STRING>,
  mezzanine_master_uri   STRING,                 -- s3://.../master.mxf
  status            ENUM(DRAFT, TRANSCODING, READY, RETIRED),
  created_at        TIMESTAMP, updated_at TIMESTAMP
)

TABLE Asset (
  title_id STRING, version INT64, profile STRING, PRIMARY KEY(title_id,version,profile)
  -- profile = e.g., "hevc_4k_hdr_10bit" | "avc_1080p" | "av1_1080p"
  manifest_uri      STRING,                      -- s3://... or /cdn/...
  rungs             JSON,                        -- [{rung:0,bitrate:450000,w:426,h:240,codec:"avc"},...]
  audio_tracks      JSON,
  subtitle_tracks   JSON,
  drm_key_refs      ARRAY<STRING>,               -- key IDs for CENC; actual keys in KMS/HSM
  vmaf_summary      JSON,                        -- per-rung avg/p1 VMAF
  packager_fingerprint STRING                     -- for idempotency / re-package detection
)

Read paths at page-load time issue 1–2 Spanner queries (fast; ~10ms) but are shielded by a per-region Redis that caches titles with 60s TTL + pub-sub invalidation on updates. Manifest-serving doesn't go back to Spanner at all — manifest is precomputed & stored in CDN/object store at publish time.

5.2 User data (watch history, entitlements, profile)

Cassandra wide-row for watch history. Why: append-heavy, key-by-user, time-ordered, no cross-user joins in the hot path.

TABLE watch_history (
  user_id      TEXT,      -- partition key
  title_id     TEXT,      -- clustering key (last segment first)
  last_position_ms BIGINT,
  last_updated TIMESTAMP,
  device_last  TEXT,
  PRIMARY KEY (user_id, title_id)
) WITH CLUSTERING ORDER BY (last_updated DESC);

Writes arrive via Kafka → Flink-aggregator (downsample 10s heartbeats to last-known-position per (user,title), upserted every 15s) → Cassandra. This decouples write scale (4M heartbeats/s firehose) from the storage tier (~40k writes/s after aggregation). "Continue watching" is a 1-partition range scan, sub-5ms.

Entitlements (subscription tier, region, parental profile): Spanner — cross-consistent with billing.

5.3 Segments (object store + CDN)

S3-compatible object store (GCS / S3 / Azure Blob — we run multi-cloud origin for vendor leverage). Each segment is an immutable object:

Bucket: stream-origin-{region}
Key:    {title_hash}/{version}/{profile}/{rung}/{seg_no}.m4s

Attributes:
  content-type: video/mp4
  x-amz-storage-class: STANDARD (hot) | INTELLIGENT_TIERING (body) | GLACIER (cold)
  x-custom-vmaf:   per-segment VMAF score (for QC)
  x-custom-bitrate: actual encoded bitrate (for client-side VMAF/BW analytics)
  x-cache-version:  cdn cache-key component

Storage tiering via object-store lifecycle:

Hot (first 30 days after title release or any title in top-10% watch share last 7 days): STANDARD, 3× replicated in-region, cross-region replication to 2 additional regions.
Warm (30d–180d or middle-share): INTELLIGENT_TIERING — auto-demotes after 30d no access; single-region + 1 replica.
Cold (old catalog, watched <10×/month/region): GLACIER / equivalent deep archive, retrieval-latency 3–5 min. Re-fetched to hot on demand into the origin shield, not directly to edge. Saves 80% on storage cost for tail.
Master (mezzanine) cold: Glacier Deep Archive / tape. Checksummed, erasure-coded 10+4. Retrieved only for re-encode.

5.4 DRM state

Content keys live in an HSM-backed KMS, keyed by kid. Never leave the HSM in plaintext; wrapped under a per-license-request key.
License policy in a small, high-QPS DB (Redis + Spanner backing for audit): which entitlements can view which kid, what HDCP level required, offline flag, max duration.
Session table (24h TTL) in Redis cluster: session_id → (user_id, device_id, region, issued_at, max_concurrent).

5.5 Recommendation serving

Offline candidate index (500M users × 500 candidates each): Faiss-style ANN index on user embedding × title embedding, updated every 6h. Sharded by user-id, replicated 3×.
Online features (last-N watched, time-of-day, device, current session context): feature store (Redis), millisecond reads.
Ranker model: deep neural net, served by TF-Serving / TorchServe, 50ms p99 per request (for ~500 candidates). §7d.

5.6 Analytics (QoE)

Kafka firehose (~4M events/s across QoE + heartbeat + engagement).
Real-time path: Flink → per-title rebuffer-ratio gauge → SLI dashboards, alerting.
Batch: hourly → warehouse (BigQuery/Snowflake) for ladder tuning, recommendation training.
Never on the playback decision path.

5.7 Engine choice — why each pick

Layer	Choice	Rejected	Why
Catalog	Spanner	DynamoDB, Postgres, MongoDB	Global consistency for embargoed drops; transactional multi-row writes on publish
Watch-history	Cassandra	Spanner, Redis persistent	Write volume; single-partition reads; eventual consistency OK ("continue watching" can be ~15s stale)
Segments	Object store + CDN	Self-hosted file servers	Bandwidth economics impossible without CDN; object store's PUT-once semantics align with immutable content model
Session cache	Redis cluster	Memcached	Atomic scripts for concurrent-stream policy check; pub/sub for invalidation
DRM keys	HSM-backed KMS	Pure SW key store	Compliance (studio contracts demand hardware) and theft-in-memory resistance
Analytics	Kafka → Flink + warehouse	Direct-to-DB	Volume + downstream branching (real-time + batch + ML)
Rec candidates	Faiss / ScaNN + feature store	Cassandra for embeddings	Vector search sublinear in candidate count

6 System Diagram (Centerpiece — two planes) #

6.1 Top-level: Ingest + Delivery + Control planes

╔══════════════════════════════════════════════════════════════════════════════════════════════════╗
║                                   CONTROL PLANE (global)                                          ║
║                                                                                                   ║
║  ┌──────────────┐   ┌───────────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ║
║  │ Catalog Svc  │   │ Entitlement   │   │ DRM License  │   │ Reco Svc     │   │ Analytics/   │   ║
║  │ (Spanner)    │←→ │ Svc (Spanner) │←→ │ Svc (KMS/HSM)│   │ (Faiss+ANN   │   │ QoE Pipeline │   ║
║  │ titles, rungs│   │ user sub+geo  │   │ per-session  │   │ +DNN ranker) │   │ (Kafka/Flink)│   ║
║  └───┬──────────┘   └───────┬───────┘   └──────┬───────┘   └──────┬───────┘   └──────▲───────┘   ║
║      │ pub-sub                │                 │                  │                   │         ║
║      │ invalidate             │ per-session     │ session+policy   │ features+         │ QoE     ║
║      │                        │ checks          │                  │ candidates        │ events  ║
╚══════╪════════════════════════╪═════════════════╪══════════════════╪═══════════════════╪═════════╝
       │                        │                 │                  │                   │
       ▼                        ▼                 ▼                  ▼                   │
┌────────────────────────────────────────────────────────────────────────────────────┐   │
│   DELIVERY PLANE (multi-region)                                                     │   │
│                                                                                     │   │
│  User ──DNS/GSLB──▶ ISP-Embedded Appliance (Tier 0, 75% hit)                       │   │
│    │                    │ miss                                                      │   │
│    │                    ▼                                                           │   │
│    │                Regional CDN PoP (Tier 1, 20% hit)                              │   │
│    │                    │ miss                                                      │   │
│    │                    ▼                                                           │   │
│    │                Origin Shield (absorbs cache-miss herds; §7b)                   │   │
│    │                    │ miss                                                      │   │
│    │                    ▼                                                           │   │
│    │                Regional Origin (5% traffic)                                    │   │
│    │                    │                                                           │   │
│    │                    ▼                                                           │   │
│    │                Object Store (S3/GCS) — authoritative source                    │   │
│    │                                                                                 │   │
│    │  ┌─────────────── Manifest/Session/License/Heartbeat/Reco APIs ──────────┐    │   │
│    └─▶│    served by same PoP edge; manifest cached briefly; others dynamic    │────┼──▶│
│       └────────────────────────────────────────────────────────────────────────┘    │ analytics
└─────────────────────────────────────────────────────────────────────────────────────┘

        ▲  authoritative segment upload (cross-region replication)
        │
╔══════════════════════════════════════════════════════════════════════════════════════════╗
║                               INGEST PLANE (regional)                                     ║
║                                                                                           ║
║ Studio                                                                                    ║
║   │  mTLS, resumable multipart (100MB parts, SHA256 end-to-end)                           ║
║   ▼                                                                                       ║
║ ┌────────────┐ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐ ┌─────────────────┐  ║
║ │ Upload API │→│ Mezzanine    │→│ Transcode      │→│ QC           │→│ DRM Packager     │ ║
║ │ (chunked)  │ │ Object Store │ │ Farm           │ │ (VMAF+BSFormat│ │ (CENC CMAF;      │ ║
║ │            │ │ (EC 10+4)    │ │ (K8s+FFmpeg+   │ │  +audio+subs) │ │ Widevine/FairPlay│ ║
║ │ authn,     │ │ Glacier DA   │ │ AV1/HEVC/AVC;  │ │  per rung+per │ │ /PlayReady)      │ ║
║ │ resumable  │ │ for masters  │ │ GOP-chunked,   │ │  segment VMAF)│ │  key refs to HSM │ ║
║ │ put_part   │ │              │ │ DAG per title) │ │               │ │                  │ ║
║ └────────────┘ └──────────────┘ └────────┬───────┘ └──────┬────────┘ └────────┬─────────┘ ║
║                                          │                 │                   │           ║
║                                          ▼                 ▼                   ▼           ║
║                                    Encoded Segment Object Store (authoritative)             ║
║                                          │                                                  ║
║                                          ▼                                                  ║
║                                    Catalog Publish (transactional flip in Spanner)          ║
║                                          │                                                  ║
║                                          ▼                                                  ║
║                                    Origin Replication: cross-region + Tier 0 PUSH           ║
║                                    to ISP appliances per release schedule (§7b)             ║
╚══════════════════════════════════════════════════════════════════════════════════════════╝

6.2 ABR playback sub-diagram (zoom)

Client                         Edge PoP                    DRM Svc         Control
  │                                │                          │                │
  │───session_start──────────────▶│                          │                │
  │                                │──verify entitlement─────┼───────────────▶│
  │                                │◀─ok, session_token──────┼────────────────│
  │◀─session_token─────────────────│                          │                │
  │                                │                          │                │
  │───GET /manifest/{title}──────▶│ (edge cache, 60s)         │                │
  │◀─manifest (DASH/HLS, per-     │                          │                │
  │    session signed URLs)       │                          │                │
  │                                │                          │                │
  │── POST /drm/license ─────────▶│──────────────────────────▶│                │
  │                                │                          │──HSM wrap──────│
  │◀─license (wrapped content key)│◀─license─────────────────│                │
  │                                │                          │                │
  │── GET /s/{title}/{profile}/   │ (Tier0→Tier1→Shield→Ori) │                │
  │        /rung0/seg0.m4s ─────▶│                          │                │
  │◀─segment (4 MB, 2s content)───│                          │                │
  │                                │                          │                │
  │──(ABR decision: throughput   │                          │                │
  │    + buffer → choose rung)    │                          │                │
  │                                │                          │                │
  │── GET /s/.../rung4/seg1 ────▶│                          │                │
  │◀─segment──────────────────────│                          │                │
  │          ...                  │                          │                │
  │                                │                          │                │
  │── POST /heartbeat (10s) ────▶│───async Kafka──────────────────────────────▶
  │                                │                          │                │

6.3 Flash-event pre-warm sub-diagram (§7b, preview)

T−24h (before drop)              Origin Regions ──── PUSH ────▶ Tier0 Appliances (ISP)
                                  (10k appliances × 140GB title × 3 codecs = ~4 PB)
                                  → over managed-overnight peering windows
                                  → idempotent, signed, verified by content-hash

T-1h                              Flip catalog status in Spanner: READY
                                  Pub-sub to edge PoPs: pre-load manifest

T=0 (global drop)                 Tier-0 already has the title: 75% traffic served
                                  with zero cold-miss. Tier-1 warms in first ~2 min
                                  organically.
                                  Rebuffer ratio curve: smooth, no spike.

Every labelled arrow maps to §4 or §5:

Arrow	API / Data	Reference
Client → Edge: `session_start`	§4.2 `POST /v1/session/start`	Validates entitlements against §5.2
Client → Edge: `/manifest`	§4.2 `GET /v1/manifest/...`	Signed manifest from §5.3
Client → DRM Svc: license	§4.2 `POST /v1/drm/license`	Key wrap from §5.4 HSM
Client → Edge: `/s/...` segment	§4.3 segment URL	Immutable object in §5.3
Client → Edge: heartbeat	§4.2 `POST /v1/heartbeat`	Ingests to §5.6 Kafka
Studio → Ingest: upload	§4.1 multipart	Lands in §5.3 mezzanine bucket
Transcode → Segment store	internal	§5.3 encoded segment bucket
Catalog publish	§4.1 `PublishTitle`	Flips `status → READY` in §5.1
Origin → Tier0 PUSH	internal sync loop	§7b
Heartbeat → Kafka	async	§5.6 analytics pipeline

7 Deep-Dive: Four Critical Topics #

7a. CDN strategy: why we build (Open Connect-style) custom appliances, with rejected alternatives and bandwidth math

Why critical. Egress is 95% of our cost (§3h). Every 1% shift from commercial-CDN to ISP-peered appliance saves ~$25M/yr at our scale. The CDN decision is the defining architectural choice for a video streamer, far more than storage or transcode. Candidates who say "use a CDN" without decomposing lose the L7 point here.

The decision space:

Option	Who owns	$/GB egress	Placement	Pre-warm cost	Operational burden	Unit economics at 500 Tbps
A. Pure 3rd-party CDN (Cloudfront/Akamai/Fastly)	Them	$0.02–0.08/GB retail, $0.005–0.015 on huge contracts	Their PoPs	They pay their origin	Minimal	~$8–25M/day unmitigated; $2–4M/day w/ contracts. Still $700M–$1.4B/yr.
B. Pull-CDN + origin shield	Them + us	Same as A, less shield-level re-fetch	Their PoPs	Small (their cache fills on first miss)	Low	Saves 5–10% vs A by reducing shield-miss traffic. Still billions.
C. Multi-CDN (A+B+C with steering)	Them × N	Weighted; renegotiation leverage ~−20%	Their PoPs	Double	Medium — need a steering layer (e.g., Cedexis/NS1)	Saves another ~15% via competitive pricing. $600M–$1.2B/yr.
D. Custom appliances at ISP (Netflix Open Connect / YouTube Google Global Cache)	Us, colocated at ISPs	Effectively peering-cost + capex ≈ $0.001–0.003/GB amortized	Inside ISP networks, one hop from eyeball	Huge — must PUSH title 24h pre-drop	High — hardware fleet of 10k+ appliances; fleet operations	$200–400M/yr all-in (capex amortized + peering + staff). 60–75% savings vs 3rd-party.
E. Custom appliances at our own IXPs only (no ISP embedding)	Us	~$0.005/GB	Regional	Medium	Medium	~$500M/yr. Middle ground.

Chosen: Hybrid — D (for head 75%) + B (Tier 1 pull-CDN commercial) + C fallback (multi-CDN) for spillover.

Why hybrid and not pure D: tail catalog (bottom 70% of titles by watch-share) is 25% of bytes; pushing every title to every appliance would require ~2× the appliance storage capacity per dollar, worse unit economics than renting pull-CDN for the tail. We break even at roughly a Pareto threshold: if the top 10% of titles are ~75% of watch time (typical), push those + hot new releases; pull the rest.

The push schedule — why 24h pre-drop is non-obvious:

Naive argument: "CDN caches warm on demand, no pre-warm needed." That's wrong at our scale because:

100M viewers start within ~5 minutes of a global drop.
Miss-rate at cold edge is ~100%. First-miss flows to shield → origin.
Even if Tier-1 CDN has 500 PoPs, each miss on a new title triggers ~500 origin fetches (one per PoP) before the PoP's local cache fills.
With 3-codec × 15-rung × 2,700 segments per title, a cold start is ~500 PoPs × 15 rungs × maybe 10 segments in-flight each = 75,000 origin-bound requests in the first minute, per title — and we may have 5 titles dropping a week.
That load on origin is tolerable; the bigger pain is the cache-miss latency spike to viewers, breaking the p99 <2s startup SLO.
We'd need origin bandwidth = ~5% of peak = 25 Tbps just to survive the first 5 min, built for usage we only hit at drops. Wasteful.

Pre-push insight (earned-secret depth): "Netflix pushes content to ISP-embedded Open Connect Appliances during the ISP's off-peak window (2am–6am local) in the 24h before a major release, over managed peering. This moves the bottleneck from CDN-bandwidth-under-flash-load to ingest-side replication scheduling. The second-order effect: our transcode deadline backs up by 24h, which forces us to demand mezzanine masters from studios 72h pre-release instead of 48h — a business-logistics change driven by cache architecture." Pure pull-CDN can approximate this by synthetic-traffic pre-warming (send fake fetches from every PoP) + origin shielding, but (a) you pay commercial CDN BW for the synthetic traffic (often >$100k for one title's pre-warm), (b) you still can't pre-warm ISP-level appliances you don't own, and (c) you measure the pre-warm effectiveness only by watching whether your rebuffer ratio spikes at T=0 — a lagging metric with a reputational cost.

Push architecture (the hard parts):

Schedule: per title, compute placement plan: which appliances get which rungs. Not every rung to every appliance — Tier 0 appliances get only the top ~5 rungs (HEVC/AV1 at 1080p+4K), since ISP subs skew high-bandwidth; AVC low rungs remain at Tier 1.
Deduplication: content is chunked at 4MB granularity with content-addressable hashes; re-pushing a title with updated metadata reuses unchanged chunks. ~90% reuse across re-encodes.
Flow control: push rate-limited per-ISP to their capacity. Each appliance has a 1–4 TB SSD + spinning disk; we prioritize push by "predicted watch share in the next 7 days" as scored by a dedicated ML model, evicting lowest-share content first.
Verification: appliance signs a per-chunk receipt back to origin; per-appliance readiness monitor fails a drop (rolls back to T+24h) if <95% of appliances are ready at T−1h.

Real systems named: Netflix Open Connect (OCA hardware, FreeBSD-based), YouTube Google Global Cache (GGC), Disney's BAMTech (pull-CDN heavy with aggressive shielding), Akamai's managed CDN for many mid-tier streamers, Fastly's edge compute for manifest customization.

Failure modes:

OCA at ISP crashes / disk fails: DNS/GSLB steers viewers to the next-nearest (regional Tier 1). Detected in <30s via PoP-level health-check. Blast radius = that ISP until repair (replaced within 72h via logistics). Mitigation: over-provision each metro with ≥2 appliances so one failure = graceful degradation to Tier 1.
Push window missed pre-drop: monitored at T−6h; if <90% readiness, page the release team. Options: (a) delay the drop in that region (contractually permitted for some titles, not others), (b) push from Tier-1 PoPs in-line (drops 75% of hits down to 20%, raises egress cost and risks rebuffer). Big red switch: "activate Tier 1 fallback posture" — doubles expected cost for 24h, absorbs the flash.
Bad encode pushed: content version is included in the object key; roll forward by publishing a new version and invalidating the old (pull-CDN model), or (for pushed titles) push v2 → flip manifest → leave v1 for in-flight sessions (bounded TTL).

Bandwidth sanity check (closes the loop with §3b):

500 Tbps peak × 0.75 Tier 0 = 375 Tbps / 10,000 appliances = 37.5 Gbps/appliance peak. Each appliance has 2×100G NICs — comfortable headroom.
375 Tbps × 3600 s = 168 PB/hour egress at peak, out of ISP-peered appliances at ~$0.001/GB. At peak hour = $168k; day avg = $1.5–2M/day. Annualized: ~$700M for Tier 0 alone, matching the ~$2.5B/yr total egress number in §3h (others tiers, DRM/control, origin, manifest, recs round out).

7b. Transcoding pipeline at scale: per-title encoding, GOP-parallel, per-rung VMAF targeting

Why critical. Transcode is where encoding quality × bandwidth cost × startup latency collide. A 20% bitrate reduction at equal VMAF saves $500M/yr in egress at our scale. It is worth burning 10× encode compute to find it.

The naive approach is a fixed bitrate ladder (e.g., {240p@400k, 360p@700k, 480p@1.2M, 720p@2.5M, 1080p@5M, 4K@15M}). Apply it to every title. Simple, fast (~1× realtime).

What's wrong with that:

A talking-heads documentary needs 40% less bitrate than an action movie to hit the same perceptual quality.
A dimly-lit horror film benefits from HDR + 10-bit much more than a cartoon, and its perceptually-optimal 1080p rung is at ~3 Mbps, not 5 Mbps.
Serving the naive ladder wastes ~25% of bitrate on average.

Per-Title Encoding (PTE), chosen approach:

Step 1: Convex-hull analysis. Encode the title at ~30 (resolution, QP) points — a grid of candidates. For each point, compute VMAF and bitrate. Plot (bitrate, VMAF); the upper-left-outward frontier is the convex hull of efficient rungs.

Step 2: Select rungs from the hull. Target specific VMAF buckets (e.g., 65, 75, 85, 93, 97) and pick the minimum-bitrate point achieving each. Per-title ladder has 5–8 rungs, non-uniform per title.

Step 3: Encode the chosen rungs fully, with the quality preset tuned per codec.

Cost: ~5× more CPU-hours than naive (the pre-analysis dominates), buying ~20–25% bitrate savings on average. At 500 Tbps × $0.003/GB effective egress, 20% savings = $500M/yr. Compute cost to achieve it: 5× × $30/hr × 50 hrs/day × 365 = ~$3M/yr. ROI is 150×.

GOP-parallel encoding (implementation):

Split each master into 2–4 s closed-GOP chunks (split at I-frames; the encoder is configured to force IDR at chunk boundary).
Dispatch chunks as independent jobs to a K8s batch fleet (or AWS MediaConvert-like managed). Each chunk encoder is FFmpeg with preset tuned.
Reassemble at the packager stage (CMAF fragments already have natural segment boundaries, often aligned with the GOP).
Why: serial encode of a 90-min film at AV1 quality is ~60× realtime = 90 hours. GOP-parallel with 1000 chunks → <10 min wall-clock for the same film.

Per-segment VMAF as QC gate:

Every encoded segment has its VMAF measured against the reference master-segment.
If any segment's VMAF is 3 points, the encode fails QC, job re-runs with higher bitrate budget. Catches encoder corner cases (high-motion scene collapse, grain preservation).
Measured VMAF is stored as object-store metadata (§5.3) for observability and client-analytics pairing.

DRM packaging (single-encode-many-DRM, CENC CMAF):

Encode once per codec/rung; package into CMAF fragmented-MP4 with CENC (Common Encryption) AES-128-CTR — the same encrypted bytes are decoded by all three DRM systems (Widevine via CDM, FairPlay via StreamingKeyDelivery, PlayReady via PRO header).
Saves ~3× storage and serving cost vs. per-DRM encoding.
Real systems: every modern streaming service uses this; bento4 + shaka-packager are the reference tools.

Codec choice math:

Codec	BW saving vs AVC	Encode cost vs AVC	Device coverage (our MAU)	Verdict
H.264/AVC	baseline	baseline	~100%	Must have
HEVC/H.265	−30–40% BW	5× CPU	~90% (iOS, newer Android, most TVs)	Chosen
AV1	−30% vs HEVC, −50% vs AVC	20–40× CPU	~40% (Chrome, newer Android, 2023+ TVs)	Chosen for top rungs where savings compound (e.g., 4K)
VP9	≈HEVC	10× CPU	~60% (no iOS)	Skipped — AV1 eclipses it going forward

Chosen: triple-codec (AVC + HEVC + AV1) with client-side capability negotiation in manifest. Top rungs (4K, 1080p) get AV1 for devices that support it; AVC always included as baseline. Per the math above, AV1 pays back its encode cost in ~1 week of serving at our scale.

Real systems: Netflix's Dynamic Optimizer (PTE grandfather); YouTube's similar Videogen; FFmpeg/libaom/x265 as encoders; AWS MediaConvert / Elemental for managed alt; Bitmovin for 3rd-party.

Failure modes:

Encoder crash mid-job: K8s restarts the chunk; idempotent chunk IDs ensure no duplicate writes. Wall-clock slips by minutes.
Bad mezzanine (corrupt, wrong color space, silent audio): QC catches, encode marked FAILED, studio notified, release blocked. Pre-ingest validator detects most of these in the first ~10s.
Ladder misconfig (bitrate too low): per-segment VMAF gate fails the encode. Regression test on previously-encoded titles catches per-title optimizer regressions.

7c. DRM license serving at 300k licenses/s under flash load

Why critical. A DRM outage == no playback. Unlike most services where failure degrades, a DRM fault is absolute: client cannot decrypt, black screen. It is also the component least visible to casual candidates — "add DRM" hand-waves over its hardest parts.

The load shape:

Steady-state 37k licenses/s (§3f).
Flash: major release globally, 100M viewers starting within 5 min. Peak license request rate ≈ 300k/s for 5 min.
HSM ops are serial and ~1ms per op; one HSM = ~1000 ops/s. Need 300 HSMs to absorb burst — expensive, ~$20k each amortized.

The hot-path breakdown per license:

Parse client challenge (from CDM). Extract requested kids, device cert.
Validate entitlement (subscription, region, parental, concurrent-streams) — Spanner or cached in Redis.
Derive policy (HDCP level, playback duration, offline flag).
Wrap content key under device session key (HSM op).
Sign response.

Optimization: move steps 1–3 and 5 to app servers; only step 4 on HSMs. That's obvious. The earned-secret optimization:

Proxy re-encryption / session-key intermediate (chosen):

HSMs pre-wrap content keys under regional session master keys at ingest time (not per-license). Regional session master key rotates daily.
At license time, app server wraps from regional-session-master-key to device-session-key in software (not HSM). Standard AES-KW is ~100 ns in software.
HSMs are used only for (a) rotating the regional-session-master-key daily, and (b) attesting the rotation via signed chain.
Security argument: an app-server compromise leaks at most one day's worth of keys, and only for content the compromised server served. A full HSM extraction still requires HSM breach.
Throughput: app servers do millions of ops/s; HSMs never saturate.

Trade-off: a small reduction in HSM-derived security guarantee (keys exit the HSM under regional-session wrap). For our threat model (professional pirate, not state actor) this is acceptable and it's what high-scale streamers actually do. It's also what Widevine L3 effectively does in its server-side extraction model.

Concurrent-stream enforcement:

Every license issuance creates a row in a Redis "active sessions" set per user (TTL = 60 min, refreshed by heartbeat).
Cardinality check before issuing: SCARD active_sessions:{user} < max_concurrent_streams. Atomic Lua to prevent race.
If over limit, kick oldest session (server-issues a "stop" to that device via push channel).

Offline license:

Different policy: offline=true, duration=48h, key_lifetime=30d, policy embedded in license itself so client enforcement is self-contained.
Stored in a separate table for revocation lineage; license servers can issue revocation on subscription cancel.

Failure modes:

HSM outage: regional session master keys cached server-side for 24h (refreshed daily); software path keeps serving. Degraded mode: cannot rotate keys; 24h SLO on restoration.
Cross-region license server outage: GSLB fails over to nearest healthy region. Add ~50ms latency. SLO met.
License theft (stolen subscription): revocation list; license servers refuse to issue for revoked device. Existing licenses expire in 1–60 min depending on policy.
Big red switch: "serve cached license for last 24h's titles to any active session" — degrades to no enforcement for ongoing sessions (studio contracts permit this for ≤1h as emergency).

Real systems named: Google Widevine, Apple FairPlay Streaming, Microsoft PlayReady, Amazon's BuyDRM integrations; HSMs from Thales/Luna, AWS CloudHSM, Google Cloud HSM; reference packagers: bento4, shaka-packager.

7d. ABR + rebuffer-ratio optimization: chosen rung-per-segment math

Why critical. Rebuffer is the #1 signal for viewer abandonment (every 1% rebuffer ratio ≈ 2% reduction in minutes viewed industry-wide). ABR is entirely client-side — our server role is to make the client's job easy: smart ladder, accurate manifests, low-variance segment sizes, CDN reliability. The interviewer probe specifically calls out p99 <2s startup and <0.5% rebuffer.

The three ABR algorithm families:

Algorithm	Decision input	Rebuffer resilience	Bitrate efficiency	Complexity
Throughput-only (BBA, classic)	EMA of segment throughput	Poor on oscillating network	Ladder-greedy	Simple
Buffer-based (BOLA)	Buffer occupancy	Excellent (self-stabilizing on buffer)	Slightly under-utilizes BW when buffer deep	Medium
Hybrid MPC (BOLA-E, RobustMPC)	Buffer + throughput + forecast	Best empirically; ~20% less rebuffer than BBA	Best empirically	Higher
ML-driven (Pensieve, Puffer)	Multi-feature NN	Promising, uneven in field	Variable	Highest; model retraining pipeline

Chosen: RobustMPC / BOLA-E hybrid, client-side. Server role: ensure the manifest advertises a ladder that lets the algorithm make good decisions.

Server-side optimizations that impact rebuffer:

Per-title ladder (§7b) — clients always have a good rung to step down to.
Variance-controlled encode — we use capped-CRF (not strict CBR) so a segment's encoded size is within ±15% of its ladder-nominal bitrate. Predictable download times ⇒ predictable buffer state ⇒ fewer ABR oscillations.
Segment size = 2–4 s — shorter = lower startup latency (first I-frame sooner) and faster rung switches; longer = better compression. Chosen: 4s for standard, 2s for live/LL-HLS. Per-segment VMAF gate ensures 2s segments don't collapse.
Initial rung selection for startup:
- Client opens playback; first segment fetched must be small enough for p99 startup <2s.
- At p99, a weak connection is ~1.5 Mbps. A 4 s segment at 720p@2.5 Mbps is 1.25 MB → 6.7 s to download. Miss.
- Solution: the client fetches a short initialization-segment + the first video-segment at the lowest-intelligent-rung (typically 360p/500k = ~250 KB, <1 s on 1.5 Mbps), then ramps up within ~10 s.
- Manifest advertises #EXT-X-START (HLS) / Period@start (DASH) hints to steer.
Prefetch the first segment on manifest fetch. Use HTTP/2 Server Push from the manifest endpoint: pushes the first init-segment before the client requests it. Cuts ~50–150 ms off startup. Real deployment: most CDNs (Akamai, Fastly) support this; CloudFront does not — we accept the skew.
DRM license parallelism. Critical: client must not serialize fetch-manifest → fetch-license → fetch-first-segment. Modern players (ExoPlayer, Shaka, AVPlayer) fetch manifest + init-segment + license concurrently once manifest URL is known; the license is consumed when first content-segment arrives. This reduces TTFB from sum-of-three to max-of-three. We ensure our manifest contains license URL in an early-parseable position.

Rebuffer budget math:

Target rebuffer ratio <0.5%.
Break down sources: (a) network oscillation unreached by ABR, (b) CDN miss storms, (c) DNS/TLS RTT spikes, (d) encoder stutters (pathological segment size).
Allocate budget: (a) <0.2%, (b) <0.1%, (c) <0.1%, (d) <0.1%.
For (b), this translates directly to a CDN miss-rate tolerance: at 99.5% hit rate and ~100ms miss penalty, rebuffer contribution = 0.005 × 100ms / 4000ms segment = 0.0125% of play time. Fine. We set a hit-rate alarm at 99% (2× our budget) as early warning.

The "first 2 seconds" optimization chain (earned-secret):

Startup p99 <2s is the hardest SLO. Breakdown of a cold play on LTE:

DNS 50ms (DoH cached) → TCP+TLS 150ms (QUIC skips some of this) → session_start call 100ms → manifest fetch 80ms (edge-cached) → parse 20ms → license fetch (parallel with init-segment) 150ms → first video-segment fetch @ 500 kbps = 250ms → decoder init 100ms → first frame.
Sum (serial): 900ms. With parallelism (license || init-segment), the critical path is ~800ms. With QUIC to shave TLS: ~700ms.
p99 is worst-case: add 2× network variance, 500ms DNS fallback, TLS resumption failure → ~2000ms. We're dancing on the line. Every 50ms matters.

Per-second optimizations we enforce:

TLS session resumption across requests within a play session.
HTTP/3 (QUIC) for handset clients — saves one RTT for handshake.
Manifest caching at edge with 60s TTL; invalidated on version bump.
Fail-fast on segment errors: 5xx triggers an immediate retry to next-rung-down rather than exponential backoff, which blows the startup budget.

Failure modes:

ABR algorithm bug ships to clients: rolling update gated by per-fleet QoE regression. If rebuffer ratio creeps up post-release in a cohort, auto-rollback.
Encoder emits pathological segment sizes (e.g., high-motion scene at a low rung grossly over-sized): QC catches; per-segment size audit at ingest.
CDN route flap: client falls back to next-best rung; sticky CDN selection per session to prevent flapping.

Real systems named: Netflix's Chunked Dash Optimizer, Twitch's low-latency HLS, MPEG-DASH with LL extension, Apple LL-HLS, BOLA/RobustMPC (CMU & MIT papers), Pensieve, Shaka Player, ExoPlayer.

Component	Failure	Detection	Blast radius	Mitigation	Recovery
Tier-0 appliance at one ISP	HW fail, disk fail	appliance healthcheck 30s; PoP-level egress drop	~5% of that ISP's viewers (if 2-appliance metro); 0% of others	DNS/GSLB auto-shifts that ISP's traffic to Tier 1 PoP; rebuffer ticks up <0.1% for duration	Replace hardware within 72h via logistics; re-sync via push
Whole Tier-1 PoP outage	Datacenter / carrier fault	Per-PoP RED metrics; external synthetic (Catchpoint)	2–5% of regional egress	GSLB fails to nearest Tier 1 (adds ~10ms latency); Tier 0 continues serving	Restore PoP or reroute via peering
Origin region fails	Cloud region outage	Cross-region health; origin 5xx	5% of traffic that was shield-missing to that region	Traffic shifts to next origin (cross-region replicated); shield-miss adds latency to first requests only	Failover complete in <5 min; full restoration per cloud SLA
Cache-miss storm (bad new release, push failed)	PoPs cold-miss to origin on new title	Origin bandwidth spike; rebuffer spike	Cold new release rebuffer >5% for ~5 min	Origin shield absorbs; emergency posture shifts to Tier 1 fallback with higher BW; traffic shaping if shield saturates	Complete pre-push retroactively; monitor
Transcoder job mid-crash	Node reboots mid-chunk	K8s pod lifecycle; chunk idempotency ID	One chunk slips 10 min	Auto-retry on another node; idempotent writes	Chunk re-encodes; title publish unaffected
DRM license server outage	Full regional DRM fault	License 5xx rate >1% in 30s	New sessions in region can't start; in-flight sessions run on cached licenses 15–60 min	GSLB to nearest healthy region; big red switch = "serve cached policy" fallback (opt-in studio-permitted titles only)	Restore region; queue of held sessions unblocks in <10 min
Recommendation service down	Ranker unavailable	Reco 5xx	Home page still loads with popularity-sorted fallback (from Spanner); personalized rows missing	Circuit-break reco; serve cached "yesterday's recs" per user from Redis	Ranker restore; recs resume; no permanent loss
Catalog DB (Spanner) unavailable	Spanner outage	Cache reads still serve; writes fail	New titles can't publish; no catalog updates; playback mostly unaffected (manifest cached)	Read-only mode: serve cached catalog; publish workflows pause	Restore Spanner; drain publish queue
Analytics pipeline lag	Kafka consumer backlog	Kafka lag alerts	Recs + QoE dashboards stale; playback unaffected	Pipeline is best-effort; drop events if lag >30 min to stop cascading	Scale consumers; lag recovers
Regulatory geo-block fails open (title served in embargoed country)	Misconfigured region list	Contract compliance monitor; studio complaint	Legal/contractual risk, not technical	Immediate catalog flip to remove region; audit; incident report to studio	Within minutes; legal follow-up
Studio-pushed bad master	Wrong color space, corrupted audio	Ingest validator + VMAF QC	Publish blocked; studio notified	Pre-ingest checks; explicit failure reasons	Re-ingest
CDN auth-token leak (pirate distribution)	Tokens circulating	Anomaly detection on session/token ratios; geographic heatmap anomalies	Piracy; revenue leak	Rotate signing key, invalidate all outstanding tokens; re-auth all users in region (brief pain)	Forensic trace; possibly escalate DRM policy
Client device bug causes login stampede after app update	Millions of simultaneous retries	Session_start RPS spike	Manifest API saturates; startup times degrade	Rate-limit at gateway + exponential backoff recommended in client; app-update rollback if severe	Coordinate with client team

9 Evolution Path #

v1 (ship in 3 months — MVP regional launch):

Single origin region (us-east) with one cloud provider's object store.
Pull-CDN from one 3rd-party (CloudFront or equivalent) for delivery.
H.264 only, 5-rung ladder, fixed per-title.
HLS only (defer DASH); Widevine + FairPlay (PlayReady can wait).
No per-title encoding; fixed ladder.
Watch-history in Postgres (not yet Cassandra — scale not there).
Popularity-sorted "recs" from SQL aggregation, no ML.
No ISP appliances; no cross-region origin replication.
Goal: prove functional correctness, ~1M subs in one region.

v2 (ship in 9 months — 10× scale, multi-region, multi-DRM):

3 origin regions; cross-region async replication of encoded assets.
Multi-CDN steering between 2–3 commercial CDNs (pull-based, origin shielding).
HEVC added; per-title encoding for top 1000 titles.
DASH added; PlayReady for Xbox/UWP.
Watch-history migrates to Cassandra; Spanner for catalog.
Two-stage recs: offline collaborative filtering + online popularity booster.
DRM: HSM-backed keys, proxy re-encryption added for scale.
Full ABR ladder; BOLA client shipped.
Analytics: Kafka + Flink + BigQuery.
Goal: 100M subs across 3 continents.

v3 (ship in 24 months — global, custom appliances, PTE, ML recs):

Open Connect-style ISP-embedded appliances at top 100 ISPs globally.
AV1 codec for top rungs; triple-codec serving.
Per-title encoding across entire back catalog; convex-hull optimizer in production.
Dynamic optimizer ML model trained on observed QoE.
ML-ranker for recs (deep neural, real-time features).
Regional DRM with proxy re-encryption; concurrent-stream enforcement via Redis.
Flash-event pre-warm pipeline; T-24h push with placement optimizer.
12 origin regions; erasure-coded masters; Glacier deep archive for tail.
Goal: global footprint, 300–500M MAU.

v4 (research / future):

AV2 / LCEVC / low-power codecs as device support solidifies.
Edge compute for personalized manifest (per-user ad insertion, regional splice).
ML-driven ABR (Pensieve-class) shipped to clients where it beats MPC.
End-to-end encrypted watch-history (no Netflix-server knows what you watched) for privacy-stringent regions.
Live-VOD hybrid (live premieres that transition to VOD) with shared ingest pipeline.
Foundation-model-assisted content understanding for better recs and subtitle generation.

10 Out-of-1-Hour Notes #

Codec selection economics. AV1's 30% BW reduction vs HEVC × $2.5B/yr egress = ~~$750M/yr if we could serve only AV1. Device penetration gates this: we ship AV1 to supporting devices (~~40% of MAU today, climbing to ~80% by 2027). HEVC serves the next ~50%, AVC baseline covers the rest. VP9 abandoned — AV1 eclipses it going forward and patent landscape is cleaner. LCEVC is a "scalability enhancement layer" (delta over a base layer) interesting for bandwidth-constrained markets but ecosystem tooling is thinner.

Subtitles & multi-audio tracks. CMAF allows subtitle and audio tracks in separate representations sharing the same CENC encryption. Storage cost modest; serving cost a rounding error. For accessibility: we require closed-captions on every title (US ADA, EU EAA compliance). Audio description tracks for visually impaired (separate audio representation). Lyrics / karaoke: subset.

Kids content compliance. COPPA in the US, GDPR-K in EU; kids profile must not send personalized recs based on adult viewing history. Implementation: kids profiles share a user_id with the account but have a profile_type=kids flag that routes to a restricted reco model and filters catalog to rated-for-kids. Geo-restrictions also apply (some content is kids-ok in country A, not country B).

Pre-release embargo. Contractual with studios; title exists in the system in DRAFT for days-to-weeks, with encoded assets pre-pushed to edge but manifest unavailable. PublishTitle at the drop moment flips a single bit in Spanner — because assets are already in place, the drop is effectively instantaneous. Crucial: audit log at publish time, signed by the release manager; on-call during the drop.

Live-streaming sub-system (separate). Architecturally distinct from VOD. Ingest via RTMP/SRT/WebRTC to regional ingest gateways; re-encode in real-time with 2–4s latency (LL-HLS) or sub-second (WebRTC for ultra-low-latency, e.g., sports bet interactions). Separate origin servers optimized for hot-cache-only; separate CDN profile. Shares the playback client, DRM, catalog metadata. Biggest operational difference: live has no re-try opportunity — the segment either reaches the viewer in time or it's stale forever. Chunked transfer encoding + LL-HLS push allows ~1-2s glass-to-glass.

Ad insertion (SSAI / CSAI). Server-Side Ad Insertion rewrites the manifest per-session, splicing pre-roll/mid-roll/post-roll ad segments. Integrates cleanly with our manifest-per-session model. Client-Side: simpler but ad-blockable; SSAI is preferred for mandated ads. Sub-components: ad decision service, ad-creative CDN, session-bound manifest generator. This is a ~6 engineer-year project on top of core platform.

Observability gold standards (playback SLOs).

Per-session QoE events piped from client: startup time, rebuffer events (count + duration), bitrate switches, errors.
SLI/SLO/error budget per playback SLO:
- Startup p99 <2s, budget 10⁻³ sessions/month exceed 5s (~500k sessions budget at 500M MAU).
- Rebuffer ratio <0.5%, budget: 5% of sessions exceed 2% rebuffer ratio.
- Playback availability 99.99%.
Error budget consumption dashboards per region, per device-class (mobile, TV, desktop), per title.
Per-title QoE surfaced to content ops: if a new release has anomalously high rebuffer in region X, investigate (bad edge push? bad encode on a specific rung?). Content quality is as operational as compute is.
CDN hit-rate per PoP per title — leading indicator for cache-miss storms.

Piracy / DRM escape hatches. Determined attackers will screen-cap; HDCP 2.2 tries to stop HDMI capture. Camcording a screen is the unavoidable low-bar attack. We don't chase screen-cappers; we chase bulk subscription-sharing (concurrent-stream enforcement), token replay (session binding, rotating CDN signing keys), and key extraction (HSM + hardware DRM). Fingerprinting (steganographic watermark per-user) for post-leak attribution is an open research area — Netflix uses forensic watermarking on premium content; we could add as v3+.

Peering & anycast. Tier-1 and Tier-2 PoPs connect to ISPs via settlement-free peering wherever possible. Anycast for manifest/session/license APIs (HTTP control plane) uses BGP to send traffic to nearest PoP — cuts ~10–50ms per RPC. Segments are unicast via content-aware routing (nearest PoP with cache).

Observability of appliances (Tier 0). Each OCA-equivalent phones home every 60s: disk health, cache hit-rate, egress Gbps, CPU, temperature. A fleet of 10k appliances at 60s cadence is ~170/s of telemetry — trivial. Alerts on: egress drop (appliance failing silently), cache miss-rate spike (likely push failed), disk predicted to fail.

Privacy & data residency. Watch-history and entitlement records are PII in many jurisdictions. EU DPA mandates data localization — watch-history for EU users kept in EU-region Cassandra clusters, no replication to non-EU. Catalog metadata is not PII, replicates globally. Request-ID sampling for debugging must be GDPR-compliant (no bulk access outside justified investigations).

Testing specifics.

Multi-region failover chaos drills monthly.
Synthetic playback from 50+ geographic probes continuously; any p99 startup drift alerts.
Pre-release push rehearsals: push a synthetic title to all appliances, measure push-time distribution, gate release on 95th-percentile-readiness meeting budget.
Encoder regression: every 100th encoded title gets an automated diff against its prior version; VMAF regression >2 points blocks publish.
DRM chaos: periodically kill a DRM region in staging; verify GSLB shifts and sessions recover.

Green-field greenhouse: what would I change if starting today (2026)?

QUIC/HTTP/3 by default for all API surfaces (control plane + data plane segment delivery). Already deployed at Google-scale and ~20% latency win.
Rust for edge services (manifest, license wrap app-layer, session). Not for transcoders — FFmpeg ecosystem dominates C++ there.
Confidential computing (AMD SEV-SNP, Intel TDX) for HSM-alternative — cheaper, comparably secure for our threat model, and enables per-tenant-region key isolation more flexibly than HSM quotas.
eBPF-based observability in the PoP — TCP retransmits, QUIC ACK delays at wire speed for finer-grained rebuffer diagnosis.
Foundation-model-based recommendation scoring — retrieve-augmented ranker with user-history + text synopsis embeddings; likely +10% engagement at unclear compute cost vs DNN ranker.

Verification Checklist (done before submission) #

SRE pager-carryable? Yes — §8 is a runbook with detection, blast radius, mitigation, recovery per component, including the "big red switches" (serve-cached-license, Tier-1 fallback posture, catalog read-only mode).
Every diagram arrow → §4 or §5? Yes — the table at end of §6 cross-references every labelled arrow to an API surface or data store.
Deep-dives at L7 depth? Yes — §7a derives CDN Pareto + ISP-push scheduling + bandwidth sanity-checked against §3; §7b derives PTE ROI ($500M/yr savings for $3M/yr compute) and GOP-parallel wall-clock math; §7c derives proxy-re-encryption as HSM decoupling with a concrete threat-model trade-off; §7d walks the full startup critical-path millisecond-by-millisecond and shows the parallelism that keeps us under 2s p99.
Capacity math closes? 500 Tbps = 100M concurrent × 5 Mbps. Breaks into 375 Tbps Tier 0 / 100 Tbps Tier 1 / 25 Tbps origin. Tier 0 → 10k appliances × 37.5 Gbps/each (100G NIC × 2 = comfortable). Storage 14 PB encoded + 15 PB masters = ~30 PB logical, ~63 PB raw with EC/replication. Transcode farm 4k CPU-hours/day base, 40k peak. DRM 37k licenses/s base, 300k burst, 24 regions × 3 × 5k = 360k capacity. Egress $2.5B/yr dominates ~$2.6B all-in. Numbers close.

Design a Video Streaming Service

1 Problem Restatement & Clarifying Questions #

2 Functional Requirements #

3 Non-Functional Requirements + Capacity Estimate #

3.1 NFRs (with explicit SLO targets)

3.2 Back-of-envelope math — every number calculated

(a) Peak egress bandwidth

(b) CDN PoP sizing

(c) Catalog storage

(d) Transcode compute

(e) Transcode farm sizing for head-of-line titles

(f) DRM license server

(g) Watch-history and recs

(h) Cost anchor (why egress dominates)

4 High-Level API #

4.1 Ingest APIs (studio-facing, authn via mTLS + studio-key)

4.2 Playback APIs (client-facing)

4.3 Segment serve (edge)

4.4 Failure semantics

5 Data Schema + Engine Choice #

5.1 Catalog metadata (durable, strongly consistent)

5.2 User data (watch history, entitlements, profile)

5.3 Segments (object store + CDN)

5.4 DRM state

5.5 Recommendation serving

5.6 Analytics (QoE)

5.7 Engine choice — why each pick

6 System Diagram (Centerpiece — two planes) #

6.1 Top-level: Ingest + Delivery + Control planes

6.2 ABR playback sub-diagram (zoom)

6.3 Flash-event pre-warm sub-diagram (§7b, preview)

7 Deep-Dive: Four Critical Topics #

7a. CDN strategy: why we build (Open Connect-style) custom appliances, with rejected alternatives and bandwidth math

7b. Transcoding pipeline at scale: per-title encoding, GOP-parallel, per-rung VMAF targeting

7c. DRM license serving at 300k licenses/s under flash load

7d. ABR + rebuffer-ratio optimization: chosen rung-per-segment math

8 Failure Modes & Resilience (pager-carryable) #

9 Evolution Path #

10 Out-of-1-Hour Notes #

Verification Checklist (done before submission) #