All problems

Q14 Product & Edge Systems 36 min read 11 sections

Design a Video Streaming Service

Support creator uploads, transcoding, adaptive-bitrate playback, CDN delivery, and watch-history-driven recommendations.

Fanout / deliveryStreamingCost / efficiencyAvailabilityPartitioning

1 Problem Restatement & Clarifying Questions #

Restatement (say this first, 20 seconds): Build a global video streaming platform that ingests licensed mezzanine masters from studios, transcodes each into an adaptive bitrate ladder with DRM packaging, distributes via a tiered CDN (edge PoPs + ISP-embedded appliances) at multi-hundred-Tbps peak egress, and serves ABR playback with p99 startup <2s and rebuffer ratio <0.5%. Secondary surface: watch-history, personalized recommendations, per-region regulatory compliance. The cost center dominates — >70% of total infra spend is egress bandwidth — so every design choice is evaluated against its bandwidth bill first.

Clarifying questions I would ask, with the default I'd adopt if told "you decide":

# Question Why it matters Default if unspecified
Q1 YouTube-style UGC vs Netflix-style licensed catalog? UGC has a long tail of cold content (millions of titles, most watched <10 times) and ingest at human-generated scale (500 hrs/min). Licensed has 100k premium titles with predictable demand spikes and rights management. Netflix-style licensed. Justification: the problem names "transcoding pipeline," "catalog," and "geo-restrictions" — all point at a curated-rights model. UGC is a separate design (§10). I'd call this out explicitly so the interviewer corrects me early if wrong.
Q2 VOD only, or live + VOD? Live ingest is a separate sub-system (RTMP/SRT → low-latency HLS/LL-DASH with 2–6s glass-to-glass). VOD only. Live-streaming treated as adjunct (§10).
Q3 MAU / peak concurrency? Sets egress BW and CDN PoP count. 500M MAU, 100M peak concurrent (major-release global drop). §3 BOE math pivots on this.
Q4 DRM level? Studio contracts usually mandate HDCP 2.2 + L1 Widevine for 4K. Determines whether decryption is hardware-bound (L1) or SW-allowed (L3). L1 = client-locked, no cloud-DRM shortcut. Widevine L1 / FairPlay / PlayReady SL3000 for 4K; L3 fallback for SD. Assume all three DRM systems required (iOS/Android/Smart-TV fragmentation).
Q5 4K / HDR / Dolby Vision / Atmos? Adds top rungs to the ladder, doubles encode cost, triples egress bytes per viewer. Yes to all. Premium catalog; trade-off is real.
Q6 Geographic scope; regulatory regimes? Drives region count, geo-fencing, per-country rights fencing, data residency (EU DPA, India DPDPA). Global minus embargoed countries. 18 regions, 3 CDN tiers (core PoPs, regional PoPs, ISP-embedded appliances).
Q7 Offline downloads? Needs persistent DRM license with bounded validity (e.g., 48h; 30d when connected). Yes. Download path is a variant of playback, same DRM but a different license policy.
Q8 Cost target per streaming hour? Anchors whether we use 3rd-party CDN, build custom appliances, or hybrid. <$0.015 per streaming hour all-in (bandwidth + compute + DRM). §3 shows custom-appliance ROI.
Q9 SLO targets: startup time, rebuffer, availability? The three numbers that define "good video." p99 startup <2s, rebuffer ratio <0.5%, playback availability 99.99%. Catalog browse 99.95% (reco service allowed to degrade separately).
Q10 Recommendation scope? Candidate-gen + ranker or one-shot? Drives whether we separate online/offline embedding. Two-stage: offline candidate gen (collaborative filtering, graph walks) + online ranker (deep neural, real-time features). §7d depth.

I'd spend ~90 seconds on these, commit to the defaults, and say "I'll surface anywhere these bite."


2 Functional Requirements #

In scope (numbered):

  1. FR1 — Studio ingest. InitiateUpload(title_id, manifest) → chunked resumable upload of mezzanine master (typically ProRes or JPEG2000 IMF, 200–500 GB per 90-min feature) → object store → trigger transcode DAG.
  2. FR2 — Transcode to ABR ladder. Per-title ladder selection (dynamic optimizer), parallel segment encoding in 2–4s GOP chunks, multi-codec output (H.264 AVC, HEVC, AV1), multi-container (HLS fMP4, DASH CMAF), DRM packaging (Widevine/FairPlay/PlayReady via a single CENC CMAF package).
  3. FR3 — QC & conformance. Automated perceptual quality checks (VMAF per rung), bitstream conformance (ffmpeg/Bento4), audio loudness (EBU R 128), subtitle timing.
  4. FR4 — Catalog publish. Transactional flip from "draft" to "available in regions X" with geo-allowlist; embargoed release (countdown-to-live).
  5. FR5 — Manifest serve. GET /v1/manifest/{title}/{session} returns signed HLS/DASH manifest with per-session CDN-token-rewritten segment URLs and per-profile rung list.
  6. FR6 — Segment serve. Stateless CDN-cacheable GET /s/{title}/{rung}/{seg_no}.m4s; 99.9%+ hit rate at edge PoPs for head catalog.
  7. FR7 — DRM license exchange. POST /drm/license exchanging a CDM-provided challenge for a signed license (Widevine/FairPlay/PlayReady key wrap).
  8. FR8 — ABR playback. Client selects rung per segment via buffer + throughput signals. We serve, don't decide (clients own ABR).
  9. FR9 — Watch-history & resume. POST /heartbeat every 10s; GET /continue-watching/{user} returns last positions with user-consistent-read.
  10. FR10 — Recommendations. GET /recs/{user}?surface=home|detail|postplay — candidate gen + rank, with diversification.
  11. FR11 — Analytics. QoE (rebuffer, startup, bitrate ladder switches) and engagement (play, pause, abandon) events; feeds recs + SLI dashboards + ladder tuning loop.
  12. FR12 — Download-for-offline. Scoped DRM license, bounded offline duration.
  13. FR13 — Geo-restriction / parental controls / kids profile. Enforced at manifest issuance (server-authoritative, cannot be bypassed by changing client state).

Out of scope (say out loud):

  • Live streaming (RTMP/SRT ingest, low-latency HLS, chat overlay) → separate sub-system, sketched in §10.
  • UGC ingest at human-generated rates (500 hrs/min YouTube). Our ingest is ~10–100 titles/day from a handful of studios.
  • Comments, social, watch-parties — app-layer features, separate services.
  • Payment / subscription billing — separate service; we consume entitlements.
  • Content moderation / Trust & Safety — trivialized for licensed catalog (studios vet); for UGC it would be huge.
  • Ad insertion / SSAI — noted in §10 as a clean extension (per-segment URL rewriting at manifest time).

3 Non-Functional Requirements + Capacity Estimate #

3.1 NFRs (with explicit SLO targets)

NFR Target How we achieve it
Playback availability 99.99% measured as played-without-fatal-error / started Tiered CDN with 3 fallbacks; origin replicated 3× cross-region; client-side rung fallback
p99 startup time (click → first video frame) <2s Manifest edge-cached; first segment in client push; pre-fetched DRM license on manifest fetch
Rebuffer ratio (rebuffer-seconds / play-seconds) <0.5% ABR tuned (BOLA-E+MPC hybrid); segment prefetch = 30s buffer target; CDN hit >99.5%
Durability of master 11 nines (1 - 10⁻¹¹ annual loss) Object store w/ erasure coding 10+4 across AZs; tape archival for deep cold; checksum on ingest and every replication hop
Durability of encoded assets 9 nines; re-transcodable if lost Master survival is ground truth; encoded ladder is recomputable (cost: 1 day & ~$30k per missing title)
DRM license issuance latency p99 <300ms License servers co-located with edge regions; HSM-backed key wrap hot-path sub-10ms
Catalog metadata freshness (new release → searchable) <60s globally Spanner / pub-sub fan-out
Recommendation freshness (watched → re-ranked) <5 min for online features; daily for batch Online ranker with last-N session features
Analytics completeness (QoE ingest) 99.9% events landed within 1 min Lossy in-memory buffer → Kafka with 3× replication → batch to warehouse
Cost per streaming hour (all-in) <$0.015 Open Connect-style appliances + pull-CDN spillover; AV1 where devices support it (30% BW reduction)

3.2 Back-of-envelope math — every number calculated

(a) Peak egress bandwidth

  • 500M MAU × 2 hrs/day avg viewing = 1B streaming-hours/day = ~42M concurrent average.
  • Peak/average ratio for global VOD ~2.5× (evening TZ spike); for major releases ~4× (global drop).
  • Peak concurrent ≈ 100M viewers.
  • Blended bitrate (4K HDR 15 Mbps, 1080p 5 Mbps, 720p 2.5 Mbps, mobile 1 Mbps; weighted 15%/45%/30%/10%) = ~5 Mbps average.
  • Peak egress = 100M × 5 Mbps = 500 Tbps.

That number is the single biggest constraint. Build everything else backward from it.

(b) CDN PoP sizing

  • Assume tiered CDN:
    • Tier 0: ISP-embedded appliances (Netflix Open Connect-style). Target ~75% of egress (the head of the catalog; pre-warmed).
    • Tier 1: ~500 core/regional PoPs globally (our own or partner CDNs). Target ~20% of egress (body of catalog + cache-miss fallback).
    • Tier 2: ~12 origin regions. Target ~5% of egress (long tail, live re-transcode, new release pre-warm).
  • Per-tier peak egress:
    • Tier 0 = 0.75 × 500 Tbps = 375 Tbps spread across 10k appliances in ~2k ISPs → **37.5 Gbps per appliance peak**. Consistent with 100G NIC + 2 × 100G uplinks per appliance.
    • Tier 1 = 0.20 × 500 Tbps = 100 Tbps across 500 PoPs → 200 Gbps/PoP peak. Modest; today's major PoPs deliver multi-Tbps.
    • Tier 2 = 0.05 × 500 Tbps = 25 Tbps across 12 regions → ~2 Tbps/region, fine.

Sanity check: Netflix publicly discloses Open Connect delivers >200 Tbps at their peak; their scale is ~250M subs. Ours is 2× theirs, 500 Tbps is in-line.

(c) Catalog storage

  • 100,000 titles × avg 90 min = 9M min = 150k hrs of content.
  • Per-title encoded bytes:
    • Ladder: 15 rungs × avg 3 Mbps (harmonic average; fat rungs dominate) × 5400 s = ~3 GB per rung average × 15 = ~45 GB per ladder per language.
    • Codecs: AVC + HEVC + AV1 × 3 = ~135 GB per language.
    • Languages/audio: 10 audio tracks × (same video, separate audio @ 192 kbps each) ≈ add 2 GB → video dominates.
    • Subtitles: 30 languages × ~1 MB = 30 MB, rounding error.
  • Per title: ~140 GB encoded (across all codecs/profiles).
  • Plus master (mezzanine): ProRes 422 HQ at 220 Mbps × 5400 s = ~150 GB per title.
  • Total encoded storage = 100k × 140 GB = 14 PB.
  • Total master storage = 100k × 150 GB = 15 PB (kept forever, warm/cold tiered).
  • With 3× replication + 10+4 erasure coding on masters (1.4× overhead vs 3× for hot): masters on EC = 21 PB raw; encoded hot on replication = 42 PB raw. Total raw ≈ 63 PB.
  • At $20/TB/yr for hot object storage (internal tiered), **$1.3M/yr storage**. Dwarfed by egress.

(d) Transcode compute

  • For each hour of source: encode a ladder (15 rungs × 3 codecs = 45 encodes, though shared analysis cuts overhead ~30%) + QC + package + DRM.
  • AV1 is the expensive rung: ~30–60× realtime for quality preset on modern Xeon. HEVC ~10×. AVC ~3×.
  • Blended per-hour-of-content: ~80 CPU-hours per hour of content (dominated by AV1 and top HEVC rungs).
  • Volume: 10 titles/day × 1.5 hrs = 15 hrs/day new content, plus 3× re-encode on codec/DRM/ladder changes across back catalog = ~50 hrs content-encode-equivalent/day.
  • = 4,000 CPU-hours/day~170 sustained cores at full utilization.
  • With launch burst (Netflix-class drops): 10× peak = 1,700 cores burst. Spot/preemptible farm at ~$0.01/core-hour = $30/hr burst, trivial.
  • Key leverage: split each hour into 2–4 s GOP chunks, encode in parallel — turns a 90-min encode into 10-min wall-clock by using 400 chunks in parallel. Same total CPU, wildly faster wall-clock. Enables "ingest → publish" in <2 h for a 90-min feature.

(e) Transcode farm sizing for head-of-line titles

  • A blockbuster master arrives 72 h before release.
  • Need to emit full ladder + QC + package in <24 h so we have 48 h to propagate to edge appliances (§7b).
  • Per title: 80 CPU-hrs × 1.5 hrs content = **120 CPU-hrs** per title, wall-clock target 4 h → need 30 cores in parallel for a single title. Trivial per title; the farm's value is handling many titles and re-encodes in parallel.

(f) DRM license server

  • 100M concurrent viewers, each fetching ~1 license per playback session + periodic renewals (every 30–60 min for long sessions).
  • Peak license issuance = new-session rate. Assume avg session = 45 min → 100M / 2700 s = ~37k licenses/s steady-state; during a major release start, 300k licenses/s burst.
  • HSM-backed wrap is ~1ms per op on modern HSMs, but HSMs are throughput-limited. We put key wrapping in the HSM and session policy (geofence, device binding, entitlement) in the application layer.
  • With regional DRM servers (24 regions × 3 pods × 5k ops/s/pod) = 360k ops/s capacity — handles 300k burst with 20% headroom.
  • If we exceed HSM throughput: software key wrap with offline-mint HSM-signed chains (proxy re-encryption; §7c).

(g) Watch-history and recs

  • Heartbeats every 10s during playback × 42M avg concurrent × 30k/s → ~4M heartbeats/s.
  • Each heartbeat: 200 B → 800 MB/s into the write path, ~70 TB/day of raw events.
  • This cannot go into Spanner directly (50k writes/s Spanner per node). Lands in Kafka, downsampled to "progress point every 15s" in Cassandra-style wide row, materialized views daily to warehouse.

(h) Cost anchor (why egress dominates)

  • Egress at blended $0.01/GB (mixed tier-0/1/2 effective) × 500 Tbps × 86,400 = ~5.4 EB/day at peak-day rate, ~$54M/day if we paid retail. Reality: Tier-0 appliances at ISPs cost effectively $0.001/GB or less (ISP peers, no transit) → effective blended ~$0.003/GB → ~$16M/day peak, ~$8M/day avg, ~$2.5B/yr egress.
  • Compute + storage combined: ~$50M/yr. DRM + control plane: ~$20M/yr.
  • Total ~$2.6B/yr of which 95% is egress. This is the economic engine behind Open Connect.

4 High-Level API #

All APIs are HTTPS/1.1 with HTTP/2 for manifests; segment fetches explicitly on HTTP/1.1 or HTTP/3 (QUIC) depending on client. gRPC internal between control planes.

4.1 Ingest APIs (studio-facing, authn via mTLS + studio-key)

service Ingest {
  // Step 1: get an upload session + part URLs
  rpc InitiateUpload(InitRequest) returns (InitResponse);
  // Response carries: upload_session_id, part_size_bytes, signed S3-style URLs
  // for ~100MB parts up to 10k parts (1 TB max single upload).

  rpc CompleteUpload(CompleteRequest) returns (CompleteResponse);
  // Triggers transcode DAG. Returns job_id.

  rpc GetTranscodeStatus(StatusRequest) returns (StatusResponse) {
    // Streaming RPC. Emits QC, ladder progress, DRM packaging done.
  }

  rpc PublishTitle(PublishRequest) returns (PublishResponse);
  // Atomic catalog flip: draft → {geo_allowlist, valid_from}. Embargo-safe.
}

message InitRequest {
  string studio_id = 1;
  string title_id = 2;
  int64  mezzanine_bytes = 3;
  string mezzanine_sha256 = 4;
  string container_hint = 5;  // "IMF" | "ProRes_MOV" | "MXF"
  Metadata meta = 6;          // cast, runtime, content-rating, etc.
}

Chunked upload note. Uploads use a multipart pattern identical to S3 (InitiateMultipart → PutPart(PartNumber) → CompleteMultipart). 100 MB part size, max 10k parts. Resumable: part uploads are content-addressed, so a network blip retries only the failed part. End-to-end SHA-256 over reassembled bytes must match studio-supplied hash; mismatch = upload rejected, studio alerted.

4.2 Playback APIs (client-facing)

// Session ticket — opaque signed blob, 15-min lifetime, rotated every session
POST /v1/session/start
  Body: { device_id, device_capabilities (codecs, HDCP, HDR), entitlements }
  Response: { session_token, manifest_url_template, license_url, token_ttl }

GET /v1/manifest/{title}/{profile}?token={session_token}
  Response: HLS (m3u8) or DASH (MPD) manifest. Segment URLs carry:
    - cdn_prefix: per-session, signed with per-PoP key, TTL=12h
    - rung identifiers
    - DRM key IDs (kid) embedded

POST /v1/drm/license
  Body: widevine_challenge | fairplay_spc | playready_challenge
  Response: signed license blob (wrapped content key + policy)

POST /v1/heartbeat
  Body: { session_token, title, position_ms, bitrate_ladder_pos, rebuffer_ms_since_last }
  (10s interval; fire-and-forget; 204 No Content)

GET /v1/continue-watching/{user_id}
  Response: list of recently-played titles with last-position

GET /v1/recs/{user_id}?surface=home&row=1&limit=40
  Response: ranked title list with boost annotations

4.3 Segment serve (edge)

GET /s/{title_hash}/{profile}/{rung}/{seg_no}.m4s
    ?v={version}       # content version; ladder re-encodes bump this
    &cdn_sig={sig}     # per-PoP-key-signed, checked at PoP
Cache-Control: public, max-age=31536000, immutable
Content-Type: video/mp4

Segment URLs are immutable (never rewritten for the life of a content version). This is critical: immutability enables max-age=1y, which is the reason our hit rate is >99% at edge. Any title update bumps the {version} component — a new URL space, no cache purge needed.

4.4 Failure semantics

  • Manifest serve: if origin unhealthy, serve stale manifest from edge (TTL extended to 24h in degraded mode).
  • Segment fetch: client-side fallback to next lower rung on 5xx or timeout >2s; CDN itself routes miss to origin shield, then origin.
  • License fetch: if DRM license server is down, playback already-in-progress continues (license cached client-side for ≥1h); new sessions fail. Big red switch: "short-circuit license" for degraded mode (serves cached token-validated license for catalog already in flight — requires opt-in studio contract terms).

5 Data Schema + Engine Choice #

5.1 Catalog metadata (durable, strongly consistent)

Stored in Spanner (or CockroachDB / TiDB if open-source-preferred). Why: catalog is low-QPS writes (~100/day), extremely high-QPS reads (every home page load, every title click), must be globally consistent (a title becoming unavailable in region X needs to propagate fast), and transactional (embargo-release is a cross-column flip — availability + ladder URLs + DRM key refs flip atomically). Reads fronted by per-region cache (Redis) with pub-sub invalidation.

TABLE Title (
  title_id          STRING PRIMARY KEY,         -- UUID
  version           INT64,                       -- content version; bumps = new ladder
  display_name_i18n JSON,                        -- {"en":"Squid Game","ja":"イカゲーム",...}
  runtime_ms        INT64,
  content_rating    JSON,                        -- per-region: {"US":"TV-MA","DE":"16"}
  geo_allowlist     ARRAY<STRING>,               -- ISO-3166-1 alpha-2
  valid_from        TIMESTAMP,
  valid_until       TIMESTAMP,                    -- for licensed content with expiry
  hdr_flags         BITMASK,                     -- HDR10, HDR10+, DolbyVision
  subtitles         ARRAY<STRING>,               -- ["en","ja",...]
  audio_langs       ARRAY<STRING>,
  mezzanine_master_uri   STRING,                 -- s3://.../master.mxf
  status            ENUM(DRAFT, TRANSCODING, READY, RETIRED),
  created_at        TIMESTAMP, updated_at TIMESTAMP
)

TABLE Asset (
  title_id STRING, version INT64, profile STRING, PRIMARY KEY(title_id,version,profile)
  -- profile = e.g., "hevc_4k_hdr_10bit" | "avc_1080p" | "av1_1080p"
  manifest_uri      STRING,                      -- s3://... or /cdn/...
  rungs             JSON,                        -- [{rung:0,bitrate:450000,w:426,h:240,codec:"avc"},...]
  audio_tracks      JSON,
  subtitle_tracks   JSON,
  drm_key_refs      ARRAY<STRING>,               -- key IDs for CENC; actual keys in KMS/HSM
  vmaf_summary      JSON,                        -- per-rung avg/p1 VMAF
  packager_fingerprint STRING                     -- for idempotency / re-package detection
)

Read paths at page-load time issue 1–2 Spanner queries (fast; ~10ms) but are shielded by a per-region Redis that caches titles with 60s TTL + pub-sub invalidation on updates. Manifest-serving doesn't go back to Spanner at all — manifest is precomputed & stored in CDN/object store at publish time.

5.2 User data (watch history, entitlements, profile)

Cassandra wide-row for watch history. Why: append-heavy, key-by-user, time-ordered, no cross-user joins in the hot path.

TABLE watch_history (
  user_id      TEXT,      -- partition key
  title_id     TEXT,      -- clustering key (last segment first)
  last_position_ms BIGINT,
  last_updated TIMESTAMP,
  device_last  TEXT,
  PRIMARY KEY (user_id, title_id)
) WITH CLUSTERING ORDER BY (last_updated DESC);

Writes arrive via Kafka → Flink-aggregator (downsample 10s heartbeats to last-known-position per (user,title), upserted every 15s) → Cassandra. This decouples write scale (4M heartbeats/s firehose) from the storage tier (~40k writes/s after aggregation). "Continue watching" is a 1-partition range scan, sub-5ms.

Entitlements (subscription tier, region, parental profile): Spanner — cross-consistent with billing.

5.3 Segments (object store + CDN)

S3-compatible object store (GCS / S3 / Azure Blob — we run multi-cloud origin for vendor leverage). Each segment is an immutable object:

Bucket: stream-origin-{region}
Key:    {title_hash}/{version}/{profile}/{rung}/{seg_no}.m4s

Attributes:
  content-type: video/mp4
  x-amz-storage-class: STANDARD (hot) | INTELLIGENT_TIERING (body) | GLACIER (cold)
  x-custom-vmaf:   per-segment VMAF score (for QC)
  x-custom-bitrate: actual encoded bitrate (for client-side VMAF/BW analytics)
  x-cache-version:  cdn cache-key component

Storage tiering via object-store lifecycle:

  • Hot (first 30 days after title release or any title in top-10% watch share last 7 days): STANDARD, 3× replicated in-region, cross-region replication to 2 additional regions.
  • Warm (30d–180d or middle-share): INTELLIGENT_TIERING — auto-demotes after 30d no access; single-region + 1 replica.
  • Cold (old catalog, watched <10×/month/region): GLACIER / equivalent deep archive, retrieval-latency 3–5 min. Re-fetched to hot on demand into the origin shield, not directly to edge. Saves 80% on storage cost for tail.
  • Master (mezzanine) cold: Glacier Deep Archive / tape. Checksummed, erasure-coded 10+4. Retrieved only for re-encode.

5.4 DRM state

  • Content keys live in an HSM-backed KMS, keyed by kid. Never leave the HSM in plaintext; wrapped under a per-license-request key.
  • License policy in a small, high-QPS DB (Redis + Spanner backing for audit): which entitlements can view which kid, what HDCP level required, offline flag, max duration.
  • Session table (24h TTL) in Redis cluster: session_id → (user_id, device_id, region, issued_at, max_concurrent).

5.5 Recommendation serving

  • Offline candidate index (500M users × 500 candidates each): Faiss-style ANN index on user embedding × title embedding, updated every 6h. Sharded by user-id, replicated 3×.
  • Online features (last-N watched, time-of-day, device, current session context): feature store (Redis), millisecond reads.
  • Ranker model: deep neural net, served by TF-Serving / TorchServe, 50ms p99 per request (for ~500 candidates). §7d.

5.6 Analytics (QoE)

  • Kafka firehose (~4M events/s across QoE + heartbeat + engagement).
  • Real-time path: Flink → per-title rebuffer-ratio gauge → SLI dashboards, alerting.
  • Batch: hourly → warehouse (BigQuery/Snowflake) for ladder tuning, recommendation training.
  • Never on the playback decision path.

5.7 Engine choice — why each pick

Layer Choice Rejected Why
Catalog Spanner DynamoDB, Postgres, MongoDB Global consistency for embargoed drops; transactional multi-row writes on publish
Watch-history Cassandra Spanner, Redis persistent Write volume; single-partition reads; eventual consistency OK ("continue watching" can be ~15s stale)
Segments Object store + CDN Self-hosted file servers Bandwidth economics impossible without CDN; object store's PUT-once semantics align with immutable content model
Session cache Redis cluster Memcached Atomic scripts for concurrent-stream policy check; pub/sub for invalidation
DRM keys HSM-backed KMS Pure SW key store Compliance (studio contracts demand hardware) and theft-in-memory resistance
Analytics Kafka → Flink + warehouse Direct-to-DB Volume + downstream branching (real-time + batch + ML)
Rec candidates Faiss / ScaNN + feature store Cassandra for embeddings Vector search sublinear in candidate count

6 System Diagram (Centerpiece — two planes) #

6.1 Top-level: Ingest + Delivery + Control planes

╔══════════════════════════════════════════════════════════════════════════════════════════════════╗
║                                   CONTROL PLANE (global)                                          ║
║                                                                                                   ║
║  ┌──────────────┐   ┌───────────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ║
║  │ Catalog Svc  │   │ Entitlement   │   │ DRM License  │   │ Reco Svc     │   │ Analytics/   │   ║
║  │ (Spanner)    │←→ │ Svc (Spanner) │←→ │ Svc (KMS/HSM)│   │ (Faiss+ANN   │   │ QoE Pipeline │   ║
║  │ titles, rungs│   │ user sub+geo  │   │ per-session  │   │ +DNN ranker) │   │ (Kafka/Flink)│   ║
║  └───┬──────────┘   └───────┬───────┘   └──────┬───────┘   └──────┬───────┘   └──────▲───────┘   ║
║      │ pub-sub                │                 │                  │                   │         ║
║      │ invalidate             │ per-session     │ session+policy   │ features+         │ QoE     ║
║      │                        │ checks          │                  │ candidates        │ events  ║
╚══════╪════════════════════════╪═════════════════╪══════════════════╪═══════════════════╪═════════╝
       │                        │                 │                  │                   │
       ▼                        ▼                 ▼                  ▼                   │
┌────────────────────────────────────────────────────────────────────────────────────┐   │
│   DELIVERY PLANE (multi-region)                                                     │   │
│                                                                                     │   │
│  User ──DNS/GSLB──▶ ISP-Embedded Appliance (Tier 0, 75% hit)                       │   │
│    │                    │ miss                                                      │   │
│    │                    ▼                                                           │   │
│    │                Regional CDN PoP (Tier 1, 20% hit)                              │   │
│    │                    │ miss                                                      │   │
│    │                    ▼                                                           │   │
│    │                Origin Shield (absorbs cache-miss herds; §7b)                   │   │
│    │                    │ miss                                                      │   │
│    │                    ▼                                                           │   │
│    │                Regional Origin (5% traffic)                                    │   │
│    │                    │                                                           │   │
│    │                    ▼                                                           │   │
│    │                Object Store (S3/GCS) — authoritative source                    │   │
│    │                                                                                 │   │
│    │  ┌─────────────── Manifest/Session/License/Heartbeat/Reco APIs ──────────┐    │   │
│    └─▶│    served by same PoP edge; manifest cached briefly; others dynamic    │────┼──▶│
│       └────────────────────────────────────────────────────────────────────────┘    │ analytics
└─────────────────────────────────────────────────────────────────────────────────────┘

        ▲  authoritative segment upload (cross-region replication)
        │
╔══════════════════════════════════════════════════════════════════════════════════════════╗
║                               INGEST PLANE (regional)                                     ║
║                                                                                           ║
║ Studio                                                                                    ║
║   │  mTLS, resumable multipart (100MB parts, SHA256 end-to-end)                           ║
║   ▼                                                                                       ║
║ ┌────────────┐ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐ ┌─────────────────┐  ║
║ │ Upload API │→│ Mezzanine    │→│ Transcode      │→│ QC           │→│ DRM Packager     │ ║
║ │ (chunked)  │ │ Object Store │ │ Farm           │ │ (VMAF+BSFormat│ │ (CENC CMAF;      │ ║
║ │            │ │ (EC 10+4)    │ │ (K8s+FFmpeg+   │ │  +audio+subs) │ │ Widevine/FairPlay│ ║
║ │ authn,     │ │ Glacier DA   │ │ AV1/HEVC/AVC;  │ │  per rung+per │ │ /PlayReady)      │ ║
║ │ resumable  │ │ for masters  │ │ GOP-chunked,   │ │  segment VMAF)│ │  key refs to HSM │ ║
║ │ put_part   │ │              │ │ DAG per title) │ │               │ │                  │ ║
║ └────────────┘ └──────────────┘ └────────┬───────┘ └──────┬────────┘ └────────┬─────────┘ ║
║                                          │                 │                   │           ║
║                                          ▼                 ▼                   ▼           ║
║                                    Encoded Segment Object Store (authoritative)             ║
║                                          │                                                  ║
║                                          ▼                                                  ║
║                                    Catalog Publish (transactional flip in Spanner)          ║
║                                          │                                                  ║
║                                          ▼                                                  ║
║                                    Origin Replication: cross-region + Tier 0 PUSH           ║
║                                    to ISP appliances per release schedule (§7b)             ║
╚══════════════════════════════════════════════════════════════════════════════════════════╝

6.2 ABR playback sub-diagram (zoom)

Client                         Edge PoP                    DRM Svc         Control
  │                                │                          │                │
  │───session_start──────────────▶│                          │                │
  │                                │──verify entitlement─────┼───────────────▶│
  │                                │◀─ok, session_token──────┼────────────────│
  │◀─session_token─────────────────│                          │                │
  │                                │                          │                │
  │───GET /manifest/{title}──────▶│ (edge cache, 60s)         │                │
  │◀─manifest (DASH/HLS, per-     │                          │                │
  │    session signed URLs)       │                          │                │
  │                                │                          │                │
  │── POST /drm/license ─────────▶│──────────────────────────▶│                │
  │                                │                          │──HSM wrap──────│
  │◀─license (wrapped content key)│◀─license─────────────────│                │
  │                                │                          │                │
  │── GET /s/{title}/{profile}/   │ (Tier0→Tier1→Shield→Ori) │                │
  │        /rung0/seg0.m4s ─────▶│                          │                │
  │◀─segment (4 MB, 2s content)───│                          │                │
  │                                │                          │                │
  │──(ABR decision: throughput   │                          │                │
  │    + buffer → choose rung)    │                          │                │
  │                                │                          │                │
  │── GET /s/.../rung4/seg1 ────▶│                          │                │
  │◀─segment──────────────────────│                          │                │
  │          ...                  │                          │                │
  │                                │                          │                │
  │── POST /heartbeat (10s) ────▶│───async Kafka──────────────────────────────▶
  │                                │                          │                │

6.3 Flash-event pre-warm sub-diagram (§7b, preview)

T−24h (before drop)              Origin Regions ──── PUSH ────▶ Tier0 Appliances (ISP)
                                  (10k appliances × 140GB title × 3 codecs = ~4 PB)
                                  → over managed-overnight peering windows
                                  → idempotent, signed, verified by content-hash

T-1h                              Flip catalog status in Spanner: READY
                                  Pub-sub to edge PoPs: pre-load manifest

T=0 (global drop)                 Tier-0 already has the title: 75% traffic served
                                  with zero cold-miss. Tier-1 warms in first ~2 min
                                  organically.
                                  Rebuffer ratio curve: smooth, no spike.

Every labelled arrow maps to §4 or §5:

Arrow API / Data Reference
Client → Edge: session_start §4.2 POST /v1/session/start Validates entitlements against §5.2
Client → Edge: /manifest §4.2 GET /v1/manifest/... Signed manifest from §5.3
Client → DRM Svc: license §4.2 POST /v1/drm/license Key wrap from §5.4 HSM
Client → Edge: /s/... segment §4.3 segment URL Immutable object in §5.3
Client → Edge: heartbeat §4.2 POST /v1/heartbeat Ingests to §5.6 Kafka
Studio → Ingest: upload §4.1 multipart Lands in §5.3 mezzanine bucket
Transcode → Segment store internal §5.3 encoded segment bucket
Catalog publish §4.1 PublishTitle Flips status → READY in §5.1
Origin → Tier0 PUSH internal sync loop §7b
Heartbeat → Kafka async §5.6 analytics pipeline

7 Deep-Dive: Four Critical Topics #

7a. CDN strategy: why we build (Open Connect-style) custom appliances, with rejected alternatives and bandwidth math

Why critical. Egress is 95% of our cost (§3h). Every 1% shift from commercial-CDN to ISP-peered appliance saves ~$25M/yr at our scale. The CDN decision is the defining architectural choice for a video streamer, far more than storage or transcode. Candidates who say "use a CDN" without decomposing lose the L7 point here.

The decision space:

Option Who owns $/GB egress Placement Pre-warm cost Operational burden Unit economics at 500 Tbps
A. Pure 3rd-party CDN (Cloudfront/Akamai/Fastly) Them $0.02–0.08/GB retail, $0.005–0.015 on huge contracts Their PoPs They pay their origin Minimal ~$8–25M/day unmitigated; $2–4M/day w/ contracts. Still $700M–$1.4B/yr.
B. Pull-CDN + origin shield Them + us Same as A, less shield-level re-fetch Their PoPs Small (their cache fills on first miss) Low Saves 5–10% vs A by reducing shield-miss traffic. Still billions.
C. Multi-CDN (A+B+C with steering) Them × N Weighted; renegotiation leverage ~−20% Their PoPs Double Medium — need a steering layer (e.g., Cedexis/NS1) Saves another ~15% via competitive pricing. $600M–$1.2B/yr.
D. Custom appliances at ISP (Netflix Open Connect / YouTube Google Global Cache) Us, colocated at ISPs Effectively peering-cost + capex ≈ $0.001–0.003/GB amortized Inside ISP networks, one hop from eyeball Huge — must PUSH title 24h pre-drop High — hardware fleet of 10k+ appliances; fleet operations $200–400M/yr all-in (capex amortized + peering + staff). 60–75% savings vs 3rd-party.
E. Custom appliances at our own IXPs only (no ISP embedding) Us ~$0.005/GB Regional Medium Medium ~$500M/yr. Middle ground.

Chosen: Hybrid — D (for head 75%) + B (Tier 1 pull-CDN commercial) + C fallback (multi-CDN) for spillover.

Why hybrid and not pure D: tail catalog (bottom 70% of titles by watch-share) is 25% of bytes; pushing every title to every appliance would require ~2× the appliance storage capacity per dollar, worse unit economics than renting pull-CDN for the tail. We break even at roughly a Pareto threshold: if the top 10% of titles are ~75% of watch time (typical), push those + hot new releases; pull the rest.

The push schedule — why 24h pre-drop is non-obvious:

Naive argument: "CDN caches warm on demand, no pre-warm needed." That's wrong at our scale because:

  • 100M viewers start within ~5 minutes of a global drop.
  • Miss-rate at cold edge is ~100%. First-miss flows to shield → origin.
  • Even if Tier-1 CDN has 500 PoPs, each miss on a new title triggers ~500 origin fetches (one per PoP) before the PoP's local cache fills.
  • With 3-codec × 15-rung × 2,700 segments per title, a cold start is ~500 PoPs × 15 rungs × maybe 10 segments in-flight each = 75,000 origin-bound requests in the first minute, per title — and we may have 5 titles dropping a week.
  • That load on origin is tolerable; the bigger pain is the cache-miss latency spike to viewers, breaking the p99 <2s startup SLO.
  • We'd need origin bandwidth = ~5% of peak = 25 Tbps just to survive the first 5 min, built for usage we only hit at drops. Wasteful.

Pre-push insight (earned-secret depth): "Netflix pushes content to ISP-embedded Open Connect Appliances during the ISP's off-peak window (2am–6am local) in the 24h before a major release, over managed peering. This moves the bottleneck from CDN-bandwidth-under-flash-load to ingest-side replication scheduling. The second-order effect: our transcode deadline backs up by 24h, which forces us to demand mezzanine masters from studios 72h pre-release instead of 48h — a business-logistics change driven by cache architecture." Pure pull-CDN can approximate this by synthetic-traffic pre-warming (send fake fetches from every PoP) + origin shielding, but (a) you pay commercial CDN BW for the synthetic traffic (often >$100k for one title's pre-warm), (b) you still can't pre-warm ISP-level appliances you don't own, and (c) you measure the pre-warm effectiveness only by watching whether your rebuffer ratio spikes at T=0 — a lagging metric with a reputational cost.

Push architecture (the hard parts):

  1. Schedule: per title, compute placement plan: which appliances get which rungs. Not every rung to every appliance — Tier 0 appliances get only the top ~5 rungs (HEVC/AV1 at 1080p+4K), since ISP subs skew high-bandwidth; AVC low rungs remain at Tier 1.
  2. Deduplication: content is chunked at 4MB granularity with content-addressable hashes; re-pushing a title with updated metadata reuses unchanged chunks. ~90% reuse across re-encodes.
  3. Flow control: push rate-limited per-ISP to their capacity. Each appliance has a 1–4 TB SSD + spinning disk; we prioritize push by "predicted watch share in the next 7 days" as scored by a dedicated ML model, evicting lowest-share content first.
  4. Verification: appliance signs a per-chunk receipt back to origin; per-appliance readiness monitor fails a drop (rolls back to T+24h) if <95% of appliances are ready at T−1h.

Real systems named: Netflix Open Connect (OCA hardware, FreeBSD-based), YouTube Google Global Cache (GGC), Disney's BAMTech (pull-CDN heavy with aggressive shielding), Akamai's managed CDN for many mid-tier streamers, Fastly's edge compute for manifest customization.

Failure modes:

  • OCA at ISP crashes / disk fails: DNS/GSLB steers viewers to the next-nearest (regional Tier 1). Detected in <30s via PoP-level health-check. Blast radius = that ISP until repair (replaced within 72h via logistics). Mitigation: over-provision each metro with ≥2 appliances so one failure = graceful degradation to Tier 1.
  • Push window missed pre-drop: monitored at T−6h; if <90% readiness, page the release team. Options: (a) delay the drop in that region (contractually permitted for some titles, not others), (b) push from Tier-1 PoPs in-line (drops 75% of hits down to 20%, raises egress cost and risks rebuffer). Big red switch: "activate Tier 1 fallback posture" — doubles expected cost for 24h, absorbs the flash.
  • Bad encode pushed: content version is included in the object key; roll forward by publishing a new version and invalidating the old (pull-CDN model), or (for pushed titles) push v2 → flip manifest → leave v1 for in-flight sessions (bounded TTL).

Bandwidth sanity check (closes the loop with §3b):

  • 500 Tbps peak × 0.75 Tier 0 = 375 Tbps / 10,000 appliances = 37.5 Gbps/appliance peak. Each appliance has 2×100G NICs — comfortable headroom.
  • 375 Tbps × 3600 s = 168 PB/hour egress at peak, out of ISP-peered appliances at ~$0.001/GB. At peak hour = $168k; day avg = $1.5–2M/day. Annualized: ~$700M for Tier 0 alone, matching the ~$2.5B/yr total egress number in §3h (others tiers, DRM/control, origin, manifest, recs round out).

7b. Transcoding pipeline at scale: per-title encoding, GOP-parallel, per-rung VMAF targeting

Why critical. Transcode is where encoding quality × bandwidth cost × startup latency collide. A 20% bitrate reduction at equal VMAF saves $500M/yr in egress at our scale. It is worth burning 10× encode compute to find it.

The naive approach is a fixed bitrate ladder (e.g., {240p@400k, 360p@700k, 480p@1.2M, 720p@2.5M, 1080p@5M, 4K@15M}). Apply it to every title. Simple, fast (~1× realtime).

What's wrong with that:

  • A talking-heads documentary needs 40% less bitrate than an action movie to hit the same perceptual quality.
  • A dimly-lit horror film benefits from HDR + 10-bit much more than a cartoon, and its perceptually-optimal 1080p rung is at ~3 Mbps, not 5 Mbps.
  • Serving the naive ladder wastes ~25% of bitrate on average.

Per-Title Encoding (PTE), chosen approach:

Step 1: Convex-hull analysis. Encode the title at ~30 (resolution, QP) points — a grid of candidates. For each point, compute VMAF and bitrate. Plot (bitrate, VMAF); the upper-left-outward frontier is the convex hull of efficient rungs.

Step 2: Select rungs from the hull. Target specific VMAF buckets (e.g., 65, 75, 85, 93, 97) and pick the minimum-bitrate point achieving each. Per-title ladder has 5–8 rungs, non-uniform per title.

Step 3: Encode the chosen rungs fully, with the quality preset tuned per codec.

Cost: ~5× more CPU-hours than naive (the pre-analysis dominates), buying ~20–25% bitrate savings on average. At 500 Tbps × $0.003/GB effective egress, 20% savings = $500M/yr. Compute cost to achieve it: 5× × $30/hr × 50 hrs/day × 365 = ~$3M/yr. ROI is 150×.

GOP-parallel encoding (implementation):

  • Split each master into 2–4 s closed-GOP chunks (split at I-frames; the encoder is configured to force IDR at chunk boundary).
  • Dispatch chunks as independent jobs to a K8s batch fleet (or AWS MediaConvert-like managed). Each chunk encoder is FFmpeg with preset tuned.
  • Reassemble at the packager stage (CMAF fragments already have natural segment boundaries, often aligned with the GOP).
  • Why: serial encode of a 90-min film at AV1 quality is ~60× realtime = 90 hours. GOP-parallel with 1000 chunks → <10 min wall-clock for the same film.

Per-segment VMAF as QC gate:

  • Every encoded segment has its VMAF measured against the reference master-segment.
  • If any segment's VMAF is 3 points, the encode fails QC, job re-runs with higher bitrate budget. Catches encoder corner cases (high-motion scene collapse, grain preservation).
  • Measured VMAF is stored as object-store metadata (§5.3) for observability and client-analytics pairing.

DRM packaging (single-encode-many-DRM, CENC CMAF):

  • Encode once per codec/rung; package into CMAF fragmented-MP4 with CENC (Common Encryption) AES-128-CTR — the same encrypted bytes are decoded by all three DRM systems (Widevine via CDM, FairPlay via StreamingKeyDelivery, PlayReady via PRO header).
  • Saves ~3× storage and serving cost vs. per-DRM encoding.
  • Real systems: every modern streaming service uses this; bento4 + shaka-packager are the reference tools.

Codec choice math:

Codec BW saving vs AVC Encode cost vs AVC Device coverage (our MAU) Verdict
H.264/AVC baseline baseline ~100% Must have
HEVC/H.265 −30–40% BW 5× CPU ~90% (iOS, newer Android, most TVs) Chosen
AV1 −30% vs HEVC, −50% vs AVC 20–40× CPU ~40% (Chrome, newer Android, 2023+ TVs) Chosen for top rungs where savings compound (e.g., 4K)
VP9 ≈HEVC 10× CPU ~60% (no iOS) Skipped — AV1 eclipses it going forward

Chosen: triple-codec (AVC + HEVC + AV1) with client-side capability negotiation in manifest. Top rungs (4K, 1080p) get AV1 for devices that support it; AVC always included as baseline. Per the math above, AV1 pays back its encode cost in ~1 week of serving at our scale.

Real systems: Netflix's Dynamic Optimizer (PTE grandfather); YouTube's similar Videogen; FFmpeg/libaom/x265 as encoders; AWS MediaConvert / Elemental for managed alt; Bitmovin for 3rd-party.

Failure modes:

  • Encoder crash mid-job: K8s restarts the chunk; idempotent chunk IDs ensure no duplicate writes. Wall-clock slips by minutes.
  • Bad mezzanine (corrupt, wrong color space, silent audio): QC catches, encode marked FAILED, studio notified, release blocked. Pre-ingest validator detects most of these in the first ~10s.
  • Ladder misconfig (bitrate too low): per-segment VMAF gate fails the encode. Regression test on previously-encoded titles catches per-title optimizer regressions.

7c. DRM license serving at 300k licenses/s under flash load

Why critical. A DRM outage == no playback. Unlike most services where failure degrades, a DRM fault is absolute: client cannot decrypt, black screen. It is also the component least visible to casual candidates — "add DRM" hand-waves over its hardest parts.

The load shape:

  • Steady-state 37k licenses/s (§3f).
  • Flash: major release globally, 100M viewers starting within 5 min. Peak license request rate ≈ 300k/s for 5 min.
  • HSM ops are serial and ~1ms per op; one HSM = ~1000 ops/s. Need 300 HSMs to absorb burst — expensive, ~$20k each amortized.

The hot-path breakdown per license:

  1. Parse client challenge (from CDM). Extract requested kids, device cert.
  2. Validate entitlement (subscription, region, parental, concurrent-streams) — Spanner or cached in Redis.
  3. Derive policy (HDCP level, playback duration, offline flag).
  4. Wrap content key under device session key (HSM op).
  5. Sign response.

Optimization: move steps 1–3 and 5 to app servers; only step 4 on HSMs. That's obvious. The earned-secret optimization:

Proxy re-encryption / session-key intermediate (chosen):

  • HSMs pre-wrap content keys under regional session master keys at ingest time (not per-license). Regional session master key rotates daily.
  • At license time, app server wraps from regional-session-master-key to device-session-key in software (not HSM). Standard AES-KW is ~100 ns in software.
  • HSMs are used only for (a) rotating the regional-session-master-key daily, and (b) attesting the rotation via signed chain.
  • Security argument: an app-server compromise leaks at most one day's worth of keys, and only for content the compromised server served. A full HSM extraction still requires HSM breach.
  • Throughput: app servers do millions of ops/s; HSMs never saturate.

Trade-off: a small reduction in HSM-derived security guarantee (keys exit the HSM under regional-session wrap). For our threat model (professional pirate, not state actor) this is acceptable and it's what high-scale streamers actually do. It's also what Widevine L3 effectively does in its server-side extraction model.

Concurrent-stream enforcement:

  • Every license issuance creates a row in a Redis "active sessions" set per user (TTL = 60 min, refreshed by heartbeat).
  • Cardinality check before issuing: SCARD active_sessions:{user} < max_concurrent_streams. Atomic Lua to prevent race.
  • If over limit, kick oldest session (server-issues a "stop" to that device via push channel).

Offline license:

  • Different policy: offline=true, duration=48h, key_lifetime=30d, policy embedded in license itself so client enforcement is self-contained.
  • Stored in a separate table for revocation lineage; license servers can issue revocation on subscription cancel.

Failure modes:

  • HSM outage: regional session master keys cached server-side for 24h (refreshed daily); software path keeps serving. Degraded mode: cannot rotate keys; 24h SLO on restoration.
  • Cross-region license server outage: GSLB fails over to nearest healthy region. Add ~50ms latency. SLO met.
  • License theft (stolen subscription): revocation list; license servers refuse to issue for revoked device. Existing licenses expire in 1–60 min depending on policy.
  • Big red switch: "serve cached license for last 24h's titles to any active session" — degrades to no enforcement for ongoing sessions (studio contracts permit this for ≤1h as emergency).

Real systems named: Google Widevine, Apple FairPlay Streaming, Microsoft PlayReady, Amazon's BuyDRM integrations; HSMs from Thales/Luna, AWS CloudHSM, Google Cloud HSM; reference packagers: bento4, shaka-packager.

7d. ABR + rebuffer-ratio optimization: chosen rung-per-segment math

Why critical. Rebuffer is the #1 signal for viewer abandonment (every 1% rebuffer ratio ≈ 2% reduction in minutes viewed industry-wide). ABR is entirely client-side — our server role is to make the client's job easy: smart ladder, accurate manifests, low-variance segment sizes, CDN reliability. The interviewer probe specifically calls out p99 <2s startup and <0.5% rebuffer.

The three ABR algorithm families:

Algorithm Decision input Rebuffer resilience Bitrate efficiency Complexity
Throughput-only (BBA, classic) EMA of segment throughput Poor on oscillating network Ladder-greedy Simple
Buffer-based (BOLA) Buffer occupancy Excellent (self-stabilizing on buffer) Slightly under-utilizes BW when buffer deep Medium
Hybrid MPC (BOLA-E, RobustMPC) Buffer + throughput + forecast Best empirically; ~20% less rebuffer than BBA Best empirically Higher
ML-driven (Pensieve, Puffer) Multi-feature NN Promising, uneven in field Variable Highest; model retraining pipeline

Chosen: RobustMPC / BOLA-E hybrid, client-side. Server role: ensure the manifest advertises a ladder that lets the algorithm make good decisions.

Server-side optimizations that impact rebuffer:

  1. Per-title ladder (§7b) — clients always have a good rung to step down to.
  2. Variance-controlled encode — we use capped-CRF (not strict CBR) so a segment's encoded size is within ±15% of its ladder-nominal bitrate. Predictable download times ⇒ predictable buffer state ⇒ fewer ABR oscillations.
  3. Segment size = 2–4 s — shorter = lower startup latency (first I-frame sooner) and faster rung switches; longer = better compression. Chosen: 4s for standard, 2s for live/LL-HLS. Per-segment VMAF gate ensures 2s segments don't collapse.
  4. Initial rung selection for startup:
    • Client opens playback; first segment fetched must be small enough for p99 startup <2s.
    • At p99, a weak connection is ~1.5 Mbps. A 4 s segment at 720p@2.5 Mbps is 1.25 MB → 6.7 s to download. Miss.
    • Solution: the client fetches a short initialization-segment + the first video-segment at the lowest-intelligent-rung (typically 360p/500k = ~250 KB, <1 s on 1.5 Mbps), then ramps up within ~10 s.
    • Manifest advertises #EXT-X-START (HLS) / Period@start (DASH) hints to steer.
  5. Prefetch the first segment on manifest fetch. Use HTTP/2 Server Push from the manifest endpoint: pushes the first init-segment before the client requests it. Cuts ~50–150 ms off startup. Real deployment: most CDNs (Akamai, Fastly) support this; CloudFront does not — we accept the skew.
  6. DRM license parallelism. Critical: client must not serialize fetch-manifest → fetch-license → fetch-first-segment. Modern players (ExoPlayer, Shaka, AVPlayer) fetch manifest + init-segment + license concurrently once manifest URL is known; the license is consumed when first content-segment arrives. This reduces TTFB from sum-of-three to max-of-three. We ensure our manifest contains license URL in an early-parseable position.

Rebuffer budget math:

  • Target rebuffer ratio <0.5%.
  • Break down sources: (a) network oscillation unreached by ABR, (b) CDN miss storms, (c) DNS/TLS RTT spikes, (d) encoder stutters (pathological segment size).
  • Allocate budget: (a) <0.2%, (b) <0.1%, (c) <0.1%, (d) <0.1%.
  • For (b), this translates directly to a CDN miss-rate tolerance: at 99.5% hit rate and ~100ms miss penalty, rebuffer contribution = 0.005 × 100ms / 4000ms segment = 0.0125% of play time. Fine. We set a hit-rate alarm at 99% (2× our budget) as early warning.

The "first 2 seconds" optimization chain (earned-secret):

Startup p99 <2s is the hardest SLO. Breakdown of a cold play on LTE:

  • DNS 50ms (DoH cached) → TCP+TLS 150ms (QUIC skips some of this) → session_start call 100ms → manifest fetch 80ms (edge-cached) → parse 20ms → license fetch (parallel with init-segment) 150ms → first video-segment fetch @ 500 kbps = 250ms → decoder init 100ms → first frame.
  • Sum (serial): 900ms. With parallelism (license || init-segment), the critical path is ~800ms. With QUIC to shave TLS: ~700ms.
  • p99 is worst-case: add 2× network variance, 500ms DNS fallback, TLS resumption failure → ~2000ms. We're dancing on the line. Every 50ms matters.

Per-second optimizations we enforce:

  • TLS session resumption across requests within a play session.
  • HTTP/3 (QUIC) for handset clients — saves one RTT for handshake.
  • Manifest caching at edge with 60s TTL; invalidated on version bump.
  • Fail-fast on segment errors: 5xx triggers an immediate retry to next-rung-down rather than exponential backoff, which blows the startup budget.

Failure modes:

  • ABR algorithm bug ships to clients: rolling update gated by per-fleet QoE regression. If rebuffer ratio creeps up post-release in a cohort, auto-rollback.
  • Encoder emits pathological segment sizes (e.g., high-motion scene at a low rung grossly over-sized): QC catches; per-segment size audit at ingest.
  • CDN route flap: client falls back to next-best rung; sticky CDN selection per session to prevent flapping.

Real systems named: Netflix's Chunked Dash Optimizer, Twitch's low-latency HLS, MPEG-DASH with LL extension, Apple LL-HLS, BOLA/RobustMPC (CMU & MIT papers), Pensieve, Shaka Player, ExoPlayer.


8 Failure Modes & Resilience (pager-carryable) #

Component Failure Detection Blast radius Mitigation Recovery
Tier-0 appliance at one ISP HW fail, disk fail appliance healthcheck 30s; PoP-level egress drop ~5% of that ISP's viewers (if 2-appliance metro); 0% of others DNS/GSLB auto-shifts that ISP's traffic to Tier 1 PoP; rebuffer ticks up <0.1% for duration Replace hardware within 72h via logistics; re-sync via push
Whole Tier-1 PoP outage Datacenter / carrier fault Per-PoP RED metrics; external synthetic (Catchpoint) 2–5% of regional egress GSLB fails to nearest Tier 1 (adds ~10ms latency); Tier 0 continues serving Restore PoP or reroute via peering
Origin region fails Cloud region outage Cross-region health; origin 5xx 5% of traffic that was shield-missing to that region Traffic shifts to next origin (cross-region replicated); shield-miss adds latency to first requests only Failover complete in <5 min; full restoration per cloud SLA
Cache-miss storm (bad new release, push failed) PoPs cold-miss to origin on new title Origin bandwidth spike; rebuffer spike Cold new release rebuffer >5% for ~5 min Origin shield absorbs; emergency posture shifts to Tier 1 fallback with higher BW; traffic shaping if shield saturates Complete pre-push retroactively; monitor
Transcoder job mid-crash Node reboots mid-chunk K8s pod lifecycle; chunk idempotency ID One chunk slips 10 min Auto-retry on another node; idempotent writes Chunk re-encodes; title publish unaffected
DRM license server outage Full regional DRM fault License 5xx rate >1% in 30s New sessions in region can't start; in-flight sessions run on cached licenses 15–60 min GSLB to nearest healthy region; big red switch = "serve cached policy" fallback (opt-in studio-permitted titles only) Restore region; queue of held sessions unblocks in <10 min
Recommendation service down Ranker unavailable Reco 5xx Home page still loads with popularity-sorted fallback (from Spanner); personalized rows missing Circuit-break reco; serve cached "yesterday's recs" per user from Redis Ranker restore; recs resume; no permanent loss
Catalog DB (Spanner) unavailable Spanner outage Cache reads still serve; writes fail New titles can't publish; no catalog updates; playback mostly unaffected (manifest cached) Read-only mode: serve cached catalog; publish workflows pause Restore Spanner; drain publish queue
Analytics pipeline lag Kafka consumer backlog Kafka lag alerts Recs + QoE dashboards stale; playback unaffected Pipeline is best-effort; drop events if lag >30 min to stop cascading Scale consumers; lag recovers
Regulatory geo-block fails open (title served in embargoed country) Misconfigured region list Contract compliance monitor; studio complaint Legal/contractual risk, not technical Immediate catalog flip to remove region; audit; incident report to studio Within minutes; legal follow-up
Studio-pushed bad master Wrong color space, corrupted audio Ingest validator + VMAF QC Publish blocked; studio notified Pre-ingest checks; explicit failure reasons Re-ingest
CDN auth-token leak (pirate distribution) Tokens circulating Anomaly detection on session/token ratios; geographic heatmap anomalies Piracy; revenue leak Rotate signing key, invalidate all outstanding tokens; re-auth all users in region (brief pain) Forensic trace; possibly escalate DRM policy
Client device bug causes login stampede after app update Millions of simultaneous retries Session_start RPS spike Manifest API saturates; startup times degrade Rate-limit at gateway + exponential backoff recommended in client; app-update rollback if severe Coordinate with client team

9 Evolution Path #

v1 (ship in 3 months — MVP regional launch):

  • Single origin region (us-east) with one cloud provider's object store.
  • Pull-CDN from one 3rd-party (CloudFront or equivalent) for delivery.
  • H.264 only, 5-rung ladder, fixed per-title.
  • HLS only (defer DASH); Widevine + FairPlay (PlayReady can wait).
  • No per-title encoding; fixed ladder.
  • Watch-history in Postgres (not yet Cassandra — scale not there).
  • Popularity-sorted "recs" from SQL aggregation, no ML.
  • No ISP appliances; no cross-region origin replication.
  • Goal: prove functional correctness, ~1M subs in one region.

v2 (ship in 9 months — 10× scale, multi-region, multi-DRM):

  • 3 origin regions; cross-region async replication of encoded assets.
  • Multi-CDN steering between 2–3 commercial CDNs (pull-based, origin shielding).
  • HEVC added; per-title encoding for top 1000 titles.
  • DASH added; PlayReady for Xbox/UWP.
  • Watch-history migrates to Cassandra; Spanner for catalog.
  • Two-stage recs: offline collaborative filtering + online popularity booster.
  • DRM: HSM-backed keys, proxy re-encryption added for scale.
  • Full ABR ladder; BOLA client shipped.
  • Analytics: Kafka + Flink + BigQuery.
  • Goal: 100M subs across 3 continents.

v3 (ship in 24 months — global, custom appliances, PTE, ML recs):

  • Open Connect-style ISP-embedded appliances at top 100 ISPs globally.
  • AV1 codec for top rungs; triple-codec serving.
  • Per-title encoding across entire back catalog; convex-hull optimizer in production.
  • Dynamic optimizer ML model trained on observed QoE.
  • ML-ranker for recs (deep neural, real-time features).
  • Regional DRM with proxy re-encryption; concurrent-stream enforcement via Redis.
  • Flash-event pre-warm pipeline; T-24h push with placement optimizer.
  • 12 origin regions; erasure-coded masters; Glacier deep archive for tail.
  • Goal: global footprint, 300–500M MAU.

v4 (research / future):

  • AV2 / LCEVC / low-power codecs as device support solidifies.
  • Edge compute for personalized manifest (per-user ad insertion, regional splice).
  • ML-driven ABR (Pensieve-class) shipped to clients where it beats MPC.
  • End-to-end encrypted watch-history (no Netflix-server knows what you watched) for privacy-stringent regions.
  • Live-VOD hybrid (live premieres that transition to VOD) with shared ingest pipeline.
  • Foundation-model-assisted content understanding for better recs and subtitle generation.

10 Out-of-1-Hour Notes #

Codec selection economics. AV1's 30% BW reduction vs HEVC × $2.5B/yr egress = $750M/yr if we could serve only AV1. Device penetration gates this: we ship AV1 to supporting devices (40% of MAU today, climbing to ~80% by 2027). HEVC serves the next ~50%, AVC baseline covers the rest. VP9 abandoned — AV1 eclipses it going forward and patent landscape is cleaner. LCEVC is a "scalability enhancement layer" (delta over a base layer) interesting for bandwidth-constrained markets but ecosystem tooling is thinner.

Subtitles & multi-audio tracks. CMAF allows subtitle and audio tracks in separate representations sharing the same CENC encryption. Storage cost modest; serving cost a rounding error. For accessibility: we require closed-captions on every title (US ADA, EU EAA compliance). Audio description tracks for visually impaired (separate audio representation). Lyrics / karaoke: subset.

Kids content compliance. COPPA in the US, GDPR-K in EU; kids profile must not send personalized recs based on adult viewing history. Implementation: kids profiles share a user_id with the account but have a profile_type=kids flag that routes to a restricted reco model and filters catalog to rated-for-kids. Geo-restrictions also apply (some content is kids-ok in country A, not country B).

Pre-release embargo. Contractual with studios; title exists in the system in DRAFT for days-to-weeks, with encoded assets pre-pushed to edge but manifest unavailable. PublishTitle at the drop moment flips a single bit in Spanner — because assets are already in place, the drop is effectively instantaneous. Crucial: audit log at publish time, signed by the release manager; on-call during the drop.

Live-streaming sub-system (separate). Architecturally distinct from VOD. Ingest via RTMP/SRT/WebRTC to regional ingest gateways; re-encode in real-time with 2–4s latency (LL-HLS) or sub-second (WebRTC for ultra-low-latency, e.g., sports bet interactions). Separate origin servers optimized for hot-cache-only; separate CDN profile. Shares the playback client, DRM, catalog metadata. Biggest operational difference: live has no re-try opportunity — the segment either reaches the viewer in time or it's stale forever. Chunked transfer encoding + LL-HLS push allows ~1-2s glass-to-glass.

Ad insertion (SSAI / CSAI). Server-Side Ad Insertion rewrites the manifest per-session, splicing pre-roll/mid-roll/post-roll ad segments. Integrates cleanly with our manifest-per-session model. Client-Side: simpler but ad-blockable; SSAI is preferred for mandated ads. Sub-components: ad decision service, ad-creative CDN, session-bound manifest generator. This is a ~6 engineer-year project on top of core platform.

Observability gold standards (playback SLOs).

  • Per-session QoE events piped from client: startup time, rebuffer events (count + duration), bitrate switches, errors.
  • SLI/SLO/error budget per playback SLO:
    • Startup p99 <2s, budget 10⁻³ sessions/month exceed 5s (~500k sessions budget at 500M MAU).
    • Rebuffer ratio <0.5%, budget: 5% of sessions exceed 2% rebuffer ratio.
    • Playback availability 99.99%.
  • Error budget consumption dashboards per region, per device-class (mobile, TV, desktop), per title.
  • Per-title QoE surfaced to content ops: if a new release has anomalously high rebuffer in region X, investigate (bad edge push? bad encode on a specific rung?). Content quality is as operational as compute is.
  • CDN hit-rate per PoP per title — leading indicator for cache-miss storms.

Piracy / DRM escape hatches. Determined attackers will screen-cap; HDCP 2.2 tries to stop HDMI capture. Camcording a screen is the unavoidable low-bar attack. We don't chase screen-cappers; we chase bulk subscription-sharing (concurrent-stream enforcement), token replay (session binding, rotating CDN signing keys), and key extraction (HSM + hardware DRM). Fingerprinting (steganographic watermark per-user) for post-leak attribution is an open research area — Netflix uses forensic watermarking on premium content; we could add as v3+.

Peering & anycast. Tier-1 and Tier-2 PoPs connect to ISPs via settlement-free peering wherever possible. Anycast for manifest/session/license APIs (HTTP control plane) uses BGP to send traffic to nearest PoP — cuts ~10–50ms per RPC. Segments are unicast via content-aware routing (nearest PoP with cache).

Observability of appliances (Tier 0). Each OCA-equivalent phones home every 60s: disk health, cache hit-rate, egress Gbps, CPU, temperature. A fleet of 10k appliances at 60s cadence is ~170/s of telemetry — trivial. Alerts on: egress drop (appliance failing silently), cache miss-rate spike (likely push failed), disk predicted to fail.

Privacy & data residency. Watch-history and entitlement records are PII in many jurisdictions. EU DPA mandates data localization — watch-history for EU users kept in EU-region Cassandra clusters, no replication to non-EU. Catalog metadata is not PII, replicates globally. Request-ID sampling for debugging must be GDPR-compliant (no bulk access outside justified investigations).

Testing specifics.

  • Multi-region failover chaos drills monthly.
  • Synthetic playback from 50+ geographic probes continuously; any p99 startup drift alerts.
  • Pre-release push rehearsals: push a synthetic title to all appliances, measure push-time distribution, gate release on 95th-percentile-readiness meeting budget.
  • Encoder regression: every 100th encoded title gets an automated diff against its prior version; VMAF regression >2 points blocks publish.
  • DRM chaos: periodically kill a DRM region in staging; verify GSLB shifts and sessions recover.

Green-field greenhouse: what would I change if starting today (2026)?

  • QUIC/HTTP/3 by default for all API surfaces (control plane + data plane segment delivery). Already deployed at Google-scale and ~20% latency win.
  • Rust for edge services (manifest, license wrap app-layer, session). Not for transcoders — FFmpeg ecosystem dominates C++ there.
  • Confidential computing (AMD SEV-SNP, Intel TDX) for HSM-alternative — cheaper, comparably secure for our threat model, and enables per-tenant-region key isolation more flexibly than HSM quotas.
  • eBPF-based observability in the PoP — TCP retransmits, QUIC ACK delays at wire speed for finer-grained rebuffer diagnosis.
  • Foundation-model-based recommendation scoring — retrieve-augmented ranker with user-history + text synopsis embeddings; likely +10% engagement at unclear compute cost vs DNN ranker.

Verification Checklist (done before submission) #

  1. SRE pager-carryable? Yes — §8 is a runbook with detection, blast radius, mitigation, recovery per component, including the "big red switches" (serve-cached-license, Tier-1 fallback posture, catalog read-only mode).
  2. Every diagram arrow → §4 or §5? Yes — the table at end of §6 cross-references every labelled arrow to an API surface or data store.
  3. Deep-dives at L7 depth? Yes — §7a derives CDN Pareto + ISP-push scheduling + bandwidth sanity-checked against §3; §7b derives PTE ROI ($500M/yr savings for $3M/yr compute) and GOP-parallel wall-clock math; §7c derives proxy-re-encryption as HSM decoupling with a concrete threat-model trade-off; §7d walks the full startup critical-path millisecond-by-millisecond and shows the parallelism that keeps us under 2s p99.
  4. Capacity math closes? 500 Tbps = 100M concurrent × 5 Mbps. Breaks into 375 Tbps Tier 0 / 100 Tbps Tier 1 / 25 Tbps origin. Tier 0 → 10k appliances × 37.5 Gbps/each (100G NIC × 2 = comfortable). Storage 14 PB encoded + 15 PB masters = ~30 PB logical, ~63 PB raw with EC/replication. Transcode farm 4k CPU-hours/day base, 40k peak. DRM 37k licenses/s base, 300k burst, 24 regions × 3 × 5k = 360k capacity. Egress $2.5B/yr dominates ~$2.6B all-in. Numbers close.
esc
navigate open esc close