Q14 Product & Edge Systems 36 min read 11 sections
Design a Video Streaming Service
Support creator uploads, transcoding, adaptive-bitrate playback, CDN delivery, and watch-history-driven recommendations.
1 Problem Restatement & Clarifying Questions #
Restatement (say this first, 20 seconds): Build a global video streaming platform that ingests licensed mezzanine masters from studios, transcodes each into an adaptive bitrate ladder with DRM packaging, distributes via a tiered CDN (edge PoPs + ISP-embedded appliances) at multi-hundred-Tbps peak egress, and serves ABR playback with p99 startup <2s and rebuffer ratio <0.5%. Secondary surface: watch-history, personalized recommendations, per-region regulatory compliance. The cost center dominates — >70% of total infra spend is egress bandwidth — so every design choice is evaluated against its bandwidth bill first.
Clarifying questions I would ask, with the default I'd adopt if told "you decide":
| # | Question | Why it matters | Default if unspecified |
|---|---|---|---|
| Q1 | YouTube-style UGC vs Netflix-style licensed catalog? | UGC has a long tail of cold content (millions of titles, most watched <10 times) and ingest at human-generated scale (500 hrs/min). Licensed has 100k premium titles with predictable demand spikes and rights management. | Netflix-style licensed. Justification: the problem names "transcoding pipeline," "catalog," and "geo-restrictions" — all point at a curated-rights model. UGC is a separate design (§10). I'd call this out explicitly so the interviewer corrects me early if wrong. |
| Q2 | VOD only, or live + VOD? | Live ingest is a separate sub-system (RTMP/SRT → low-latency HLS/LL-DASH with 2–6s glass-to-glass). | VOD only. Live-streaming treated as adjunct (§10). |
| Q3 | MAU / peak concurrency? | Sets egress BW and CDN PoP count. | 500M MAU, 100M peak concurrent (major-release global drop). §3 BOE math pivots on this. |
| Q4 | DRM level? Studio contracts usually mandate HDCP 2.2 + L1 Widevine for 4K. | Determines whether decryption is hardware-bound (L1) or SW-allowed (L3). L1 = client-locked, no cloud-DRM shortcut. | Widevine L1 / FairPlay / PlayReady SL3000 for 4K; L3 fallback for SD. Assume all three DRM systems required (iOS/Android/Smart-TV fragmentation). |
| Q5 | 4K / HDR / Dolby Vision / Atmos? | Adds top rungs to the ladder, doubles encode cost, triples egress bytes per viewer. | Yes to all. Premium catalog; trade-off is real. |
| Q6 | Geographic scope; regulatory regimes? | Drives region count, geo-fencing, per-country rights fencing, data residency (EU DPA, India DPDPA). | Global minus embargoed countries. 18 regions, 3 CDN tiers (core PoPs, regional PoPs, ISP-embedded appliances). |
| Q7 | Offline downloads? | Needs persistent DRM license with bounded validity (e.g., 48h; 30d when connected). | Yes. Download path is a variant of playback, same DRM but a different license policy. |
| Q8 | Cost target per streaming hour? | Anchors whether we use 3rd-party CDN, build custom appliances, or hybrid. | <$0.015 per streaming hour all-in (bandwidth + compute + DRM). §3 shows custom-appliance ROI. |
| Q9 | SLO targets: startup time, rebuffer, availability? | The three numbers that define "good video." | p99 startup <2s, rebuffer ratio <0.5%, playback availability 99.99%. Catalog browse 99.95% (reco service allowed to degrade separately). |
| Q10 | Recommendation scope? Candidate-gen + ranker or one-shot? | Drives whether we separate online/offline embedding. | Two-stage: offline candidate gen (collaborative filtering, graph walks) + online ranker (deep neural, real-time features). §7d depth. |
I'd spend ~90 seconds on these, commit to the defaults, and say "I'll surface anywhere these bite."
2 Functional Requirements #
In scope (numbered):
- FR1 — Studio ingest.
InitiateUpload(title_id, manifest)→ chunked resumable upload of mezzanine master (typically ProRes or JPEG2000 IMF, 200–500 GB per 90-min feature) → object store → trigger transcode DAG. - FR2 — Transcode to ABR ladder. Per-title ladder selection (dynamic optimizer), parallel segment encoding in 2–4s GOP chunks, multi-codec output (H.264 AVC, HEVC, AV1), multi-container (HLS fMP4, DASH CMAF), DRM packaging (Widevine/FairPlay/PlayReady via a single CENC CMAF package).
- FR3 — QC & conformance. Automated perceptual quality checks (VMAF per rung), bitstream conformance (ffmpeg/Bento4), audio loudness (EBU R 128), subtitle timing.
- FR4 — Catalog publish. Transactional flip from "draft" to "available in regions X" with geo-allowlist; embargoed release (countdown-to-live).
- FR5 — Manifest serve.
GET /v1/manifest/{title}/{session}returns signed HLS/DASH manifest with per-session CDN-token-rewritten segment URLs and per-profile rung list. - FR6 — Segment serve. Stateless CDN-cacheable
GET /s/{title}/{rung}/{seg_no}.m4s; 99.9%+ hit rate at edge PoPs for head catalog. - FR7 — DRM license exchange.
POST /drm/licenseexchanging a CDM-provided challenge for a signed license (Widevine/FairPlay/PlayReady key wrap). - FR8 — ABR playback. Client selects rung per segment via buffer + throughput signals. We serve, don't decide (clients own ABR).
- FR9 — Watch-history & resume.
POST /heartbeatevery 10s;GET /continue-watching/{user}returns last positions with user-consistent-read. - FR10 — Recommendations.
GET /recs/{user}?surface=home|detail|postplay— candidate gen + rank, with diversification. - FR11 — Analytics. QoE (rebuffer, startup, bitrate ladder switches) and engagement (play, pause, abandon) events; feeds recs + SLI dashboards + ladder tuning loop.
- FR12 — Download-for-offline. Scoped DRM license, bounded offline duration.
- FR13 — Geo-restriction / parental controls / kids profile. Enforced at manifest issuance (server-authoritative, cannot be bypassed by changing client state).
Out of scope (say out loud):
- Live streaming (RTMP/SRT ingest, low-latency HLS, chat overlay) → separate sub-system, sketched in §10.
- UGC ingest at human-generated rates (500 hrs/min YouTube). Our ingest is ~10–100 titles/day from a handful of studios.
- Comments, social, watch-parties — app-layer features, separate services.
- Payment / subscription billing — separate service; we consume entitlements.
- Content moderation / Trust & Safety — trivialized for licensed catalog (studios vet); for UGC it would be huge.
- Ad insertion / SSAI — noted in §10 as a clean extension (per-segment URL rewriting at manifest time).
3 Non-Functional Requirements + Capacity Estimate #
3.1 NFRs (with explicit SLO targets)
| NFR | Target | How we achieve it |
|---|---|---|
| Playback availability | 99.99% measured as played-without-fatal-error / started |
Tiered CDN with 3 fallbacks; origin replicated 3× cross-region; client-side rung fallback |
| p99 startup time (click → first video frame) | <2s | Manifest edge-cached; first segment in client push; pre-fetched DRM license on manifest fetch |
| Rebuffer ratio (rebuffer-seconds / play-seconds) | <0.5% | ABR tuned (BOLA-E+MPC hybrid); segment prefetch = 30s buffer target; CDN hit >99.5% |
| Durability of master | 11 nines (1 - 10⁻¹¹ annual loss) |
Object store w/ erasure coding 10+4 across AZs; tape archival for deep cold; checksum on ingest and every replication hop |
| Durability of encoded assets | 9 nines; re-transcodable if lost | Master survival is ground truth; encoded ladder is recomputable (cost: 1 day & ~$30k per missing title) |
| DRM license issuance latency | p99 <300ms | License servers co-located with edge regions; HSM-backed key wrap hot-path sub-10ms |
| Catalog metadata freshness (new release → searchable) | <60s globally | Spanner / pub-sub fan-out |
| Recommendation freshness (watched → re-ranked) | <5 min for online features; daily for batch | Online ranker with last-N session features |
| Analytics completeness (QoE ingest) | 99.9% events landed within 1 min | Lossy in-memory buffer → Kafka with 3× replication → batch to warehouse |
| Cost per streaming hour (all-in) | <$0.015 | Open Connect-style appliances + pull-CDN spillover; AV1 where devices support it (30% BW reduction) |
3.2 Back-of-envelope math — every number calculated
(a) Peak egress bandwidth
- 500M MAU × 2 hrs/day avg viewing = 1B streaming-hours/day = ~42M concurrent average.
- Peak/average ratio for global VOD ~2.5× (evening TZ spike); for major releases ~4× (global drop).
- Peak concurrent ≈ 100M viewers.
- Blended bitrate (4K HDR 15 Mbps, 1080p 5 Mbps, 720p 2.5 Mbps, mobile 1 Mbps; weighted 15%/45%/30%/10%) = ~5 Mbps average.
- Peak egress = 100M × 5 Mbps = 500 Tbps.
That number is the single biggest constraint. Build everything else backward from it.
(b) CDN PoP sizing
- Assume tiered CDN:
- Tier 0: ISP-embedded appliances (Netflix Open Connect-style). Target ~75% of egress (the head of the catalog; pre-warmed).
- Tier 1: ~500 core/regional PoPs globally (our own or partner CDNs). Target ~20% of egress (body of catalog + cache-miss fallback).
- Tier 2: ~12 origin regions. Target ~5% of egress (long tail, live re-transcode, new release pre-warm).
- Per-tier peak egress:
- Tier 0 = 0.75 × 500 Tbps = 375 Tbps spread across
10k appliances in ~2k ISPs → **37.5 Gbps per appliance peak**. Consistent with 100G NIC + 2 × 100G uplinks per appliance. - Tier 1 = 0.20 × 500 Tbps = 100 Tbps across 500 PoPs → 200 Gbps/PoP peak. Modest; today's major PoPs deliver multi-Tbps.
- Tier 2 = 0.05 × 500 Tbps = 25 Tbps across 12 regions → ~2 Tbps/region, fine.
- Tier 0 = 0.75 × 500 Tbps = 375 Tbps spread across
Sanity check: Netflix publicly discloses Open Connect delivers >200 Tbps at their peak; their scale is ~250M subs. Ours is 2× theirs, 500 Tbps is in-line.
(c) Catalog storage
- 100,000 titles × avg 90 min = 9M min = 150k hrs of content.
- Per-title encoded bytes:
- Ladder: 15 rungs × avg 3 Mbps (harmonic average; fat rungs dominate) × 5400 s = ~3 GB per rung average × 15 = ~45 GB per ladder per language.
- Codecs: AVC + HEVC + AV1 × 3 = ~135 GB per language.
- Languages/audio: 10 audio tracks × (same video, separate audio @ 192 kbps each) ≈ add 2 GB → video dominates.
- Subtitles: 30 languages × ~1 MB = 30 MB, rounding error.
- Per title: ~140 GB encoded (across all codecs/profiles).
- Plus master (mezzanine): ProRes 422 HQ at 220 Mbps × 5400 s = ~150 GB per title.
- Total encoded storage = 100k × 140 GB = 14 PB.
- Total master storage = 100k × 150 GB = 15 PB (kept forever, warm/cold tiered).
- With 3× replication + 10+4 erasure coding on masters (1.4× overhead vs 3× for hot): masters on EC = 21 PB raw; encoded hot on replication = 42 PB raw. Total raw ≈ 63 PB.
- At
$20/TB/yr for hot object storage (internal tiered), **$1.3M/yr storage**. Dwarfed by egress.
(d) Transcode compute
- For each hour of source: encode a ladder (15 rungs × 3 codecs = 45 encodes, though shared analysis cuts overhead ~30%) + QC + package + DRM.
- AV1 is the expensive rung: ~30–60× realtime for quality preset on modern Xeon. HEVC ~10×. AVC ~3×.
- Blended per-hour-of-content: ~80 CPU-hours per hour of content (dominated by AV1 and top HEVC rungs).
- Volume: 10 titles/day × 1.5 hrs = 15 hrs/day new content, plus 3× re-encode on codec/DRM/ladder changes across back catalog = ~50 hrs content-encode-equivalent/day.
- = 4,000 CPU-hours/day → ~170 sustained cores at full utilization.
- With launch burst (Netflix-class drops): 10× peak = 1,700 cores burst. Spot/preemptible farm at ~$0.01/core-hour = $30/hr burst, trivial.
- Key leverage: split each hour into 2–4 s GOP chunks, encode in parallel — turns a 90-min encode into 10-min wall-clock by using 400 chunks in parallel. Same total CPU, wildly faster wall-clock. Enables "ingest → publish" in <2 h for a 90-min feature.
(e) Transcode farm sizing for head-of-line titles
- A blockbuster master arrives 72 h before release.
- Need to emit full ladder + QC + package in <24 h so we have 48 h to propagate to edge appliances (§7b).
- Per title:
80 CPU-hrs × 1.5 hrs content = **120 CPU-hrs** per title, wall-clock target 4 h → need 30 cores in parallel for a single title. Trivial per title; the farm's value is handling many titles and re-encodes in parallel.
(f) DRM license server
- 100M concurrent viewers, each fetching ~1 license per playback session + periodic renewals (every 30–60 min for long sessions).
- Peak license issuance = new-session rate. Assume avg session = 45 min → 100M / 2700 s = ~37k licenses/s steady-state; during a major release start, 300k licenses/s burst.
- HSM-backed wrap is ~1ms per op on modern HSMs, but HSMs are throughput-limited. We put key wrapping in the HSM and session policy (geofence, device binding, entitlement) in the application layer.
- With regional DRM servers (24 regions × 3 pods × 5k ops/s/pod) = 360k ops/s capacity — handles 300k burst with 20% headroom.
- If we exceed HSM throughput: software key wrap with offline-mint HSM-signed chains (proxy re-encryption; §7c).
(g) Watch-history and recs
- Heartbeats every 10s during playback × 42M avg concurrent × 30k/s → ~4M heartbeats/s.
- Each heartbeat: 200 B → 800 MB/s into the write path, ~70 TB/day of raw events.
- This cannot go into Spanner directly (50k writes/s Spanner per node). Lands in Kafka, downsampled to "progress point every 15s" in Cassandra-style wide row, materialized views daily to warehouse.
(h) Cost anchor (why egress dominates)
- Egress at blended $0.01/GB (mixed tier-0/1/2 effective) × 500 Tbps × 86,400 = ~5.4 EB/day at peak-day rate, ~$54M/day if we paid retail. Reality: Tier-0 appliances at ISPs cost effectively $0.001/GB or less (ISP peers, no transit) → effective blended ~$0.003/GB → ~$16M/day peak, ~$8M/day avg, ~$2.5B/yr egress.
- Compute + storage combined: ~$50M/yr. DRM + control plane: ~$20M/yr.
- Total ~$2.6B/yr of which 95% is egress. This is the economic engine behind Open Connect.
4 High-Level API #
All APIs are HTTPS/1.1 with HTTP/2 for manifests; segment fetches explicitly on HTTP/1.1 or HTTP/3 (QUIC) depending on client. gRPC internal between control planes.
4.1 Ingest APIs (studio-facing, authn via mTLS + studio-key)
service Ingest {
// Step 1: get an upload session + part URLs
rpc InitiateUpload(InitRequest) returns (InitResponse);
// Response carries: upload_session_id, part_size_bytes, signed S3-style URLs
// for ~100MB parts up to 10k parts (1 TB max single upload).
rpc CompleteUpload(CompleteRequest) returns (CompleteResponse);
// Triggers transcode DAG. Returns job_id.
rpc GetTranscodeStatus(StatusRequest) returns (StatusResponse) {
// Streaming RPC. Emits QC, ladder progress, DRM packaging done.
}
rpc PublishTitle(PublishRequest) returns (PublishResponse);
// Atomic catalog flip: draft → {geo_allowlist, valid_from}. Embargo-safe.
}
message InitRequest {
string studio_id = 1;
string title_id = 2;
int64 mezzanine_bytes = 3;
string mezzanine_sha256 = 4;
string container_hint = 5; // "IMF" | "ProRes_MOV" | "MXF"
Metadata meta = 6; // cast, runtime, content-rating, etc.
}
Chunked upload note. Uploads use a multipart pattern identical to S3 (InitiateMultipart → PutPart(PartNumber) → CompleteMultipart). 100 MB part size, max 10k parts. Resumable: part uploads are content-addressed, so a network blip retries only the failed part. End-to-end SHA-256 over reassembled bytes must match studio-supplied hash; mismatch = upload rejected, studio alerted.
4.2 Playback APIs (client-facing)
// Session ticket — opaque signed blob, 15-min lifetime, rotated every session
POST /v1/session/start
Body: { device_id, device_capabilities (codecs, HDCP, HDR), entitlements }
Response: { session_token, manifest_url_template, license_url, token_ttl }
GET /v1/manifest/{title}/{profile}?token={session_token}
Response: HLS (m3u8) or DASH (MPD) manifest. Segment URLs carry:
- cdn_prefix: per-session, signed with per-PoP key, TTL=12h
- rung identifiers
- DRM key IDs (kid) embedded
POST /v1/drm/license
Body: widevine_challenge | fairplay_spc | playready_challenge
Response: signed license blob (wrapped content key + policy)
POST /v1/heartbeat
Body: { session_token, title, position_ms, bitrate_ladder_pos, rebuffer_ms_since_last }
(10s interval; fire-and-forget; 204 No Content)
GET /v1/continue-watching/{user_id}
Response: list of recently-played titles with last-position
GET /v1/recs/{user_id}?surface=home&row=1&limit=40
Response: ranked title list with boost annotations
4.3 Segment serve (edge)
GET /s/{title_hash}/{profile}/{rung}/{seg_no}.m4s
?v={version} # content version; ladder re-encodes bump this
&cdn_sig={sig} # per-PoP-key-signed, checked at PoP
Cache-Control: public, max-age=31536000, immutable
Content-Type: video/mp4
Segment URLs are immutable (never rewritten for the life of a content version). This is critical: immutability enables max-age=1y, which is the reason our hit rate is >99% at edge. Any title update bumps the {version} component — a new URL space, no cache purge needed.
4.4 Failure semantics
- Manifest serve: if origin unhealthy, serve stale manifest from edge (TTL extended to 24h in degraded mode).
- Segment fetch: client-side fallback to next lower rung on 5xx or timeout >2s; CDN itself routes miss to origin shield, then origin.
- License fetch: if DRM license server is down, playback already-in-progress continues (license cached client-side for ≥1h); new sessions fail. Big red switch: "short-circuit license" for degraded mode (serves cached token-validated license for catalog already in flight — requires opt-in studio contract terms).
5 Data Schema + Engine Choice #
5.1 Catalog metadata (durable, strongly consistent)
Stored in Spanner (or CockroachDB / TiDB if open-source-preferred). Why: catalog is low-QPS writes (~100/day), extremely high-QPS reads (every home page load, every title click), must be globally consistent (a title becoming unavailable in region X needs to propagate fast), and transactional (embargo-release is a cross-column flip — availability + ladder URLs + DRM key refs flip atomically). Reads fronted by per-region cache (Redis) with pub-sub invalidation.
TABLE Title (
title_id STRING PRIMARY KEY, -- UUID
version INT64, -- content version; bumps = new ladder
display_name_i18n JSON, -- {"en":"Squid Game","ja":"イカゲーム",...}
runtime_ms INT64,
content_rating JSON, -- per-region: {"US":"TV-MA","DE":"16"}
geo_allowlist ARRAY<STRING>, -- ISO-3166-1 alpha-2
valid_from TIMESTAMP,
valid_until TIMESTAMP, -- for licensed content with expiry
hdr_flags BITMASK, -- HDR10, HDR10+, DolbyVision
subtitles ARRAY<STRING>, -- ["en","ja",...]
audio_langs ARRAY<STRING>,
mezzanine_master_uri STRING, -- s3://.../master.mxf
status ENUM(DRAFT, TRANSCODING, READY, RETIRED),
created_at TIMESTAMP, updated_at TIMESTAMP
)
TABLE Asset (
title_id STRING, version INT64, profile STRING, PRIMARY KEY(title_id,version,profile)
-- profile = e.g., "hevc_4k_hdr_10bit" | "avc_1080p" | "av1_1080p"
manifest_uri STRING, -- s3://... or /cdn/...
rungs JSON, -- [{rung:0,bitrate:450000,w:426,h:240,codec:"avc"},...]
audio_tracks JSON,
subtitle_tracks JSON,
drm_key_refs ARRAY<STRING>, -- key IDs for CENC; actual keys in KMS/HSM
vmaf_summary JSON, -- per-rung avg/p1 VMAF
packager_fingerprint STRING -- for idempotency / re-package detection
)
Read paths at page-load time issue 1–2 Spanner queries (fast; ~10ms) but are shielded by a per-region Redis that caches titles with 60s TTL + pub-sub invalidation on updates. Manifest-serving doesn't go back to Spanner at all — manifest is precomputed & stored in CDN/object store at publish time.
5.2 User data (watch history, entitlements, profile)
Cassandra wide-row for watch history. Why: append-heavy, key-by-user, time-ordered, no cross-user joins in the hot path.
TABLE watch_history (
user_id TEXT, -- partition key
title_id TEXT, -- clustering key (last segment first)
last_position_ms BIGINT,
last_updated TIMESTAMP,
device_last TEXT,
PRIMARY KEY (user_id, title_id)
) WITH CLUSTERING ORDER BY (last_updated DESC);
Writes arrive via Kafka → Flink-aggregator (downsample 10s heartbeats to last-known-position per (user,title), upserted every 15s) → Cassandra. This decouples write scale (4M heartbeats/s firehose) from the storage tier (~40k writes/s after aggregation). "Continue watching" is a 1-partition range scan, sub-5ms.
Entitlements (subscription tier, region, parental profile): Spanner — cross-consistent with billing.
5.3 Segments (object store + CDN)
S3-compatible object store (GCS / S3 / Azure Blob — we run multi-cloud origin for vendor leverage). Each segment is an immutable object:
Bucket: stream-origin-{region}
Key: {title_hash}/{version}/{profile}/{rung}/{seg_no}.m4s
Attributes:
content-type: video/mp4
x-amz-storage-class: STANDARD (hot) | INTELLIGENT_TIERING (body) | GLACIER (cold)
x-custom-vmaf: per-segment VMAF score (for QC)
x-custom-bitrate: actual encoded bitrate (for client-side VMAF/BW analytics)
x-cache-version: cdn cache-key component
Storage tiering via object-store lifecycle:
- Hot (first 30 days after title release or any title in top-10% watch share last 7 days): STANDARD, 3× replicated in-region, cross-region replication to 2 additional regions.
- Warm (30d–180d or middle-share): INTELLIGENT_TIERING — auto-demotes after 30d no access; single-region + 1 replica.
- Cold (old catalog, watched <10×/month/region): GLACIER / equivalent deep archive, retrieval-latency 3–5 min. Re-fetched to hot on demand into the origin shield, not directly to edge. Saves 80% on storage cost for tail.
- Master (mezzanine) cold: Glacier Deep Archive / tape. Checksummed, erasure-coded 10+4. Retrieved only for re-encode.
5.4 DRM state
- Content keys live in an HSM-backed KMS, keyed by
kid. Never leave the HSM in plaintext; wrapped under a per-license-request key. - License policy in a small, high-QPS DB (Redis + Spanner backing for audit): which entitlements can view which
kid, what HDCP level required, offline flag, max duration. - Session table (24h TTL) in Redis cluster:
session_id → (user_id, device_id, region, issued_at, max_concurrent).
5.5 Recommendation serving
- Offline candidate index (500M users × 500 candidates each): Faiss-style ANN index on user embedding × title embedding, updated every 6h. Sharded by user-id, replicated 3×.
- Online features (last-N watched, time-of-day, device, current session context): feature store (Redis), millisecond reads.
- Ranker model: deep neural net, served by TF-Serving / TorchServe, 50ms p99 per request (for ~500 candidates). §7d.
5.6 Analytics (QoE)
- Kafka firehose (~4M events/s across QoE + heartbeat + engagement).
- Real-time path: Flink → per-title rebuffer-ratio gauge → SLI dashboards, alerting.
- Batch: hourly → warehouse (BigQuery/Snowflake) for ladder tuning, recommendation training.
- Never on the playback decision path.
5.7 Engine choice — why each pick
| Layer | Choice | Rejected | Why |
|---|---|---|---|
| Catalog | Spanner | DynamoDB, Postgres, MongoDB | Global consistency for embargoed drops; transactional multi-row writes on publish |
| Watch-history | Cassandra | Spanner, Redis persistent | Write volume; single-partition reads; eventual consistency OK ("continue watching" can be ~15s stale) |
| Segments | Object store + CDN | Self-hosted file servers | Bandwidth economics impossible without CDN; object store's PUT-once semantics align with immutable content model |
| Session cache | Redis cluster | Memcached | Atomic scripts for concurrent-stream policy check; pub/sub for invalidation |
| DRM keys | HSM-backed KMS | Pure SW key store | Compliance (studio contracts demand hardware) and theft-in-memory resistance |
| Analytics | Kafka → Flink + warehouse | Direct-to-DB | Volume + downstream branching (real-time + batch + ML) |
| Rec candidates | Faiss / ScaNN + feature store | Cassandra for embeddings | Vector search sublinear in candidate count |
6 System Diagram (Centerpiece — two planes) #
6.1 Top-level: Ingest + Delivery + Control planes
╔══════════════════════════════════════════════════════════════════════════════════════════════════╗
║ CONTROL PLANE (global) ║
║ ║
║ ┌──────────────┐ ┌───────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ║
║ │ Catalog Svc │ │ Entitlement │ │ DRM License │ │ Reco Svc │ │ Analytics/ │ ║
║ │ (Spanner) │←→ │ Svc (Spanner) │←→ │ Svc (KMS/HSM)│ │ (Faiss+ANN │ │ QoE Pipeline │ ║
║ │ titles, rungs│ │ user sub+geo │ │ per-session │ │ +DNN ranker) │ │ (Kafka/Flink)│ ║
║ └───┬──────────┘ └───────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────▲───────┘ ║
║ │ pub-sub │ │ │ │ ║
║ │ invalidate │ per-session │ session+policy │ features+ │ QoE ║
║ │ │ checks │ │ candidates │ events ║
╚══════╪════════════════════════╪═════════════════╪══════════════════╪═══════════════════╪═════════╝
│ │ │ │ │
▼ ▼ ▼ ▼ │
┌────────────────────────────────────────────────────────────────────────────────────┐ │
│ DELIVERY PLANE (multi-region) │ │
│ │ │
│ User ──DNS/GSLB──▶ ISP-Embedded Appliance (Tier 0, 75% hit) │ │
│ │ │ miss │ │
│ │ ▼ │ │
│ │ Regional CDN PoP (Tier 1, 20% hit) │ │
│ │ │ miss │ │
│ │ ▼ │ │
│ │ Origin Shield (absorbs cache-miss herds; §7b) │ │
│ │ │ miss │ │
│ │ ▼ │ │
│ │ Regional Origin (5% traffic) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ Object Store (S3/GCS) — authoritative source │ │
│ │ │ │
│ │ ┌─────────────── Manifest/Session/License/Heartbeat/Reco APIs ──────────┐ │ │
│ └─▶│ served by same PoP edge; manifest cached briefly; others dynamic │────┼──▶│
│ └────────────────────────────────────────────────────────────────────────┘ │ analytics
└─────────────────────────────────────────────────────────────────────────────────────┘
▲ authoritative segment upload (cross-region replication)
│
╔══════════════════════════════════════════════════════════════════════════════════════════╗
║ INGEST PLANE (regional) ║
║ ║
║ Studio ║
║ │ mTLS, resumable multipart (100MB parts, SHA256 end-to-end) ║
║ ▼ ║
║ ┌────────────┐ ┌──────────────┐ ┌────────────────┐ ┌──────────────┐ ┌─────────────────┐ ║
║ │ Upload API │→│ Mezzanine │→│ Transcode │→│ QC │→│ DRM Packager │ ║
║ │ (chunked) │ │ Object Store │ │ Farm │ │ (VMAF+BSFormat│ │ (CENC CMAF; │ ║
║ │ │ │ (EC 10+4) │ │ (K8s+FFmpeg+ │ │ +audio+subs) │ │ Widevine/FairPlay│ ║
║ │ authn, │ │ Glacier DA │ │ AV1/HEVC/AVC; │ │ per rung+per │ │ /PlayReady) │ ║
║ │ resumable │ │ for masters │ │ GOP-chunked, │ │ segment VMAF)│ │ key refs to HSM │ ║
║ │ put_part │ │ │ │ DAG per title) │ │ │ │ │ ║
║ └────────────┘ └──────────────┘ └────────┬───────┘ └──────┬────────┘ └────────┬─────────┘ ║
║ │ │ │ ║
║ ▼ ▼ ▼ ║
║ Encoded Segment Object Store (authoritative) ║
║ │ ║
║ ▼ ║
║ Catalog Publish (transactional flip in Spanner) ║
║ │ ║
║ ▼ ║
║ Origin Replication: cross-region + Tier 0 PUSH ║
║ to ISP appliances per release schedule (§7b) ║
╚══════════════════════════════════════════════════════════════════════════════════════════╝
6.2 ABR playback sub-diagram (zoom)
Client Edge PoP DRM Svc Control
│ │ │ │
│───session_start──────────────▶│ │ │
│ │──verify entitlement─────┼───────────────▶│
│ │◀─ok, session_token──────┼────────────────│
│◀─session_token─────────────────│ │ │
│ │ │ │
│───GET /manifest/{title}──────▶│ (edge cache, 60s) │ │
│◀─manifest (DASH/HLS, per- │ │ │
│ session signed URLs) │ │ │
│ │ │ │
│── POST /drm/license ─────────▶│──────────────────────────▶│ │
│ │ │──HSM wrap──────│
│◀─license (wrapped content key)│◀─license─────────────────│ │
│ │ │ │
│── GET /s/{title}/{profile}/ │ (Tier0→Tier1→Shield→Ori) │ │
│ /rung0/seg0.m4s ─────▶│ │ │
│◀─segment (4 MB, 2s content)───│ │ │
│ │ │ │
│──(ABR decision: throughput │ │ │
│ + buffer → choose rung) │ │ │
│ │ │ │
│── GET /s/.../rung4/seg1 ────▶│ │ │
│◀─segment──────────────────────│ │ │
│ ... │ │ │
│ │ │ │
│── POST /heartbeat (10s) ────▶│───async Kafka──────────────────────────────▶
│ │ │ │
6.3 Flash-event pre-warm sub-diagram (§7b, preview)
T−24h (before drop) Origin Regions ──── PUSH ────▶ Tier0 Appliances (ISP)
(10k appliances × 140GB title × 3 codecs = ~4 PB)
→ over managed-overnight peering windows
→ idempotent, signed, verified by content-hash
T-1h Flip catalog status in Spanner: READY
Pub-sub to edge PoPs: pre-load manifest
T=0 (global drop) Tier-0 already has the title: 75% traffic served
with zero cold-miss. Tier-1 warms in first ~2 min
organically.
Rebuffer ratio curve: smooth, no spike.
Every labelled arrow maps to §4 or §5:
| Arrow | API / Data | Reference |
|---|---|---|
Client → Edge: session_start |
§4.2 POST /v1/session/start |
Validates entitlements against §5.2 |
Client → Edge: /manifest |
§4.2 GET /v1/manifest/... |
Signed manifest from §5.3 |
| Client → DRM Svc: license | §4.2 POST /v1/drm/license |
Key wrap from §5.4 HSM |
Client → Edge: /s/... segment |
§4.3 segment URL | Immutable object in §5.3 |
| Client → Edge: heartbeat | §4.2 POST /v1/heartbeat |
Ingests to §5.6 Kafka |
| Studio → Ingest: upload | §4.1 multipart | Lands in §5.3 mezzanine bucket |
| Transcode → Segment store | internal | §5.3 encoded segment bucket |
| Catalog publish | §4.1 PublishTitle |
Flips status → READY in §5.1 |
| Origin → Tier0 PUSH | internal sync loop | §7b |
| Heartbeat → Kafka | async | §5.6 analytics pipeline |
7 Deep-Dive: Four Critical Topics #
7a. CDN strategy: why we build (Open Connect-style) custom appliances, with rejected alternatives and bandwidth math
Why critical. Egress is 95% of our cost (§3h). Every 1% shift from commercial-CDN to ISP-peered appliance saves ~$25M/yr at our scale. The CDN decision is the defining architectural choice for a video streamer, far more than storage or transcode. Candidates who say "use a CDN" without decomposing lose the L7 point here.
The decision space:
| Option | Who owns | $/GB egress | Placement | Pre-warm cost | Operational burden | Unit economics at 500 Tbps |
|---|---|---|---|---|---|---|
| A. Pure 3rd-party CDN (Cloudfront/Akamai/Fastly) | Them | $0.02–0.08/GB retail, $0.005–0.015 on huge contracts | Their PoPs | They pay their origin | Minimal | ~$8–25M/day unmitigated; $2–4M/day w/ contracts. Still $700M–$1.4B/yr. |
| B. Pull-CDN + origin shield | Them + us | Same as A, less shield-level re-fetch | Their PoPs | Small (their cache fills on first miss) | Low | Saves 5–10% vs A by reducing shield-miss traffic. Still billions. |
| C. Multi-CDN (A+B+C with steering) | Them × N | Weighted; renegotiation leverage ~−20% | Their PoPs | Double | Medium — need a steering layer (e.g., Cedexis/NS1) | Saves another ~15% via competitive pricing. $600M–$1.2B/yr. |
| D. Custom appliances at ISP (Netflix Open Connect / YouTube Google Global Cache) | Us, colocated at ISPs | Effectively peering-cost + capex ≈ $0.001–0.003/GB amortized | Inside ISP networks, one hop from eyeball | Huge — must PUSH title 24h pre-drop | High — hardware fleet of 10k+ appliances; fleet operations | $200–400M/yr all-in (capex amortized + peering + staff). 60–75% savings vs 3rd-party. |
| E. Custom appliances at our own IXPs only (no ISP embedding) | Us | ~$0.005/GB | Regional | Medium | Medium | ~$500M/yr. Middle ground. |
Chosen: Hybrid — D (for head 75%) + B (Tier 1 pull-CDN commercial) + C fallback (multi-CDN) for spillover.
Why hybrid and not pure D: tail catalog (bottom 70% of titles by watch-share) is 25% of bytes; pushing every title to every appliance would require ~2× the appliance storage capacity per dollar, worse unit economics than renting pull-CDN for the tail. We break even at roughly a Pareto threshold: if the top 10% of titles are ~75% of watch time (typical), push those + hot new releases; pull the rest.
The push schedule — why 24h pre-drop is non-obvious:
Naive argument: "CDN caches warm on demand, no pre-warm needed." That's wrong at our scale because:
- 100M viewers start within ~5 minutes of a global drop.
- Miss-rate at cold edge is ~100%. First-miss flows to shield → origin.
- Even if Tier-1 CDN has 500 PoPs, each miss on a new title triggers ~500 origin fetches (one per PoP) before the PoP's local cache fills.
- With 3-codec × 15-rung × 2,700 segments per title, a cold start is ~500 PoPs × 15 rungs × maybe 10 segments in-flight each = 75,000 origin-bound requests in the first minute, per title — and we may have 5 titles dropping a week.
- That load on origin is tolerable; the bigger pain is the cache-miss latency spike to viewers, breaking the p99 <2s startup SLO.
- We'd need origin bandwidth = ~5% of peak = 25 Tbps just to survive the first 5 min, built for usage we only hit at drops. Wasteful.
Pre-push insight (earned-secret depth): "Netflix pushes content to ISP-embedded Open Connect Appliances during the ISP's off-peak window (2am–6am local) in the 24h before a major release, over managed peering. This moves the bottleneck from CDN-bandwidth-under-flash-load to ingest-side replication scheduling. The second-order effect: our transcode deadline backs up by 24h, which forces us to demand mezzanine masters from studios 72h pre-release instead of 48h — a business-logistics change driven by cache architecture." Pure pull-CDN can approximate this by synthetic-traffic pre-warming (send fake fetches from every PoP) + origin shielding, but (a) you pay commercial CDN BW for the synthetic traffic (often >$100k for one title's pre-warm), (b) you still can't pre-warm ISP-level appliances you don't own, and (c) you measure the pre-warm effectiveness only by watching whether your rebuffer ratio spikes at T=0 — a lagging metric with a reputational cost.
Push architecture (the hard parts):
- Schedule: per title, compute placement plan: which appliances get which rungs. Not every rung to every appliance — Tier 0 appliances get only the top ~5 rungs (HEVC/AV1 at 1080p+4K), since ISP subs skew high-bandwidth; AVC low rungs remain at Tier 1.
- Deduplication: content is chunked at 4MB granularity with content-addressable hashes; re-pushing a title with updated metadata reuses unchanged chunks. ~90% reuse across re-encodes.
- Flow control: push rate-limited per-ISP to their capacity. Each appliance has a 1–4 TB SSD + spinning disk; we prioritize push by "predicted watch share in the next 7 days" as scored by a dedicated ML model, evicting lowest-share content first.
- Verification: appliance signs a per-chunk receipt back to origin; per-appliance readiness monitor fails a drop (rolls back to T+24h) if <95% of appliances are ready at T−1h.
Real systems named: Netflix Open Connect (OCA hardware, FreeBSD-based), YouTube Google Global Cache (GGC), Disney's BAMTech (pull-CDN heavy with aggressive shielding), Akamai's managed CDN for many mid-tier streamers, Fastly's edge compute for manifest customization.
Failure modes:
- OCA at ISP crashes / disk fails: DNS/GSLB steers viewers to the next-nearest (regional Tier 1). Detected in <30s via PoP-level health-check. Blast radius = that ISP until repair (replaced within 72h via logistics). Mitigation: over-provision each metro with ≥2 appliances so one failure = graceful degradation to Tier 1.
- Push window missed pre-drop: monitored at T−6h; if <90% readiness, page the release team. Options: (a) delay the drop in that region (contractually permitted for some titles, not others), (b) push from Tier-1 PoPs in-line (drops 75% of hits down to 20%, raises egress cost and risks rebuffer). Big red switch: "activate Tier 1 fallback posture" — doubles expected cost for 24h, absorbs the flash.
- Bad encode pushed: content version is included in the object key; roll forward by publishing a new version and invalidating the old (pull-CDN model), or (for pushed titles) push v2 → flip manifest → leave v1 for in-flight sessions (bounded TTL).
Bandwidth sanity check (closes the loop with §3b):
- 500 Tbps peak × 0.75 Tier 0 = 375 Tbps / 10,000 appliances = 37.5 Gbps/appliance peak. Each appliance has 2×100G NICs — comfortable headroom.
- 375 Tbps × 3600 s = 168 PB/hour egress at peak, out of ISP-peered appliances at ~$0.001/GB. At peak hour = $168k; day avg = $1.5–2M/day. Annualized: ~$700M for Tier 0 alone, matching the ~$2.5B/yr total egress number in §3h (others tiers, DRM/control, origin, manifest, recs round out).
7b. Transcoding pipeline at scale: per-title encoding, GOP-parallel, per-rung VMAF targeting
Why critical. Transcode is where encoding quality × bandwidth cost × startup latency collide. A 20% bitrate reduction at equal VMAF saves $500M/yr in egress at our scale. It is worth burning 10× encode compute to find it.
The naive approach is a fixed bitrate ladder (e.g., {240p@400k, 360p@700k, 480p@1.2M, 720p@2.5M, 1080p@5M, 4K@15M}). Apply it to every title. Simple, fast (~1× realtime).
What's wrong with that:
- A talking-heads documentary needs 40% less bitrate than an action movie to hit the same perceptual quality.
- A dimly-lit horror film benefits from HDR + 10-bit much more than a cartoon, and its perceptually-optimal 1080p rung is at ~3 Mbps, not 5 Mbps.
- Serving the naive ladder wastes ~25% of bitrate on average.
Per-Title Encoding (PTE), chosen approach:
Step 1: Convex-hull analysis. Encode the title at ~30 (resolution, QP) points — a grid of candidates. For each point, compute VMAF and bitrate. Plot (bitrate, VMAF); the upper-left-outward frontier is the convex hull of efficient rungs.
Step 2: Select rungs from the hull. Target specific VMAF buckets (e.g., 65, 75, 85, 93, 97) and pick the minimum-bitrate point achieving each. Per-title ladder has 5–8 rungs, non-uniform per title.
Step 3: Encode the chosen rungs fully, with the quality preset tuned per codec.
Cost: ~5× more CPU-hours than naive (the pre-analysis dominates), buying ~20–25% bitrate savings on average. At 500 Tbps × $0.003/GB effective egress, 20% savings = $500M/yr. Compute cost to achieve it: 5× × $30/hr × 50 hrs/day × 365 = ~$3M/yr. ROI is 150×.
GOP-parallel encoding (implementation):
- Split each master into 2–4 s closed-GOP chunks (split at I-frames; the encoder is configured to force IDR at chunk boundary).
- Dispatch chunks as independent jobs to a K8s batch fleet (or AWS MediaConvert-like managed). Each chunk encoder is FFmpeg with preset tuned.
- Reassemble at the packager stage (CMAF fragments already have natural segment boundaries, often aligned with the GOP).
- Why: serial encode of a 90-min film at AV1 quality is ~60× realtime = 90 hours. GOP-parallel with 1000 chunks → <10 min wall-clock for the same film.
Per-segment VMAF as QC gate:
- Every encoded segment has its VMAF measured against the reference master-segment.
- If any segment's VMAF is
3 points, the encode fails QC, job re-runs with higher bitrate budget. Catches encoder corner cases (high-motion scene collapse, grain preservation). - Measured VMAF is stored as object-store metadata (§5.3) for observability and client-analytics pairing.
DRM packaging (single-encode-many-DRM, CENC CMAF):
- Encode once per codec/rung; package into CMAF fragmented-MP4 with CENC (Common Encryption) AES-128-CTR — the same encrypted bytes are decoded by all three DRM systems (Widevine via CDM, FairPlay via StreamingKeyDelivery, PlayReady via PRO header).
- Saves ~3× storage and serving cost vs. per-DRM encoding.
- Real systems: every modern streaming service uses this; bento4 + shaka-packager are the reference tools.
Codec choice math:
| Codec | BW saving vs AVC | Encode cost vs AVC | Device coverage (our MAU) | Verdict |
|---|---|---|---|---|
| H.264/AVC | baseline | baseline | ~100% | Must have |
| HEVC/H.265 | −30–40% BW | 5× CPU | ~90% (iOS, newer Android, most TVs) | Chosen |
| AV1 | −30% vs HEVC, −50% vs AVC | 20–40× CPU | ~40% (Chrome, newer Android, 2023+ TVs) | Chosen for top rungs where savings compound (e.g., 4K) |
| VP9 | ≈HEVC | 10× CPU | ~60% (no iOS) | Skipped — AV1 eclipses it going forward |
Chosen: triple-codec (AVC + HEVC + AV1) with client-side capability negotiation in manifest. Top rungs (4K, 1080p) get AV1 for devices that support it; AVC always included as baseline. Per the math above, AV1 pays back its encode cost in ~1 week of serving at our scale.
Real systems: Netflix's Dynamic Optimizer (PTE grandfather); YouTube's similar Videogen; FFmpeg/libaom/x265 as encoders; AWS MediaConvert / Elemental for managed alt; Bitmovin for 3rd-party.
Failure modes:
- Encoder crash mid-job: K8s restarts the chunk; idempotent chunk IDs ensure no duplicate writes. Wall-clock slips by minutes.
- Bad mezzanine (corrupt, wrong color space, silent audio): QC catches, encode marked FAILED, studio notified, release blocked. Pre-ingest validator detects most of these in the first ~10s.
- Ladder misconfig (bitrate too low): per-segment VMAF gate fails the encode. Regression test on previously-encoded titles catches per-title optimizer regressions.
7c. DRM license serving at 300k licenses/s under flash load
Why critical. A DRM outage == no playback. Unlike most services where failure degrades, a DRM fault is absolute: client cannot decrypt, black screen. It is also the component least visible to casual candidates — "add DRM" hand-waves over its hardest parts.
The load shape:
- Steady-state 37k licenses/s (§3f).
- Flash: major release globally, 100M viewers starting within 5 min. Peak license request rate ≈ 300k/s for 5 min.
- HSM ops are serial and ~1ms per op; one HSM = ~1000 ops/s. Need 300 HSMs to absorb burst — expensive, ~$20k each amortized.
The hot-path breakdown per license:
- Parse client challenge (from CDM). Extract requested
kids, device cert. - Validate entitlement (subscription, region, parental, concurrent-streams) — Spanner or cached in Redis.
- Derive policy (HDCP level, playback duration, offline flag).
- Wrap content key under device session key (HSM op).
- Sign response.
Optimization: move steps 1–3 and 5 to app servers; only step 4 on HSMs. That's obvious. The earned-secret optimization:
Proxy re-encryption / session-key intermediate (chosen):
- HSMs pre-wrap content keys under regional session master keys at ingest time (not per-license). Regional session master key rotates daily.
- At license time, app server wraps from regional-session-master-key to device-session-key in software (not HSM). Standard AES-KW is ~100 ns in software.
- HSMs are used only for (a) rotating the regional-session-master-key daily, and (b) attesting the rotation via signed chain.
- Security argument: an app-server compromise leaks at most one day's worth of keys, and only for content the compromised server served. A full HSM extraction still requires HSM breach.
- Throughput: app servers do millions of ops/s; HSMs never saturate.
Trade-off: a small reduction in HSM-derived security guarantee (keys exit the HSM under regional-session wrap). For our threat model (professional pirate, not state actor) this is acceptable and it's what high-scale streamers actually do. It's also what Widevine L3 effectively does in its server-side extraction model.
Concurrent-stream enforcement:
- Every license issuance creates a row in a Redis "active sessions" set per user (TTL = 60 min, refreshed by heartbeat).
- Cardinality check before issuing:
SCARD active_sessions:{user}<max_concurrent_streams. Atomic Lua to prevent race. - If over limit, kick oldest session (server-issues a "stop" to that device via push channel).
Offline license:
- Different policy:
offline=true,duration=48h,key_lifetime=30d, policy embedded in license itself so client enforcement is self-contained. - Stored in a separate table for revocation lineage; license servers can issue revocation on subscription cancel.
Failure modes:
- HSM outage: regional session master keys cached server-side for 24h (refreshed daily); software path keeps serving. Degraded mode: cannot rotate keys; 24h SLO on restoration.
- Cross-region license server outage: GSLB fails over to nearest healthy region. Add ~50ms latency. SLO met.
- License theft (stolen subscription): revocation list; license servers refuse to issue for revoked device. Existing licenses expire in 1–60 min depending on policy.
- Big red switch: "serve cached license for last 24h's titles to any active session" — degrades to no enforcement for ongoing sessions (studio contracts permit this for ≤1h as emergency).
Real systems named: Google Widevine, Apple FairPlay Streaming, Microsoft PlayReady, Amazon's BuyDRM integrations; HSMs from Thales/Luna, AWS CloudHSM, Google Cloud HSM; reference packagers: bento4, shaka-packager.
7d. ABR + rebuffer-ratio optimization: chosen rung-per-segment math
Why critical. Rebuffer is the #1 signal for viewer abandonment (every 1% rebuffer ratio ≈ 2% reduction in minutes viewed industry-wide). ABR is entirely client-side — our server role is to make the client's job easy: smart ladder, accurate manifests, low-variance segment sizes, CDN reliability. The interviewer probe specifically calls out p99 <2s startup and <0.5% rebuffer.
The three ABR algorithm families:
| Algorithm | Decision input | Rebuffer resilience | Bitrate efficiency | Complexity |
|---|---|---|---|---|
| Throughput-only (BBA, classic) | EMA of segment throughput | Poor on oscillating network | Ladder-greedy | Simple |
| Buffer-based (BOLA) | Buffer occupancy | Excellent (self-stabilizing on buffer) | Slightly under-utilizes BW when buffer deep | Medium |
| Hybrid MPC (BOLA-E, RobustMPC) | Buffer + throughput + forecast | Best empirically; ~20% less rebuffer than BBA | Best empirically | Higher |
| ML-driven (Pensieve, Puffer) | Multi-feature NN | Promising, uneven in field | Variable | Highest; model retraining pipeline |
Chosen: RobustMPC / BOLA-E hybrid, client-side. Server role: ensure the manifest advertises a ladder that lets the algorithm make good decisions.
Server-side optimizations that impact rebuffer:
- Per-title ladder (§7b) — clients always have a good rung to step down to.
- Variance-controlled encode — we use capped-CRF (not strict CBR) so a segment's encoded size is within ±15% of its ladder-nominal bitrate. Predictable download times ⇒ predictable buffer state ⇒ fewer ABR oscillations.
- Segment size = 2–4 s — shorter = lower startup latency (first I-frame sooner) and faster rung switches; longer = better compression. Chosen: 4s for standard, 2s for live/LL-HLS. Per-segment VMAF gate ensures 2s segments don't collapse.
- Initial rung selection for startup:
- Client opens playback; first segment fetched must be small enough for p99 startup <2s.
- At p99, a weak connection is ~1.5 Mbps. A 4 s segment at 720p@2.5 Mbps is 1.25 MB → 6.7 s to download. Miss.
- Solution: the client fetches a short initialization-segment + the first video-segment at the lowest-intelligent-rung (typically 360p/500k = ~250 KB, <1 s on 1.5 Mbps), then ramps up within ~10 s.
- Manifest advertises
#EXT-X-START(HLS) /Period@start(DASH) hints to steer.
- Prefetch the first segment on manifest fetch. Use HTTP/2 Server Push from the manifest endpoint: pushes the first init-segment before the client requests it. Cuts ~50–150 ms off startup. Real deployment: most CDNs (Akamai, Fastly) support this; CloudFront does not — we accept the skew.
- DRM license parallelism. Critical: client must not serialize
fetch-manifest → fetch-license → fetch-first-segment. Modern players (ExoPlayer, Shaka, AVPlayer) fetch manifest + init-segment + license concurrently once manifest URL is known; the license is consumed when first content-segment arrives. This reduces TTFB from sum-of-three to max-of-three. We ensure our manifest contains license URL in an early-parseable position.
Rebuffer budget math:
- Target rebuffer ratio <0.5%.
- Break down sources: (a) network oscillation unreached by ABR, (b) CDN miss storms, (c) DNS/TLS RTT spikes, (d) encoder stutters (pathological segment size).
- Allocate budget: (a) <0.2%, (b) <0.1%, (c) <0.1%, (d) <0.1%.
- For (b), this translates directly to a CDN miss-rate tolerance: at 99.5% hit rate and ~100ms miss penalty, rebuffer contribution = 0.005 × 100ms / 4000ms segment = 0.0125% of play time. Fine. We set a hit-rate alarm at 99% (2× our budget) as early warning.
The "first 2 seconds" optimization chain (earned-secret):
Startup p99 <2s is the hardest SLO. Breakdown of a cold play on LTE:
- DNS 50ms (DoH cached) → TCP+TLS 150ms (QUIC skips some of this) → session_start call 100ms → manifest fetch 80ms (edge-cached) → parse 20ms → license fetch (parallel with init-segment) 150ms → first video-segment fetch @ 500 kbps = 250ms → decoder init 100ms → first frame.
- Sum (serial): 900ms. With parallelism (license || init-segment), the critical path is ~800ms. With QUIC to shave TLS: ~700ms.
- p99 is worst-case: add 2× network variance, 500ms DNS fallback, TLS resumption failure → ~2000ms. We're dancing on the line. Every 50ms matters.
Per-second optimizations we enforce:
- TLS session resumption across requests within a play session.
- HTTP/3 (QUIC) for handset clients — saves one RTT for handshake.
- Manifest caching at edge with 60s TTL; invalidated on version bump.
- Fail-fast on segment errors: 5xx triggers an immediate retry to next-rung-down rather than exponential backoff, which blows the startup budget.
Failure modes:
- ABR algorithm bug ships to clients: rolling update gated by per-fleet QoE regression. If rebuffer ratio creeps up post-release in a cohort, auto-rollback.
- Encoder emits pathological segment sizes (e.g., high-motion scene at a low rung grossly over-sized): QC catches; per-segment size audit at ingest.
- CDN route flap: client falls back to next-best rung; sticky CDN selection per session to prevent flapping.
Real systems named: Netflix's Chunked Dash Optimizer, Twitch's low-latency HLS, MPEG-DASH with LL extension, Apple LL-HLS, BOLA/RobustMPC (CMU & MIT papers), Pensieve, Shaka Player, ExoPlayer.
8 Failure Modes & Resilience (pager-carryable) #
| Component | Failure | Detection | Blast radius | Mitigation | Recovery |
|---|---|---|---|---|---|
| Tier-0 appliance at one ISP | HW fail, disk fail | appliance healthcheck 30s; PoP-level egress drop | ~5% of that ISP's viewers (if 2-appliance metro); 0% of others | DNS/GSLB auto-shifts that ISP's traffic to Tier 1 PoP; rebuffer ticks up <0.1% for duration | Replace hardware within 72h via logistics; re-sync via push |
| Whole Tier-1 PoP outage | Datacenter / carrier fault | Per-PoP RED metrics; external synthetic (Catchpoint) | 2–5% of regional egress | GSLB fails to nearest Tier 1 (adds ~10ms latency); Tier 0 continues serving | Restore PoP or reroute via peering |
| Origin region fails | Cloud region outage | Cross-region health; origin 5xx | 5% of traffic that was shield-missing to that region | Traffic shifts to next origin (cross-region replicated); shield-miss adds latency to first requests only | Failover complete in <5 min; full restoration per cloud SLA |
| Cache-miss storm (bad new release, push failed) | PoPs cold-miss to origin on new title | Origin bandwidth spike; rebuffer spike | Cold new release rebuffer >5% for ~5 min | Origin shield absorbs; emergency posture shifts to Tier 1 fallback with higher BW; traffic shaping if shield saturates | Complete pre-push retroactively; monitor |
| Transcoder job mid-crash | Node reboots mid-chunk | K8s pod lifecycle; chunk idempotency ID | One chunk slips 10 min | Auto-retry on another node; idempotent writes | Chunk re-encodes; title publish unaffected |
| DRM license server outage | Full regional DRM fault | License 5xx rate >1% in 30s | New sessions in region can't start; in-flight sessions run on cached licenses 15–60 min | GSLB to nearest healthy region; big red switch = "serve cached policy" fallback (opt-in studio-permitted titles only) | Restore region; queue of held sessions unblocks in <10 min |
| Recommendation service down | Ranker unavailable | Reco 5xx | Home page still loads with popularity-sorted fallback (from Spanner); personalized rows missing | Circuit-break reco; serve cached "yesterday's recs" per user from Redis | Ranker restore; recs resume; no permanent loss |
| Catalog DB (Spanner) unavailable | Spanner outage | Cache reads still serve; writes fail | New titles can't publish; no catalog updates; playback mostly unaffected (manifest cached) | Read-only mode: serve cached catalog; publish workflows pause | Restore Spanner; drain publish queue |
| Analytics pipeline lag | Kafka consumer backlog | Kafka lag alerts | Recs + QoE dashboards stale; playback unaffected | Pipeline is best-effort; drop events if lag >30 min to stop cascading | Scale consumers; lag recovers |
| Regulatory geo-block fails open (title served in embargoed country) | Misconfigured region list | Contract compliance monitor; studio complaint | Legal/contractual risk, not technical | Immediate catalog flip to remove region; audit; incident report to studio | Within minutes; legal follow-up |
| Studio-pushed bad master | Wrong color space, corrupted audio | Ingest validator + VMAF QC | Publish blocked; studio notified | Pre-ingest checks; explicit failure reasons | Re-ingest |
| CDN auth-token leak (pirate distribution) | Tokens circulating | Anomaly detection on session/token ratios; geographic heatmap anomalies | Piracy; revenue leak | Rotate signing key, invalidate all outstanding tokens; re-auth all users in region (brief pain) | Forensic trace; possibly escalate DRM policy |
| Client device bug causes login stampede after app update | Millions of simultaneous retries | Session_start RPS spike | Manifest API saturates; startup times degrade | Rate-limit at gateway + exponential backoff recommended in client; app-update rollback if severe | Coordinate with client team |
9 Evolution Path #
v1 (ship in 3 months — MVP regional launch):
- Single origin region (us-east) with one cloud provider's object store.
- Pull-CDN from one 3rd-party (CloudFront or equivalent) for delivery.
- H.264 only, 5-rung ladder, fixed per-title.
- HLS only (defer DASH); Widevine + FairPlay (PlayReady can wait).
- No per-title encoding; fixed ladder.
- Watch-history in Postgres (not yet Cassandra — scale not there).
- Popularity-sorted "recs" from SQL aggregation, no ML.
- No ISP appliances; no cross-region origin replication.
- Goal: prove functional correctness, ~1M subs in one region.
v2 (ship in 9 months — 10× scale, multi-region, multi-DRM):
- 3 origin regions; cross-region async replication of encoded assets.
- Multi-CDN steering between 2–3 commercial CDNs (pull-based, origin shielding).
- HEVC added; per-title encoding for top 1000 titles.
- DASH added; PlayReady for Xbox/UWP.
- Watch-history migrates to Cassandra; Spanner for catalog.
- Two-stage recs: offline collaborative filtering + online popularity booster.
- DRM: HSM-backed keys, proxy re-encryption added for scale.
- Full ABR ladder; BOLA client shipped.
- Analytics: Kafka + Flink + BigQuery.
- Goal: 100M subs across 3 continents.
v3 (ship in 24 months — global, custom appliances, PTE, ML recs):
- Open Connect-style ISP-embedded appliances at top 100 ISPs globally.
- AV1 codec for top rungs; triple-codec serving.
- Per-title encoding across entire back catalog; convex-hull optimizer in production.
- Dynamic optimizer ML model trained on observed QoE.
- ML-ranker for recs (deep neural, real-time features).
- Regional DRM with proxy re-encryption; concurrent-stream enforcement via Redis.
- Flash-event pre-warm pipeline; T-24h push with placement optimizer.
- 12 origin regions; erasure-coded masters; Glacier deep archive for tail.
- Goal: global footprint, 300–500M MAU.
v4 (research / future):
- AV2 / LCEVC / low-power codecs as device support solidifies.
- Edge compute for personalized manifest (per-user ad insertion, regional splice).
- ML-driven ABR (Pensieve-class) shipped to clients where it beats MPC.
- End-to-end encrypted watch-history (no Netflix-server knows what you watched) for privacy-stringent regions.
- Live-VOD hybrid (live premieres that transition to VOD) with shared ingest pipeline.
- Foundation-model-assisted content understanding for better recs and subtitle generation.
10 Out-of-1-Hour Notes #
Codec selection economics. AV1's 30% BW reduction vs HEVC × $2.5B/yr egress = $750M/yr if we could serve only AV1. Device penetration gates this: we ship AV1 to supporting devices (40% of MAU today, climbing to ~80% by 2027). HEVC serves the next ~50%, AVC baseline covers the rest. VP9 abandoned — AV1 eclipses it going forward and patent landscape is cleaner. LCEVC is a "scalability enhancement layer" (delta over a base layer) interesting for bandwidth-constrained markets but ecosystem tooling is thinner.
Subtitles & multi-audio tracks. CMAF allows subtitle and audio tracks in separate representations sharing the same CENC encryption. Storage cost modest; serving cost a rounding error. For accessibility: we require closed-captions on every title (US ADA, EU EAA compliance). Audio description tracks for visually impaired (separate audio representation). Lyrics / karaoke: subset.
Kids content compliance. COPPA in the US, GDPR-K in EU; kids profile must not send personalized recs based on adult viewing history. Implementation: kids profiles share a user_id with the account but have a profile_type=kids flag that routes to a restricted reco model and filters catalog to rated-for-kids. Geo-restrictions also apply (some content is kids-ok in country A, not country B).
Pre-release embargo. Contractual with studios; title exists in the system in DRAFT for days-to-weeks, with encoded assets pre-pushed to edge but manifest unavailable. PublishTitle at the drop moment flips a single bit in Spanner — because assets are already in place, the drop is effectively instantaneous. Crucial: audit log at publish time, signed by the release manager; on-call during the drop.
Live-streaming sub-system (separate). Architecturally distinct from VOD. Ingest via RTMP/SRT/WebRTC to regional ingest gateways; re-encode in real-time with 2–4s latency (LL-HLS) or sub-second (WebRTC for ultra-low-latency, e.g., sports bet interactions). Separate origin servers optimized for hot-cache-only; separate CDN profile. Shares the playback client, DRM, catalog metadata. Biggest operational difference: live has no re-try opportunity — the segment either reaches the viewer in time or it's stale forever. Chunked transfer encoding + LL-HLS push allows ~1-2s glass-to-glass.
Ad insertion (SSAI / CSAI). Server-Side Ad Insertion rewrites the manifest per-session, splicing pre-roll/mid-roll/post-roll ad segments. Integrates cleanly with our manifest-per-session model. Client-Side: simpler but ad-blockable; SSAI is preferred for mandated ads. Sub-components: ad decision service, ad-creative CDN, session-bound manifest generator. This is a ~6 engineer-year project on top of core platform.
Observability gold standards (playback SLOs).
- Per-session QoE events piped from client: startup time, rebuffer events (count + duration), bitrate switches, errors.
- SLI/SLO/error budget per playback SLO:
- Startup p99 <2s, budget 10⁻³ sessions/month exceed 5s (~500k sessions budget at 500M MAU).
- Rebuffer ratio <0.5%, budget: 5% of sessions exceed 2% rebuffer ratio.
- Playback availability 99.99%.
- Error budget consumption dashboards per region, per device-class (mobile, TV, desktop), per title.
- Per-title QoE surfaced to content ops: if a new release has anomalously high rebuffer in region X, investigate (bad edge push? bad encode on a specific rung?). Content quality is as operational as compute is.
- CDN hit-rate per PoP per title — leading indicator for cache-miss storms.
Piracy / DRM escape hatches. Determined attackers will screen-cap; HDCP 2.2 tries to stop HDMI capture. Camcording a screen is the unavoidable low-bar attack. We don't chase screen-cappers; we chase bulk subscription-sharing (concurrent-stream enforcement), token replay (session binding, rotating CDN signing keys), and key extraction (HSM + hardware DRM). Fingerprinting (steganographic watermark per-user) for post-leak attribution is an open research area — Netflix uses forensic watermarking on premium content; we could add as v3+.
Peering & anycast. Tier-1 and Tier-2 PoPs connect to ISPs via settlement-free peering wherever possible. Anycast for manifest/session/license APIs (HTTP control plane) uses BGP to send traffic to nearest PoP — cuts ~10–50ms per RPC. Segments are unicast via content-aware routing (nearest PoP with cache).
Observability of appliances (Tier 0). Each OCA-equivalent phones home every 60s: disk health, cache hit-rate, egress Gbps, CPU, temperature. A fleet of 10k appliances at 60s cadence is ~170/s of telemetry — trivial. Alerts on: egress drop (appliance failing silently), cache miss-rate spike (likely push failed), disk predicted to fail.
Privacy & data residency. Watch-history and entitlement records are PII in many jurisdictions. EU DPA mandates data localization — watch-history for EU users kept in EU-region Cassandra clusters, no replication to non-EU. Catalog metadata is not PII, replicates globally. Request-ID sampling for debugging must be GDPR-compliant (no bulk access outside justified investigations).
Testing specifics.
- Multi-region failover chaos drills monthly.
- Synthetic playback from 50+ geographic probes continuously; any p99 startup drift alerts.
- Pre-release push rehearsals: push a synthetic title to all appliances, measure push-time distribution, gate release on 95th-percentile-readiness meeting budget.
- Encoder regression: every 100th encoded title gets an automated diff against its prior version; VMAF regression >2 points blocks publish.
- DRM chaos: periodically kill a DRM region in staging; verify GSLB shifts and sessions recover.
Green-field greenhouse: what would I change if starting today (2026)?
- QUIC/HTTP/3 by default for all API surfaces (control plane + data plane segment delivery). Already deployed at Google-scale and ~20% latency win.
- Rust for edge services (manifest, license wrap app-layer, session). Not for transcoders — FFmpeg ecosystem dominates C++ there.
- Confidential computing (AMD SEV-SNP, Intel TDX) for HSM-alternative — cheaper, comparably secure for our threat model, and enables per-tenant-region key isolation more flexibly than HSM quotas.
- eBPF-based observability in the PoP — TCP retransmits, QUIC ACK delays at wire speed for finer-grained rebuffer diagnosis.
- Foundation-model-based recommendation scoring — retrieve-augmented ranker with user-history + text synopsis embeddings; likely +10% engagement at unclear compute cost vs DNN ranker.
Verification Checklist (done before submission) #
- SRE pager-carryable? Yes — §8 is a runbook with detection, blast radius, mitigation, recovery per component, including the "big red switches" (serve-cached-license, Tier-1 fallback posture, catalog read-only mode).
- Every diagram arrow → §4 or §5? Yes — the table at end of §6 cross-references every labelled arrow to an API surface or data store.
- Deep-dives at L7 depth? Yes — §7a derives CDN Pareto + ISP-push scheduling + bandwidth sanity-checked against §3; §7b derives PTE ROI ($500M/yr savings for $3M/yr compute) and GOP-parallel wall-clock math; §7c derives proxy-re-encryption as HSM decoupling with a concrete threat-model trade-off; §7d walks the full startup critical-path millisecond-by-millisecond and shows the parallelism that keeps us under 2s p99.
- Capacity math closes? 500 Tbps = 100M concurrent × 5 Mbps. Breaks into 375 Tbps Tier 0 / 100 Tbps Tier 1 / 25 Tbps origin. Tier 0 → 10k appliances × 37.5 Gbps/each (100G NIC × 2 = comfortable). Storage 14 PB encoded + 15 PB masters = ~30 PB logical, ~63 PB raw with EC/replication. Transcode farm 4k CPU-hours/day base, 40k peak. DRM 37k licenses/s base, 300k burst, 24 regions × 3 × 5k = 360k capacity. Egress $2.5B/yr dominates ~$2.6B all-in. Numbers close.