All problems

Q12 Product & Edge Systems 26 min read 11 sections

Design Google Street View Image Ingestion

Ingest imagery from vehicle fleets over unreliable networks and feed geo-indexed processing pipelines downstream.

Edge / offlineStreamingPartitioningAvailabilityCost / efficiency

1 Problem Restatement & Clarifying Questions #

Restatement. Design the backend that ingests 360° panoramic imagery from a fleet of camera-equipped taxis, processes it (blur PII, stitch, tile, build 3D), and serves it to (a) end-user panorama viewers, (b) downstream ML pipelines (imagery understanding, map generation), and (c) internal map/ops teams. Vehicles are mobile, cellular-connected, frequently in flaky-network conditions (tunnels, dead zones, dense urban canyons). Imagery is regulated data — it contains faces and license plates and therefore carries privacy + takedown obligations.

Clarifying questions I would ask (and my assumed answers for this doc).

# Question Assumed answer Why it matters
Q1 Fleet size? ~10,000 taxis globally, skew toward dense metros (US/EU/JP/LatAm) Drives ingest bandwidth + regional ingest endpoints
Q2 Camera rig? 6-camera ring, each 8MP, synced, plus GPS/IMU Drives per-capture size + stitching work
Q3 Panorama resolution and format? Equirect 8192×4096 stitched; raw per-camera JPEGs ~2MB each = ~12MB raw per capture; stitched JPEG/WebP ~8-15MB; tile pyramid ~20MB total Storage + CDN math
Q4 Capture rate? 1 capture per ~10m driven (GPS-triggered), ~24h-of-storage local buffer on vehicle Sets ingest velocity + buffering requirements
Q5 Active drive hours? 8 hrs/day avg per vehicle (fleet staggered) BOE math
Q6 Privacy constraints? Faces + plates blurred before any external serving; EU/DE harder constraints; right-to-be-forgotten requests Hard invariants on serving path
Q7 Geographic coverage? Planet-scale, but dense-metro-first Drives regional storage tiers + S2 hot cells
Q8 Serving QPS? 1M panorama views/sec global peak (Street View is a high-traffic Maps feature); tile fanout ~10× (lookahead prefetch) Drives CDN + tile-store design
Q9 Durability target? 11 9's (legal evidence in some jurisdictions, expensive to recapture) Dictates erasure coding + multi-region replication
Q10 Latency budget — ingest vs serve? Ingest: tolerant, minutes-to-hours staleness OK; serve: p99 < 300ms first tile Lets us decouple pipelines
Q11 Are "fresh" captures prioritized (construction zones, new roads)? Yes, priority lanes exist for specific geographies Drives lane-aware scheduling in v3
Q12 Is the raw blob ever exposed externally? NO — raw is internal-only; only blurred derivatives are servable Core ACL invariant

If I only had time for three of the twelve: Q1 (fleet size), Q6 (privacy), Q9 (durability) — because these three redraw the architecture at the billion-dollar level.


2 Functional Requirements #

In scope

  1. FR-1 Resumable chunked upload from vehicle over flaky cellular; client-side WAL; part-level retry; content-hash-based idempotency.
  2. FR-2 Geo-tagged capture registration: each capture carries GPS (lat/lon/alt), heading (compass + IMU), timestamp (GPS-synced), camera intrinsics, vehicle ID, rig firmware version.
  3. FR-3 Deduplication across taxis passing the same street within a short time window (pick best, demote rest to ML pool).
  4. FR-4 Async processing pipeline: dedupe → stitch (6-camera → equirect) → PII blur (faces + plates) → SLAM/3D depth → tile pyramid (multi-zoom) → publish.
  5. FR-5 Panorama serving to end users via CDN, level-of-detail tile pyramid, tile-server origin.
  6. FR-6 Downstream ML export — raw (access-controlled internal consumers) and blurred (broader access) feeds into BigQuery / Dataflow / feature store.
  7. FR-7 Takedown + right-to-be-forgotten — specific panos/bboxes can be invalidated; CDN purges propagate.
  8. FR-8 Fleet control plane — vehicle auth, firmware attestation, upload quotas, back-pressure, SLI visibility per vehicle.
  9. FR-9 Reprocessing on model upgrade — when blur model improves, re-blur cold imagery deterministically.

Out of scope

  • ML model training internals (face/plate detector, SLAM model weights) — those are trained offline on exported data. We treat them as versioned black-box services.
  • Map-matching / graph-building (road graph extraction from imagery) — consumes our feed, doesn't live here.
  • Consumer app (Maps Street View UI) — we provide tile URLs.
  • Billing for external API consumers — out of scope.
  • Vehicle routing/dispatch — separate fleet system.

3 NFRs + Capacity Estimate (full BOE math, reconciled) #

NFRs

Category Target Justification
Availability — ingest 99.9% (8.7 hr/yr downtime) Vehicles can buffer 24h locally; ingest is async-tolerant
Availability — serve 99.99% (52 min/yr) User-facing, Maps-critical
Durability — raw + processed 11 nines (10⁻¹¹ annual loss) Imagery is legal evidence in some jurisdictions; recapture requires physically dispatching vehicle — $100s/revisit
Latency — upload No hard SLA; p95 commit within 6h of capture under normal conditions Vehicles with 24h local WAL tolerate several hours of backhaul
Latency — serve (first tile) p99 < 300ms globally User panorama pan must feel responsive
Latency — dedupe pipeline P95 within 30 min of commit Required before blur
Latency — blur P95 within 2 h of commit; 99.99% of publicly-servable blobs are blur-complete Hard invariant
Privacy — blur coverage FN rate < 0.1% audited; auto-rollback on regression Regulatory
Takedown SLA 24h CDN purge + origin invalidate GDPR / equivalent
Tamper-evidence Per-capture signed hash manifest, attested at vehicle boot Anti-spoof

Capacity estimate — derived, not asserted

Ingest volume (per day).

  • 10,000 taxis × 8 hrs/day × 3600 s/hr = 288M vehicle-seconds/day.
  • At 30 km/h urban avg = 8.33 m/s, 1 capture / 10 m = 1 capture / 1.2 s.
  • Expected captures/day = 288M / 1.2 ≈ 240M captures/day. Using the 300M from the prompt (higher assumed capture rate or more vehicles) for conservative sizing. I'll use 300M/day.
  • Raw bytes per capture: 6 cameras × 2MB JPEG = ~12MB. With IMU/metadata overhead + lossless archive margin = ~15MB raw wire bytes; prompt says 50MB which is a conservative upper bound (maybe raw RAW/DNG from cameras). I'll use two numbers: 15MB efficient (JPEG-in, JPEG-on-disk) and 50MB if raw/DNG preserved for future reprocess.
Scenario Captures/day Bytes/capture Raw/day Raw/yr
JPEG-only (efficient) 300M 15 MB 4.5 PB/day 1.6 EB/yr
Preserve raw DNG (prompt) 300M 50 MB 15 PB/day 5.5 EB/yr

I'll carry the 15 PB/day raw figure as baseline since the prompt set it.

Processed + tile pyramid bytes.

  • Stitched equirect 8192×4096 WebP ~8 MB.
  • Tile pyramid, 5 zoom levels, 512×512 tiles, overhead ~1.33×: ~11 MB processed/pano.
  • Processed = 300M × 11 MB = 3.3 PB/day incremental.

Ingest bandwidth — peak vs avg.

  • Avg = 15 PB / 86400 s = 174 GB/s sustained globally.
  • Peak surge (rush hour, all regions overlapping — won't happen globally, but within a region can): assume 4× avg in peak hour = 5 PB/hour ≈ 1.4 TB/s surge. The prompt's figure.
  • Distribute across ~10 regional ingest points (NA-East, NA-West, EU-Central, EU-West, JP, SG, AU, LatAm, IN, ME). Avg per region = 17 GB/s sustained, ~140 GB/s peak → one region needs ~1.5 Tbps ingress at peak. Solvable with cloud-provider regional PoPs; GCS regional buckets handle this natively.

Storage tiering.

  • Hot tier (near-region, SSD-backed, CDN-frontable): 30 days of processed = 30 × 3.3 PB = ~100 PB (plus ~30-day raw window for re-stitch ops = +450 PB = ~550 PB hot total).
  • Warm tier (HDD, regional, slower egress): 90 days post-hot = 90 × 18.3 PB/day (raw + processed) = 1.65 EB warm.
  • Cold tier (multi-region erasure-coded archive, Colossus / GCS Archive): 10-year retention at 18.3 PB/day × 3650 days = 66 EB cold after 10 yrs. Realistically with dedup/lifecycle at 2.5× compression factor for cold (JPEG doesn't compress much, but dedup + delta-encode similar tiles helps) → **25 EB cold** effective.

Sanity check: 25 EB is ~1% of the Colossus total footprint Google publicly discloses as its planet-scale storage baseline. Plausible.

Serving QPS + CDN bandwidth.

  • 1M pano views/s peak global. Each view loads ~10 tiles initial + 20 on pan = ~30 tile GETs per session.
  • Tile QPS: 1M × 30 = 30M tile GET/s peak. Tiles are ~50KB each after CDN compression.
  • CDN egress peak: 30M × 50KB = 1.5 TB/s egress = 12 Tbps — within Google Cloud CDN / Google Global Cache scale. 95%+ hit rate assumed (tile data is hugely repetitive viewer-side).
  • Origin QPS: 5% × 30M = 1.5M QPS to tile origin. Single Bigtable cluster can do ~1M QPS; shard across 3–5 clusters per region.

Metadata row count.

  • 300M captures/day × 365 × 10 yrs = 1.1 trillion captures steady-state. Spanner-scale (Spanner handles trillions of rows with proper sharding key).

Reconciliation check. Raw/day × 365 ÷ 1000 = 15 × 365 / 1000 = 5.5 EB/yr. At cold tier $0.004/GB/mo (GCS Archive) = 5.5 EB × 1e6 GB/EB × $0.004 × 12 mo = $264M/yr/year of archival — the realistic 10-yr TCO is O($1B-2B) in storage alone. This is the number that drives lifecycle aggressiveness.


4 High-Level API #

All APIs are gRPC internally, HTTPS/2 with streaming at the edge for upload. Authentication via vehicle device certificate (mTLS) issued by a fleet CA; attested at vehicle boot.

Upload (vehicle → regional ingest)

// Begin a resumable upload session. Idempotent by (vehicle_id, client_session_id).
rpc InitiateUpload(InitiateUploadRequest) returns (InitiateUploadResponse);

message InitiateUploadRequest {
  string vehicle_id = 1;           // fleet-unique, attested
  string client_session_id = 2;    // client-generated UUID; dedup key
  CaptureManifest manifest = 3;    // list of parts, sizes, sha256 per part
  bytes manifest_signature = 4;    // vehicle TPM-signed
  GeoHint geo_hint = 5;            // approximate GPS at session start, for region routing
}

message InitiateUploadResponse {
  string upload_id = 1;            // server-assigned, stable across retries
  repeated PartUrl part_urls = 2;  // signed PUT URLs, one per part, TTL=1h
  int64  part_size_bytes = 3;      // server-chosen optimal (default 12MB)
  string ingest_region = 4;        // which region to continue uploading to
  int64  backpressure_retry_after_ms = 5; // 0 = OK, >0 = client should wait
}

// Upload one part. Client retries safe (part_hash match → 200 OK no-op).
rpc PutPart(PutPartRequest) returns (PutPartResponse);
message PutPartRequest {
  string upload_id = 1;
  int32  part_number = 2;
  bytes  part_bytes = 3;    // streamed
  string part_sha256 = 4;   // client-computed
}
message PutPartResponse {
  enum Status { OK = 0; HASH_MISMATCH = 1; BACKPRESSURE = 2; REJECTED = 3; }
  Status status = 1;
  int64  retry_after_ms = 2;
}

// Commit: atomically transitions the upload from staging to captured state.
// Content-hash-idempotent: a second Commit with same hash returns same capture_id.
rpc CommitUpload(CommitUploadRequest) returns (CommitUploadResponse);
message CommitUploadRequest {
  string upload_id = 1;
  string aggregate_sha256 = 2;     // covers all parts, vehicle-signed
}
message CommitUploadResponse {
  string capture_id = 1;           // globally unique, deterministic from hash
  string s2_cell_id_l14 = 2;       // server-derived, for debug
}

// Metadata registration (can be batched; usually called by pipeline after dedup).
rpc RegisterCapture(RegisterCaptureRequest) returns (RegisterCaptureResponse);
message RegisterCaptureRequest {
  string capture_id = 1;
  GpsFix gps = 2;                  // lat, lon, alt, hdop, timestamp
  Heading heading = 3;             // IMU+compass
  int64  capture_ts_ns = 4;
  CameraParams params = 5;         // intrinsics, firmware version
  string vehicle_id = 6;
  string raw_blob_ref = 7;         // gs:// path
  bytes  attestation = 8;
}

Pipeline events (Pub/Sub)

Topic: captures.ingested        // published on CommitUpload
  {capture_id, vehicle_id, s2_cell_l14, ts, raw_blob_ref, size, hashes...}
Topic: captures.deduped
Topic: captures.stitched
Topic: captures.blurred          // triggers publish
Topic: tiles.generated
Topic: captures.takedown         // PII takedown request

All topics retained 7 days for replay; ordering key = s2_cell_l10 so replays per-cell are serial.

Serving (user → CDN → tile origin)

// Public: fronted by CDN, signed URLs for authenticated session if needed.
GET /v1/pano/{s2_cell_l14}/{zoom}/{tile_x}/{tile_y}.webp
  -> tile bytes (50KB typical, cached at CDN)

GET /v1/pano/lookup?lat={lat}&lng={lng}&radius_m={r}
  -> list of nearby pano_ids with capture_ts + coverage_score
  (geo-query served from S2-indexed metadata)

GET /v1/pano/{pano_id}/manifest
  -> {tile_urls, depth_map_url, neighbor_pano_ids, capture_ts, coverage_quality}

Internal (downstream ML, ops)

rpc ExportCaptures(ExportRequest) returns (stream ExportBatch);  // BigQuery extract
rpc InvalidatePano(InvalidateRequest) returns (InvalidateResponse); // takedown
rpc ReprocessCaptures(ReprocessRequest) returns (ReprocessResponse); // model-upgrade triggered re-blur

Idempotency invariants.

  • InitiateUpload idempotent on (vehicle_id, client_session_id).
  • PutPart idempotent on (upload_id, part_number, part_sha256).
  • CommitUploadcapture_id = hash(aggregate_sha256) → committing same content twice returns same capture_id, no duplicate row.
  • Pipeline events carry capture_id; each stage is keyed by it → exactly-once per stage via idempotent writes.

5 Data Schema #

Engine choice matrix

Data Engine Why chosen Rejected
Capture metadata (GPS, heading, refs) Spanner (globally-replicated) Strong consistency for (capture_id → blob_ref) lookup; SQL for ops; geo-distributed for low-latency serve-path reads; handles trillion rows Bigtable rejected: eventual-consistent secondary indexes; Cassandra: operational burden and cross-region consistency; MySQL/Postgres: doesn't shard to planet scale without third-party (Vitess/Citus) ops pain
Blob data (raw, stitched, tiles) Colossus / GCS (Reed-Solomon erasure coded) Durability 11-9s with (10,4) or (9,4) EC; cost; native lifecycle tiering hot→warm→cold→archive Block storage / PD: 10–100× cost; self-managed HDFS: ops + lower durability
Tile-server lookup cache (hot) Bigtable (row key: s2_cell_l14#zoom#tile_xy#version) Range scans along S2 Hilbert curve → local reads; 1M QPS/cluster; low-latency point reads Spanner: overkill + costlier for read-only tile lookups; Redis: not durable enough, memory-bound
Upload staging state Bigtable (row key: upload_id) with TTL 7d Write-heavy, short-lived, high throughput; simple row-level ops; TTL cleans old sessions Spanner: write amplification from global replication for short-lived data is wasteful
S2 geo-index Bigtable secondary; row key: s2_cell_l10#capture_ts#capture_id Range scan "all captures in this cell for this date range" is a single locality read; co-locates cell neighbors Elasticsearch geo_point: ops burden + weak consistency; R-tree in Postgres: doesn't shard
Pipeline state machine Spanner (row key: capture_id, columns per stage) Need atomic stage transitions + audit; strong consistency for "has this passed blur?" invariant Bigtable: stage transitions are read-modify-write → need transactions → Spanner is simpler
Analytics / ML export BigQuery (scheduled export from Spanner CDC + blob manifests) SQL at PB scale, columnar, federated to blob refs Dataflow into Parquet on GCS: fine, but analysts want SQL; BQ wins
Dedup LSH index Bigtable (row key: s2_cell_l14#time_bucket#phash_prefix) LSH buckets map cleanly to row-key prefixes; dedup is a cell-local operation HNSW in memory: doesn't scale to 300M/day; cross-region replication is painful

Schema — key tables

// Spanner: captures (primary metadata)
CREATE TABLE captures (
  capture_id          STRING(64) NOT NULL,          // derive from content hash
  vehicle_id          STRING(32) NOT NULL,
  session_id          STRING(64),
  capture_ts          TIMESTAMP NOT NULL,
  gps_lat             FLOAT64 NOT NULL,
  gps_lng             FLOAT64 NOT NULL,
  gps_alt             FLOAT64,
  gps_hdop            FLOAT64,
  heading_deg         FLOAT64,
  s2_cell_l10         INT64 NOT NULL,               // ~150 km² cell, shard key
  s2_cell_l14         INT64 NOT NULL,               // ~150 m cell, serve key
  raw_blob_ref        STRING(256) NOT NULL,         // gs://raw-bucket/{year}/{mo}/{s2_l10}/{capture_id}
  stitched_blob_ref   STRING(256),
  tile_manifest_ref   STRING(256),
  blur_status         STRING(16) NOT NULL,          // pending | running | passed | failed
  blur_model_ver      STRING(16),
  dedup_group_id      STRING(64),                   // LSH cluster id
  quality_score       FLOAT32,                      // for dedup "best pick"
  takedown_status     STRING(16) NOT NULL DEFAULT 'active',
  firmware_ver        STRING(32),
  camera_intrinsics   BYTES(MAX),                   // protobuf
  attestation         BYTES(256),                   // vehicle TPM signature
  created_ts          TIMESTAMP NOT NULL OPTIONS(allow_commit_timestamp=true),
) PRIMARY KEY (s2_cell_l10, capture_ts DESC, capture_id);
// Interleaved shard layout: reads/writes for a cell locate to same split;
// cell is the natural locality unit.

CREATE INDEX idx_captures_vehicle ON captures (vehicle_id, capture_ts DESC);
CREATE INDEX idx_captures_dedup_group ON captures (dedup_group_id);
CREATE INDEX idx_captures_takedown_pending ON captures (takedown_status)
  WHERE takedown_status = 'pending';
// Bigtable: upload_staging (row-key: upload_id)
upload_id | column family "meta" -> vehicle_id, session_id, manifest_proto, created_ts, ttl_ts
          | column family "parts" -> part_1_status, part_1_sha256, part_1_size, ..., part_N_...
          | column family "commit" -> committed_ts, aggregate_sha256, capture_id
// TTL = 7 days; garbage-collected sessions.
// Bigtable: tiles (row key: {s2_cell_l14}#{zoom}#{tile_xy_interleaved}#{version})
// Locality group "blob" -> blob_ref (gs:// or inline for tiny)
// Locality group "meta" -> content_hash, generated_ts, pipeline_version
// Tombstone row on takedown; CDN purge keyed on row key.
// Bigtable: geo_index_l10 (row key: {s2_cell_l10}#{capture_ts_reversed}#{capture_id})
// Answers: "captures in cell X between T1 and T2" via single range scan.
// Reversed timestamp → newest first without DESC sort.
// Bigtable: dedup_lsh (row key: {s2_cell_l14}#{time_bucket_10min}#{phash_simhash})
// Value: list of capture_ids with same LSH bucket → dedup candidates.
// Spanner: pipeline_state (idempotent stage gating)
CREATE TABLE pipeline_state (
  capture_id STRING(64) NOT NULL,
  stage      STRING(16) NOT NULL,   // ingest, dedup, stitch, blur, slam, tile
  status     STRING(16) NOT NULL,   // pending, running, success, failed, skipped
  attempt    INT64 NOT NULL,
  worker_id  STRING(64),
  started_ts TIMESTAMP,
  finished_ts TIMESTAMP,
  output_ref STRING(256),
  error_msg  STRING(1024),
) PRIMARY KEY (capture_id, stage);
// Strict FSM: blur.success is a gate for tile generation; enforced by pipeline.

Why Spanner for metadata (vs Bigtable everywhere): the blur_status = passed invariant is a strong-consistency read — when the serving tier answers "can I serve this tile?" it must NEVER see a stale blur_status = pending that was flipped 5s ago. Bigtable's eventual consistency between regions would allow a brief window where a European reader sees a US-region-committed passed but hasn't replicated yet — that's a compliance-ending race. Spanner's external consistency eliminates it. We pay the write latency cost (5-10ms cross-region commit vs Bigtable's 1ms local write) and it's worth it.


6 System Diagram (ASCII) — centerpiece #

6.1 End-to-end

                                                            ┌─────────── CONTROL PLANE ───────────┐
                                                            │ Fleet CA (mTLS cert issuance)        │
                                                            │ Attestation service (TPM quote)      │
                                                            │ Quota + backpressure controller      │
                                                            │ Config service (firmware, model ver) │
                                                            │ Observability (SLI/SLO, alerts)      │
                                                            └──────────────────────────────────────┘
                                                                          ▲
                                                                          │ policy + attestation
  ═══════════════ VEHICLE (edge) ═══════════════════    ┌─────────────────┴────────────────┐
  ┌────────────────────────────────────────────────┐    │     REGIONAL INGEST (N=~10)      │
  │ 6-cam rig (2MP × 6) + GPS/IMU                  │    │ ┌────────────────────────────┐   │
  │ Onboard stitch preview + JPEG encode           │    │ │ L7 LB (Envoy) — mTLS       │   │
  │ TPM signs manifest hash                        │    │ │ auth vehicle_id + cert     │   │
  │ Local WAL on NVMe (24 h, ~24 GB/taxi)          │───▶│ │ rate-limit + backpressure  │   │
  │ Upload agent: part-level retry, WAL drain,     │    │ └──────────────┬─────────────┘   │
  │ backpressure-aware scheduler                   │    │                ▼                 │
  │ Part size 12 MB; HTTPS/2 + QUIC fallback       │    │ ┌────────────────────────────┐   │
  └────────────────┬───────────────────────────────┘    │ │ Upload Service (stateful)  │   │
                   │                                    │ │ - Spanner: staging state   │   │
                   │ mTLS, PUT /part (12 MB),           │ │ - Signed URL issuer        │   │
                   │ avg 30 Mbps cellular,              │ │ - Part-hash validator      │   │
                   │ retries with Retry-After headers   │ │ - Commit → deterministic   │   │
                   ▼                                    │ │   capture_id = h(content)  │   │
                                                        │ └──────────────┬─────────────┘   │
                                                        │                │ gs:// PUT        │
                                                        │                ▼                 │
                                                        │ ┌────────────────────────────┐   │
                                                        │ │ RAW BLOB STORE (regional)  │   │
                                                        │ │ GCS "raw-ingest-{region}"  │   │
                                                        │ │ ACL: write=taxi-SA,        │   │
                                                        │ │      read=pipeline-SA only │   │
                                                        │ │ (NO public read EVER)      │   │
                                                        │ │ Reed-Solomon (9,4) EC      │   │
                                                        │ │ 30-day hot, then tier →    │   │
                                                        │ └──────────────┬─────────────┘   │
                                                        │                │                 │
                                                        │                ▼ commit event    │
                                                        │ ┌────────────────────────────┐   │
                                                        │ │ Pub/Sub: captures.ingested │   │
                                                        │ │ ordered by s2_cell_l10     │◀──┐
                                                        │ └──────────────┬─────────────┘   │
                                                        └────────────────┼─────────────────┘
                                                                         │
                              ═══════════ PROCESSING DAG (Dataflow/Beam) ═══════════
                                                                         │
                                                                         ▼
                                                ┌──────────────────────────────────────────────┐
                                                │ Stage 1: DEDUPE                              │
                                                │  - Compute pHash+simhash from stitched prev  │
                                                │  - Lookup dedup_lsh by s2_cell_l14#10min     │
                                                │  - If LSH match: pick best (quality score),  │
                                                │    demote rest to ML-only pool               │
                                                │  - Emit to captures.deduped                  │
                                                └──────────────────────┬───────────────────────┘
                                                                       ▼
                                                ┌──────────────────────────────────────────────┐
                                                │ Stage 2: STITCH                              │
                                                │  - 6-cam → equirect 8192×4096                │
                                                │  - Output: stitched_blob_ref                 │
                                                │  - ACL: still internal-only                  │
                                                └──────────────────────┬───────────────────────┘
                                                                       ▼
                                                ┌──────────────────────────────────────────────┐
                                                │ Stage 3: PII BLUR (HARD GATE)                │
                                                │  - GPU fleet: T4 / L4, ~5 pano/s each        │
                                                │  - Face detector + plate detector (2 models) │
                                                │  - Deterministic blur (seeded gaussian)      │
                                                │  - blur_model_ver pinned per capture         │
                                                │  - Auto-rollback if FN-rate regresses        │
                                                │  - Emit captures.blurred with signed proof   │
                                                │  - Store blurred_blob_ref (NEW ACL: public)  │
                                                └──────────────────────┬───────────────────────┘
                                                                       ▼
                                         ┌─────────────────────────────┴─────────────────────────┐
                                         ▼                                                       ▼
                          ┌──────────────────────────────┐               ┌─────────────────────────────────┐
                          │ Stage 4a: SLAM / 3D          │               │ Stage 4b: TILE PYRAMID          │
                          │  - Depth estimation          │               │  - Zoom 0..5 (512×512 WebP)     │
                          │  - Neighbor stitching        │               │  - Version = blur_model_ver     │
                          │  - Pose refinement           │               │  - Write to tiles table (BT)    │
                          │  - depth_map blob            │               │  - Atomic activate → serving    │
                          └──────────────────────────────┘               └──────────────────┬──────────────┘
                                                                                            │
                                                                                            ▼
                                                                         ┌───────────────────────────────┐
                                                                         │ PROCESSED BLOB STORE          │
                                                                         │ GCS "pano-tiles-{region}"     │
                                                                         │ multi-regional bucket         │
                                                                         │ ACL: read=CDN + serve-SA      │
                                                                         │ Lifecycle: 30d hot → warm →   │
                                                                         │   cold (90d) → archive (10y)  │
                                                                         └──────────────┬────────────────┘
                                                                                        │
                              ═══════════════════ SERVING PATH ═══════════════════      │
                                                                                        │
  ┌─────────────────┐     ┌─────────────────┐    ┌──────────────────┐    ┌──────────────┴──────────────┐
  │ End user        │◀───▶│ CDN (Google     │◀──▶│ Tile Origin      │◀──▶│ tiles (Bigtable) + blob ref │
  │ Maps / SDK      │     │  Global Cache)  │    │ gRPC, p50 5ms    │    │ + processed blob store      │
  │ 1 M sess/s peak │     │ 95% hit rate    │    │ 1.5 M QPS peak   │    └─────────────────────────────┘
  └─────────────────┘     └─────────────────┘    └──────────────────┘
           │                                             │
           ▼                                             ▼
  ┌───────────────────┐                    ┌──────────────────────────────┐
  │ Metadata lookup   │                    │ Metadata serve (Spanner       │
  │ /v1/pano/lookup   │───────────────────▶│ read replica in region, point │
  │ (nearby panos)    │                    │ reads + s2 range scans)       │
  └───────────────────┘                    └──────────────────────────────┘

  ═══════════════ DOWNSTREAM CONSUMERS ═══════════════
  • ML training: BigQuery export of (capture_id, blob_refs, metadata) — internal-only feed
  • Map generation: Dataflow job consuming captures.blurred, producing road-graph deltas
  • Internal tools: takedown workflow, reprocess-on-model-upgrade controller

6.2 Upload path sub-diagram (L7 detail)

VEHICLE                           EDGE LB              UPLOAD SVC            RAW BUCKET
  │                                 │                      │                    │
  │ InitiateUpload(manifest, sig)   │                      │                    │
  ├───(mTLS, 2 KB)─────────────────▶│                      │                    │
  │                                 │──(region-affinity)──▶│                    │
  │                                 │                      │ validate sig       │
  │                                 │                      │ reserve upload_id  │
  │                                 │                      │ gen N signed URLs  │
  │                                 │                      │ stash state (BT)   │
  │◀────────(upload_id, part_urls, part_size=12MB)────────┤                    │
  │                                 │                      │                    │
  │ PutPart(1, 12MB, sha256)        │                      │                    │
  ├───(HTTPS/2, 12MB, ~3s @30Mbps)─▶│──────────────────────┼───────────────────▶│
  │                                 │                      │                    │ write chunk
  │                                 │                      │                    │ 
  │ [if response.retry_after_ms > 0]│                      │                    │
  │ sleep(jitter), retry            │                      │                    │
  │                                 │                      │                    │
  │ [parallel PutPart(2..N)]        │                      │                    │
  │ ... 4-way parallel typical      │                      │                    │
  │                                 │                      │                    │
  │ [network drops mid-part]        │                      │                    │
  │ WAL preserves part; on reconnect│                      │                    │
  │ re-PUT same part_number,        │                      │                    │
  │ server: hash match → 200 no-op  │                      │                    │
  │                                 │                      │                    │
  │ CommitUpload(agg_sha256)        │                      │                    │
  ├────────────────────────────────▶│─────────────────────▶│ verify all parts   │
  │                                 │                      │ finalize multipart │────(compose)──▶│
  │                                 │                      │ derive capture_id  │                │
  │                                 │                      │ = hash(content)    │                │
  │                                 │                      │ publish Pub/Sub    │                │
  │◀──────(capture_id, s2_cell)─────┤                      │                    │                │

6.3 Serving path sub-diagram

USER                   CDN              TILE ORIGIN              BIGTABLE (tiles)         GCS (processed)
  │                     │                    │                       │                       │
  │ GET /v1/pano/{s2}/{z}/{x}/{y}.webp       │                       │                       │
  ├────────────────────▶│                    │                       │                       │
  │                     │ [cache hit 95%]    │                       │                       │
  │◀─── 50KB tile ──────┤                    │                       │                       │
  │                     │                    │                       │                       │
  │                     │ [cache miss 5%]    │                       │                       │
  │                     ├───────────────────▶│                       │                       │
  │                     │                    │ lookup by row key     │                       │
  │                     │                    ├──────────────────────▶│                       │
  │                     │                    │◀────blob_ref──────────┤                       │
  │                     │                    │ fetch blob            │                       │
  │                     │                    ├───────────────────────┼──────────────────────▶│
  │                     │                    │◀──────50KB tile ──────┼───────────────────────┤
  │                     │                    │                       │                       │
  │                     │◀── tile + cache────┤                       │                       │
  │◀─── 50KB tile ──────┤                    │                       │                       │
  │                     │                    │                       │                       │
  │                     │ [TTL 7d, purge on takedown event]          │                       │

Arrow annotations.

  • Vehicle → Edge LB: mTLS HTTPS/2, ~30 Mbps avg (cellular), 12 MB/part, 4-way parallel parts → ~120 Mbps burst per taxi → fleet peak ingress dominated by regional aggregation.
  • Pub/Sub → Dataflow: ordered by s2_cell_l10, ~3500 msg/s per region avg, 14K peak.
  • Dataflow stages: Beam PTransforms, checkpointed; stage-local parallelism scaled by GPU pool size for blur.
  • CDN → Origin: 5% miss rate at 30M tile QPS → 1.5M QPS origin; tiles table is keyed for Hilbert locality, scans co-locate.

7 Deep-Dives (3 critical topics at L7 depth) #

7.1 Resumable chunked upload over flaky vehicular cellular — the earned-secret depth

Why critical. At 10K taxis × 24h WAL, a regional ingest outage could push ~240 TB of backlog on recovery. If the reconnect storm isn't shaped, the ingest tier melts and you lose days of imagery. The upload protocol IS the reliability story.

Alternatives considered.

Option Throughput Retry cost Complexity Why rejected/chosen
Single POST per capture Good on wired, terrible on cellular Full re-upload on drop: 12 MB × retry count Trivial Rejected: 10-20% of captures on cellular have a network blip during a 3s transfer; 10% retry = 10% wasted bandwidth
Multipart resumable, 12 MB parts ← chosen Excellent; parallel parts fill BDP Only failed parts retry: ~2 MB expected loss per capture Medium Chosen: sweet spot. Details below.
Byte-range resumable (tus.io style) Similar to multipart Good, but no parallelism Medium Rejected: can't fill cellular BDP in parallel; single-stream TCP reorder after handoff between cell towers causes throughput collapse
gRPC bidi streaming Excellent on stable conn Server-side buffering complex on drop High Rejected: server state complexity + harder retry semantics across load balancers

Why 12 MB parts — the L7 insight.

  • Too small (<5 MB): HTTP/2 stream setup + per-part metadata write in staging table + S3/GCS per-part billing overhead dominates. At 2 MB parts you're making 6 round-trips per capture instead of 1. S3's 5 MB multipart minimum is not technical — S3 can handle 1 KB parts — it's billing-policy: AWS wants to avoid micro-part storms that create metadata cost asymmetry.
  • Too large (>64 MB): on cellular with 5-10% drop rate mid-part, you lose 64 MB per drop and have to re-upload it. Average wasted bandwidth scales linearly with part size × drop probability.
  • 12 MB specifically because:
    • Matches GCS default object chunk size (8 MB) rounded up for alignment with 4 MB TCP send buffer batching — fewer syscalls in kernel.
    • On LTE with ~30 Mbps sustained and 100-300ms RTT, TCP BDP = 30 Mbps × 0.2s = 750 KB; parallel 4 streams × 12 MB = 48 MB inflight → saturates cellular with headroom.
    • Per-part SHA256 costs ~40ms on vehicle ARM SoC — acceptable latency; aligns with keeping commits <2h P95.

Client-side WAL (the production-earned detail).

Vehicle NVMe layout:
  /wal/pending/{capture_id}/part_{N}.bin    (raw part bytes, fsync'd on write)
  /wal/pending/{capture_id}/manifest.pb     (TPM-signed manifest)
  /wal/uploaded/{capture_id}/receipts/{N}.sig  (server signed receipt per part)
  /wal/committed/{capture_id}/                 (empty marker: capture safely landed)

Storage budget: 24 GB, enough for 24h of uploads at 12 MB/pano × 2000 pano/vehicle.
GC policy: move to /committed after server Commit ACK; delete /committed after 7 days
(retention for audit + reprocess requests).

The WAL must be on NVMe, not eMMC — we learned the hard way that sustained 150 GB/day writes to eMMC burns it out in 18 months. NVMe with DWPD ≥ 1 is non-negotiable for fleet hardware.

Back-pressure — control loop.

  • Server returns retry_after_ms in three places:
    • LB level (Envoy): based on global CPU + connection count in region.
    • Upload service level: based on Spanner commit QPS vs SLO.
    • Pub/Sub level: if ingestion topic backlog exceeds N minutes, signal "don't commit yet."
  • Client honors the MAX of these and adds jitter: sleep(retry_after_ms + uniform(0, retry_after_ms)).
  • The jitter is critical: without it, a coordinated surge (fleet-wide firmware update, tunnel exit of a convoy) causes all vehicles to retry at exactly retry_after_ms → second-order storm.

Failure modes.

  • Half-uploaded part, client crash: on reboot, scan WAL, resume with same upload_id; server treats identical part-hash PUT as idempotent no-op.
  • Corrupt bit flip during upload: per-part SHA256 catches it; server returns HASH_MISMATCH; client re-reads from WAL and retries. If WAL also has the corrupt byte (NVMe error), client escalates to re-capture (if still in range) or logs uncorrectable.
  • Clock skew: GPS-synced; if GPS lost, vehicle switches to monotonic + last-known-good; attestation service detects skew > 60s and flags capture for triage.
  • Upload service rolls mid-commit: staging state is in Bigtable, not RAM; new instance resumes from Spanner's upload FSM row. Commit is idempotent on aggregate_sha256.

Real systems named. tus.io (byte-range resumable spec), AWS S3 multipart, GCS resumable upload protocol, YouTube's Resumable Upload for mobile creators, Mapillary's upload API (similar problem, similar shape — they use 10 MB chunks with content hash dedup).


7.2 Geo-indexing at planet scale — S2 vs geohash vs R-tree, quantified

Why critical. Two workloads with conflicting requirements:

  1. Ingest-side sharding: 300M writes/day need locality so that "captures in lower Manhattan" land on a small number of Spanner splits.
  2. Serve-side range: "panos within 50m of user's tap" needs a <10ms range scan to answer /v1/pano/lookup.

A bad geo-key choice ruins both. Hot cells (Times Square, Shibuya, Piccadilly) will melt a naive scheme.

Alternatives quantified.

Scheme Locality Neighbor queries Hot-spot mitigation Write scalability Index size Chosen?
S2 cells (Hilbert curve) Excellent (Hilbert preserves 2D locality; neighbors share key prefix in most cases) 8-neighbors via CellId.neighbors(); range [cell, cell.next_level()] Sub-cell hashing at L14 for known hot cells Excellent (sharded on cell_l10) 64-bit uint YES
Geohash (base32 z-order) Good (z-order) but has seam discontinuities at poles + prime meridian; neighbor queries cross seams geohash_neighbors heuristic, but returns 0 at equator crossing Same sub-hash trick, but seams still hurt Good ~12 char string Rejected (see below)
R-tree (Postgres PostGIS) Arbitrary rect query is O(log N); great for polygons Native GiST index Rebalance on insert → hot-node write amplification Poor at scale (single node or manual shard) Tree depth ~7 at 1T rows Rejected for write path
Z-order / Morton curves Similar to geohash Same seam issues Same Similar 64-bit Rejected: strict subset of S2's benefits
Uber H3 (hex grid) Great (hex neighbors are uniform) 6 neighbors, not 8; uniform Sub-hex hashing possible Great 64-bit Close second — we'd pick H3 if we needed uniform area per cell (e.g., ride dispatch); we don't

Why S2 wins for us, concretely.

  1. Hilbert locality in Bigtable row keys. Bigtable is lex-ordered on row key. S2 cell IDs encoded in Hilbert order mean two geographically-close cells usually have keys close in byte space. A scan for "all captures in a 2km box" hits 1-3 Bigtable splits. Geohash has seam cases where adjacent cells differ in top byte → 2× more splits hit in worst case.

  2. Variable resolution. One cell ID encodes both "cell" and "resolution" in same 64-bit int. I can store s2_cell_l10 as shard key (coarser 150 km² buckets for sharding → balanced) and s2_cell_l14 as serving key (150m buckets for "nearby"). Geohash requires two different strings.

  3. Area uniformity is acceptable, not perfect. S2 cells vary ~2× in area across the globe (cube projection distortion). For our serving use case (nearby-pano lookup) 2× is fine — we just request a slightly larger radius. H3 would be uniform hex but lacks the 20-year battle-tested C++/Java/Go library ecosystem S2 has (Google's s2geometry), and at Google — using Google's library is table stakes.

  4. Hot-cell mitigation. For known hot cells (Times Square L14 cell at ~150m resolution may carry 100K captures/month), we salt the row key:

    // Normal:     row_key = s2_cell_l14 | time_reverse | capture_id
    // Hot-cell:   row_key = s2_cell_l14 | hash(capture_id) % 16 | time_reverse | capture_id
    

    The hash % 16 splits the cell across 16 sub-shards → scans for that cell fan out, but total volume is manageable. Hot-cell list maintained by a daily job that analyzes per-cell QPS; promoted/demoted automatically.

  5. Neighbor queries. S2CellId.GetAllNeighbors(level) gives you the 8-connected neighbors at same level in 80 lines of C. Use this for "when stitching, give me captures in adjacent cells at level 14 within ±30s" — a single range-scan per neighbor cell.

Operational tuning learned.

  • Level choice for sharding: L10 has ~2M cells worldwide → one shard per 5M cells if you had only one Spanner split per L10 cell that's way too coarse; we use L10 as the bucket but let Spanner auto-split sub-ranges. Writes per L10 per day typical: 300M / (2M × population_fraction) ≈ 1000 writes/sec in a hot L10 → ok for a single Spanner split.
  • Level choice for serving: L14 (~150m) is the natural pano "snap radius" — two panos at the same L14 cell are effectively on the same street corner.
  • Don't index on raw lat/lng. We learned this elsewhere — a B-tree on (lat, lng) has terrible range-scan semantics for 2D regions because B-tree linearizes on lat first, so "within X km" degenerates to a full lng scan per matching lat slice.

Real systems named. Google S2 (used internally in Maps, YouTube, Adwords geo-targeting), Uber H3 (ride dispatch), OpenStreetMap Nominatim (geohash), PostGIS (R-tree GiST), Mapbox vector tiles (Mercator quad-key — basically geohash).


7.3 PII blurring pipeline — the hard privacy invariant at fleet scale

Why critical. One unblurred face served in the EU → class-action, fine up to 4% revenue, press catastrophe. Germany's 2010 Street View opt-out was a direct cost — parts of Germany still have no imagery because takedown-at-collection was too expensive to retrofit. The blur pipeline is the risk-carrying component. Everything else can be rebuilt from logs; an unblurred pano served is not recoverable.

The hard invariants.

  1. No blob in the pano-tiles-* bucket is public-readable until blur_status = passed in Spanner. Enforced by ACL at bucket level + signed-URL flow at serve tier.
  2. Blur is deterministic. Same input + same model_version → same output bytes. Lets us (a) audit (rehash + compare), (b) re-blur cold data if model improves, (c) reconcile across regions.
  3. Detection false-negative rate (FN — missed face/plate) < 0.1% audited. A regression that pushes FN to 1% must trigger auto-rollback within 30 min.

Alternatives considered.

Approach Throughput FN rate Cost/day (300M panos) Audit Chosen?
Two-stage: face detector (YOLO-variant) + plate detector (custom CNN) + Gaussian blur 5 pano/s per T4 GPU ~0.05% FN when tuned ~$12K/day (T4 fleet) Deterministic, replayable YES
Segmentation model (panoptic) covering faces+plates as classes 2 pano/s per T4 ~0.08% FN ~$30K/day Harder to retrain per-class Rejected: throughput/cost worse, joint retraining is slower
Human-in-loop review ~1 pano/s per reviewer ~0% FN $0.10/pano × 300M = $30M/day (!!) Perfect audit but cost Rejected for primary path; retain for appeals + audit sampling
On-device blur at capture Offloads cost; bandwidth savings Model too large for vehicle SoC; FN 1% tolerable $0 compute; $500/vehicle hardware Hard to update model Rejected as primary — no model-version upgrade without fleet flash; kept as Phase-2 feature-flagged for low-priority regions

Throughput math (GPU fleet sizing).

  • 300M pano/day ÷ 86400s = 3472 pano/sec steady-state.
  • 1 T4 GPU @ 5 pano/s → need 695 GPUs steady.
  • Peak 4× → need ~2800 GPUs burst. We run 2× steady (1400) reserved + autoscale burst from preemptible pool.
  • Cost: 1400 × $0.35/hr × 24h = $11,760/day on-demand; with committed-use discount ~$7K/day; with preemptible for burst, ~$9K/day blended.
  • Single pano latency: ~180ms (6 sub-images × 30ms); acceptable because pipeline stage is throughput-bound not latency-bound.

The hard-gate invariant — how it's enforced.

1. Pipeline writes stitched_blob_ref to internal-only bucket "pano-internal-{region}".
   ACL: read = pipeline-SA, write = pipeline-SA. No IAM binding to CDN, no public read.

2. Blur stage:
   - Reads stitched_blob from internal bucket.
   - Runs detectors → deterministic_blur → output bytes B.
   - Writes B to "pano-tiles-{region}" with generation metadata
     {model_ver, detector_ver, blur_proof_hash=h(B)}.
   - Updates Spanner: blur_status = passed, blur_model_ver = X, blur_proof_hash = h(B).

3. Tile serving:
   - Tile origin reads blob from "pano-tiles-*" and ALSO reads Spanner row.
   - REQUIRES blur_status = passed AND blur_proof_hash matches h(actual blob bytes)
     before returning to CDN. (Belt + suspenders: ACL should make this impossible
     already, but the Spanner check catches any bucket-ACL mistake.)

4. Takedown:
   - Sets blur_status = retracted + takedown_status = active.
   - CDN purge by row-key prefix.
   - Origin denies on retracted even if blob still there.
   - Async job overwrites blob with takedown placeholder.

Deterministic blur — why it matters at L7.

We chose seeded Gaussian over random occlusion. Seed = h(capture_id || model_ver). Properties:

  • Re-blur produces identical bytes → hash check = integrity check.
  • Across regions: US blur and EU blur of same capture produce same bytes → no cross-region drift.
  • Audit: sample 0.1% of panos; re-run blur offline; compare hashes. A mismatch = pipeline bug or bit-rot → page.
  • The L7 insight: non-deterministic blur (what naive teams ship first) means you CAN'T re-blur on model upgrade without invalidating downstream derivatives. Deterministic blur means the re-blur of cold imagery on model v2 writes identical bytes for unchanged regions (where old model was already correct) — so tile cache invalidation is surgical, not whole-pano.

Re-blur on model upgrade — the workflow.

Every 3-6 months a new blur model ships with 20-30% lower FN rate. Requirement: re-process historic imagery so serve-time FN is always below threshold.

  • Scheduler enumerates captures where blur_model_ver < current in priority order (recent + high-traffic cells first).
  • Throttle to <10% of steady GPU capacity to avoid contending with live ingest.
  • Cost math: 10-year archive = 1.1T captures. Re-blur everything = 1.1T ÷ 5 pano/s ÷ 1400 GPU = 4.5 years — untenable. So:
    • Prioritize by recency × view-count.
    • Re-blur on read (lazy): if blur_model_ver < threshold when tile requested → re-blur with new model, write tile, serve with +1s latency first-time. Subsequent requests cached.
    • Only eagerly re-blur top-10% by traffic; tail lazy.

Failure modes of the blur stage.

Failure Detection Mitigation Recovery
Model deploys with higher FN (regression) Shadow-model online eval; sampled human audit; FN-rate SLO alert Circuit breaker: rollback model_ver → previous in Spanner config; hot-reload in GPU workers within 10 min Re-blur affected captures (those processed between bad-deploy-start and rollback) deterministically with rolled-back version
GPU worker OOM on giant pano (e.g., malformed > 100MB) Worker crash → pod restart; retry exceeds threshold → quarantine Size cap at 80MB pre-blur; quarantine oversize to offline queue for manual inspect Manual review; likely corrupted capture
Blur produces all-black output (crop bug) Output size anomaly detection (expected output size ≈ input size ± 5%) Quarantine capture; alert Fix code, re-run from pipeline_state
Detector misses novel adversarial content (mask, turban) Human audit sample; user takedown reports Escalate to detector team; add to training set Lazy re-blur on next model ver; explicit takedown in interim
Raw bucket ACL misconfig (public read) Continuous IAM policy lint CI/CD; access-log anomaly detection Automatic IAM revert via org-policy guardrail Purge any accessed blobs from CDN; audit access logs for exposure

Real systems named. Google Street View's known blur pipeline (public talks by Google Maps team reference 2013-2015 architecture); Mapillary's detector pipeline (open-source variant); YOLOv8/RetinaFace for face detection; OpenALPR for plates; Apple's privacy-first mobile mapping.


8 Failure Modes & Resilience (system-wide) #

Component Failure Detection Blast radius Mitigation Recovery
Vehicle cellular Lose signal in tunnel for 30 min Upload agent reconnect timeout 1 vehicle, up to 24h WAL'd 24h local NVMe WAL; QUIC for fast reconnect Drain WAL on reconnect with part-level idempotency
Regional ingest Entire GCP region outage Health probes; LB failover All vehicles in that region, up to 24h DNS steering (Cloud Load Balancing global anycast) routes to next-nearest region; cross-region failover pre-warmed Vehicles retry with alt-region upload_id; on primary restore, reconcile via aggregate_sha256 idempotency
Spanner primary Zone failure Auto-failover (built-in) Metadata writes pause ~30s Spanner multi-region; no custom action Writes resume; no data loss
GCS raw bucket Data corruption Per-part SHA256; background integrity scan; erasure-coding parity check Up to the corrupted objects Reed-Solomon (9,4) auto-repairs bit rot; multi-region redundancy Recompute from parity; if total loss, alert + request recapture if recent
Pub/Sub backlog Processing backs up > 30 min Backlog depth metric SLO Freshness SLA breach; tiles stale Dataflow autoscaling; replay via topic retention (7d) Drain with priority lanes (fresh before backlog)
Dedupe stage produces wrong "best" pick User-visible quality regression Sampling dashboards; user feedback Cell-level quality Re-run dedupe with improved score fn; mark old pick demoted Roll forward
Blur model regression FN-rate SLI alert Shadow eval + human audit Any pano processed between deploy & rollback Auto-rollback model_ver; re-blur affected batch Re-blur with good model; purge CDN
Corrupt image poisoning pipeline Detector OOM / exception Worker crash loop detection Single capture quarantined; ~1s pipeline stall Resource limits; oversize / malformed quarantine Quarantine + re-inspect
S2 hot cell (e.g., Times Square blows past shard limit) Bigtable hot-tablet alert; p99 latency jump Single cell (~150m area) Sub-cell hash-salt row key (16-way) promoted from hot-cell list Online; no downtime needed because Bigtable auto-splits
Replay during re-processing stampede Spanner write QPS spike Rate-limit saturation Pipeline slows; ingest unaffected (ingest uses own Spanner cluster) Hard rate-limit re-proc to 10% of steady capacity; priority queues Rate limiter controls restore in minutes
Tile CDN purge storm (mass takedown) CDN purge rate spike Global CDN ops visibility Temporary p99 degradation on cold tiles Purge-by-prefix instead of per-key; staged over 4h for large batches Complete within takedown SLA (24h)
Attestation service outage Cert validation fails Health checks Vehicles can't commit new uploads but can buffer 24h local WAL absorbs outage; attestation fails open for already-in-progress sessions with time-bounded token Restore service; cached certs validate on-vehicle
Vehicle cert leak / clone Upload anomaly detection (same cert, two locations) Anomaly detector Up to one vehicle's worth of fake data Auto-revoke cert; force re-attestation; captures post-revoke rejected Forensics on suspect captures; purge if needed

Paging philosophy. SRE pager carries SLO-breach alerts:

  • P0 — blur SLO breach (any unblurred-served-to-public event; any ACL misconfig); serve-path availability < 99.99%.
  • P1 — ingest SLO breach (region-wide); pipeline freshness > 24h; FN-rate > threshold.
  • P2 — per-vehicle anomaly; hot-cell split delay.
  • Runbooks are keyed to each alert; most have kubectl rollout undo or gcloud config config:update as the first step.

9 Evolution Path #

v1 — "Single region, batch, prove it works" (first 6 months)

  • One GCP region (us-central1). All vehicles upload here regardless of location (OK, fleet starts in SF).
  • Upload service on Cloud Run; GCS regional bucket; Spanner single-region.
  • Processing: Cloud Scheduler → Cloud Run batch jobs, runs every 15 min over new captures.
  • Blur: CPU inference (not GPU) because volume is ~100K/day total.
  • Serving: no CDN; direct origin at small scale. Pano viewer is internal-only demo.
  • Scope: 10 vehicles, one city.
  • Success criteria: e2e ingest → blur → viewable in <1h P95; durability check passes.

v2 — "Multi-region ingest, streaming, CDN serving" (months 6-18)

  • Regional ingest in NA-East/West, EU, JP. Spanner multi-region. Dual-region GCS for raw.
  • Dataflow streaming pipeline replaces batch: Pub/Sub → PTransforms → exactly-once writes.
  • Blur migrates to GPU fleet (T4) as volume crosses ~5M/day.
  • Tile origin + Google Cloud CDN; first public launch for Street View end-users in tier-1 metros.
  • Dedupe: LSH-based, Bigtable-backed.
  • SLI/SLO + auto-rollback for blur model.
  • Scope: 1000 vehicles, 20 cities.
  • Success criteria: 99.99% serve availability; FN-rate < 0.1%; ingest P95 < 6h.

v3 — "Planet-scale, priority lanes, continuous reprocessing" (months 18+)

  • 10 regional ingest endpoints; edge PoPs for upload termination where latency matters.
  • Fresh-imagery priority lane: construction zones, disaster response, new road openings have their own Pub/Sub subscription with dedicated pipeline capacity.
  • On-demand re-blur scheduler: new model deploys, top-10% traffic re-blurred eagerly, tail lazy.
  • Spanner hierarchical sharding by S2 L10; hot-cell sub-sharding automatic.
  • Tiered storage fully online: 30d hot → 90d warm → cold archive with object lifecycle.
  • BigQuery federated export to ML training feature store.
  • Device attestation upgraded to TPM 2.0 with per-capture signed manifest in protocol.
  • Takedown workflow fully automated with 24h CDN-purge SLA.
  • Cost optimization: preemptible GPU burst for blur; committed-use discounts for sustained.
  • Scope: 10K vehicles, global.

v4 candidates (speculative). On-device blur with verifiable attestation (cuts ingest bandwidth 3-5×); federated learning for detector improvement from declined-for-upload examples; differential privacy for statistical aggregate release.


10 Out-of-1-Hour Notes #

10.1 SLAM / photogrammetry

  • Full depth-map reconstruction per pano uses neighboring captures (temporal stereo) + IMU priors.
  • Output: per-pixel depth + per-pano pose refinement (GPS is only accurate to ~3m; SLAM refines to sub-meter relative pose between neighboring panos).
  • Used downstream for Map Generation (building facade extraction, road geometry).
  • Compute cost: ~10× blur cost; typically run at lower priority on cold data (weeks after capture) rather than real-time.
  • Real systems: COLMAP, ORB-SLAM3, Google's proprietary photogrammetry stack.

10.2 Privacy regulations by jurisdiction

  • EU (GDPR): blur mandatory before publish. Right to erasure: user can request pano removed from a specific address within 30 days.
  • Germany: 2010 set precedent — visible houses opt-out; full "Verpixelungsrecht". Parts of Germany still uncovered. Any future German expansion requires house-level opt-out UI and pre-publish honoring.
  • California (CCPA): similar to GDPR for personal information. Faces = personal info.
  • Japan: looser on plates, stricter on residential detail.
  • India (DPDP 2023): emerging; face blur + explicit consent for sensitive locations (temples, defense).
  • Architecture implication: per-region takedown_policy config; blur + takedown are pluggable per-country; separate worker queues for jurisdictions with different rules to guarantee policy isolation.

10.3 Takedown workflow (right to be forgotten on panos)

  • User-facing web form → captures (lat, lng, radius, reason).
  • Operator reviews; if approved:
    • S2 → list of affected captures.
    • Mark captures takedown_status = active.
    • Purge CDN by row-key prefix (tile rows).
    • Overwrite blurred blob with takedown-placeholder (keeping raw in cold for legal record but making it inaccessible via serve path).
    • Propagate to downstream ML exports — BigQuery has a takedown_status column; training jobs MUST filter.
  • Auditable: every takedown logged with ticket ID + operator ID + affected capture_ids; 7-year retention.

10.4 Vehicle fleet security

  • Device attestation: TPM 2.0 generates boot quote → attestation service verifies measured boot + firmware signature. Fails → cert not issued → vehicle can't upload.
  • Per-capture signing: vehicle signs manifest hash with TPM-resident key; server verifies signature chain up to fleet CA. Detects: cert clone, man-in-the-middle, spoofed GPS.
  • Anti-tamper physical: tamper-evident seals on camera rig; any physical tamper → TPM attestation fails on next boot.
  • Rotation: certs rotate every 90 days; any anomaly forces immediate re-attestation.
  • Supply-chain: firmware signed by cross-signed Google + vendor keys; verified at boot.

10.5 Cost model

  • Hot storage: GCS Standard ~$0.020/GB/mo × 550 PB = $11M/mo.
  • Warm: GCS Nearline ~$0.010/GB/mo × 1.65 EB = $17M/mo.
  • Cold (10y archive): GCS Archive ~$0.004/GB/mo × 25 EB = $100M/mo year-10 steady.
  • Blur compute: ~$9K/day × 365 = $3.3M/yr.
  • Stitch + SLAM compute: ~$30K/day × 365 = $11M/yr.
  • CDN egress: 1.5 TB/s peak × ... back-of-envelope ~$0.08/GB egress × 30% of traffic billable after caching = O($100M/yr egress). (This is actually the biggest variable cost; peering agreements help.)
  • Ingest bandwidth: 174 GB/s sustained × O($0.01/GB) ingress (often free on GCP within-region) = negligible vs egress.
  • Spanner: 10K processing units × $0.90/node/hr × 24 × 365 ≈ ~$80M/yr (and this is a concern; we'd push some workloads to Bigtable).
  • Rule of thumb: Street-View-scale imagery is an O($1B-2B/yr infrastructure line item at 10K vehicle fleet scale, before vehicle capex. This is what justifies aggressive dedupe + lifecycle tiering + deterministic re-blur (so model upgrades don't 10× the blur cost).

10.6 Observability

SLIs (customer-facing):

  • serve.availability = 2xx / total — target 99.99% 30d.
  • serve.latency_p99 per region — target <300ms.
  • blur.fn_rate (from audit sampling) — target <0.1%.
  • ingest.commit_p95 — target <6h.

SLIs (internal):

  • pipeline.freshness_per_stage_p95 — target <2h blur, <4h tile.
  • takedown.propagation_p95 — target <24h.
  • hot_cell.p99_latency — red-flag when a single cell's p99 spikes.
  • per_vehicle.upload_success_rate — per vehicle SLI, catches bad camera rigs.
  • per_model.blur_fn_rate — per deployed model ver, pre-prod + prod.

Alerting:

  • Multi-burn-rate error budget alerts (Google SRE style).
  • Blur FN rate has a fast-burn + auto-rollback integration: >3× baseline FN for 5 min → automated model rollback (human-in-loop notification but action is automatic because the privacy risk is too high to wait for paging).
  • Hot-cell alert: any L14 cell exceeding 100× median QPS → automatic sub-cell hash-salt promotion.

Dashboards:

  • Global ingest map (volume by region, freshness by region).
  • Per-stage pipeline latency heatmap.
  • Cost-per-capture breakdown (storage/compute/egress).
  • Blur audit dashboard with sampled FN examples for human reviewer triage.

Self-verification (before I submit) #

  • SRE pager-carryable? Yes — every failure mode has a detection signal, a mitigation, and a recovery path. Runbooks are keyed to alert names. Blur FN-rate is the only fully-automated rollback; everything else is human-in-loop.
  • Every diagram arrow → API/data flow? Yes. Upload arrow → InitiateUpload/PutPart/Commit (section 4). Pub/Sub fanout → event schemas (section 4). Serving arrows → GET /v1/pano (section 4). Metadata lookup → Spanner row schemas (section 5).
  • L7 vs L6 depth on deep-dives?
    • Deep-dive 7.1 (upload): L7 — covers part-size billing asymmetry, DWPD on vehicle NVMe, jitter on Retry-After; L6 would stop at "use multipart."
    • Deep-dive 7.2 (geo-index): L7 — quantifies S2 vs H3 vs geohash with Hilbert-locality argument, hot-cell sub-shard mitigation; L6 would say "use S2."
    • Deep-dive 7.3 (blur): L7 — deterministic blur seed for re-blur idempotency, ACL belt-and-suspenders invariant, lazy vs eager re-blur economics; L6 would say "run a face detector."
  • Volume reconciliation: 300M panos/day × 50MB = 15 PB/day raw → 30d hot = 450 PB → 10 regional GCS buckets × ~45 PB each (feasible); processed 3.3 PB/day → 30d hot = 100 PB → CDN egress math consistent with 1M users × 30 tiles/sess × 50 KB = 1.5 TB/s peak → matches Google Global Cache scale. Metadata 300M rows/day × 365 × 10yr = 1.1T rows → Spanner feasible with L10 sharding. All numbers reconcile.

esc
navigate open esc close