Q11 Product & Edge Systems 27 min read 10 sections
Design a Menu Update System for a Global Restaurant Chain
Push daily menu updates from HQ to thousands of restaurant locations and keep every device in a store on the same version.
1 Problem Restatement & Clarifying Questions #
Restatement in candidate's words
HQ maintains a logical menu composed of (a) a global base, (b) a regional layer (e.g., APAC, EU), (c) a country layer (France, Japan), and (d) a store layer (airport store vs highway store). Three times a day, per meal, a new menu variant must go live. ~10K stores × ~10 display devices per store; each device shows the same menu state as its siblings at meal transitions (no "register 1 sells bagels, register 2 sells burgers" at 10:31 AM). Stores lose WAN connectivity for anything from seconds (last-mile glitch) to days (natural disaster). The design must guarantee: menus keep working offline, menus catch up on reconnect, bad menus don't take down the chain, and transitions happen within ~1 second across a store.
Clarifying questions I'd ask the interviewer in the first ~3 minutes
| # | Question | Why it matters to the design |
|---|---|---|
| Q1 | Device heterogeneity? Android tablet (ARM, 2-8 GB RAM, WiFi), thin POS terminal (Windows Embedded, 1 GB RAM), digital menu board (HDMI, RPi-class), kitchen display (KDS)? | Determines whether we can assume an on-device daemon, SQLite, mTLS — or must handle "dumb" HDMI boards that only render what a store controller serves them. Assumption: Android/Linux device + a per-store "controller" box (small form-factor Linux PC like an Intel NUC). |
| Q2 | Offline tolerance SLO? Must a store stay open for 4 hours offline? 24 hours? 7 days? | Drives local cache size (variants × days × 2MB each) and regulatory staleness limits. Assumption: 72 h offline with last-known-good; hard expiry at 7 days per food-safety + EU FIC allergen-accuracy rules. |
| Q3 | Menu blob size? Text JSON only ( |
Decides CDN vs. direct API, pre-fetch windows, per-store storage. Assumption: ~2 MB per variant (JSON + compressed thumbnails); video ads handled by a separate DSA system (out of scope). |
| Q4 | SKU count? 200 global items, 50 regional overrides, 20 local? | Sizes the merge pipeline and delta encoding. Assumption: ~500 SKUs per resolved menu, so merge cost is microseconds. |
| Q5 | Regulatory posture? EU Food Information for Consumers (FIC) allergen mandate, FDA calorie disclosure, Japan halal labeling, China licensing per-store? | Forces per-country publish gates + regulator sign-off as a publish step; regulatory-override later in the day must not be blocked by the 3× day cadence. |
| Q6 | Language / script? UTF-8 everywhere? RTL (Arabic) layout changes? Pre-baked vs on-device localization? | Assumption: server-side pre-localization — HQ publishes N×lang variants so devices are dumb. Cheaper to burn CDN bytes than to ship i18n runtime to every POS. |
| Q7 | Who owns "price"? Is this an advertising menu (display only) or the authoritative source that POS charges from? | Huge. If POS charges from this blob, we are on the critical path to revenue and need transaction-safe publishing (2PC with POS, rollback guardrails). Assumption per probes: display authority only; pricing authority is the POS system; this blob is what the customer sees — pricing consistency is handled elsewhere. Still high-stakes: wrong allergen label = lawsuit. |
| Q8 | Update frequency upper bound? Truly 3/day or could it be every 5 minutes during a promo? | Assumption: 3 scheduled per day + occasional emergency push (recall an ingredient). Emergency path must exist but need not be the 99% path. |
| Q9 | A/B testing granularity? Per-region? Per-store? Per-device? | Assumption: per-store (not per-device — a device-level A/B inside one store would violate intra-store consistency). |
| Q10 | Ownership model inside the store? Is there a store manager UI, or is this fully HQ-driven? | Assumption: HQ-driven by default; manager has local override for "sold out" items (stripped down, non-schema-changing) — out of scope for v1. |
Remainder of the doc assumes the above.
2 Functional Requirements #
In scope (numbered)
- FR1 — Publish menu variant. Author at HQ creates a versioned menu variant
(version_id, base, overrides, assets, effective_from, effective_until, target_filter)and it becomes durable + signable + auditable. - FR2 — Schedule activation. A publisher schedules version V to become active at timestamp T, for filter F (e.g., all EU stores, breakfast 2026-05-05 06:00 local). Supports timezone-local activation (every store has its own 06:00).
- FR3 — Cross-device atomic flip per store. At activation time, all devices in a store switch to V within a ≤1 s window (SLO), and never show a mixed state visible to a customer (stronger than "close in time").
- FR4 — Offline store fallback. A store with no WAN connectivity must serve the last-known-good menu (LKG) and all scheduled flips that were already pre-fetched. Must handle "store went offline before today's lunch variant was pre-fetched" gracefully.
- FR5 — Catch-up on reconnect. On WAN restoration, store reconciles: pulls manifest, fetches missed blobs, and skips past any activations that already-should-have-happened (no "welcome back, here's yesterday's breakfast at 7 PM").
- FR6 — Audit log & rollback. Every publish / activation / override is logged with actor, reason, diff. Rollback = "activate version V-prev now" globally or filtered; completes in ≤60 s at the edge.
- FR7 — Regional A/B testing. A variant can be rolled out to a region percentage (e.g., 5% of EU stores for a week). Splits must be deterministic per store_id (not per request).
- FR8 — Emergency recall. A version can be marked POISONED, forcing all stores to roll forward/back to a named safe version within a tight SLO (≤5 min global). Think: recalled ingredient.
- FR9 — Regulatory gate. Per-country publishing requires sign-off (e.g., EU FIC reviewer must approve allergen fields); the publish pipeline is a multi-stage workflow.
- FR10 — Freshness visibility. HQ dashboard shows per-store freshness:
current_version,target_version,last_heartbeat,stale_duration.
Out of scope (explicit)
- POS transactions / pricing authority / tax — separate system.
- Kitchen Display System (KDS) — separate system; may subscribe to menu for recipe mapping but that's a downstream consumer.
- Inventory / "86'd items" — in-store local overrides; a shim consumes our menu and flags items unavailable.
- Digital-signage video ads / promo content — separate DSA/CMS; we only handle the menu schema and its thumbnails.
- Employee-facing training content — LMS, not menu.
3 Non-Functional Requirements + Capacity (BOE math) #
NFRs with numbers
| NFR | Target | Why this exact number |
|---|---|---|
| Availability — menu display at leaf | 99.99% (≤52 min/year customer-visible outage per store) | A store with a blank menu is closed. Edge-reliant; WAN availability of ≤99.9% is insufficient, so design must not require WAN to render. |
| Availability — control plane (publish/schedule) | 99.95% | Authoring can tolerate minutes of outage. The control plane is not on the customer path. |
| Consistency — intra-store | Linearizable across devices at flip boundary (all devices observe V at the "same" moment) | Customer-visible violation = two registers disagreeing on the price of a croissant. PR incident. |
| Consistency — inter-store | Eventual within 60 s for scheduled flips, 5 min for emergency recalls | Convergence under network partition, monotonic reads per-store. |
| Freshness — scheduled activation | Intra-store flip window ≤1 s | Customer-scale: an empty terminal for >1 s is noticed. |
| Freshness — emergency recall | Global ≤5 min propagation (excluding fully-offline stores) | Food-safety pull is legally time-bound. |
| Durability — no lost update | 0 lost activations in published state (RPO=0 for metadata; blobs are addressed by content hash) | Author signed it; we must not "forget". |
| Latency — manifest read | p99 ≤100 ms from store to nearest edge | Cheap to hit often. |
| Latency — blob fetch (cold CDN) | p99 ≤5 s for 2 MB from regional edge | Acceptable during pre-fetch; activation never blocks on a fetch. |
| Security | Signed manifests + signed blobs, mTLS store↔edge, KMS-backed keys | Menu tampering at the last mile is a brand + legal risk. |
Capacity estimate (all derived, nothing asserted)
Fleet
- 10,000 stores × 10 devices = 100,000 endpoints.
- +1 "store controller" per store = 10,000 controllers.
- Total managed nodes ≈ 110,000.
Traffic — control plane (publish / manifest)
- Manifest poll: 1 per store every 30 s → 10,000 / 30 = ~333 QPS globally. (Peaky around meal transitions; double it → ~700 QPS.)
- Manifest payload:
1 KB. → **1 MB/s** aggregate. Laughably small. - Publish operations: 3 meals × 1 publish/meal × some variants = ~20-50 publishes/day. Not a scale problem.
Traffic — data plane (blob distribution)
- Blob size: 2 MB per resolved variant.
- Pre-fetch fan-out per meal: 10,000 stores × 2 MB = 20 GB per meal cycle.
- Daily: 3 × 20 GB = 60 GB/day aggregate from CDN edges.
- Per-edge POP, assume 50 POPs: ~1.2 GB/POP/day → a single nginx can push that in <1 second. The CDN is cost, not capacity.
- Why it's still a CDN: we want low tail-latency cold fetches and resilience to a blob-store region outage, not raw throughput.
Storage
- HQ metadata DB (Spanner):
1 KB per activation_plan row × 300K flips/day × 7-year retention ≈ **800 GB** (trivial, incl. overhead). Audit log dominated by publish events + device heartbeats → ~5 TB/year with indices and compression. Spanner handles this in a single regional config. - Blob store (GCS/S3): ~5 variants × ~50 regions × 2 MB + assets + history. Including a year of version history (for rollback): ~100K variants × 2 MB = 200 GB. Nothing.
- Per-store local cache:
- 5 variants per meal (global/region/country/store merged resolved views × language locales) × 3 meals × 3 days LKG buffer =
45 × 2 MB = **90 MB**. Budget 1 GB per store (10× headroom for promotional seasonal menus). SD card trivial.
- 5 variants per meal (global/region/country/store merged resolved views × language locales) × 3 meals × 3 days LKG buffer =
- Redis for live state: 100K endpoints ×
300 B of state ≈ **30 MB hot set**. A single shard.
Bandwidth peak
- Scheduled peak: all 10K stores pre-fetch tomorrow's lunch within a randomized 4-hour window before activation → 20 GB over 4 h = ~1.4 MB/s globally. Sub-second peaks are hidden by CDN cache hits. Design invariant: randomized pre-fetch window + jitter prevents thundering herd.
Conclusion The problem is reliability-bound, not scale-bound. 300K activations/day is 3.5 QPS sustained — a single machine can execute that. Budget must be spent on leaf resilience, not on elastic control-plane capacity.
4 High-Level API #
Author / control-plane side (HQ operators + CI)
POST /v1/variants # create new variant (idempotent by client_ref)
body: {
parent_version_id, # optional; for layered merges (base)
overrides: { ... layered patch tree ... },
assets: [{url, sha256, bytes}], # already-uploaded to blob store
target_filter: { regions, countries, store_ids, tags },
author, reason, regulatory_approvals: [{jurisdiction, reviewer, timestamp, signature}]
}
returns: {version_id, content_hash, signed_blob_url}
POST /v1/activation_plans # schedule activation
body: {
version_id,
activate_at: {
kind: "UTC" | "STORE_LOCAL", # STORE_LOCAL = per-tz resolution
t: "2026-05-05T06:00:00",
grace_window_ms: 60000
},
rollout: { strategy: "all" | "canary", canary_pct, canary_order },
target_filter,
rollback_on: { error_budget_pct, max_stale_stores }
}
returns: {plan_id}
POST /v1/rollback # global or filtered rollback
body: {version_target, filter, reason, urgency: "scheduled"|"immediate"}
returns: {plan_id}
POST /v1/poison # emergency recall, highest priority
body: {version_id, safe_version_id, reason, regulatory_ref}
GET /v1/fleet/freshness # dashboard query
returns: per-store {current, target, last_hb, staleness_s}
POST /v1/variants/{id}/simulate # dry-run merge for a store_id — critical for authors
returns: resolved menu as it would render
Idempotency: every mutating call takes a client_request_id; writes are deduped in Spanner via a unique index on (operator_id, client_request_id).
Leaf / store-controller side (pull-based)
GET /v1/stores/{store_id}/manifest # the hot loop; tiny payload
returns: {
current: {version_id, sha256, activate_at, signed_url, expires_at},
upcoming: [ {version_id, sha256, activate_at, signed_url, expires_at}, ... ],
poisoned: [version_id, ...], # versions to purge
server_time_utc,
next_poll_hint_s,
manifest_sig # signature over everything above
}
auth: mTLS cert pinned to store_id
POST /v1/stores/{store_id}/heartbeat # observability + flow control
body: {
controller_version, device_versions: {device_id: version_id},
last_activation_ts, cache_bytes, wan_rtt_ms, anomalies: [...]
}
returns: {ack, instructions?: [...]} # may carry a "pull manifest now" nudge
Why pull, not push:
- Push requires a long-lived connection from HQ to 10K stores on flaky WANs. NAT traversal alone is a nightmare.
- Pull is idempotent, survives a dropped connection, and the store chooses when to fetch (important when the store WiFi is saturated during a lunch rush — don't fight for bandwidth).
- Urgent recalls are still ≤5 min via a short poll interval (15-30 s) + heartbeat-response "pull now" hint. Near-push latency for recall; pull simplicity for the 99% path.
- Considered alternative: gRPC streaming / MQTT — rejected. See Deep Dive C.
Transport & crypto
- mTLS store↔HQ edge with store-unique cert issued at provisioning, rotated yearly.
- Manifest signature: Ed25519 over the manifest body with HQ's publisher key; rotated quarterly; public keys on device via firmware.
- Blob integrity: SHA-256 embedded in manifest; controller verifies after download before activating. Do not trust CDN caches.
- Signed URLs for blob fetch: short-TTL (1 h), scoped to
store_id + version_id. Rationale: if a store controller is stolen, compromise window is 1 h.
5 Data Schema #
Storage-engine choices (justified)
| Data | Store | Why |
|---|---|---|
| menu_variant (metadata), activation_plan, store_state (authoritative), audit_log, rbac | Spanner (single global DB, multi-region) | Strong consistency for publish + rollback; global reads for manifest generation; transactional audit. Alternative Postgres + cross-region replication rejected: manifest build must be consistent globally, and Spanner's TrueTime + strong global reads are the paved road at Google. |
| Menu blobs, assets (images, localized bundles) | GCS / S3 (blob store) + CDN (Fastly/Cloudflare/Akamai) | Content-addressed by hash; immutable per version; CDN caches by URL → cache-busting by version = zero purge events. Why not Spanner BLOBs: cost/byte and egress don't pencil out at 2 MB blobs × 100K endpoints. |
| Live store_state (heartbeats, last_hb_ts, current_version by device) | Redis Cluster (regional) | High write rate; dashboard queries; <1 ms reads. Backed by async flush to Spanner for durability. Alternative DynamoDB/Bigtable fine — Redis chosen for TTL semantics on heartbeats. |
| Hot manifest responses | Regional cache (Redis + CDN for public manifest) | Manifest is signed; safe to cache. TTL = 5-15 s. Reduces Spanner read pressure at meal transitions. |
Tables (abbreviated)
menu_variant
version_id TEXT PK -- monotonic: "v_2026-05-04T06:00Z_eu_lunch_01"
content_hash TEXT UNIQUE -- SHA-256 of canonical JSON
parent_version_id TEXT NULLABLE -- base for layered merge
overrides_json JSONB -- patch tree; merged at publish time
assets_json JSONB -- [{blob_url, sha256, bytes}]
target_filter JSONB -- {regions, countries, store_ids, tags}
published_by TEXT
published_at TIMESTAMPTZ
regulatory_sigs JSONB -- [{jurisdiction, reviewer_id, sig, ts}]
state ENUM(draft, published, poisoned, archived)
canonical_blob_url TEXT -- immutable; tied to content_hash
INDEX(state, published_at DESC)
INDEX(target_filter USING GIN)
activation_plan
plan_id TEXT PK
version_id TEXT FK
activate_at_kind ENUM(UTC, STORE_LOCAL)
activate_at_ts TIMESTAMPTZ -- or local-time string for STORE_LOCAL
target_filter JSONB
rollout JSONB -- {strategy, canary_pct, canary_order}
rollback_on JSONB -- guardrails
state ENUM(scheduled, rolling, complete, aborted, superseded)
created_by, created_at
INDEX(activate_at_ts) WHERE state='scheduled'
store_state -- authoritative row per store; updated via heartbeat drainer
store_id TEXT PK
region, country, tz, tags
current_version TEXT -- what devices are on right now
target_version TEXT -- what manifest says they should be on
last_heartbeat_ts TIMESTAMPTZ
last_activation_ts TIMESTAMPTZ
staleness_s INT GENERATED -- now() - last_heartbeat_ts
wan_status ENUM(online, flaky, offline)
INDEX(region, staleness_s DESC)
audit_log
audit_id BIGSERIAL PK
actor, action, subject (version_id | plan_id | store_id)
before, after JSONB
reason TEXT NOT NULL -- required on every mutating action
ts TIMESTAMPTZ
INDEX(subject, ts DESC)
device_heartbeat_events -- Kafka/Pub-Sub topic, sampled + rolled up into store_state
store_id, device_id, version_id, ts, meta
Layered merge semantics (FR-critical)
Resolved menu at a given store = ordered merge of:
global_base(root of hierarchy, explicit version pinned).regional_layer[region](override set).country_layer[country](override set).store_layer[store_id](override set — often empty).- Optional
promotional_overlay(time-windowed).
Merge rule: JSON-Patch-style deep merge with explicit remove markers (not null-means-delete — null is a valid value). Collisions bias to the deeper layer. The resolved merge happens at publish time, not at the device — device receives a flat, pre-resolved blob. Rationale: (a) devices are heterogenous and low-powered, (b) a merge bug at HQ is one incident, a merge bug on 100K devices is a chainwide outage, (c) regulators sign off on the final menu, not the patch tree.
6 System Diagram (ASCII) — Centerpiece #
╔════════════════════════════════════════════════════════════════════════════════════════════╗
║ HQ CONTROL PLANE (multi-region) ║
║ ║
║ ┌──────────────┐ POST /variants ┌────────────────┐ write ┌──────────────────┐ ║
║ │ Authoring UI │ ──────────────────▶ │ Publisher │ ────────▶ │ Spanner (global) │ ║
║ │ (internal) │ POST /plans │ (stateless) │ │ variants, plans, │ ║
║ └──────────────┘ │ JSON schema │ ◀──────── │ store_state, │ ║
║ ▲ │ validator + │ read │ audit_log │ ║
║ │ RBAC + regulatory sign-off │ JSON merger + │ └──────────────────┘ ║
║ │ │ signer (KMS) │ │ ║
║ ┌──────┴───────┐ └────────┬───────┘ │ ║
║ │ Regulator │ │ upload blob + sig │ ║
║ │ Reviewer │ ▼ │ ║
║ │ Console │ ┌────────────────┐ │ ║
║ └──────────────┘ │ Blob Store │ │ ║
║ │ (GCS/S3) │ │ ║
║ │ immutable │ │ ║
║ │ content-addr │ │ ║
║ └────────┬───────┘ │ ║
║ │ origin pull │ ║
║ ▼ │ ║
║ ┌────────────────┐ │ ║
║ │ Global CDN │ ──── signed URL ──┐│ ║
║ │ (Fastly/CF/ │ cached at POP ││ ║
║ │ Akamai) │ ││ ║
║ └────────────────┘ ││ ║
║ ││ ║
║ ┌────────────────────────────────┐ ┌────────────────┐ ││ ║
║ │ Scheduler (leader-elected) │──▶│ Manifest API │◀──────────────────┘│ ║
║ │ - scans activation_plan │ │ (stateless, │ │ ║
║ │ - materializes per-store │ │ autoscaled) │ │ ║
║ │ manifests into Spanner │ │ signs with KMS │ │ ║
║ │ - emits poison events │ └────────┬───────┘ │ ║
║ └────────────────────────────────┘ │ │ ║
║ │ read/write │ ║
║ ┌────────────────────────────────┐ │ │ ║
║ │ Rollout Orchestrator │──────────────┤ │ ║
║ │ - canary → auto-rollback on │ │ │ ║
║ │ error-budget breach │ │ │ ║
║ │ - consumes store_state │ │ │ ║
║ └────────────────────────────────┘ │ │ ║
║ │ │ ║
║ ┌────────────────────────────────┐ │ │ ║
║ │ Observability (dashboards, │ │ │ ║
║ │ alert: stale_store_count, │ ◀── heartbeats + audit (Kafka/PubSub) │ ║
║ │ rollout_progress, poison_ack) │ │ ║
║ └────────────────────────────────┘ │ ║
║ │ ║
╚═════════════════════════════════════════════════════════════╧══════════════════╧═════════════╝
│ │
│ │
┌───────────────────────────────────────────┤ │
│ REGIONAL EDGE (per continent) │ │
│ │ │
│ ┌───────────────────────────┐ │ │
│ │ Regional Manifest Mirror │ ◀─ async ─┤ │
│ │ (read-only replica) │ stream │
│ │ serves manifest if HQ │ │
│ │ region unreachable │ │
│ └─────────────┬─────────────┘ │
│ │ │
│ │ mTLS │
│ │ GET /manifest │
│ │ every 30s │
│ ▼ │
└─────────────────┼─────────────────────────────────────────────┘
│
│ ┌──────────────┐
│ │ CDN POP │ ◀── pull blob
│ │ nearest to │
│ │ store │
│ └──────┬───────┘
│ │
╔════════════════════════▼════════════════════════════════▼═══════════╗
║ ONE STORE (×10,000) ║
║ ║
║ ┌─────────────────────────────────────────────────────┐ ║
║ │ Store Controller (NUC or equivalent) │ ║
║ │ - Store Agent daemon │ ║
║ │ - SQLite (state) + flat-file cache (blobs) │ ║
║ │ - Local leader: coordinator for intra-store flip │ ║
║ │ - Runs local HTTP (store-LAN) serving manifest │ ║
║ │ + blob to devices │ ║
║ └───────────┬─────────────────────────┬────────────────┘ ║
║ │ local LAN mDNS/static │ systemd + watchdog ║
║ │ http://store-ctl.lan/ │ ║
║ ┌───────────▼─┐ ┌───────────┐ ┌──────▼────────┐ ┌──────────────┐ ║
║ │ POS Register│ │ POS Reg. │ │ Menu Board │ │ Kiosk │ ║
║ │ device │ │ device │ │ (digital) │ │ (self-serve) │ ║
║ │ - polls │ │ │ │ - polls │ │ - polls │ ║
║ │ - has LKG │ │ │ │ - has LKG │ │ - has LKG │ ║
║ │ - verifies │ │ │ │ │ │ │ ║
║ │ sig │ │ │ │ │ │ │ ║
║ └─────────────┘ └───────────┘ └───────────────┘ └──────────────┘ ║
║ (×10 devices; each verifies signature independently) ║
╚═══════════════════════════════════════════════════════════════════════╝
Arrow semantics (every arrow ↔ API or data flow)
| Arrow | Protocol | Volume | Frequency |
|---|---|---|---|
| Authoring UI → Publisher | HTTPS JSON | <1 MB per publish | 10-50/day |
| Publisher → Spanner | Spanner RPC | <1 KB rows | per publish |
| Publisher → Blob store | S3 PUT | 2 MB | per variant |
| Blob store → CDN | origin pull (lazy) | 2 MB | per unique version × cold POP |
| Scheduler → Spanner | RPC | small | continuous scan |
| Rollout Orchestrator ← store_state | Pub/Sub | B | on heartbeat |
| Store Controller → Manifest API | HTTPS GET mTLS | 1 KB | every 30 s (jittered) |
| Store Controller → CDN POP | HTTPS GET signed URL | 2 MB | per new version (pre-fetch) |
| Store Controller → Devices (LAN) | HTTP + local signed manifest | 1 KB + 2 MB | same as upstream, served from store cache |
| Store Controller → Heartbeat | HTTPS POST mTLS | <1 KB | every 60 s |
| Regional Manifest Mirror ← HQ | gRPC streaming replication | small | continuous |
Sub-diagram A — Pre-fetch flow (scheduled lunch 12:00 local)
T-4h ──────────── T-3h ─────────── T-10min ────────── T=12:00 ──────
│ │ │ │
│ Scheduler writes manifest │ │
│ (upcoming: v_lunch, activate_at=12:00) │
│ │ │ │
│ Store controller polls manifest at next tick: │
│ sees "upcoming" with activate_at = T │
│ chooses a random time in [T-4h, T-30min] │
│ ── BLOB PRE-FETCH jittered ──▶ CDN │
│ verifies sha256 + manifest sig; writes to disk │
│ marks "staged" in SQLite │
│ │ │ │
│ │ │ T-10min: │
│ │ │ controller │
│ │ │ runs PRE-FLIP │
│ │ │ broadcast to │
│ │ │ all devices: │
│ │ │ "stage v_lunch"│
│ │ │ │
│ │ │ devices ack │
│ │ │ │
│ │ │ │ T=12:00:
│ │ │ │ COMMIT broadcast
│ │ │ │ (see Deep Dive A)
Sub-diagram B — Intra-store atomic flip (Deep Dive A in pictures)
Store Controller (local leader):
│
│── t = T − 10min: PREPARE(v_lunch) ───────────────▶ all 10 devices
│ │
│ each device:
│ - fetches blob from
│ local LAN cache
│ - verifies sha256 + sig
│ - stages as "ready"
│ - replies PREPARED(v_lunch)
│ OR ABORT(reason)
│ │
│◀─── PREPARED / ABORT from each device ─────────────┘
│
│── (gate) if any ABORT or timeout: ───▶ cancel, re-stage, alert, keep LKG
│
│── t = T: COMMIT(v_lunch, seq=N+1) ───────────────▶ all 10 devices
│ │
│ each device atomically:
│ - rename staged blob
│ → active path
│ - update in-memory
│ pointer
│ - (browser UI: soft-
│ reload via SSE)
│
│◀─── COMMITTED(v_lunch, seq=N+1) ───────────────────┘
│
└── log transition, heartbeat upstream
Sub-diagram C — Offline recovery / reconnect
Store disconnected at t0. WAN reconnects at t1. Scheduled flips missed in [t0, t1].
│
│── t1: controller detects WAN up (dns + probe)
│── immediate: GET /manifest
│ ◀── manifest with:
│ current: v_X (what HQ thinks we should be on NOW)
│ upcoming: [...]
│ poisoned: [v_bad, ...]
│
│── reconcile logic:
│ if current_local ∈ poisoned: FLIP to safe ASAP (see Deep Dive C)
│ if current_local != v_X:
│ if clock shows we are past the activate_at for v_X:
│ fetch v_X blob → intra-store flip now (skip intermediate flips)
│ else:
│ fetch v_X blob → schedule flip at activate_at
│ for each upcoming: pre-fetch + schedule (normal path)
│
│── 60 s later: heartbeat reports reconciled state
7 Deep Dives #
Deep Dive A — Intra-store atomic flip (the earned-secret bit)
Why critical. Customer-visible correctness lives or dies here. If register 1 shows "2026 Summer Menu" with new croissant price $4.50 and register 2 still on yesterday's menu at $4.00, we have (a) a confused cashier, (b) a refund event, (c) a viral TikTok "McDonalds randomly charges different prices." This is the L7 earned-secret question: how do you do distributed atomic commit across 10 nodes whose only shared infrastructure is a $40 store router?
Three candidate designs (compared quantitatively):
| # | Approach | Mechanism | Latency | Failure mode | Mitigation | Verdict |
|---|---|---|---|---|---|---|
| A1 | Naïve: activate_at timestamp, devices trust their own clock |
Every device pulls manifest with activate_at = T; at now() >= T it flips. |
0 RTT; simplest. | NTP skew. In a store, NTP syncs to upstream every 15-30 min; typical skew 50-500 ms, bad cases 1-5 s if WAN is down. ⇒ device A flips at T, device B flips at T+500 ms. Customer sees mixed state for up to 500 ms per flip, 3× per day. | Could tighten via local NTP at the store controller, but you still have jitter. Still possible: >1 s mixed state during clock-skew spikes. | Rejected as sole mechanism. Used only as a defensive fallback timestamp. |
| A2 | Pure 2PC over store-LAN with controller as coordinator | PREPARE to all devices, wait for all-PREPARED, then COMMIT. If any device doesn't respond within timeout → ABORT. | 2 RTTs on LAN (~2 ms each) + longest-tail device = ~20-100 ms. | If coordinator crashes between PREPARE and COMMIT → devices stuck in PREPARED limbo, blocking new flips. Classic 2PC pathology. | Add persistent log for coordinator + recovery protocol + device-side timeout that auto-aborts after grace (e.g., 60 s), reverting to current. | Good but not enough alone — blocking if LAN partitions during the commit phase: some devices commit, others don't. |
| A3 | Pre-staged blob + activate_at + 2-phase broadcast with bounded-stall and grace window (CHOSEN) |
T-10 min: controller broadcasts PREPARE → all devices stage blob + verify + reply PREPARED. T: controller broadcasts COMMIT over store-LAN redundantly (3 retries, 100 ms apart). Each device also independently trusts its local clock: if no COMMIT received within grace_window_ms after activate_at, device flips on its own (the blob is already there). |
~100 ms worst case intra-store. | (a) A device misses COMMIT and its clock is skewed → flips late. (b) A device crashes between PREPARE and COMMIT → stays on LKG, surfaces as ABORT. (c) Coordinator crashes → every device's fallback timer fires at ~T and they flip approximately together. | Device clocks synced via store-local NTP on the controller (controller itself NTP-synced over WAN, but within the store it serves time) → intra-store skew typically <50 ms. Grace window = 500 ms: either everyone commits via broadcast within 500 ms, or they fall back to their clock (bounded deviation). | CHOSEN. Gets best-of-both: coordinator-driven convergence when healthy; clock-driven recovery when coordinator flakes. |
Chosen protocol in pseudo-code:
# On each device
on PREPARE(v, activate_at_ts):
pull_blob_from_store_lan(v)
if not sha256_ok or not sig_ok:
reply ABORT; return
stage(v) # atomic rename to .staged; not active yet
reply PREPARED
set timer T = activate_at_ts + grace_window_ms
state = PREPARED(v, T)
on COMMIT(v, seq):
if state == PREPARED(v, _):
atomic_rename(.staged → .active) # POSIX rename, inode swap
hot_reload() # UI pointer update; no process restart
reply COMMITTED(v, seq)
state = ACTIVE(v)
on timer_fire:
if state == PREPARED(v, T) and now() >= activate_at_ts:
# coordinator didn't tell us to commit; commit ourselves
atomic_rename(.staged → .active)
log("autonomous_commit") # → heartbeat → alert if too many autonomous
state = ACTIVE(v)
on ABORT(v, reason):
discard(.staged)
state = ACTIVE(previous)
log abort to heartbeat
Why the atomic rename matters. POSIX rename(2) on the same filesystem is atomic; no reader sees a torn file. A UI that reads menu.json via mmap or open() sees either the old inode (pre-rename) or the new one (post-rename), never a half-written file. This is the same primitive LaunchDarkly / Nginx config reload / Atomic-Deploy-Symlink patterns rely on. Cheap, battle-tested, no distributed-systems-PhD required.
Earned-secret bits hidden here:
- A device that "auto-commits" without COMMIT broadcast should emit an anomaly heartbeat; the dashboard threshold of "N% autonomous commits in the fleet" is an early warning that LAN-multicast or controller health is degrading in the wild.
grace_window_ms= 500 ms is the customer-visible SLO for mixed-menu exposure. Any tighter and a single TCP retransmit can breach it. Any looser and an NTP-skewed device is visibly mismatched. 500 ms is short enough that no human will notice at the register but long enough to cover LAN misbehavior.- The controller runs local NTP (it serves time to devices on the store LAN). This turns "WAN-clock-skew" into "controller-clock-skew," which is a single fault domain and far easier to alert on.
- Pre-flip PREPARE has the invaluable side-effect of verifying blob integrity on all devices before the flip deadline. A bad blob is detected 10 min before it would have gone live, giving time to abort and roll forward.
Comparison table (intra-store mixed-state exposure per flip, quantified):
- A1 alone:
500 ms (NTP skew) per flip × 3 flips/day × 10K stores = **15 M ms/day** of mixed-state customer exposure. - A3 (chosen):
50 ms typical, ≤500 ms worst case → **1.5 M ms/day worst case, two-order-of-magnitude improvement** via 500 ms bound.
Deep Dive B — Offline resilience & staleness semantics
Why critical. A store that can't serve a menu is closed. The probe said "what happens when a restaurant goes offline." The L6 answer is "LKG cache." The L7 answer is a tiered degradation policy with regulatory and product stops.
Three candidate designs:
| # | Approach | What it does | Gaps |
|---|---|---|---|
| B1 | LKG forever — serve last valid menu until reconnect. | Simple. | EU allergen labels could be stale for weeks → legal liability. Pricing drift if this were pricing-authoritative. |
| B2 | Hard expire after N hours — show "menu unavailable" splash. | Regulatory-safe. | Store is closed. Revenue loss. Not acceptable during typical WAN outages which are minutes, not days. |
| B3 | Tiered staleness regime (chosen). Three bands: Fresh / Stale / Expired; each has a product + compliance policy. | See below. | More complex; needs manager UI for override. |
Chosen tiered policy:
age = now() - menu.published_at
tier =
FRESH if age <= 24 h → normal operation
STALE if 24 h < age <= grace (e.g., 72 h)
→ serve normally, show a manager-only banner
→ emit alarm to HQ
→ feature-flag "disable new SKUs" if schema changed
EXPIRED if age > grace
→ refuse to show price-modified items
→ fall back to "safe menu" (a pre-baked subset with stable prices + mandated allergen info)
→ in EU: refuse to sell items with time-sensitive allergen labels
→ store manager can override for an extra N hours (audit-logged)
→ at 7 d, store may be required by policy to close that terminal
Why tiered wins:
- Covers the 99% case (WAN flaps for minutes) with zero customer impact.
- Covers the 0.9% case (WAN down for hours during snowstorm) with degradation, not closure.
- Covers the 0.1% case (prolonged disconnection) with a regulator-safe, audit-clean mode.
- Gives store manager agency (override) with accountability (audit log).
Fail-safe "rice and water" menu. Every device ships with a minimal, pre-baked "fail-safe" menu that includes the universally-available items at conservative prices, with legally-mandated allergen text. If everything else fails — bad blob, corrupt cache, missing keys — fail-safe menu is served. This is analogous to CDNs serving a static error page from their edge when origin is fully dead. It means a store is never completely menu-dark.
Catch-up on reconnect — subtle bits:
- If the store missed 5 flips while offline, we do NOT replay them; we jump to the current target version. Replaying "breakfast at 3 PM" is both customer-bad and wastes the store's LAN.
- Poisoned versions pre-empt everything. If
current_local ∈ poisoned, we flip to the named safe version before even processing other manifest hints. Emergency recall is latency-sensitive. - We do replay relevant audit events in heartbeat: "I was offline, I did N autonomous commits, here are my transitions during the outage." Observability completeness matters for post-incident analysis.
Staleness SLO metrics to page on:
stale_store_count_1h > 0.5%— WAN or CDN regional problem.stale_store_count_72h > 0.01%— human escalation, may need field-ops.autonomous_commit_rate > 1%of flips — controller health degradation.expired_store_count > 0— paging; revenue + regulatory.
Deep Dive C — Global rollout of a bad menu (and why cache-busting > purge)
Why critical. "Bad menu goes global in 60 s" is the most expensive failure mode. A typo showing "$0.01 Big Mac" across 10K stores is a trading-pit-level incident. We need defense in depth: prevent, detect, contain, rollback.
Layered defenses:
- Author-side schema validation — publisher rejects out-of-bounds prices (e.g., any item <20% or >500% of sibling-country baseline flags for review). Data-quality gate.
- Regulator sign-off — per-country compliance review is a hard gate before an
activation_plancan target that country. - Canary rollout — first 5% of target stores (geo-distributed, not a single region) get V at T. If error budget breaches in 5 min → auto-abort. If clean for 10 min → expand to 20% → 50% → 100%. Inspired by LaunchDarkly / Statsig % rollouts, Google Chrome % rollouts, Envoy/Istio traffic-splitting patterns.
- Automatic rollback — guardrail tripwires on
rollback_onblock: anomalous POS voids spike, manager override spike, sentinel tests in "shadow stores." On trip → rollout orchestrator issues/rollbackwithurgency=immediate. - Emergency recall —
/poisonmarks the version and the next manifest poll carries apoisoned: [v_bad]list. Devices detect and flip tosafe_versionwithin their next poll + flip cycle (~1 min). No CDN purge needed; see below.
The cache-invalidation question (CDN purge vs cache-busting by version):
Two options; we choose one for subtle but high-value reasons.
| Option | Mechanism | Latency | Failure mode | Verdict |
|---|---|---|---|---|
| CDN purge | Call CDN API to purge menu-v42.json at all POPs. |
30 s - 5 min per provider; 1-2% of POPs are stragglers. | Partial success: some POPs still cache v42. Store fetches from those POPs and gets bad blob. Cache invalidation is hard — see Phil Karlton's Law. |
Reject. |
| Cache-busting by version in URL (chosen) | Each version has its own URL: /menus/{content_hash}.json. Old URL is never invalidated; it just stops being referenced. To "roll back," we publish a new manifest pointing to the older, known-good URL. New URL becomes the referenced one. |
No purge latency; immediate on manifest refresh. | None — the manifest is the source of truth; stale CDN entries are irrelevant. | Chosen. |
Why cache-busting wins at every quantifiable axis:
- Correctness: impossible to serve the bad blob once the manifest points elsewhere.
- Latency: bounded by manifest-poll interval (30 s) + intra-store flip window (≤1 s).
- Rollback cost: zero CDN API calls; zero purge events (purges cost $, have quota, and scrape cold caches).
- Auditability: every byte served has an immutable content hash; forensics is "which version_id did this store have."
This is the same pattern used by WebPack-hashed asset bundles, iOS App Store versioned bundles, Debian snapshot archives, and GCS generation-numbered objects.
What remains on CDN: only cost control. The bad blob is still cached at 50 POPs consuming (maybe) 100 MB of CDN storage until TTL expires (e.g., 24 h). Negligible.
Sentinel stores — an under-loved trick. Run ~100 "canary stores" (chosen for geo/traffic diversity) that always get new versions first. These stores also run synthetic transactions (robo-customer clicks through menu, simulated order, tests price rendering, tests allergen text). Breakage detected here before real customers see it. Pattern from Google's SRE playbook (canary analysis), Meta's LSV rollouts, Netflix region-by-region rollouts.
8 Failure Modes & Resilience (matrix) #
| Component | Failure | Detection | Blast radius | Mitigation | Recovery |
|---|---|---|---|---|---|
| HQ region | Full region outage (rare — AWS/GCS region failure) | Synthetic manifest probe from external monitors; K8s control-plane alerts | Publish blocked; manifest reads degraded | Regional manifest mirrors (read-only replicas per continent) serve cached last-known-good manifests; scheduled activations continue because stores already pre-fetched; no new publishes until HQ is restored | Failover authoring UI to DR region (hours RTO; acceptable — authoring is not on customer path) |
| CDN | POP-wide outage | Synthetic GET from monitoring, plus real store miss-rate spike | Stores in that POP's catchment fail to pre-fetch | Multi-CDN (active-active Fastly + Cloudflare; manifest URL has 2 URLs, prefer-list); CDN fallback is origin blob store directly (slower but works) | CDN provider failover; multi-CDN routing shifts traffic |
| Blob store origin | GCS/S3 outage | CDN origin-pull errors; synthetic probes | New version can't be pre-fetched; existing caches still serve | Write to two blob stores (primary + cross-cloud secondary); manifest can reference either URL; CDN origin-shield configured with fallback origin | Writes drain to secondary; no data loss (content-hash immutable); originals copied when primary recovers |
| Spanner | Regional brownout on primary | Spanner SLO alerts | Publishes fail; manifest generation slows | Spanner multi-region config (automatic); reads can go to replicas | Auto-failover; writes resume when quorum recovers |
| Manifest API | Pod crash / overload | Health-check 5xx rate | Stores get stale manifests (but have pre-fetched already, so flips still work) | Autoscale + rate limits + CDN in front of manifest (signed, short-TTL cacheable); store controller retries with exponential backoff | Self-heals as pods recover |
| Scheduler | Leader crash | Leader-lease timeout | New activations not materialized into manifests | Leader-election (Spanner-backed lease); standby takes over in <10 s | Automatic |
| Rollout Orchestrator | Crash mid-rollout | Heartbeat | Canary frozen at current percentage | Persistent state in Spanner; restart resumes from last completed stage; guardrail is that a frozen rollout is safe (doesn't expand) | Restart |
| Store WAN | Down for minutes | Controller heartbeat gap | 1 store shows LKG (which is correct) | Tier B (Deep Dive B); manifest cached | Automatic on reconnect |
| Store controller | Crash mid-flip (between PREPARE and COMMIT) | Devices auto-commit via grace-window timer (Deep Dive A) | Up to 1 flip might be slightly desynchronized across devices; bounded 500 ms | Controller systemd auto-restart; post-restart it reads SQLite state and resumes heartbeat | Self-heal; heartbeat surfaces an anomaly |
| Device crash at flip | One device crashes between PREPARE and COMMIT | Controller sees PREPARED response but no heartbeat | One register shows old menu (or crash page) | Coordinator aborts / proceeds based on quorum policy (default: proceed if ≥N-1 acked; single laggard catches up after reboot and runs catch-up protocol) | Manual reboot or watchdog |
| Clock skew between devices | Device NTP broken, clock drifts 10s | Heartbeat reports local clock vs controller clock | Device flips early or late; grace window contains it if <500 ms; if >500 ms, we page | Store controller is local NTP; watchdog kills device's NTP client and re-inits | Ops runbook |
| Half-flipped store | 6/10 devices on new menu, 4 on old (LAN partition) | Heartbeat diff: distinct(device_versions) > 1 for > 60 s |
Customer-visible mixed state | Controller explicit rollback: broadcasts ABORT for the new version; all devices go back to LKG; store is in consistent state until partition resolves | Retries flip when partition heals |
| Bad menu (corrupts POS) | Bad blob accepted, all stores flip, POS breaks | POS error-rate guardrail | Chainwide | /poison + auto-rollback; fail-safe menu fallback; sentinel stores catch before full rollout |
Rollback to LKG; RCA |
| Regulator override late | EU regulator flags v42 after it's live | Regulatory system emits alert | EU stores serving v42 | Immediate /poison v42 + rollback to v41 in EU filter only |
<5 min recovery for EU |
| Key compromise | Publisher key leaked | Security team | Attacker could publish signed menus | KMS-backed keys; monthly rotation; multi-party signing for high-risk publishes (e.g., prod + regulator co-sign); device pins publisher keyring with revocation support | Rotate keys; revoke old; next manifest carries new keyring |
| Store controller stolen | Physical theft | Cert pinning + short-TTL signed URLs limit window; mTLS cert revocation list | Single store, max 1 h of signed-URL validity | CA revokes cert; store re-provisioned from replacement hardware | Field ops |
9 Evolution Path #
v1 — "We have 50 stores, no automation"
- SFTP drop: HQ publishes a menu file to an FTP server; stores cron-poll hourly and copy.
- Manual flip: store manager restarts the menu display at breakfast/lunch/dinner.
- No signing, no audit, no rollback (except "copy yesterday's file back").
- Works. Scales to maybe 200 stores and ~1 update/day.
- Pain: no atomicity, no rollback, no observability, no canary, no offline distinction from "forgot to restart."
v2 — "1,000 stores, we need a real distribution system"
- Pull-based manifest + blob separation (the architecture in §6 without the bells).
- Store agent daemon; LKG cache; signature verification.
- Single-region HQ; single CDN.
- Activations via
activate_attimestamp only (no 2PC). Accept ~500 ms intra-store mixed state — users haven't noticed yet. - Basic audit log. Manual rollback (author publishes a new version with old content).
- Dashboard shows stale stores.
- Pain: bad-menu incident reveals a 45-minute global outage because CDN purge took forever. NTP skew causes visible mismatches. Team feels the heat.
v3 — "10K stores, 80 countries, 3 meals × 5 languages × 10 regions"
- Everything in §6: multi-region HQ, regional manifest mirrors, multi-CDN, poison/rollback, 2PC + grace-window intra-store flip (Deep Dive A), tiered staleness policy (Deep Dive B), cache-busting versioned URLs (Deep Dive C), sentinel stores, canary rollouts with guardrails, regulatory signing, layered merge at HQ, fail-safe baked menu, freshness SLOs, per-region A/B.
- ML-assisted anomaly detection on POS behavior to catch bad menus that passed schema validation.
- Edge-deployed WASM menu rendering (optional) for kiosks with fancier layouts.
v4 — "We're on the platform; this is a product"
- Self-service for franchisees: regional managers can draft + submit; HQ approves.
- Delta-encoded blobs (compress by prior version) — saves 80% of CDN bytes.
- Joint system with KDS, POS, loyalty — menu+promo cross-system atomic flips.
- On-device inference for personalized menus (privacy-compliant, on-device, no PII to HQ — natural with the edge architecture).
10 Out-of-1-Hour Notes (earned-secret depth) #
Signed manifests — deeper dive
- Signature is Ed25519 (deterministic, fast to verify on constrained POS hardware).
- Signature covers the canonical-JSON manifest body plus a sequence number per-store to defeat replay (otherwise an attacker could replay an old manifest to downgrade the menu — e.g., reintroduce a recalled item).
- Device stores the last seen sequence number; rejects any manifest with seq ≤ seen.
- Key rotation: quarterly publisher key roll. Devices accept any key in the keyring (zero-downtime rotation); keyring shipped via manifest + signed by a root key that is much longer-lived and kept in HSM. Root key rotation is a 2-year event with staged OTA.
Asset compression & payload engineering
- Text JSON: Brotli -11 offline at publish time (not at serve time) — CPU at author, bytes at edge.
- Images: WebP / AVIF at multiple resolutions, pre-generated (one per device class).
- Consider content-defined chunking (rsync-style) for delta blobs: if only prices changed, new blob is ~90% identical to previous; ship a 200 KB patch instead of 2 MB full blob.
- Reject deltas on bad-chain corner cases (missing base on the device → fall back to full blob).
Device class heterogeneity
- Reference device profile matrix:
- Class A: Android tablet (recent). Full runtime, WebView, mTLS, verify sig in-process.
- Class B: POS terminal (Windows Embedded 7, weak CPU). Uses store controller as proxy; just fetches
/active/menu.jsonfrom LAN. - Class C: Digital menu board (RPi-class or HDMI dumb display). Store controller renders and pushes frame; board is truly dumb.
- Single-digest principle: regardless of device class, every device verifies the signature or trusts the store controller's LAN which has verified. Never fetch directly from CDN at Class C; traffic always proxied via controller for audit & cache locality.
Localization + allergen regulatory
- Pre-localize at publish: menu blob is emitted once per (country, language) pair; device fetches only the one it needs.
- Allergen fields are structured (not free text):
{contains: ["gluten","dairy"], may_contain: ["nuts"]}; UI renders per locale. - EU FIC Article 21 requires allergens be accurate and up-to-date; hence the 72 h staleness grace in Deep Dive B has a regulatory rationale, not just an operational one.
- FDA menu-labeling (calorie disclosure) has similar provenance requirements.
- Japan / China: separate trust anchors per regulator; signing pipeline is country-sharded.
Power outage behavior
- Store controller has a UPS-backed battery (15-30 min). If power fails: controller raises "degraded" flag; WiFi may die too; devices fall back to LKG cache on local NVRAM.
- Blob cache is written with
fsync + renameafter every stage — no half-written files on power loss. - Boot sequence: controller starts → LAN up → devices reconnect via mDNS → reconcile → serve.
Intra-store gossip when WAN is down
- Controller-is-the-oracle model normally. But if the controller dies and the WAN is also down, devices have no shared timebase.
- Fallback: device-to-device gossip via mDNS/Zeroconf on store LAN. One device is ad-hoc elected (lowest MAC addr) as "flip leader" and broadcasts "activate_at" commits. Eventual convergence on 1-2 s bound.
- This is a rare corner but it keeps the store "eventually consistent" during degraded mode. The code path is rarely exercised — SRE runbooks should include chaos testing of "controller + WAN simultaneously down during meal transition."
Cost per store per day
- Manifest polls: 2880/day × 1 KB = 3 MB egress → negligible ($0.0003).
- Blob pre-fetch: 3 variants × 2 MB = 6 MB/day → ~$0.0005.
- Heartbeats: 1440/day × 500 B = 720 KB → negligible.
- Per store per day: ~$0.001 distribution cost; ~$10/month including CDN pre-paid + a share of the control plane. Compared to the $XXX/store/day we'd lose if the menu went out during lunch, this is the cheapest insurance money can buy.
Observability SLOs (per-store freshness dashboard)
Key SLIs:
p99(now - last_heartbeat_ts) per region— should be <5 min.p99(target_version_age_on_store) at activation+1min— should be <1%.count(device_version_disagreement_within_store) > 0— should be 0 except in <500 ms flip windows.autonomous_commit_rate— should be <0.1% (if higher, controller LAN health is degrading).rollback_time_to_95pct— for emergency recalls, should be <5 min.
Dashboards:
- Per-region heatmap — staleness by country.
- Rollout progress bar — for each active
activation_plan, percentage of target stores on target_version. - Anomaly board — stores with half-flipped state, high autonomous commits, or expired menus.
Adversarial / abuse concerns
- Rogue store employee: what if someone with physical access to the controller modifies the local cache? Device-side signature verification on every load means a modified blob fails verification and falls back. Manager override is audit-logged to HQ heartbeat — can't hide.
- Rogue publisher: what if an attacker (insider) publishes a bad menu? Dual-control (two-person approval) on publish + regulator sign-off + sentinel stores + canary = multiple gates.
- Manifest replay: prevented by per-store sequence numbers.
- DoS on manifest API: rate-limited per cert; CDN in front; scale out is cheap.
Privacy posture (aligning with candidate's Privacy Infra background)
- Menu data is not PII; however, heartbeat + store_state DB is a geo-tagged telemetry stream. If joined with customer purchase data, it becomes sensitive. Separation of duty: menu-distribution DB ≠ customer-purchase DB; no join keys.
- If v4 ships personalized menus, per-user preference computation should happen on-device (privacy-preserving, GDPR-clean, no PII to HQ) — echoing the Android private compute core pattern.
- Audit log retention: 7 years for regulatory, then hard delete. Signed at write time (tamper-evident).
What I didn't cover but would in a longer slot
- Kitchen Display System integration. Menu → recipe → KDS routing is a different beast; would need to coordinate "recipe pack deployed" with "menu active" to avoid the kitchen getting an order for an item its KDS doesn't know how to prepare. Pattern: two-phase coordinated activation across menu-system and KDS-system, with KDS version as a dependency in the activation plan.
- Promotional overlays (time-bounded). E.g., "$3 coffee for the next 2 hours." Modeled as a short-TTL overlay layer in the merge stack; different activation cadence; exit criterion is its own expiry.
- Dark launches. Publish a variant but don't activate; lets authors validate merge results against a store's context without customer exposure.
- Multi-brand. A single chain might own multiple brands (e.g., Inspire Brands = Arby's + Dunkin + Buffalo Wild Wings). Add a
brand_idto the filter and do the same partitioning one level up.
Final thesis, restated with earned evidence: The problem looks like a distribution problem but the customer-visible correctness guarantee is an in-store distributed-systems problem. Solving that elegantly — pre-staged blobs + 2-phase commit with bounded-stall clock-fallback, backed by cache-busting by version and a tiered offline policy — is what separates L7 from L6 on this question. A pager-carryable SRE can operate this because every component has a health signal, every failure mode has a finite blast radius, and every rollback is a pointer flip, not a purge.