Q11 · Design a Menu Update System for a Global Restaurant Chain

1 Problem Restatement & Clarifying Questions #

Restatement in candidate's words

HQ maintains a logical menu composed of (a) a global base, (b) a regional layer (e.g., APAC, EU), (c) a country layer (France, Japan), and (d) a store layer (airport store vs highway store). Three times a day, per meal, a new menu variant must go live. ~10K stores × ~10 display devices per store; each device shows the same menu state as its siblings at meal transitions (no "register 1 sells bagels, register 2 sells burgers" at 10:31 AM). Stores lose WAN connectivity for anything from seconds (last-mile glitch) to days (natural disaster). The design must guarantee: menus keep working offline, menus catch up on reconnect, bad menus don't take down the chain, and transitions happen within ~1 second across a store.

Clarifying questions I'd ask the interviewer in the first ~3 minutes

#	Question	Why it matters to the design
Q1	Device heterogeneity? Android tablet (ARM, 2-8 GB RAM, WiFi), thin POS terminal (Windows Embedded, 1 GB RAM), digital menu board (HDMI, RPi-class), kitchen display (KDS)?	Determines whether we can assume an on-device daemon, SQLite, mTLS — or must handle "dumb" HDMI boards that only render what a store controller serves them. Assumption: Android/Linux device + a per-store "controller" box (small form-factor Linux PC like an Intel NUC).
Q2	Offline tolerance SLO? Must a store stay open for 4 hours offline? 24 hours? 7 days?	Drives local cache size (variants × days × 2MB each) and regulatory staleness limits. Assumption: 72 h offline with last-known-good; hard expiry at 7 days per food-safety + EU FIC allergen-accuracy rules.
Q3	Menu blob size? Text JSON only (~~50 KB) or with hero images/video (~~5-50 MB)?	Decides CDN vs. direct API, pre-fetch windows, per-store storage. Assumption: ~2 MB per variant (JSON + compressed thumbnails); video ads handled by a separate DSA system (out of scope).
Q4	SKU count? 200 global items, 50 regional overrides, 20 local?	Sizes the merge pipeline and delta encoding. Assumption: ~500 SKUs per resolved menu, so merge cost is microseconds.
Q5	Regulatory posture? EU Food Information for Consumers (FIC) allergen mandate, FDA calorie disclosure, Japan halal labeling, China licensing per-store?	Forces per-country publish gates + regulator sign-off as a publish step; regulatory-override later in the day must not be blocked by the 3× day cadence.
Q6	Language / script? UTF-8 everywhere? RTL (Arabic) layout changes? Pre-baked vs on-device localization?	Assumption: server-side pre-localization — HQ publishes N×lang variants so devices are dumb. Cheaper to burn CDN bytes than to ship i18n runtime to every POS.
Q7	Who owns "price"? Is this an advertising menu (display only) or the authoritative source that POS charges from?	Huge. If POS charges from this blob, we are on the critical path to revenue and need transaction-safe publishing (2PC with POS, rollback guardrails). Assumption per probes: display authority only; pricing authority is the POS system; this blob is what the customer sees — pricing consistency is handled elsewhere. Still high-stakes: wrong allergen label = lawsuit.
Q8	Update frequency upper bound? Truly 3/day or could it be every 5 minutes during a promo?	Assumption: 3 scheduled per day + occasional emergency push (recall an ingredient). Emergency path must exist but need not be the 99% path.
Q9	A/B testing granularity? Per-region? Per-store? Per-device?	Assumption: per-store (not per-device — a device-level A/B inside one store would violate intra-store consistency).
Q10	Ownership model inside the store? Is there a store manager UI, or is this fully HQ-driven?	Assumption: HQ-driven by default; manager has local override for "sold out" items (stripped down, non-schema-changing) — out of scope for v1.

Remainder of the doc assumes the above.

2 Functional Requirements #

In scope (numbered)

FR1 — Publish menu variant. Author at HQ creates a versioned menu variant (version_id, base, overrides, assets, effective_from, effective_until, target_filter) and it becomes durable + signable + auditable.
FR2 — Schedule activation. A publisher schedules version V to become active at timestamp T, for filter F (e.g., all EU stores, breakfast 2026-05-05 06:00 local). Supports timezone-local activation (every store has its own 06:00).
FR3 — Cross-device atomic flip per store. At activation time, all devices in a store switch to V within a ≤1 s window (SLO), and never show a mixed state visible to a customer (stronger than "close in time").
FR4 — Offline store fallback. A store with no WAN connectivity must serve the last-known-good menu (LKG) and all scheduled flips that were already pre-fetched. Must handle "store went offline before today's lunch variant was pre-fetched" gracefully.
FR5 — Catch-up on reconnect. On WAN restoration, store reconciles: pulls manifest, fetches missed blobs, and skips past any activations that already-should-have-happened (no "welcome back, here's yesterday's breakfast at 7 PM").
FR6 — Audit log & rollback. Every publish / activation / override is logged with actor, reason, diff. Rollback = "activate version V-prev now" globally or filtered; completes in ≤60 s at the edge.
FR7 — Regional A/B testing. A variant can be rolled out to a region percentage (e.g., 5% of EU stores for a week). Splits must be deterministic per store_id (not per request).
FR8 — Emergency recall. A version can be marked POISONED, forcing all stores to roll forward/back to a named safe version within a tight SLO (≤5 min global). Think: recalled ingredient.
FR9 — Regulatory gate. Per-country publishing requires sign-off (e.g., EU FIC reviewer must approve allergen fields); the publish pipeline is a multi-stage workflow.
FR10 — Freshness visibility. HQ dashboard shows per-store freshness: current_version, target_version, last_heartbeat, stale_duration.

Out of scope (explicit)

POS transactions / pricing authority / tax — separate system.
Kitchen Display System (KDS) — separate system; may subscribe to menu for recipe mapping but that's a downstream consumer.
Inventory / "86'd items" — in-store local overrides; a shim consumes our menu and flags items unavailable.
Digital-signage video ads / promo content — separate DSA/CMS; we only handle the menu schema and its thumbnails.
Employee-facing training content — LMS, not menu.

3 Non-Functional Requirements + Capacity (BOE math) #

NFRs with numbers

NFR	Target	Why this exact number
Availability — menu display at leaf	99.99% (≤52 min/year customer-visible outage per store)	A store with a blank menu is closed. Edge-reliant; WAN availability of ≤99.9% is insufficient, so design must not require WAN to render.
Availability — control plane (publish/schedule)	99.95%	Authoring can tolerate minutes of outage. The control plane is not on the customer path.
Consistency — intra-store	Linearizable across devices at flip boundary (all devices observe V at the "same" moment)	Customer-visible violation = two registers disagreeing on the price of a croissant. PR incident.
Consistency — inter-store	Eventual within 60 s for scheduled flips, 5 min for emergency recalls	Convergence under network partition, monotonic reads per-store.
Freshness — scheduled activation	Intra-store flip window ≤1 s	Customer-scale: an empty terminal for >1 s is noticed.
Freshness — emergency recall	Global ≤5 min propagation (excluding fully-offline stores)	Food-safety pull is legally time-bound.
Durability — no lost update	0 lost activations in published state (RPO=0 for metadata; blobs are addressed by content hash)	Author signed it; we must not "forget".
Latency — manifest read	p99 ≤100 ms from store to nearest edge	Cheap to hit often.
Latency — blob fetch (cold CDN)	p99 ≤5 s for 2 MB from regional edge	Acceptable during pre-fetch; activation never blocks on a fetch.
Security	Signed manifests + signed blobs, mTLS store↔edge, KMS-backed keys	Menu tampering at the last mile is a brand + legal risk.

Capacity estimate (all derived, nothing asserted)

Fleet

10,000 stores × 10 devices = 100,000 endpoints.
+1 "store controller" per store = 10,000 controllers.
Total managed nodes ≈ 110,000.

Traffic — control plane (publish / manifest)

Manifest poll: 1 per store every 30 s → 10,000 / 30 = ~333 QPS globally. (Peaky around meal transitions; double it → ~700 QPS.)
Manifest payload: 1 KB. → **1 MB/s** aggregate. Laughably small.
Publish operations: 3 meals × 1 publish/meal × some variants = ~20-50 publishes/day. Not a scale problem.

Traffic — data plane (blob distribution)

Blob size: 2 MB per resolved variant.
Pre-fetch fan-out per meal: 10,000 stores × 2 MB = 20 GB per meal cycle.
Daily: 3 × 20 GB = 60 GB/day aggregate from CDN edges.
Per-edge POP, assume 50 POPs: ~1.2 GB/POP/day → a single nginx can push that in <1 second. The CDN is cost, not capacity.
Why it's still a CDN: we want low tail-latency cold fetches and resilience to a blob-store region outage, not raw throughput.

Storage

HQ metadata DB (Spanner): 1 KB per activation_plan row × 300K flips/day × 7-year retention ≈ **800 GB** (trivial, incl. overhead). Audit log dominated by publish events + device heartbeats → ~5 TB/year with indices and compression. Spanner handles this in a single regional config.
Blob store (GCS/S3): ~5 variants × ~50 regions × 2 MB + assets + history. Including a year of version history (for rollback): ~100K variants × 2 MB = 200 GB. Nothing.
Per-store local cache:
- 5 variants per meal (global/region/country/store merged resolved views × language locales) × 3 meals × 3 days LKG buffer = 45 × 2 MB = **90 MB**. Budget 1 GB per store (10× headroom for promotional seasonal menus). SD card trivial.
Redis for live state: 100K endpoints × 300 B of state ≈ **30 MB hot set**. A single shard.

Bandwidth peak

Scheduled peak: all 10K stores pre-fetch tomorrow's lunch within a randomized 4-hour window before activation → 20 GB over 4 h = ~1.4 MB/s globally. Sub-second peaks are hidden by CDN cache hits. Design invariant: randomized pre-fetch window + jitter prevents thundering herd.

Conclusion The problem is reliability-bound, not scale-bound. 300K activations/day is 3.5 QPS sustained — a single machine can execute that. Budget must be spent on leaf resilience, not on elastic control-plane capacity.

4 High-Level API #

Author / control-plane side (HQ operators + CI)

POST  /v1/variants                       # create new variant (idempotent by client_ref)
      body: {
        parent_version_id,              # optional; for layered merges (base)
        overrides: { ... layered patch tree ... },
        assets: [{url, sha256, bytes}],  # already-uploaded to blob store
        target_filter: { regions, countries, store_ids, tags },
        author, reason, regulatory_approvals: [{jurisdiction, reviewer, timestamp, signature}]
      }
      returns: {version_id, content_hash, signed_blob_url}

POST  /v1/activation_plans               # schedule activation
      body: {
        version_id,
        activate_at: {
          kind: "UTC" | "STORE_LOCAL",    # STORE_LOCAL = per-tz resolution
          t: "2026-05-05T06:00:00",
          grace_window_ms: 60000
        },
        rollout: { strategy: "all" | "canary", canary_pct, canary_order },
        target_filter,
        rollback_on: { error_budget_pct, max_stale_stores }
      }
      returns: {plan_id}

POST  /v1/rollback                       # global or filtered rollback
      body: {version_target, filter, reason, urgency: "scheduled"|"immediate"}
      returns: {plan_id}

POST  /v1/poison                         # emergency recall, highest priority
      body: {version_id, safe_version_id, reason, regulatory_ref}

GET   /v1/fleet/freshness                # dashboard query
      returns: per-store {current, target, last_hb, staleness_s}

POST  /v1/variants/{id}/simulate         # dry-run merge for a store_id — critical for authors
      returns: resolved menu as it would render

Idempotency: every mutating call takes a client_request_id; writes are deduped in Spanner via a unique index on (operator_id, client_request_id).

Leaf / store-controller side (pull-based)

GET   /v1/stores/{store_id}/manifest     # the hot loop; tiny payload
      returns: {
        current: {version_id, sha256, activate_at, signed_url, expires_at},
        upcoming: [ {version_id, sha256, activate_at, signed_url, expires_at}, ... ],
        poisoned: [version_id, ...],     # versions to purge
        server_time_utc,
        next_poll_hint_s,
        manifest_sig                      # signature over everything above
      }
      auth: mTLS cert pinned to store_id

POST  /v1/stores/{store_id}/heartbeat    # observability + flow control
      body: {
        controller_version, device_versions: {device_id: version_id},
        last_activation_ts, cache_bytes, wan_rtt_ms, anomalies: [...]
      }
      returns: {ack, instructions?: [...]}  # may carry a "pull manifest now" nudge

Why pull, not push:

Push requires a long-lived connection from HQ to 10K stores on flaky WANs. NAT traversal alone is a nightmare.
Pull is idempotent, survives a dropped connection, and the store chooses when to fetch (important when the store WiFi is saturated during a lunch rush — don't fight for bandwidth).
Urgent recalls are still ≤5 min via a short poll interval (15-30 s) + heartbeat-response "pull now" hint. Near-push latency for recall; pull simplicity for the 99% path.
Considered alternative: gRPC streaming / MQTT — rejected. See Deep Dive C.

Transport & crypto

mTLS store↔HQ edge with store-unique cert issued at provisioning, rotated yearly.
Manifest signature: Ed25519 over the manifest body with HQ's publisher key; rotated quarterly; public keys on device via firmware.
Blob integrity: SHA-256 embedded in manifest; controller verifies after download before activating. Do not trust CDN caches.
Signed URLs for blob fetch: short-TTL (1 h), scoped to store_id + version_id. Rationale: if a store controller is stolen, compromise window is 1 h.

5 Data Schema #

Storage-engine choices (justified)

Data	Store	Why
menu_variant (metadata), activation_plan, store_state (authoritative), audit_log, rbac	Spanner (single global DB, multi-region)	Strong consistency for publish + rollback; global reads for manifest generation; transactional audit. Alternative Postgres + cross-region replication rejected: manifest build must be consistent globally, and Spanner's TrueTime + strong global reads are the paved road at Google.
Menu blobs, assets (images, localized bundles)	GCS / S3 (blob store) + CDN (Fastly/Cloudflare/Akamai)	Content-addressed by hash; immutable per version; CDN caches by URL → cache-busting by version = zero purge events. Why not Spanner BLOBs: cost/byte and egress don't pencil out at 2 MB blobs × 100K endpoints.
Live store_state (heartbeats, last_hb_ts, current_version by device)	Redis Cluster (regional)	High write rate; dashboard queries; <1 ms reads. Backed by async flush to Spanner for durability. Alternative DynamoDB/Bigtable fine — Redis chosen for TTL semantics on heartbeats.
Hot manifest responses	Regional cache (Redis + CDN for public manifest)	Manifest is signed; safe to cache. TTL = 5-15 s. Reduces Spanner read pressure at meal transitions.

Tables (abbreviated)

menu_variant
  version_id          TEXT PK            -- monotonic: "v_2026-05-04T06:00Z_eu_lunch_01"
  content_hash        TEXT UNIQUE         -- SHA-256 of canonical JSON
  parent_version_id   TEXT NULLABLE       -- base for layered merge
  overrides_json      JSONB               -- patch tree; merged at publish time
  assets_json         JSONB               -- [{blob_url, sha256, bytes}]
  target_filter       JSONB               -- {regions, countries, store_ids, tags}
  published_by        TEXT
  published_at        TIMESTAMPTZ
  regulatory_sigs     JSONB               -- [{jurisdiction, reviewer_id, sig, ts}]
  state               ENUM(draft, published, poisoned, archived)
  canonical_blob_url  TEXT                -- immutable; tied to content_hash
  INDEX(state, published_at DESC)
  INDEX(target_filter USING GIN)

activation_plan
  plan_id             TEXT PK
  version_id          TEXT FK
  activate_at_kind    ENUM(UTC, STORE_LOCAL)
  activate_at_ts      TIMESTAMPTZ          -- or local-time string for STORE_LOCAL
  target_filter       JSONB
  rollout             JSONB                -- {strategy, canary_pct, canary_order}
  rollback_on         JSONB                -- guardrails
  state               ENUM(scheduled, rolling, complete, aborted, superseded)
  created_by, created_at
  INDEX(activate_at_ts) WHERE state='scheduled'

store_state          -- authoritative row per store; updated via heartbeat drainer
  store_id            TEXT PK
  region, country, tz, tags
  current_version     TEXT                 -- what devices are on right now
  target_version      TEXT                 -- what manifest says they should be on
  last_heartbeat_ts   TIMESTAMPTZ
  last_activation_ts  TIMESTAMPTZ
  staleness_s         INT GENERATED        -- now() - last_heartbeat_ts
  wan_status          ENUM(online, flaky, offline)
  INDEX(region, staleness_s DESC)

audit_log
  audit_id            BIGSERIAL PK
  actor, action, subject (version_id | plan_id | store_id)
  before, after       JSONB
  reason              TEXT NOT NULL        -- required on every mutating action
  ts                  TIMESTAMPTZ
  INDEX(subject, ts DESC)

device_heartbeat_events  -- Kafka/Pub-Sub topic, sampled + rolled up into store_state
  store_id, device_id, version_id, ts, meta

Layered merge semantics (FR-critical)

Resolved menu at a given store = ordered merge of:

global_base (root of hierarchy, explicit version pinned).
regional_layer[region] (override set).
country_layer[country] (override set).
store_layer[store_id] (override set — often empty).
Optional promotional_overlay (time-windowed).

Merge rule: JSON-Patch-style deep merge with explicit remove markers (not null-means-delete — null is a valid value). Collisions bias to the deeper layer. The resolved merge happens at publish time, not at the device — device receives a flat, pre-resolved blob. Rationale: (a) devices are heterogenous and low-powered, (b) a merge bug at HQ is one incident, a merge bug on 100K devices is a chainwide outage, (c) regulators sign off on the final menu, not the patch tree.

6 System Diagram (ASCII) — Centerpiece #

╔════════════════════════════════════════════════════════════════════════════════════════════╗
║                                   HQ CONTROL PLANE (multi-region)                            ║
║                                                                                              ║
║   ┌──────────────┐   POST /variants    ┌────────────────┐   write   ┌──────────────────┐    ║
║   │ Authoring UI │ ──────────────────▶ │   Publisher    │ ────────▶ │  Spanner (global) │   ║
║   │ (internal)   │   POST /plans       │  (stateless)   │           │  variants, plans, │   ║
║   └──────────────┘                     │  JSON schema   │ ◀──────── │  store_state,     │   ║
║          ▲                              │  validator +   │   read    │  audit_log        │   ║
║          │ RBAC + regulatory sign-off   │  JSON merger + │           └──────────────────┘   ║
║          │                              │  signer (KMS)  │                    │              ║
║   ┌──────┴───────┐                     └────────┬───────┘                    │              ║
║   │  Regulator   │                              │ upload blob + sig           │              ║
║   │   Reviewer   │                              ▼                             │              ║
║   │   Console    │                     ┌────────────────┐                    │              ║
║   └──────────────┘                     │  Blob Store    │                    │              ║
║                                         │  (GCS/S3)      │                    │              ║
║                                         │  immutable     │                    │              ║
║                                         │  content-addr  │                    │              ║
║                                         └────────┬───────┘                    │              ║
║                                                  │ origin pull                │              ║
║                                                  ▼                             │              ║
║                                         ┌────────────────┐                    │              ║
║                                         │   Global CDN   │ ──── signed URL ──┐│              ║
║                                         │ (Fastly/CF/    │     cached at POP ││              ║
║                                         │  Akamai)       │                    ││              ║
║                                         └────────────────┘                    ││              ║
║                                                                                ││              ║
║   ┌────────────────────────────────┐    ┌────────────────┐                    ││              ║
║   │  Scheduler (leader-elected)    │──▶│  Manifest API  │◀──────────────────┘│              ║
║   │  - scans activation_plan       │    │  (stateless,    │                    │              ║
║   │  - materializes per-store      │    │   autoscaled)   │                    │              ║
║   │    manifests into Spanner      │    │  signs with KMS │                    │              ║
║   │  - emits poison events         │    └────────┬───────┘                    │              ║
║   └────────────────────────────────┘             │                             │              ║
║                                                   │ read/write                  │              ║
║   ┌────────────────────────────────┐              │                             │              ║
║   │  Rollout Orchestrator          │──────────────┤                             │              ║
║   │  - canary → auto-rollback on   │              │                             │              ║
║   │    error-budget breach         │              │                             │              ║
║   │  - consumes store_state        │              │                             │              ║
║   └────────────────────────────────┘              │                             │              ║
║                                                   │                             │              ║
║   ┌────────────────────────────────┐              │                             │              ║
║   │  Observability (dashboards,    │              │                             │              ║
║   │  alert: stale_store_count,     │ ◀── heartbeats + audit (Kafka/PubSub)     │              ║
║   │  rollout_progress, poison_ack) │                                            │              ║
║   └────────────────────────────────┘                                            │              ║
║                                                                                 │              ║
╚═════════════════════════════════════════════════════════════╧══════════════════╧═════════════╝
                                                              │                  │
                                                              │                  │
                  ┌───────────────────────────────────────────┤                  │
                  │    REGIONAL EDGE (per continent)          │                  │
                  │                                           │                  │
                  │   ┌───────────────────────────┐           │                  │
                  │   │ Regional Manifest Mirror  │ ◀─ async ─┤                  │
                  │   │  (read-only replica)      │  stream                      │
                  │   │  serves manifest if HQ     │                              │
                  │   │  region unreachable       │                              │
                  │   └─────────────┬─────────────┘                              │
                  │                 │                                             │
                  │                 │  mTLS                                       │
                  │                 │  GET /manifest                              │
                  │                 │  every 30s                                  │
                  │                 ▼                                             │
                  └─────────────────┼─────────────────────────────────────────────┘
                                    │
                                    │                         ┌──────────────┐
                                    │                         │  CDN POP     │ ◀── pull blob
                                    │                         │  nearest to  │
                                    │                         │  store       │
                                    │                         └──────┬───────┘
                                    │                                │
           ╔════════════════════════▼════════════════════════════════▼═══════════╗
           ║                    ONE STORE (×10,000)                                ║
           ║                                                                       ║
           ║   ┌─────────────────────────────────────────────────────┐            ║
           ║   │  Store Controller (NUC or equivalent)                │            ║
           ║   │  - Store Agent daemon                                │            ║
           ║   │  - SQLite (state) + flat-file cache (blobs)          │            ║
           ║   │  - Local leader: coordinator for intra-store flip    │            ║
           ║   │  - Runs local HTTP (store-LAN) serving manifest      │            ║
           ║   │    + blob to devices                                 │            ║
           ║   └───────────┬─────────────────────────┬────────────────┘            ║
           ║               │ local LAN mDNS/static   │ systemd + watchdog          ║
           ║               │ http://store-ctl.lan/   │                             ║
           ║   ┌───────────▼─┐ ┌───────────┐ ┌──────▼────────┐ ┌──────────────┐  ║
           ║   │ POS Register│ │ POS Reg.  │ │ Menu Board    │ │ Kiosk        │  ║
           ║   │  device     │ │  device   │ │  (digital)    │ │ (self-serve) │  ║
           ║   │  - polls    │ │           │ │  - polls       │ │  - polls     │  ║
           ║   │  - has LKG  │ │           │ │  - has LKG     │ │  - has LKG   │  ║
           ║   │  - verifies │ │           │ │                │ │              │  ║
           ║   │    sig      │ │           │ │                │ │              │  ║
           ║   └─────────────┘ └───────────┘ └───────────────┘ └──────────────┘  ║
           ║    (×10 devices; each verifies signature independently)              ║
           ╚═══════════════════════════════════════════════════════════════════════╝

Arrow semantics (every arrow ↔ API or data flow)

Arrow	Protocol	Volume	Frequency
Authoring UI → Publisher	HTTPS JSON	<1 MB per publish	10-50/day
Publisher → Spanner	Spanner RPC	<1 KB rows	per publish
Publisher → Blob store	S3 PUT	2 MB	per variant
Blob store → CDN	origin pull (lazy)	2 MB	per unique version × cold POP
Scheduler → Spanner	RPC	small	continuous scan
Rollout Orchestrator ← store_state	Pub/Sub	B	on heartbeat
Store Controller → Manifest API	HTTPS GET mTLS	1 KB	every 30 s (jittered)
Store Controller → CDN POP	HTTPS GET signed URL	2 MB	per new version (pre-fetch)
Store Controller → Devices (LAN)	HTTP + local signed manifest	1 KB + 2 MB	same as upstream, served from store cache
Store Controller → Heartbeat	HTTPS POST mTLS	<1 KB	every 60 s
Regional Manifest Mirror ← HQ	gRPC streaming replication	small	continuous

Sub-diagram A — Pre-fetch flow (scheduled lunch 12:00 local)

T-4h ──────────── T-3h ─────────── T-10min ────────── T=12:00 ──────
 │                 │                │                  │
 │ Scheduler writes manifest        │                  │
 │ (upcoming: v_lunch, activate_at=12:00)              │
 │                 │                │                  │
 │  Store controller polls manifest at next tick:      │
 │    sees "upcoming" with activate_at = T             │
 │    chooses a random time in [T-4h, T-30min]         │
 │    ── BLOB PRE-FETCH jittered ──▶ CDN               │
 │    verifies sha256 + manifest sig; writes to disk   │
 │    marks "staged" in SQLite                         │
 │                 │                │                  │
 │                 │                │   T-10min:       │
 │                 │                │   controller     │
 │                 │                │   runs PRE-FLIP  │
 │                 │                │   broadcast to   │
 │                 │                │   all devices:   │
 │                 │                │   "stage v_lunch"│
 │                 │                │                  │
 │                 │                │   devices ack    │
 │                 │                │                  │
 │                 │                │                  │  T=12:00:
 │                 │                │                  │  COMMIT broadcast
 │                 │                │                  │  (see Deep Dive A)

Sub-diagram B — Intra-store atomic flip (Deep Dive A in pictures)

Store Controller (local leader):
      │
      │── t = T − 10min: PREPARE(v_lunch) ───────────────▶ all 10 devices
      │                                                    │
      │                                                   each device:
      │                                                    - fetches blob from
      │                                                      local LAN cache
      │                                                    - verifies sha256 + sig
      │                                                    - stages as "ready"
      │                                                    - replies PREPARED(v_lunch)
      │                                                     OR ABORT(reason)
      │                                                    │
      │◀─── PREPARED / ABORT from each device ─────────────┘
      │
      │── (gate) if any ABORT or timeout: ───▶ cancel, re-stage, alert, keep LKG
      │
      │── t = T: COMMIT(v_lunch, seq=N+1) ───────────────▶ all 10 devices
      │                                                    │
      │                                                   each device atomically:
      │                                                    - rename staged blob
      │                                                      → active path
      │                                                    - update in-memory
      │                                                      pointer
      │                                                    - (browser UI: soft-
      │                                                      reload via SSE)
      │
      │◀─── COMMITTED(v_lunch, seq=N+1) ───────────────────┘
      │
      └── log transition, heartbeat upstream

Sub-diagram C — Offline recovery / reconnect

Store disconnected at t0. WAN reconnects at t1. Scheduled flips missed in [t0, t1].
      │
      │── t1: controller detects WAN up (dns + probe)
      │── immediate: GET /manifest
      │      ◀── manifest with:
      │            current: v_X (what HQ thinks we should be on NOW)
      │            upcoming: [...]
      │            poisoned: [v_bad, ...]
      │
      │── reconcile logic:
      │     if current_local ∈ poisoned:   FLIP to safe ASAP (see Deep Dive C)
      │     if current_local != v_X:
      │        if clock shows we are past the activate_at for v_X:
      │           fetch v_X blob → intra-store flip now (skip intermediate flips)
      │        else:
      │           fetch v_X blob → schedule flip at activate_at
      │     for each upcoming: pre-fetch + schedule (normal path)
      │
      │── 60 s later: heartbeat reports reconciled state

7 Deep Dives #

Deep Dive A — Intra-store atomic flip (the earned-secret bit)

Why critical. Customer-visible correctness lives or dies here. If register 1 shows "2026 Summer Menu" with new croissant price $4.50 and register 2 still on yesterday's menu at $4.00, we have (a) a confused cashier, (b) a refund event, (c) a viral TikTok "McDonalds randomly charges different prices." This is the L7 earned-secret question: how do you do distributed atomic commit across 10 nodes whose only shared infrastructure is a $40 store router?

Three candidate designs (compared quantitatively):

#	Approach	Mechanism	Latency	Failure mode	Mitigation	Verdict
A1	Naïve: `activate_at` timestamp, devices trust their own clock	Every device pulls manifest with `activate_at = T`; at `now() >= T` it flips.	0 RTT; simplest.	NTP skew. In a store, NTP syncs to upstream every 15-30 min; typical skew 50-500 ms, bad cases 1-5 s if WAN is down. ⇒ device A flips at T, device B flips at T+500 ms. Customer sees mixed state for up to 500 ms per flip, 3× per day.	Could tighten via local NTP at the store controller, but you still have jitter. Still possible: >1 s mixed state during clock-skew spikes.	Rejected as sole mechanism. Used only as a defensive fallback timestamp.
A2	Pure 2PC over store-LAN with controller as coordinator	PREPARE to all devices, wait for all-PREPARED, then COMMIT. If any device doesn't respond within timeout → ABORT.	2 RTTs on LAN (~2 ms each) + longest-tail device = ~20-100 ms.	If coordinator crashes between PREPARE and COMMIT → devices stuck in PREPARED limbo, blocking new flips. Classic 2PC pathology.	Add persistent log for coordinator + recovery protocol + device-side timeout that auto-aborts after grace (e.g., 60 s), reverting to current.	Good but not enough alone — blocking if LAN partitions during the commit phase: some devices commit, others don't.
A3	Pre-staged blob + `activate_at` + 2-phase broadcast with bounded-stall and grace window (CHOSEN)	T-10 min: controller broadcasts PREPARE → all devices stage blob + verify + reply PREPARED. T: controller broadcasts COMMIT over store-LAN redundantly (3 retries, 100 ms apart). Each device also independently trusts its local clock: if no COMMIT received within `grace_window_ms` after `activate_at`, device flips on its own (the blob is already there).	~100 ms worst case intra-store.	(a) A device misses COMMIT and its clock is skewed → flips late. (b) A device crashes between PREPARE and COMMIT → stays on LKG, surfaces as ABORT. (c) Coordinator crashes → every device's fallback timer fires at ~T and they flip approximately together.	Device clocks synced via store-local NTP on the controller (controller itself NTP-synced over WAN, but within the store it serves time) → intra-store skew typically <50 ms. Grace window = 500 ms: either everyone commits via broadcast within 500 ms, or they fall back to their clock (bounded deviation).	CHOSEN. Gets best-of-both: coordinator-driven convergence when healthy; clock-driven recovery when coordinator flakes.

Chosen protocol in pseudo-code:

# On each device
on PREPARE(v, activate_at_ts):
    pull_blob_from_store_lan(v)
    if not sha256_ok or not sig_ok:
        reply ABORT; return
    stage(v)              # atomic rename to .staged; not active yet
    reply PREPARED
    set timer T = activate_at_ts + grace_window_ms
    state = PREPARED(v, T)

on COMMIT(v, seq):
    if state == PREPARED(v, _):
        atomic_rename(.staged → .active)  # POSIX rename, inode swap
        hot_reload()                       # UI pointer update; no process restart
        reply COMMITTED(v, seq)
        state = ACTIVE(v)

on timer_fire:
    if state == PREPARED(v, T) and now() >= activate_at_ts:
        # coordinator didn't tell us to commit; commit ourselves
        atomic_rename(.staged → .active)
        log("autonomous_commit")  # → heartbeat → alert if too many autonomous
        state = ACTIVE(v)

on ABORT(v, reason):
    discard(.staged)
    state = ACTIVE(previous)
    log abort to heartbeat

Why the atomic rename matters. POSIX rename(2) on the same filesystem is atomic; no reader sees a torn file. A UI that reads menu.json via mmap or open() sees either the old inode (pre-rename) or the new one (post-rename), never a half-written file. This is the same primitive LaunchDarkly / Nginx config reload / Atomic-Deploy-Symlink patterns rely on. Cheap, battle-tested, no distributed-systems-PhD required.

Earned-secret bits hidden here:

A device that "auto-commits" without COMMIT broadcast should emit an anomaly heartbeat; the dashboard threshold of "N% autonomous commits in the fleet" is an early warning that LAN-multicast or controller health is degrading in the wild.
grace_window_ms = 500 ms is the customer-visible SLO for mixed-menu exposure. Any tighter and a single TCP retransmit can breach it. Any looser and an NTP-skewed device is visibly mismatched. 500 ms is short enough that no human will notice at the register but long enough to cover LAN misbehavior.
The controller runs local NTP (it serves time to devices on the store LAN). This turns "WAN-clock-skew" into "controller-clock-skew," which is a single fault domain and far easier to alert on.
Pre-flip PREPARE has the invaluable side-effect of verifying blob integrity on all devices before the flip deadline. A bad blob is detected 10 min before it would have gone live, giving time to abort and roll forward.

Comparison table (intra-store mixed-state exposure per flip, quantified):

A1 alone: 500 ms (NTP skew) per flip × 3 flips/day × 10K stores = **15 M ms/day** of mixed-state customer exposure.
A3 (chosen): 50 ms typical, ≤500 ms worst case → **1.5 M ms/day worst case, two-order-of-magnitude improvement** via 500 ms bound.

Deep Dive B — Offline resilience & staleness semantics

Why critical. A store that can't serve a menu is closed. The probe said "what happens when a restaurant goes offline." The L6 answer is "LKG cache." The L7 answer is a tiered degradation policy with regulatory and product stops.

Three candidate designs:

#	Approach	What it does	Gaps
B1	LKG forever — serve last valid menu until reconnect.	Simple.	EU allergen labels could be stale for weeks → legal liability. Pricing drift if this were pricing-authoritative.
B2	Hard expire after N hours — show "menu unavailable" splash.	Regulatory-safe.	Store is closed. Revenue loss. Not acceptable during typical WAN outages which are minutes, not days.
B3	Tiered staleness regime (chosen). Three bands: Fresh / Stale / Expired; each has a product + compliance policy.	See below.	More complex; needs manager UI for override.

Chosen tiered policy:

age = now() - menu.published_at
tier = 
    FRESH     if age <= 24 h                 → normal operation
    STALE     if 24 h < age <= grace (e.g., 72 h)
              → serve normally, show a manager-only banner
              → emit alarm to HQ  
              → feature-flag "disable new SKUs" if schema changed
    EXPIRED   if age > grace 
              → refuse to show price-modified items  
              → fall back to "safe menu" (a pre-baked subset with stable prices + mandated allergen info)
              → in EU: refuse to sell items with time-sensitive allergen labels
              → store manager can override for an extra N hours (audit-logged)
              → at 7 d, store may be required by policy to close that terminal

Why tiered wins:

Covers the 99% case (WAN flaps for minutes) with zero customer impact.
Covers the 0.9% case (WAN down for hours during snowstorm) with degradation, not closure.
Covers the 0.1% case (prolonged disconnection) with a regulator-safe, audit-clean mode.
Gives store manager agency (override) with accountability (audit log).

Fail-safe "rice and water" menu. Every device ships with a minimal, pre-baked "fail-safe" menu that includes the universally-available items at conservative prices, with legally-mandated allergen text. If everything else fails — bad blob, corrupt cache, missing keys — fail-safe menu is served. This is analogous to CDNs serving a static error page from their edge when origin is fully dead. It means a store is never completely menu-dark.

Catch-up on reconnect — subtle bits:

If the store missed 5 flips while offline, we do NOT replay them; we jump to the current target version. Replaying "breakfast at 3 PM" is both customer-bad and wastes the store's LAN.
Poisoned versions pre-empt everything. If current_local ∈ poisoned, we flip to the named safe version before even processing other manifest hints. Emergency recall is latency-sensitive.
We do replay relevant audit events in heartbeat: "I was offline, I did N autonomous commits, here are my transitions during the outage." Observability completeness matters for post-incident analysis.

Staleness SLO metrics to page on:

stale_store_count_1h > 0.5% — WAN or CDN regional problem.
stale_store_count_72h > 0.01% — human escalation, may need field-ops.
autonomous_commit_rate > 1% of flips — controller health degradation.
expired_store_count > 0 — paging; revenue + regulatory.

Deep Dive C — Global rollout of a bad menu (and why cache-busting > purge)

Why critical. "Bad menu goes global in 60 s" is the most expensive failure mode. A typo showing "$0.01 Big Mac" across 10K stores is a trading-pit-level incident. We need defense in depth: prevent, detect, contain, rollback.

Layered defenses:

Author-side schema validation — publisher rejects out-of-bounds prices (e.g., any item <20% or >500% of sibling-country baseline flags for review). Data-quality gate.
Regulator sign-off — per-country compliance review is a hard gate before an activation_plan can target that country.
Canary rollout — first 5% of target stores (geo-distributed, not a single region) get V at T. If error budget breaches in 5 min → auto-abort. If clean for 10 min → expand to 20% → 50% → 100%. Inspired by LaunchDarkly / Statsig % rollouts, Google Chrome % rollouts, Envoy/Istio traffic-splitting patterns.
Automatic rollback — guardrail tripwires on rollback_on block: anomalous POS voids spike, manager override spike, sentinel tests in "shadow stores." On trip → rollout orchestrator issues /rollback with urgency=immediate.
Emergency recall — /poison marks the version and the next manifest poll carries a poisoned: [v_bad] list. Devices detect and flip to safe_version within their next poll + flip cycle (~1 min). No CDN purge needed; see below.

The cache-invalidation question (CDN purge vs cache-busting by version):

Two options; we choose one for subtle but high-value reasons.

Option	Mechanism	Latency	Failure mode	Verdict
CDN purge	Call CDN API to purge `menu-v42.json` at all POPs.	30 s - 5 min per provider; 1-2% of POPs are stragglers.	Partial success: some POPs still cache `v42`. Store fetches from those POPs and gets bad blob. Cache invalidation is hard — see Phil Karlton's Law.	Reject.
Cache-busting by version in URL (chosen)	Each version has its own URL: `/menus/{content_hash}.json`. Old URL is never invalidated; it just stops being referenced. To "roll back," we publish a new manifest pointing to the older, known-good URL. New URL becomes the referenced one.	No purge latency; immediate on manifest refresh.	None — the manifest is the source of truth; stale CDN entries are irrelevant.	Chosen.

Why cache-busting wins at every quantifiable axis:

Correctness: impossible to serve the bad blob once the manifest points elsewhere.
Latency: bounded by manifest-poll interval (30 s) + intra-store flip window (≤1 s).
Rollback cost: zero CDN API calls; zero purge events (purges cost $, have quota, and scrape cold caches).
Auditability: every byte served has an immutable content hash; forensics is "which version_id did this store have."

This is the same pattern used by WebPack-hashed asset bundles, iOS App Store versioned bundles, Debian snapshot archives, and GCS generation-numbered objects.

What remains on CDN: only cost control. The bad blob is still cached at 50 POPs consuming (maybe) 100 MB of CDN storage until TTL expires (e.g., 24 h). Negligible.

Sentinel stores — an under-loved trick. Run ~100 "canary stores" (chosen for geo/traffic diversity) that always get new versions first. These stores also run synthetic transactions (robo-customer clicks through menu, simulated order, tests price rendering, tests allergen text). Breakage detected here before real customers see it. Pattern from Google's SRE playbook (canary analysis), Meta's LSV rollouts, Netflix region-by-region rollouts.

8 Failure Modes & Resilience (matrix) #

Component	Failure	Detection	Blast radius	Mitigation	Recovery
HQ region	Full region outage (rare — AWS/GCS region failure)	Synthetic manifest probe from external monitors; K8s control-plane alerts	Publish blocked; manifest reads degraded	Regional manifest mirrors (read-only replicas per continent) serve cached last-known-good manifests; scheduled activations continue because stores already pre-fetched; no new publishes until HQ is restored	Failover authoring UI to DR region (hours RTO; acceptable — authoring is not on customer path)
CDN	POP-wide outage	Synthetic GET from monitoring, plus real store miss-rate spike	Stores in that POP's catchment fail to pre-fetch	Multi-CDN (active-active Fastly + Cloudflare; manifest URL has 2 URLs, prefer-list); CDN fallback is origin blob store directly (slower but works)	CDN provider failover; multi-CDN routing shifts traffic
Blob store origin	GCS/S3 outage	CDN origin-pull errors; synthetic probes	New version can't be pre-fetched; existing caches still serve	Write to two blob stores (primary + cross-cloud secondary); manifest can reference either URL; CDN origin-shield configured with fallback origin	Writes drain to secondary; no data loss (content-hash immutable); originals copied when primary recovers
Spanner	Regional brownout on primary	Spanner SLO alerts	Publishes fail; manifest generation slows	Spanner multi-region config (automatic); reads can go to replicas	Auto-failover; writes resume when quorum recovers
Manifest API	Pod crash / overload	Health-check 5xx rate	Stores get stale manifests (but have pre-fetched already, so flips still work)	Autoscale + rate limits + CDN in front of manifest (signed, short-TTL cacheable); store controller retries with exponential backoff	Self-heals as pods recover
Scheduler	Leader crash	Leader-lease timeout	New activations not materialized into manifests	Leader-election (Spanner-backed lease); standby takes over in <10 s	Automatic
Rollout Orchestrator	Crash mid-rollout	Heartbeat	Canary frozen at current percentage	Persistent state in Spanner; restart resumes from last completed stage; guardrail is that a frozen rollout is safe (doesn't expand)	Restart
Store WAN	Down for minutes	Controller heartbeat gap	1 store shows LKG (which is correct)	Tier B (Deep Dive B); manifest cached	Automatic on reconnect
Store controller	Crash mid-flip (between PREPARE and COMMIT)	Devices auto-commit via grace-window timer (Deep Dive A)	Up to 1 flip might be slightly desynchronized across devices; bounded 500 ms	Controller systemd auto-restart; post-restart it reads SQLite state and resumes heartbeat	Self-heal; heartbeat surfaces an anomaly
Device crash at flip	One device crashes between PREPARE and COMMIT	Controller sees PREPARED response but no heartbeat	One register shows old menu (or crash page)	Coordinator aborts / proceeds based on quorum policy (default: proceed if ≥N-1 acked; single laggard catches up after reboot and runs catch-up protocol)	Manual reboot or watchdog
Clock skew between devices	Device NTP broken, clock drifts 10s	Heartbeat reports local clock vs controller clock	Device flips early or late; grace window contains it if <500 ms; if >500 ms, we page	Store controller is local NTP; watchdog kills device's NTP client and re-inits	Ops runbook
Half-flipped store	6/10 devices on new menu, 4 on old (LAN partition)	Heartbeat diff: `distinct(device_versions) > 1 for > 60 s`	Customer-visible mixed state	Controller explicit rollback: broadcasts ABORT for the new version; all devices go back to LKG; store is in consistent state until partition resolves	Retries flip when partition heals
Bad menu (corrupts POS)	Bad blob accepted, all stores flip, POS breaks	POS error-rate guardrail	Chainwide	`/poison` + auto-rollback; fail-safe menu fallback; sentinel stores catch before full rollout	Rollback to LKG; RCA
Regulator override late	EU regulator flags v42 after it's live	Regulatory system emits alert	EU stores serving v42	Immediate `/poison v42` + rollback to v41 in EU filter only	<5 min recovery for EU
Key compromise	Publisher key leaked	Security team	Attacker could publish signed menus	KMS-backed keys; monthly rotation; multi-party signing for high-risk publishes (e.g., prod + regulator co-sign); device pins publisher keyring with revocation support	Rotate keys; revoke old; next manifest carries new keyring
Store controller stolen	Physical theft	Cert pinning + short-TTL signed URLs limit window; mTLS cert revocation list	Single store, max 1 h of signed-URL validity	CA revokes cert; store re-provisioned from replacement hardware	Field ops

9 Evolution Path #

v1 — "We have 50 stores, no automation"

SFTP drop: HQ publishes a menu file to an FTP server; stores cron-poll hourly and copy.
Manual flip: store manager restarts the menu display at breakfast/lunch/dinner.
No signing, no audit, no rollback (except "copy yesterday's file back").
Works. Scales to maybe 200 stores and ~1 update/day.
Pain: no atomicity, no rollback, no observability, no canary, no offline distinction from "forgot to restart."

v2 — "1,000 stores, we need a real distribution system"

Pull-based manifest + blob separation (the architecture in §6 without the bells).
Store agent daemon; LKG cache; signature verification.
Single-region HQ; single CDN.
Activations via activate_at timestamp only (no 2PC). Accept ~500 ms intra-store mixed state — users haven't noticed yet.
Basic audit log. Manual rollback (author publishes a new version with old content).
Dashboard shows stale stores.
Pain: bad-menu incident reveals a 45-minute global outage because CDN purge took forever. NTP skew causes visible mismatches. Team feels the heat.

v3 — "10K stores, 80 countries, 3 meals × 5 languages × 10 regions"

Everything in §6: multi-region HQ, regional manifest mirrors, multi-CDN, poison/rollback, 2PC + grace-window intra-store flip (Deep Dive A), tiered staleness policy (Deep Dive B), cache-busting versioned URLs (Deep Dive C), sentinel stores, canary rollouts with guardrails, regulatory signing, layered merge at HQ, fail-safe baked menu, freshness SLOs, per-region A/B.
ML-assisted anomaly detection on POS behavior to catch bad menus that passed schema validation.
Edge-deployed WASM menu rendering (optional) for kiosks with fancier layouts.

v4 — "We're on the platform; this is a product"

Self-service for franchisees: regional managers can draft + submit; HQ approves.
Delta-encoded blobs (compress by prior version) — saves 80% of CDN bytes.
Joint system with KDS, POS, loyalty — menu+promo cross-system atomic flips.
On-device inference for personalized menus (privacy-compliant, on-device, no PII to HQ — natural with the edge architecture).

10 Out-of-1-Hour Notes (earned-secret depth) #

Signed manifests — deeper dive

Signature is Ed25519 (deterministic, fast to verify on constrained POS hardware).
Signature covers the canonical-JSON manifest body plus a sequence number per-store to defeat replay (otherwise an attacker could replay an old manifest to downgrade the menu — e.g., reintroduce a recalled item).
Device stores the last seen sequence number; rejects any manifest with seq ≤ seen.
Key rotation: quarterly publisher key roll. Devices accept any key in the keyring (zero-downtime rotation); keyring shipped via manifest + signed by a root key that is much longer-lived and kept in HSM. Root key rotation is a 2-year event with staged OTA.

Asset compression & payload engineering

Text JSON: Brotli -11 offline at publish time (not at serve time) — CPU at author, bytes at edge.
Images: WebP / AVIF at multiple resolutions, pre-generated (one per device class).
Consider content-defined chunking (rsync-style) for delta blobs: if only prices changed, new blob is ~90% identical to previous; ship a 200 KB patch instead of 2 MB full blob.
Reject deltas on bad-chain corner cases (missing base on the device → fall back to full blob).

Device class heterogeneity

Reference device profile matrix:
- Class A: Android tablet (recent). Full runtime, WebView, mTLS, verify sig in-process.
- Class B: POS terminal (Windows Embedded 7, weak CPU). Uses store controller as proxy; just fetches /active/menu.json from LAN.
- Class C: Digital menu board (RPi-class or HDMI dumb display). Store controller renders and pushes frame; board is truly dumb.
Single-digest principle: regardless of device class, every device verifies the signature or trusts the store controller's LAN which has verified. Never fetch directly from CDN at Class C; traffic always proxied via controller for audit & cache locality.

Localization + allergen regulatory

Pre-localize at publish: menu blob is emitted once per (country, language) pair; device fetches only the one it needs.
Allergen fields are structured (not free text): {contains: ["gluten","dairy"], may_contain: ["nuts"]}; UI renders per locale.
EU FIC Article 21 requires allergens be accurate and up-to-date; hence the 72 h staleness grace in Deep Dive B has a regulatory rationale, not just an operational one.
FDA menu-labeling (calorie disclosure) has similar provenance requirements.
Japan / China: separate trust anchors per regulator; signing pipeline is country-sharded.

Power outage behavior

Store controller has a UPS-backed battery (15-30 min). If power fails: controller raises "degraded" flag; WiFi may die too; devices fall back to LKG cache on local NVRAM.
Blob cache is written with fsync + rename after every stage — no half-written files on power loss.
Boot sequence: controller starts → LAN up → devices reconnect via mDNS → reconcile → serve.

Intra-store gossip when WAN is down

Controller-is-the-oracle model normally. But if the controller dies and the WAN is also down, devices have no shared timebase.
Fallback: device-to-device gossip via mDNS/Zeroconf on store LAN. One device is ad-hoc elected (lowest MAC addr) as "flip leader" and broadcasts "activate_at" commits. Eventual convergence on 1-2 s bound.
This is a rare corner but it keeps the store "eventually consistent" during degraded mode. The code path is rarely exercised — SRE runbooks should include chaos testing of "controller + WAN simultaneously down during meal transition."

Cost per store per day

Manifest polls: 2880/day × 1 KB = 3 MB egress → negligible ($0.0003).
Blob pre-fetch: 3 variants × 2 MB = 6 MB/day → ~$0.0005.
Heartbeats: 1440/day × 500 B = 720 KB → negligible.
Per store per day: ~$0.001 distribution cost; ~$10/month including CDN pre-paid + a share of the control plane. Compared to the $XXX/store/day we'd lose if the menu went out during lunch, this is the cheapest insurance money can buy.

Observability SLOs (per-store freshness dashboard)

Key SLIs:

p99(now - last_heartbeat_ts) per region — should be <5 min.
p99(target_version_age_on_store) at activation+1min — should be <1%.
count(device_version_disagreement_within_store) > 0 — should be 0 except in <500 ms flip windows.
autonomous_commit_rate — should be <0.1% (if higher, controller LAN health is degrading).
rollback_time_to_95pct — for emergency recalls, should be <5 min.

Dashboards:

Per-region heatmap — staleness by country.
Rollout progress bar — for each active activation_plan, percentage of target stores on target_version.
Anomaly board — stores with half-flipped state, high autonomous commits, or expired menus.

Adversarial / abuse concerns

Rogue store employee: what if someone with physical access to the controller modifies the local cache? Device-side signature verification on every load means a modified blob fails verification and falls back. Manager override is audit-logged to HQ heartbeat — can't hide.
Rogue publisher: what if an attacker (insider) publishes a bad menu? Dual-control (two-person approval) on publish + regulator sign-off + sentinel stores + canary = multiple gates.
Manifest replay: prevented by per-store sequence numbers.
DoS on manifest API: rate-limited per cert; CDN in front; scale out is cheap.

Privacy posture (aligning with candidate's Privacy Infra background)

Menu data is not PII; however, heartbeat + store_state DB is a geo-tagged telemetry stream. If joined with customer purchase data, it becomes sensitive. Separation of duty: menu-distribution DB ≠ customer-purchase DB; no join keys.
If v4 ships personalized menus, per-user preference computation should happen on-device (privacy-preserving, GDPR-clean, no PII to HQ) — echoing the Android private compute core pattern.
Audit log retention: 7 years for regulatory, then hard delete. Signed at write time (tamper-evident).

What I didn't cover but would in a longer slot

Kitchen Display System integration. Menu → recipe → KDS routing is a different beast; would need to coordinate "recipe pack deployed" with "menu active" to avoid the kitchen getting an order for an item its KDS doesn't know how to prepare. Pattern: two-phase coordinated activation across menu-system and KDS-system, with KDS version as a dependency in the activation plan.
Promotional overlays (time-bounded). E.g., "$3 coffee for the next 2 hours." Modeled as a short-TTL overlay layer in the merge stack; different activation cadence; exit criterion is its own expiry.
Dark launches. Publish a variant but don't activate; lets authors validate merge results against a store's context without customer exposure.
Multi-brand. A single chain might own multiple brands (e.g., Inspire Brands = Arby's + Dunkin + Buffalo Wild Wings). Add a brand_id to the filter and do the same partitioning one level up.

Final thesis, restated with earned evidence: The problem looks like a distribution problem but the customer-visible correctness guarantee is an in-store distributed-systems problem. Solving that elegantly — pre-staged blobs + 2-phase commit with bounded-stall clock-fallback, backed by cache-busting by version and a tiered offline policy — is what separates L7 from L6 on this question. A pager-carryable SRE can operate this because every component has a health signal, every failure mode has a finite blast radius, and every rollback is a pointer flip, not a purge.