Q13 · Mall Entrance Occupancy Management System

0 Pre-flight Reasoning (not spoken in interview, internal framing) #

This problem looks like a toy counter problem. It is not. The earned-secret insight is that a mall occupancy counter is a life-safety system with an asymmetric CAP choice baked into regulation, not engineering taste:

Entry path is CP — over-admission is a regulatory & liability incident (fire code violation).
Exit path is AP — under-counting exits is annoying (admits fewer people than legally allowed), but blocking an exit is a felony under most building codes. Software must never be able to prevent egress.

Everything else — Redis-vs-Spanner, sharded counters, lease models — is engineering that serves this asymmetry. If the candidate only designs a "distributed counter," they land at L5. L6 adds sharding and event sourcing. L7 recognizes that the two sides of the turnstile run on different CAP regimes wired to different hardware fail-safe modes, and that the software design cannot be the root of trust for exits because the fire marshal says so.

1 Problem Restatement & Clarifying Questions #

Restatement. Build a system that, for each shopping mall, (a) enforces a strict maximum occupancy by admitting or blocking entries at any of N gates in real time, (b) never prevents egress, (c) exposes a live count, (d) retains a complete audit trail of every entry/exit event with gate attribution, and (e) remains safe-by-default under partition or central-service failure.

Clarifying questions (ask these up front — ~3 min):

#	Question	Why it matters	Assumed answer for this doc
Q1	Mall capacity?	Drives counter throughput & fan-out	10,000 pax/mall, configurable
Q2	Gate count?	Drives fan-in, contention, sharding	30 gates (20 entry, 10 entry+exit combined)
Q3	Peak entry rate?	Sizes admission QPS	Black-Friday 1000 admissions/min = ~16/s steady, burst 100/s
Q4	Fire code?	Hard constraint: exits must never block; mandates hardware interlock	Yes — NFPA 101 / local equivalent
Q5	Sensor modality?	Determines double-count risk, latency floor	Turnstile primary + optical overhead counter for reconciliation; no biometrics
Q6	Multi-mall tenancy?	SaaS vs per-deployment; blast-radius boundary	Yes — operator runs fleet of ~500 malls
Q7	Definition of "admission"?	Turnstile clears vs person enters vs commit at POS?	Turnstile arm releases — legally the admission event
Q8	Child/stroller policy?	Fractional count?	1 person = 1 turnstile rotation, strollers un-counted (regulatory decision, not ours)
Q9	Is exact count required, or bounded error ok?	CP vs AP, cost of Spanner vs Redis	Exact for entry gate (CP); bounded ±1% for dashboard (AP)
Q10	Emergency override actor?	Auth model for fire marshal unlock	Dedicated hardware key-switch at security desk + software path — software is belt to hardware's suspenders
Q11	Latency SLO for turnstile decision?	People won't wait >200ms before pushing	p99 < 100ms end-to-end, p99.9 < 250ms
Q12	Audit retention?	Liability, GDPR-like privacy laws	1 year hot, 7 years cold (insurance/litigation window)

Non-questions (stated, not asked): we assume the mall network has some backhaul (fiber or LTE), but it can partition. We assume each gate has a local microcontroller (gate controller) with ~MB of RAM and persistent storage, and that no gate controller has >1s of un-persisted state. We assume time is approximately synchronized via NTP within ±50ms; we do not rely on wall-clock ordering for correctness.

2 Functional Requirements #

In-scope (numbered, testable):

FR1 — Atomic entry admission. request_entry(gate_id) must return ADMIT only if the global occupancy strictly remains ≤ capacity after the increment. No two concurrent requests anywhere in the mall may both see the same "last slot" and both admit.
FR2 — Unrestricted exit. record_exit(gate_id) must always succeed logically; it may never be rate-limited, blocked, or gated on central-service reachability. If central is unreachable, exit is recorded locally and replayed.
FR3 — Live occupancy query. get_occupancy(mall_id) returns current count with bounded staleness (≤ 5s under normal ops, ≤ 60s under partition).
FR4 — Per-gate metrics & dashboard. Operations dashboard shows occupancy, per-gate admit/reject/exit rate, queue depth estimate, sensor health.
FR5 — Audit log. Every entry attempt (admit or reject) and every exit is persisted as an immutable event with event_id, mall_id, gate_id, type, ts, admission_decision, admission_reason, sensor_confidence.
FR6 — Configurable capacity. Operator can set capacity with effective_from timestamp. Capacity can be reduced at runtime (e.g. HVAC failure, police order) — system must honor the lower bound immediately without crashing.
FR7 — Emergency override. emergency_override(mode=EVACUATE | LOCKDOWN | FIRE) — EVACUATE flips all gates to exit-only mode, blocks entries; FIRE triggers hardware fail-open (all turnstiles release); LOCKDOWN is for security events and still honors fire-code exits.
FR8 — Capacity change audit. Capacity changes themselves are audit events (who, when, why, previous value).

Out-of-scope (explicitly stated to bound the interview):

Identity, demographics, facial recognition, loyalty program tie-in.
POS revenue correlation, footfall → sales forecasting.
Occupancy prediction / ML crowd forecasting (v3 feature, §9).
Per-zone occupancy (food court vs parking). Treat mall as a single bucket.
Ticketing, reservations, paid entry.

3 NFRs + Capacity Estimate #

3.1 NFRs

NFR	Target	Justification
Availability (decision path)	99.99% normal; 99.95% effective during WAN partition (gate-local fallback still admits up to leased budget, then safely rejects)	4 nines requires ~52 min/yr downtime. Partition tolerance via leased slots keeps decisions local → effective 5 nines for local decisions.
Correctness	Zero over-admission under normal ops. Under gate-local fallback: bounded over-admission ≤ sum(unreturned_lease_tokens) across partitioned gates, and ≤ 0.5% of capacity (engineering SLO).	Regulatory — fire code is a hard ceiling. The 0.5% slack exists because turnstile sensors themselves have error (tailgating, dual-pass); software cannot be more accurate than its sensor.
Latency (entry decision)	p50 < 20ms, p99 < 100ms, p99.9 < 250ms end-to-end gate controller → admit/reject.	UX — queue rage. Empirically people start pushing the turnstile at ~300ms.
Latency (exit)	p99 < 30ms locally (hardware release); async replication to central within 5s.	Exit is hardware-first; software logs.
Durability (audit log)	11 nines (Kafka RF=3 + tiered storage to object store RF=3 cross-AZ).	Liability window 7 yrs.
Freshness (dashboard)	< 5s under normal; < 60s under partition.	Operator expectation.
Scale (fleet)	500 malls × 30 gates = 15K gate controllers; headroom to 50K.	Multi-tenant.

3.2 Back-of-the-envelope (all math shown)

Single-mall admission load.

Capacity C = 10,000.
Typical turnover: 3× full capacity per day ⇒ 30,000 admissions/day.
Peak hour concentration: 25% of daily traffic in busiest hour ⇒ 7,500 admits/hr = 2.08 admits/s steady peak.
Black-Friday / mall-opening burst: 10× peak-steady ⇒ ~20 admits/s sustained, peaks of 100/s for 30–60s when doors open.
Per-gate: 20 entry gates ⇒ burst ~5/s/gate — one turnstile can physically cycle at most ~1 person/1.5s ≈ 0.67/s, so the 5/s/gate burst figure is actually bounded by human-throughput; the system must survive every gate cycling at full physical rate simultaneously = 20 × 0.67/s ≈ 13.4 admits/s sustained, bursts to ~20/s with brief queue absorption. ⇒ Design for 100/s admit QPS per mall (5× safety factor).

Fleet load.

500 malls × 20/s steady avg = 10,000 admits/s across fleet.
Fleet-wide peak (all malls at once, e.g. Saturday 11am across one time zone): 500 × 100/s = 50,000 admits/s peak.

Event log sizing.

Event row: event_id(16B) + mall_id(8B) + gate_id(4B) + type(1B) + ts(8B) + decision(1B) + reason(8B) + sensor_confidence(4B) + serialization overhead ≈ 120B wire, 200B on-disk with indexing & compression-offset.
Per mall per day: (30,000 admits + 30,000 exits + ~1,000 rejects) × 200B ≈ 12.2 MB/day/mall.
Fleet: 500 × 12.2 MB = 6.1 GB/day; ~2.2 TB/yr hot.
7-yr cold tier: ~15 TB — trivial for object storage, ~$350/mo at S3 Glacier Deep Archive rates.

Counter store sizing.

Hot state: ~1 KB/mall (counter, version, lease table, config) × 500 malls = 500 KB active state. Fits in any single Redis/Spanner node trivially; the scale is in the write QPS, not the size.

Read load.

get_occupancy from ops dashboard: polled every 5s per mall view × say 20 operator sessions/mall × 500 malls = 10K queries/5s = 2K QPS. Trivial.

Key takeaway of BoE: write contention, not volume, is the constraint. The hot partition is the single counter per mall. 100 QPS on one key is fine for Redis (100K+ INCR/s on one shard), fine for Spanner single-row (thousands of commits/s with conflict-free sequencer), but pathological for a naive RDBMS SELECT ... FOR UPDATE pattern (contention latency degrades super-linearly).

4 High-Level API #

All APIs: gRPC primary (low-latency, streaming for sensor telemetry), HTTP/JSON gateway for dashboard and admin.

4.1 Gate-plane (gate controller ↔ entry coordinator)

service EntryCoordinator {
  // Called on every turnstile presentation. Must be fast and idempotent.
  rpc RequestEntry(EntryRequest) returns (EntryDecision);

  // Fire-and-forget stream from gate controller; at-least-once.
  rpc RecordExit(stream ExitEvent) returns (ExitAck);

  // Gate pulls more lease tokens when its local budget is low.
  rpc AcquireLease(LeaseRequest) returns (LeaseGrant);

  // Gate returns unused tokens on shift end / graceful shutdown.
  rpc ReleaseLease(LeaseRelease) returns (LeaseAck);

  // Heartbeat + reconciliation (gate reports local counters; coord reconciles).
  rpc Reconcile(stream ReconcileBatch) returns (ReconcileAck);
}

message EntryRequest {
  string event_id = 1;      // UUIDv7 from gate. Idempotency key.
  string mall_id = 2;
  string gate_id = 3;
  int64  gate_local_seq = 4; // Monotonic per-gate. Gap-detection on replay.
  int64  ts_gate_ms = 5;     // Gate-local NTP time.
  float  sensor_confidence = 6; // 0..1. <0.8 ⇒ flag for reconciliation.
}

message EntryDecision {
  enum Code { ADMIT = 0; REJECT_AT_CAPACITY = 1; REJECT_EMERGENCY = 2;
              ADMIT_LOCAL_LEASE = 3;  // decided from gate's local lease, not central
              ERROR_RETRY = 4; }
  Code   code = 1;
  int64  occupancy_after = 2; // best-effort; may be stale under fallback
  int64  capacity = 3;
  string decision_source = 4; // "central" | "gate_lease" | "cached_fallback"
  int64  decision_latency_us = 5;
}

4.2 Control-plane (operator / admin)

service MallOps {
  rpc GetOccupancy(GetOccupancyRequest) returns (Occupancy);
  rpc StreamOccupancy(GetOccupancyRequest) returns (stream Occupancy); // push
  rpc SetCapacity(SetCapacityRequest) returns (CapacityChange);
  rpc EmergencyOverride(OverrideRequest) returns (OverrideAck);
  rpc GetGateHealth(MallRef) returns (stream GateHealth);
  rpc QueryAuditLog(AuditQuery) returns (stream AuditEvent);
}

4.3 Semantics

Idempotency. event_id is a UUIDv7 minted by the gate controller. The coordinator maintains a bounded-time dedup set (last 5 min) to absorb retries. Events older than the dedup window are accepted (gate was partitioned longer than 5 min) but marked late_arrival=true and reconciled; double-admission risk is bounded to gate_local_lease size (see §7.2).
Idempotency key is per-gate and per-sequence, not just UUID. This lets us detect gaps on replay: a gate that reconnects with seq 103 when we last saw 97 must have 5 events in flight that we need to reconcile.
No cross-mall calls on hot path. Each mall is an independent bulkhead.
ADMIT_LOCAL_LEASE vs ADMIT — deliberately a distinct return code. Dashboards and audit surface the source of every decision. This makes the "is the system partitioned right now?" question answerable from the data plane, not just from ops monitoring.

5 Data Schema #

5.1 Primary stores (authoritative)

occupancy_counter — one row per mall, strongly consistent. This is the hot row.

col	type	notes
mall_id	string PK
count	int64	current occupancy
capacity	int64	current capacity
version	int64	monotonic; bumped on every write (optimistic CAS)
leased_out	int64	sum of live gate leases (phantom admits in flight)
effective_capacity	int64	`capacity - leased_out` — the value coordinator compares against for central admits
last_reconcile_ts	int64	for staleness alerts
mode	enum	NORMAL \| EVACUATE \| LOCKDOWN \| FIRE

Engine choice (see §7.1 for full analysis): Spanner single-row with read-write transactions for the authoritative counter; Redis cluster with Lua script as a tier-1 cache for sub-10ms reads + optimistic-path writes, with async write-through to Spanner. Rejected engines documented in §7.1.

gate_lease — one row per live lease. Mall-sharded.

col	type	notes
lease_id	uuid PK
mall_id	string	shard key
gate_id	string
granted_slots	int32	e.g. 20
consumed_slots	int32	updated on reconcile
granted_at	ts
expires_at	ts	lease TTL, e.g. 60s
status	enum	ACTIVE \| RETURNED \| EXPIRED \| RECONCILED

events — append-only audit log. Kafka primary (RF=3, min.insync.replicas=2, acks=all). Compacted topic with mall_id as partition key for per-mall ordering; sub-partitioned by mall_id||gate_id to prevent one mall's audit stream from head-of-line-blocking another.

col	type
event_id	uuid
mall_id	string
gate_id	string
type	ADMIT \| REJECT \| EXIT \| LEASE_GRANT \| LEASE_RETURN \| OVERRIDE \| CAPACITY_CHANGE
ts_gate_ms	int64
ts_server_ms	int64
occupancy_after	int64
decision_source	string
sensor_confidence	float
operator_id	string (nullable; for overrides)
raw_payload	bytes

Archive: Kafka → tiered to object store (Parquet files, daily partitions) via Kafka Connect. Cold tier = Glacier-class. Hot query via Druid or ClickHouse (§5.3).

capacity_policy — small OLTP table (Postgres or Spanner).

col	type
mall_id	string
capacity	int64
effective_from	ts
set_by_operator	string
reason	string
prev_capacity	int64

5.2 Rationale for the `effective_capacity = capacity - leased_out` column

Naive designs keep one counter count. Under leased slots, the central count is the last reconciled count, which lags. If the coordinator compares count < capacity to admit, it will over-admit by leased_out. Precomputing effective_capacity means central's own admit path (for gates that round-trip) respects the outstanding leases. The leases are phantom admits from central's perspective.

5.3 Operational dashboard

ClickHouse or Druid fed from the Kafka audit stream.
Rollups: 1s, 10s, 1min, 1hr buckets per gate × decision.
Why not just Grafana over Prometheus? Because we also need ad-hoc queries ("who entered gate 7 between 14:32 and 14:47 on Apr 4" for a lost-child or theft case), and that's an OLAP-shape workload.

5.4 A note on CRDTs (and why not)

A PN-Counter CRDT gives partition-tolerant counting with eventual consistency. It is wrong for this problem because it permits unbounded over-admission during partition (each partition independently increments without bound), which violates the hard fire-code ceiling. CRDTs are AP; we need CP on the entry path. The leased-slot model is a CP-shaped answer that looks CRDT-ish (local increments, eventual merge) but with bounded divergence via the lease size.

6 System Diagram (Centerpiece) #

6.1 Full topology

                                      PHYSICAL PLANE                          ┃          NETWORK PLANE                    ┃   SERVICE PLANE (per mall)              ┃   FLEET CONTROL PLANE
                                                                              ┃                                           ┃                                         ┃
   ┌────────────────────┐    mag-lock  ┌──────────────────┐                   ┃                                           ┃                                         ┃
   │  Turnstile #1..20  │◄────HW────►  │ Fire alarm bus   │───┐               ┃                                           ┃                                         ┃
   │  (IR + HW counter) │ fail-open   │ (NFPA hardwired) │   │ override      ┃                                           ┃                                         ┃
   └─────────┬──────────┘              └──────────────────┘   ▼               ┃                                           ┃                                         ┃
             │ RS-485 /                                      ┌─────────────┐ ┃                                           ┃                                         ┃
             │ GPIO (<5ms)                                   │ Hardware    │ ┃                                           ┃                                         ┃
             ▼                                               │ Key-switch  │ ┃                                           ┃                                         ┃
   ┌───────────────────────────┐                             │ @ sec. desk │ ┃                                           ┃                                         ┃
   │ Gate Controller (per      │                             └──────┬──────┘ ┃                                           ┃                                         ┃
   │   turnstile, embedded)    │─┐                                  │        ┃                                           ┃                                         ┃
   │ - 4KB dedup ring          │ │                                  │        ┃                                           ┃                                         ┃
   │ - lease cache (N tokens)  │ │                                  │        ┃                                           ┃                                         ┃
   │ - local SQLite audit WAL  │ │                                  │        ┃                                           ┃                                         ┃
   │ - fallback policy FSM     │ │                                  │        ┃                                           ┃                                         ┃
   └─────────────┬─────────────┘ │                                  │        ┃                                           ┃                                         ┃
                 │  gRPC/mTLS    │ mTLS                             │        ┃                                           ┃                                         ┃
                 │  <5ms LAN     │ telemetry                        │        ┃                                           ┃                                         ┃
                 ▼               ▼                                  │        ┃                                           ┃                                         ┃
   ┌──────────────────────────────────────────┐                     │        ┃                                           ┃                                         ┃
   │  Edge Gateway (per mall, HA pair)        │                     │        ┃                                           ┃                                         ┃
   │  - local circuit-breaker to WAN          │                     │        ┃                                           ┃                                         ┃
   │  - aggregation, mTLS termination         │                     │        ┃                                           ┃                                         ┃
   │  - local Redis (read-through cache)      │                     │        ┃                                           ┃                                         ┃
   │  - fallback arbiter (see §7.2)           │                     │        ┃                                           ┃                                         ┃
   │  - local Kafka producer buffer (24h)     │                     │        ┃                                           ┃                                         ┃
   └──────────┬───────────────────────────────┘                     │        ┃                                           ┃                                         ┃
              │   WAN (SD-WAN, dual-homed fiber+LTE)                │        ┃                                           ┃                                         ┃
              │   p50 20ms, p99 80ms, may partition                 │        ┃                                           ┃                                         ┃
              ▼                                                     │        ┃                                           ┃                                         ┃
══════════════════════════════════════════════════════════════════════════════┃═══════════════════════════════════════════┃═════════════════════════════════════════
              │                                                     │        ┃  REGION: MALL's home region               ┃   REGION: cross-region control
              │        ┌────────────────────────────────────────────┘        ┃                                           ┃
              │        │  override (dual-path: HW+SW)                        ┃                                           ┃
              ▼        ▼                                                     ┃                                           ┃
   ┌────────────────────────────────────────────────┐  (A)  CAS/Lua ┌────────────────────────────┐                       ┃
   │  Entry Coordinator Service (per mall, 3x HA)   │◄─────────────►│  Primary Counter Store     │                       ┃
   │  ┌──────────────────────────────────────────┐  │   <5ms        │  Spanner row (SR)          │                       ┃
   │  │ Admission FSM:                           │  │               │  - occupancy_counter       │                       ┃
   │  │  1. Normalize request                    │  │  (B) read     │  - gate_lease              │                       ┃
   │  │  2. Dedup on event_id                    │  │ through Redis │                            │                       ┃
   │  │  3. Check override mode                  │  │  Cluster (L1) │  + Redis Cluster (L1)      │                       ┃
   │  │  4. Choose path:                         │  │  ~1ms         │    - Lua CAS script        │                       ┃
   │  │     (a) central CAS (default)            │  │               │    - TTL'd lease snapshot  │                       ┃
   │  │     (b) grant lease to gate              │  │               │                            │                       ┃
   │  │     (c) fallback: reject if no budget    │  │               └──────────────┬─────────────┘                       ┃
   │  │  5. Emit audit event                     │  │                              │ async replicate to global audit    ┃
   │  │  6. Return decision                      │  │                              ▼                                      ┃
   │  └──────────────────────────────────────────┘  │                 ┌─────────────────────────┐                        ┃
   └──────────┬───────────────────────────────────┬─┘                 │  Kafka (audit bus)      │                        ┃
              │                                   │                   │  topic: mall-events     │                        ┃
              │ audit write                       │ exit path         │  RF=3, acks=all         │                        ┃
              │ (at-least-once)                   │ (AP)              │  partition: mall_id     │                        ┃
              ▼                                   ▼                   └──────┬────────┬─────────┘                        ┃
   ┌────────────────────┐                ┌────────────────────┐              │        │                                  ┃
   │ Kafka producer     │                │ Exit Sink          │              │        └──►  S3/GCS Parquet archive       ┃
   │ (local buffer      │                │ (async, writes     │              │              (7yr cold tier)              ┃
   │  on edge gateway)  │                │  count DECR +      │              ▼                                            ┃
   └────────────────────┘                │  audit event;      │      ┌──────────────────────┐                             ┃
                                         │  NEVER blocks exit)│      │ ClickHouse / Druid   │                             ┃
                                         └──────────┬─────────┘      │ (ops dashboard DB)   │                             ┃
                                                    │                └──────────┬───────────┘                             ┃
                                                    │                           │                                          ┃
                                                    └────► Spanner counter ◄────┘                                          ┃
                                                          (DECR is always                                                  ┃
                                                          accepted; may go                                                 ┃
                                                          transiently negative                                             ┃
                                                          under race, see §8)                                              ┃
                                                                                                                           ┃
                                                                                                                           ┃   ┌───────────────────────────────┐
                                                                                                                           ┃   │ Fleet Control Plane           │
                                                                                                                           ┃   │ - Multi-mall config svc       │
                                                                                                                           ┃   │ - Capacity policy DB          │
                                                                                                                           ┃   │ - Global ops dashboard        │
                                                                                                                           ┃   │ - PagerDuty / alert routing   │
                                                                                                                           ┃   │ - Tenant mgmt / RBAC          │
                                                                                                                           ┃   │ - Firmware rollout controller │
                                                                                                                           ┃   └───────────────────────────────┘

Every arrow carries (protocol, budget, QPS):

Arrow	Protocol	Latency budget	QPS (peak)	Notes
Turnstile → Gate Controller	RS-485 / GPIO	<5ms	0.67/s/gate	hardware
Gate Controller → Edge Gateway	gRPC mTLS over LAN	<10ms	100/s/mall
Edge Gateway → Entry Coordinator	gRPC mTLS over WAN	p99 80ms	100/s/mall	may partition
Entry Coordinator ↔ Spanner	Spanner RPC	p99 15ms	100/s/mall hot row
Entry Coordinator ↔ Redis L1	RESP	p99 3ms	500/s/mall (reads amplified)
Entry Coordinator → Kafka	Kafka protocol	p99 20ms (async ack)	200/s/mall (entries + exits)
Kafka → ClickHouse	Kafka Connect	seconds	200/s/mall
Fire Alarm → All Turnstiles	hardwired NFPA loop	<500ms physical	—	hardware path, bypasses software entirely
Hardware Key-switch → Turnstiles	hardwired	<500ms	—	security desk override

6.2 Sub-diagram: Entry Decision Flow (happy path + fallback)

                       EntryRequest(event_id, gate_id)
                                 │
                                 ▼
              ┌─────────────────────────────────────┐
              │ Gate Controller receives            │
              │ 1. dedup event_id in ring buffer?   │──yes──► return cached decision
              │ 2. emergency mode flag set?         │──FIRE─► OPEN (HW override)
              │    (read from local GPIO + SW)      │──EVAC─► REJECT (entry blocked)
              └──────────────┬──────────────────────┘
                             │
             local lease     │
              tokens > 0 ?   │
                   ┌─────────┴──────────┐
                  yes                   no
                   │                    │
      ┌────────────▼───────────┐       ▼
      │ LOCAL ADMIT:            │   ┌───────────────────────────┐
      │ - decrement local lease │   │ Call coordinator          │
      │ - write local WAL       │   │ RequestEntry              │
      │ - open turnstile        │   │ timeout: 80ms              │
      │ - async push audit      │   └──────┬──────────┬──────────┘
      │ - if lease < low_water, │          │          │
      │   schedule async refill │     success       timeout/error
      └─────────────────────────┘          │          │
                                           ▼          ▼
                              ┌───────────────┐  ┌──────────────────────┐
                              │ Apply         │  │ Fallback arbiter:    │
                              │ decision from │  │ - lease > 0? admit   │
                              │ coordinator   │  │ - else: pessimistic  │
                              │ (authoritative)│  │   REJECT & alert    │
                              └───────────────┘  └──────────────────────┘

6.3 Sub-diagram: Gate-Local Fallback State Machine

                     ┌──── central healthy (last_heartbeat < 5s) ────┐
                     │                                                │
                     ▼                                                │
               ┌──────────┐   heartbeat timeout 5s   ┌────────────┐   │
       ┌──────▶│  NORMAL  │────────────────────────▶ │ DEGRADED    │──┘ recovery
       │       └────┬─────┘                          └─────┬──────┘    (force reconcile)
       │            │                                      │
       │    lease refill ok                    lease depleted & no WAN
       │            │                                      │
       │            ▼                                      ▼
       │       ┌──────────┐                         ┌────────────┐
       │       │ LEASED   │──lease exhausted──────► │ PESSIMISTIC│──┐
       │       │ OPERATION│                         │ REJECT      │  │
       │       └──────────┘                         └────────────┘  │
       │                                                  │          │
       │              fire override at any state          │          │
       │              ┌───────────────────────────────────┘          │
       │              │                                              │
       │              ▼                                              │
       │       ┌─────────────┐   fire cleared + mgmt ack             │
       │       │ FIRE_OPEN   │────────────────────────────────────► NORMAL
       │       │ (HW-enforced│
       │       │  exit-only) │
       │       └─────────────┘
       └─────── (return after reconcile from any state) ──────

6.4 Asymmetry of exit path (critical L7 call-out)

The diagram deliberately shows exit as a separate data path:

Exit sensor → Gate Controller — no admission check.
Gate Controller writes exit to local WAL, opens turnstile.
Exit events stream async to Kafka, which decrements the counter eventually.
The exit path has no synchronous call to the coordinator. Even if the WAN is fully partitioned, exit is local and always works.
If the gate controller itself dies, the turnstile mag-lock fails open (hardware default) — exit is still physically possible.

This is the asymmetric CAP decision made physical. Entry = CP. Exit = AP.

7 Deep Dives #

7.1 Deep Dive 1 — Strongly-consistent counter without hot-key collapse

Why critical. The whole system centers on one row per mall. Naïvely, 100 writes/s on one key is easy; under the CP + audit requirement it becomes a contention battle.

Options considered (quantified):

Option	Mechanism	Throughput per mall	Latency p99	Correctness	Operational cost	Verdict
A. Postgres `SELECT ... FOR UPDATE`	Pessimistic row lock	~500/s before lock contention dominates; degrades super-linearly	50–500ms under load	Strong	Low	Rejected — lock-acquire latency spikes kill p99 at peak; no natural multi-region story.
B. Redis single-node `INCR` + Lua CAS	Single-threaded event loop serializes ops	100K ops/s	p99 2ms	Strong (single shard)	Low	Strong candidate but no multi-AZ strong consistency natively (failover is async, we'd lose up to `replica_lag` admissions on failover — unacceptable).
C. Redis Cluster with Lua CAS on single slot	Hash-slot isolates mall_id	Still single-shard per mall (~100K/s)	p99 3ms	Strong per-slot; weaker on failover	Medium	Same AZ-failover concern as B. Good as cache, not as source-of-truth.
D. Spanner single-row RW transaction	Paxos-replicated row	Single-row = 1000s commits/s; contention handled by serialization queue	p99 15ms	Strong, cross-region, externally-consistent	Higher $	Chosen for authoritative state.
E. FoundationDB transactions	Same semantics as Spanner; cheaper, more ops	10K+ txns/s on hot key	p99 10ms	Strong	Higher ops burden (self-managed)	Strong candidate if self-hosted; chose Spanner for managed ops story.
F. Sharded counter + periodic roll-up	N sub-counters per gate, each incremented locally, coordinator sums periodically	Unlimited throughput	Coordinator read p99 unchanged	Weakly consistent — coordinator's sum lags; over-admission bounded by shard divergence	Complex	Rejected for base counter — cannot guarantee ceiling. Adopted in a modified form as leased-slot (§7.2).
G. CRDT PN-Counter	Per-replica increment, merge on sync	Unlimited	Read is local	Unbounded divergence under partition	Medium	Rejected — AP, cannot cap.
H. Token-based admission	Pre-mint C tokens, gates consume	Naturally serializes at token mint; consumption is local	Local p99 < 5ms	Strong bound on max admissions	Complex, reconciliation-heavy	Chosen as optimistic fast path (§7.2).

Chosen approach — two-tier:

Authoritative layer: Spanner single-row per mall for occupancy_counter. Single-row RW transactions on a hot row in Spanner are actually a good match: Spanner's participant leader serializes conflicting commits on a single Paxos group, and the row is small enough that commit latency is dominated by WAN round-trip (~~10-15ms p99). At 100 QPS/mall we are nowhere near its ceiling (~~2-5K commits/s hot row).
Fast-path cache layer: Redis Cluster with Lua. The coordinator's hot path reads the effective_capacity from Redis (refreshed every 100ms from Spanner) and serves most requests from the lease model without hitting Spanner for every admit. Writes (central-path admits) go to Spanner authoritatively and then invalidate/update Redis.
Dedup buffer: Redis set with TTL=5min for event_ids, to absorb retries.

Why not just Redis? The business can tolerate a Redis failover dropping unreplicated admits (they would be re-issued lease tokens), but cannot tolerate a silent gap where the canonical count is wrong after failover. Spanner gives us externally-consistent source of truth; Redis is a performance cache in front of it. If we run Redis-only, we would need to implement our own Raft/Paxos replication of the counter to avoid post-failover ambiguity — that's rebuilding Spanner badly.

Production systems named:

Ticketmaster "queue-it" and TicketSocket: use a similar two-tier model — a global token mint (authoritative) with local queue servers. See Andrea Leopardi / Livewire talks on Ticketmaster's high-contention ticket mint — they moved off Postgres FOR UPDATE to pre-allocated token pools precisely because lock-queue latency was unbounded.
Disney Parks FastPass / Genie+: pre-allocates ride-slot tokens per-day and distributes them to kiosks/app clients; kiosks admit from local budget; central reconciles.
Stripe idempotency keys: TTL'd dedup set, same pattern as our event_id dedup.
Google Spanner single-row hot-key benchmarks (Jeff Dean, CIDR 2017): single row at thousands of commits/s is supported with minor caveats (participant-leader batching).

Failure modes + mitigations (of the chosen counter design):

Spanner regional outage → failover is automatic within configured placement. During the ~30s failover window, coordinator falls into "leased-slot only" mode — gates admit from their existing leases, no new leases granted, no new central admits. When Spanner recovers, a reconciliation pass folds in buffered admits.
Redis cache poisoning / staleness → bounded by 100ms refresh window; coordinator always falls back to Spanner for the authoritative admit.
Hot mall (one mall at 500 QPS, 10× design) → single-row commit queue on Spanner starts building. Mitigation: increase lease size so more admits go local; lease grants are themselves admits from Spanner's perspective, so a gate taking a 50-token lease is 1 commit for 50 admits — this is the contention-amortization trick.

7.2 Deep Dive 2 — Leased-slot admission (earned-secret core)

Why critical. This is the model that makes the system both partition-tolerant and safe. It is the single most important technique in the design.

The model.

Each gate holds a local lease of L admission tokens (e.g. L=20). When a person presents:

If local_tokens > 0: decrement, admit, write local WAL. No network call.
If local_tokens == 0: synchronously call coordinator to AcquireLease(L'). Coordinator does a Spanner CAS: leased_out += L', effective_capacity -= L'. Returns a lease grant.
When gate goes idle or on graceful shutdown: ReleaseLease returns unconsumed tokens to central.

When local_tokens drops below a low-water mark (e.g. 5 of 20), the gate asynchronously requests a refill so the synchronous path rarely blocks on network.

CAP positioning. This is a bounded-divergence CP design. At any instant, actual_occupancy ≤ central_count + Σ(leased_out per gate) = capacity. The invariant is maintained by the central CAS at lease grant time, not at admit time. The lease is the commit.

Under partition (pessimistic):

Gate's last-known lease is L_current. Gate admits up to L_current, then rejects. Worst-case over-admission during partition: zero (by definition — the leased tokens were already committed against capacity centrally).
When the partition heals, gate sends Reconcile with its consumed count. Central sets leased_out -= unconsumed, count += consumed. Accounting is precise.

Under partition (optimistic) — rejected for this problem but worth analyzing:

Gate continues admitting past lease, up to a safety cap L_current + Δ.
Over-admission up to Σ Δ across gates, which is bounded. E.g. 30 gates × Δ=10 = 300 over-admits ≤ 3% of capacity.
Acceptable for some venues (concert festival, where queue rage > safety margin). Not acceptable here because fire code is a hard ceiling enforced by inspection and liability.

Low-water mark math. Peak per-gate admit rate ~0.67/s. Network p99 to central = 80ms. Coordinator lease-grant p99 = 100ms total. Expected admits during one refill round-trip = 0.67 × 0.1 = ~0.07 — basically never block. Set low-water = ⌈p99_admits_during_refill × safety_factor⌉ = 3. Configure lease L=20 so refills happen once per ~30s/gate.

Why not just "gate_budget = capacity / gates"? Because traffic is not uniform. At the mall's main entrance gate, 90% of admissions happen; at a side gate, 5%. If we statically partitioned capacity, the main gate would reject prematurely while side gates sat on unused budget. The lease model is a dynamic reallocation — a form of work-stealing counter — where hot gates refill more often and cold gates let their leases expire.

Lease expiry handling. A gate's lease has a TTL (e.g. 60s). If the gate dies without returning the lease, central reclaims it via:

Heartbeats. Gate sends heartbeat with remaining lease every 5s.
On heartbeat miss × 3 (15s), coordinator marks lease SUSPECT.
At 60s TTL, coordinator reclaims: leased_out -= granted_slots, count += granted_slots × fudge_factor. Critically, the fudge_factor here is a safe assumption that ALL tokens were consumed — i.e. the system errs on the side of treating unreachable gates as having admitted their full lease. This prevents double-counting if the gate comes back online after reclamation. It means we under-count capacity (transiently treat more people as inside than might be true), which biases toward rejecting more entries, which is the safe direction.
When gate reconnects, it sends its actual consumption in Reconcile; central does a compensating adjustment.

This is the L7 move: the asymmetry of assumption under uncertainty is deliberately biased toward the safety-critical direction. Over-rejection is recoverable; over-admission is not.

Production analogs:

AWS IAM / STS temporary credentials: pre-issued tokens with TTL, verified locally.
Kubernetes kube-scheduler with resource reservations: nodes hold a reservation budget for pods; scheduler reconciles.
Airlines' seat-map reservation locks: a travel agent holds N seat locks for 15 min; returns unused.
CockroachDB's "SQL sequence caching": each SQL node caches a block of sequence values to avoid a round-trip per INSERT. Same idea, same edge case (cached values on a dead node are lost = gaps in sequence; acceptable for sequences, not for our counter — so we reclaim).
Riot Games / Fortnite queue-token model: login tokens pre-minted in pools, consumed by region servers.

7.3 Deep Dive 3 — Asymmetric exit path (fire safety as architecture)

Why critical. A naive system wraps both entry and exit in the same "update counter" API. This is not just inefficient — it is illegal. Most building codes (NFPA 101 in US; equivalent EN, ISO) require that egress cannot be prevented by mechanical or electronic means except under active fire-marshal control. If software can block an exit, the building is not code-compliant.

The design.

Exit does not go through the admission FSM at all. Separate gate-controller code path.
Exit is fire-and-forget with durable local WAL. The gate controller writes the exit event to local SQLite (fsync) before releasing the turnstile. Even if it reboots mid-cycle, the event is preserved.
Exit batches are streamed async to Kafka via the edge gateway. Kafka's consumer group (the "Exit Sink") DECRs the Spanner counter. Since exits can never cause over-admission (they only reduce count), eventual replication is safe.
Hardware-level fail-open. The turnstile mag-lock is wired so that loss of power or loss of signal from the gate controller releases the lock. Software cannot hold the lock closed.
Fire alarm integration is a hardwired NFPA loop, not an API call. When the fire panel asserts alarm, all mag-locks drop simultaneously via dedicated copper. Our software also enters FIRE mode via a separate telemetry channel (so we record the event), but the software path is the belt; the hardware loop is the suspenders.

Race: exit before entry admit replicates. Scenario: person enters gate 1, coordinator admits, Kafka event in flight. Same person exits gate 2 before the entry event reaches Spanner. The counter would transiently go negative.

Mitigation: count is stored as int64 (room for negative), and the dashboard clamps display to max(0, count). Over a reconciliation window (5s), the entry event catches up and the counter recovers. The invariant we enforce is "never above capacity," not "never below zero." This is an explicit L7 decision: the ceiling is regulatory; the floor is cosmetic.

Race: double-exit. Sensor fires twice on a single person (tailgating variant in reverse). Gate controller dedups within a 500ms window at the sensor level (hardware debounce + software). Cross-gate dedup is impossible (person physically at one gate at a time), so this is a local concern.

Tailgating (the real correctness ceiling). Turnstiles allow tailgating at ~0.1–0.3% depending on design (tailgater slips through behind another person, sensor only counts one rotation). This means the sensor itself has error that software cannot fix. We address this by:

Overhead optical counter (secondary sensor) at each gate, independently counted.
Coordinator reconciles turnstile count vs optical count hourly; discrepancy > 0.5% triggers ops alert and a sensor calibration task.
Reporting: "Sensor-confident admissions" vs "total foot traffic" are separate metrics. Fire-code compliance is measured against the higher of the two (worst-case), so we under-capacity relative to sensor-counted admissions.

Why not ML tailgating detection on camera? v3 feature. Introduces privacy surface (see §10) and drifts accuracy under crowd load. For v1/v2 we rely on dual-sensor mechanical reconciliation and ops-driven calibration.

Production analogs:

Airport boarding: boarding pass scan is the "admit"; exit (deplaning) is physically unstoppable.
Amusement park ride queues (Disney LightningLane): entry is ticketed; exit is free.
Every data-center badge reader: entry requires auth; emergency egress ("panic bar") is hardware-guaranteed.

8 Failure Modes & Resilience #

Component	Failure	Detection	Blast Radius	Mitigation	Recovery
Turnstile sensor	Miscount (tailgating, double-count)	Per-hour reconciliation against optical counter	Single gate, <0.5% count drift	Alert ops, schedule calibration, bias safety margin by 1%	Manual recount + adjustment via `CapacityChange` audit event
Gate controller	Hang / reboot	Heartbeat miss (5s)	Single gate; in-flight lease held until 60s TTL reclaim	Pair gate with adjacent gate for backup admit; mag-lock fails open for exit	Controller boots, reloads WAL, re-plays unsynced events
Gate controller	Split-brain (network partition from central but not from sensor)	Coordinator heartbeat timeout	Single gate; lease exhaustion within ~30s	Pessimistic reject when lease empty; human fallback (guard lets people through manually)	`Reconcile` on reconnect folds consumed lease
Edge gateway	Outage	Coordinator heartbeat	All gates in mall	HA active-active pair (anycast VIP); all gates reconnect in <5s	Recovery via HA partner; log buffer on gates survives ≤24h
WAN partition (mall ↔ central)	Full isolation	Coordinator probe	All mall's central-path admits	Gates operate on leases until exhausted; then pessimistic reject; exits unaffected	On reconnect, bulk reconcile; over-admission count (bounded by sum of unused lease) folded into `count`
Spanner regional outage	Region unavailable	Spanner client errors	All malls in region	Cross-region failover (automatic with multi-region instance); during failover, coordinator drops into "leased-slot only" — no new leases but existing ones honored	Automatic; reconcile on recovery
Redis cluster outage	Cache miss storm	Cache error rate	+10ms latency per admit (fall back to Spanner direct)	Spanner directly serves; degraded latency but correct	Rebuild cache async from Spanner
Kafka outage	Audit writes blocked	Producer metric	Audit lag	Local producer buffer on edge gateway (24h, spill to SQLite); entries still admitted (audit is not on critical path)	Drain buffer on recovery
ClickHouse/Druid outage	Dashboard stale	Read errors	Dashboard only	Fallback to direct Spanner `count` read; gate-level metrics degraded until recovery	Replay Kafka stream
Sensor double-fire (e.g. mechanical bounce)	Phantom admissions	Local sensor dedup (500ms debounce) + optical reconciliation	<0.1% error	Dedup in firmware; tolerate via bounded-error SLO	Calibration; Kafka has raw events for audit
Capacity misconfiguration (set to 100 instead of 10000)	Over-reject & angry customers	Large rejection spike alert (>5x baseline for >5min triggers PagerDuty)	Mall entries stalled	Four-eyes policy on `SetCapacity` for reductions >20%; operator override required	Operator sets correct value; audit records both
Audit log backlog (Kafka lag)	Dashboard lag	Lag metric	Delayed insights, not decisions	Scale Kafka consumer group horizontally	Natural drain
Fire alarm triggered	Hardware loop opens all mag-locks	Fire panel signal	Entire mall goes to evacuation mode	All entries blocked (FIRE mode in software); exits unrestricted (hardware + software); audit records the event	Manual reset after fire marshal clears; operator re-arms gates
Fire alarm false positive	Gates fail-open incorrectly	Reconciliation: sudden drop to "all exit" state without corresponding occupant count change	Mall closes until reset	Physical lockout: hardware key-switch at sec desk required to re-arm (code-compliance)	Security desk walkthrough + re-arm
Malicious operator (capacity set to 100K to fit extra customers)	Over-admission by management	Audit-log anomaly detection (capacity_change outside pre-approved bounds)	Potentially severe (liability)	Hard policy cap in control plane; multi-party approval for >20% capacity increase; fleet-level alert	Retroactive rollback; compliance team review
Gate-local WAL corruption	Lost audit events	SQLite checksum + startup integrity check	Single gate's audit; counter untouched (counter is in Spanner)	Regular WAL backup to edge gateway every 30s	Replay from edge gateway buffer
Clock skew > 50ms between gate & central	Event ordering confusion	NTP monitoring	Audit log ordering may be wrong	We order by `ts_server_ms` in audit, `ts_gate_ms` is informational only; dedup by `event_id`, not ts	NTP re-sync; flag events as `clock_skew_suspected`
Leased tokens outstanding when capacity reduced	Capacity change to lower value while leases held may cause ≤ leased_out over-admission	Reconcile on capacity change	Bounded to Σ live leases (~30 × 20 = 600 → 6% of capacity worst case)	On capacity reduction, coordinator revokes all outstanding leases (async message to gates); gates complete in-flight admit then stop admitting until refill	Gates re-lease at new effective_capacity

9 Evolution Path #

v1 — "Make it correct, for one mall" (0–3 months)

Single mall, single region.
Centralized Redis with Lua CAS on the occupancy counter; Postgres for capacity_policy and audit.
Gates are synchronous — every admit is a blocking coordinator call.
No leases, no local fallback — if central is down, entries stop.
Exits are async batched to Redis (still asymmetric from day 1 — fire code non-negotiable).
Ops dashboard: Grafana over Postgres audit.
Good-enough SLO: p99 100ms local mall network; central outage = entries stop (acceptable for single-tenant pilot).

v2 — "Make it partition-tolerant and fleet-ready" (3–9 months)

Add leased-slot model (§7.2) — gates hold local budget, operate through short partitions.
Migrate authoritative counter from Redis → Spanner (or FoundationDB); Redis becomes L1 cache.
Add Kafka audit bus — move audit off the critical path; introduce ClickHouse for OLAP.
Multi-mall (200–500): control plane for fleet, per-tenant isolation by region.
Reconciliation job runs on lease expiry + on partition heal.
SLO upgrade: 99.99% entry decision availability; exit is effectively 100% (hardware-backed).

v3 — "Make it smart, safe, and compliant at scale" (9–18 months)

ML-assisted tailgating detection on overhead camera (privacy-bounded: counts only, no identity).
Emergency system integration: two-way with building management (BMS/BACnet) — fire panel, HVAC, elevator recall.
Predictive capacity — forecast crowd peaks to preload leases hot gates.
Compliance automation: auto-generated monthly occupancy reports for landlord / insurer; SOC 2 + local fire-code audit evidence exports.
Dynamic capacity based on HVAC (reduce capacity when CO2 sensors trigger), parking (reduce when full), public-safety events.
Per-zone occupancy (food court, cinema, department stores) — sub-counter hierarchy with per-zone leases.
Multi-region active-active per mall (true two-region dual-write with CRDT-bounded conflict resolution using the same bounded-lease technique — v3-level trick that only applies when regional WAN is itself flaky).

10 Out-of-1-Hour Notes (solo-study deep content) #

10.1 Hardware-level interlocks and why software cannot be the root of trust

Building fire codes (NFPA 101, IBC, EN 13637 for electric locks in escape routes) require that exits fail safe on:

Loss of power
Loss of signal from control panel
Manual emergency release ("panic bar") at the door itself
Fire alarm

No amount of clever software can satisfy the code — the inspector checks the physical wiring. Our software architecture must be designed as a cooperating peer to the hardware interlock, not as its manager. This frames many decisions (e.g., FIRE mode in software is for auditing and alerting, not for enforcement; enforcement is in copper).

10.2 Privacy of entry/exit counts

Even anonymous counting has surveillance implications:

Aggregate hourly counts at an abortion clinic, a church, a synagogue, a political rally → identifies individual visit patterns when correlated with other data.
Mitigations:
- Differential privacy noise on public-facing counts (dashboards shown to tenants or published externally).
- Coarse bucketing of audit logs (day-level for external, per-event only for operator-accessible and with retention limits).
- No biometrics, no face matching in counting pipeline.
- Minimize retention: raw per-event audit in hot tier 30 days, aggregated in cold tier; long-tail liability windows use aggregates unless incident triggers hot-tier retention.
This is relevant to the candidate's current role (Meta Privacy Infra): the same techniques (anonymization, DP, k-anonymity on aggregate release) apply.

10.3 Legal record retention and e-discovery

In US, statute of limitations for slip-and-fall personal injury in malls = 2–4 years depending on state; for structural/fire events = 7 years (varies). 7-year retention is the industry default.
Audit log must be immutable for evidentiary value (append-only + hash-chained or WORM storage). Kafka with compaction disabled + object-storage retention lock satisfies this.
Design must support legal hold — ability to tag specific events as "do not delete past retention" for ongoing litigation.

10.4 Integration with building management systems (BMS / BACnet / Modbus)

BMS protocols are not modern REST — they are BACnet or Modbus over serial/IP with weak auth.
Our integration is one-way pull for safety signals (HVAC, CO2, temperature) — never trust write paths from BMS to be authenticated, so the occupancy system reads signals and makes its own decisions.
Fire panel integration is specifically via hardwired dry-contact relays, not protocol-level — this isolates software vulnerabilities from life-safety actuation.

10.5 Sensor calibration and tailgating measurement

Quarterly calibration: walk-through test with known count, compare against turnstile + optical counter. Discrepancy logged; if >0.5%, sensor flagged for service.
Tailgating rate is a slowly-varying mall characteristic — depends on turnstile geometry, staff attentiveness, crowd density. Some malls see 0.05%, some see 0.4%. Per-gate learned baseline; flag drift.

10.6 Probabilistic counting vs exact counting — two parallel pipelines

Exact (CP pipeline) for capacity enforcement, as designed above.
Probabilistic (AP pipeline) for marketing & foot-traffic analytics — WiFi-probe counting, Bluetooth beacons, camera-based crowd estimation. Much cheaper, less precise, explicitly non-regulatory.
These pipelines must be kept organizationally separate to prevent the marketing team's approximate counter from being used for capacity decisions. (L7: data lineage matters for compliance.)

10.7 Observability as first-class SLO surface

Operator-facing SLOs (internal, not customer-facing):

SLO	Target	Measurement
Entry decision latency	p99 < 100ms	Coordinator RPC histograms
Over-admission events	0 under normal; bounded & reported under partition	Synthetic reconciliation: `actual_occupancy_from_sensor_count - max_count_observed > 0` → alert
Exit reliability	100% availability (hardware-backed)	Count of "exit rejected" events: must be 0
Audit log completeness	100% of admits + exits logged within 60s	`count(admits at coord) - count(admit events in warehouse)` = 0 at end-of-day
Lease reconciliation accuracy	Σ(admits + rejects + uncomitted) = Σ(lease grants)	Daily reconciler cron; alert on drift
Mean time to detect partition	< 10s	Heartbeat-based
Fire-drill restore time	< 2 min	Periodic fire drill exercise

Over-admission counter as an SLO: this is the anomaly. Most system SLOs measure performance; an over-admission SLO measures correctness against a regulatory ceiling. It is measured by independent sensor reconciliation, not by the system's own self-report. This defends against the failure mode where the bug is in the code that counts.

10.8 Why "just use Spanner" is tempting but incomplete

Spanner solves the authoritative-counter problem at the write layer. It does not solve:

Local decision latency during partition (Spanner doesn't exist in the mall).
Exit-never-blocked (that's a hardware decision).
Bounded divergence on partition (that's the lease model).
Audit lineage & data retention (that's Kafka + tiering).
Sensor error (that's calibration + dual-sensor).

An L6 answer says "Spanner." An L7 answer names Spanner as the hot row but lays the lease model on top because the problem is really about a distributed semaphore with hardware fail-safes and regulated correctness, not a distributed counter.

10.9 Analog to distributed locking services

The lease model is isomorphic to how Chubby / Zookeeper / etcd hand out leases for leader election, but inverted: instead of leasing a single resource (leader slot) to one client, we lease N abstract "admission tokens" to many clients. Each consumption is unilateral (no central round-trip). The lease is the contract. If the lessor (gate) dies, the contract expires and the slots revert. This is one of the fundamental distributed patterns, and recognizing that our counter is a semaphore with lease-based distribution (not a shared variable) is the conceptual unlock.

Compare to Kubernetes: kubelet holds a resource reservation from kube-scheduler; pods scheduled locally are deducted from the local reservation. If the node partitions, the scheduler reclaims the reservation (pod eviction). Exactly our model with different nouns.

10.10 What I'd sketch on the whiteboard if given 5 more minutes

A reconciliation invariant:
```
For all time t, for each mall:
    actual_count(t) == spanner_count(t)
                       + Σ(consumed_tokens in gate g) for each gate g
                       - Σ(exits_in_kafka_lag)

When partitions heal:
    spanner_count += Σ(consumed_tokens from reconciled gates)
    Σ(consumed_tokens on gate) -= reconciled_amount
    effective_capacity -= spread across re-granted leases

Invariant throughout: spanner_count + leased_out ≤ capacity
```
This invariant is the formal proof that over-admission cannot occur under our assumptions. The assumptions that can break it:
- Byzantine gate reporting (malicious firmware): mitigated by signed firmware + dual-sensor reconciliation.
- Sensor miscounting (tailgating): bounded by SLO, accepted.
- Clock skew in TTL reclaim: mitigated by making reclamation strictly additive and using hybrid logical clocks for ordering if we ever need total-order reasoning (we don't for capacity).

Self-Verification (§pre-submit checklist) #

SRE pager-carryable? ✅ — Ops can diagnose via (a) per-gate decision_source metric, (b) fallback state machine diagram in runbook, (c) reconciliation invariant as a check. Primary pagers: entry_decision_p99, over_admission_counter, exit_reject_rate (any >0 is page), Kafka_audit_lag, spanner_commit_latency.
Every diagram arrow → real API/data flow? ✅ — table in §6.1 maps each arrow to the API in §4 or the data plane in §5.
Deep-dive L7 or L6? ✅ — §7.2 leased-slot includes the bias-toward-safety reclamation rule (the L7 move); §7.3 maps asymmetric CAP onto regulated hardware. §7.1 rejects Spanner-only, Redis-only, and FoundationDB with quantified trade-offs.
Exits-never-blocked as asymmetric code path? ✅ — §6.1 shows exit bypassing admission FSM; §6.4 explicitly calls out the asymmetry; §7.3 is a whole deep dive on it.

End of solution.