Q13 Product & Edge Systems 29 min read 12 sections
Mall Entrance Occupancy Management System
Keep a mall below legal occupancy limits in real time across many gates without ever blocking exits.
0 Pre-flight Reasoning (not spoken in interview, internal framing) #
This problem looks like a toy counter problem. It is not. The earned-secret insight is that a mall occupancy counter is a life-safety system with an asymmetric CAP choice baked into regulation, not engineering taste:
- Entry path is CP — over-admission is a regulatory & liability incident (fire code violation).
- Exit path is AP — under-counting exits is annoying (admits fewer people than legally allowed), but blocking an exit is a felony under most building codes. Software must never be able to prevent egress.
Everything else — Redis-vs-Spanner, sharded counters, lease models — is engineering that serves this asymmetry. If the candidate only designs a "distributed counter," they land at L5. L6 adds sharding and event sourcing. L7 recognizes that the two sides of the turnstile run on different CAP regimes wired to different hardware fail-safe modes, and that the software design cannot be the root of trust for exits because the fire marshal says so.
1 Problem Restatement & Clarifying Questions #
Restatement. Build a system that, for each shopping mall, (a) enforces a strict maximum occupancy by admitting or blocking entries at any of N gates in real time, (b) never prevents egress, (c) exposes a live count, (d) retains a complete audit trail of every entry/exit event with gate attribution, and (e) remains safe-by-default under partition or central-service failure.
Clarifying questions (ask these up front — ~3 min):
| # | Question | Why it matters | Assumed answer for this doc |
|---|---|---|---|
| Q1 | Mall capacity? | Drives counter throughput & fan-out | 10,000 pax/mall, configurable |
| Q2 | Gate count? | Drives fan-in, contention, sharding | 30 gates (20 entry, 10 entry+exit combined) |
| Q3 | Peak entry rate? | Sizes admission QPS | Black-Friday 1000 admissions/min = ~16/s steady, burst 100/s |
| Q4 | Fire code? | Hard constraint: exits must never block; mandates hardware interlock | Yes — NFPA 101 / local equivalent |
| Q5 | Sensor modality? | Determines double-count risk, latency floor | Turnstile primary + optical overhead counter for reconciliation; no biometrics |
| Q6 | Multi-mall tenancy? | SaaS vs per-deployment; blast-radius boundary | Yes — operator runs fleet of ~500 malls |
| Q7 | Definition of "admission"? | Turnstile clears vs person enters vs commit at POS? | Turnstile arm releases — legally the admission event |
| Q8 | Child/stroller policy? | Fractional count? | 1 person = 1 turnstile rotation, strollers un-counted (regulatory decision, not ours) |
| Q9 | Is exact count required, or bounded error ok? | CP vs AP, cost of Spanner vs Redis | Exact for entry gate (CP); bounded ±1% for dashboard (AP) |
| Q10 | Emergency override actor? | Auth model for fire marshal unlock | Dedicated hardware key-switch at security desk + software path — software is belt to hardware's suspenders |
| Q11 | Latency SLO for turnstile decision? | People won't wait >200ms before pushing | p99 < 100ms end-to-end, p99.9 < 250ms |
| Q12 | Audit retention? | Liability, GDPR-like privacy laws | 1 year hot, 7 years cold (insurance/litigation window) |
Non-questions (stated, not asked): we assume the mall network has some backhaul (fiber or LTE), but it can partition. We assume each gate has a local microcontroller (gate controller) with ~MB of RAM and persistent storage, and that no gate controller has >1s of un-persisted state. We assume time is approximately synchronized via NTP within ±50ms; we do not rely on wall-clock ordering for correctness.
2 Functional Requirements #
In-scope (numbered, testable):
- FR1 — Atomic entry admission.
request_entry(gate_id)must returnADMITonly if the global occupancy strictly remains ≤ capacity after the increment. No two concurrent requests anywhere in the mall may both see the same "last slot" and both admit. - FR2 — Unrestricted exit.
record_exit(gate_id)must always succeed logically; it may never be rate-limited, blocked, or gated on central-service reachability. If central is unreachable, exit is recorded locally and replayed. - FR3 — Live occupancy query.
get_occupancy(mall_id)returns current count with bounded staleness (≤ 5s under normal ops, ≤ 60s under partition). - FR4 — Per-gate metrics & dashboard. Operations dashboard shows occupancy, per-gate admit/reject/exit rate, queue depth estimate, sensor health.
- FR5 — Audit log. Every entry attempt (admit or reject) and every exit is persisted as an immutable event with
event_id, mall_id, gate_id, type, ts, admission_decision, admission_reason, sensor_confidence. - FR6 — Configurable capacity. Operator can set capacity with
effective_fromtimestamp. Capacity can be reduced at runtime (e.g. HVAC failure, police order) — system must honor the lower bound immediately without crashing. - FR7 — Emergency override.
emergency_override(mode=EVACUATE | LOCKDOWN | FIRE)— EVACUATE flips all gates to exit-only mode, blocks entries; FIRE triggers hardware fail-open (all turnstiles release); LOCKDOWN is for security events and still honors fire-code exits. - FR8 — Capacity change audit. Capacity changes themselves are audit events (who, when, why, previous value).
Out-of-scope (explicitly stated to bound the interview):
- Identity, demographics, facial recognition, loyalty program tie-in.
- POS revenue correlation, footfall → sales forecasting.
- Occupancy prediction / ML crowd forecasting (v3 feature, §9).
- Per-zone occupancy (food court vs parking). Treat mall as a single bucket.
- Ticketing, reservations, paid entry.
3 NFRs + Capacity Estimate #
3.1 NFRs
| NFR | Target | Justification |
|---|---|---|
| Availability (decision path) | 99.99% normal; 99.95% effective during WAN partition (gate-local fallback still admits up to leased budget, then safely rejects) | 4 nines requires ~52 min/yr downtime. Partition tolerance via leased slots keeps decisions local → effective 5 nines for local decisions. |
| Correctness | Zero over-admission under normal ops. Under gate-local fallback: bounded over-admission ≤ sum(unreturned_lease_tokens) across partitioned gates, and ≤ 0.5% of capacity (engineering SLO). | Regulatory — fire code is a hard ceiling. The 0.5% slack exists because turnstile sensors themselves have error (tailgating, dual-pass); software cannot be more accurate than its sensor. |
| Latency (entry decision) | p50 < 20ms, p99 < 100ms, p99.9 < 250ms end-to-end gate controller → admit/reject. | UX — queue rage. Empirically people start pushing the turnstile at ~300ms. |
| Latency (exit) | p99 < 30ms locally (hardware release); async replication to central within 5s. | Exit is hardware-first; software logs. |
| Durability (audit log) | 11 nines (Kafka RF=3 + tiered storage to object store RF=3 cross-AZ). | Liability window 7 yrs. |
| Freshness (dashboard) | < 5s under normal; < 60s under partition. | Operator expectation. |
| Scale (fleet) | 500 malls × 30 gates = 15K gate controllers; headroom to 50K. | Multi-tenant. |
3.2 Back-of-the-envelope (all math shown)
Single-mall admission load.
- Capacity C = 10,000.
- Typical turnover: 3× full capacity per day ⇒ 30,000 admissions/day.
- Peak hour concentration: 25% of daily traffic in busiest hour ⇒ 7,500 admits/hr = 2.08 admits/s steady peak.
- Black-Friday / mall-opening burst: 10× peak-steady ⇒ ~20 admits/s sustained, peaks of 100/s for 30–60s when doors open.
- Per-gate: 20 entry gates ⇒ burst ~5/s/gate — one turnstile can physically cycle at most ~1 person/1.5s ≈ 0.67/s, so the 5/s/gate burst figure is actually bounded by human-throughput; the system must survive every gate cycling at full physical rate simultaneously = 20 × 0.67/s ≈ 13.4 admits/s sustained, bursts to ~20/s with brief queue absorption. ⇒ Design for 100/s admit QPS per mall (5× safety factor).
Fleet load.
- 500 malls × 20/s steady avg = 10,000 admits/s across fleet.
- Fleet-wide peak (all malls at once, e.g. Saturday 11am across one time zone): 500 × 100/s = 50,000 admits/s peak.
Event log sizing.
- Event row:
event_id(16B) + mall_id(8B) + gate_id(4B) + type(1B) + ts(8B) + decision(1B) + reason(8B) + sensor_confidence(4B) + serialization overhead≈ 120B wire, 200B on-disk with indexing & compression-offset. - Per mall per day: (30,000 admits + 30,000 exits + ~1,000 rejects) × 200B ≈ 12.2 MB/day/mall.
- Fleet: 500 × 12.2 MB = 6.1 GB/day; ~2.2 TB/yr hot.
- 7-yr cold tier: ~15 TB — trivial for object storage, ~$350/mo at S3 Glacier Deep Archive rates.
Counter store sizing.
- Hot state: ~1 KB/mall (counter, version, lease table, config) × 500 malls = 500 KB active state. Fits in any single Redis/Spanner node trivially; the scale is in the write QPS, not the size.
Read load.
get_occupancyfrom ops dashboard: polled every 5s per mall view × say 20 operator sessions/mall × 500 malls = 10K queries/5s = 2K QPS. Trivial.
Key takeaway of BoE: write contention, not volume, is the constraint. The hot partition is the single counter per mall. 100 QPS on one key is fine for Redis (100K+ INCR/s on one shard), fine for Spanner single-row (thousands of commits/s with conflict-free sequencer), but pathological for a naive RDBMS SELECT ... FOR UPDATE pattern (contention latency degrades super-linearly).
4 High-Level API #
All APIs: gRPC primary (low-latency, streaming for sensor telemetry), HTTP/JSON gateway for dashboard and admin.
4.1 Gate-plane (gate controller ↔ entry coordinator)
service EntryCoordinator {
// Called on every turnstile presentation. Must be fast and idempotent.
rpc RequestEntry(EntryRequest) returns (EntryDecision);
// Fire-and-forget stream from gate controller; at-least-once.
rpc RecordExit(stream ExitEvent) returns (ExitAck);
// Gate pulls more lease tokens when its local budget is low.
rpc AcquireLease(LeaseRequest) returns (LeaseGrant);
// Gate returns unused tokens on shift end / graceful shutdown.
rpc ReleaseLease(LeaseRelease) returns (LeaseAck);
// Heartbeat + reconciliation (gate reports local counters; coord reconciles).
rpc Reconcile(stream ReconcileBatch) returns (ReconcileAck);
}
message EntryRequest {
string event_id = 1; // UUIDv7 from gate. Idempotency key.
string mall_id = 2;
string gate_id = 3;
int64 gate_local_seq = 4; // Monotonic per-gate. Gap-detection on replay.
int64 ts_gate_ms = 5; // Gate-local NTP time.
float sensor_confidence = 6; // 0..1. <0.8 ⇒ flag for reconciliation.
}
message EntryDecision {
enum Code { ADMIT = 0; REJECT_AT_CAPACITY = 1; REJECT_EMERGENCY = 2;
ADMIT_LOCAL_LEASE = 3; // decided from gate's local lease, not central
ERROR_RETRY = 4; }
Code code = 1;
int64 occupancy_after = 2; // best-effort; may be stale under fallback
int64 capacity = 3;
string decision_source = 4; // "central" | "gate_lease" | "cached_fallback"
int64 decision_latency_us = 5;
}
4.2 Control-plane (operator / admin)
service MallOps {
rpc GetOccupancy(GetOccupancyRequest) returns (Occupancy);
rpc StreamOccupancy(GetOccupancyRequest) returns (stream Occupancy); // push
rpc SetCapacity(SetCapacityRequest) returns (CapacityChange);
rpc EmergencyOverride(OverrideRequest) returns (OverrideAck);
rpc GetGateHealth(MallRef) returns (stream GateHealth);
rpc QueryAuditLog(AuditQuery) returns (stream AuditEvent);
}
4.3 Semantics
- Idempotency.
event_idis a UUIDv7 minted by the gate controller. The coordinator maintains a bounded-time dedup set (last 5 min) to absorb retries. Events older than the dedup window are accepted (gate was partitioned longer than 5 min) but markedlate_arrival=trueand reconciled; double-admission risk is bounded togate_local_leasesize (see §7.2). - Idempotency key is per-gate and per-sequence, not just UUID. This lets us detect gaps on replay: a gate that reconnects with seq 103 when we last saw 97 must have 5 events in flight that we need to reconcile.
- No cross-mall calls on hot path. Each mall is an independent bulkhead.
ADMIT_LOCAL_LEASEvsADMIT— deliberately a distinct return code. Dashboards and audit surface the source of every decision. This makes the "is the system partitioned right now?" question answerable from the data plane, not just from ops monitoring.
5 Data Schema #
5.1 Primary stores (authoritative)
occupancy_counter — one row per mall, strongly consistent. This is the hot row.
| col | type | notes |
|---|---|---|
| mall_id | string PK | |
| count | int64 | current occupancy |
| capacity | int64 | current capacity |
| version | int64 | monotonic; bumped on every write (optimistic CAS) |
| leased_out | int64 | sum of live gate leases (phantom admits in flight) |
| effective_capacity | int64 | capacity - leased_out — the value coordinator compares against for central admits |
| last_reconcile_ts | int64 | for staleness alerts |
| mode | enum | NORMAL | EVACUATE | LOCKDOWN | FIRE |
Engine choice (see §7.1 for full analysis): Spanner single-row with read-write transactions for the authoritative counter; Redis cluster with Lua script as a tier-1 cache for sub-10ms reads + optimistic-path writes, with async write-through to Spanner. Rejected engines documented in §7.1.
gate_lease — one row per live lease. Mall-sharded.
| col | type | notes |
|---|---|---|
| lease_id | uuid PK | |
| mall_id | string | shard key |
| gate_id | string | |
| granted_slots | int32 | e.g. 20 |
| consumed_slots | int32 | updated on reconcile |
| granted_at | ts | |
| expires_at | ts | lease TTL, e.g. 60s |
| status | enum | ACTIVE | RETURNED | EXPIRED | RECONCILED |
events — append-only audit log. Kafka primary (RF=3, min.insync.replicas=2, acks=all). Compacted topic with mall_id as partition key for per-mall ordering; sub-partitioned by mall_id||gate_id to prevent one mall's audit stream from head-of-line-blocking another.
| col | type |
|---|---|
| event_id | uuid |
| mall_id | string |
| gate_id | string |
| type | ADMIT | REJECT | EXIT | LEASE_GRANT | LEASE_RETURN | OVERRIDE | CAPACITY_CHANGE |
| ts_gate_ms | int64 |
| ts_server_ms | int64 |
| occupancy_after | int64 |
| decision_source | string |
| sensor_confidence | float |
| operator_id | string (nullable; for overrides) |
| raw_payload | bytes |
Archive: Kafka → tiered to object store (Parquet files, daily partitions) via Kafka Connect. Cold tier = Glacier-class. Hot query via Druid or ClickHouse (§5.3).
capacity_policy — small OLTP table (Postgres or Spanner).
| col | type |
|---|---|
| mall_id | string |
| capacity | int64 |
| effective_from | ts |
| set_by_operator | string |
| reason | string |
| prev_capacity | int64 |
5.2 Rationale for the effective_capacity = capacity - leased_out column
Naive designs keep one counter count. Under leased slots, the central count is the last reconciled count, which lags. If the coordinator compares count < capacity to admit, it will over-admit by leased_out. Precomputing effective_capacity means central's own admit path (for gates that round-trip) respects the outstanding leases. The leases are phantom admits from central's perspective.
5.3 Operational dashboard
- ClickHouse or Druid fed from the Kafka audit stream.
- Rollups: 1s, 10s, 1min, 1hr buckets per gate × decision.
- Why not just Grafana over Prometheus? Because we also need ad-hoc queries ("who entered gate 7 between 14:32 and 14:47 on Apr 4" for a lost-child or theft case), and that's an OLAP-shape workload.
5.4 A note on CRDTs (and why not)
A PN-Counter CRDT gives partition-tolerant counting with eventual consistency. It is wrong for this problem because it permits unbounded over-admission during partition (each partition independently increments without bound), which violates the hard fire-code ceiling. CRDTs are AP; we need CP on the entry path. The leased-slot model is a CP-shaped answer that looks CRDT-ish (local increments, eventual merge) but with bounded divergence via the lease size.
6 System Diagram (Centerpiece) #
6.1 Full topology
PHYSICAL PLANE ┃ NETWORK PLANE ┃ SERVICE PLANE (per mall) ┃ FLEET CONTROL PLANE
┃ ┃ ┃
┌────────────────────┐ mag-lock ┌──────────────────┐ ┃ ┃ ┃
│ Turnstile #1..20 │◄────HW────► │ Fire alarm bus │───┐ ┃ ┃ ┃
│ (IR + HW counter) │ fail-open │ (NFPA hardwired) │ │ override ┃ ┃ ┃
└─────────┬──────────┘ └──────────────────┘ ▼ ┃ ┃ ┃
│ RS-485 / ┌─────────────┐ ┃ ┃ ┃
│ GPIO (<5ms) │ Hardware │ ┃ ┃ ┃
▼ │ Key-switch │ ┃ ┃ ┃
┌───────────────────────────┐ │ @ sec. desk │ ┃ ┃ ┃
│ Gate Controller (per │ └──────┬──────┘ ┃ ┃ ┃
│ turnstile, embedded) │─┐ │ ┃ ┃ ┃
│ - 4KB dedup ring │ │ │ ┃ ┃ ┃
│ - lease cache (N tokens) │ │ │ ┃ ┃ ┃
│ - local SQLite audit WAL │ │ │ ┃ ┃ ┃
│ - fallback policy FSM │ │ │ ┃ ┃ ┃
└─────────────┬─────────────┘ │ │ ┃ ┃ ┃
│ gRPC/mTLS │ mTLS │ ┃ ┃ ┃
│ <5ms LAN │ telemetry │ ┃ ┃ ┃
▼ ▼ │ ┃ ┃ ┃
┌──────────────────────────────────────────┐ │ ┃ ┃ ┃
│ Edge Gateway (per mall, HA pair) │ │ ┃ ┃ ┃
│ - local circuit-breaker to WAN │ │ ┃ ┃ ┃
│ - aggregation, mTLS termination │ │ ┃ ┃ ┃
│ - local Redis (read-through cache) │ │ ┃ ┃ ┃
│ - fallback arbiter (see §7.2) │ │ ┃ ┃ ┃
│ - local Kafka producer buffer (24h) │ │ ┃ ┃ ┃
└──────────┬───────────────────────────────┘ │ ┃ ┃ ┃
│ WAN (SD-WAN, dual-homed fiber+LTE) │ ┃ ┃ ┃
│ p50 20ms, p99 80ms, may partition │ ┃ ┃ ┃
▼ │ ┃ ┃ ┃
══════════════════════════════════════════════════════════════════════════════┃═══════════════════════════════════════════┃═════════════════════════════════════════
│ │ ┃ REGION: MALL's home region ┃ REGION: cross-region control
│ ┌────────────────────────────────────────────┘ ┃ ┃
│ │ override (dual-path: HW+SW) ┃ ┃
▼ ▼ ┃ ┃
┌────────────────────────────────────────────────┐ (A) CAS/Lua ┌────────────────────────────┐ ┃
│ Entry Coordinator Service (per mall, 3x HA) │◄─────────────►│ Primary Counter Store │ ┃
│ ┌──────────────────────────────────────────┐ │ <5ms │ Spanner row (SR) │ ┃
│ │ Admission FSM: │ │ │ - occupancy_counter │ ┃
│ │ 1. Normalize request │ │ (B) read │ - gate_lease │ ┃
│ │ 2. Dedup on event_id │ │ through Redis │ │ ┃
│ │ 3. Check override mode │ │ Cluster (L1) │ + Redis Cluster (L1) │ ┃
│ │ 4. Choose path: │ │ ~1ms │ - Lua CAS script │ ┃
│ │ (a) central CAS (default) │ │ │ - TTL'd lease snapshot │ ┃
│ │ (b) grant lease to gate │ │ │ │ ┃
│ │ (c) fallback: reject if no budget │ │ └──────────────┬─────────────┘ ┃
│ │ 5. Emit audit event │ │ │ async replicate to global audit ┃
│ │ 6. Return decision │ │ ▼ ┃
│ └──────────────────────────────────────────┘ │ ┌─────────────────────────┐ ┃
└──────────┬───────────────────────────────────┬─┘ │ Kafka (audit bus) │ ┃
│ │ │ topic: mall-events │ ┃
│ audit write │ exit path │ RF=3, acks=all │ ┃
│ (at-least-once) │ (AP) │ partition: mall_id │ ┃
▼ ▼ └──────┬────────┬─────────┘ ┃
┌────────────────────┐ ┌────────────────────┐ │ │ ┃
│ Kafka producer │ │ Exit Sink │ │ └──► S3/GCS Parquet archive ┃
│ (local buffer │ │ (async, writes │ │ (7yr cold tier) ┃
│ on edge gateway) │ │ count DECR + │ ▼ ┃
└────────────────────┘ │ audit event; │ ┌──────────────────────┐ ┃
│ NEVER blocks exit)│ │ ClickHouse / Druid │ ┃
└──────────┬─────────┘ │ (ops dashboard DB) │ ┃
│ └──────────┬───────────┘ ┃
│ │ ┃
└────► Spanner counter ◄────┘ ┃
(DECR is always ┃
accepted; may go ┃
transiently negative ┃
under race, see §8) ┃
┃
┃ ┌───────────────────────────────┐
┃ │ Fleet Control Plane │
┃ │ - Multi-mall config svc │
┃ │ - Capacity policy DB │
┃ │ - Global ops dashboard │
┃ │ - PagerDuty / alert routing │
┃ │ - Tenant mgmt / RBAC │
┃ │ - Firmware rollout controller │
┃ └───────────────────────────────┘
Every arrow carries (protocol, budget, QPS):
| Arrow | Protocol | Latency budget | QPS (peak) | Notes |
|---|---|---|---|---|
| Turnstile → Gate Controller | RS-485 / GPIO | <5ms | 0.67/s/gate | hardware |
| Gate Controller → Edge Gateway | gRPC mTLS over LAN | <10ms | 100/s/mall | |
| Edge Gateway → Entry Coordinator | gRPC mTLS over WAN | p99 80ms | 100/s/mall | may partition |
| Entry Coordinator ↔ Spanner | Spanner RPC | p99 15ms | 100/s/mall hot row | |
| Entry Coordinator ↔ Redis L1 | RESP | p99 3ms | 500/s/mall (reads amplified) | |
| Entry Coordinator → Kafka | Kafka protocol | p99 20ms (async ack) | 200/s/mall (entries + exits) | |
| Kafka → ClickHouse | Kafka Connect | seconds | 200/s/mall | |
| Fire Alarm → All Turnstiles | hardwired NFPA loop | <500ms physical | — | hardware path, bypasses software entirely |
| Hardware Key-switch → Turnstiles | hardwired | <500ms | — | security desk override |
6.2 Sub-diagram: Entry Decision Flow (happy path + fallback)
EntryRequest(event_id, gate_id)
│
▼
┌─────────────────────────────────────┐
│ Gate Controller receives │
│ 1. dedup event_id in ring buffer? │──yes──► return cached decision
│ 2. emergency mode flag set? │──FIRE─► OPEN (HW override)
│ (read from local GPIO + SW) │──EVAC─► REJECT (entry blocked)
└──────────────┬──────────────────────┘
│
local lease │
tokens > 0 ? │
┌─────────┴──────────┐
yes no
│ │
┌────────────▼───────────┐ ▼
│ LOCAL ADMIT: │ ┌───────────────────────────┐
│ - decrement local lease │ │ Call coordinator │
│ - write local WAL │ │ RequestEntry │
│ - open turnstile │ │ timeout: 80ms │
│ - async push audit │ └──────┬──────────┬──────────┘
│ - if lease < low_water, │ │ │
│ schedule async refill │ success timeout/error
└─────────────────────────┘ │ │
▼ ▼
┌───────────────┐ ┌──────────────────────┐
│ Apply │ │ Fallback arbiter: │
│ decision from │ │ - lease > 0? admit │
│ coordinator │ │ - else: pessimistic │
│ (authoritative)│ │ REJECT & alert │
└───────────────┘ └──────────────────────┘
6.3 Sub-diagram: Gate-Local Fallback State Machine
┌──── central healthy (last_heartbeat < 5s) ────┐
│ │
▼ │
┌──────────┐ heartbeat timeout 5s ┌────────────┐ │
┌──────▶│ NORMAL │────────────────────────▶ │ DEGRADED │──┘ recovery
│ └────┬─────┘ └─────┬──────┘ (force reconcile)
│ │ │
│ lease refill ok lease depleted & no WAN
│ │ │
│ ▼ ▼
│ ┌──────────┐ ┌────────────┐
│ │ LEASED │──lease exhausted──────► │ PESSIMISTIC│──┐
│ │ OPERATION│ │ REJECT │ │
│ └──────────┘ └────────────┘ │
│ │ │
│ fire override at any state │ │
│ ┌───────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ fire cleared + mgmt ack │
│ │ FIRE_OPEN │────────────────────────────────────► NORMAL
│ │ (HW-enforced│
│ │ exit-only) │
│ └─────────────┘
└─────── (return after reconcile from any state) ──────
6.4 Asymmetry of exit path (critical L7 call-out)
The diagram deliberately shows exit as a separate data path:
- Exit sensor → Gate Controller — no admission check.
- Gate Controller writes exit to local WAL, opens turnstile.
- Exit events stream async to Kafka, which decrements the counter eventually.
- The exit path has no synchronous call to the coordinator. Even if the WAN is fully partitioned, exit is local and always works.
- If the gate controller itself dies, the turnstile mag-lock fails open (hardware default) — exit is still physically possible.
This is the asymmetric CAP decision made physical. Entry = CP. Exit = AP.
7 Deep Dives #
7.1 Deep Dive 1 — Strongly-consistent counter without hot-key collapse
Why critical. The whole system centers on one row per mall. Naïvely, 100 writes/s on one key is easy; under the CP + audit requirement it becomes a contention battle.
Options considered (quantified):
| Option | Mechanism | Throughput per mall | Latency p99 | Correctness | Operational cost | Verdict |
|---|---|---|---|---|---|---|
A. Postgres SELECT ... FOR UPDATE |
Pessimistic row lock | ~500/s before lock contention dominates; degrades super-linearly | 50–500ms under load | Strong | Low | Rejected — lock-acquire latency spikes kill p99 at peak; no natural multi-region story. |
B. Redis single-node INCR + Lua CAS |
Single-threaded event loop serializes ops | 100K ops/s | p99 2ms | Strong (single shard) | Low | Strong candidate but no multi-AZ strong consistency natively (failover is async, we'd lose up to replica_lag admissions on failover — unacceptable). |
| C. Redis Cluster with Lua CAS on single slot | Hash-slot isolates mall_id | Still single-shard per mall (~100K/s) | p99 3ms | Strong per-slot; weaker on failover | Medium | Same AZ-failover concern as B. Good as cache, not as source-of-truth. |
| D. Spanner single-row RW transaction | Paxos-replicated row | Single-row = 1000s commits/s; contention handled by serialization queue | p99 15ms | Strong, cross-region, externally-consistent | Higher $ | Chosen for authoritative state. |
| E. FoundationDB transactions | Same semantics as Spanner; cheaper, more ops | 10K+ txns/s on hot key | p99 10ms | Strong | Higher ops burden (self-managed) | Strong candidate if self-hosted; chose Spanner for managed ops story. |
| F. Sharded counter + periodic roll-up | N sub-counters per gate, each incremented locally, coordinator sums periodically | Unlimited throughput | Coordinator read p99 unchanged | Weakly consistent — coordinator's sum lags; over-admission bounded by shard divergence | Complex | Rejected for base counter — cannot guarantee ceiling. Adopted in a modified form as leased-slot (§7.2). |
| G. CRDT PN-Counter | Per-replica increment, merge on sync | Unlimited | Read is local | Unbounded divergence under partition | Medium | Rejected — AP, cannot cap. |
| H. Token-based admission | Pre-mint C tokens, gates consume | Naturally serializes at token mint; consumption is local | Local p99 < 5ms | Strong bound on max admissions | Complex, reconciliation-heavy | Chosen as optimistic fast path (§7.2). |
Chosen approach — two-tier:
- Authoritative layer: Spanner single-row per mall for
occupancy_counter. Single-row RW transactions on a hot row in Spanner are actually a good match: Spanner's participant leader serializes conflicting commits on a single Paxos group, and the row is small enough that commit latency is dominated by WAN round-trip (10-15ms p99). At 100 QPS/mall we are nowhere near its ceiling (2-5K commits/s hot row). - Fast-path cache layer: Redis Cluster with Lua. The coordinator's hot path reads the effective_capacity from Redis (refreshed every 100ms from Spanner) and serves most requests from the lease model without hitting Spanner for every admit. Writes (central-path admits) go to Spanner authoritatively and then invalidate/update Redis.
- Dedup buffer: Redis set with TTL=5min for
event_ids, to absorb retries.
Why not just Redis? The business can tolerate a Redis failover dropping unreplicated admits (they would be re-issued lease tokens), but cannot tolerate a silent gap where the canonical count is wrong after failover. Spanner gives us externally-consistent source of truth; Redis is a performance cache in front of it. If we run Redis-only, we would need to implement our own Raft/Paxos replication of the counter to avoid post-failover ambiguity — that's rebuilding Spanner badly.
Production systems named:
- Ticketmaster "queue-it" and TicketSocket: use a similar two-tier model — a global token mint (authoritative) with local queue servers. See Andrea Leopardi / Livewire talks on Ticketmaster's high-contention ticket mint — they moved off Postgres
FOR UPDATEto pre-allocated token pools precisely because lock-queue latency was unbounded. - Disney Parks FastPass / Genie+: pre-allocates ride-slot tokens per-day and distributes them to kiosks/app clients; kiosks admit from local budget; central reconciles.
- Stripe idempotency keys: TTL'd dedup set, same pattern as our event_id dedup.
- Google Spanner single-row hot-key benchmarks (Jeff Dean, CIDR 2017): single row at thousands of commits/s is supported with minor caveats (participant-leader batching).
Failure modes + mitigations (of the chosen counter design):
- Spanner regional outage → failover is automatic within configured placement. During the ~30s failover window, coordinator falls into "leased-slot only" mode — gates admit from their existing leases, no new leases granted, no new central admits. When Spanner recovers, a reconciliation pass folds in buffered admits.
- Redis cache poisoning / staleness → bounded by 100ms refresh window; coordinator always falls back to Spanner for the authoritative admit.
- Hot mall (one mall at 500 QPS, 10× design) → single-row commit queue on Spanner starts building. Mitigation: increase lease size so more admits go local; lease grants are themselves admits from Spanner's perspective, so a gate taking a 50-token lease is 1 commit for 50 admits — this is the contention-amortization trick.
7.2 Deep Dive 2 — Leased-slot admission (earned-secret core)
Why critical. This is the model that makes the system both partition-tolerant and safe. It is the single most important technique in the design.
The model.
Each gate holds a local lease of L admission tokens (e.g. L=20). When a person presents:
- If
local_tokens > 0: decrement, admit, write local WAL. No network call. - If
local_tokens == 0: synchronously call coordinator toAcquireLease(L'). Coordinator does a Spanner CAS:leased_out += L', effective_capacity -= L'. Returns a lease grant. - When gate goes idle or on graceful shutdown:
ReleaseLeasereturns unconsumed tokens to central.
When local_tokens drops below a low-water mark (e.g. 5 of 20), the gate asynchronously requests a refill so the synchronous path rarely blocks on network.
CAP positioning. This is a bounded-divergence CP design. At any instant, actual_occupancy ≤ central_count + Σ(leased_out per gate) = capacity. The invariant is maintained by the central CAS at lease grant time, not at admit time. The lease is the commit.
Under partition (pessimistic):
- Gate's last-known lease is
L_current. Gate admits up toL_current, then rejects. Worst-case over-admission during partition: zero (by definition — the leased tokens were already committed against capacity centrally). - When the partition heals, gate sends
Reconcilewith its consumed count. Central setsleased_out -= unconsumed, count += consumed. Accounting is precise.
Under partition (optimistic) — rejected for this problem but worth analyzing:
- Gate continues admitting past lease, up to a safety cap
L_current + Δ. - Over-admission up to
Σ Δacross gates, which is bounded. E.g. 30 gates × Δ=10 = 300 over-admits ≤ 3% of capacity. - Acceptable for some venues (concert festival, where queue rage > safety margin). Not acceptable here because fire code is a hard ceiling enforced by inspection and liability.
Low-water mark math.
Peak per-gate admit rate ~0.67/s. Network p99 to central = 80ms. Coordinator lease-grant p99 = 100ms total. Expected admits during one refill round-trip = 0.67 × 0.1 = ~0.07 — basically never block. Set low-water = ⌈p99_admits_during_refill × safety_factor⌉ = 3. Configure lease L=20 so refills happen once per ~30s/gate.
Why not just "gate_budget = capacity / gates"? Because traffic is not uniform. At the mall's main entrance gate, 90% of admissions happen; at a side gate, 5%. If we statically partitioned capacity, the main gate would reject prematurely while side gates sat on unused budget. The lease model is a dynamic reallocation — a form of work-stealing counter — where hot gates refill more often and cold gates let their leases expire.
Lease expiry handling. A gate's lease has a TTL (e.g. 60s). If the gate dies without returning the lease, central reclaims it via:
- Heartbeats. Gate sends heartbeat with remaining lease every 5s.
- On heartbeat miss × 3 (15s), coordinator marks lease
SUSPECT. - At 60s TTL, coordinator reclaims:
leased_out -= granted_slots, count += granted_slots × fudge_factor. Critically, thefudge_factorhere is a safe assumption that ALL tokens were consumed — i.e. the system errs on the side of treating unreachable gates as having admitted their full lease. This prevents double-counting if the gate comes back online after reclamation. It means we under-count capacity (transiently treat more people as inside than might be true), which biases toward rejecting more entries, which is the safe direction. - When gate reconnects, it sends its actual consumption in
Reconcile; central does a compensating adjustment.
This is the L7 move: the asymmetry of assumption under uncertainty is deliberately biased toward the safety-critical direction. Over-rejection is recoverable; over-admission is not.
Production analogs:
- AWS IAM / STS temporary credentials: pre-issued tokens with TTL, verified locally.
- Kubernetes
kube-schedulerwith resource reservations: nodes hold a reservation budget for pods; scheduler reconciles. - Airlines' seat-map reservation locks: a travel agent holds N seat locks for 15 min; returns unused.
- CockroachDB's "SQL sequence caching": each SQL node caches a block of sequence values to avoid a round-trip per INSERT. Same idea, same edge case (cached values on a dead node are lost = gaps in sequence; acceptable for sequences, not for our counter — so we reclaim).
- Riot Games / Fortnite queue-token model: login tokens pre-minted in pools, consumed by region servers.
7.3 Deep Dive 3 — Asymmetric exit path (fire safety as architecture)
Why critical. A naive system wraps both entry and exit in the same "update counter" API. This is not just inefficient — it is illegal. Most building codes (NFPA 101 in US; equivalent EN, ISO) require that egress cannot be prevented by mechanical or electronic means except under active fire-marshal control. If software can block an exit, the building is not code-compliant.
The design.
- Exit does not go through the admission FSM at all. Separate gate-controller code path.
- Exit is fire-and-forget with durable local WAL. The gate controller writes the exit event to local SQLite (fsync) before releasing the turnstile. Even if it reboots mid-cycle, the event is preserved.
- Exit batches are streamed async to Kafka via the edge gateway. Kafka's consumer group (the "Exit Sink") DECRs the Spanner counter. Since exits can never cause over-admission (they only reduce count), eventual replication is safe.
- Hardware-level fail-open. The turnstile mag-lock is wired so that loss of power or loss of signal from the gate controller releases the lock. Software cannot hold the lock closed.
- Fire alarm integration is a hardwired NFPA loop, not an API call. When the fire panel asserts alarm, all mag-locks drop simultaneously via dedicated copper. Our software also enters
FIREmode via a separate telemetry channel (so we record the event), but the software path is the belt; the hardware loop is the suspenders.
Race: exit before entry admit replicates. Scenario: person enters gate 1, coordinator admits, Kafka event in flight. Same person exits gate 2 before the entry event reaches Spanner. The counter would transiently go negative.
Mitigation: count is stored as int64 (room for negative), and the dashboard clamps display to max(0, count). Over a reconciliation window (5s), the entry event catches up and the counter recovers. The invariant we enforce is "never above capacity," not "never below zero." This is an explicit L7 decision: the ceiling is regulatory; the floor is cosmetic.
Race: double-exit. Sensor fires twice on a single person (tailgating variant in reverse). Gate controller dedups within a 500ms window at the sensor level (hardware debounce + software). Cross-gate dedup is impossible (person physically at one gate at a time), so this is a local concern.
Tailgating (the real correctness ceiling). Turnstiles allow tailgating at ~0.1–0.3% depending on design (tailgater slips through behind another person, sensor only counts one rotation). This means the sensor itself has error that software cannot fix. We address this by:
- Overhead optical counter (secondary sensor) at each gate, independently counted.
- Coordinator reconciles turnstile count vs optical count hourly; discrepancy > 0.5% triggers ops alert and a sensor calibration task.
- Reporting: "Sensor-confident admissions" vs "total foot traffic" are separate metrics. Fire-code compliance is measured against the higher of the two (worst-case), so we under-capacity relative to sensor-counted admissions.
Why not ML tailgating detection on camera? v3 feature. Introduces privacy surface (see §10) and drifts accuracy under crowd load. For v1/v2 we rely on dual-sensor mechanical reconciliation and ops-driven calibration.
Production analogs:
- Airport boarding: boarding pass scan is the "admit"; exit (deplaning) is physically unstoppable.
- Amusement park ride queues (Disney LightningLane): entry is ticketed; exit is free.
- Every data-center badge reader: entry requires auth; emergency egress ("panic bar") is hardware-guaranteed.
8 Failure Modes & Resilience #
| Component | Failure | Detection | Blast Radius | Mitigation | Recovery |
|---|---|---|---|---|---|
| Turnstile sensor | Miscount (tailgating, double-count) | Per-hour reconciliation against optical counter | Single gate, <0.5% count drift | Alert ops, schedule calibration, bias safety margin by 1% | Manual recount + adjustment via CapacityChange audit event |
| Gate controller | Hang / reboot | Heartbeat miss (5s) | Single gate; in-flight lease held until 60s TTL reclaim | Pair gate with adjacent gate for backup admit; mag-lock fails open for exit | Controller boots, reloads WAL, re-plays unsynced events |
| Gate controller | Split-brain (network partition from central but not from sensor) | Coordinator heartbeat timeout | Single gate; lease exhaustion within ~30s | Pessimistic reject when lease empty; human fallback (guard lets people through manually) | Reconcile on reconnect folds consumed lease |
| Edge gateway | Outage | Coordinator heartbeat | All gates in mall | HA active-active pair (anycast VIP); all gates reconnect in <5s | Recovery via HA partner; log buffer on gates survives ≤24h |
| WAN partition (mall ↔ central) | Full isolation | Coordinator probe | All mall's central-path admits | Gates operate on leases until exhausted; then pessimistic reject; exits unaffected | On reconnect, bulk reconcile; over-admission count (bounded by sum of unused lease) folded into count |
| Spanner regional outage | Region unavailable | Spanner client errors | All malls in region | Cross-region failover (automatic with multi-region instance); during failover, coordinator drops into "leased-slot only" — no new leases but existing ones honored | Automatic; reconcile on recovery |
| Redis cluster outage | Cache miss storm | Cache error rate | +10ms latency per admit (fall back to Spanner direct) | Spanner directly serves; degraded latency but correct | Rebuild cache async from Spanner |
| Kafka outage | Audit writes blocked | Producer metric | Audit lag | Local producer buffer on edge gateway (24h, spill to SQLite); entries still admitted (audit is not on critical path) | Drain buffer on recovery |
| ClickHouse/Druid outage | Dashboard stale | Read errors | Dashboard only | Fallback to direct Spanner count read; gate-level metrics degraded until recovery |
Replay Kafka stream |
| Sensor double-fire (e.g. mechanical bounce) | Phantom admissions | Local sensor dedup (500ms debounce) + optical reconciliation | <0.1% error | Dedup in firmware; tolerate via bounded-error SLO | Calibration; Kafka has raw events for audit |
| Capacity misconfiguration (set to 100 instead of 10000) | Over-reject & angry customers | Large rejection spike alert (>5x baseline for >5min triggers PagerDuty) | Mall entries stalled | Four-eyes policy on SetCapacity for reductions >20%; operator override required |
Operator sets correct value; audit records both |
| Audit log backlog (Kafka lag) | Dashboard lag | Lag metric | Delayed insights, not decisions | Scale Kafka consumer group horizontally | Natural drain |
| Fire alarm triggered | Hardware loop opens all mag-locks | Fire panel signal | Entire mall goes to evacuation mode | All entries blocked (FIRE mode in software); exits unrestricted (hardware + software); audit records the event | Manual reset after fire marshal clears; operator re-arms gates |
| Fire alarm false positive | Gates fail-open incorrectly | Reconciliation: sudden drop to "all exit" state without corresponding occupant count change | Mall closes until reset | Physical lockout: hardware key-switch at sec desk required to re-arm (code-compliance) | Security desk walkthrough + re-arm |
| Malicious operator (capacity set to 100K to fit extra customers) | Over-admission by management | Audit-log anomaly detection (capacity_change outside pre-approved bounds) | Potentially severe (liability) | Hard policy cap in control plane; multi-party approval for >20% capacity increase; fleet-level alert | Retroactive rollback; compliance team review |
| Gate-local WAL corruption | Lost audit events | SQLite checksum + startup integrity check | Single gate's audit; counter untouched (counter is in Spanner) | Regular WAL backup to edge gateway every 30s | Replay from edge gateway buffer |
| Clock skew > 50ms between gate & central | Event ordering confusion | NTP monitoring | Audit log ordering may be wrong | We order by ts_server_ms in audit, ts_gate_ms is informational only; dedup by event_id, not ts |
NTP re-sync; flag events as clock_skew_suspected |
| Leased tokens outstanding when capacity reduced | Capacity change to lower value while leases held may cause ≤ leased_out over-admission | Reconcile on capacity change | Bounded to Σ live leases (~30 × 20 = 600 → 6% of capacity worst case) | On capacity reduction, coordinator revokes all outstanding leases (async message to gates); gates complete in-flight admit then stop admitting until refill | Gates re-lease at new effective_capacity |
9 Evolution Path #
v1 — "Make it correct, for one mall" (0–3 months)
- Single mall, single region.
- Centralized Redis with Lua CAS on the occupancy counter; Postgres for capacity_policy and audit.
- Gates are synchronous — every admit is a blocking coordinator call.
- No leases, no local fallback — if central is down, entries stop.
- Exits are async batched to Redis (still asymmetric from day 1 — fire code non-negotiable).
- Ops dashboard: Grafana over Postgres audit.
- Good-enough SLO: p99 100ms local mall network; central outage = entries stop (acceptable for single-tenant pilot).
v2 — "Make it partition-tolerant and fleet-ready" (3–9 months)
- Add leased-slot model (§7.2) — gates hold local budget, operate through short partitions.
- Migrate authoritative counter from Redis → Spanner (or FoundationDB); Redis becomes L1 cache.
- Add Kafka audit bus — move audit off the critical path; introduce ClickHouse for OLAP.
- Multi-mall (200–500): control plane for fleet, per-tenant isolation by region.
- Reconciliation job runs on lease expiry + on partition heal.
- SLO upgrade: 99.99% entry decision availability; exit is effectively 100% (hardware-backed).
v3 — "Make it smart, safe, and compliant at scale" (9–18 months)
- ML-assisted tailgating detection on overhead camera (privacy-bounded: counts only, no identity).
- Emergency system integration: two-way with building management (BMS/BACnet) — fire panel, HVAC, elevator recall.
- Predictive capacity — forecast crowd peaks to preload leases hot gates.
- Compliance automation: auto-generated monthly occupancy reports for landlord / insurer; SOC 2 + local fire-code audit evidence exports.
- Dynamic capacity based on HVAC (reduce capacity when CO2 sensors trigger), parking (reduce when full), public-safety events.
- Per-zone occupancy (food court, cinema, department stores) — sub-counter hierarchy with per-zone leases.
- Multi-region active-active per mall (true two-region dual-write with CRDT-bounded conflict resolution using the same bounded-lease technique — v3-level trick that only applies when regional WAN is itself flaky).
10 Out-of-1-Hour Notes (solo-study deep content) #
10.1 Hardware-level interlocks and why software cannot be the root of trust
Building fire codes (NFPA 101, IBC, EN 13637 for electric locks in escape routes) require that exits fail safe on:
- Loss of power
- Loss of signal from control panel
- Manual emergency release ("panic bar") at the door itself
- Fire alarm
No amount of clever software can satisfy the code — the inspector checks the physical wiring. Our software architecture must be designed as a cooperating peer to the hardware interlock, not as its manager. This frames many decisions (e.g., FIRE mode in software is for auditing and alerting, not for enforcement; enforcement is in copper).
10.2 Privacy of entry/exit counts
Even anonymous counting has surveillance implications:
- Aggregate hourly counts at an abortion clinic, a church, a synagogue, a political rally → identifies individual visit patterns when correlated with other data.
- Mitigations:
- Differential privacy noise on public-facing counts (dashboards shown to tenants or published externally).
- Coarse bucketing of audit logs (day-level for external, per-event only for operator-accessible and with retention limits).
- No biometrics, no face matching in counting pipeline.
- Minimize retention: raw per-event audit in hot tier 30 days, aggregated in cold tier; long-tail liability windows use aggregates unless incident triggers hot-tier retention.
- This is relevant to the candidate's current role (Meta Privacy Infra): the same techniques (anonymization, DP, k-anonymity on aggregate release) apply.
10.3 Legal record retention and e-discovery
- In US, statute of limitations for slip-and-fall personal injury in malls = 2–4 years depending on state; for structural/fire events = 7 years (varies). 7-year retention is the industry default.
- Audit log must be immutable for evidentiary value (append-only + hash-chained or WORM storage). Kafka with compaction disabled + object-storage retention lock satisfies this.
- Design must support legal hold — ability to tag specific events as "do not delete past retention" for ongoing litigation.
10.4 Integration with building management systems (BMS / BACnet / Modbus)
- BMS protocols are not modern REST — they are BACnet or Modbus over serial/IP with weak auth.
- Our integration is one-way pull for safety signals (HVAC, CO2, temperature) — never trust write paths from BMS to be authenticated, so the occupancy system reads signals and makes its own decisions.
- Fire panel integration is specifically via hardwired dry-contact relays, not protocol-level — this isolates software vulnerabilities from life-safety actuation.
10.5 Sensor calibration and tailgating measurement
- Quarterly calibration: walk-through test with known count, compare against turnstile + optical counter. Discrepancy logged; if >0.5%, sensor flagged for service.
- Tailgating rate is a slowly-varying mall characteristic — depends on turnstile geometry, staff attentiveness, crowd density. Some malls see 0.05%, some see 0.4%. Per-gate learned baseline; flag drift.
10.6 Probabilistic counting vs exact counting — two parallel pipelines
- Exact (CP pipeline) for capacity enforcement, as designed above.
- Probabilistic (AP pipeline) for marketing & foot-traffic analytics — WiFi-probe counting, Bluetooth beacons, camera-based crowd estimation. Much cheaper, less precise, explicitly non-regulatory.
- These pipelines must be kept organizationally separate to prevent the marketing team's approximate counter from being used for capacity decisions. (L7: data lineage matters for compliance.)
10.7 Observability as first-class SLO surface
Operator-facing SLOs (internal, not customer-facing):
| SLO | Target | Measurement |
|---|---|---|
| Entry decision latency | p99 < 100ms | Coordinator RPC histograms |
| Over-admission events | 0 under normal; bounded & reported under partition | Synthetic reconciliation: actual_occupancy_from_sensor_count - max_count_observed > 0 → alert |
| Exit reliability | 100% availability (hardware-backed) | Count of "exit rejected" events: must be 0 |
| Audit log completeness | 100% of admits + exits logged within 60s | count(admits at coord) - count(admit events in warehouse) = 0 at end-of-day |
| Lease reconciliation accuracy | Σ(admits + rejects + uncomitted) = Σ(lease grants) | Daily reconciler cron; alert on drift |
| Mean time to detect partition | < 10s | Heartbeat-based |
| Fire-drill restore time | < 2 min | Periodic fire drill exercise |
Over-admission counter as an SLO: this is the anomaly. Most system SLOs measure performance; an over-admission SLO measures correctness against a regulatory ceiling. It is measured by independent sensor reconciliation, not by the system's own self-report. This defends against the failure mode where the bug is in the code that counts.
10.8 Why "just use Spanner" is tempting but incomplete
Spanner solves the authoritative-counter problem at the write layer. It does not solve:
- Local decision latency during partition (Spanner doesn't exist in the mall).
- Exit-never-blocked (that's a hardware decision).
- Bounded divergence on partition (that's the lease model).
- Audit lineage & data retention (that's Kafka + tiering).
- Sensor error (that's calibration + dual-sensor).
An L6 answer says "Spanner." An L7 answer names Spanner as the hot row but lays the lease model on top because the problem is really about a distributed semaphore with hardware fail-safes and regulated correctness, not a distributed counter.
10.9 Analog to distributed locking services
The lease model is isomorphic to how Chubby / Zookeeper / etcd hand out leases for leader election, but inverted: instead of leasing a single resource (leader slot) to one client, we lease N abstract "admission tokens" to many clients. Each consumption is unilateral (no central round-trip). The lease is the contract. If the lessor (gate) dies, the contract expires and the slots revert. This is one of the fundamental distributed patterns, and recognizing that our counter is a semaphore with lease-based distribution (not a shared variable) is the conceptual unlock.
Compare to Kubernetes: kubelet holds a resource reservation from kube-scheduler; pods scheduled locally are deducted from the local reservation. If the node partitions, the scheduler reclaims the reservation (pod eviction). Exactly our model with different nouns.
10.10 What I'd sketch on the whiteboard if given 5 more minutes
A reconciliation invariant:
For all time t, for each mall: actual_count(t) == spanner_count(t) + Σ(consumed_tokens in gate g) for each gate g - Σ(exits_in_kafka_lag) When partitions heal: spanner_count += Σ(consumed_tokens from reconciled gates) Σ(consumed_tokens on gate) -= reconciled_amount effective_capacity -= spread across re-granted leases Invariant throughout: spanner_count + leased_out ≤ capacityThis invariant is the formal proof that over-admission cannot occur under our assumptions. The assumptions that can break it:
- Byzantine gate reporting (malicious firmware): mitigated by signed firmware + dual-sensor reconciliation.
- Sensor miscounting (tailgating): bounded by SLO, accepted.
- Clock skew in TTL reclaim: mitigated by making reclamation strictly additive and using hybrid logical clocks for ordering if we ever need total-order reasoning (we don't for capacity).
Self-Verification (§pre-submit checklist) #
- SRE pager-carryable? ✅ — Ops can diagnose via (a) per-gate decision_source metric, (b) fallback state machine diagram in runbook, (c) reconciliation invariant as a check. Primary pagers: entry_decision_p99, over_admission_counter, exit_reject_rate (any >0 is page), Kafka_audit_lag, spanner_commit_latency.
- Every diagram arrow → real API/data flow? ✅ — table in §6.1 maps each arrow to the API in §4 or the data plane in §5.
- Deep-dive L7 or L6? ✅ — §7.2 leased-slot includes the bias-toward-safety reclamation rule (the L7 move); §7.3 maps asymmetric CAP onto regulated hardware. §7.1 rejects Spanner-only, Redis-only, and FoundationDB with quantified trade-offs.
- Exits-never-blocked as asymmetric code path? ✅ — §6.1 shows exit bypassing admission FSM; §6.4 explicitly calls out the asymmetry; §7.3 is a whole deep dive on it.
End of solution.