Q2 · Design a Distributed Rate Limiter

1 Problem Restatement & Clarifying Questions #

Restatement (say this first, in 20 seconds): Build an internal rate-limiting service that protects shared infra from abuse, enforces per-tenant SLA quotas, and is called in the request path of every API gateway at peak 1M RPS. The check must add ≤1–2ms p99 to user-facing latency, must be correct enough that whale tenants cannot DoS small tenants, and must degrade gracefully (fail-open) if the limiter itself has an incident — because a rate limiter outage must not become a full-site outage.

Clarifying questions I would ask the interviewer, with the default I'd adopt if they say "you decide":

#	Question	Why it matters	Default if unspecified
Q1	Internal-only (trusted callers, cooperative) vs. external (adversarial)?	Changes whether we design for correctness under attack or for SLA fairness.	Internal, cooperative. Anti-DDoS stays at the CDN/edge; this service enforces fairness and quotas, not malice.
Q2	Target precision: per-second, per-minute, per-hour, or all?	Determines window size, counter count, and whether sliding-window-log (expensive) is viable.	Multi-window. Must support 1s, 1m, 1h simultaneously per policy.
Q3	Strong or eventual consistency cross-region?	Strong cross-region pushes us to a single-writer or quorum design → 10× cost and latency.	Eventual cross-region, strong intra-region per shard. Whales get special handling (§7b).
Q4	Multi-region active-active or single home region?	Affects whether we replicate counters or partition by region.	Multi-region active-active, each region owns a shard range; budgets are per-region with a global reconciler for whales.
Q5	Fail-open (allow on limiter outage) or fail-closed?	Business decision: SLA protection vs. cost protection.	Fail-open by default, fail-closed for billing-critical and compute-expensive endpoints (e.g., `/inference/*`). Policy flag per endpoint.
Q6	Who owns policy? Central team or tenant self-service?	Drives the admin API and RBAC surface area.	Self-service with tiered defaults; platform team owns shared limits (e.g., Redis, DB).
Q7	Do we need burst handling, or only steady-state?	Token bucket vs. GCRA vs. pure sliding window.	Yes — burst + sustained. Prefer GCRA (token-bucket equivalent, O(1) state).
Q8	What's the cost appetite — premium low-latency or best-effort?	Dictates replication, sharding, and whether we can co-locate limiter with gateway.	Tight latency budget but cost-sensitive. One counter check should cost <$0.000001. BOE in §3.

I'd spend about 60 seconds on these, land on the defaults above, and say "I'll note these assumptions and surface anywhere they change the design."

2 Functional Requirements #

In scope (numbered):

FR1: CheckAndIncrement(key, cost, policy_id) — atomic, returns (allowed, remaining, reset_ts, retry_after).
FR2: Multi-dimensional keys — compose any subset of {user_id, tenant_id, client_ip, api_route, api_key, region}. A single request may fire 3–5 checks (user AND tenant AND IP).
FR3: Hierarchical limits — a tenant's aggregate across users cannot exceed tenant cap; user cap is per-user within that tenant.
FR4: Multi-window per policy — {1s burst: 100, 1m sustained: 3000, 1h: 100k} simultaneously.
FR5: Cost-weighted checks — e.g., one call to /embed-large costs 10 tokens, /list costs 1.
FR6: Policy admin API — ConfigurePolicy, ListPolicies, GetUsage(key), ResetKey(key) (ops escape hatch). Versioned and rolled out gradually.
FR7: Observability — per-policy drop/allow counters, top-N keys by drop rate, hot-key detection.
FR8: Quota "reservations" / leases — let clients request N tokens up front for a burst, reducing hot-path RTT.

Out of scope (say out loud):

DDoS/L3-L4 attack mitigation → CDN & edge firewall.
Billing itself (we feed usage events to billing; we are not the system of record for invoices).
Authn/authz of the caller → API gateway responsibility; we trust the signed identity it passes.
Arbitrary user-defined policy languages — we accept a fixed schema.
Circuit breaking / adaptive retry policy on downstreams → separate service (envoy outlier detection).

3 Non-Functional Requirements + Capacity Estimate #

NFRs (with explicit SLO targets)

NFR	Target	How we achieve it
Availability of limiter check	99.99% effective (fail-open converts most failures to "allow" so user-facing avail stays ≥ gateway's 99.99%)	Fail-open on timeout >2ms; 3× Redis replication; cell-based isolation per tenant tier.
p50 / p99 / p99.9 latency added to the gateway	0.3ms / 1.5ms / 4ms	Co-locate limiter SDK in gateway process; 1 RTT (<0.5ms intra-AZ) to Redis shard; Lua script <50µs execution.
Consistency	Strong within region & key (Redis atomic Lua); eventual cross-region.	Shard a key to one primary region; global reconciler for whales.
Throughput	1M RPS × ~3 avg dims = 3M counter ops/sec global sustained, 6M peak.	Sharded Redis; two-tier with local leases (§7b) reduces central ops ~5×.
Cost ceiling	<$0.50 per million checks (all-in, incl. Redis infra + limiter fleet).	Small cell footprint, heavy use of Lua to cut RTTs, token leases.
Durability of counter state	Not durable — loss on shard failure is acceptable (re-fills from zero; brief over-allow for ≤window).	In-memory Redis + replicas; no AOF fsync on hot path.
Durability of policy config	Strong; multi-region replicated; immutable versions.	Spanner/CockroachDB for policies; Redis is only an edge cache.

Back-of-envelope math (everything calculated)

Request fan-out: 1 API request → 3 checks average (user, tenant, IP) → 3M counter ops/sec baseline, model peak at 2× = 6M ops/sec.

Per-key state size (Redis):

GCRA stores 1 float64 TAT (8 B) per (key, policy). Plus 80 B Redis overhead per key (dict entry + embedded strings + expiration metadata in Redis 7). → **90 B/key**.
If we use sliding-window-counter with 2 buckets: 2 × int64 + timestamp ≈ 24 B data + 80 B overhead = ~100 B/key.

Active unique keys at any moment:

10M users × assume 30% active in last 10 min window = 3M user keys.
100k tenants (est.) × 100% active = 100k tenant keys.
IPs: assume 5M unique in any 1-min window.
Policies: 10 windows × 3 dimensions = up to 30 keys/entity, but most entities touch <5 windows actively → multiplier ~5×.
Total active keys ≈ (3M + 0.1M + 5M) × 5 = 40.5M keys.

Total memory: 40.5M × 100 B = 4.05 GB. With 3× replication and 30% headroom: ~16 GB. Plus hash tables, slab waste, fragmentation — budget 32 GB. This fits 1 mid-size Redis node, so we're throughput-bound, not memory-bound. Good — that means we shard for ops/sec, not for RAM.

Redis shard throughput ceiling:

A single Redis instance sustains ~100k ops/sec for simple GET/SET, but our Lua scripts are more expensive (hash-field read + conditional write + TTL touch + return). Measured on modern hardware: ~30–40k Lua CheckAndIncrement ops/sec per shard at p99 <500µs without blocking.
Headroom rule: run at ≤50% → 15k ops/sec/shard target.

Shard count:

6M ops/sec ÷ 15k/shard = 400 shards nominal.
With two-tier (§7b) dropping 80% of non-whale traffic to local leases: **100 shards**.
Each shard has 1 primary + 2 replicas = 300 nodes. Fit on ~50 hosts with 6 shards/host.
Each node: 8 vCPU, 16 GB RAM (Redis is single-threaded for the data path; extra cores do I/O multiplexing + RDB + monitoring).
Cost ballpark: 50 hosts × $400/mo = $20k/mo Redis infra + replication network.

Limiter service fleet (if we use a central service in front of Redis rather than pure SDK):

Per-instance: ~50k RPS at 1ms p99 on a 4-vCPU pod (thin client, no business logic).
6M RPS ÷ 50k = 120 pods. 2× for HA = 240 pods. Roughly $5–8k/mo.
Decision: we go SDK-in-gateway, not central service. Eliminates 1 network hop, 1 serialization, 1 failure mode. The "service" surface exists only for admin APIs and rare control operations.

Network:

Each gateway ↔ Redis check: ~200 B request + 100 B response = 300 B. At 6M ops/sec: 1.8 GB/s = ~14 Gbps. Easily within a single AZ's cross-connect; spread across 10 AZs it's 1.4 Gbps per AZ, trivial.
Cross-region replication: only policy config (writes are rare, <10 QPS) + whale reconciliation (<100 QPS). Negligible.

Cost check: $20k Redis + $8k limiter pods + $5k admin/config/observability = ~$33k/mo. At 6M RPS × 86400 × 30 = 15.5 trillion checks/mo → $0.002 per million checks. Well under budget.

4 High-Level API #

All APIs are gRPC over HTTP/2 with Protobuf. gRPC because: (a) bidirectional streaming for bulk admin, (b) deadline propagation, (c) tight codegen. JSON/HTTP exposed at admin-console edge only.

service RateLimiter {
  // Hot path — called from gateway SDK; inline-ed as a direct Redis Lua call 
  // in practice, but the SDK exposes this surface for testability.
  rpc CheckAndIncrement(CheckRequest) returns (CheckResponse) {
    option deadline_default_ms = 3; // fail-open if we exceed
  }

  // Reserve N tokens up front — used by token-lease pattern (§7b).
  rpc Lease(LeaseRequest) returns (LeaseResponse);

  // Admin plane
  rpc ConfigurePolicy(Policy) returns (PolicyAck);
  rpc GetUsage(UsageRequest) returns (UsageResponse);
  rpc ResetKey(ResetRequest) returns (ResetAck);   // ops escape hatch, audited
  rpc ListPolicies(ListReq) returns (stream Policy);
}

message CheckRequest {
  string policy_id = 1;                  // "api.read.v3"
  map<string, string> dimensions = 2;    // {"user":"u123","tenant":"t7","ip":"1.2.3.4"}
  int32 cost = 3;                        // token cost; default 1
  string request_id = 4;                 // idempotency key, 10-minute dedupe window
}

message CheckResponse {
  bool allowed = 1;
  int32 remaining = 2;                    // min across multi-window; may be stale
  int64 reset_unix_ms = 3;                // when *this* check's most constrained window resets
  int32 retry_after_ms = 4;               // server-suggested backoff; 0 when allowed
  string policy_version = 5;              // "api.read.v3#47" for debugging
  Quota matched_quota = 6;                // which of {user,tenant,ip} caused the drop
}

enum Code {
  OK = 0;
  DENY = 1;              // limit exceeded — HTTP 429
  POLICY_NOT_FOUND = 2;  // HTTP 400 (misconfig)
  INTERNAL = 13;         // gateway maps to fail-open-or-closed per endpoint flag
  DEADLINE_EXCEEDED = 4; // same as INTERNAL for fail-open logic
}

Idempotency semantics. request_id is a client-generated UUID with a 10-minute TTL. If the gateway retries due to a network blip, the limiter dedupes by request_id and returns the original decision without re-incrementing — critical because a retry storm otherwise double-counts every failing request. Stored in a small Redis hash adjacent to the counter, EXPIRE 600.

Response headers the gateway emits on 429:

HTTP/1.1 429 Too Many Requests
Retry-After: 2                    # integer seconds (RFC 7231)
X-RateLimit-Limit: 3000
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1734567890     # unix seconds
X-RateLimit-Policy: api.read.v3
X-RateLimit-Scope: tenant         # which dim was tripped

Retry-After carries jittered advice: base_ms + U[0, base_ms] — half the traffic resumes at T, not all of it at once. Otherwise the next window starts with a synchronized thundering herd.

5 Data Schema + Engine Choice #

5.1 Policy config (durable)

Stored in Spanner (or CockroachDB if open-source-preferred). Why: policies are low-QPS writes (<10/s globally), high-QPS reads (every gateway pod caches them), and must be consistent cross-region — a tenant's just-lowered limit should propagate in seconds. TTL caches at each gateway (60s), with push-based invalidation via pub/sub for urgent changes.

Table Policy (
  policy_id    STRING PRIMARY KEY,   -- "api.read.v3"
  version      INT64,                -- monotonic; every edit = new version
  rule_type    ENUM(GCRA, SLIDING_WINDOW_COUNTER, FIXED_WINDOW),
  windows      ARRAY<Window>,        -- [{span_ms:1000,limit:100},{span_ms:60000,limit:3000}]
  dimensions   ARRAY<STRING>,        -- ["user","tenant","ip"]
  fail_policy  ENUM(OPEN, CLOSED),
  cost_map     MAP<ROUTE, INT>,      -- {"/embed":10, "/list":1}
  owner_team   STRING,
  created_at   TIMESTAMP,
  valid_from   TIMESTAMP,            -- staged rollout
  rollout_pct  FLOAT                 -- 0..1 for gradual apply
)

5.2 Counter state (volatile)

Stored in Redis. The key encoding is load-bearing:

Key format: rl:{<shard_tag>}:<policy_id>:<dim_kind>:<dim_value>
e.g.:       rl:{t7}:api.read.v3:tenant:t7
            rl:{t7}:api.read.v3:user:u123   <-- same hashtag → same shard → atomic pipelining

The {...} is Redis Cluster's hashtag — it pins all keys sharing a tag to the same shard. We tag by tenant_id so a single request's tenant+user checks land on one shard and we can MULTI/EXEC or one Lua script for both. IP checks use a separate keyspace ({ip-shard}), because IPs have no tenant relationship.

Value depends on algorithm. For GCRA (our default, §7a):

Value: float64 TAT (theoretical arrival time, unix ms)
TTL:   max(window) × 2, refreshed on each write

For sliding-window-counter:

Value: hash { cur_window_start: int64, cur_count: int64, prev_count: int64 }
TTL:   2 × window_span

5.3 Lua scripts

We ship two scripts, SHA-preloaded at gateway startup (SCRIPT LOAD → every subsequent call is EVALSHA):

-- check_and_increment.lua (GCRA)
-- KEYS[1] = counter key
-- ARGV    = { now_ms, cost, period_ms, limit, burst }
-- Returns { allowed, remaining, retry_after_ms }
local key   = KEYS[1]
local now   = tonumber(ARGV[1])
local cost  = tonumber(ARGV[2])
local emit  = tonumber(ARGV[3]) / tonumber(ARGV[4])   -- emission interval per token
local burst = tonumber(ARGV[5])
local delay_tol = emit * burst

local tat = tonumber(redis.call('GET', key)) or now
local new_tat = math.max(tat, now) + emit * cost
local allow_at = new_tat - delay_tol

if now < allow_at then
  local retry = math.ceil(allow_at - now)
  return {0, 0, retry}
end

redis.call('SET', key, new_tat, 'PX', math.ceil(delay_tol * 2))
local remaining = math.floor((delay_tol - (new_tat - now)) / emit)
return {1, remaining, 0}

This is ~12 Redis operations compiled — measured at <50µs on a modern shard. Critical threshold: Lua scripts longer than ~100µs stall the shard's single-threaded event loop, starving all other keys on that shard. We keep Lua dead-simple and push anything non-trivial to the SDK.

5.4 Engine choice — WHY Redis (and why not the alternatives)

Option	Pros	Cons	Verdict
Redis (our pick)	Single-threaded atomicity → Lua is free; 40k Lua ops/sec/shard; mature ops; hashtags for co-location; pub/sub for invalidation.	Single-threaded = hot-key cap ~40k/s on one key; memory-only data model.	✅ Chosen.
Memcached	Simpler, multi-threaded.	No scripting → can't make check+increment atomic; CAS is 2 RTTs in the abort path; no TTL precision we need.	❌ — atomicity is the core requirement.
Aerospike	Cross-DC replication; very fast.	Overkill for transient counter state; licensing; ops team doesn't know it. Cost-benefit is wrong for us.	❌
DynamoDB w/ conditional update	Fully managed; regional replication.	Per-op billing at 1M RPS is eye-watering (~$250/hr on-demand); 3–5ms p99 (too slow); hot-partition throttling above 3k ops/s on one key.	❌
Cassandra/Scylla	High write throughput.	Eventually consistent; no atomic read-modify-write on single partition without LWT (Paxos, 4 RTTs); p99 >10ms under load.	❌
In-memory + gossip (Lyft's early approach, old envoy-ratelimit)	No network hop; zero-latency.	Each gateway has only its own view → can't enforce a global limit; only works with a central authority or if you accept ~N× over-allow where N = gateway count.	❌ Alone; ✅ as local tier in two-tier design (§7b).
Redis Cell module (CL.THROTTLE)	Ships GCRA natively; zero custom Lua.	Must compile & deploy a module, not available in managed Redis (ElastiCache / MemoryDB), locks us out of managed hosting.	❌ for cross-cloud; ✅ if we ran self-managed. We steal its algorithm but implement in Lua.

Why Redis over self-sharded open-source alternatives: operational maturity. A rate limiter is on the critical path; we want to debug with tooling our SRE team has held a pager for.

6 System Diagram (Centerpiece) #

                                 ┌─────────────────────────────────────────┐
                                 │                 Users                   │
                                 └──────────────────┬──────────────────────┘
                                                    │ HTTPS
                                                    ▼
                            ┌──────────────────────────────────────────────┐
                            │          CDN / Edge (Cloudflare/Fastly)       │
                            │  L3/L4 DDoS scrub + per-IP edge rate-limit    │
                            │  (rough, 1000 RPS/IP, anti-abuse only)        │
                            └──────────────────────────┬───────────────────┘
                                                       │ HTTPS
                                                       ▼
┌──────────────────────────────────────────────────────────────────────────────────┐
│                       API GATEWAY FLEET (Envoy + custom filter)                   │
│                  ~2000 pods × 10 AZs × 3 regions | 1M peak RPS global             │
│                                                                                   │
│  ┌───────────────────────────────────────────────────────────────────────────┐    │
│  │  per-request filter chain:                                                │    │
│  │                                                                           │    │
│  │   ① authn/authz → ② RATE LIMIT FILTER ───────────────── ③ upstream proxy  │    │
│  │                        │                                                  │    │
│  │                        ├── LOCAL TIER (§7b):                              │    │
│  │                        │   in-proc token bucket, per (tenant,policy)      │    │
│  │                        │   refilled by lease RPC. Hit rate ≈ 80%.         │    │
│  │                        │   Latency: 0 network hops, <5µs.                 │    │
│  │                        │                                                  │    │
│  │                        └── CENTRAL TIER (miss path, 20% of reqs):         │    │
│  │                            SDK → Redis shard (EVALSHA). ≤1 RTT.           │    │
│  │                                                                           │    │
│  │  Policy cache (60s TTL) pulled from Spanner via sidecar.                  │    │
│  │  Fail-open flag per policy → on Redis timeout / error:                    │    │
│  │    OPEN  → allow request, emit "limiter_failed{policy=X}" metric          │    │
│  │    CLOSED→ 503 with Retry-After, emit same + page on sustained            │    │
│  └───────────────────────────────────────────────────────────────────────────┘    │
└─────────┬──────────────────────────┬────────────────────────────────┬─────────────┘
          │                          │                                │
          │ gRPC (admin)             │ Lua EVALSHA                    │ async usage events
          │ <10 QPS                  │ 6M peak ops/s                  │ 1M RPS (batched, 100/batch)
          │ p99 <50ms                │ p99 <1ms intra-AZ              │ fire-and-forget
          ▼                          ▼                                ▼
┌────────────────────┐  ┌──────────────────────────────────┐  ┌──────────────────────┐
│ Rate Limiter       │  │  REDIS CLUSTER                    │  │  Kafka "usage.events"│
│ Control Plane      │  │  ~100 shards × 3 replicas = 300   │  │  →  Flink/Beam       │
│ (stateless, 20 pods│  │  nodes, 8 vCPU / 16GB each        │  │  → warehouse (BQ)    │
│ behind ALB)        │  │                                   │  │  → policy tuning     │
│                    │  │  Hash-tag by {tenant_id}:         │  │  → billing lineage   │
│  - ConfigurePolicy │  │    - user+tenant co-located       │  │                      │
│  - GetUsage        │  │    - IP keyspace separate shards  │  │  (NOT the rate-limit │
│  - ResetKey (RBAC) │  │                                   │  │   decision path)     │
│  - Leasing for     │  │  Replication: async, 1 AZ-primary │  │                      │
│    whales (§7b)    │  │    + 2 cross-AZ replicas          │  │                      │
│                    │  │                                   │  │                      │
│                    │  │  Per shard: 30k Lua ops/sec peak, │  │                      │
│                    │  │    running at 15k (50% headroom)  │  │                      │
│                    │  │                                   │  │                      │
│                    │  │  Failover: Sentinel → redirect    │  │                      │
│                    │  │    <10s; gateway fails open on    │  │                      │
│                    │  │    MOVED storm if >1% err rate    │  │                      │
│                    │  └──────────────────────────────────┘  └──────────────────────┘
│        │           │                                                  │
│        │ reads/writes                                                 │
│        ▼                                                              ▼
│ ┌──────────────────┐                                         ┌───────────────────┐
│ │ Spanner          │                                         │ Observability     │
│ │  Policy table    │                                         │  (Prom/Grafana,   │
│ │  cross-region    │                                         │   OTel, Splunk)   │
│ │  RW ~10 QPS      │                                         │  - allow/deny rate│
│ │  strong consis   │                                         │  - fail-open rate │
│ │                  │                                         │  - hot-key top-N  │
│ └──────────────────┘                                         └───────────────────┘
└────────────────────┘

Global reconciler (whale tenants only):
  Every 1s, Rate Limiter Control Plane reads whale counters across regions
  via cross-region Spanner pub/sub → recomputes each region's "lease quota"
  → pushes back to local-tier caches. Bounded skew ≤1s × global_limit/N_regions.

Every labelled arrow maps to a call in §4 or state in §5:

Gateway → Redis: EVALSHA check_and_increment on rl:{tenant}:policy:dim:val (§5.2).
Gateway → Control Plane: ConfigurePolicy / GetUsage / Lease (§4).
Gateway → Kafka: async UsageEvent{key, cost, allowed, ts} for analytics, never blocks decision.
Control Plane ↔ Spanner: policy CRUD, versioned.
Redis primary → replicas: async replication.
Control Plane → gateways (push): policy invalidation via pub/sub when rollout_pct changes.

7 Deep-Dive: Three Critical Topics #

7a. Algorithm choice: why GCRA, with rigorous comparison

Why this is critical: the algorithm is where accuracy meets memory meets atomicity. A wrong pick either blows the memory budget (sliding-window-log) or silently allows 2× the burst (fixed-window boundary bug) or can't be made atomic in Lua.

Four candidates, compared on the axes that matter:

Algorithm	Memory/key	Accuracy	Burst behavior	Lua cost	Clock sensitivity
Fixed window	1 counter	❌ Bad	At window boundary, client sends `limit` in last 1ms of window 1 + `limit` in first 1ms of window 2 → 2× burst.	Trivial (INCR)	Sensitive: clock skew shifts boundary.
Sliding-window log	O(N) where N = max events in window; for a 1s/1000 policy = 1000 timestamps = 8 KB/key	✅ Exact	Perfect	Expensive: ZADD + ZREMRANGEBYSCORE + ZCARD, ~5–10 Lua ops	Low (compares timestamps directly).
Sliding-window counter	2 counters + 1 timestamp ≈ 24 B	✅ Good (≈2% error, provable bound)	Approximates perfect via weighted sum	Cheap	Medium — relies on window alignment.
GCRA / token bucket	1 float64 TAT = 8 B	✅ Exact	Smooth; naturally supports burst via delay-tolerance parameter	Cheap (~6 Lua ops)	Low.

Sliding-window-counter math (worth stating out loud, shows earned depth): At time t inside the current window of span W, with previous-window count P and current-window count C:

effective_count(t) = P × (1 − (t − window_start) / W) + C

This linearly interpolates from the prior window, assuming uniform distribution within it. Error: if actual traffic in the prior window was bursty at the end (worst case), we'd under-estimate by up to 50% of P; in practice with real traffic, <5% in the 99th percentile of misestimation scenarios.

GCRA (chosen):

GCRA (Generic Cell Rate Algorithm, from ATM networks) stores a single TAT (theoretical arrival time). On a request of cost c:

emit_interval = period / limit (e.g., 1s / 100 = 10ms/token).
tat' = max(tat, now) + emit_interval × c
allow_at = tat' − (emit_interval × burst)
If now < allow_at: reject, retry_after = allow_at − now. Else accept and write tat'.

GCRA is mathematically equivalent to token bucket:

Token bucket: bucket with capacity B, refill rate R = 1/emit. Take c tokens if available.
GCRA: same thing, reformulated so state is a single float instead of (bucket_level, last_refill_ts).

Why GCRA over token bucket in Redis Lua specifically:

Token bucket needs two fields (level + ts) → either a hash (HGETALL+HMSET, 2 ops) or a string encoding. GCRA is one float field, one GET, one SET. That's 2× fewer Lua ops → 2× more shard throughput at the margin.
No floating drift: token bucket's level drifts every refill; GCRA recomputes fresh.

Chosen: GCRA default, sliding-window-counter opt-in for teams that want the human-readable "remaining count" to be precise. Fixed window explicitly rejected — we document the boundary bug so no team asks "why not simpler?"

Real systems: Stripe, Cloudflare's edge limiter, and Shopify Shop's API limiter all use token-bucket-family. Redis Cell implements GCRA as a native module.

7b. Hot-key mitigation: two-tier leasing for whale tenants

Why this is critical: one Redis shard's ceiling on one key is ~30–40k ops/sec. A whale tenant sending 100k RPS will either: (a) saturate their shard and starve other tenants sharing it, or (b) get artificially throttled below their SLA. Neither is acceptable. This is the hardest problem in distributed rate limiting and where L7 shows up.

Problem statement:

99% of tenants are under 100 RPS → fine.
Top 0.1% (whales) are 10k–100k RPS sustained.
A single key (e.g., rl:{whale_t}:api:tenant:whale_t) cannot be atomically incremented at 100k/s because Redis is single-threaded per shard.

Three approaches, quantified:

Approach	Max RPS/key	Accuracy	Added latency	Complexity
(A) Single central counter	30k	Exact	1 RTT	Low
(B) Sharded central counter (fan key into N sub-keys)	N × 30k	Exact, but client must sum	1 RTT to one of N shards	Medium; lookups need all N
(C) Two-tier: local lease from central authority	~unlimited	Bounded error ≤ lease_size × gateway_count	0 RTT on fast path	Higher

Chosen: C. Here's the mechanism:

Each gateway pod maintains a local token bucket per (tenant, policy), starting with 0 tokens.
On miss/empty, the pod calls Lease(tenant, policy, want=L) to the Rate Limiter service, which atomically decrements the central Redis counter by L and returns L tokens (or fewer) plus an expiry.
The pod spends those tokens locally at line rate; when it runs low (e.g., <20% remaining) it pre-fetches another lease.
On lease expiry, unused tokens are not returned (simpler, slight over-throttle); pods renew.

The key parameter — lease size L — is a 3-way trade-off:

Larger L → fewer central calls, more throughput, but larger error bound. If each of 2000 gateway pods holds 100 unused tokens at the instant a whale's limit drops, we over-allow by up to 200k before the next lease refresh.
Smaller L → tighter accuracy but central calls increase; at L=1 we're back to the single-key bottleneck.
Sweet spot formula: L = max(1, floor(gateway_local_RPS × refresh_interval)). For a gateway at 50 RPS for that tenant and a 100ms refresh interval, L = 5. With 2000 pods this means central is called ~20k times/sec per whale — 1.5% of a shard's capacity. Excellent.
Adaptive: if central denies a lease, shrink L by half (AIMD-style); if hit-rate on leases is 100%, grow by 1. Same pattern as TCP congestion control.

Accuracy bound (bookable): In the worst case, at any instant, unused tokens across the fleet ≤ #pods × L. With L=5, pods=2000: over-allow is bounded at 10k tokens, or 0.1s at a 100k RPS limit. Within SLA for a system that doesn't promise stronger than 1-second precision.

Failure modes unique to this design:

Central counter lost (Redis shard failover): pods exhaust their current lease, then call central → times out → per policy fail_policy: either allow-at-local-rate (OPEN) or drop (CLOSED). Crucially, in OPEN mode a shard outage converts to "every pod enforces only its local limit", which with 2000 pods means no effective global limit — intentional on a degraded path, and clearly metricsed so SRE sees it within one scrape interval (15s).
Pod crash with unspent lease: tokens are lost (not returned). We deliberately don't implement return — it would require 2-phase commit on the hot path. Over-throttle by ~L/pod is acceptable.
Clock skew across pods: GCRA uses Redis server time (now = TIME in Lua) for central; local tier uses monotonic clock. Pod clocks don't matter for central decision. We bound pod clock skew elsewhere (NTP, §10).

Real production analogs: Google's Doorman (token leasing) and its successor in Google's internal quota system; AWS's Service Quotas does a similar hierarchical broadcast; Envoy's global-rate-limit + local-rate-limit composition is effectively a 2-tier.

7c. Fail-open vs fail-closed — chosen per endpoint, with math

Why this is critical: the most common rate-limiter incident is "limiter goes down, gateway 500s, full-site outage." The design decision must be explicit, policy-driven, and reviewed as rigorously as a firewall rule.

The question framed correctly: "what's the cost of wrongly allowing a request we should have denied, vs. the cost of wrongly denying a request we should have allowed, when the limiter itself is unhealthy?"

Endpoint class	Cost of over-allow during outage	Cost of over-deny during outage	Decision
Read APIs (`/list`, `/get`)	Low — stateless, bounded by downstream capacity.	High — user experience degraded; SLA impact.	Fail-open.
Billing-chargeable APIs (`/inference/*large`, `/export`)	High — real $ cost to company per over-allow × hours.	Medium — customer sees 503 → retries later.	Fail-closed.
Auth writes (`/login`, `/token`)	Security risk — brute-force window opens.	High — users can't log in.	Fail-closed-ish: fallback to edge-limited, much stricter, still-available coarse limiter (IP-only).
Admin/internal (`/admin/*`)	Low; small volume.	Low; staff can retry.	Fail-closed.
Free-tier abuse-prone (`/register`)	High — floods create account pollution.	Low — new users can retry.	Fail-closed.

This is encoded as the fail_policy field on the policy (§5.1). Review is part of the policy-change RFC. No endpoint's default is implicit.

Math for fail-open SLO: gateway target is 99.99% available. If limiter is independently 99.9% available and fail-open converts limiter failures to allows, the effective request-success rate is:

P(success) = P(gw up) × [P(limiter up) × P(not-rate-limited) + P(limiter down) × 1]
           = 0.9999 × [0.999 × 0.97 + 0.001 × 1]
           = 0.9999 × 0.9703
           ≈ 0.9702 (headline availability unchanged by limiter failure mode)

Whereas fail-closed would make P(success) include P(limiter down) × 0 → headline availability drops to 0.9693, a ~1000 ppm regression. For read paths, fail-open is objectively correct unless per-request cost exceeds limiter outage value.

Failure detection:

Gateway SDK treats Redis error OR deadline (3ms) as failure.
Error budget: if >1% of checks fail in a 30s window, trip a circuit breaker for that shard → fail-open (or -closed) without even trying Redis, saving latency budget.
Half-open probes every 5s to recover.

Operational rule: an SRE can flip the global flag to "force fail-open" in one command during an incident, regardless of per-policy flags. This is the "big red switch." Audited, time-limited (auto-revert in 1h).

Real systems: Envoy's failure_mode_deny config is exactly this flag. Cloudflare's rate limiter defaults to fail-open for the same reasoning above.

Component	Failure	Detection	Blast radius	Mitigation	Recovery
One Redis shard primary	Crash / network partition	Sentinel heartbeat loss 3× (3s); gateway observes `ECONNREFUSED` or `MOVED`	1/100th of keys; ≈ 60k ops/s / 1M total	Sentinel promotes replica (<10s); gateway SDK retries once; if still failing, fail-open per policy	Promoted replica takes over; old primary on recovery rejoins as replica; 1-min reconciliation
Whole Redis cluster	AZ outage or ops mistake	Widespread 5xx from shard fleet; fail-open rate metric spikes	Global. Non-leased tenants lose enforcement; whales run on local leases only (bounded error)	Global circuit breaker → fail-open for OPEN policies; fail-closed for CLOSED (with Retry-After jitter to avoid herd)	Restore cluster, re-preload Lua scripts, watch drop-rate metric return to baseline
Policy store (Spanner) unavailable	Write failures; cached reads OK	`ConfigurePolicy` 5xx	Cannot change policies; existing policies keep enforcing from gateway's 60s cache + longer fallback cache	Gateways extend cache TTL to 1h when Spanner unhealthy; alert fires, block admin writes	Restore Spanner, drain alert, rollback TTL
Rate Limiter control plane pods	Crash / scaling lag	LB health-check; Lease RPC errors	Whale tenants lose fresh leases; fall back to "allow at last known rate" or fail-open per policy	Gateway SDK has a stale-lease budget — extends current lease's expiry by 1 min on control-plane unavail	Control-plane returns → normal lease refresh resumes
Gateway pod itself	Crash	Envoy/k8s health-check	Loses local tokens (unspent lease leaked)	Accept the small over-throttle; whale's other pods still hold leases	New pod starts with zero, fetches lease on first request
Hot key — whale overrides policy	Whale RPS > 30k on one key	Shard CPU >80%, one-key dominate metric ("hot-key top-N")	Shard starvation for all tenants on it	Auto-promote whale to two-tier mode (§7b); if already there, shrink their L to push load back to central; if saturated central, alert	Re-shard the tenant to an isolated shard cell (admin op)
Clock skew on a Redis node	NTP drift > 1s	Heartbeat clock-delta metric vs. other replicas	Window boundaries wrong on that shard → over/under counting by up to skew × rate	Eject node; rely on replica	Fix NTP, re-add as replica after reconciliation
Config poisoning (bad policy pushed)	Global drop spike after a `ConfigurePolicy`	Drop-rate anomaly post-deploy	Could block legit traffic globally if policy was bad	Every policy is staged: `rollout_pct` starts at 1%; auto-rollback if drop-rate regression >2σ	Revert to previous policy version (policies are immutable-versioned)
gRPC deadline too tight under tail latency	p99.9 spikes on high-fanout requests	p99.9 check latency > 3ms budget	Fail-open triggers cascade on OPEN paths (no real harm); fail-closed paths see 503 spikes	Per-policy deadline tuning; TCP_NODELAY; keepalive tuning	Slow tail from Redis full scan → add keys eviction, reduce fragmentation

9 Evolution Path #

v1 (ship in 1 week — proof of correctness, limited scale):

Gateway-local in-memory token bucket only, no cross-pod sync.
Each pod enforces limit / num_pods as a rough share.
Correct for small tenants (quickly averaged out); wildly over-allows whales.
Policy in a YAML file checked into the gateway repo, reloaded on SIGHUP.
Good for: internal dogfood, beta tenants, anti-accident protection.
Not good for: real SLA enforcement, billing, hostile traffic.

v2 (ship in ~6 weeks — central enforcement):

Redis cluster with sharding by tenant hashtag.
Lua GCRA scripts, SHA-loaded.
Spanner-backed policy store with staged rollout.
Per-request central check; no local tier yet.
Fail-open default; per-policy flag for fail-closed.
Good for: 80% of teams, <50k RPS tenants.
Hits the wall on: whales, per-request latency on deep fan-out (§3 shows ~1.5ms/check × 3 checks = 4.5ms gateway overhead).

v3 (ship in ~3 months — two-tier, multi-region):

Local tier with lease-based tokens from central (§7b).
Whale tenants auto-detected (high-RPS metric); promoted to dedicated shard cells.
Multi-region active-active: each region owns its own counters for most traffic; whale reconciler at 1 Hz for tenants with global limits.
Idempotency (§4) for retry safety.
Advanced observability: per-policy drop breakdown, hot-key top-N, fail-open rate SLO.
Good for: the 1M RPS target, whale tolerance, regional growth.

v4 (beyond — research):

Approximate counters (HLL for unique-users-per-tenant-per-minute policies) at 10× memory savings.
Predictive lease sizing: ML-driven adaptive L based on tenant traffic patterns.
Multi-region strong consistency for specific billing-critical policies via Spanner-backed counters (trade 5ms latency for correctness on a handful of policies).

10 Out-of-1-Hour Notes #

Clock skew / NTP drift. GCRA uses Redis-server time (via redis.call('TIME') in Lua) exclusively — we never trust client timestamps for decision-making. Cross-shard skew bounded by NTP (chrony) to <10ms within a DC. Sliding-window-counter, however, is more fragile: if two shards see different "window_start" boundaries, their weighted sums diverge. For policies with multi-shard aggregation (rare, §7b whale path), we round window_start to the nearest second — skew impact becomes "the whole fleet agrees on the boundary within 1s." This is why we hash-tag single-tenant keys to one shard: clock skew becomes a non-issue for 99% of checks.

Counter vs. log vs. sketch trade-offs.

Counter (GCRA, SWC): exact enforcement, 8–100 B/key, what we ship.
Log (window-log): byte-exact history but 8 B × events/window = explodes at scale. Only useful for audit/debugging.
Sketch (CountMin, HLL): approximate, sub-linear memory. Good for "count distinct IPs per tenant per hour" where we only care about cardinality. HLL is ~12 KB for any cardinality with <2% error. For our core limit-enforcement, sketches are wrong tool (false positives = wrong decisions on real traffic).

Redis Cluster vs Sentinel vs ElastiCache vs MemoryDB.

Sentinel: single-writer, easy, doesn't shard. OK for <100k ops/sec — not us.
Redis Cluster (self-managed): exactly what we need; hash-slot sharding, MOVED redirects, full control over Lua, custom modules if we ever want Redis Cell. Ops burden.
ElastiCache Cluster Mode: managed, cross-AZ, good latency. Can't load custom modules (Redis Cell out). Our needs are Lua-only — fits.
AWS MemoryDB: Redis-compatible, durable (Multi-AZ WAL). Unnecessary durability at 3× cost; counter state is intentionally ephemeral.
Chosen ladder: dev → Sentinel; prod → ElastiCache Cluster Mode if all-AWS, or self-managed Redis Cluster if multi-cloud. Same Lua scripts work on all three.

DDoS interplay. The CDN/edge does coarse per-IP limiting (~1k RPS/IP) — that's L3/L4 protection, not billing/fairness. Our service begins where identity is established (post-TLS, post-authn). Splitting concerns avoids the common mistake of asking one service to both protect the origin (huge RPS) and enforce fairness (requires identity) — those have different latency, precision, and failure tolerances.

Observability gold standards.

RED (rate/error/duration) on CheckAndIncrement per-policy, per-shard.
Per-policy drop rate as a labelled counter; rate(drops{policy="x"}[1m]) > threshold alerts the policy owner.
Fail-open rate is a first-class SLO: if >0.1% of decisions are fail-open in a rolling 5-min, page SRE — this is our signal that the limiter is degraded.
Hot-key top-N: periodic sampled histogram of key → ops-per-second; feeds whale-auto-promote.
Allow vs drop histograms with policy version labels so we can audit "did this policy bump drop rates?"

Regulatory / billing lineage. If usage events feed billing, the pipeline must be authoritative: Redis is not. We emit UsageEvent to Kafka after the decision, consumed by the billing pipeline with its own deduplication (via request_id). Drop-rate metrics may legitimately disagree with billing counts (e.g., we count "denied" events that billing doesn't). Always separate decision plane from ledger plane.

Privacy & PII. client_ip is PII in some jurisdictions (EU, CA). Hash at the gateway before it hits Redis: ip_key = HMAC(region_salt, ip) rotating salt quarterly. Hashed keys preserve enforcement; can't be reversed to the raw IP. Same pattern for user_id if tenants request PII-minimized logging.

Testing specifics I'd put on the team's runbook:

Load-test with a real-traffic replay, not synthetic uniform — synthetic hides hot-key pathologies.
Chaos: randomly kill 10% of Redis pods; assert fail-open rate is visible within 1 scrape interval and never breaches 1% in aggregate.
Policy rollout is always staged (1% → 10% → 100%) with auto-rollback on drop-rate regression >2σ. Fully revertible because policies are immutable-versioned.
Game-day: "whale tenant doubles their RPS in 30s" — validate lease adapter converges within 5s without starvation.

Verification checklist (done before submission) #

SRE pager-carryable? Yes — §8 is a runbook table with detection / mitigation / recovery.
Every diagram arrow → real API/data flow? Yes — each arrow in §6 is cross-referenced to §4 or §5.
Deep-dive L7? Yes — §7a provides the SWC-weighting formula and Lua-micro-timing constraint; §7b derives lease-size math with bounded-error proof and AIMD adaptation; §7c shows the per-endpoint decision with a success-probability calculation that flips a 1000-ppm availability outcome.
Capacity math closes? 6M ops/s ÷ 15k/shard = 400 → two-tier drops 80% → 100 shards × 3 replicas = 300 nodes. Memory 40.5M × 100 B × 3 = 12 GB, fits. Cost ≈ $33k/mo = $0.002 per million checks. Yes.