Q1 · Design an L4 (Transport-Layer) Load Balancer

1 Problem Restatement & Clarifying Questions #

Restatement (what I'm building). A software L4 load balancer that receives TCP/UDP packets on a set of virtual IPs (VIPs) and forwards them to a pool of backend servers, preserving the 5-tuple session affinity so that every packet of a given flow lands on the same backend for the flow's lifetime. The control plane discovers healthy backends via active health checks and pushes the membership set to the data plane. The service has to absorb 5K peak QPS against a 500-ish-node backend fleet where each backend does ~10 QPS.

Clarifying questions (with defaults I'll assume if the interviewer says "your call"):

#	Question	Default I'll adopt	Why it matters
Q1	Is "QPS" == new TCP connections/sec, or requests over long-lived conns?	new-conn/sec. 5K new-conn/s is a far more stressful model for an L4 LB than 5K reqs over a handful of kept-alives. If it's the easier case the design still holds.	Conntrack insertion rate and SYN-flood surface differ by 100× between the two.
Q2	TCP-only or TCP+UDP?	TCP + stateless UDP (QUIC/DNS).	UDP has no SYN/FIN so flow reaping is purely TTL-based.
Q3	Per-connection bandwidth / payload shape?	~4 KB/req avg, ~50 KB p99, typical "web API" shape.	Drives bandwidth + LB NIC sizing.
Q4	Client affinity required (sticky sessions)?	Yes — once a client lands on backend B, stay on B until flow closes or B dies. L4 means we cannot touch HTTP cookies; affinity is 5-tuple-derived.	Forces consistent hashing, not round-robin.
Q5	TLS termination on the LB?	No — pass-through (DSR-friendly). TLS terminates on the backend.	Keeps us L4, lets us do Direct Server Return and removes a CPU tax from the LB.
Q6	Single region or multi-region?	Single region for v1/v2, multi-region anycast in v3 (§9).	Scoping.
Q7	Who owns DNS / VIP advertisement?	We own VIPs, announced via BGP/ECMP from the LB hosts. DNS returns the anycast VIP.	Lets us scale LB horizontally without DNS thrash.
Q8	Budget for HW offload (SmartNIC, DPDK)?	Software-only (XDP/eBPF) for v1-v2, SmartNIC optional in v3.	Most of the interesting design is in software; HW is a tail-latency optimization.

I'll explicitly confirm Q1 with the interviewer if they haven't pinned it — the entire sizing hinges on it.

2 Functional Requirements #

FR-1. Accept TCP SYNs and UDP datagrams on one or more VIPs. FR-2. For every flow (5-tuple = {src_ip, src_port, dst_ip, dst_port, proto}), deterministically select exactly one healthy backend and forward all packets of that flow to it. FR-3. Preserve flow affinity for the lifetime of the flow, even if the backend set changes (minimal disruption property). FR-4. Detect unhealthy backends within a bounded time (target: 3-5 s, §3) and stop sending new flows to them; drain existing flows gracefully. FR-5. Support dynamic backend registration/deregistration via a control-plane API. FR-6. Survive the loss of any single LB data-plane node without flow-level disruption for flows on other LB nodes, and with bounded disruption for flows pinned to the lost node. FR-7. Provide per-VIP and per-backend observability: conn/sec, bytes/sec, active flows, health state, ejection events.

Out of scope (with reasons):

TLS termination, HTTP path routing, cookie stickiness — that's L7. If we needed it we'd put an L7 (Envoy/HAProxy) fleet behind the L4 tier (this is literally what GFE+Maglev or ELB-NLB→ELB-ALB does).
WAF / bot protection — L7 concern; belongs on Envoy or a dedicated tier.
Rate-limiting at the request level — L4 doesn't see "requests." We can do per-5-tuple SYN-rate limiting and per-source-IP SYN-cookie triggers, and that's it.
Geographic routing / latency-based steering — anycast + BGP communities do a crude version; proper geosteering is a DNS-layer concern (GSLB).
Backend-to-backend service mesh — separate problem.

3 NFRs + Capacity Estimate #

3.1 NFRs

Dimension	Target	Justification
Availability (per VIP)	99.99% (52 min/yr)	SPOF-grade. A single LB outage blackholes a VIP.
Data-plane added latency	p50 ≤ 50 µs, p99 ≤ 250 µs	Kernel-bypass (XDP/DPDK) gets p50 sub-50 µs; generic netfilter/IPVS is p50 ~200 µs.
Control-plane propagation	Backend-set change → all LB nodes ≤ 3 s	Must be faster than typical deploy "drain window" (5-10 s).
Flow-affinity disruption during backend add/remove	≤ 1/N flows re-hashed for N backends (Maglev minimal-disruption property)	Modulo hashing re-hashes ~(N-1)/N; unacceptable.
Conntrack capacity per LB node	≥ 2M active flows	See §3.3 math.
Packets-per-second per LB node	≥ 1 Mpps on commodity HW with XDP	Leaves headroom even if avg packets/flow jumps.
Cost	~$X/VIP/month = LB fleet cost ÷ VIP count. Targeting <5% of backend fleet cost.	L4 should be a rounding error vs. backends.

3.2 QPS, PPS, bandwidth math (from first principles)

Given: peak 5K QPS; each backend = 10 QPS.

Backend fleet (raw)      = 5,000 / 10                    = 500 servers
+ 30% headroom (N+k, deploys, hot-spot absorption)
                         = 500 × 1.30                    = 650 servers
+ AZ/rack diversity (lose 1 of 3 AZs → keep 5K QPS)
  Each AZ must carry 5,000 / 2 = 2,500 QPS with one AZ down
  So per-AZ capacity ≥ 2,500 / 10 = 250 servers × 3 AZs  = 750 servers
Call it 750 backends. (Interviewer-facing number: "500 raw, 750 with headroom+AZ.")

PPS budget (TCP). Assume an average "request" = ~10 packets (3-way handshake + GET + response split into MSS=1460 segments for ~4 KB + ACKs + FIN). Small-payload API:

Peak PPS at LB ingress  = 5,000 QPS × 10 pkts/req        = 50,000 pps
With egress symmetry    = 100,000 pps (if NOT DSR)
With DSR (LB only sees ingress)                         = 50,000 pps

50K pps is trivial — a single modern NIC handles 14 Mpps line-rate at 10 GbE. PPS is not our bottleneck; conntrack insertion rate and HA are.

Bandwidth.

Avg req+resp bytes       = 4 KB req + 4 KB resp = 8 KB
Peak ingress+egress bw  = 5,000 × 8 KB × 8 bit/byte       = 320 Mbps
With DSR (LB only ingress half)                          = 160 Mbps
p99 burst (50 KB resp)   = 5,000 × 58 KB × 8              = 2.3 Gbps

Also trivial on 10 GbE. Again, the thing that saves us or kills us is not bandwidth — it's conntrack table sizing and control-plane freshness.

3.3 Conntrack memory math (THE important calculation)

Conntrack entry size (realistic, not toy):

struct flow_entry {
  u32  src_ip;                       4 B
  u32  dst_ip;                       4 B  (VIP; redundant but keeps entry self-contained)
  u16  src_port;                     2 B
  u16  dst_port;                     2 B
  u8   proto;                        1 B
  u8   pad;                          1 B
  u32  backend_id;                   4 B  (not IP — dense 32-bit ID, we resolve via table)
  u64  last_seen_ns;                 8 B  (monotonic, for idle reaping)
  u64  bytes_tx, bytes_rx;           16 B
  u32  pkts_tx, pkts_rx;             8 B
  u32  state_flags;                  4 B  (SYN_SEEN, ESTABLISHED, FIN_SEEN, ...)
  u32  next_hash;                    4 B  (chained-hash overflow pointer)
};                               = ~58 B raw → round to 64 B for cacheline alignment.

Peak active flows per LB node, assuming flows are sticky for ~60 s avg (keep-alive API):

New conns/sec at one LB (5K ÷ 4 LB nodes)         = 1,250 cps
Avg flow lifetime                                  = 60 s
Steady-state active flows per LB                   = 1,250 × 60   = 75,000 flows

Plus idle-not-yet-reaped overhang (TTL = 120 s for ESTABLISHED, 60 s for TIME_WAIT):

Worst-case active-ish                             = 75K × 2       = 150,000 flows
Conntrack memory per LB                            = 150K × 64 B   = 9.6 MB.

That is tiny — and it's the point. People worry about conntrack blowing up; at our scale it doesn't. It blows up when (a) you're Google-scale (100M+ cps), or (b) you fail to reap, or (c) an attacker floods SYNs. So the sizing I actually defend to the interviewer is:

Provision for 10× headroom (SYN-flood absorption)  = 1.5 M flows
                                                  = 96 MB per LB node.

Still trivial. But this changes if you do not do DSR and the LB has to keep conntrack for return traffic — double it. And it changes massively if the interviewer says "now it's 5M QPS not 5K QPS," which leads naturally into the Maglev-style stateless hashing discussion in §7.1.

3.4 LB fleet sizing

Per-LB capacity (XDP, commodity, conservative):
  New-conn/sec budget                               = 100 Kcps
  Active flows                                       = 2M
  PPS                                                = 1 Mpps

Peak load:
  New-conn/sec                                       = 5K
  Active flows                                       = 150K (calc above)
  PPS                                                = 50K

Fleet headroom argument:
  Any single LB node failing must NOT cause VIP degradation.
  Use N+2 active-active: min 3 nodes.
  Add deploy/upgrade slack: 4 nodes.
  Per AZ × 3 AZs                                    = 12 LB nodes total.

Interview number: 3-4 LB nodes per AZ, 12 total across 3 AZs, ECMP'd behind a single anycast VIP. Each node runs at ~5% capacity — that's a feature, not waste: it means our tail latency stays low and a SYN-flood that 10×'s traffic doesn't tip us over.

3.5 Control-plane sizing

Backend events       = 750 backends × ~1 deploy/day × 2 events (drain+register)
                     = ~1,500 events/day ≈ 1 event / 60 s avg.
Health checks        = 750 backends × 1 check/s (per LB) × 12 LBs = 9K checks/s
                     → we'll centralize this, see §7.3.
Config push          = full snapshot = 750 × ~100 B/entry = 75 KB. Trivial.

The control plane is a tiny service — but it's on the critical path for correctness, so we treat it with Paxos-grade care (§5, §7.3).

4 High-Level API #

Three surfaces: (a) control plane — register/deregister/health, (b) data plane — packet processing contract, (c) admin/observability.

4.1 Control-plane gRPC (preferred; HTTP-JSON equivalent noted)

service LBControlPlane {
  // Backend lifecycle — idempotent.
  rpc RegisterBackend(RegisterBackendReq)     returns (RegisterBackendResp);
  rpc DeregisterBackend(DeregisterBackendReq) returns (Ack);
  rpc DrainBackend(DrainBackendReq)           returns (Ack);   // stop NEW flows; keep existing
  rpc UpdateBackend(UpdateBackendReq)         returns (Ack);   // weight, metadata

  // Health reporting — push model from health-checker workers to registry.
  rpc ReportHealth(stream HealthReport)       returns (stream HealthAck);

  // VIP / pool management.
  rpc CreatePool(CreatePoolReq)               returns (Pool);
  rpc BindVIPToPool(BindVIPReq)               returns (Ack);

  // Data-plane subscription — LB nodes stream config from registry.
  rpc WatchPool(WatchPoolReq)                 returns (stream PoolSnapshot);
}

message RegisterBackendReq {
  string pool_id       = 1;
  string backend_id    = 2;  // stable; survives IP change (e.g. "host-42-nic0")
  string ip            = 3;
  uint32 port          = 4;
  uint32 weight        = 5;  // default 100
  map<string,string> labels = 6;  // rack, az, version — used for %-deploy policies
  string readiness_probe_url = 7; // used by health-checker
}

message PoolSnapshot {
  string pool_id       = 1;
  uint64 version       = 2;   // monotonic, used by LB nodes for dedupe
  repeated BackendEntry backends = 3;  // FULL list, not diff — idempotent apply
  MaglevTable table    = 4;   // pre-computed lookup table (see §7.1)
}

message HealthReport {
  string backend_id    = 1;
  Health state         = 2;   // HEALTHY | UNHEALTHY | DRAINING | UNKNOWN
  uint64 checked_at_ns = 3;
  uint32 consecutive_pass = 4;
  uint32 consecutive_fail = 5;
  string probe_latency_us = 6;
}

Error codes (gRPC status → semantics):

Code	Meaning	Caller action
`OK`	Applied	—
`ALREADY_EXISTS`	Backend with same `backend_id` registered	Safe to ignore (idempotent); else call `Update`
`FAILED_PRECONDITION`	Pool membership at capacity (Maglev table size)	Page SRE; or bump table size
`DEADLINE_EXCEEDED`	Registry couldn't reach quorum	Retry with jitter; don't fail closed
`UNAVAILABLE`	Leader election in progress	Retry ≤ 5 s with backoff
`RESOURCE_EXHAUSTED`	Health-report stream flow-control window full	Back off; reduce reporting cadence

4.2 Data-plane packet contract (no API — packet semantics)

Per incoming packet to VIP:

on_packet(pkt):
  k = hash_5tuple(pkt)                                    # siphash-2-4 or crc32 + seed
  if is_tcp(pkt) and (SYN_ONLY or RST or NEW_FLOW_HEURISTIC):
      backend = maglev_lookup(pool.table, k)              # O(1), stateless
      conntrack_insert(k, backend, now)                   # best-effort; see §7.1 for why
  else:
      backend = conntrack_lookup(k)
      if miss:
          backend = maglev_lookup(pool.table, k)          # graceful fallback
  encapsulate_or_dsr(pkt, backend)
  forward(pkt)

Two forwarding modes supported (concrete choice in §7.2):

DSR (Direct Server Return) — MAC-rewrite or IP-in-IP encap; response goes client-direct, not via LB. Preferred.
NAT mode (SNAT+DNAT) — LB rewrites src/dst; both directions traverse LB. Fallback for networks that can't do DSR (cross-AZ, certain cloud VPCs).

4.3 Admin / observability HTTP endpoints

GET  /healthz                       → LB process liveness
GET  /readyz                        → has pool snapshot + table loaded?
GET  /metrics                       → Prometheus exposition
GET  /admin/pools/{id}              → JSON dump of current membership + Maglev table stats
POST /admin/pools/{id}/drain/{bid}  → manual drain (ops escape hatch)
GET  /admin/flows?vip=&backend=     → sampled conntrack dump (for debug; rate-limited)

5 Data Schema #

5.1 Conntrack table (per LB node, in-memory only — ephemeral state)

Field	Type	Notes
`flow_key`	`u128` (packed 5-tuple)	Partition key; hashed with SipHash
`backend_id`	`u32`	Index into backend table (NOT an IP)
`last_seen_ns`	`u64`	Monotonic ns; drives TTL reaping
`state`	`u8`	SYN_RECV / ESTABLISHED / FIN_WAIT / TIME_WAIT
`bytes/pkts`	4× `u32-u64`	Observability only

Storage engine: in-process open-addressed robin-hood hash table (cache-friendly), sharded per CPU (one shard per RX queue) to avoid lock contention. 1.5M entries × 64 B = 96 MB per node (§3.3).
TTL: 5 min ESTABLISHED / 60 s TIME_WAIT / 30 s SYN_RECV. Lazy reaper walks 1/256 of table per tick.
Persistence: none. This is the key L4 insight — conntrack is rebuild-from-scratch-able because of Maglev's consistent-hashing fallback (§7.1). We deliberately do not replicate it; replication would be the #1 scalability bottleneck.
Sync across LB nodes: none (see §7.1 deep-dive). Katran, Maglev, and GLB all get away with this.

5.2 Backend / server registry (strongly consistent store)

Field	Type	Index	Notes
`backend_id`	string	PK	Stable ID; survives IP change
`pool_id`	string	secondary	Every backend belongs to exactly one pool
`ip`, `port`	inet, u16	—	Current forwarding target
`weight`	u16	—	For weighted Maglev
`state`	enum	secondary	`PENDING`, `HEALTHY`, `DRAINING`, `UNHEALTHY`, `EVICTED`
`az`, `rack`, `version`	strings	—	Used for topology-aware placement, canary deploys
`registered_at`, `last_state_change_ns`	timestamps	—	Audit
`readiness_url`	string	—	Distinct from liveness (§7.3)

Storage engine: etcd or Spanner/ZooKeeper. We need linearizable reads + watch semantics + small total size (750 rows × ~500 B ≈ 375 KB — fits in RAM of any consensus store). I'd pick etcd for an AWS/GCP cloud deploy (simple, well-understood, fits watch-pattern) and Spanner at Google for the obvious reasons (existing infra, global scope).
Partitioning: none needed at 750 rows; shard by pool_id if it grows past 10K backends.
TTL: none on rows; on health records below.

5.3 Health status store (eventually consistent, high-churn)

Field	Type	TTL	Notes
`backend_id`	string	—	FK to registry
`checker_id`	string	—	Which health-checker reported
`state`	enum	—	Same enum as registry
`last_probe_ns`	u64	—	For staleness detection
`consecutive_fail`	u32	—	Used by quorum logic
`record_ttl`	—	15 s	If no checker reports → UNKNOWN

Storage engine: in-memory + periodic snapshot to etcd (or a Redis-class store). The authoritative state comes from quorum across checkers (§7.3), not from any one row. This is the liveness-vs-readiness split.
Why not put this in the same store as §5.2? Write rate: 750 backends × 12 LB-side checks/s = 9K w/s; that's 60× the steady-state registry churn. Keeping it separate prevents health-check hot-writes from swamping registry consensus.

5.4 Maglev lookup table (derived, cached)

Field	Type	Notes
`pool_id`, `version`	string, u64
`table`	`u32[M]` where M is prime (e.g., 65,537 or 65,521)	Each slot = `backend_id`
`weights_digest`	hash	For cheap "did this change?" checks

Storage engine: Computed by control plane (O(M × N) ~= 65K × 750 = 50M ops, ~50 ms), pushed over the WatchPool stream. Stored in LB process memory as a flat u32[65537] array — 256 KB. O(1) lookup, zero allocations on the hot path.

6 System Diagram (ASCII) — the centerpiece #

6.1 Macro view

                                    INTERNET
                                        │
                                        │  TCP SYN / UDP datagram
                                        ▼
                            ┌───────────────────────┐
                            │   Edge DNS (GSLB)     │      Control: ~1 query/client-session
                            │   → returns ANYCAST   │      Data:    N/A (out-of-band)
                            │     VIP 203.0.113.1   │
                            └───────────┬───────────┘
                                        │  DNS A/AAAA answer
                                        ▼
                                  (client dials VIP)
                                        │
                     ╔══════════════════╧══════════════════╗
                     ║      BGP / ECMP ROUTERS (Tier-0)   ║
                     ║   Anycast VIP 203.0.113.1/32       ║
                     ║   ECMP hash = 5-tuple, stateless   ║
                     ╚══════╦════════════╦════════════════╝
                            │            │   (router picks 1-of-N LBs
                            │            │    via consistent 5-tuple hash;
                            │            │    ~1 packet per flow decision)
              ┌─────────────┘            └─────────────┐
              │                                        │
   Protocol: IPv4/v6 packet                 Protocol: IPv4/v6 packet
   Rate: ~12.5K pps per LB (peak)           Rate: same
   (50K pps peak ÷ 4 LB nodes per AZ)
              │                                        │
              ▼                                        ▼
   ┌──────────────────────┐              ┌──────────────────────┐
   │  LB Node az-a #1     │              │  LB Node az-a #2     │
   │  ┌────────────────┐  │              │  ┌────────────────┐  │
   │  │ XDP fast path  │  │              │  │ XDP fast path  │  │
   │  │ (eBPF) — pkts/ │  │              │  │ (eBPF)         │  │
   │  │ NIC RX queue   │  │              │  │                │  │
   │  └───────┬────────┘  │              │  └───────┬────────┘  │
   │          │           │              │          │           │
   │  ┌───────▼────────┐  │              │  ┌───────▼────────┐  │
   │  │ Maglev lookup  │  │              │  │ Maglev lookup  │  │
   │  │ u32[65537]     │  │              │  │ u32[65537]     │  │
   │  └───────┬────────┘  │              │  └───────┬────────┘  │
   │  ┌───────▼────────┐  │              │  ┌───────▼────────┐  │
   │  │ Conntrack hash │  │              │  │ Conntrack hash │  │
   │  │ (per-CPU)      │  │              │  │ (per-CPU)      │  │
   │  └───────┬────────┘  │              │  └───────┬────────┘  │
   │  ┌───────▼────────┐  │              │  ┌───────▼────────┐  │
   │  │ DSR encap:     │  │              │  │ DSR encap:     │  │
   │  │ IPIP or L2-DNAT│  │              │  │ IPIP or L2-DNAT│  │
   │  └───────┬────────┘  │              │  └───────┬────────┘  │
   └──────────┼───────────┘              └──────────┼───────────┘
              │                                     │
              │  Encapsulated to backend (DSR)      │
              │  Protocol: IPIP (proto 4) or GRE    │
              │  Rate: ~12.5K pps per LB           │
              │                                     │
              └─────────────┬───────────────────────┘
                            │
                            ▼
     ╔═════════════════════════════════════════════════════════╗
     ║  BACKEND POOL (750 hosts across az-a / az-b / az-c)    ║
     ║  Each handles 10 QPS; each owns VIP as loopback alias   ║
     ║  Decap → deliver to socket listening on VIP:port        ║
     ║  Response: src=VIP (spoofed), dst=client, direct        ║
     ╚════════════════════════════╦════════════════════════════╝
                                  │
                                  │  Return traffic (DSR) ─ bypasses LB
                                  ▼
                              to INTERNET

6.2 Control plane (separate from data plane)

                        ┌─────────────────────────────────┐
                        │  SRE / Ops (kubectl, UI, API)   │
                        └───────────────┬─────────────────┘
                                        │ gRPC RegisterBackend / DrainBackend
                                        ▼
                     ┌─────────────────────────────────────┐
                     │   LB Controller (leader-elected)    │
                     │   Replicas: 3 (Raft quorum)         │
                     │   - Accepts admin CRUD              │
                     │   - Runs Maglev table compute       │
                     │   - Publishes PoolSnapshot stream   │
                     └──────────────┬──────────────────────┘
                                    │  writes registry & table
                                    ▼
                     ┌─────────────────────────────────────┐
                     │   etcd / Spanner (strong consist.)  │
                     │   Tables: backends, pools, vips,    │
                     │           maglev_tables             │
                     │   Write rate: ~1/s;  Read: 12 LBs   │
                     │   watching                          │
                     └──────────────┬──────────────────────┘
                                    │  Watch stream, PoolSnapshot @ ≤3 s p99
       ┌────────────────────────────┼────────────────────────────┐
       │                            │                            │
       ▼                            ▼                            ▼
┌───────────┐               ┌───────────┐               ┌───────────┐
│ LB az-a#1 │               │ LB az-a#2 │     ...       │ LB az-c#4 │
└───────────┘               └───────────┘               └───────────┘

AND, running in parallel:

                     ┌─────────────────────────────────────┐
                     │  Health Checker Fleet               │
                     │  (6 workers, 2 per AZ)              │
                     │  - Active TCP+HTTP probes           │
                     │  - 1 check/s per backend per checker│
                     │  - 6 × 750 = 4,500 probes/s total   │
                     └──────────────┬──────────────────────┘
                                    │ gRPC ReportHealth (streaming)
                                    ▼
                         Health Aggregator (in controller)
                                    │  quorum: state = majority opinion
                                    ▼
                                 etcd write
                                    │
                                    ▼
                     triggers Maglev table recompute
                              (if state changed)

6.3 Failure-scenario view — single LB node death

T0:   LB az-a #2 kernel panics.
      BGP session to router drops within 1-3 s (BFD).
T1 (≤3 s):
      Router removes LB#2 from ECMP group.
      Consistent-hash over remaining {LB#1, LB#3, LB#4}.
      ~1/4 of flows that had been landing on LB#2 are now
      steered to one of the other 3.
T1+ε:
      Those redirected flows arrive at LB#1/3/4 with NO conntrack
      entry → Maglev lookup still maps each flow to the SAME
      backend as before (because Maglev is deterministic on
      5-tuple + pool version, identical across all LBs).
      → flow continues uninterrupted with only a brief
        packet-reorder jitter.
Net disruption:  0 flows broken, <100 ms added latency on
                ~25% of flows during BFD convergence.
                (Cf. §7.2 why this works.)

Every arrow in §6.1 corresponds to: a concrete protocol (TCP/UDP/IPIP/gRPC), a quantified rate, and one of the schemas in §5 or endpoints in §4. If I haven't traced it, it doesn't belong in the diagram.

7 Deep-Dive on 3 Critical Topics #

7.1 Connection tracking + Maglev consistent hashing

Why critical. If you get this wrong, every scale event (deploy, autoscale, crash) silently breaks live TCP flows. Users see "connection reset" and you blame the network.

The three candidate algorithms, quantified:

Algorithm	Flow disruption on backend change	Memory	Affinity
Round-robin	— (no affinity to disrupt)	O(1)	None. Every packet could land on a different backend. Breaks stateful TCP completely unless LB does conntrack.
Modulo hash `h(5-tuple) % N`	~(N-1)/N on any N change. At N=750, adding 1 backend re-hashes 99.87% of flows.	O(1)	Perfect until N changes.
Least-connections	Variable; depends on conntrack freshness	O(N) per LB + requires LB-to-LB state sync (otherwise each LB sees only its own connections, and the algorithm lies).	Drift between LBs.
Consistent hashing (ring)	~K/N per change, K=vnodes. Good but has hot-spot issues at N=750.	O(N×V), V=100-200 virtual nodes	Good.
Maglev hashing	~1/N on single add/remove. Additional bound: no slot changes by more than `M/N²` entries.	O(M), M=65,537 ≈ 256 KB flat array.	Excellent, and identical across all LBs for the same backend set → no LB-to-LB sync needed.

Chosen: Maglev. Rationale:

Minimal disruption is mathematically the best of the practical options.
Shared determinism across LBs — the killer property. Every LB computes the same table from the same pool version. That's how I get away with "no conntrack sync" (§5.1) and "graceful LB failover without breaking flows" (§6.3). This is also why Google, Cloudflare (Unimog), GitHub (GLB), and Meta (Katran) all converged on Maglev-style hashing.
Flat array → cache-line friendly, 1-2 ns lookup in the hot path.

How Maglev builds the table (the 20-second version):

M = 65537  (prime; large enough that slots/backend ≈ M/N ≈ 87 at N=750)
For each backend i, compute permutation P_i over [0, M) from hash(backend_id).
Iterate: each backend takes turns claiming its next-preferred empty slot,
until all M slots are filled. At N backends and M ≈ 100×N, this terminates
in ~M log M ops; for us, ~50ms on the controller.

The conntrack table then becomes a performance optimization, not a correctness requirement. Why have it at all?

TCP state correctness. We need to know "this packet belongs to flow F, which was last on backend B" because the Maglev table for today's pool version may route flow F to a different backend if a backend was added/removed mid-flow. Conntrack pins the flow to B until FIN.
Concretely: on SYN, we Maglev-select B and insert conntrack(5-tuple → B). On every subsequent packet, we check conntrack first; only on miss do we fall back to Maglev.

The earned-secret insight about Maglev + conntrack interaction:

At Google scale (Maglev paper §5.2), conntrack is per-packet-processor and not synchronized across boxes, but this only works because:

The ECMP router uses a 5-tuple-consistent hash (so the same flow reaches the same LB box most of the time);

When ECMP does reshuffle (LB node failure, deploy), Maglev's determinism means the new box computes the same backend for the flow — so the conntrack miss on the new box is harmless: both boxes independently arrive at backend B.

This only fails when a backend has been added or removed in the tiny window between "flow started" and "flow's packet is now being handled by a different LB with a newer pool version." The disruption probability there is (1/N) × (rate of pool changes) × (avg flow lifetime) — at our scale, ~1 in 50,000 flows. Acceptable.

At Katran/Cloudflare scale they add one more trick: the backend itself accepts a packet whose 5-tuple it doesn't recognize in its kernel socket table by sending a RST with a cookie or by using TCP-level connection pinning. The result: the LB tier can be fully stateless in the steady state.

Alternatives rejected and why:

Ring hashing — works, but has well-known hot-spot issues without careful vnode sizing; Maglev proved mathematically better.
Rendezvous (HRW) hashing — O(N) per lookup. At N=750 that's 750 hash ops/pkt; we'd be PPS-bound.
LB-to-LB conntrack replication (a la IPVS sync) — doable but costs 2× memory + a replication stream + correctness headaches on partition. We do not need it with Maglev.

Failure modes & mitigations (for this subsystem specifically):

Failure	Detection	Blast radius	Mitigation
Conntrack table full	Hash insert fails	This LB's new-flow success rate → 0	Pre-size 10× peak (§3.3); LRU-evict oldest IDLE flows before new ones; SYN-cookies if a flood is detected
Two LBs have different pool versions for ~seconds	Version mismatch in snapshot	Tiny minority of flows routed to stale backend	Maglev table includes `version`; backends drop packets whose encap header version is N-2 or older; LB retries via new Maglev
Maglev table recompute latency spike	Controller metric	New registrations delayed	Double-buffer tables; recompute async; serve old table while new builds
Hash collision attacker	Anomalously-skewed backend traffic	One backend gets hammered	Use keyed SipHash with a per-pool secret; rotate on attack detection

7.2 LB high availability: ECMP + anycast + DSR

Why critical. The LB is a SPOF by default. Every production L4 LB since ~2012 uses the same recipe: ECMP/anycast + stateless hashing + DSR. Interviewers want you to name the three tricks and explain why they compose.

The three tricks, one at a time:

(a) Anycast VIP + BGP. Every LB node announces the same VIP (203.0.113.1/32) to upstream routers. The routers see N equal-cost next-hops to that VIP and do ECMP.

Why anycast over "DNS-returns-many-IPs"? Client DNS caches and TTLs are the devil. With anycast, failover is at BGP-convergence speed (1-3 s with BFD) rather than DNS-TTL speed (minutes). Also: clients behind egress NAT that re-use a single src-IP hammer a single returned DNS answer — you lose LB-side load balance. Anycast doesn't care.
Why BGP and not static routes? So we can withdraw the route from a failed LB automatically.

(b) ECMP with consistent hash on 5-tuple. The router's ECMP implementation hashes {src_ip, src_port, dst_ip, dst_port, proto} and picks 1-of-N next-hops. If N changes (an LB dies), most flows still hash to the same LB they were on.

Pitfall #1 — "old-school" ECMP uses hash % N, which re-hashes (N-1)/N flows on N change. Modern switches support resilient hashing (Broadcom, Cisco Nexus) which behaves like consistent hashing at the silicon level. I would explicitly ask the network team "is ECMP resilient?" — this is a classic L7 on-call war-story question.
Pitfall #2 — BFD isn't free. At 50 ms × 3 BFD misses = 150 ms detection, but BFD's own chatter at scale is non-trivial. Realistic target: 1-3 s convergence.

(c) DSR (Direct Server Return). The backend sends the response packet directly to the client — NOT back through the LB. The response bandwidth + PPS bypasses the LB entirely.

Why it matters. Most L4 workloads are response-heavy (1 KB req, 50 KB resp). Without DSR, the LB handles 100% of the response bandwidth; with DSR it handles 0%. 10× reduction in LB hardware.
How it works. The LB doesn't NAT — it either (i) rewrites the dst MAC and keeps IPs intact ("L2 DSR") so the backend decodes the VIP-bound packet as "mine" because the VIP is a loopback alias on every backend; or (ii) IPIP-encapsulates (the "IPIP tunnel mode" / Katran approach). The backend decap'd inner packet has the original client src and VIP dst — it answers with src=VIP, dst=client, bypassing the LB.
Why IPIP > L2 DSR at our scale. L2 DSR requires every backend to be in the same L2 broadcast domain as the LB. That does not scale across AZs/racks. IPIP routes over L3 — any backend, any AZ, any rack.
Backend-side caveats. Backends need (1) VIP on loopback (and arp_ignore=1, arp_announce=2 on Linux to not respond to ARP for the VIP), (2) IPIP tunnel interface, (3) rp_filter=0 on that interface (because the src IP is a client IP that doesn't have a route through this interface).

Alternatives rejected and why:

Active-passive LB pair with keepalived/VRRP. Works for one LB pair per VIP, tops out at single-box NIC capacity, and failover is 1-3 s with a brief flow-break window. Fine for v1 (§9), but doesn't scale to our 5K peak + headroom story, and certainly not to v3.
L7 LB (Envoy/HAProxy) at the edge. More features, but 10× CPU/pkt, TLS tax, and breaks DSR. Use it behind the L4 tier, not instead of.
Cloud-native (AWS NLB). Totally fine answer if the interviewer is in the cloud-native frame. It's literally Maglev-style hashing + ECMP + cross-AZ anycast inside AWS. I'd mention NLB as "we're reimplementing, for the interview, what NLB does internally."

Earned-secret note on ECMP rehash during rolling deploys:

When you rolling-deploy the LB fleet, each node restart withdraws and re-announces its BGP route. If ECMP is not resilient, every restart re-hashes ~(N-1)/N flows through the router. At N=4 LBs, that's a 75% flow-break per restart. Rolling through 4 LBs → almost every flow breaks during deploy. Mitigations: (1) demand resilient ECMP from the network team, (2) drain connections at the LB before BGP withdraw (GARP + MED shifting), (3) keep the withdraw paused for 2× max-flow-lifetime so in-flight flows bleed out. Katran's approach is even slicker — the outgoing LB hands off the conntrack shard to a sibling before BGP withdraw.

Failure modes & mitigations:

Failure	Detection	Blast radius	Mitigation
Single LB node dies	BGP/BFD	~1/N flows take 100-300 ms reroute	Maglev determinism keeps them on the same backend
BGP route flap (split brain)	Route monitor	Blackhole a VIP for ≤1 s	BFD + BGP min-adv-interval tuning
Router misconfigures ECMP (non-resilient)	Deploy breaks everyone	Up to 100% flows on deploy	Pre-deploy verify; abort deploy if conntrack-miss-rate > 5%
Backend loses VIP loopback config	Traffic black-holed on that backend	1/N of flows	Readiness probe from LB side checks that backend responds on VIP, not just on pod-IP

7.3 Health checking + graceful drain

Why critical. Mis-calibrated health checks either (a) keep serving traffic to a broken backend → user-visible errors, or (b) evict a healthy backend because of a transient blip → cascading overload. Both are classic pager events; calibration is where staff-level judgment shows.

Dimensions of health checking (each requires a decision):

Active vs. passive.
- Active = LB/checker sends probe. Passive = LB observes real-traffic outcomes (RST rate, retransmits, HTTP 5xx).
- Choice: active for liveness + passive for early-warning ejection. Active catches "backend idle but broken"; passive catches "backend fine on /health, dying on real traffic."
Liveness vs. readiness.
- Liveness = "process responds at all." Low-layer TCP connect probe. Used for ejection.
- Readiness = "process is ready to take load (e.g., caches warm)." Application-level HTTP /readyz. Used for admission of new pods into rotation and for drain announcement.
- Confusing them means you restart a warming pod ("it's not live!") when it's simply not-yet-ready.
Centralized checker fleet vs. every-LB-checks.
- Every-LB-checks: each of 12 LBs probes each of 750 backends → 9K checks/s. Fine at this scale, painful past 10K backends.
- Centralized: 6 dedicated checkers × 750 = 4.5K checks/s, aggregated to a registry.
- Choice: centralized + quorum. Each backend is probed by 3 checkers in different AZs; the registry marks it UNHEALTHY only when 2-of-3 agree. This defuses network-partition false positives.
Probe cadence + failure threshold.
- Probe every 1 s; mark unhealthy after 3 consecutive failures; mark healthy again after 2 consecutive passes.
- Detection latency: 3 s (required by FR-4). Recovery latency: 2 s.
- Why not 200 ms probes? They synchronize with backend GC pauses and create false-positive storms. 1 s is the sweet spot for typical JVM/Python backends.
Graceful drain sequence (the crown jewel of this section):

Operator issues DrainBackend(bid):
  t=0:   Controller writes state=DRAINING to registry.
  t=+1:  LBs receive PoolSnapshot update.
         Maglev table is RECOMPUTED excluding the backend.
         BUT — critical: EXISTING CONNTRACK ENTRIES POINTING AT THE BACKEND
         ARE PRESERVED. Only NEW flows get reshuffled.
  t=+1 to +max_flow_lifetime:
         In-flight flows finish naturally on the draining backend.
         Backend's readiness probe goes from READY → NOT_READY,
         optionally returning `Connection: close` on HTTP responses
         to encourage client-side flow turnover.
  t=+60s: Controller writes state=EVICTED.
         LBs purge conntrack entries for this backend.
         Any straggler flow gets a TCP RST (acceptable; flow was
         already 60 s overdue).

The earned-secret insight: drain != remove. The mistake I see most often at L6 is "delete from Maglev table == drain." That instantly breaks any in-flight flow. The correct pattern is two-phase: (1) remove from new-flow eligibility but keep in conntrack routing, then (2) after bleed-out window, purge. Katran implements this with a per-backend SKIP_FOR_NEW flag in its BPF map; Envoy calls it "graceful removal." If you describe this two-phase sequence, interviewers tick the box for "has actually operated this in production."

Alternatives rejected and why:

TTL-based health (agent on backend heartbeats to registry). Works, but backend-reported "I'm healthy" is a weak signal — a zombie process can heartbeat. Active probes are ground truth.
Kubernetes-style in-process liveness-probe only. Fine for the backend's local orchestrator, but the LB needs its own opinion — k8s may think a pod is up while the backend's socket isn't accepting.
Ejecting on first error (no quorum). Causes cascading eviction during network blips. Quorum is required.

Failure modes & mitigations:

Failure	Detection	Blast radius	Mitigation
Entire checker fleet crashes	No health updates for 15 s → records TTL-expire to UNKNOWN	Registry freezes pool in last-known state (fail-safe)	Checker fleet is N+2 across AZs; alert if < quorum reachable
Probe false positive (GC pause)	Single checker sees fail; others don't	No eviction (quorum)	2-of-3 quorum; 3× consecutive-fail threshold
Probe false negative (health endpoint lies)	Passive metrics: RST rate + 5xx rate on real traffic	Users see errors	Passive outlier detection (Envoy-style) ejects on real-traffic anomaly
Split-brain: 2 controllers both think they're leader	Dual writes; registry invariant violation	Maglev table thrashes	Controller uses Raft/lease; fence tokens on writes
Mass-eviction (regional DB outage trips readiness)	Suddenly 80% of backends UNHEALTHY	Fleet-wide brownout	Panic threshold (Envoy term): if > 50% of pool is UNHEALTHY, treat pool as "all HEALTHY" and let Maglev distribute. Prefer overload to outage.

8 Failure Modes & Resilience — cross-component table #

#	Failure	Component	Detection	Blast radius	Mitigation	MTTR target
1	LB node kernel panic	Data plane	BFD / BGP timeout (≤3 s)	~1/N flows rerouted, most continue via Maglev	N+2 LBs per AZ, ECMP resilient hashing, IPIP DSR means backends don't need re-wire	≤3 s
2	Single backend OOM	Backend	Active probe fails (3 s) + passive RST spike (1 s)	Flows to that backend RST	Quorum eject; Maglev reshuffle affects ~1/N flows	≤5 s
3	Controller leader loss	Control plane	Raft lease timeout	No config changes for ≤5 s (data plane keeps running on last snapshot)	3-replica Raft; LBs serve from cached table (fail-static)	≤5 s
4	etcd cluster outage	Storage	Watch reconnect fails	LBs keep running; no reconfigures	Local snapshot cache, admin API surfaces "stale config" banner; alert SRE	Depends on etcd
5	ECMP rehash during deploy	Network	Metric: conntrack miss rate	Up to ~25% flows reroute to another LB but same backend (Maglev) → fine	Resilient ECMP; rolling deploy with drain between LBs	0 actual breakage
6	Conntrack table overflow (SYN flood)	Data plane	Insert failures; CPU spike	New-flow rate drops	SYN cookies, per-src-IP rate limit, LRU eviction, scale up LB fleet	≤1 min
7	Health-checker partition from backends	Health	Probe timeouts in one AZ	Would evict entire AZ	3-AZ quorum prevents; panic threshold as last-resort	≤15 s
8	Pool version skew across LBs	Data plane	Backend-side encap header version drop	Tiny minority of flows reroute via backend-RST	Bounded to 1-2 s via Watch latency	≤3 s
9	Backend loses VIP loopback (misconfig)	Backend	LB-side probe to VIP fails (not to pod-IP)	That backend's share of flows RST	Probe the VIP specifically; quorum-eject	≤5 s
10	Entire AZ loss	Physical	BGP routes withdrawn en-masse	~1/3 of LBs + ~1/3 of backends lost	Headroom sized for AZ loss (§3.4); ECMP/Maglev auto-converge	≤30 s
11	DNS cache poisoning sending to wrong VIP	External	Client reports + traffic dip	VIP level	Use DNSSEC; anycast means routes are geographically verifiable	Ext.
12	Maglev table recompute bug	Controller	Canary deploy of controller catches skew	Bad table pushed to all LBs	Stage controller rollout; LBs validate table hash vs. expected distribution; reject obvious bugs	≤1 min to rollback

9 Evolution Path #

v1 — "It works" (weeks 0-4)

Single HAProxy box (or nginx-stream) per AZ, active-passive with keepalived/VRRP.
Backends registered via flat YAML file; reload on change.
Round-robin with TCP health checks.
Good for ≤500 QPS, ≤50 backends, ≤1 AZ.
Limits hit: single-box NIC cap (~1 Gbps), reload blips break flows, no cross-AZ story, no DSR.

v2 — "Production scale" (months 1-3) — the interview target

ECMP'd stateless LB fleet (3-4 nodes per AZ × 3 AZs) behind anycast VIPs over BGP.
Maglev hashing with conntrack table as perf-optimization, no cross-LB sync.
IPIP DSR to any backend, any AZ.
Centralized health checker fleet with 3-checker quorum.
Raft-based controller + etcd for registry.
Handles 5K QPS peak, 2M active flows per LB, 750 backends. This is what I described above.
Limits hit: single region, software fast path bounded at ~1-2 Mpps per LB, no cross-region steering.

v3 — "Multi-region + earned secret" (quarter 2+)

Multi-region anycast: each region advertises the same VIP; BGP community attributes steer by geography.
XDP/eBPF fast path for 10 Mpps+ per LB node (Katran-style). Preserves the Maglev algorithm — just moves it to the NIC driver layer.
SmartNIC / hardware offload (e.g., Intel IPU, BlueField DPU, Nitro) for cases where software PPS isn't enough. Maglev table fits in TCAM.
SYN cookies in hardware (kTLS / XDP_DROP on pattern-match) as DDoS shield.
Active-active cross-region connection state — only for specific long-lived flows (WebSocket, gRPC streaming). Done via a separate "sticky" tier with Redis-backed session mapping.
Backend health ingestion via Prometheus/OTel metrics instead of synthetic probes — decouples LB from backend-internal state.
Handles 100K+ QPS, multi-million active flows, multi-region failover in ≤10 s.

10 Out-of-1-hour Notes (the earned-secret reservoir) #

Topics I would not necessarily volunteer in a 60-minute window but should be ready for the probe on.

10.1 Kernel tuning checklist (Linux, not exhaustive)

# conntrack
net.netfilter.nf_conntrack_max             = 4194304   # sized from §3.3 × 20
net.netfilter.nf_conntrack_buckets         = 1048576   # nf_conntrack_max / 4
net.netfilter.nf_conntrack_tcp_timeout_established = 300  # default 432000 = 5 days. Absurd.
net.netfilter.nf_conntrack_tcp_timeout_time_wait   = 30

# TIME_WAIT reuse (client-side; LB-as-client to backends, if NAT mode)
net.ipv4.tcp_tw_reuse                      = 1         # safe; RFC 1323 timestamps
net.ipv4.tcp_tw_recycle                    = 0         # NEVER. broken with NAT clients.
net.ipv4.ip_local_port_range               = "1024 65535"  # 64K ephemeral ports

# Backpressure / bufferbloat
net.core.rmem_max                          = 268435456
net.core.wmem_max                          = 268435456
net.core.netdev_max_backlog                = 30000
net.ipv4.tcp_rmem                          = "4096 87380 268435456"
net.ipv4.tcp_wmem                          = "4096 65536 268435456"

# SYN / half-open
net.ipv4.tcp_max_syn_backlog               = 65536
net.ipv4.tcp_syncookies                    = 1         # arm the cookie machinery
net.ipv4.tcp_synack_retries                = 3         # default 5 = too slow

# GRO / TSO / LRO on the LB NIC
ethtool -K eth0 gro on tso on lro off       # LRO off on a forwarder — LRO mangles packets
ethtool -N eth0 rx-flow-hash tcp4 sdfn      # hash src-ip,dst-ip,src-port,dst-port

10.2 SYN cookies vs. SYN proxy

SYN cookies (enabled above) — kernel responds to SYNs without allocating state by encoding the initial SEQ num as a cryptographic hash of the 4-tuple + MSS. Only on the client's ACK do we allocate state. Handles SYN floods gracefully at some cost to per-SYN latency and loss of TCP options.
SYN proxy (iptables SYNPROXY target) — LB acts as SYN terminator; only passes to backend on successful handshake. Expensive per-flow (double handshakes) but protects backends from SYN storms. Use for SYN-flood-prone VIPs.
Rule of thumb: cookies for general hardening, SYN proxy for known-hostile endpoints (login pages).

10.3 DPDK / XDP / eBPF — when to reach for each

Approach	Throughput	Development effort	When to use
Kernel netfilter / IPVS	~1 Mpps single-core	Lowest; just a config	v1-v2, ≤100 Kcps
XDP + eBPF (Katran-style)	10 Mpps / core	Medium; BPF verifier constraints	v2-v3, when NIC is bottleneck
DPDK user-space	14 Mpps+ / core	High; no kernel at all	v3+, when you own the NIC
SmartNIC (BlueField, IPU)	100 Mpps+	Very high; vendor lock	Hyperscaler scale

The big insight: XDP lets you run the Maglev lookup and conntrack insert before the packet hits the kernel stack. Katran's open-source XDP LB shows this end-to-end in ~500 lines of BPF.

10.4 BGP + ECMP integration details

BFD tuned to 50 ms × 3 misses → 150 ms failure detection. Watch the router's CPU — at scale BFD alone can melt a top-of-rack switch.
GoBGP / BIRD on each LB host to announce the VIP.
MED / LOCAL_PREF to do graceful drain: raise MED to deprioritize this LB before BGP withdraw.
Route flap damping off inside the DC — we want fast convergence.

10.5 TCP MD5 / TLS offload — scope boundary

TCP MD5 (RFC 2385) — if we peer BGP with the upstream router, MD5 auth is standard. Burns some CPU.
TLS offload — explicitly out of scope for this L4 design. If the interviewer insists, mention: (a) kTLS for kernel-space offload; (b) SmartNIC TLS; (c) moving TLS to a separate L7 tier. Doing TLS on the L4 LB breaks DSR.

10.6 Observability — what SREs actually page on

Signal	Source	Alert threshold
Per-VIP new-conn-rate	XDP counter	> 2× moving-avg for 30 s → SYN flood
Conntrack utilization	/proc/sys/net/netfilter/nf_conntrack_count	> 70% of max
Maglev table miss rate	LB metric	> 5% sustained → config skew
Backend-eject rate	Controller	> 10%/min → cascading failure
BGP session state	BGP daemon	any down > 10 s
ECMP member count vs. expected	Router SNMP	< expected for > 30 s
Per-backend conn imbalance	LB metric	> 2× mean for > 2 min → Maglev skew or hash collision
p99 LB added latency	Tracer	> 500 µs → CPU/buffer issue

10.7 Flow export

sFlow — sampled packet export (1:1000 default). Cheap, lossy, fine for capacity planning.
IPFIX / NetFlow v9 — per-flow records at flow end. Better for forensics. Most ECMP switches support it natively; do not implement in the LB unless you have to.
eBPF perf-buffer export — custom structured events. Use this for the "why did this flow get RST?" debug workflow.

10.8 Things I would not do, and why

Per-flow encryption at the LB — TLS tax + breaks DSR. Encrypt on backend.
LB-to-LB conntrack replication — 2× memory + a gossip/repl channel + a new failure mode. Maglev determinism makes it unnecessary at our scale.
Application-layer routing at L4 — if the interviewer pushes "route /api/v2 to pool B," that's L7. I will stand firm on the scope boundary.
"Smart" load balancing (EWMA of backend latency) at L4 — we can't see application latency without looking past TCP. Use passive RST/retransmit signals only; anything deeper belongs at L7.
NAT mode as default — it doubles LB bandwidth and breaks source-IP preservation (backends see the LB as client, which breaks audit logs, rate-limits, and ACLs). DSR is the default; NAT is the fallback for networks that can't route IPIP.

Closing self-check (against the rubric)

SRE pager-carryable? §8 table maps failure → detection → mitigation → MTTR for 12 scenarios. §10.6 gives concrete alert thresholds. ✓
Every diagram arrow → real API/data flow? §6.1 arrows labeled with protocol + rate; control-plane arrows map to §4.1 gRPC calls; data-plane labels map to §5 tables. ✓
Deep-dives at L7 depth? Maglev + conntrack interaction (§7.1), ECMP rehash during rolling deploys (§7.2), drain-vs-remove two-phase pattern (§7.3) — all named with real systems and quantified. ✓
5K peak / 10 per-server self-consistent? 500 raw, 750 with headroom+AZ; 4 LBs/AZ × 3 AZs = 12 LBs; 9K health checks/s; 96 MB conntrack / LB. Math in §3. ✓