Q9 · Design an Online Chat / Messaging System

1 Problem Restatement & Clarifying Questions #

Restatement. Build a globally distributed real-time chat platform supporting 1:1 DMs, medium groups, and one-to-many broadcast channels. Clients (iOS/Android/Web) maintain long-lived connections; the system must deliver every message at-least-once with client-side dedup, preserve per-conversation ordering, handle offline users via push and queued delivery, expose presence and read receipts, and survive regional failures without message loss.

Clarifying questions I'd ask a Google interviewer (and my assumed answers for this doc):

#	Question	Assumed answer (drives design)
Q1	Target scale?	500M DAU, peak 5M concurrent msg/s write, peak 50M concurrent WebSocket connections after multi-device expansion.
Q2	Group size distribution?	1:1 dominant (~~70% of messages), groups 2–1000 (~~28%), broadcast-style channels 1K–100K members (~2% but heavy fan-out). I'll design tiered.
Q3	Ordering guarantee?	Per-conversation total order visible to all members, causal across conversations for a single user's outbox. No global order across channels.
Q4	Durability vs latency?	No message loss is non-negotiable (durable before ack). p99 send-to-deliver <200ms intra-region when both parties online.
Q5	E2EE required?	Optional (per-channel flag). Server must work without plaintext; design for ciphertext storage with server-side fan-out over opaque blobs. Detailed crypto deferred to Section 10.
Q6	Retention?	Default infinite (like iMessage/WhatsApp), with per-user/per-channel TTL knobs. Legal-hold out of scope for 60 min.
Q7	Federation (XMPP / Matrix style)?	No for v1/v2. Optional v3 inter-org federation (think Slack Connect).
Q8	Multi-device?	Yes — user can have ≤10 active devices. Messages must sync to all, including those offline at send time.
Q9	Media limits?	Up to 100MB per attachment; uploaded to blob store via pre-signed URL, message carries pointer + thumbnail.
Q10	Read receipts?	Opt-in per-user (UX requirement, common in FB Messenger).
Q11	Regulatory?	GDPR right-to-erasure, China/India data-locality. Drives per-user region affinity (Section 9 v3).

Why these questions matter. Q2 and Q8 alone pivot the fan-out strategy — a 100K-member channel cannot use a naive per-recipient-inbox write, and multi-device turns "fan-out to 5 recipients" into "fan-out to ~30 device sessions." Q3 decides whether we need total order per channel (yes) or global (no — vector clocks across users would be crushing). Q5 determines whether the server can do search/spam/moderation (no for E2EE channels — huge UX consequence).

2 Functional Requirements #

In scope (numbered):

Send 1:1 message with durable ack, idempotent via client_msg_id.
Group chat ≤1000 members, same ack semantics; fan-out transparent to sender.
Broadcast channels ≤100K members (one-way-dominant, reader fan-out, e.g. "announcements").
Typing indicators (ephemeral, best-effort, lossy).
Read receipts (per-user per-message last_read_msg_id advance).
Presence (online / away / offline / last_seen_ts), coarse aggregation.
Offline delivery: server queues until recipient reconnects OR delivers via APNs/FCM push.
Media attachments: images/video/files up to 100MB via pre-signed S3/GCS URL; server stores only metadata.
History sync / pagination: cursor-based back-scroll.
Multi-device sync: all devices see same conversation state; messages sent from one device appear "sent" on others.
Delete for me / delete for everyone (time-bounded).
Reactions (emoji annotations on messages).

Out of scope for this 60-min interview (covered in Section 10):

Voice/video calling (entirely separate SFU architecture)
ML-based moderation + report pipeline
Search across encrypted content
Bots / integrations (Slack apps, webhook)
File preview generation
Message translation

3 NFRs + Capacity Estimate #

3.1 Non-Functional Requirements

NFR	Target	Notes
Availability	99.99% (≈52 min/yr downtime)	per-region; global via isolation
p99 send-to-deliver (online)	<200ms intra-region, <350ms cross-region	ack-from-server + deliver-to-peer
p99 send-to-ack	<80ms (durable on primary replica)	what sender's spinner waits for
Durability	No loss after server-ack; RPO = 0 within region	replicated commit log before ack
Ordering	Per-conversation monotonic; causal per-sender	HLC-based
Delivery	At-least-once + client dedup	never at-most-once
Consistency	Read-your-writes on sender's devices; eventual across group members (bounded to seconds)
Throughput headroom	2× peak steady-state	holiday surges (NYE messages ~3× normal)
Cost	<$0.30 per MAU/yr on messaging infra	rough order; compare WhatsApp famously ~$0.50/MAU at $22B acq

3.2 Capacity Back-of-Envelope (show the arithmetic)

Users and messages.

500M DAU × 50 msgs/day sent ≈ 25B msgs/day.
25B / 86,400s ≈ 290K msgs/s average write.
Peak-to-average ratio ~10× for chat (lunch/evening spikes, plus NYE): 2.9M msgs/s peak write.
Average payload (text + metadata): ~1KB. Media msgs only carry pointer, so avg stays ~1KB.
Peak write bytes: 2.9M × 1KB ≈ 2.9 GB/s ingested.

Fan-out multiplier.

Per-message recipients: 1:1 avg=1 recipient, group avg=20 (assume), broadcast avg=5K for the 2% channel-posts.
Weighted: 0.70×1 + 0.28×20 + 0.02×5000 = 0.7 + 5.6 + 100 = 106.3 avg delivery events per send.
Multi-device ×3 on average (phone + web + tablet) → ~320 delivery events / sent message.
Peak delivery rate: 2.9M × 320 ≈ 930M delivery events/s (this is the fan-out pipeline QPS, not the storage QPS).
This is why we separate a fan-out service from the message store: storage QPS is bounded by unique channel writes, fan-out QPS scales with recipients.

Storage.

25B msgs/day × 1KB = 25 TB/day raw. × 3 replicas = 75 TB/day.
With compression (~3× on text): ~25 TB/day effective.
Annual: ~9 PB/yr. Over 10-year retention, ~90 PB — comparable to WhatsApp's known archival.
Hot-tier (last 7 days or last 100 msgs/channel, whichever smaller): 25TB×7 ≈ 175TB. Redis can't hold this raw — we cache only last 50 msgs/active channel. 500M DAU × avg 5 active channels × 50 msgs × 1KB = 125 TB hot cache across ~2000 Redis shards at 64GB usable each. In practice we tier: Redis for last-50 tail, local SSD cache on message service for last-1000, Cassandra beyond.

Connections / gateways.

500M DAU, avg 50% online concurrent during peak hours + 3 devices each → 750M concurrent WebSocket connections at peak.
WebSocket memory: ~16KB per connection for userspace + ~32KB kernel (TCP + TLS state). Call it 48KB total.
- Earned-secret aside: TLS session state dominates app-level buffers. Without kernel TLS offload (kTLS), 500M conns × 32KB kernel ≈ 16TB of kernel memory. This is why WhatsApp's BEAM/Erlang gateways and Meta's Iris use kTLS — it amortizes crypto state and avoids kernel→user copy.
Per-node budget: assume 2M connections/gateway (matches WhatsApp's public figure of 2M+ per Erlang node in 2014, achievable on 256GB box).
Gateway fleet: 750M / 2M = ~375 gateways global, round to ~500 for headroom + regional distribution.
With a generous margin and N+2 per region across 20 regions: ~800 gateway nodes.

Fan-out pipeline.

930M deliveries/s, each ~500 bytes over internal bus = 465 GB/s internal bandwidth.
A Kafka-style commit log at 500MB/s per broker → ~~1000 brokers needed just for fan-out. In practice we use an in-memory pub/sub (NATS or custom) for hot path + Kafka for durable queue of offline deliveries. Most deliveries (~~80%) are synchronous to online peers and never touch Kafka.

Offline queue.

Of 930M deliveries/s, ~30% recipients offline → 280M/s queued.
Each offline entry: channel_id, msg_id, user_id (~50 bytes) = 14 GB/s into queue store.
Queue depth: assume avg offline duration 6h per user per day. 280M/s × 21,600s = 6T entries peak queue. With dedup per (user, channel): actual depth bounded by active channels per user × coalesce factor → ~100B entries, ~5TB.

Push notifications.

Offline users receiving msgs → push via APNs/FCM. Expect 60% of offline-delivery-events trigger a push (rest are batched/coalesced).
Peak push rate: 280M × 0.6 = 168M pushes/s — this exceeds APNs/FCM per-account quotas; we need push-provider sharded by user region + aggressive coalescing (one push per channel per 30s window).

Sanity check. Peak 2.9M send/s with 320 fan-out = 930M/s. At 1KB per message, network egress from fan-out tier = 930 GB/s. Divided across 500 gateways = ~2 GB/s per gateway egress. A single 25Gbps NIC does 3.1 GB/s, so we're at 60% NIC utilization at peak — feasible with headroom.

4 High-Level API #

Three protocols for three concerns: WebSocket (client↔gateway, bidirectional chat frames), HTTPS REST (setup, uploads, history), gRPC (internal server↔server).

4.1 WebSocket frames (client ↔ gateway)

Wire: binary frames, MessagePack or protobuf (not JSON — saves ~40% bytes, matters at 2.9M/s). Each frame has {type, request_id, payload}. request_id echoed in ack for correlation.

C→S  SEND                  {channel_id, client_msg_id(UUID), hlc_ts_hint, content, attachments[]}
S→C  SEND_ACK              {client_msg_id, server_msg_id, server_ts, hlc_ts, status}
S→C  DELIVER               {channel_id, server_msg_id, hlc_ts, sender, content, ...}
C→S  DELIVER_ACK           {server_msg_id}               — client acks receipt (for read cursor & offline queue drain)
C→S  TYPING                {channel_id, is_typing}
S→C  TYPING                {channel_id, user_id, is_typing}
C→S  READ                  {channel_id, up_to_msg_id}
S→C  READ                  {channel_id, user_id, up_to_msg_id}
C→S  PRESENCE_HEARTBEAT    {status}                       — every 30s
S→C  PRESENCE              {user_id, status, last_seen}   — only deltas pushed
C→S  PULL_HISTORY          {channel_id, before_msg_id, limit}
S→C  HISTORY_PAGE          {channel_id, msgs[], next_cursor}
C→S  REACT                 {channel_id, msg_id, emoji, op:add|remove}
S→C  ERROR                 {request_id, code, retryable, backoff_hint_ms}
S→C  RESUME_HINT           {session_token, preferred_gateway_dns}   — sent on reconnect for stickiness

Idempotency. client_msg_id is the dedup key. Server stores (user_id, client_msg_id) → server_msg_id for 24h in Redis. A retry returns the original server_msg_id → client replaces its optimistic entry.

Flow control. Client maintains a per-connection sliding window (max 64 unacked frames). Backpressure via TCP naturally; on overload gateway sends SLOWDOWN frame with suggested delay.

4.2 HTTPS REST (setup and things that aren't message traffic)

POST /v1/auth/login           → {access_token, refresh_token, session_id, gateway_hint}
GET  /v1/channels             → {channels[], unread_counts}
POST /v1/channels             (create group)
POST /v1/channels/{id}/members
POST /v1/attachments/presign  → {upload_url, download_url, expires_at, attachment_id}
GET  /v1/channels/{id}/messages?before=&limit=
DELETE /v1/messages/{id}
POST /v1/devices              (register device, receive FCM/APNs token)
DELETE /v1/devices/{device_id}

4.3 gRPC server↔server

MessageService
  rpc Append(AppendReq) returns (AppendResp)          // hot path write
  rpc GetTail(GetTailReq) returns (stream Message)
  rpc GetRange(GetRangeReq) returns (stream Message)

FanoutService
  rpc Publish(PublishReq) returns (PublishResp)       // push to online gateways via pubsub
  rpc Enqueue(EnqueueReq) returns (EnqueueResp)       // persist for offline

PresenceService
  rpc Heartbeat(HeartbeatReq) returns (HeartbeatResp)
  rpc Subscribe(SubReq) returns (stream PresenceDelta)

SessionService
  rpc Authenticate(AuthReq) returns (AuthResp)
  rpc ResolveUser(UserReq) returns (UserResp)         // cacheable

4.4 Canonical message schema

message ChatMessage {
  string channel_id         = 1;   // partition key
  bytes  server_msg_id      = 2;   // 128-bit: 64b HLC + 64b channel-local-seq
  uint64 hlc_ts             = 3;   // hybrid logical clock, 48b physical + 16b logical
  string client_msg_id      = 4;   // UUIDv4 from client
  string sender_user_id     = 5;
  string sender_device_id   = 6;
  uint64 server_recv_ts     = 7;   // ms since epoch, server wall clock
  oneof  payload {
    TextPayload       text         = 10;
    AttachmentPayload attachment   = 11;
    SystemPayload     system       = 12;   // e.g. "Alice added Bob"
    EncryptedPayload  encrypted    = 13;   // E2EE ciphertext + key ref
  }
  repeated Reaction reactions = 20;
  EditHistory      edits      = 21;
  MessageFlags     flags      = 22;        // deleted, edited, ephemeral
}

5 Data Schema #

Three storage engines for three access patterns: wide-row append log (Cassandra), hot-tail cache (Redis), and metadata with transactions (Spanner/FDB).

5.1 `messages` — Cassandra (wide-row, partition = channel)

CREATE TABLE messages (
  channel_id          text,
  hlc_ts              bigint,          // clustering key, DESC
  server_msg_id       blob,
  sender_user_id      text,
  sender_device_id    text,
  payload             blob,            // protobuf-encoded
  server_recv_ts      timestamp,
  flags               int,
  PRIMARY KEY ((channel_id), hlc_ts, server_msg_id)
) WITH CLUSTERING ORDER BY (hlc_ts DESC);

Append-optimized, channel-local tail read is a single partition sequential scan. TTL at table level for retention policies.

Why Cassandra, not X?

Alternative	Verdict	Quantified reason
DynamoDB	Rejected	Per-partition write throughput cap 1000 WCU; a celebrity channel at 10K msg/s needs multi-partition hashing that breaks ordering. Cost at 25B msgs/day is ~$50M/yr vs. Cassandra self-managed ~$8M/yr hardware.
HBase	Rejected	HDFS/ZooKeeper op-burden too high for always-on; p99 write ≥50ms during region splits; compaction storms freeze regions. WhatsApp used early, migrated away.
Scylla	Strong contender	C++ rewrite of Cassandra, ~5× per-core throughput, shared-nothing per core. Discord migrated to ScyllaDB 2022 for exactly this workload. I'd pick Scylla if greenfield — but Cassandra has deeper ops maturity at $BIGCO. Document Scylla as v2 upgrade path.
Spanner	Rejected	Paxos commit latency 10-50ms intra-region, unnecessary global consistency. $100M/yr+ at this scale. Use Spanner for channel metadata only (Section 5.3).
FoundationDB	Rejected for messages	Transaction size limits (10MB, 5s). Works great for channel metadata. Snowflake + iCloud use it for metadata.
Kafka as primary	Rejected	Great as append log buffer, but poor random-read for history pagination. Use Kafka as the commit log in front, Cassandra as the queryable store (covered in Section 6).

5.2 `inbox_per_user` — Redis + Spanner (user-sharded)

Tracks per-user per-channel read cursor and unread count. Small, hot, transactional.

inbox:{user_id}:{channel_id} → {last_read_hlc, unread_count, muted, last_delivered_hlc}
user_channel_list:{user_id}  → sorted set of (channel_id, last_activity_hlc)

Authoritative in Spanner (for cross-device consistency — read receipts must not regress), cached in Redis for sub-ms lookup on gateway attach. CAS on last_read_hlc to prevent ABA.

5.3 `channels` — Spanner

channels (channel_id, type, creator, created_ts, settings_json, is_e2ee, member_count)
channel_members (channel_id, user_id, role, joined_ts, left_ts, notifications)

Transactional because "add member" must atomically: insert row, bump member_count, notify, and update user's channel list. ~200K channels created/sec peak during viral events; Spanner handles that easily. 1 Spanner instance cluster per region.

5.4 `presence` — Redis (sharded by user_id)

presence:{user_id} → {status, last_heartbeat_ms, devices:[{id, gateway, last_seen}]}

TTL 90s (3× heartbeat). On expiry, user transitions offline via keyspace-notification → presence aggregator recomputes "was online, now offline" delta.

5.5 `attachments` — metadata in Spanner, blob in S3/GCS

attachments (attachment_id, owner_user_id, content_type, size_bytes,
             sha256, blob_url, thumbnail_url, uploaded_ts, ttl)

5.6 `offline_queue` — Kafka (per-user topic-or-partition)

Not actually "per topic per user" — too many topics. Instead 10K topics, user_id % 10000 → partition key. Each entry: (user_id, channel_id, msg_id, delivered_bit). Consumer is the fan-out service; drains on reconnect.

6 System Diagram (Centerpiece) #

6.1 Full end-to-end view

                                                                              ┌──────────────────────────┐
                                                                              │   Control Plane          │
                                                                              │  ┌────────────────────┐  │
                                                                              │  │ Config / Feature   │  │
                                                                              │  │ Flags (Consul)     │  │
                                                                              │  └────────────────────┘  │
                                                                              │  ┌────────────────────┐  │
                                                                              │  │ Capacity Autoscaler│  │
                                                                              │  │ (HPA + custom)     │  │
                                                                              │  └────────────────────┘  │
                                                                              │  ┌────────────────────┐  │
                                                                              │  │ Rate Limiter       │  │
                                                                              │  │ (per-user/chan)    │  │
                                                                              │  └────────────────────┘  │
                                                                              └──────────────────────────┘
                                                                                          ▲  config / policy
                                                                                          │  push
┌─────────────┐      TLS 1.3       ┌───────────────────────┐   WSS (long-lived)           │
│  iOS Client │  ───────────────▶  │  Anycast GeoDNS /     │  ──────────────────────┐     │
│  Android    │                    │  L4 LB (Maglev-hash   │                        │     │
│  Web        │  ◀──────────────   │   by user_id)         │ ◀──────────────────┐   │     │
└─────────────┘     WSS binary     └───────────────────────┘                    │   │     │
   │ sticky                                │                                    │   ▼     │
   │ reconnect                             │ L7 routes                ┌─────────────────────────────────────┐
   │ (user_id%N)                           ▼                          │  Gateway Fleet (~500 nodes)         │
   │                       ┌──────────────────────────────────┐       │  ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐   │
   └──────────────────────▶│  Gateway Router / Session-Resolver│──────▶│  │GW-0 │ │GW-1 │ │GW-2 │ │ ... │   │
                           │  resolves user → home-gateway     │       │  │2M   │ │2M   │ │2M   │ │     │   │
                           │  (consistent-hash w/ bounded load)│       │  │conn │ │conn │ │conn │ │     │   │
                           └──────────────────────────────────┘       │  │kTLS │ │kTLS │ │kTLS │ │     │   │
                                                                       │  └──┬──┘ └──┬──┘ └──┬──┘ └─────┘   │
                                                                       │     │       │       │              │
                                                                       │     │ (stateful: conn table,       │
                                                                       │     │  session cache, send buffer) │
                                                                       │     ▼       ▼       ▼              │
                                                                       └─────────────────────────────────────┘
                                                                             │ gRPC (mTLS, H2)
                                                     ┌───────────────────────┼──────────────────────────────────┐
                                                     │                       │                                  │
                                                     ▼                       ▼                                  ▼
                           ┌──────────────────────┐  ┌───────────────────────────┐     ┌──────────────────────────┐
                           │ Session Service      │  │  Message Service (MSG-*)  │     │  Presence Service        │
                           │ - authN/Z            │  │  - per-channel leader     │     │  - HB aggregator         │
                           │ - device registry    │  │    via lease (single-writer)    │  │  - online set (Redis)    │
                           │ - token refresh      │  │  - HLC assignment         │     │  - delta publisher       │
                           │ - multi-device keys  │  │  - append to commit log   │     │  - fanout-on-change only │
                           └──────────┬───────────┘  │  - sync-write Redis tail  │     └──────────────┬───────────┘
                                      │              │  - emits to fan-out bus   │                    │
                                      ▼              └──────┬──────────────┬─────┘                    ▼
                            ┌─────────────────┐             │              │                ┌──────────────────┐
                            │ Spanner: users, │             ▼              ▼                │ Redis Cluster    │
                            │ devices, tokens │      ┌─────────────┐ ┌──────────────┐       │ presence:{uid}   │
                            └─────────────────┘      │ Commit Log  │ │ Redis HOT    │       │ sharded by uid   │
                                                     │ (Kafka-like)│ │ msgtail:{ch} │       └──────────────────┘
                                                     │ 1000 brkrs  │ │ LRU 50 msgs  │
                                                     │ sharded by  │ │ per channel  │
                                                     │ channel_id  │ │ (~2000 shds) │
                                                     └──────┬──────┘ └──────────────┘
                                                            │
                                                            ▼
                                                     ┌───────────────────────┐
                                                     │  Cassandra/Scylla     │
                                                     │  messages by channel  │
                                                     │  RF=3, LOCAL_QUORUM   │
                                                     │  ~5000 nodes global   │
                                                     └───────────────────────┘

                                                     ┌───────────────────────┐       ┌─────────────────────────┐
                                                     │  Fan-out Service      │──────▶│  Internal Pub/Sub Bus   │
                                                     │  - maps channel →     │       │  (NATS JetStream / SPS) │
                                                     │    member list        │       │  topic per user_shard   │
                                                     │  - online? → bus      │       └────────────┬────────────┘
                                                     │  - offline? → queue   │                    │
                                                     │  - celebrity? → pull  │                    ▼
                                                     └──────┬────────────────┘                    │
                                                            │                                     │
                                                            │ offline                             │ online deliveries
                                                            ▼                                     │ consumed by gateways
                                                     ┌───────────────────────┐                    │ subscribing to
                                                     │  Kafka: offline_queue │                    │ their owned user_shards
                                                     │  10K partitions       │                    │
                                                     └──────┬────────────────┘                    │
                                                            │                                     │
                                                            ▼                                     │
                                                     ┌───────────────────────┐                    │
                                                     │  Push Dispatcher      │───▶ APNs / FCM    │
                                                     │  - coalesce per 30s   │     (per-region    │
                                                     │  - per-device route   │      push shards)  │
                                                     └───────────────────────┘                    │
                                                                                                  │
                                                                      ┌───────────────────────────┘
                                                                      │ (back to Gateway Fleet)
                                                                      ▼
                                                                 Online delivery to peer client

  Blob tier (media):
  Client ──pre-signed PUT──▶  GCS/S3  ◀──pre-signed GET── Client (recipient)
  Metadata only flows through message pipeline.

6.2 Sub-diagram — 1:1 send path (the hot path to optimize)

 Alice (online)                                           Bob (online, different gateway)
   │                                                                ▲
   │ 1. WSS SEND {channel,text,client_msg_id}                       │ 9. WSS DELIVER
   ▼                                                                │
 GW-Alice ──2. gRPC Append(channel=c,msg)──▶ MSG-Leader(c)         │
   │                                            │                   │
   │                                            │ 3. HLC assign     │
   │                                            │ 4. CommitLog append (sync, RF=3 quorum) ─ durable
   │                                            │ 5. Redis tail write (async)
   │                                            │ 6. emit to Fan-out bus(channel=c)
   │                                            ▼
   │                                       Fan-out Svc
   │                                            │
   │                                            │ 7. resolve members: [Alice, Bob]
   │                                            │    Alice online on GW-Alice ← already delivered via reflect
   │                                            │    Bob online on GW-Bob
   │                                            ▼
   │                                       Pub/Sub topic user_shard(Bob)
   │                                            │
   │ 8. SEND_ACK (hlc,server_msg_id)            ▼
   ◀─────────────────────────────────────  GW-Bob (subscribed to Bob's shard) ─── 9. ▶ Bob
                                                │
                                                │
          Alice receives SEND_ACK after step 4's durable commit.
          Bob receives DELIVER after step 9.
          Both may happen concurrently; SEND_ACK unblocks sender spinner.

Critical timing (intra-region, happy path):

WSS frame parse: 0.5ms
Gateway → MSG-Leader gRPC: 1ms
HLC assign: <0.1ms
Commit-log RF=3 quorum append: 8–12ms (p50)
SEND_ACK back to Alice: ~15ms p50, ~40ms p99
Fan-out emit + pub/sub hop: 5ms
GW-Bob deliver: +3ms
Total send→deliver p50: ~20ms, p99: ~60ms intra-region. Well under 200ms target.

6.3 Sub-diagram — Group fan-out (1:500 group)

MSG-Leader(ch=G)
      │
      │ emit {channel=G, hlc, payload}
      ▼
Fan-out Service (member list cached from Spanner: 500 user_ids)
      │
      │ bucket by user_shard (users hashed into 1000 shards)
      │ result: 500 users map to ~300 unique shards (birthday paradox kicks in)
      ▼
Pub/Sub: publish to 300 topics in parallel
      │
      ▼
Gateways subscribed to each shard pick up msg
      │
      │ for each online member: send DELIVER frame
      │ for each offline member: enqueue in Kafka offline_queue
      ▼
Online: direct deliver. Offline: push via dispatcher after coalesce window.

6.4 Sub-diagram — Presence aggregation

Clients send HEARTBEAT every 30s via their gateway.
Gateway batches heartbeats (coalesce 1s window) → single gRPC to Presence Svc.
Presence Svc writes to Redis presence:{user_id} with 90s TTL.

A presence CHANGE (online→offline or vice-versa) triggers:
   - Redis pub/sub on presence_delta topic
   - Fan-out to only the users who are subscribed to this user's presence
     (e.g. friends/contacts, or co-members of shared channels)
Subscription set cardinality capped at ~1000 per user; above that, presence for that user is "public" (pulled on demand, not pushed).

Cost: 500M users × (1 HB / 30s) = 16.6M HB/s. Coalesced 1s/GW → 500GWs × 1qps = 500 qps to Presence Svc for writes-in-bulk. Each bulk carries ~33K HBs. Presence Svc load becomes negligible; bottleneck is Redis TTL-expiry scan.

6.5 Sub-diagram — Multi-device sync

Alice has 3 devices: phone(P), web(W), tablet(T). All open WSS to (possibly different) gateways.

On SEND from P:
  - GW-P writes via MSG-Leader
  - SEND_ACK returns to GW-P → Alice's phone
  - Fan-out resolves Alice's channel membership → emits to Alice's user_shard
  - GW-W and GW-T (subscribed to Alice's shard) each receive DELIVER
    → Web and Tablet show "sent" state (not a duplicate send indicator; server_msg_id tells them it's the same msg)

On device register (device_id unknown):
  - On reconnect, client sends RESUME {last_hlc_seen_per_channel[]}
  - GW invokes Message Svc GetRange for gaps → catches up
  - Client dedups via server_msg_id

7 Deep Dives (earned-secret depth) #

7.1 Deep Dive A — Ordering Guarantees: HLC + single-writer-per-channel vs RSM vs Vector Clocks

Why this is the hardest problem. Chat UX demands: "what I see in this conversation is what the next person sees, in the same order." Naive timestamp ordering fails — clocks skew, sends race. Vector clocks give partial order but UI can't render partial order. Paxos/Raft per channel gives perfect total order but 50ms consensus overhead per message. The right answer threads this needle.

Alternatives evaluated.

Scheme	Ordering property	Latency cost	Operational cost	Verdict
Wall-clock timestamps + server arbitration	Best-effort; reordering on clock skew	~0ms overhead	Zero	Rejected — iMessage reportedly hit this issue, occasional out-of-order renders
Vector clock per user	Causal partial order	~0ms	Bloat: O(members) per message	Rejected — a 1000-member group has 1KB VC prefix; breaks down for E2EE where ciphertext can't include metadata cheaply; UI needs total order anyway
Lamport timestamps	Total order, no causal info	~0ms	Zero	Rejected — doesn't respect wall-clock intuition ("I sent this at 3pm but it sorted before your 2:59pm msg")
HLC (Hybrid Logical Clock)	Total order + wall-clock monotonic + causal if servers see each other	~0ms	8-byte clock per msg	Chosen — combines wall-clock intuition (HLC ≥ wall time) with causality via logical counter
Raft per channel	Strong linearizable total order	5-20ms extra per send	5× storage (log + state machine), leader election storms on failure	Used for metadata (channel membership changes) but not every message
Single-writer-per-channel lease (our pick)	Total order via serialization at one node	~1-2ms lease-check overhead (amortized)	Leadership transfer on failure	Chosen — MSG-Leader(channel) holds a 10s lease via etcd/Chubby; all writes for that channel serialize through it

Chosen design — HLC + single-writer lease per channel.

Each channel c has exactly one MSG-Leader at any time, elected via lease on a channel_id → node Chubby/etcd table. Lease = 10s, renewed every 3s.
Gateway routes the Append(c, msg) gRPC via consistent-hash on channel_id to the leader.
Leader assigns hlc_ts = max(prev_channel_hlc + 1, wall_clock_now_ms << 16). This makes hlc_ts strictly increasing per channel AND roughly aligned with wall time.
Leader writes to commit log with RF=3 quorum. Only after quorum does it ack back.
If leader fails: lease expires (10s), new node takes over. It reads the last hlc_ts from the commit log tail and continues. During the 10s window, that channel is write-unavailable — this is a conscious trade: 10s of unavailability for one channel vs global ordering inconsistency forever.

Earned-secret aside: Single-writer-per-channel is exactly what Slack's "Channel Server" (Gumbo) and Discord's Elixir/BEAM process-per-channel do. Each channel has a dedicated process holding the canonical state; messages serialize through it. WhatsApp's Erlang gateways use a similar model. The insight is that per-channel throughput is bounded (~1000 msg/s in the worst real case) — so serializing through one node is fine; what matters is that you have millions of channels distributed across thousands of nodes.

What HLC buys us over Lamport. A message with hlc_ts = 1705000000042:5 is readable: the first part is ms since epoch. When rendering, UI sorts by hlc_ts and the order matches wall-clock ±tiny logical counter. Lamport gives you only 42 which is unreadable and meaningless for UI. HLC is what CockroachDB and YugabyteDB use for exactly the same reason.

At-least-once + client dedup. Because we ack after RF=3 quorum write, if a retry happens the leader sees the same client_msg_id and returns the already-assigned server_msg_id. Dedup keyed by (user_id, client_msg_id) → server_msg_id in Redis, TTL 24h. This costs us 5M msgs/s × 40 bytes × 86400s = 17TB Redis — feasible with 200 shards.

Failure modes:

Dual-leader split-brain during lease transfer. Mitigation: fencing tokens. Each append includes (lease_id, lease_seq); commit log rejects if seq < stored. Same pattern as ZooKeeper zxid fencing.
Clock-jump backward on leader node. HLC guarantees monotonicity by taking max(wall, prev+1). A backward jump just means logical counter increments more — still correct.
Leader GC pause >10s. Worst case: new leader elected, old leader resumes and tries to ack stale writes. Fencing rejects them; old leader sees rejection and surfaces retryable=false error. Alarming, but no data corruption.

7.2 Deep Dive B — Fan-out: Push vs Pull vs Hybrid (the celebrity problem, for groups)

Why critical. Whether to do write fan-out (copy message to each recipient's inbox at send time) or read fan-out (recipient pulls from channel's shared log at read time) is the pivot point of the whole architecture. Wrong choice breaks scale in either direction.

Alternatives.

Strategy	How it works	Storage cost	Read latency	Works for
Pure push (per-recipient inbox write)	On send, write N copies (one per recipient) to N inboxes	N× storage	O(1) read — inbox is pre-materialized	Small groups, consistent fast reads
Pure pull (shared channel log)	Write once to channel log; readers query log on demand	1× storage	O(log history) per read + network	Large groups, celebrity channels
Hybrid (tiered)	Push for small channels, pull for large	1-N× (tunable)	Depends on tier	Realistic mixed workloads

Math per strategy at our scale.

Pure push: 2.9M sends/s × 106 avg recipients = 307M inbox writes/s. At 200 bytes per inbox entry = 61 GB/s into inbox storage. Cassandra can do this but wasteful — user rarely rereads offline msgs more than once.
Pure pull: 2.9M sends/s, 1× channel-log write = 2.9 GB/s. BUT reads: 500M DAU × 100 reads/day × avg 5 msgs per read = 2.9B reads/s. Bandwidth on the channel log storage becomes the bottleneck, and celebrity channels (Cristiano Ronaldo's broadcast channel with 500K members) see 500K concurrent readers pulling on every post.

Chosen design — Hybrid with three tiers:

Tier 1 — Direct push (channel_size ≤ 1000): At send time, MSG-Leader emits to fan-out bus; fan-out service looks up member list (cached) and publishes to each member's user_shard on pub/sub. Online gateways consume and deliver. Offline members → Kafka offline_queue with (user_id, channel_id, msg_id). No inbox materialization on storage side — the Kafka queue IS the inbox for offline, and it auto-drains.
Tier 2 — Sharded push (1000 < channel_size ≤ 10K): Fan-out into parallel worker pools per shard of the channel's membership. Worker per shard batches 1000 member-lookups, coalesces online vs offline. Introduces ~50ms p99 vs Tier 1's ~10ms, but distributes CPU. This is the WhatsApp approach for medium groups.
Tier 3 — Pull (channel_size > 10K, i.e. broadcast channels): Do NOT push to recipients. Instead: write to channel log only. Recipient gateway subscribes to a channel-post notification pub/sub (tiny — just "channel X has new msg at HLC Y"). Recipient pulls message body on demand (when user opens channel or a push notification requires body). This is Twitter's approach for celebrity tweets, and Telegram's approach for broadcast channels.

The decision knob. Tier is picked by channel size at channel creation or dynamically (cross threshold → migration). Threshold of 10K for tier 3 comes from: 10K × 200 bytes × 2.9M sends_to_large_channels/s / total_sends ≈ 2% of sends. Even if these cost 10× per msg, they're <20% of fan-out cost. Pushing them all would dominate fan-out cost — so pull is a ~5× cost savings for the top 2%.

Earned-secret aside: Twitter's home timeline uses exactly this hybrid. Fan-out-on-write for normal users, fan-out-on-read for celebrities (>1M followers). The threshold was tuned empirically; they ended up with a hybrid merge on the read path that combines pre-materialized timeline + pulled celebrity tweets. Same idea here: a user in 5 broadcast channels + 20 normal groups gets pushed msgs for groups and pulls for broadcasts, merged by client into the home view.

Celebrity problem variant — group with one very active member. One user sending 100 msg/s into a 10K group = 1M deliveries/s from one user. Solution: per-sender-per-channel rate limit (e.g. 10 msg/s from one user into one channel) + flood protection. This is a product policy, but enforced at Session Svc via sliding window.

Failure modes:

Fan-out service OOMs on large-member lookup. Mitigation: stream member iteration from Spanner with page size 1000; fan-out worker is stateless on membership, uses flow control to the pub/sub bus.
Push tier misdetects offline. Mitigation: gateway publishes "I hold user U's conn" heartbeat to a routing cache every 5s; fan-out checks cache; if stale (>15s), treat as offline. False offline is OK because client will reconcile via RESUME on reconnect.
Pub/sub topic hot spot. Mitigation: if a single user_shard gets >50K msg/s (unlikely but possible), split shard dynamically — migrate half of users to a new shard, gateway learns via routing cache.

7.3 Deep Dive C — Gateway Stickiness, Failover, and Mass-Reconnect Storms

Why critical (and earned-secret). Gateways hold state: open WebSocket, TLS session, per-conn sequence numbers, small send buffer. When a gateway dies, millions of clients reconnect within seconds. Done wrong, this takes down the entire fleet (thundering herd). Done right, clients barely notice.

Alternatives.

Stateless gateways, external session store. Every frame re-reads session from Redis. Trade: simple failover, but 2 Redis hops per frame × 930M deliveries/s = 1.86B Redis ops/s. Not affordable.
Sticky, but hash-on-user_id with simple modulo. Client reconnects to user_id % N_gateway. Problem: scaling from 500 to 501 gateways shuffles every client. Rejected.
Sticky with consistent hashing + bounded load (chosen). Clients hash user_id onto a ring; lookup closest gateway with capacity <90%. Adds/removes shuffle ~1/N of clients.

Chosen design.

L4 LB uses Maglev-style consistent hash on user_id (extracted from JWT or a cookie). This delivers reconnects to the same gateway pod with ≥95% probability across pod changes.
Gateway holds the WSS conn + session state locally. Session state is also persisted to Redis (cold replica) for recovery.
On gateway crash:
- K8s liveness detects in 5s, endpoints controller removes pod.
- L4 LB reroutes the slice of user_ids to next-in-ring gateway.
- Clients see WSS drop, reconnect with exponential backoff jittered by user_id hash. This is the critical bit — plain exp-backoff synchronizes reconnects; jittering by hash(user_id) % 30s spreads them.

The earned-secret bit — mass reconnect storm on deploy.

When we deploy new gateway binary, we drain gateways gracefully. Without care: 2M connections per node × 500 nodes × "deploy-one-at-a-time" = 2M reconnects per drain. If all hit one new gateway in 1s, that gateway sees 2M new TLS handshakes — a single handshake costs ~2-4ms of CPU (crypto + session setup). 2M / 4ms = 500K TLS handshakes/s ≥ ~8K CPU-cores required. No single node has this.

Mitigations in layers:

Slow drain. When a gateway is marked for drain, it sends a GOAWAY frame with reconnect_after_ms = rand(0, 60000). Clients wait a random interval, then reconnect. This spreads 2M reconnects over 60s = 33K/s per-destination — manageable.
TLS session resumption. Redis-backed TLS ticket store; reconnect uses ticket → skips full handshake. Cuts per-handshake cost from 4ms to 0.4ms CPU. Earned-secret note: this requires careful ticket-key rotation; rotate too fast and you lose resumption; too slow and you keep forward-secrecy risk. WhatsApp rotates every 12h.
Reconnect hint from server. When a client connects, gateway includes in handshake response preferred_gateway_dns=gw-17.region-us-east.example.com. On subsequent reconnect, client connects directly to that DNS name, bypassing the LB. This pins clients and avoids LB storms.
Client-side exponential backoff with jitter. min(max_delay, base × 2^attempt) × rand(0.5, 1.5). Base = 1s, max = 60s. Most clients reconnect within 30s; tail extends to 90s.
Kernel-level tuning. SO_REUSEPORT for socket accept scaling; net.core.somaxconn=65535; kTLS offload so handshake crypto runs in kernel w/o user-kernel copy. This is the WhatsApp/Meta secret sauce: "we hold 2M conns/node because memory is dominated by TLS state not the message buffer, so kernel TLS offload matters more than app heap tuning."
Blast-radius limit via cell architecture. Divide gateway fleet into 20 cells of 25 gateways each. A bad deploy is rolled out cell-by-cell; a crash storm in one cell drives ~2.5M reconnects max, not 500M.

Real systems named.

WhatsApp Erlang gateways: 2M conns/node (Rick Reed's 2014 talk), BEAM VM's lightweight processes are the model for our per-conn state. They solved the mass-reconnect by jittered backoff on the client.
Slack's Flannel / Gumbo: routing proxy in front of channel servers, does exactly the "reconnect hint" pattern.
Discord's Elixir + Gateway: same BEAM model, publicly known to handle 10M+ concurrent conns per cluster (2019 blog).
Messenger / Iris: Meta's internal system. Public Meta engineering posts describe gateway sharding and fan-out separation aligned with this design.

Failure modes.

Cold cache on new gateway. New node has empty session cache → every frame causes Redis lookup. Mitigation: pre-warm by replaying session table for the hash range the node will own.
Split-brain during LB reconfig. LB might briefly send same user to two gateways. Gateway detects duplicate session_id → sends CONFLICT frame, newer one wins, older drops connection.
Zombie connections (client device went offline without TCP RST). TCP keepalive + app-level heartbeat with 90s deadline evicts zombies. At 750M conns × 3 zombie rate = ~20K zombies in flight at any time — negligible.

8 Failure Modes & Resilience #

Component	Failure	Detection	Blast radius	Mitigation	Recovery
Gateway pod	Process crash	K8s liveness 5s	~2M users (1 pod)	L4 LB reroute + client reconnect w/ jitter	<60s; users see 1 reconnect
Gateway rack	Rack-level power loss	Heartbeat timeout 10s	~20M users	LB drains rack; cross-rack capacity headroom	<2min
Entire region	Data-center offline	Anycast BGP withdraw	All users in region (~50M)	Clients fall back to next region via GeoDNS; authoritative data replicated to 2+ regions	Region restored: ~hours; users work throughout
Redis hot-tail shard	Node loss	Sentinel + replica promote	~250K channels hot-tail	Reads fall through to Cassandra (slower, +20ms); writes continue	Seconds (failover), fill cache over minutes
Redis presence shard	Node loss	As above	Presence for ~25M users	Degraded to "last_known_status" from Cassandra snapshot; presence inaccurate for ~1min	Rebuild from HBs over 2min
Cassandra hot shard (one channel melts)	Celebrity posts → one partition overload	Per-partition QPS alarm	One channel has degraded latency	Shadow-read from replica; write-through Redis buffers; emergency: split channel partition by hlc_ts range	Mins to split
Cassandra node	Disk failure	gossip + local liveness	RF=3 means one replica loss, still quorum	Auto-bootstrap replacement from peers; hinted handoff for missed writes	Hours to bootstrap TB
Commit log broker	Kafka/Bookkeeper broker loss	Heartbeat	1/1000 brokers → no loss (RF=3)	Broker replacement; lag increases transiently	Minutes
Fan-out pub/sub topic	Hot topic (celebrity user)	Topic lag alarm	Deliveries to that shard delayed seconds	Dynamic shard split; consumer scale-out	<1min
Offline queue (Kafka) full	Push provider outage	Lag > threshold	Pushes delayed for offline users; messages still durable in Cassandra	Pause dispatcher; drain after recovery; deduplicate on resume (client uses server_msg_id)	Hours (catch-up), no data loss
APNs / FCM outage	Push provider 5xx	Error rate > 10%	Offline users don't get push notifs; they DO get msgs on reconnect	Exponential backoff, don't retry >1hr old; surface "silent delivery"	Hours typically
Presence flapping (bad network)	User oscillates online/offline every 30s	High presence-delta rate per user	Fan-out waste to subscribers	Debounce: require stable state ≥60s before broadcasting delta	Immediate (rule fires)
Mass reconnect storm	Deploy / fleet-wide restart	Handshake rate spike	Gateway fleet overload	Jittered backoff + slow drain + session resumption (Section 7.3)	<2min
Consistent-hash reshuffle	Adding/removing 10 gateways at once	LB reconfig events	Up to 10/500 = 2% of users reshuffle	Stagger gateway adds (1 per 30s); clients carry session across shuffle (session data in Redis)	Progressive
Spanner quorum loss (channels table)	Paxos majority lost	Write errors	Channel-create fails; messages in existing channels unaffected	Read-only mode on metadata; degrade gracefully	Minutes (usually self-healing)
Dedup Redis loss	Whole dedup cluster	Redis cluster heartbeat	24h window of dedup data	Fall back to Cassandra-idempotent lookup (slower). Rebuild over 24h as new msgs flow	Hours
Clock skew between MSG-Leader nodes	NTP failure	Leader-to-leader skew alarm	HLC still correct (max-based), wall-clock on hlc_ts drifts	Alert SRE; NTP recovery; HLC preserves correctness	Mins

Pager policy. SRE on-call pages on:

Regional p99 > 2× target for 5 minutes
Message loss > 0 in any 5-min window (any ack'd but not durable)
Gateway fleet utilization > 85%
Fan-out queue lag > 10s
Cassandra write latency p99 > 100ms for 5 min

SLO error budget. 99.99% = 4.3min/mo. Each minor incident burns 0.1-0.5min. Budget exhaustion = freeze deploys.

9 Evolution Path #

v1 (MVP, single region, 1:1 only) — 3 months

Single region (us-east-1), 20 gateways, 100 Cassandra nodes, Redis tail cache.
1:1 messaging only. No groups, no presence broadcast.
Offline queue via simple per-user Kafka partition.
Push via FCM/APNs direct.
Multi-device NOT supported (one active device per user).
Target: 5M DAU, p99 <300ms.
Open questions at this stage: do we need Spanner for metadata or can we start with PostgreSQL per-shard? (Answer: PG for v1, migrate at v2.)

v2 (groups, multi-device, presence) — 9 months

Add groups up to 1000, with tier-1 push fan-out.
Add presence subsystem with coarse aggregation.
Multi-device sync via user-shard pub/sub.
Scale gateway fleet to 100 nodes; 3 regions (us-east, us-west, eu).
Migrate metadata to Spanner for global transactional consistency.
Introduce dedicated fan-out service (split from MSG service).
Read receipts and typing indicators.
Target: 100M DAU.

v3 (global scale, large groups, broadcast, E2EE, federation) — 18 months

20 regions; cross-region replication for messages (async, with read-your-writes for sender).
Tier-2 and tier-3 fan-out (sharded push, pull for broadcast).
Broadcast channels up to 100K members with pull-based delivery.
E2EE opt-in per channel (Signal Double Ratchet for 1:1, Megolm for groups).
Federation (inter-org like Slack Connect) — optional, gated behind enterprise tier.
Cross-region HLC reconciliation (Section 10).
Data locality compliance (India/China shards).
Target: 500M DAU.

Build-vs-buy decisions at each phase.

v1: Use AWS managed services (MSK for Kafka, ElastiCache for Redis, AWS Keyspaces for Cassandra-compat). Fast ship; pay for managed premium.
v2: Migrate hot path to self-managed (Cassandra on EC2, Kafka on dedicated hosts). Managed cost starts exceeding engineering cost.
v3: Bare-metal for gateways (for kTLS and NIC tuning); Cassandra on bare metal too. Custom fan-out (off-shelf pub/sub hits limits at 930M deliveries/s).

10 Out-of-1-Hour Notes #

10.1 End-to-End Encryption

Signal Protocol (Double Ratchet) for 1:1.

Long-term identity keys + medium-term signed prekeys + one-time prekeys, exchanged via a key distribution server (the only plaintext-adjacent role the server plays).
Each message encrypted with a fresh key derived from a ratchet (DH + symmetric chain).
Forward secrecy (past messages safe if current key stolen) + future secrecy (compromise self-heals after next DH).
Server sees ciphertext only — can store, fan-out, deliver, but cannot search, moderate content, or auto-translate.

Megolm / Sender Keys for groups.

Naive N-to-N pairwise Double Ratchet explodes (1000-member group = 500K session states).
Sender generates a "sender key" (symmetric chain) shared with each member via pairwise Double Ratchet — one-time cost at key-rotation.
Each msg encrypted once with sender key; all members decrypt.
Key rotation when member leaves (obviously — otherwise left member keeps decrypting).
Matrix uses this model; Signal uses a variant.

Tradeoffs with server features.

No server-side search — clients maintain local encrypted indices.
No moderation on content — metadata-only moderation (report flow ships a cleartext excerpt via user opt-in).
Push notification preview text: either (a) "New message" only, or (b) client shows preview using locally-cached key after wake — Signal does (b).
Backups: either cloud-backup with user-derived key (optional, Signal PIN), or no backup.

10.2 Key Server

Stores public keys, signed prekey bundles, one-time prekeys.
On sender request "give me keys for user X's device D": returns bundle, burns one-time prekey.
Prekey replenishment by client every N days or when count <20.
Must resist key-server compromise: Trusted Identity via QR-code fingerprint verification (Signal does this).

10.3 Search (E2EE and non-E2EE)

Non-E2EE: server-side Elasticsearch per-user index. Cost ~$0.05/MAU/yr additional.
E2EE: client-side FTS over local cache. WhatsApp does this — search only works for msgs on the device.
Encrypted cloud-side search (BlindSearch, Ciphertext Keyword Search) — research-grade, not prod-ready.

10.4 Moderation Pipeline

Non-E2EE: ML classifier on every msg (toxicity, CSAM, spam). Pre-publish intercept for severe; post-hoc takedown for mild.
PhotoDNA / CSAM hashing on uploaded images. Report to NCMEC required by law.
Report flow: user taps "report", client uploads an explicit cleartext bundle (even from E2EE chat — this is Apple's model for iMessage) + sender metadata + msg context window.
Apple's Communication Safety classifier runs client-side even for E2EE (controversial but real).

10.5 Retention and Legal Holds

Default: infinite retention, user-deletable.
Configurable per-user or per-channel TTL.
Legal hold override: SRE-applied tag that bypasses TTL for specified users/channels (GDPR-regulated countries require judicial order).
GDPR right-to-erasure: when user requests deletion, scrub PII and sent msgs. Problem: can't scrub received copies in others' inboxes — document in ToS that your identity remains in others' histories unless they also delete.

10.6 Spam and Abuse

Rate limits per user globally (500 msgs/hr) and per channel (10 msgs/min).
Bulk-send detection: user with 100 new 1:1s in 10min → captcha / SMS verify.
Reputation score per account (age, verified phone, social signals).
Honeypot accounts for spam detection.

10.7 Cross-Device Key Sync

Multi-device E2EE: each device has its own identity key. Register device = sign new device key with existing device's key ("device auth").
On registering new device, server fan-out out to other devices: "pending new device, approve?". User confirms on existing device.
Message encryption to user U = encrypt once per active device (small N, usually ≤10).

10.8 QoS and Priority

1:1 msgs priority=high, <100ms SLA.
Group <100 members: priority=medium.
Broadcast channels: priority=low, can tolerate 5s delivery.
Control frames (typing, presence): priority=best-effort, first-to-drop under pressure.
Implement via pub/sub topic per priority + gateway consumer weights.

10.9 Cost per MAU

Component	Peak demand	Annual cost at scale
Gateway fleet	800 nodes × $8K/yr	$6.4M
Cassandra	5000 nodes × $12K/yr	$60M
Redis	500 nodes × $10K/yr	$5M
Kafka/pub-sub	1500 nodes × $8K/yr	$12M
Network egress (inter-region replication + client delivery)	~900 GB/s avg × $0.02/GB	$570M(!)
Storage (9PB × $0.02/GB-mo)		$2.2M
Push provider (APNs free, FCM free, SMS fallback)		~$20M
Total		~$676M/yr
Per MAU (at 500M active, assume MAU ~600M)		~$1.13/MAU/yr

Network egress dominates — a real optimization is reducing cross-region replication (only replicate recently-active channels) and using client-pull for broadcast (already in design).

10.10 Client Backgrounding and Battery

iOS/Android kill background TCP connections after ~30s.
Use push-to-wake: push carries msg metadata, app wakes and pulls body via HTTP (not WSS — WSS reconnect in background is costly).
Android foreground service can hold WSS for active chats (WhatsApp model).
iOS: background pull every 15min (APNs push handles immediacy).
Battery-optimal: coalesce msgs into bundles; push N msgs as 1 bundle; reduce radio wakes from N to 1.
Adaptive heartbeat: on wifi, 60s; on cell, 180s; on weak signal, 300s — trades presence-staleness for battery.

10.11 Federation (v3+)

Inter-org federation like Matrix/XMPP or Slack Connect.
Each org runs its own gateways + storage; federation via a signed message protocol between orgs.
Directory discovery: DNS SRV records point to federation endpoint.
Challenges: identity mapping, spam (federated identities can spoof), encryption (federated E2EE requires cross-org key exchange), retention (whose policy wins?).
Design: introduce Federation Gateway as a distinct tier that translates internal pub/sub events to inter-org signed gRPC. Every cross-org msg is a signed capability.

10.12 Cross-region HLC reconciliation (when we go multi-master)

v3 may allow writing to nearest region for the same channel (write-locality).
Requires HLC + vector per region: hlc_ts + region_id.
Conflict resolution: total order by (hlc_ts, region_id) — deterministic but can reorder messages slightly vs causal expectation.
Alternative: keep single-writer-per-channel, but move the writer to the region with the hottest activity (leader migration). Trades cross-region write latency for single-region read latency. This is what CockroachDB follower reads pattern does.

10.13 Observability

Message tracing: every msg gets a trace_id (derived from client_msg_id). Sampled 1/1000 for tracing through every hop (gateway → MSG leader → commit log → fan-out → delivery gateway → peer). Jaeger/Tempo.
Per-channel metrics: write QPS, p99 latency, fan-out fan-out factor, offline recipients %.
Per-gateway metrics: connection count, CPU, kTLS ratio, p99 handshake time, zombie conn rate.
SLO dashboard: weekly review.
Synthetic probes: 1000 bot pairs globally sending heartbeat messages every minute, measuring end-to-end; alarms on regression.

Appendix A — Quick Reference: Chosen Tech Stack #

Layer	Choice	Alternative considered	Why chosen
Client transport	WebSocket over TLS 1.3 + kTLS	HTTP/3 long-poll	Bi-directional, lower overhead, kernel-offloadable
Wire format	protobuf (binary)	JSON	40% smaller, 3× faster parse
Load balancer	L4 Maglev-hash	L7 ingress	No TLS termination at LB (do it on gateway for kTLS); consistent hashing cheaper at L4
Gateway	Go or Rust + epoll + kTLS	Erlang/BEAM	Rust/Go for ecosystem; BEAM is valid if team has Erlang expertise (Discord/WhatsApp)
Session store	Redis (sharded by user_id)	Memcached	Need pub/sub and data structures (sorted sets for history)
Message store	Cassandra (migrate to Scylla v2)	DynamoDB / Spanner / HBase	Wide-row write optimized, operational maturity, cost
Metadata store	Spanner	CockroachDB	Transactional global; Google-native
Commit log	Kafka w/ RF=3	Pulsar, RabbitMQ	Throughput + log-replay semantics
Pub/sub (fan-out)	NATS JetStream or custom	Kafka	Lower latency for 930M msg/s fan-out; Kafka for durable offline queue
Object store	GCS / S3	—	Media blobs
Push	APNs + FCM	—	No alternative
Config / service discovery	Consul or etcd	ZooKeeper	Modern, battle-tested

Appendix B — Pre-Interview Verification Checklist (answers to the prompt's 4 verifications) #

SRE pager-carryable? Yes — §8 enumerates every failure mode with detection, blast radius, mitigation, runbook. Clear SLO budget and pager triggers.
Every diagram arrow → real API/data flow? Yes — every gRPC call is defined in §4.3; every WSS frame in §4.1; storage writes in §5 with chosen engine per arrow.
Deep-dive L7 or L6? §7 goes to earned-secret depth: kTLS + TLS ticket rotation interaction, Meta/WhatsApp-specific operational patterns, single-writer lease fencing, Twitter celebrity-pull threshold tuning, consistent-hash bounded-load behavior during deploys.
Conn count and msg-rate math consistent? Yes — §3.2 derives 750M conns → 500 gateways at 2M/node; 2.9M msg/s × 320 fan-out = 930M delivery/s; 930 GB/s egress divided across 500 gateways = 2GB/s per NIC at 60% utilization. Numbers all tie together.