Q9 Messaging & Notifications 28 min read 12 sections
Design an Online Chat / Messaging System
Support one-to-one and group chat with ordering, presence, offline sync, and read receipts.
1 Problem Restatement & Clarifying Questions #
Restatement. Build a globally distributed real-time chat platform supporting 1:1 DMs, medium groups, and one-to-many broadcast channels. Clients (iOS/Android/Web) maintain long-lived connections; the system must deliver every message at-least-once with client-side dedup, preserve per-conversation ordering, handle offline users via push and queued delivery, expose presence and read receipts, and survive regional failures without message loss.
Clarifying questions I'd ask a Google interviewer (and my assumed answers for this doc):
| # | Question | Assumed answer (drives design) |
|---|---|---|
| Q1 | Target scale? | 500M DAU, peak 5M concurrent msg/s write, peak 50M concurrent WebSocket connections after multi-device expansion. |
| Q2 | Group size distribution? | 1:1 dominant ( |
| Q3 | Ordering guarantee? | Per-conversation total order visible to all members, causal across conversations for a single user's outbox. No global order across channels. |
| Q4 | Durability vs latency? | No message loss is non-negotiable (durable before ack). p99 send-to-deliver <200ms intra-region when both parties online. |
| Q5 | E2EE required? | Optional (per-channel flag). Server must work without plaintext; design for ciphertext storage with server-side fan-out over opaque blobs. Detailed crypto deferred to Section 10. |
| Q6 | Retention? | Default infinite (like iMessage/WhatsApp), with per-user/per-channel TTL knobs. Legal-hold out of scope for 60 min. |
| Q7 | Federation (XMPP / Matrix style)? | No for v1/v2. Optional v3 inter-org federation (think Slack Connect). |
| Q8 | Multi-device? | Yes — user can have ≤10 active devices. Messages must sync to all, including those offline at send time. |
| Q9 | Media limits? | Up to 100MB per attachment; uploaded to blob store via pre-signed URL, message carries pointer + thumbnail. |
| Q10 | Read receipts? | Opt-in per-user (UX requirement, common in FB Messenger). |
| Q11 | Regulatory? | GDPR right-to-erasure, China/India data-locality. Drives per-user region affinity (Section 9 v3). |
Why these questions matter. Q2 and Q8 alone pivot the fan-out strategy — a 100K-member channel cannot use a naive per-recipient-inbox write, and multi-device turns "fan-out to 5 recipients" into "fan-out to ~30 device sessions." Q3 decides whether we need total order per channel (yes) or global (no — vector clocks across users would be crushing). Q5 determines whether the server can do search/spam/moderation (no for E2EE channels — huge UX consequence).
2 Functional Requirements #
In scope (numbered):
- Send 1:1 message with durable ack, idempotent via
client_msg_id. - Group chat ≤1000 members, same ack semantics; fan-out transparent to sender.
- Broadcast channels ≤100K members (one-way-dominant, reader fan-out, e.g. "announcements").
- Typing indicators (ephemeral, best-effort, lossy).
- Read receipts (per-user per-message
last_read_msg_idadvance). - Presence (
online/away/offline/last_seen_ts), coarse aggregation. - Offline delivery: server queues until recipient reconnects OR delivers via APNs/FCM push.
- Media attachments: images/video/files up to 100MB via pre-signed S3/GCS URL; server stores only metadata.
- History sync / pagination: cursor-based back-scroll.
- Multi-device sync: all devices see same conversation state; messages sent from one device appear "sent" on others.
- Delete for me / delete for everyone (time-bounded).
- Reactions (emoji annotations on messages).
Out of scope for this 60-min interview (covered in Section 10):
- Voice/video calling (entirely separate SFU architecture)
- ML-based moderation + report pipeline
- Search across encrypted content
- Bots / integrations (Slack apps, webhook)
- File preview generation
- Message translation
3 NFRs + Capacity Estimate #
3.1 Non-Functional Requirements
| NFR | Target | Notes |
|---|---|---|
| Availability | 99.99% (≈52 min/yr downtime) | per-region; global via isolation |
| p99 send-to-deliver (online) | <200ms intra-region, <350ms cross-region | ack-from-server + deliver-to-peer |
| p99 send-to-ack | <80ms (durable on primary replica) | what sender's spinner waits for |
| Durability | No loss after server-ack; RPO = 0 within region | replicated commit log before ack |
| Ordering | Per-conversation monotonic; causal per-sender | HLC-based |
| Delivery | At-least-once + client dedup | never at-most-once |
| Consistency | Read-your-writes on sender's devices; eventual across group members (bounded to seconds) | |
| Throughput headroom | 2× peak steady-state | holiday surges (NYE messages ~3× normal) |
| Cost | <$0.30 per MAU/yr on messaging infra | rough order; compare WhatsApp famously ~$0.50/MAU at $22B acq |
3.2 Capacity Back-of-Envelope (show the arithmetic)
Users and messages.
- 500M DAU × 50 msgs/day sent ≈ 25B msgs/day.
- 25B / 86,400s ≈ 290K msgs/s average write.
- Peak-to-average ratio ~10× for chat (lunch/evening spikes, plus NYE): 2.9M msgs/s peak write.
- Average payload (text + metadata): ~1KB. Media msgs only carry pointer, so avg stays ~1KB.
- Peak write bytes: 2.9M × 1KB ≈ 2.9 GB/s ingested.
Fan-out multiplier.
- Per-message recipients: 1:1 avg=1 recipient, group avg=20 (assume), broadcast avg=5K for the 2% channel-posts.
- Weighted: 0.70×1 + 0.28×20 + 0.02×5000 = 0.7 + 5.6 + 100 = 106.3 avg delivery events per send.
- Multi-device ×3 on average (phone + web + tablet) → ~320 delivery events / sent message.
- Peak delivery rate: 2.9M × 320 ≈ 930M delivery events/s (this is the fan-out pipeline QPS, not the storage QPS).
- This is why we separate a fan-out service from the message store: storage QPS is bounded by unique channel writes, fan-out QPS scales with recipients.
Storage.
- 25B msgs/day × 1KB = 25 TB/day raw. × 3 replicas = 75 TB/day.
- With compression (~3× on text): ~25 TB/day effective.
- Annual: ~9 PB/yr. Over 10-year retention, ~90 PB — comparable to WhatsApp's known archival.
- Hot-tier (last 7 days or last 100 msgs/channel, whichever smaller): 25TB×7 ≈ 175TB. Redis can't hold this raw — we cache only last 50 msgs/active channel. 500M DAU × avg 5 active channels × 50 msgs × 1KB = 125 TB hot cache across ~2000 Redis shards at 64GB usable each. In practice we tier: Redis for last-50 tail, local SSD cache on message service for last-1000, Cassandra beyond.
Connections / gateways.
- 500M DAU, avg 50% online concurrent during peak hours + 3 devices each → 750M concurrent WebSocket connections at peak.
- WebSocket memory: ~16KB per connection for userspace + ~32KB kernel (TCP + TLS state). Call it 48KB total.
- Earned-secret aside: TLS session state dominates app-level buffers. Without kernel TLS offload (kTLS), 500M conns × 32KB kernel ≈ 16TB of kernel memory. This is why WhatsApp's BEAM/Erlang gateways and Meta's Iris use kTLS — it amortizes crypto state and avoids kernel→user copy.
- Per-node budget: assume 2M connections/gateway (matches WhatsApp's public figure of 2M+ per Erlang node in 2014, achievable on 256GB box).
- Gateway fleet: 750M / 2M = ~375 gateways global, round to ~500 for headroom + regional distribution.
- With a generous margin and N+2 per region across 20 regions: ~800 gateway nodes.
Fan-out pipeline.
- 930M deliveries/s, each ~500 bytes over internal bus = 465 GB/s internal bandwidth.
- A Kafka-style commit log at 500MB/s per broker →
1000 brokers needed just for fan-out. In practice we use an in-memory pub/sub (NATS or custom) for hot path + Kafka for durable queue of offline deliveries. Most deliveries (80%) are synchronous to online peers and never touch Kafka.
Offline queue.
- Of 930M deliveries/s, ~30% recipients offline → 280M/s queued.
- Each offline entry: channel_id, msg_id, user_id (~50 bytes) = 14 GB/s into queue store.
- Queue depth: assume avg offline duration 6h per user per day. 280M/s × 21,600s = 6T entries peak queue. With dedup per (user, channel): actual depth bounded by active channels per user × coalesce factor → ~100B entries, ~5TB.
Push notifications.
- Offline users receiving msgs → push via APNs/FCM. Expect 60% of offline-delivery-events trigger a push (rest are batched/coalesced).
- Peak push rate: 280M × 0.6 = 168M pushes/s — this exceeds APNs/FCM per-account quotas; we need push-provider sharded by user region + aggressive coalescing (one push per channel per 30s window).
Sanity check. Peak 2.9M send/s with 320 fan-out = 930M/s. At 1KB per message, network egress from fan-out tier = 930 GB/s. Divided across 500 gateways = ~2 GB/s per gateway egress. A single 25Gbps NIC does 3.1 GB/s, so we're at 60% NIC utilization at peak — feasible with headroom.
4 High-Level API #
Three protocols for three concerns: WebSocket (client↔gateway, bidirectional chat frames), HTTPS REST (setup, uploads, history), gRPC (internal server↔server).
4.1 WebSocket frames (client ↔ gateway)
Wire: binary frames, MessagePack or protobuf (not JSON — saves ~40% bytes, matters at 2.9M/s). Each frame has {type, request_id, payload}. request_id echoed in ack for correlation.
C→S SEND {channel_id, client_msg_id(UUID), hlc_ts_hint, content, attachments[]}
S→C SEND_ACK {client_msg_id, server_msg_id, server_ts, hlc_ts, status}
S→C DELIVER {channel_id, server_msg_id, hlc_ts, sender, content, ...}
C→S DELIVER_ACK {server_msg_id} — client acks receipt (for read cursor & offline queue drain)
C→S TYPING {channel_id, is_typing}
S→C TYPING {channel_id, user_id, is_typing}
C→S READ {channel_id, up_to_msg_id}
S→C READ {channel_id, user_id, up_to_msg_id}
C→S PRESENCE_HEARTBEAT {status} — every 30s
S→C PRESENCE {user_id, status, last_seen} — only deltas pushed
C→S PULL_HISTORY {channel_id, before_msg_id, limit}
S→C HISTORY_PAGE {channel_id, msgs[], next_cursor}
C→S REACT {channel_id, msg_id, emoji, op:add|remove}
S→C ERROR {request_id, code, retryable, backoff_hint_ms}
S→C RESUME_HINT {session_token, preferred_gateway_dns} — sent on reconnect for stickiness
Idempotency. client_msg_id is the dedup key. Server stores (user_id, client_msg_id) → server_msg_id for 24h in Redis. A retry returns the original server_msg_id → client replaces its optimistic entry.
Flow control. Client maintains a per-connection sliding window (max 64 unacked frames). Backpressure via TCP naturally; on overload gateway sends SLOWDOWN frame with suggested delay.
4.2 HTTPS REST (setup and things that aren't message traffic)
POST /v1/auth/login → {access_token, refresh_token, session_id, gateway_hint}
GET /v1/channels → {channels[], unread_counts}
POST /v1/channels (create group)
POST /v1/channels/{id}/members
POST /v1/attachments/presign → {upload_url, download_url, expires_at, attachment_id}
GET /v1/channels/{id}/messages?before=&limit=
DELETE /v1/messages/{id}
POST /v1/devices (register device, receive FCM/APNs token)
DELETE /v1/devices/{device_id}
4.3 gRPC server↔server
MessageService
rpc Append(AppendReq) returns (AppendResp) // hot path write
rpc GetTail(GetTailReq) returns (stream Message)
rpc GetRange(GetRangeReq) returns (stream Message)
FanoutService
rpc Publish(PublishReq) returns (PublishResp) // push to online gateways via pubsub
rpc Enqueue(EnqueueReq) returns (EnqueueResp) // persist for offline
PresenceService
rpc Heartbeat(HeartbeatReq) returns (HeartbeatResp)
rpc Subscribe(SubReq) returns (stream PresenceDelta)
SessionService
rpc Authenticate(AuthReq) returns (AuthResp)
rpc ResolveUser(UserReq) returns (UserResp) // cacheable
4.4 Canonical message schema
message ChatMessage {
string channel_id = 1; // partition key
bytes server_msg_id = 2; // 128-bit: 64b HLC + 64b channel-local-seq
uint64 hlc_ts = 3; // hybrid logical clock, 48b physical + 16b logical
string client_msg_id = 4; // UUIDv4 from client
string sender_user_id = 5;
string sender_device_id = 6;
uint64 server_recv_ts = 7; // ms since epoch, server wall clock
oneof payload {
TextPayload text = 10;
AttachmentPayload attachment = 11;
SystemPayload system = 12; // e.g. "Alice added Bob"
EncryptedPayload encrypted = 13; // E2EE ciphertext + key ref
}
repeated Reaction reactions = 20;
EditHistory edits = 21;
MessageFlags flags = 22; // deleted, edited, ephemeral
}
5 Data Schema #
Three storage engines for three access patterns: wide-row append log (Cassandra), hot-tail cache (Redis), and metadata with transactions (Spanner/FDB).
5.1 messages — Cassandra (wide-row, partition = channel)
CREATE TABLE messages (
channel_id text,
hlc_ts bigint, // clustering key, DESC
server_msg_id blob,
sender_user_id text,
sender_device_id text,
payload blob, // protobuf-encoded
server_recv_ts timestamp,
flags int,
PRIMARY KEY ((channel_id), hlc_ts, server_msg_id)
) WITH CLUSTERING ORDER BY (hlc_ts DESC);
Append-optimized, channel-local tail read is a single partition sequential scan. TTL at table level for retention policies.
Why Cassandra, not X?
| Alternative | Verdict | Quantified reason |
|---|---|---|
| DynamoDB | Rejected | Per-partition write throughput cap 1000 WCU; a celebrity channel at 10K msg/s needs multi-partition hashing that breaks ordering. Cost at 25B msgs/day is ~$50M/yr vs. Cassandra self-managed ~$8M/yr hardware. |
| HBase | Rejected | HDFS/ZooKeeper op-burden too high for always-on; p99 write ≥50ms during region splits; compaction storms freeze regions. WhatsApp used early, migrated away. |
| Scylla | Strong contender | C++ rewrite of Cassandra, ~5× per-core throughput, shared-nothing per core. Discord migrated to ScyllaDB 2022 for exactly this workload. I'd pick Scylla if greenfield — but Cassandra has deeper ops maturity at $BIGCO. Document Scylla as v2 upgrade path. |
| Spanner | Rejected | Paxos commit latency 10-50ms intra-region, unnecessary global consistency. $100M/yr+ at this scale. Use Spanner for channel metadata only (Section 5.3). |
| FoundationDB | Rejected for messages | Transaction size limits (10MB, 5s). Works great for channel metadata. Snowflake + iCloud use it for metadata. |
| Kafka as primary | Rejected | Great as append log buffer, but poor random-read for history pagination. Use Kafka as the commit log in front, Cassandra as the queryable store (covered in Section 6). |
5.2 inbox_per_user — Redis + Spanner (user-sharded)
Tracks per-user per-channel read cursor and unread count. Small, hot, transactional.
inbox:{user_id}:{channel_id} → {last_read_hlc, unread_count, muted, last_delivered_hlc}
user_channel_list:{user_id} → sorted set of (channel_id, last_activity_hlc)
Authoritative in Spanner (for cross-device consistency — read receipts must not regress), cached in Redis for sub-ms lookup on gateway attach. CAS on last_read_hlc to prevent ABA.
5.3 channels — Spanner
channels (channel_id, type, creator, created_ts, settings_json, is_e2ee, member_count)
channel_members (channel_id, user_id, role, joined_ts, left_ts, notifications)
Transactional because "add member" must atomically: insert row, bump member_count, notify, and update user's channel list. ~200K channels created/sec peak during viral events; Spanner handles that easily. 1 Spanner instance cluster per region.
5.4 presence — Redis (sharded by user_id)
presence:{user_id} → {status, last_heartbeat_ms, devices:[{id, gateway, last_seen}]}
TTL 90s (3× heartbeat). On expiry, user transitions offline via keyspace-notification → presence aggregator recomputes "was online, now offline" delta.
5.5 attachments — metadata in Spanner, blob in S3/GCS
attachments (attachment_id, owner_user_id, content_type, size_bytes,
sha256, blob_url, thumbnail_url, uploaded_ts, ttl)
5.6 offline_queue — Kafka (per-user topic-or-partition)
Not actually "per topic per user" — too many topics. Instead 10K topics, user_id % 10000 → partition key. Each entry: (user_id, channel_id, msg_id, delivered_bit). Consumer is the fan-out service; drains on reconnect.
6 System Diagram (Centerpiece) #
6.1 Full end-to-end view
┌──────────────────────────┐
│ Control Plane │
│ ┌────────────────────┐ │
│ │ Config / Feature │ │
│ │ Flags (Consul) │ │
│ └────────────────────┘ │
│ ┌────────────────────┐ │
│ │ Capacity Autoscaler│ │
│ │ (HPA + custom) │ │
│ └────────────────────┘ │
│ ┌────────────────────┐ │
│ │ Rate Limiter │ │
│ │ (per-user/chan) │ │
│ └────────────────────┘ │
└──────────────────────────┘
▲ config / policy
│ push
┌─────────────┐ TLS 1.3 ┌───────────────────────┐ WSS (long-lived) │
│ iOS Client │ ───────────────▶ │ Anycast GeoDNS / │ ──────────────────────┐ │
│ Android │ │ L4 LB (Maglev-hash │ │ │
│ Web │ ◀────────────── │ by user_id) │ ◀──────────────────┐ │ │
└─────────────┘ WSS binary └───────────────────────┘ │ │ │
│ sticky │ │ ▼ │
│ reconnect │ L7 routes ┌─────────────────────────────────────┐
│ (user_id%N) ▼ │ Gateway Fleet (~500 nodes) │
│ ┌──────────────────────────────────┐ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │
└──────────────────────▶│ Gateway Router / Session-Resolver│──────▶│ │GW-0 │ │GW-1 │ │GW-2 │ │ ... │ │
│ resolves user → home-gateway │ │ │2M │ │2M │ │2M │ │ │ │
│ (consistent-hash w/ bounded load)│ │ │conn │ │conn │ │conn │ │ │ │
└──────────────────────────────────┘ │ │kTLS │ │kTLS │ │kTLS │ │ │ │
│ └──┬──┘ └──┬──┘ └──┬──┘ └─────┘ │
│ │ │ │ │
│ │ (stateful: conn table, │
│ │ session cache, send buffer) │
│ ▼ ▼ ▼ │
└─────────────────────────────────────┘
│ gRPC (mTLS, H2)
┌───────────────────────┼──────────────────────────────────┐
│ │ │
▼ ▼ ▼
┌──────────────────────┐ ┌───────────────────────────┐ ┌──────────────────────────┐
│ Session Service │ │ Message Service (MSG-*) │ │ Presence Service │
│ - authN/Z │ │ - per-channel leader │ │ - HB aggregator │
│ - device registry │ │ via lease (single-writer) │ │ - online set (Redis) │
│ - token refresh │ │ - HLC assignment │ │ - delta publisher │
│ - multi-device keys │ │ - append to commit log │ │ - fanout-on-change only │
└──────────┬───────────┘ │ - sync-write Redis tail │ └──────────────┬───────────┘
│ │ - emits to fan-out bus │ │
▼ └──────┬──────────────┬─────┘ ▼
┌─────────────────┐ │ │ ┌──────────────────┐
│ Spanner: users, │ ▼ ▼ │ Redis Cluster │
│ devices, tokens │ ┌─────────────┐ ┌──────────────┐ │ presence:{uid} │
└─────────────────┘ │ Commit Log │ │ Redis HOT │ │ sharded by uid │
│ (Kafka-like)│ │ msgtail:{ch} │ └──────────────────┘
│ 1000 brkrs │ │ LRU 50 msgs │
│ sharded by │ │ per channel │
│ channel_id │ │ (~2000 shds) │
└──────┬──────┘ └──────────────┘
│
▼
┌───────────────────────┐
│ Cassandra/Scylla │
│ messages by channel │
│ RF=3, LOCAL_QUORUM │
│ ~5000 nodes global │
└───────────────────────┘
┌───────────────────────┐ ┌─────────────────────────┐
│ Fan-out Service │──────▶│ Internal Pub/Sub Bus │
│ - maps channel → │ │ (NATS JetStream / SPS) │
│ member list │ │ topic per user_shard │
│ - online? → bus │ └────────────┬────────────┘
│ - offline? → queue │ │
│ - celebrity? → pull │ ▼
└──────┬────────────────┘ │
│ │
│ offline │ online deliveries
▼ │ consumed by gateways
┌───────────────────────┐ │ subscribing to
│ Kafka: offline_queue │ │ their owned user_shards
│ 10K partitions │ │
└──────┬────────────────┘ │
│ │
▼ │
┌───────────────────────┐ │
│ Push Dispatcher │───▶ APNs / FCM │
│ - coalesce per 30s │ (per-region │
│ - per-device route │ push shards) │
└───────────────────────┘ │
│
┌───────────────────────────┘
│ (back to Gateway Fleet)
▼
Online delivery to peer client
Blob tier (media):
Client ──pre-signed PUT──▶ GCS/S3 ◀──pre-signed GET── Client (recipient)
Metadata only flows through message pipeline.
6.2 Sub-diagram — 1:1 send path (the hot path to optimize)
Alice (online) Bob (online, different gateway)
│ ▲
│ 1. WSS SEND {channel,text,client_msg_id} │ 9. WSS DELIVER
▼ │
GW-Alice ──2. gRPC Append(channel=c,msg)──▶ MSG-Leader(c) │
│ │ │
│ │ 3. HLC assign │
│ │ 4. CommitLog append (sync, RF=3 quorum) ─ durable
│ │ 5. Redis tail write (async)
│ │ 6. emit to Fan-out bus(channel=c)
│ ▼
│ Fan-out Svc
│ │
│ │ 7. resolve members: [Alice, Bob]
│ │ Alice online on GW-Alice ← already delivered via reflect
│ │ Bob online on GW-Bob
│ ▼
│ Pub/Sub topic user_shard(Bob)
│ │
│ 8. SEND_ACK (hlc,server_msg_id) ▼
◀───────────────────────────────────── GW-Bob (subscribed to Bob's shard) ─── 9. ▶ Bob
│
│
Alice receives SEND_ACK after step 4's durable commit.
Bob receives DELIVER after step 9.
Both may happen concurrently; SEND_ACK unblocks sender spinner.
Critical timing (intra-region, happy path):
- WSS frame parse: 0.5ms
- Gateway → MSG-Leader gRPC: 1ms
- HLC assign: <0.1ms
- Commit-log RF=3 quorum append: 8–12ms (p50)
- SEND_ACK back to Alice: ~15ms p50, ~40ms p99
- Fan-out emit + pub/sub hop: 5ms
- GW-Bob deliver: +3ms
- Total send→deliver p50: ~20ms, p99: ~60ms intra-region. Well under 200ms target.
6.3 Sub-diagram — Group fan-out (1:500 group)
MSG-Leader(ch=G)
│
│ emit {channel=G, hlc, payload}
▼
Fan-out Service (member list cached from Spanner: 500 user_ids)
│
│ bucket by user_shard (users hashed into 1000 shards)
│ result: 500 users map to ~300 unique shards (birthday paradox kicks in)
▼
Pub/Sub: publish to 300 topics in parallel
│
▼
Gateways subscribed to each shard pick up msg
│
│ for each online member: send DELIVER frame
│ for each offline member: enqueue in Kafka offline_queue
▼
Online: direct deliver. Offline: push via dispatcher after coalesce window.
6.4 Sub-diagram — Presence aggregation
Clients send HEARTBEAT every 30s via their gateway.
Gateway batches heartbeats (coalesce 1s window) → single gRPC to Presence Svc.
Presence Svc writes to Redis presence:{user_id} with 90s TTL.
A presence CHANGE (online→offline or vice-versa) triggers:
- Redis pub/sub on presence_delta topic
- Fan-out to only the users who are subscribed to this user's presence
(e.g. friends/contacts, or co-members of shared channels)
Subscription set cardinality capped at ~1000 per user; above that, presence for that user is "public" (pulled on demand, not pushed).
Cost: 500M users × (1 HB / 30s) = 16.6M HB/s. Coalesced 1s/GW → 500GWs × 1qps = 500 qps to Presence Svc for writes-in-bulk. Each bulk carries ~33K HBs. Presence Svc load becomes negligible; bottleneck is Redis TTL-expiry scan.
6.5 Sub-diagram — Multi-device sync
Alice has 3 devices: phone(P), web(W), tablet(T). All open WSS to (possibly different) gateways.
On SEND from P:
- GW-P writes via MSG-Leader
- SEND_ACK returns to GW-P → Alice's phone
- Fan-out resolves Alice's channel membership → emits to Alice's user_shard
- GW-W and GW-T (subscribed to Alice's shard) each receive DELIVER
→ Web and Tablet show "sent" state (not a duplicate send indicator; server_msg_id tells them it's the same msg)
On device register (device_id unknown):
- On reconnect, client sends RESUME {last_hlc_seen_per_channel[]}
- GW invokes Message Svc GetRange for gaps → catches up
- Client dedups via server_msg_id
7 Deep Dives (earned-secret depth) #
7.1 Deep Dive A — Ordering Guarantees: HLC + single-writer-per-channel vs RSM vs Vector Clocks
Why this is the hardest problem. Chat UX demands: "what I see in this conversation is what the next person sees, in the same order." Naive timestamp ordering fails — clocks skew, sends race. Vector clocks give partial order but UI can't render partial order. Paxos/Raft per channel gives perfect total order but 50ms consensus overhead per message. The right answer threads this needle.
Alternatives evaluated.
| Scheme | Ordering property | Latency cost | Operational cost | Verdict |
|---|---|---|---|---|
| Wall-clock timestamps + server arbitration | Best-effort; reordering on clock skew | ~0ms overhead | Zero | Rejected — iMessage reportedly hit this issue, occasional out-of-order renders |
| Vector clock per user | Causal partial order | ~0ms | Bloat: O(members) per message | Rejected — a 1000-member group has 1KB VC prefix; breaks down for E2EE where ciphertext can't include metadata cheaply; UI needs total order anyway |
| Lamport timestamps | Total order, no causal info | ~0ms | Zero | Rejected — doesn't respect wall-clock intuition ("I sent this at 3pm but it sorted before your 2:59pm msg") |
| HLC (Hybrid Logical Clock) | Total order + wall-clock monotonic + causal if servers see each other | ~0ms | 8-byte clock per msg | Chosen — combines wall-clock intuition (HLC ≥ wall time) with causality via logical counter |
| Raft per channel | Strong linearizable total order | 5-20ms extra per send | 5× storage (log + state machine), leader election storms on failure | Used for metadata (channel membership changes) but not every message |
| Single-writer-per-channel lease (our pick) | Total order via serialization at one node | ~1-2ms lease-check overhead (amortized) | Leadership transfer on failure | Chosen — MSG-Leader(channel) holds a 10s lease via etcd/Chubby; all writes for that channel serialize through it |
Chosen design — HLC + single-writer lease per channel.
- Each channel
chas exactly one MSG-Leader at any time, elected via lease on achannel_id → nodeChubby/etcd table. Lease = 10s, renewed every 3s. - Gateway routes the
Append(c, msg)gRPC via consistent-hash onchannel_idto the leader. - Leader assigns
hlc_ts = max(prev_channel_hlc + 1, wall_clock_now_ms << 16). This makeshlc_tsstrictly increasing per channel AND roughly aligned with wall time. - Leader writes to commit log with RF=3 quorum. Only after quorum does it ack back.
- If leader fails: lease expires (10s), new node takes over. It reads the last
hlc_tsfrom the commit log tail and continues. During the 10s window, that channel is write-unavailable — this is a conscious trade: 10s of unavailability for one channel vs global ordering inconsistency forever.
Earned-secret aside: Single-writer-per-channel is exactly what Slack's "Channel Server" (Gumbo) and Discord's Elixir/BEAM process-per-channel do. Each channel has a dedicated process holding the canonical state; messages serialize through it. WhatsApp's Erlang gateways use a similar model. The insight is that per-channel throughput is bounded (~1000 msg/s in the worst real case) — so serializing through one node is fine; what matters is that you have millions of channels distributed across thousands of nodes.
What HLC buys us over Lamport. A message with hlc_ts = 1705000000042:5 is readable: the first part is ms since epoch. When rendering, UI sorts by hlc_ts and the order matches wall-clock ±tiny logical counter. Lamport gives you only 42 which is unreadable and meaningless for UI. HLC is what CockroachDB and YugabyteDB use for exactly the same reason.
At-least-once + client dedup. Because we ack after RF=3 quorum write, if a retry happens the leader sees the same client_msg_id and returns the already-assigned server_msg_id. Dedup keyed by (user_id, client_msg_id) → server_msg_id in Redis, TTL 24h. This costs us 5M msgs/s × 40 bytes × 86400s = 17TB Redis — feasible with 200 shards.
Failure modes:
- Dual-leader split-brain during lease transfer. Mitigation: fencing tokens. Each append includes
(lease_id, lease_seq); commit log rejects if seq < stored. Same pattern as ZooKeeper zxid fencing. - Clock-jump backward on leader node. HLC guarantees monotonicity by taking
max(wall, prev+1). A backward jump just means logical counter increments more — still correct. - Leader GC pause >10s. Worst case: new leader elected, old leader resumes and tries to ack stale writes. Fencing rejects them; old leader sees rejection and surfaces
retryable=falseerror. Alarming, but no data corruption.
7.2 Deep Dive B — Fan-out: Push vs Pull vs Hybrid (the celebrity problem, for groups)
Why critical. Whether to do write fan-out (copy message to each recipient's inbox at send time) or read fan-out (recipient pulls from channel's shared log at read time) is the pivot point of the whole architecture. Wrong choice breaks scale in either direction.
Alternatives.
| Strategy | How it works | Storage cost | Read latency | Works for |
|---|---|---|---|---|
| Pure push (per-recipient inbox write) | On send, write N copies (one per recipient) to N inboxes | N× storage | O(1) read — inbox is pre-materialized | Small groups, consistent fast reads |
| Pure pull (shared channel log) | Write once to channel log; readers query log on demand | 1× storage | O(log history) per read + network | Large groups, celebrity channels |
| Hybrid (tiered) | Push for small channels, pull for large | 1-N× (tunable) | Depends on tier | Realistic mixed workloads |
Math per strategy at our scale.
- Pure push: 2.9M sends/s × 106 avg recipients = 307M inbox writes/s. At 200 bytes per inbox entry = 61 GB/s into inbox storage. Cassandra can do this but wasteful — user rarely rereads offline msgs more than once.
- Pure pull: 2.9M sends/s, 1× channel-log write = 2.9 GB/s. BUT reads: 500M DAU × 100 reads/day × avg 5 msgs per read = 2.9B reads/s. Bandwidth on the channel log storage becomes the bottleneck, and celebrity channels (Cristiano Ronaldo's broadcast channel with 500K members) see 500K concurrent readers pulling on every post.
Chosen design — Hybrid with three tiers:
- Tier 1 — Direct push (channel_size ≤ 1000): At send time, MSG-Leader emits to fan-out bus; fan-out service looks up member list (cached) and publishes to each member's user_shard on pub/sub. Online gateways consume and deliver. Offline members → Kafka offline_queue with
(user_id, channel_id, msg_id). No inbox materialization on storage side — the Kafka queue IS the inbox for offline, and it auto-drains. - Tier 2 — Sharded push (1000 < channel_size ≤ 10K): Fan-out into parallel worker pools per shard of the channel's membership. Worker per shard batches 1000 member-lookups, coalesces online vs offline. Introduces ~50ms p99 vs Tier 1's ~10ms, but distributes CPU. This is the WhatsApp approach for medium groups.
- Tier 3 — Pull (channel_size > 10K, i.e. broadcast channels): Do NOT push to recipients. Instead: write to channel log only. Recipient gateway subscribes to a channel-post notification pub/sub (tiny — just "channel X has new msg at HLC Y"). Recipient pulls message body on demand (when user opens channel or a push notification requires body). This is Twitter's approach for celebrity tweets, and Telegram's approach for broadcast channels.
The decision knob. Tier is picked by channel size at channel creation or dynamically (cross threshold → migration). Threshold of 10K for tier 3 comes from: 10K × 200 bytes × 2.9M sends_to_large_channels/s / total_sends ≈ 2% of sends. Even if these cost 10× per msg, they're <20% of fan-out cost. Pushing them all would dominate fan-out cost — so pull is a ~5× cost savings for the top 2%.
Earned-secret aside: Twitter's home timeline uses exactly this hybrid. Fan-out-on-write for normal users, fan-out-on-read for celebrities (>1M followers). The threshold was tuned empirically; they ended up with a hybrid merge on the read path that combines pre-materialized timeline + pulled celebrity tweets. Same idea here: a user in 5 broadcast channels + 20 normal groups gets pushed msgs for groups and pulls for broadcasts, merged by client into the home view.
Celebrity problem variant — group with one very active member. One user sending 100 msg/s into a 10K group = 1M deliveries/s from one user. Solution: per-sender-per-channel rate limit (e.g. 10 msg/s from one user into one channel) + flood protection. This is a product policy, but enforced at Session Svc via sliding window.
Failure modes:
- Fan-out service OOMs on large-member lookup. Mitigation: stream member iteration from Spanner with page size 1000; fan-out worker is stateless on membership, uses flow control to the pub/sub bus.
- Push tier misdetects offline. Mitigation: gateway publishes "I hold user U's conn" heartbeat to a routing cache every 5s; fan-out checks cache; if stale (>15s), treat as offline. False offline is OK because client will reconcile via RESUME on reconnect.
- Pub/sub topic hot spot. Mitigation: if a single user_shard gets >50K msg/s (unlikely but possible), split shard dynamically — migrate half of users to a new shard, gateway learns via routing cache.
7.3 Deep Dive C — Gateway Stickiness, Failover, and Mass-Reconnect Storms
Why critical (and earned-secret). Gateways hold state: open WebSocket, TLS session, per-conn sequence numbers, small send buffer. When a gateway dies, millions of clients reconnect within seconds. Done wrong, this takes down the entire fleet (thundering herd). Done right, clients barely notice.
Alternatives.
- Stateless gateways, external session store. Every frame re-reads session from Redis. Trade: simple failover, but 2 Redis hops per frame × 930M deliveries/s = 1.86B Redis ops/s. Not affordable.
- Sticky, but hash-on-user_id with simple modulo. Client reconnects to
user_id % N_gateway. Problem: scaling from 500 to 501 gateways shuffles every client. Rejected. - Sticky with consistent hashing + bounded load (chosen). Clients hash user_id onto a ring; lookup closest gateway with capacity <90%. Adds/removes shuffle ~1/N of clients.
Chosen design.
- L4 LB uses Maglev-style consistent hash on user_id (extracted from JWT or a cookie). This delivers reconnects to the same gateway pod with ≥95% probability across pod changes.
- Gateway holds the WSS conn + session state locally. Session state is also persisted to Redis (cold replica) for recovery.
- On gateway crash:
- K8s liveness detects in 5s, endpoints controller removes pod.
- L4 LB reroutes the slice of user_ids to next-in-ring gateway.
- Clients see WSS drop, reconnect with exponential backoff jittered by user_id hash. This is the critical bit — plain exp-backoff synchronizes reconnects; jittering by
hash(user_id) % 30sspreads them.
The earned-secret bit — mass reconnect storm on deploy.
When we deploy new gateway binary, we drain gateways gracefully. Without care: 2M connections per node × 500 nodes × "deploy-one-at-a-time" = 2M reconnects per drain. If all hit one new gateway in 1s, that gateway sees 2M new TLS handshakes — a single handshake costs ~2-4ms of CPU (crypto + session setup). 2M / 4ms = 500K TLS handshakes/s ≥ ~8K CPU-cores required. No single node has this.
Mitigations in layers:
- Slow drain. When a gateway is marked for drain, it sends a
GOAWAYframe withreconnect_after_ms = rand(0, 60000). Clients wait a random interval, then reconnect. This spreads 2M reconnects over 60s = 33K/s per-destination — manageable. - TLS session resumption. Redis-backed TLS ticket store; reconnect uses ticket → skips full handshake. Cuts per-handshake cost from 4ms to 0.4ms CPU. Earned-secret note: this requires careful ticket-key rotation; rotate too fast and you lose resumption; too slow and you keep forward-secrecy risk. WhatsApp rotates every 12h.
- Reconnect hint from server. When a client connects, gateway includes in handshake response
preferred_gateway_dns=gw-17.region-us-east.example.com. On subsequent reconnect, client connects directly to that DNS name, bypassing the LB. This pins clients and avoids LB storms. - Client-side exponential backoff with jitter.
min(max_delay, base × 2^attempt) × rand(0.5, 1.5). Base = 1s, max = 60s. Most clients reconnect within 30s; tail extends to 90s. - Kernel-level tuning.
SO_REUSEPORTfor socket accept scaling;net.core.somaxconn=65535; kTLS offload so handshake crypto runs in kernel w/o user-kernel copy. This is the WhatsApp/Meta secret sauce: "we hold 2M conns/node because memory is dominated by TLS state not the message buffer, so kernel TLS offload matters more than app heap tuning." - Blast-radius limit via cell architecture. Divide gateway fleet into 20 cells of 25 gateways each. A bad deploy is rolled out cell-by-cell; a crash storm in one cell drives ~2.5M reconnects max, not 500M.
Real systems named.
- WhatsApp Erlang gateways: 2M conns/node (Rick Reed's 2014 talk), BEAM VM's lightweight processes are the model for our per-conn state. They solved the mass-reconnect by jittered backoff on the client.
- Slack's Flannel / Gumbo: routing proxy in front of channel servers, does exactly the "reconnect hint" pattern.
- Discord's Elixir + Gateway: same BEAM model, publicly known to handle 10M+ concurrent conns per cluster (2019 blog).
- Messenger / Iris: Meta's internal system. Public Meta engineering posts describe gateway sharding and fan-out separation aligned with this design.
Failure modes.
- Cold cache on new gateway. New node has empty session cache → every frame causes Redis lookup. Mitigation: pre-warm by replaying session table for the hash range the node will own.
- Split-brain during LB reconfig. LB might briefly send same user to two gateways. Gateway detects duplicate session_id → sends
CONFLICTframe, newer one wins, older drops connection. - Zombie connections (client device went offline without TCP RST). TCP keepalive + app-level heartbeat with 90s deadline evicts zombies. At 750M conns × 3 zombie rate = ~20K zombies in flight at any time — negligible.
8 Failure Modes & Resilience #
| Component | Failure | Detection | Blast radius | Mitigation | Recovery |
|---|---|---|---|---|---|
| Gateway pod | Process crash | K8s liveness 5s | ~2M users (1 pod) | L4 LB reroute + client reconnect w/ jitter | <60s; users see 1 reconnect |
| Gateway rack | Rack-level power loss | Heartbeat timeout 10s | ~20M users | LB drains rack; cross-rack capacity headroom | <2min |
| Entire region | Data-center offline | Anycast BGP withdraw | All users in region (~50M) | Clients fall back to next region via GeoDNS; authoritative data replicated to 2+ regions | Region restored: ~hours; users work throughout |
| Redis hot-tail shard | Node loss | Sentinel + replica promote | ~250K channels hot-tail | Reads fall through to Cassandra (slower, +20ms); writes continue | Seconds (failover), fill cache over minutes |
| Redis presence shard | Node loss | As above | Presence for ~25M users | Degraded to "last_known_status" from Cassandra snapshot; presence inaccurate for ~1min | Rebuild from HBs over 2min |
| Cassandra hot shard (one channel melts) | Celebrity posts → one partition overload | Per-partition QPS alarm | One channel has degraded latency | Shadow-read from replica; write-through Redis buffers; emergency: split channel partition by hlc_ts range | Mins to split |
| Cassandra node | Disk failure | gossip + local liveness | RF=3 means one replica loss, still quorum | Auto-bootstrap replacement from peers; hinted handoff for missed writes | Hours to bootstrap TB |
| Commit log broker | Kafka/Bookkeeper broker loss | Heartbeat | 1/1000 brokers → no loss (RF=3) | Broker replacement; lag increases transiently | Minutes |
| Fan-out pub/sub topic | Hot topic (celebrity user) | Topic lag alarm | Deliveries to that shard delayed seconds | Dynamic shard split; consumer scale-out | <1min |
| Offline queue (Kafka) full | Push provider outage | Lag > threshold | Pushes delayed for offline users; messages still durable in Cassandra | Pause dispatcher; drain after recovery; deduplicate on resume (client uses server_msg_id) | Hours (catch-up), no data loss |
| APNs / FCM outage | Push provider 5xx | Error rate > 10% | Offline users don't get push notifs; they DO get msgs on reconnect | Exponential backoff, don't retry >1hr old; surface "silent delivery" | Hours typically |
| Presence flapping (bad network) | User oscillates online/offline every 30s | High presence-delta rate per user | Fan-out waste to subscribers | Debounce: require stable state ≥60s before broadcasting delta | Immediate (rule fires) |
| Mass reconnect storm | Deploy / fleet-wide restart | Handshake rate spike | Gateway fleet overload | Jittered backoff + slow drain + session resumption (Section 7.3) | <2min |
| Consistent-hash reshuffle | Adding/removing 10 gateways at once | LB reconfig events | Up to 10/500 = 2% of users reshuffle | Stagger gateway adds (1 per 30s); clients carry session across shuffle (session data in Redis) | Progressive |
| Spanner quorum loss (channels table) | Paxos majority lost | Write errors | Channel-create fails; messages in existing channels unaffected | Read-only mode on metadata; degrade gracefully | Minutes (usually self-healing) |
| Dedup Redis loss | Whole dedup cluster | Redis cluster heartbeat | 24h window of dedup data | Fall back to Cassandra-idempotent lookup (slower). Rebuild over 24h as new msgs flow | Hours |
| Clock skew between MSG-Leader nodes | NTP failure | Leader-to-leader skew alarm | HLC still correct (max-based), wall-clock on hlc_ts drifts | Alert SRE; NTP recovery; HLC preserves correctness | Mins |
Pager policy. SRE on-call pages on:
- Regional p99 > 2× target for 5 minutes
- Message loss > 0 in any 5-min window (any ack'd but not durable)
- Gateway fleet utilization > 85%
- Fan-out queue lag > 10s
- Cassandra write latency p99 > 100ms for 5 min
SLO error budget. 99.99% = 4.3min/mo. Each minor incident burns 0.1-0.5min. Budget exhaustion = freeze deploys.
9 Evolution Path #
v1 (MVP, single region, 1:1 only) — 3 months
- Single region (us-east-1), 20 gateways, 100 Cassandra nodes, Redis tail cache.
- 1:1 messaging only. No groups, no presence broadcast.
- Offline queue via simple per-user Kafka partition.
- Push via FCM/APNs direct.
- Multi-device NOT supported (one active device per user).
- Target: 5M DAU, p99 <300ms.
- Open questions at this stage: do we need Spanner for metadata or can we start with PostgreSQL per-shard? (Answer: PG for v1, migrate at v2.)
v2 (groups, multi-device, presence) — 9 months
- Add groups up to 1000, with tier-1 push fan-out.
- Add presence subsystem with coarse aggregation.
- Multi-device sync via user-shard pub/sub.
- Scale gateway fleet to 100 nodes; 3 regions (us-east, us-west, eu).
- Migrate metadata to Spanner for global transactional consistency.
- Introduce dedicated fan-out service (split from MSG service).
- Read receipts and typing indicators.
- Target: 100M DAU.
v3 (global scale, large groups, broadcast, E2EE, federation) — 18 months
- 20 regions; cross-region replication for messages (async, with read-your-writes for sender).
- Tier-2 and tier-3 fan-out (sharded push, pull for broadcast).
- Broadcast channels up to 100K members with pull-based delivery.
- E2EE opt-in per channel (Signal Double Ratchet for 1:1, Megolm for groups).
- Federation (inter-org like Slack Connect) — optional, gated behind enterprise tier.
- Cross-region HLC reconciliation (Section 10).
- Data locality compliance (India/China shards).
- Target: 500M DAU.
Build-vs-buy decisions at each phase.
- v1: Use AWS managed services (MSK for Kafka, ElastiCache for Redis, AWS Keyspaces for Cassandra-compat). Fast ship; pay for managed premium.
- v2: Migrate hot path to self-managed (Cassandra on EC2, Kafka on dedicated hosts). Managed cost starts exceeding engineering cost.
- v3: Bare-metal for gateways (for kTLS and NIC tuning); Cassandra on bare metal too. Custom fan-out (off-shelf pub/sub hits limits at 930M deliveries/s).
10 Out-of-1-Hour Notes #
10.1 End-to-End Encryption
Signal Protocol (Double Ratchet) for 1:1.
- Long-term identity keys + medium-term signed prekeys + one-time prekeys, exchanged via a key distribution server (the only plaintext-adjacent role the server plays).
- Each message encrypted with a fresh key derived from a ratchet (DH + symmetric chain).
- Forward secrecy (past messages safe if current key stolen) + future secrecy (compromise self-heals after next DH).
- Server sees ciphertext only — can store, fan-out, deliver, but cannot search, moderate content, or auto-translate.
Megolm / Sender Keys for groups.
- Naive N-to-N pairwise Double Ratchet explodes (1000-member group = 500K session states).
- Sender generates a "sender key" (symmetric chain) shared with each member via pairwise Double Ratchet — one-time cost at key-rotation.
- Each msg encrypted once with sender key; all members decrypt.
- Key rotation when member leaves (obviously — otherwise left member keeps decrypting).
- Matrix uses this model; Signal uses a variant.
Tradeoffs with server features.
- No server-side search — clients maintain local encrypted indices.
- No moderation on content — metadata-only moderation (report flow ships a cleartext excerpt via user opt-in).
- Push notification preview text: either (a) "New message" only, or (b) client shows preview using locally-cached key after wake — Signal does (b).
- Backups: either cloud-backup with user-derived key (optional, Signal PIN), or no backup.
10.2 Key Server
- Stores public keys, signed prekey bundles, one-time prekeys.
- On sender request "give me keys for user X's device D": returns bundle, burns one-time prekey.
- Prekey replenishment by client every N days or when count <20.
- Must resist key-server compromise: Trusted Identity via QR-code fingerprint verification (Signal does this).
10.3 Search (E2EE and non-E2EE)
- Non-E2EE: server-side Elasticsearch per-user index. Cost ~$0.05/MAU/yr additional.
- E2EE: client-side FTS over local cache. WhatsApp does this — search only works for msgs on the device.
- Encrypted cloud-side search (BlindSearch, Ciphertext Keyword Search) — research-grade, not prod-ready.
10.4 Moderation Pipeline
- Non-E2EE: ML classifier on every msg (toxicity, CSAM, spam). Pre-publish intercept for severe; post-hoc takedown for mild.
- PhotoDNA / CSAM hashing on uploaded images. Report to NCMEC required by law.
- Report flow: user taps "report", client uploads an explicit cleartext bundle (even from E2EE chat — this is Apple's model for iMessage) + sender metadata + msg context window.
- Apple's Communication Safety classifier runs client-side even for E2EE (controversial but real).
10.5 Retention and Legal Holds
- Default: infinite retention, user-deletable.
- Configurable per-user or per-channel TTL.
- Legal hold override: SRE-applied tag that bypasses TTL for specified users/channels (GDPR-regulated countries require judicial order).
- GDPR right-to-erasure: when user requests deletion, scrub PII and sent msgs. Problem: can't scrub received copies in others' inboxes — document in ToS that your identity remains in others' histories unless they also delete.
10.6 Spam and Abuse
- Rate limits per user globally (500 msgs/hr) and per channel (10 msgs/min).
- Bulk-send detection: user with 100 new 1:1s in 10min → captcha / SMS verify.
- Reputation score per account (age, verified phone, social signals).
- Honeypot accounts for spam detection.
10.7 Cross-Device Key Sync
- Multi-device E2EE: each device has its own identity key. Register device = sign new device key with existing device's key ("device auth").
- On registering new device, server fan-out out to other devices: "pending new device, approve?". User confirms on existing device.
- Message encryption to user U = encrypt once per active device (small N, usually ≤10).
10.8 QoS and Priority
- 1:1 msgs priority=high, <100ms SLA.
- Group <100 members: priority=medium.
- Broadcast channels: priority=low, can tolerate 5s delivery.
- Control frames (typing, presence): priority=best-effort, first-to-drop under pressure.
- Implement via pub/sub topic per priority + gateway consumer weights.
10.9 Cost per MAU
| Component | Peak demand | Annual cost at scale |
|---|---|---|
| Gateway fleet | 800 nodes × $8K/yr | $6.4M |
| Cassandra | 5000 nodes × $12K/yr | $60M |
| Redis | 500 nodes × $10K/yr | $5M |
| Kafka/pub-sub | 1500 nodes × $8K/yr | $12M |
| Network egress (inter-region replication + client delivery) | ~900 GB/s avg × $0.02/GB | $570M(!) |
| Storage (9PB × $0.02/GB-mo) | $2.2M | |
| Push provider (APNs free, FCM free, SMS fallback) | ~$20M | |
| Total | ~$676M/yr | |
| Per MAU (at 500M active, assume MAU ~600M) | ~$1.13/MAU/yr |
Network egress dominates — a real optimization is reducing cross-region replication (only replicate recently-active channels) and using client-pull for broadcast (already in design).
10.10 Client Backgrounding and Battery
- iOS/Android kill background TCP connections after ~30s.
- Use push-to-wake: push carries msg metadata, app wakes and pulls body via HTTP (not WSS — WSS reconnect in background is costly).
- Android foreground service can hold WSS for active chats (WhatsApp model).
- iOS: background pull every 15min (APNs push handles immediacy).
- Battery-optimal: coalesce msgs into bundles; push N msgs as 1 bundle; reduce radio wakes from N to 1.
- Adaptive heartbeat: on wifi, 60s; on cell, 180s; on weak signal, 300s — trades presence-staleness for battery.
10.11 Federation (v3+)
- Inter-org federation like Matrix/XMPP or Slack Connect.
- Each org runs its own gateways + storage; federation via a signed message protocol between orgs.
- Directory discovery: DNS SRV records point to federation endpoint.
- Challenges: identity mapping, spam (federated identities can spoof), encryption (federated E2EE requires cross-org key exchange), retention (whose policy wins?).
- Design: introduce Federation Gateway as a distinct tier that translates internal pub/sub events to inter-org signed gRPC. Every cross-org msg is a signed capability.
10.12 Cross-region HLC reconciliation (when we go multi-master)
- v3 may allow writing to nearest region for the same channel (write-locality).
- Requires HLC + vector per region:
hlc_ts+region_id. - Conflict resolution: total order by
(hlc_ts, region_id)— deterministic but can reorder messages slightly vs causal expectation. - Alternative: keep single-writer-per-channel, but move the writer to the region with the hottest activity (leader migration). Trades cross-region write latency for single-region read latency. This is what CockroachDB follower reads pattern does.
10.13 Observability
- Message tracing: every msg gets a trace_id (derived from client_msg_id). Sampled 1/1000 for tracing through every hop (gateway → MSG leader → commit log → fan-out → delivery gateway → peer). Jaeger/Tempo.
- Per-channel metrics: write QPS, p99 latency, fan-out fan-out factor, offline recipients %.
- Per-gateway metrics: connection count, CPU, kTLS ratio, p99 handshake time, zombie conn rate.
- SLO dashboard: weekly review.
- Synthetic probes: 1000 bot pairs globally sending heartbeat messages every minute, measuring end-to-end; alarms on regression.
Appendix A — Quick Reference: Chosen Tech Stack #
| Layer | Choice | Alternative considered | Why chosen |
|---|---|---|---|
| Client transport | WebSocket over TLS 1.3 + kTLS | HTTP/3 long-poll | Bi-directional, lower overhead, kernel-offloadable |
| Wire format | protobuf (binary) | JSON | 40% smaller, 3× faster parse |
| Load balancer | L4 Maglev-hash | L7 ingress | No TLS termination at LB (do it on gateway for kTLS); consistent hashing cheaper at L4 |
| Gateway | Go or Rust + epoll + kTLS | Erlang/BEAM | Rust/Go for ecosystem; BEAM is valid if team has Erlang expertise (Discord/WhatsApp) |
| Session store | Redis (sharded by user_id) | Memcached | Need pub/sub and data structures (sorted sets for history) |
| Message store | Cassandra (migrate to Scylla v2) | DynamoDB / Spanner / HBase | Wide-row write optimized, operational maturity, cost |
| Metadata store | Spanner | CockroachDB | Transactional global; Google-native |
| Commit log | Kafka w/ RF=3 | Pulsar, RabbitMQ | Throughput + log-replay semantics |
| Pub/sub (fan-out) | NATS JetStream or custom | Kafka | Lower latency for 930M msg/s fan-out; Kafka for durable offline queue |
| Object store | GCS / S3 | — | Media blobs |
| Push | APNs + FCM | — | No alternative |
| Config / service discovery | Consul or etcd | ZooKeeper | Modern, battle-tested |
Appendix B — Pre-Interview Verification Checklist (answers to the prompt's 4 verifications) #
- SRE pager-carryable? Yes — §8 enumerates every failure mode with detection, blast radius, mitigation, runbook. Clear SLO budget and pager triggers.
- Every diagram arrow → real API/data flow? Yes — every gRPC call is defined in §4.3; every WSS frame in §4.1; storage writes in §5 with chosen engine per arrow.
- Deep-dive L7 or L6? §7 goes to earned-secret depth: kTLS + TLS ticket rotation interaction, Meta/WhatsApp-specific operational patterns, single-writer lease fencing, Twitter celebrity-pull threshold tuning, consistent-hash bounded-load behavior during deploys.
- Conn count and msg-rate math consistent? Yes — §3.2 derives 750M conns → 500 gateways at 2M/node; 2.9M msg/s × 320 fan-out = 930M delivery/s; 930 GB/s egress divided across 500 gateways = 2GB/s per NIC at 60% utilization. Numbers all tie together.