Q4 Storage & Query Systems 28 min read 12 sections
Design a Distributed Storage System (Object Store)
Choose block, file, or object storage, then justify durability, placement, repair, and metadata consistency.
1 Problem Restatement & Clarifying Questions #
Restatement
Design a planet-scale distributed object store (S3 / GCS / Azure Blob / Ceph RADOSGW / f4-class). Durability, strong-enough consistency, sharded metadata, background repair. Pager-carryable for an SRE on-call.
Why object store (not block, not file)?
I pick object store and will justify by ruling out the alternatives:
| Target | Access pattern | Why wrong for this brief |
|---|---|---|
| Block (EBS, Ceph RBD, SAN) | 4KB random R/W, attach-to-one-VM, POSIX-ish via layers above | Tight latency (<1ms), requires IOPS-scheduled LUN-per-host, doesn't fit "EB-scale multi-tenant key addressing." Block stores top out at ~PB per cluster because you're selling a volume abstraction, not a namespace. |
| File (GFS, HDFS, Lustre, Isilon, ColossusFS) | POSIX-ish, directories, rename, hardlinks, mtime, append, chown | POSIX rename is a directory transaction. Directory-op scaling is where GFS single-master hit 50 PB / ~100 M files; Colossus and HDFS Federation spent years fixing it. Not the interesting bottleneck to design against in 60 min. |
| Object (S3, GCS, Azure Blob, Swift, Ceph RGW, Haystack/f4) | Immutable-by-default blobs, flat key namespace, PUT/GET/DELETE/LIST by prefix, eventually billions of tenants | Matches "EB-scale, immutable-write + big-read, cheap, durable" workload. Flat namespace kills the dir-tx bottleneck. Chosen. |
So: object store, S3-grade semantics, GCS/Colossus-grade internals.
Clarifying Questions (I would ask the interviewer; I'm answering them for self-study)
- Workload shape? Assume write-once-read-many-delete-rarely, skewed toward large objects (p50=512 KB, p95=64 MB, p99=1 GB, tail to 5 TB via multipart). This is the "real" distribution on S3/GCS — not uniform. Small-object tax is real (see §7.1).
- Scale? Target 1 EB logical stored, 10 B live objects, 10 M QPS GET, 1 M QPS PUT, 100 K QPS LIST. This is mid-tier regional (AWS has 400+ trillion objects; I'm sizing one region).
- Latency SLO? p50 GET <30 ms, p99 GET <200 ms for hot cache tier, <500 ms for cold. p50 PUT <80 ms, p99 <500 ms. Tail is where L7 lives.
- Durability SLO? 11 nines (99.999999999 % annual) — S3 / GCS / Azure standard. I'll derive it in §3 from AFR + EC k/n + MTTR.
- Availability SLO? 99.99% per region, 99.999% multi-region. One region is allowed to brown out; metadata is allowed to reject writes before corrupting.
- Consistency? S3 went strong-consistent for PUT-after-PUT in Dec 2020. I'll match: read-your-writes for new keys and overwrites; eventually-consistent LIST. Multi-region: async, last-writer-wins with vector metadata.
- Multi-region? Yes, but v3. v1 is single-region. Avoid designing for the hardest case first.
- Multi-tenant? Yes, billions of buckets, millions of tenants. Noisy-neighbor protection required (§10.5).
- Encryption? At-rest AES-256 per-object DEK wrapped by tenant KEK in KMS; in-transit TLS. Out-of-scope for core 60 min, detailed in §10.4.
Out-of-scope (explicitly deferred)
Cross-object transactions; server-side rich query (SQL-on-blob); streaming mutation of an existing object mid-read; filesystem semantics (rename/hardlink); strong cross-region consistency; durable queue / pub-sub on top.
2 Functional Requirements #
In-scope (numbered)
- PUT
/b/{bucket}/{key}— body up to 5 GB single-shot, 5 TB via multipart; returns ETag (content-MD5 or content-SHA256). - GET
/b/{bucket}/{key}— full body, withIf-Match/If-None-Match/If-Modified-Sinceand HTTPRange:for byte ranges. - DELETE
/b/{bucket}/{key}— soft delete with versioning; tombstone propagation. - LIST
/b/{bucket}?prefix=&marker=&max-keys=&delimiter=/— prefix walk, pagination. Eventually consistent by design (more on this in §7.2). - Multipart Upload:
initiate→uploadPart× N (parallel) →complete(atomic manifest commit) → single object. Enables 100 GB+ uploads over flaky networks and parallel ingest up to ~10 Gbps per object. - Range Read:
Range: bytes=a-b— served from exact chunk, never read full object. - ACL / IAM: bucket-level policy + per-object ACL; pre-signed URL (HMAC over canonical request) for delegated access with TTL.
- Versioning: per-bucket flag; DELETE creates delete-marker, previous version retained; PUT never overwrites in place.
- Lifecycle: per-bucket rules (transition to cold tier after N days, expire after M days).
- CopyObject (server-side copy): metadata-only when source/dest share storage class, deep-copy otherwise.
Out-of-scope (explicitly)
- SELECT-over-object (S3 Select / BigQuery-external). Separate compute layer.
- Atomic PUT-if-not-exists conditional on another key. Single-key conditional PUT is in-scope (
If-Matchon ETag); cross-key transactions are out. - Strong global consistency — out, see §9 v3.
- Filesystem rename / directory metadata (would require namespace service).
3 Non-Functional Requirements + Capacity Back-of-Envelope #
NFRs
| NFR | Target | Rationale |
|---|---|---|
| Availability | 99.99% per region (4.38 min/mo error budget) | AWS S3, GCS bucket-standard |
| Durability | 11 nines annual | Industry standard; derived below |
| Durability for archive tier | 12 nines | Deep-archive / Glacier class |
| Latency p50 GET | <30 ms (hot) | Within-DC round trip + 1 disk seek |
| Latency p99 GET | <200 ms (hot), <500 ms (cold) | Budget for 1 reconstruct if chunk slow/dead |
| Latency p50 PUT | <80 ms | Write + 2 replica ack (v1) or 10+4 EC acks (v2) |
| Latency p99 PUT | <500 ms | Tail-tolerant, includes metadata commit |
| Consistency | Read-your-writes for PUT; eventual for LIST | Matches S3-Dec-2020 model |
| Throughput | 10 M GET QPS, 1 M PUT QPS per region | Mid-tier region |
| Cost | <$0.015 / GB-month warm, <$0.001 / GB-month cold | Within S3 Standard / Glacier envelope |
Capacity BOE (show the math)
Logical data
- 1 EB = 10^18 B = 10^6 TB = 10^3 PB.
- 10 B objects; mean size = 1 EB / 10 B = 100 MB (skewed; p50 is 512 KB but the tail dominates bytes).
Physical footprint — replication vs erasure coding
- Pure 3× replication: 3 EB physical.
- Reed-Solomon RS(10,4) (10 data + 4 parity per stripe): 1.4× overhead → 1.4 EB physical. This is Facebook f4 / Azure LRS / GCS default for cold & warm.
- Savings of 3.0 − 1.4 = 1.6 EB. At ~$20 /TB HDD amortized (disk + chassis + 3 yr) that's $32 M saved per region. Non-trivial; see §7.1.
Disk fleet
- Assume 20 TB enterprise HDD (2026 vintage; Seagate Exos / WD Ultrastar are shipping 22–24 TB).
- 1.4 EB / 20 TB = 70 000 disks just for user data.
- Add overhead: 15 % free-space for compaction/rebalance, 5 % for metadata replicas, 10 % for transient multipart parts not yet committed, 10 % reserved for EC-reconstruction scratch = 40 % overhead → ~100 K HDDs.
- Hot tier on NVMe SSD (3.84 TB): assume 5 % of bytes are hot → 50 PB / 3.84 TB = ~13 K SSDs.
- Total: ~113 K storage drives in the region.
Rack / zone / region layout
- 60 disks per node (JBOD chassis with HBAs), 1 node per RU × 2 RU = 30 nodes per rack × 60 disks = 1800 disks/rack.
- 100 K / 1800 ≈ 56 racks for HDD, say 64 racks total with headroom.
- 4 zones per region × 16 racks/zone.
- Placement invariant: no stripe has >1 shard in the same rack; RS(10,4) spread across ≥2 zones so a single-zone outage loses ≤7 shards out of 14 (survivable, since RS(10,4) tolerates 4 losses per stripe — but this means a zone failure makes EC unrepairable for some stripes; see §7.1 for the correct fix).
- Correct invariant: spread 14 shards across ≥3 zones with max 5 shards/zone, so a 1-zone loss loses ≤5 shards, within the 4-tolerance — wait, 5 > 4. Fix: RS(10,4) across ≥4 zones, ≤4 shards/zone (so 1-zone loss ≤4 shards, exactly at tolerance). At 3-zone layout, RS(10,4) cannot survive a zone loss; must use RS(6,3) or LRC for 3-zone regions. This is the single most common bug in interview EC answers.
Metadata footprint
- 10 B objects × ~2 KB metadata/object (bucket_id, key, version, size, checksum, chunk manifest for ~5 chunks × 24 B, ACL id, timestamps, EC params) = 20 TB metadata.
- 3× replicated + Paxos/Raft overhead → ~60 TB across metadata servers. Fits comfortably on sharded SSD KV.
- Metadata op rate: every GET = 1 metadata read, every PUT = 1 metadata write, every LIST = range scan. 10 M GET + 1 M PUT + 100 K LIST = ~11 M read QPS, ~1 M write QPS against metadata.
- At 50 K read QPS per Spanner-class shard → 220 shards for reads; at 10 K write QPS per shard → 100 shards for writes. Take max ≈ 256 metadata shards for headroom.
Network bandwidth
- Read: 10 M QPS × 100 MB avg = 10^15 B/s = 1 EB/s egress. Obviously wrong at mean; most GETs are p50=512 KB.
- Recalibrate with weighted mean: log-normal around p50=512 KB with p99=1 GB tail. Weighted avg ~4 MB/GET. 10 M × 4 MB = 40 TB/s = 320 Tbps regional egress. Matches "big region scale" (AWS us-east-1 is rumored >1 Pbps aggregate).
- Write: 1 M × 4 MB × 1.4 (EC overhead) = 5.6 TB/s = 45 Tbps regional ingest to disks.
- Per-rack: 320 Tbps / 64 racks = 5 Tbps/rack — needs 4× 800G or 8× 400G uplinks. Consistent with a hyperscale spine.
QPS per chunkserver
- 30 nodes/rack × 64 racks ≈ 1920 storage nodes. 10 M GET / 1920 = ~5 K GET QPS/node. At 60 disks × ~150 random IOPS = 9 K IOPS/node — disks are the bottleneck for small GETs, so hot data must live on SSDs or be cached.
4 High-Level API #
REST/HTTP with S3-compatible signatures (SigV4 HMAC). Internal plane is gRPC + Protobuf.
Client-facing (external, HTTPS)
PUT /b/{bucket}/{key} Content-Length, Content-MD5, x-amz-storage-class, x-amz-server-side-encryption
→ 200 { ETag, VersionId, x-amz-request-id, x-amz-id-2 }
Errors: 400 InvalidRequest | 403 AccessDenied | 409 BucketAlreadyExists
| 412 PreconditionFailed | 503 SlowDown (back-pressure, client retries with expo backoff + jitter)
GET /b/{bucket}/{key} [Range: bytes=a-b] [If-Match: etag] [If-None-Match: etag]
→ 200/206 body, ETag, Content-Length, Last-Modified, x-amz-version-id
Errors: 304 NotModified | 404 NoSuchKey | 416 RangeNotSatisfiable | 503 SlowDown
DELETE /b/{bucket}/{key}[?versionId=]
→ 204 (async GC; immediate tombstone in metadata)
LIST /b/{bucket}?prefix=&marker=&max-keys=1000&delimiter=/
→ 200 { Contents:[…], NextMarker, CommonPrefixes:[…], IsTruncated }
Explicitly: eventually consistent, reflects metadata state at read time ±seconds.
# Multipart
POST /b/{bucket}/{key}?uploads → { UploadId }
PUT /b/{bucket}/{key}?partNumber=N&uploadId= body → { ETag }
POST /b/{bucket}/{key}?uploadId= { Parts:[{N, ETag}] } → final ETag, atomic commit
DELETE /b/{bucket}/{key}?uploadId= abort + GC parts
# Pre-signed URL
GET /b/{bucket}/{key}?X-Amz-Signature=&X-Amz-Expires=900&X-Amz-SignedHeaders=…
Server verifies HMAC with bucket-owner's AKID-SK; forwards IAM as the bucket owner's.
Error semantics (important for back-pressure)
- 503 SlowDown on overload — always with
Retry-Afterhint and a token-bucket per caller. Client libraries back off exponentially (base 50 ms, factor 2, max 20 s, jittered). - 500 InternalError for unexpected; client retries only idempotent ops (GET, DELETE with version, PUT with
Content-MD5since it's idempotent-by-hash). - 429-class rate-limit at edge before requests touch the data plane, so a thundering herd doesn't fan out past LB (see §8.6).
Control-plane & admin
- CreateBucket / DeleteBucket / PutBucketPolicy / PutBucketLifecycle / PutBucketVersioning — low-QPS, strongly consistent, Paxos-ordered.
5 Data Schema #
Engines
| Data | Engine | Why |
|---|---|---|
| Object metadata (bucket, key → manifest) | Sharded transactional KV — Spanner-class (multi-Paxos groups per shard) or FoundationDB | Need strong consistency for read-your-writes PUT; need ordered range scans for LIST; billions of keys; multi-region Paxos later (v3). Rejected: Cassandra/Dynamo — no transactions, LIST is painful; MySQL/Postgres — can't shard this wide natively. |
| Chunk index (chunk_id → placement) | Same KV or a separate placement service (BigTable-style) | Separating means object metadata can be cached aggressively (immutable manifest post-commit); chunk-index changes during repair without invalidating object caches. |
| Bucket catalog | Same KV, rooted at system bucket /sys/buckets |
Low QPS, strong consistency, ACL-attached |
| ACL / IAM policies | Same KV | Cross-referenced on every auth; cached aggressively |
| Data (chunks) | Chunkfiles on local disk (Colossus D / GFS chunkserver / RADOS OSD). One chunkfile = many chunks append-packed. | Avoids 1-chunk-1-file on XFS (inode pressure kills perf at billions-of-files scale). Classic f4/Haystack lesson: pack small objects into large container files. |
| Write-ahead log | Local SSD per metadata shard + shard's Paxos ring | Bounds commit latency |
Schemas (logical)
# Metadata: objects table (sharded by hash(bucket_id, key))
object_meta {
bucket_id uint64
key string # up to 1 KB
version_id uint128 # monotonic (ts << 64) | random
is_delete_marker bool
size uint64
content_md5 bytes(16) # ETag for single-part
sha256 bytes(32) # internal end-to-end integrity
storage_class enum {HOT, WARM, COLD, ARCHIVE}
ec_scheme enum {REPL_3, RS_10_4, LRC_12_2_2}
chunk_manifest [chunk_id] # ordered; covers object bytes [0, size)
chunk_sizes [uint64]
acl_id uint64
created_ts uint64 # hybrid logical clock
lifecycle_tag uint32 # for batch lifecycle scans
tenant_id uint64
encryption { kms_kek_id, wrapped_dek }
user_metadata map<string,string> # up to 2 KB
}
# Primary key: (bucket_id, key, version_id DESC)
# Range scans on (bucket_id, key) serve LIST.
# Chunk placement (sharded by chunk_id)
chunk_placement {
chunk_id uint128 # globally unique, content-addressable seed
stripe_id uint128 # many chunks in one EC stripe share this
stripe_pos uint8 # 0..13 for RS(10,4)
disk_id uint64 # physical disk
node_id uint64
rack_id uint32
zone_id uint16
chunkfile_id uint64
offset uint64
length uint32
crc32c uint32 # per-chunk; composes with object-level sha256
gen_number uint32 # lease generation; bumped on recovery
state enum {HEALTHY, REPAIRING, STALE, DEAD}
}
# Bucket
bucket {
bucket_id uint64 PK
name string UNIQUE GLOBAL
owner uint64
created_ts uint64
versioning enum {DISABLED, ENABLED, SUSPENDED}
lifecycle [rule]
replication { enabled, dest_regions[] } # for v3
policy_id uint64
}
# ACL
acl {
acl_id uint64 PK
grants [{grantee_type, grantee_id, permission}]
}
Why sharded KV (not a blob-store master like GFS)
- GFS master held all FS metadata in one process's RAM. Hit ~50 PB / ~100 M file ceiling circa 2009.
- Colossus (curators) shards metadata into BigTable, which is itself sharded; each curator serves a key-range.
- This design:
hash(bucket_id, key) mod N_shards→ shard; each shard is a 5-member Paxos/Raft group on SSD. Range LIST uses an ordered layer (Spanner directories) or a side-index.
Why content-addressable chunk IDs
- Enables dedup for identical chunks across objects (a blessing for image thumbnails, CI artifacts, git packs).
- Enables retry idempotency: if a PUT fails mid-way, a retry writing the same bytes produces the same chunk IDs; no phantom duplicates.
6 System Diagram (ASCII, centerpiece) #
Regional architecture
+----------------------+
| Client SDK / App |
| (S3 SigV4, retries) |
+----------+-----------+
| HTTPS, SigV4
v
+----------------------+
| Edge / CDN Cache | (optional for GET; bypass for PUT)
| (write-through? no)|
+----------+-----------+
| HTTPS
v
+--------- Anycast / GCLB ---------+
| Regional Load Balancer (L7) | ~100 k QPS/box, 200 boxes
| TLS term, WAF, rate-limit edge |
+------+----------------------+----+
| |
v v
+---------+-----------+ +-------+----------+
| Frontend / Gateway | | Frontend /Gateway| ... fleet ~2000 nodes
| (stateless) | | ... |
| - Auth (SigV4,IAM) | | |
| - req parse/validate | |
| - slice body->chunks | |
| - pick EC scheme | |
| - upload coordinator | |
+----+-------+-------+-+ |
| | | |
gRPC (KV RPC) | | | gRPC (bulk data) |
v v v v
+--------------------------+ +--------------------------------+
| METADATA SERVICE | | CHUNKSERVER FLEET |
| 256 shards | | ~2000 nodes × 60 HDD + SSD |
| Raft 5-way per shard | | (Colossus D-class) |
| SSD-backed | | - chunkfiles (packed) |
| p99 read <5 ms | | - per-disk IO scheduler |
| p99 write <15 ms | | - QoS classes (user/repair) |
| ------------- | | - crc32c on every 64KB page |
| objects | | - lease holder per chunk |
| chunk_placement | | |
| buckets, ACLs, multipart |
| upload parts in flight | +----^---------------^-----------+
+-----^-------^-----^------+ | |
| | | | |
(range | | (chunk replicate) (repair reads)
scan LIST) | (commit) | |
| | | | |
+-------+-----+ | |
| |
+---------------------+---------------+-------------------+
| |
v v
+--------------------------------+ +---------------------------------+
| CONTROL PLANE | | BACKGROUND / DATA PLANE |
| - Placement Driver (PD): | | - Repairer (scan->reconstruct)|
| picks (node, rack, zone) |<---- health ----->| - Anti-entropy scrubber |
| per stripe | | - Erasure Coder (3->RS shift) |
| - Topology Service | | - Garbage Collector |
| - Health / failure detector | | - Tiering Mover (hot->cold) |
| - Capacity planner | | - Rebalancer (disk skew fix) |
+--------------------------------+ +---------------------------------+
Side services:
- IAM / KMS (DEK wrap/unwrap, per-tenant KEKs)
- Bucket quota / billing meter
- Audit log (all control-plane ops, immutable WORM)
- Replication service (cross-region, v3)
Scale annotations:
- Frontend → Metadata : ~12 M QPS at 256 shards (46 K QPS/shard avg, 80 K peak)
- Frontend → Chunkserver : ~45 Tbps ingress, ~320 Tbps egress regional
- Chunkserver disk IOPS : 60 disks × 150 IOPS = 9 K per node × 2000 = 18 M IOPS
- PD health pings : every 5 s to every node → 400 QPS from PD, fanout to 2000
Write path (single-part PUT, ≤64 MB)
Client Frontend Metadata (shard S) Chunkservers (14 picks across 4 zones)
| | | |
|-- PUT key body ------->| | |
| |- auth/sigV4 | |
| |- shard(bucket,key)=S | |
| |- slice body into 8MB chunks (N=ceil(size/8MB)) |
| |- for each chunk: |
| | erasure-encode RS(10,4) -> 14 shards |
| | PD.pick_placement(14) -> [(node,disk), ...] |
| |- begin_write_txn to S | |
| |----------------------- >|-- reserve object version v ---|
| | | (lease / uncommitted row) |
| |< ---- leases ----------| | |
| | |
| |-------------- WRITE each shard in parallel ------------>|
| | |
| | <---- CRC + ack per shard -----|
| | (wait for ALL 14 shards acked; or 12-of-14 and repair) |
| | |
| |- compose manifest (chunk_ids, sizes) |
| |- commit_object_txn(meta, ec_scheme, manifest, sha256) |
| |------------------------>|-- Raft commit on shard S ----|
| |< --- commit OK ---------| |
|< -- 200 ETag ----------| | |
Latency budget (single 8 MB chunk, intra-DC):
auth + shard routing ~2 ms
slice + encode (in-memory) ~4 ms (RS(10,4) at 2 GB/s per core)
PD placement ~1 ms (cached)
metadata reserve ~5 ms (Raft prepare)
parallel 14-shard write ~25 ms (TCP + disk write cache + fsync-on-close)
metadata commit ~10 ms (Raft commit)
-----------------------------------
total ~47 ms -> p50 PUT target met
Read path (GET)
Client Frontend Metadata (shard S) Chunkservers
| | | |
|-- GET key ------------>| | |
| |- auth | |
| |- ask: obj_meta(b,k) -->|-- read row (cached) --+ |
| |<-- manifest, acl, size -| | |
| |- auth against ACL |
| |- for each needed chunk (range -> subset): |
| | send parallel GET_CHUNK(chunk_id) to placement[0..9] |
| | |
| | <---- chunk bytes + crc -------|
| |- verify crc, stream to client |
| |- (if any of 10 is slow/dead beyond hedge deadline: |
| | fetch 2 parity chunks + reconstruct) |
| | |
|<-- 200 body -----------| |
Latency budget (hot, cache-hit on metadata):
LB + auth ~3 ms
metadata read (cached) <1 ms (LRU hit) or 5 ms (SSD miss)
chunkserver dispatch ~1 ms
parallel 10 chunk reads ~15 ms (one disk seek + transfer)
hedge + stream ~5 ms
-------
total ~25-30 ms -> p50 hit
Recovery path (disk dies)
Health detector (PD) sees node/disk FAIL at T0
|
v
PD queries chunk_placement WHERE disk_id = dead_disk (range scan -> ~5M chunks)
|
v
Repairer schedules reconstructions (priority queue: high if stripe health ≤ floor+1)
|
v
For each stripe with a lost shard:
- pick 10 healthy survivors (for RS(10,4))
- choose new placement satisfying rack/zone constraints
- pull 10 shards over the "repair QoS class" (separate token bucket from user traffic)
- reconstruct missing shard, write to new disk
- Paxos-update chunk_placement atomically (CAS on gen_number)
|
v
Anti-entropy scrubber periodically re-reads every chunk, verifies crc32c,
marks SILENT_CORRUPTION if mismatch -> triggers same repair path.
MTTR target: single-disk (20 TB, 5M chunks) repaired within 4 hours.
Requires: 20 TB / 4 h = 1.4 GB/s reconstruct bandwidth sustained.
Allocated: 10 Gbps repair-class lane per rack, so 64 racks × 1.25 GB/s = 80 GB/s aggregate.
Per-stripe: 10 × shard_size read + 1 × shard_size write = 11× write-amp.
7 Deep Dives (3 earned-secret topics) #
7.1 Replication vs Erasure Coding — the real trade-off
Why critical. This choice moves ~40% of the cluster cost and defines 90% of the failure-mode surface. Getting it wrong burns either money (over-replicated) or durability (under-parity). Every "use EC" hand-wave at L7 gets probed.
Alternatives & numbers
| Scheme | Overhead | Max losses / stripe | Reconstruct reads | Small-object penalty | Typical production use |
|---|---|---|---|---|---|
| 3× replication | 3.00× | 2 | 1 (just copy a replica) | none | GFS v1, HDFS default, Ceph RBD |
| RS(6,3) | 1.50× | 3 | 6× | moderate | Azure LRS variants, small clusters |
| RS(10,4) | 1.40× | 4 | 10× | high | Colossus (default warm), f4, S3 (EC tier) |
| RS(12,4) | 1.33× | 4 | 12× | high | Azure, HDFS-EC |
| LRC(12,2,2) — Microsoft | 1.33× | mixed; local repairs cheaper | 6× for single loss | high | Azure Storage |
| Tape (archive) | 1.0× + 2-copy across sites | geo-scale | hours | N/A | AWS Glacier Deep Archive |
Key math — why 1.4× isn't the whole story
Repair cost scales with k. For RS(10,4), replacing one dead shard reads 10 shards to reconstruct. With 20 TB drives:
- Read amplification during repair = 10× (read 10 shards at shard_size to make 1 shard_size of data).
- On a 20 TB disk death, that's 200 TB of reads. At 1 GB/s per node, > 55 node-hours of repair IO competing with user reads.
- This is why you MUST have QoS lanes. Google's public papers describe separate Colossus D "repair tokens." Without them, a disk failure → tail-latency spike → angry pager.
The small-object penalty (earned-secret) Any object smaller than one shard (e.g., 512 KB on a 8 MB shard size) still has to be split into 14 shards. You end up storing 14 × padding + 512 KB across 14 disks. For a 512 KB object under RS(10,4) with 8 MB shard size, you've touched 14 disks for 512 KB → catastrophic IOPS amplification on GET. Real systems solve this by:
- Packing — small objects first packed into 64 MB "bundles," then bundles are erasure-coded. This is Facebook Haystack's insight: one bundle = 1000s of photos = amortize EC overhead. f4 extends this to warm tier with RS(10,4).
- Hot-tier 3× replication — hot small objects get 3× replication; EC applies only on cold/warm tier transition (via lifecycle). Matches S3 Standard → S3 IA transition model.
- LRC (Microsoft Locally Recoverable Codes) — adds local parity groups so single-shard repair reads only 6 shards, not 12. Pays 33% vs 40% overhead.
My chosen approach (and why)
- Hot tier (new writes, <7 days): 3× replication across 3 zones. PUT latency 40 ms (no encode). Reads can serve from nearest replica. Repair = 1× read.
- Warm tier (7–90 days): RS(10,4) across 4 zones via background coder. Fully supports 4-shard loss including zone failure.
- Cold tier (>90 days): RS(12,4) or LRC(12,2,2), shard size bumped to 64 MB, possibly stored on SMR drives with bundles.
- Archive tier: single RS(10,4) + geo-replicated to a second region on tape.
Failure modes & mitigations
- Correlated disk failure (bad batch, same firmware bug) — placement MUST NOT cluster shards of one stripe on disks with same firmware version. PD records
firmware_idand diversifies. - Zone failure on 3-zone region — RS(10,4) can't tolerate losing ≥5 shards; requires 4-zone layout or RS(6,3) with 3 zones × 3 shards/zone.
- Slow shard ("stuck" but not dead) — during GET, hedge by reading
k+1(11 shards) after 20 ms; first 10 to return wins. Classic "tail at scale" technique (Dean & Barroso 2013). - Bit-rot — crc32c per 64 KB page + sha256 end-to-end. Scrubber re-verifies every chunk every 14 days.
Quantified durability derivation (the 11-nines claim)
Assume:
- Disk AFR = 1% annual → P(disk death in 1 year) = 0.01.
- Stripe of 14 shards (RS(10,4)), each on independent disks.
- P(≥5 shard losses in 1 year before repair) ≤ stripe_loss_prob.
- MTTR = 4 hours (aggressive placement + repair).
Simplified model: during repair window (4 h = 1/2190 year), per-disk failure prob ≈ 0.01 / 2190 ≈ 4.6e-6. P(≥4 more fail in that window | 1 already lost) ≈ C(13,4) × (4.6e-6)^4 = 715 × 4.5e-22 ≈ 3.2e-19 per stripe per year.
Total stripes ≈ 1.4 EB / (14 × 8 MB) = 1.25e10 stripes. P(any stripe loses ≥5 shards) ≈ 1.25e10 × 3.2e-19 ≈ 4e-9 per year → ~9 nines.
To hit 11 nines we need MTTR shortened, or additional parity (RS(12,4) has 4 losses tolerated still, but with 14 shards → tighter). Real systems get the extra 2 nines via:
- Geo-redundancy for archive (adds effectively ~3 nines once independent-region failure is factored in).
- Shorter MTTR (sub-hour via aggressive repair) → each power of 2 on MTTR adds ~4× to durability.
- Diversity of failure modes beyond independent AFR (firmware, power, cooling) — hurts, not helps; Backblaze data shows real-world correlated failures.
The honest L7 admission: 11 nines is a marketing number. Real per-object annual loss on S3 is empirically much smaller than the model predicts because of geo-replication and software redundancy layers; the model above gives 9 nines for a single-region RS(10,4) warm tier, which is why production systems layer cross-region async replication and journaled writes on top.
7.2 Metadata Scalability — why GFS's single master had to die
Why critical. Metadata is the serializer of the whole system. A slow metadata service makes the fastest chunkservers useless. GFS master hit a wall at 50 PB / ~100 M files / ~10 K QPS. Colossus's first job was to kill that master.
Alternatives (quantified)
| Design | Namespace scale | Metadata QPS | LIST perf | Ops complexity | Systems |
|---|---|---|---|---|---|
| Single master (RAM) | ≤50 PB / ~100 M files | ~10 K QPS | O(1) tree walk | Simple, single-pane | GFS v1, HDFS NN v1 |
| Federation (namespace slicing) | linear w/ slices | +N× | breaks cross-slice rename | medium | HDFS Federation |
| BigTable-backed (Colossus curators) | effectively unbounded | ~1 M+ QPS | O(log) + range scan | complex | Colossus, BigTable itself |
| Sharded transactional KV (Spanner dirs) | unbounded | ~10 M QPS | range scan with ordered shards | complex | F1, CockroachDB-style |
| DHT / consistent hashing (Dynamo) | unbounded | ~10 M QPS | no prefix scan — deal-breaker for S3 LIST | simple | Dynamo, Cassandra, Swift rings |
Why DHT-style metadata is WRONG for S3-like
- LIST
?prefix=foo/bar/requires finding all keys in range[foo/bar/, foo/bar0). A hash-based placement scatters that range across every shard. A LIST becomes a scatter-gather over every metadata node, which:- makes LIST latency scale with shard count;
- makes LIST cost scale with shard count, not result size;
- makes LIST fragile — one slow shard stalls the whole result.
- Dynamo/Swift sidestep this by not doing prefix LIST natively (Swift does flat container listings but has scaling issues there). This is fine for KV blobs; it's fatal for S3 API compatibility.
My chosen approach
- Ordered sharded KV for metadata (Spanner-like directory layer).
(bucket_id, key)lex-ordered. - 256 shards per region, each a 5-member Paxos/Raft group.
- Shard split when a shard grows past ~100 GB of metadata or 50 K QPS.
- Shard re-mapping handled by a directory service; frontend caches shard→range mapping with TTL + ETag.
Bucket-hotspot mitigation (earned-secret)
- A tenant uploads 1 M objects/hour to the same prefix (e.g.,
/logs/YYYYMMDD/HHMM/). The metadata shard owning that range gets hammered. - Fix: auto-split on load, not just on size. Each shard tracks p99 write latency; if >SLA for 5 min, propose split to range-mid.
- Client-side hint: SDKs recommend adding a hash prefix to keys (
/logs/ab12_20260418.../) — classic S3 partition-key trick (since 2018 S3 removed the need, but underneath it's exactly this).
Metadata-vs-data consistency (§ title of the problem) How do we keep metadata consistent with the data it points to?
The commit protocol:
- Frontend slices body, assigns chunk IDs (content-addressable).
- Frontend writes chunks to chunkservers — chunks become uncommitted, linked only in a transient upload record.
- Chunkservers fsync + return placement receipt with generation number.
- Frontend calls metadata
commit_objectwith manifest + receipts. - Metadata Paxos commits the row. At this moment the object exists.
- If step 4 or 5 fails, chunks are orphaned. GC reaps them after TTL (default 7 days). No lost data; just space reclaimed.
Critical property: a client never sees an object until metadata commit. If metadata commits but frontend crashes before returning 200, client retries; retry writes same content-addressed chunks (idempotent) and re-commits — metadata sees same version_id and returns OK (idempotent commit). This is why version_id is derived from a hash of the content + timestamp, not a server-generated serial that would differ on retry.
Handling torn writes / partial failures
- A chunk write fails mid-stripe: 12 of 14 shards acked, 2 didn't. Options:
- Abort: tell client 503, GC the 12 acked shards (cheap — they're orphaned).
- Commit with degraded flag, repair in background. Dangerous if further failure before repair; don't.
- Always abort-and-retry. Simpler, matches S3 behavior.
Dealing with the metadata service itself going down
- Write availability: 99.99% via Paxos across 5 replicas in 3 zones (tolerates 2 losses).
- Read availability: local-follower reads OK for immutable keys (post-commit metadata doesn't mutate except for lifecycle/delete).
- Metadata partitioning (network partition inside one region): minority-side rejects writes (503) but can still serve reads with "may be stale" hint. Matches CP side of CAP.
7.3 Durable writes & end-to-end integrity (how we actually hit 11 nines in practice)
Why critical. Silent data corruption is the #1 cause of real-world "impossible" data loss on large clusters. Backblaze reports ~1 bit error per 10^14–10^15 bits read on consumer disks — at EB-scale, you will see silent corruption daily. No amount of replication fixes corruption you didn't detect.
The stack of integrity checks
- Client-provided content-MD5 or SHA256 in PUT header. Server re-hashes on receipt; mismatch → 400 BadDigest. Detects in-flight corruption from client to edge.
- Per-shard CRC32C computed at encode time, stored alongside shard. Verified on every disk read.
- Per-page CRC32C at 64 KB granularity inside the chunkfile. Verified on every read; bad page → reconstruct from parity.
- Per-object SHA-256 in metadata, end-to-end. After read + reassemble + decrypt, SHA-256 is recomputed; mismatch → pull alternate parity, log corruption, fail fast.
- Scrub walker — background process re-reads every shard every N days (typically 14 days on warm, 30 days on cold), verifies all CRCs. On corruption: mark shard dead, trigger repair.
- End-of-disk-life RMA triggered by SMART + CRC error rate. Disk is "drained" before it hits critical.
Why not just TCP checksum?
- TCP CRC-16 misses ~1-in-65K two-bit errors; and once data is memcpy'd in RAM, TCP doesn't help. [Stone & Partridge 2000] "When the CRC and TCP Checksum Disagree."
- Disk firmware checksums miss writes that went to wrong sectors ("phantom writes"). Only end-to-end checksums catch torn writes / misdirected writes.
Quorum writes vs synchronous all (the interview sleight-of-hand)
You'll see answers say "quorum write: W=2, R=2, N=3, consistency guaranteed." That's Dynamo, and it's subtly wrong for object storage:
- With W=2 of 3, the 3rd replica is eventually consistent. Read-your-writes requires R+W > N → R=2, W=2, N=3 → R+W=4 > 3. OK.
- But: if W=2 and the 2 that acked are in the same rack, a rack failure loses the write. You must enforce zone-diversity at write time, not just placement policy.
- S3 went strong-consistent in Dec 2020 by writing to an odd number with majority and a consistent cache on metadata. Details not public, but almost certainly: write W=3 of 3 (all) synchronously for metadata, and for data, multi-AZ quorum with repair.
My model: metadata = synchronous Paxos commit (W=3 of 5); data chunks = all-acks (W=14 of 14 for RS(10,4)) at write time with hedging. Why all-acks:
- Only 14 shards per stripe; "quorum of 14" is silly.
- The latency cost of waiting for the last 2 is small if tail is well-managed.
- If 1–2 shards are persistently slow, PD reassigns before the write (health-aware placement), rather than committing with missing shards. A stripe committed with missing shards is a stripe you have to reconstruct on first read — bad latency UX.
Handling bit rot across the tiers
- Hot: 3× replicated, scrubbed weekly, CRC mismatch → copy from another replica.
- Warm: RS(10,4), scrubbed bi-weekly, CRC mismatch → reconstruct from parity.
- Cold: RS(12,4) or LRC, scrubbed monthly, shards on SMR HDDs (read-mostly), corruption rate higher → more frequent scrub budget needed.
- Archive (tape): scrubbed yearly, 2-copy geo.
Production naming Colossus D papers mention the "end-to-end CRC composition" pattern: block CRC composes to chunk CRC composes to object CRC, so you can verify a range read's integrity without re-reading the whole object. ZFS does this with Merkle trees (Bonwick). HDFS does it with per-block checksum files.
8 Failure Modes & Resilience #
Per-component, detection → blast radius → mitigation → recovery.
8.1 Single disk failure
- Detection: SMART + read error rate + missed heartbeat from disk scrubber. Node marks disk
DEADwithin 60 s. - Blast: one shard per stripe, 5 M stripes affected (20 TB / 4 MB avg shard = 5 M).
- Mitigation: stripes are still live (RS(10,4) tolerates 4 losses). No user-facing error.
- Recovery: repair queued immediately, SLA 4 h. During repair, read path uses reconstruction for affected stripes (adds ~50 ms). Alert only if >5 disks fail simultaneously (then rack/node event).
8.2 Node failure
- Detection: PD heartbeat miss for 3 intervals (15 s). Node marked
DOWN. - Blast: 60 disks × ~5 M chunks = 300 M chunks. Each stripe loses 1 shard (since placement MUST ensure one shard/node).
- Mitigation: same as 8.1 ×60. Large repair job (1.2 PB) — spread over 24–48 h if QoS allows.
- Recovery: if node comes back within grace period (1 h), re-attach and cancel repair for not-yet-started work. If not, repair to completion; drain the node when it returns.
- Correlated failure: if 5 nodes in one rack die (power outage), the invariant "≤1 shard/rack" means stripes lose ≤1 shard — still safe. But placement must actually enforce it.
8.3 Rack failure
- Detection: correlated heartbeat loss in one rack-id cluster.
- Blast: up to 1800 disks, 108 M chunks, but ≤1 shard/stripe under invariant.
- Mitigation: no user-facing error if invariant holds. If placement ever violated invariant (bug), now you see correlated stripe damage.
- Recovery: power-on or draining. Repair-to-completion over 72 h via separate QoS lane (<10% of cluster bandwidth to avoid user-traffic SLO burn).
8.4 Zone failure
- Detection: regional health service (not the same zone's PD — chicken/egg) detects zone-wide packet loss.
- Blast: ~25% of fleet in a 4-zone region = 500 nodes, 30 K disks. Under RS(10,4) with ≤4 shards/zone, each stripe loses ≤4 shards → still readable but at bare minimum, no further loss tolerance.
- Mitigation: IMMEDIATELY enter "degraded" mode: halt repair traffic (it amplifies damage if we mis-place), halt placement churn, serve reads only via reconstruction for affected stripes, block writes to affected zone.
- Recovery: after zone returns, delta-repair (re-verify shards rather than re-write). If zone permanently lost, 30+ day repair job to re-hydrate to 14/14.
8.5 Metadata partition
- Detection: Raft leader lease fails to renew; followers detect.
- Blast: the affected metadata shard (one 256th of keyspace) is unavailable for writes; reads can continue with stale data for
stale_read_bound. - Mitigation: client SDK sees 503 on that key-range and retries with backoff. Most buckets span multiple shards (hashed), so a single-shard outage only affects ~0.4% of keys.
- Recovery: new leader elected within 5 s (typical Raft). If minority permanently isolated, majority continues.
8.6 Silent corruption
- Detection: CRC mismatch on read OR during scrub.
- Blast: single shard per event.
- Mitigation: read path falls back to parity reconstruction, logs event, pulls from an alternate shard.
- Recovery: shard marked dead; repair replaces it. If SHA-256 mismatch at object level (meaning multiple shards corrupted in a correlated way), this is an incident — page the SRE, freeze the object's bucket, investigate (firmware bug? cosmic ray? malicious?). Correlated silent corruption has happened in production (Amazon S3 2008 hash-bug, Netflix 2019 memory bit-flip).
8.7 Thundering herd after outage
- Scenario: a zone comes back after 30-min outage; 25% of client traffic re-targets; cold caches; retry-amplified QPS can be 3–10×.
- Detection: QPS spike monitors at LB and at metadata.
- Mitigation:
- Global rate-limit at LB: token bucket per tenant, 2× steady-state allowed for 60 s then drop to SLA.
- Coalesce in-flight reads: multiple GETs for same key within 50 ms serve from one upstream read (request collapsing / "singleflight"); no amplification.
- Backoff with jitter enforced in official SDK. Clients without SDK: 503 + Retry-After tells them to wait.
- Graceful admit: LB sheds 10% of load rather than let the fleet tip into CPU saturation. Matches Netflix "adaptive concurrency limits" (Little's law in the LB).
- Recovery: rate-limit relaxes when QPS returns to steady-state for 5 min.
8.8 Correlated cross-zone failure (the "bad deploy" scenario)
- Scenario: a rolling push of a new chunkserver binary has a bug that crashes on a specific byte pattern. Deploy hits 3 of 4 zones before detection.
- Detection: elevated crash rate across multiple zones within 2 min.
- Mitigation: deploy system auto-rollback on health-signal regression (Canary + Argos-style). Staggered deploy per zone with 24 h hold makes zone-wide correlated failures from code nearly impossible.
- Recovery: roll back binary, nodes self-recover as they restart.
9 Evolution Path #
v1 — Single-region, GFS-like, 3× replication
- Single metadata master (active + passive, log-shipped failover), 3× rep everywhere.
- Targets: 10 PB, 1 B objects, 100 K QPS.
- Timeline: 6 months to MVP. Known to break at ~50 PB / master exhaustion.
- Operationally: simple, single pane, 3-person SRE team can carry it.
v2 — Sharded metadata + EC for cold (Colossus-like)
- 256 metadata shards (Paxos), hot-tier 3× rep, warm/cold RS(10,4).
- Background tiering mover converts warm→cold.
- LRC(12,2,2) experimented for coldest tier.
- Add: lifecycle rules, versioning, multipart.
- Scale targets: 1 EB, 10 B objects, 10 M QPS. (This is the design spec above.)
- Key migration: metadata migrated via online dual-write + shadow-read; data tier converted stripe-by-stripe.
v3 — Multi-region, globally-eventually-consistent
- Per-region v2 stacks, each with its own metadata/data plane.
- Cross-region replication service: async log-shipped (change data capture from Raft log) with p95 <5 s end-to-end.
- Conflict resolution: last-writer-wins by Hybrid Logical Clock timestamp; for PUT-PUT same key, arbitrary resolution with notification to tenant.
- Global table per bucket: uses Spanner-like TrueTime or Calvin-style deterministic order for the tiny subset of "global-strong" buckets (at latency cost).
- Geo tiering: hot tier in region-of-origin, cold tier shipped to lower-$$ region (e.g., us-central to us-iowa) for 40% cost reduction.
- Disaster recovery: every region has 2 partner regions; losing 1 region degrades read latency by ~50 ms, losing 2 regions simultaneously is pager-worthy but RPO=seconds.
v4 — (out of scope, FYI)
- Compute-near-storage (S3 Select, SELECT-over-object).
- Post-quantum-ready KMS.
- Storage-class-memory tier for ultra-hot (Optane-like; though Intel killed Optane, similar class tech persists).
10 Out-of-1-Hour Notes #
10.1 Consistent hashing vs centralized placement
- Consistent hashing (Dynamo): no placement driver, every node knows the ring, O(log N) rebalance on membership change. Simple, but:
- Rack/zone diversity is hard — the ring doesn't know topology. Dynamo uses "virtual nodes + preference lists," still suboptimal under correlated failure.
- Small objects pay full per-request hashing cost; for EC stripes you need to pick k+m locations with constraints, which Ring can't natively do.
- Result: Dynamo-style is great for equal-size, no-topology-constraint payloads (sessions, shopping carts). Not great for EC-first object storage.
- Centralized PD: can optimize for topology, capacity, heat, and firmware diversity. Cost is complexity and that PD can be a bottleneck (mitigate: cache in frontends, PD only authoritative for changes).
- Real-world: Colossus uses PD. Ceph uses CRUSH which is a clever middle ground — pseudo-random but topology-aware, so clients can compute placement from a topology map without a central oracle.
10.2 Disk formats
- Ext4 fails at billions of files (inode table bloat). Don't do 1 chunk = 1 file.
- XFS is better but same tail. Packing into chunkfiles is mandatory.
- Raw block + in-house format (what Colossus D does): directly manage LBAs, no filesystem overhead. Fastest, most complex.
- SMR (shingled) HDDs: 15–20% density improvement, but sequential-write-only. Perfect for cold tier append-only data. EC bundles rewritten rarely → SMR is a natural fit.
- NVMe SSDs: for hot tier only; $/GB still ~20× HDD in 2026.
10.3 Write-through CDN vs origin reads
- CDN at edge (CloudFront / Fastly / Google-CDN): caches public GETs; 10× reduction in origin QPS and ~50ms latency improvement globally. Invalidation on PUT via purge API or short TTL.
- Origin direct for private objects, pre-signed URLs, large streaming content that doesn't fit edge caches.
- Write-through CDN: almost never. Writes are rare enough that CDN-layer caching complicates consistency (CDN caches stale → reader sees old object) without bandwidth savings.
- Design choice: serve public bucket policy via CDN; everything else direct to frontend.
10.4 Encryption at rest
- Per-object DEK, randomly generated 256-bit AES key.
- DEK encrypts object bytes in AES-GCM (authenticated encryption catches tampering in addition to CRC).
- DEK is wrapped with per-tenant KEK stored in KMS.
- KEK itself wrapped by a root key in HSM, rotated yearly.
- Rotation: DEK rotation requires re-encrypt (expensive, usually lazy on next overwrite). KEK rotation is cheap — only re-wrap existing DEKs, re-key via KMS API.
- Key deletion = effective object deletion for compliance — destroy KEK, DEKs unwrappable, objects unrecoverable. GDPR right-to-delete friendly (but see §10.7 for EC + deletion headache).
- BYOK (customer-managed keys): tenant holds KEK in their own KMS; our system calls their KMS to unwrap DEK on every access. High-value enterprise feature.
10.5 Multi-tenant isolation & noisy neighbor
- Per-tenant token bucket at LB — capped QPS and bandwidth.
- Per-tenant shard affinity: huge tenants pinned to a subset of metadata shards so their LIST bursts don't harm others.
- Priority classes on chunkserver disk scheduler: P0 (paid), P1 (standard), P2 (best-effort batch). Colossus D has per-request priority with weighted fair queuing.
- CPU/mem isolation: containerized chunkserver and metadata workers; noisy tenant can't starve sibling containers.
- IOPS SLO per tenant — S3 historically under-exposed this; GCP's Anywhere Cache and Azure's Premium tiers now expose guaranteed IOPS.
10.6 Cost per GB-month modeling
- Raw HDD: ~$15/TB (2026 enterprise spot price, amortized). At 1.4× EC overhead, effective $21/TB-physical-for-1TB-logical.
- Chassis + rack + power + cooling: ~1.5–2× disk cost over 5 years.
- Networking amortized: ~$2/TB/year.
- Software/SRE/amortized R&D: ~$3/TB/year.
- Total warm tier: ~$8/TB/year = $0.67/TB-month = $0.00067/GB-month cost.
- S3 Standard charges ~$0.023/GB-month — ~34× markup, funds availability, API, profit, and that the public price is a "retail" price with implied headroom.
- Tape tier (Glacier Deep Archive): LTO-9 is ~$4/TB, robotic library + retrievals ~$2/TB/year effective; ~$0.00017/GB-month cost, retail $0.00099.
- Erasure coding saves ~48% of disk cost vs 3× replication — at 1 EB scale that's $150M+/region/5-yr. This is the reason EC exists.
10.7 Regulatory (GDPR right-to-delete meets EC)
- "Delete" on object-level = metadata tombstone + eventual GC of chunks.
- GC of EC-shards: all 14 shards of each affected stripe must be re-encoded (since a stripe may contain bytes from multiple objects if packed). For packed bundles, deletion = mark bytes dead; compaction rewrites bundle excluding dead bytes (like LSM-tree compaction). Until compaction runs, deleted bytes remain recoverable in theory.
- Compliance SLA: GDPR requires "without undue delay," typically ≤30 days. Lifecycle scheduler ensures compaction runs within SLA on delete-heavy buckets.
- Crypto-shredding as an alternative: delete the per-object DEK from KMS. Bytes remain but unreadable. Acceptable for most regulators; cryptographic one-way erasure in practice. Cheaper than re-compacting EC bundles.
- Data sovereignty: bucket pinned to a specific region/country; no cross-region replication without tenant opt-in. Audit log proves location.
10.8 Observability (tail latency and how you find disk rot)
- RED metrics: rate, error, duration per endpoint × tenant × region.
- Histogram quantiles — not averages. p50/p90/p99/p99.9 per GET/PUT path. Tail is where SRE lives.
- Disk-level wearout dashboard: SMART attributes, CRC error rates, reallocated sectors, disk-model groups (for correlated failure prediction).
- Repair queue depth & bandwidth consumed — first-class SLO. Alert when EC reconstruct fanout per user read exceeds threshold (= stripes are under-healthy).
- Metadata shard fan-out for LIST operations — detects hotspot buckets.
- Per-tenant top-N bandwidth / IOPS dashboards — blame game when a tenant noisy-neighbors.
- Chaos engineering: regular random disk/node/rack kills; enforces that MTTR SLO is real, not aspirational.
- Tracing: open-telemetry span from client → LB → FE → metadata → chunkserver for every sampled 1:10K request. Essential for debugging tail latency ("why did this particular PUT take 3s?" → span shows metadata Paxos leader election during that window).
10.9 Capacity planning out-of-band
- Growth forecast: 60%/year logical data growth per region (historically matched for hyperscalers).
- Disk procurement lead time: 26 weeks in 2026 post-AI-GPU-supply-crunch side-effects. Plan 9 months ahead.
- Power ceiling in the DC is the real bottleneck, not floor space: a rack of 1800 disks draws ~15 kW. A 10 MW DC caps at ~670 racks — aligns with our 64-rack regional spec giving 10x growth headroom before needing a new DC.
10.10 What I did not have time to build but would
- Cross-tenant dedup (cryptographic — convergent encryption) — saves 10–30% on CI / backup workloads. Privacy-sensitive; needs careful design.
- Event notifications (bucket → Lambda/PubSub) for PUT/DELETE. Reliable delivery requires a durable outbox pattern on metadata commit.
- Object Lock / WORM compliance mode — immutable retention for SEC/FINRA compliance.
- Bucket inventory reports — S3 exports CSV/Parquet inventory to a destination bucket daily. Enables fast LIST for customers.
- Intelligent tiering — ML-predicted access pattern per object, auto-moves between hot/warm.
Verification vs quality bar #
- SRE pager-carryable? Yes — §8 per-component failure-modes with detection/blast/mitigation/recovery. Added "staggered deploy" and thundering-herd mitigations (§8.7, 8.8).
- Every arrow maps to real API (§4) or schema (§5)? Yes — arrows labeled with gRPC/HTTPS, map to
commit_object,reserve_version,WRITE_SHARD,GET_CHUNK, health-ping. - Deep-dive L7 vs L6? L7 check: (a) quantified durability derivation from AFR+MTTR, (b) RS small-object penalty + bundle/Haystack rescue, (c) metadata DHT-vs-KV prefix-LIST trade-off, (d) named post-2020 S3 consistency change with why it's hard, (e) CRC vs SHA-256 vs TCP-checksum tower with failure-mode citation.
- Durability math derived from AFR + k/n + MTTR? §7.1 — single-region 9 nines derived; acknowledge 11 nines requires geo-redundancy, not a single-region physical guarantee. This is the honest L7 answer; L6 would assert 11 nines and move on.
- Rejected alternatives for every major choice with quantified trade-offs? Yes — object vs block vs file (§1), DHT vs sharded KV (§7.2), 3× vs RS vs LRC (§7.1), single master vs sharded (§7.2), CDN write-through vs not (§10.3).
- Real systems named? GFS, Colossus (incl. D, curators), HDFS, Ceph (RADOS, RGW, CRUSH), S3, Dynamo, Swift, Haystack, f4, Azure Blob (LRC), Spanner, FoundationDB, BigTable, Glacier, Backblaze (AFR source), Stone/Partridge TCP checksum paper, Dean/Barroso tail-at-scale.
Interview delivery order (60 min) #
- 0–5 min: §1 restatement, clarifying questions, pick object store.
- 5–10 min: §2 FR + §3 NFR and core BOE numbers (1 EB, 70K disks, 256 shards).
- 10–20 min: §4 API + §5 Schema.
- 20–30 min: §6 ASCII diagram (draw live), walk write + read paths.
- 30–50 min: §7 one or two deep-dives, interviewer-driven (usually §7.1 EC or §7.2 metadata).
- 50–60 min: §8 failure modes, §9 evolution path.
- Skip §10 in interview unless asked; use for follow-up.