Q4 · Design a Distributed Storage System (Object Store)

1 Problem Restatement & Clarifying Questions #

Restatement

Design a planet-scale distributed object store (S3 / GCS / Azure Blob / Ceph RADOSGW / f4-class). Durability, strong-enough consistency, sharded metadata, background repair. Pager-carryable for an SRE on-call.

Why object store (not block, not file)?

I pick object store and will justify by ruling out the alternatives:

Target	Access pattern	Why wrong for this brief
Block (EBS, Ceph RBD, SAN)	4KB random R/W, attach-to-one-VM, POSIX-ish via layers above	Tight latency (<1ms), requires IOPS-scheduled LUN-per-host, doesn't fit "EB-scale multi-tenant key addressing." Block stores top out at ~PB per cluster because you're selling a volume abstraction, not a namespace.
File (GFS, HDFS, Lustre, Isilon, ColossusFS)	POSIX-ish, directories, rename, hardlinks, mtime, append, chown	POSIX rename is a directory transaction. Directory-op scaling is where GFS single-master hit 50 PB / ~100 M files; Colossus and HDFS Federation spent years fixing it. Not the interesting bottleneck to design against in 60 min.
Object (S3, GCS, Azure Blob, Swift, Ceph RGW, Haystack/f4)	Immutable-by-default blobs, flat key namespace, PUT/GET/DELETE/LIST by prefix, eventually billions of tenants	Matches "EB-scale, immutable-write + big-read, cheap, durable" workload. Flat namespace kills the dir-tx bottleneck. Chosen.

So: object store, S3-grade semantics, GCS/Colossus-grade internals.

Clarifying Questions (I would ask the interviewer; I'm answering them for self-study)

Workload shape? Assume write-once-read-many-delete-rarely, skewed toward large objects (p50=512 KB, p95=64 MB, p99=1 GB, tail to 5 TB via multipart). This is the "real" distribution on S3/GCS — not uniform. Small-object tax is real (see §7.1).
Scale? Target 1 EB logical stored, 10 B live objects, 10 M QPS GET, 1 M QPS PUT, 100 K QPS LIST. This is mid-tier regional (AWS has 400+ trillion objects; I'm sizing one region).
Latency SLO? p50 GET <30 ms, p99 GET <200 ms for hot cache tier, <500 ms for cold. p50 PUT <80 ms, p99 <500 ms. Tail is where L7 lives.
Durability SLO? 11 nines (99.999999999 % annual) — S3 / GCS / Azure standard. I'll derive it in §3 from AFR + EC k/n + MTTR.
Availability SLO? 99.99% per region, 99.999% multi-region. One region is allowed to brown out; metadata is allowed to reject writes before corrupting.
Consistency? S3 went strong-consistent for PUT-after-PUT in Dec 2020. I'll match: read-your-writes for new keys and overwrites; eventually-consistent LIST. Multi-region: async, last-writer-wins with vector metadata.
Multi-region? Yes, but v3. v1 is single-region. Avoid designing for the hardest case first.
Multi-tenant? Yes, billions of buckets, millions of tenants. Noisy-neighbor protection required (§10.5).
Encryption? At-rest AES-256 per-object DEK wrapped by tenant KEK in KMS; in-transit TLS. Out-of-scope for core 60 min, detailed in §10.4.

Out-of-scope (explicitly deferred)

Cross-object transactions; server-side rich query (SQL-on-blob); streaming mutation of an existing object mid-read; filesystem semantics (rename/hardlink); strong cross-region consistency; durable queue / pub-sub on top.

2 Functional Requirements #

In-scope (numbered)

PUT /b/{bucket}/{key} — body up to 5 GB single-shot, 5 TB via multipart; returns ETag (content-MD5 or content-SHA256).
GET /b/{bucket}/{key} — full body, with If-Match/If-None-Match/If-Modified-Since and HTTP Range: for byte ranges.
DELETE /b/{bucket}/{key} — soft delete with versioning; tombstone propagation.
LIST /b/{bucket}?prefix=&marker=&max-keys=&delimiter=/ — prefix walk, pagination. Eventually consistent by design (more on this in §7.2).
Multipart Upload: initiate → uploadPart × N (parallel) → complete (atomic manifest commit) → single object. Enables 100 GB+ uploads over flaky networks and parallel ingest up to ~10 Gbps per object.
Range Read: Range: bytes=a-b — served from exact chunk, never read full object.
ACL / IAM: bucket-level policy + per-object ACL; pre-signed URL (HMAC over canonical request) for delegated access with TTL.
Versioning: per-bucket flag; DELETE creates delete-marker, previous version retained; PUT never overwrites in place.
Lifecycle: per-bucket rules (transition to cold tier after N days, expire after M days).
CopyObject (server-side copy): metadata-only when source/dest share storage class, deep-copy otherwise.

Out-of-scope (explicitly)

SELECT-over-object (S3 Select / BigQuery-external). Separate compute layer.
Atomic PUT-if-not-exists conditional on another key. Single-key conditional PUT is in-scope (If-Match on ETag); cross-key transactions are out.
Strong global consistency — out, see §9 v3.
Filesystem rename / directory metadata (would require namespace service).

3 Non-Functional Requirements + Capacity Back-of-Envelope #

NFRs

NFR	Target	Rationale
Availability	99.99% per region (4.38 min/mo error budget)	AWS S3, GCS bucket-standard
Durability	11 nines annual	Industry standard; derived below
Durability for archive tier	12 nines	Deep-archive / Glacier class
Latency p50 GET	<30 ms (hot)	Within-DC round trip + 1 disk seek
Latency p99 GET	<200 ms (hot), <500 ms (cold)	Budget for 1 reconstruct if chunk slow/dead
Latency p50 PUT	<80 ms	Write + 2 replica ack (v1) or 10+4 EC acks (v2)
Latency p99 PUT	<500 ms	Tail-tolerant, includes metadata commit
Consistency	Read-your-writes for PUT; eventual for LIST	Matches S3-Dec-2020 model
Throughput	10 M GET QPS, 1 M PUT QPS per region	Mid-tier region
Cost	<$0.015 / GB-month warm, <$0.001 / GB-month cold	Within S3 Standard / Glacier envelope

Capacity BOE (show the math)

Logical data

1 EB = 10^18 B = 10^6 TB = 10^3 PB.
10 B objects; mean size = 1 EB / 10 B = 100 MB (skewed; p50 is 512 KB but the tail dominates bytes).

Physical footprint — replication vs erasure coding

Pure 3× replication: 3 EB physical.
Reed-Solomon RS(10,4) (10 data + 4 parity per stripe): 1.4× overhead → 1.4 EB physical. This is Facebook f4 / Azure LRS / GCS default for cold & warm.
Savings of 3.0 − 1.4 = 1.6 EB. At ~$20 /TB HDD amortized (disk + chassis + 3 yr) that's $32 M saved per region. Non-trivial; see §7.1.

Disk fleet

Assume 20 TB enterprise HDD (2026 vintage; Seagate Exos / WD Ultrastar are shipping 22–24 TB).
1.4 EB / 20 TB = 70 000 disks just for user data.
Add overhead: 15 % free-space for compaction/rebalance, 5 % for metadata replicas, 10 % for transient multipart parts not yet committed, 10 % reserved for EC-reconstruction scratch = 40 % overhead → ~100 K HDDs.
Hot tier on NVMe SSD (3.84 TB): assume 5 % of bytes are hot → 50 PB / 3.84 TB = ~13 K SSDs.
Total: ~113 K storage drives in the region.

Rack / zone / region layout

60 disks per node (JBOD chassis with HBAs), 1 node per RU × 2 RU = 30 nodes per rack × 60 disks = 1800 disks/rack.
100 K / 1800 ≈ 56 racks for HDD, say 64 racks total with headroom.
4 zones per region × 16 racks/zone.
Placement invariant: no stripe has >1 shard in the same rack; RS(10,4) spread across ≥2 zones so a single-zone outage loses ≤7 shards out of 14 (survivable, since RS(10,4) tolerates 4 losses per stripe — but this means a zone failure makes EC unrepairable for some stripes; see §7.1 for the correct fix).
- Correct invariant: spread 14 shards across ≥3 zones with max 5 shards/zone, so a 1-zone loss loses ≤5 shards, within the 4-tolerance — wait, 5 > 4. Fix: RS(10,4) across ≥4 zones, ≤4 shards/zone (so 1-zone loss ≤4 shards, exactly at tolerance). At 3-zone layout, RS(10,4) cannot survive a zone loss; must use RS(6,3) or LRC for 3-zone regions. This is the single most common bug in interview EC answers.

Metadata footprint

10 B objects × ~2 KB metadata/object (bucket_id, key, version, size, checksum, chunk manifest for ~5 chunks × 24 B, ACL id, timestamps, EC params) = 20 TB metadata.
3× replicated + Paxos/Raft overhead → ~60 TB across metadata servers. Fits comfortably on sharded SSD KV.
Metadata op rate: every GET = 1 metadata read, every PUT = 1 metadata write, every LIST = range scan. 10 M GET + 1 M PUT + 100 K LIST = ~11 M read QPS, ~1 M write QPS against metadata.
- At 50 K read QPS per Spanner-class shard → 220 shards for reads; at 10 K write QPS per shard → 100 shards for writes. Take max ≈ 256 metadata shards for headroom.

Network bandwidth

Read: 10 M QPS × 100 MB avg = 10^15 B/s = 1 EB/s egress. Obviously wrong at mean; most GETs are p50=512 KB.
Recalibrate with weighted mean: log-normal around p50=512 KB with p99=1 GB tail. Weighted avg ~4 MB/GET. 10 M × 4 MB = 40 TB/s = 320 Tbps regional egress. Matches "big region scale" (AWS us-east-1 is rumored >1 Pbps aggregate).
Write: 1 M × 4 MB × 1.4 (EC overhead) = 5.6 TB/s = 45 Tbps regional ingest to disks.
Per-rack: 320 Tbps / 64 racks = 5 Tbps/rack — needs 4× 800G or 8× 400G uplinks. Consistent with a hyperscale spine.

QPS per chunkserver

30 nodes/rack × 64 racks ≈ 1920 storage nodes. 10 M GET / 1920 = ~5 K GET QPS/node. At 60 disks × ~150 random IOPS = 9 K IOPS/node — disks are the bottleneck for small GETs, so hot data must live on SSDs or be cached.

4 High-Level API #

REST/HTTP with S3-compatible signatures (SigV4 HMAC). Internal plane is gRPC + Protobuf.

Client-facing (external, HTTPS)

PUT  /b/{bucket}/{key}                          Content-Length, Content-MD5, x-amz-storage-class, x-amz-server-side-encryption
     → 200 { ETag, VersionId, x-amz-request-id, x-amz-id-2 }
     Errors: 400 InvalidRequest | 403 AccessDenied | 409 BucketAlreadyExists
           | 412 PreconditionFailed | 503 SlowDown (back-pressure, client retries with expo backoff + jitter)

GET  /b/{bucket}/{key}                          [Range: bytes=a-b] [If-Match: etag] [If-None-Match: etag]
     → 200/206 body, ETag, Content-Length, Last-Modified, x-amz-version-id
     Errors: 304 NotModified | 404 NoSuchKey | 416 RangeNotSatisfiable | 503 SlowDown

DELETE /b/{bucket}/{key}[?versionId=]
     → 204 (async GC; immediate tombstone in metadata)

LIST /b/{bucket}?prefix=&marker=&max-keys=1000&delimiter=/
     → 200 { Contents:[…], NextMarker, CommonPrefixes:[…], IsTruncated }
     Explicitly: eventually consistent, reflects metadata state at read time ±seconds.

# Multipart
POST   /b/{bucket}/{key}?uploads                 → { UploadId }
PUT    /b/{bucket}/{key}?partNumber=N&uploadId=  body → { ETag }
POST   /b/{bucket}/{key}?uploadId=               { Parts:[{N, ETag}] } → final ETag, atomic commit
DELETE /b/{bucket}/{key}?uploadId=               abort + GC parts

# Pre-signed URL
GET /b/{bucket}/{key}?X-Amz-Signature=&X-Amz-Expires=900&X-Amz-SignedHeaders=…
     Server verifies HMAC with bucket-owner's AKID-SK; forwards IAM as the bucket owner's.

Error semantics (important for back-pressure)

503 SlowDown on overload — always with Retry-After hint and a token-bucket per caller. Client libraries back off exponentially (base 50 ms, factor 2, max 20 s, jittered).
500 InternalError for unexpected; client retries only idempotent ops (GET, DELETE with version, PUT with Content-MD5 since it's idempotent-by-hash).
429-class rate-limit at edge before requests touch the data plane, so a thundering herd doesn't fan out past LB (see §8.6).

Control-plane & admin

CreateBucket / DeleteBucket / PutBucketPolicy / PutBucketLifecycle / PutBucketVersioning — low-QPS, strongly consistent, Paxos-ordered.

5 Data Schema #

Engines

Data	Engine	Why
Object metadata (bucket, key → manifest)	Sharded transactional KV — Spanner-class (multi-Paxos groups per shard) or FoundationDB	Need strong consistency for read-your-writes PUT; need ordered range scans for LIST; billions of keys; multi-region Paxos later (v3). Rejected: Cassandra/Dynamo — no transactions, LIST is painful; MySQL/Postgres — can't shard this wide natively.
Chunk index (chunk_id → placement)	Same KV or a separate placement service (BigTable-style)	Separating means object metadata can be cached aggressively (immutable manifest post-commit); chunk-index changes during repair without invalidating object caches.
Bucket catalog	Same KV, rooted at system bucket `/sys/buckets`	Low QPS, strong consistency, ACL-attached
ACL / IAM policies	Same KV	Cross-referenced on every auth; cached aggressively
Data (chunks)	Chunkfiles on local disk (Colossus D / GFS chunkserver / RADOS OSD). One chunkfile = many chunks append-packed.	Avoids 1-chunk-1-file on XFS (inode pressure kills perf at billions-of-files scale). Classic f4/Haystack lesson: pack small objects into large container files.
Write-ahead log	Local SSD per metadata shard + shard's Paxos ring	Bounds commit latency

Schemas (logical)

# Metadata: objects table (sharded by hash(bucket_id, key))
object_meta {
  bucket_id      uint64
  key            string         # up to 1 KB
  version_id     uint128        # monotonic (ts << 64) | random
  is_delete_marker bool
  size           uint64
  content_md5    bytes(16)      # ETag for single-part
  sha256         bytes(32)      # internal end-to-end integrity
  storage_class  enum {HOT, WARM, COLD, ARCHIVE}
  ec_scheme      enum {REPL_3, RS_10_4, LRC_12_2_2}
  chunk_manifest [chunk_id]     # ordered; covers object bytes [0, size)
  chunk_sizes    [uint64]
  acl_id         uint64
  created_ts     uint64         # hybrid logical clock
  lifecycle_tag  uint32         # for batch lifecycle scans
  tenant_id      uint64
  encryption     { kms_kek_id, wrapped_dek }
  user_metadata  map<string,string>   # up to 2 KB
}
# Primary key: (bucket_id, key, version_id DESC)
# Range scans on (bucket_id, key) serve LIST.

# Chunk placement (sharded by chunk_id)
chunk_placement {
  chunk_id       uint128        # globally unique, content-addressable seed
  stripe_id      uint128        # many chunks in one EC stripe share this
  stripe_pos     uint8          # 0..13 for RS(10,4)
  disk_id        uint64         # physical disk
  node_id        uint64
  rack_id        uint32
  zone_id        uint16
  chunkfile_id   uint64
  offset         uint64
  length         uint32
  crc32c         uint32         # per-chunk; composes with object-level sha256
  gen_number     uint32         # lease generation; bumped on recovery
  state          enum {HEALTHY, REPAIRING, STALE, DEAD}
}

# Bucket
bucket {
  bucket_id      uint64 PK
  name           string UNIQUE GLOBAL
  owner          uint64
  created_ts     uint64
  versioning     enum {DISABLED, ENABLED, SUSPENDED}
  lifecycle      [rule]
  replication    { enabled, dest_regions[] }  # for v3
  policy_id      uint64
}

# ACL
acl {
  acl_id         uint64 PK
  grants         [{grantee_type, grantee_id, permission}]
}

Why sharded KV (not a blob-store master like GFS)

GFS master held all FS metadata in one process's RAM. Hit ~50 PB / ~100 M file ceiling circa 2009.
Colossus (curators) shards metadata into BigTable, which is itself sharded; each curator serves a key-range.
This design: hash(bucket_id, key) mod N_shards → shard; each shard is a 5-member Paxos/Raft group on SSD. Range LIST uses an ordered layer (Spanner directories) or a side-index.

Why content-addressable chunk IDs

Enables dedup for identical chunks across objects (a blessing for image thumbnails, CI artifacts, git packs).
Enables retry idempotency: if a PUT fails mid-way, a retry writing the same bytes produces the same chunk IDs; no phantom duplicates.

6 System Diagram (ASCII, centerpiece) #

Regional architecture

                                   +----------------------+
                                   |   Client SDK / App   |
                                   |  (S3 SigV4, retries) |
                                   +----------+-----------+
                                              |  HTTPS, SigV4
                                              v
                                   +----------------------+
                                   |   Edge / CDN Cache   |   (optional for GET; bypass for PUT)
                                   |   (write-through? no)|
                                   +----------+-----------+
                                              |  HTTPS
                                              v
                         +--------- Anycast / GCLB ---------+
                         |   Regional Load Balancer (L7)     |   ~100 k QPS/box, 200 boxes
                         |   TLS term, WAF, rate-limit edge  |
                         +------+----------------------+----+
                                |                      |
                                v                      v
                      +---------+-----------+  +-------+----------+
                      |  Frontend / Gateway |  |  Frontend /Gateway|   ... fleet ~2000 nodes
                      |  (stateless)        |  |  ...              |
                      |  - Auth (SigV4,IAM) |  |                   |
                      |  - req parse/validate  |                   |
                      |  - slice body->chunks  |                   |
                      |  - pick EC scheme      |                   |
                      |  - upload coordinator  |                   |
                      +----+-------+-------+-+                      |
                           |       |       |                        |
          gRPC (KV RPC)    |       |       |  gRPC (bulk data)      |
                           v       v       v                        v
     +--------------------------+    +--------------------------------+
     |   METADATA SERVICE       |    |       CHUNKSERVER FLEET        |
     |   256 shards             |    |   ~2000 nodes × 60 HDD + SSD   |
     |   Raft 5-way per shard   |    |   (Colossus D-class)           |
     |   SSD-backed             |    |   - chunkfiles (packed)        |
     |   p99 read <5 ms         |    |   - per-disk IO scheduler      |
     |   p99 write <15 ms       |    |   - QoS classes (user/repair)  |
     |   -------------          |    |   - crc32c on every 64KB page  |
     |   objects                |    |   - lease holder per chunk     |
     |   chunk_placement        |    |                                |
     |   buckets, ACLs, multipart                                     |
     |   upload parts in flight |    +----^---------------^-----------+
     +-----^-------^-----^------+         |               |
           |       |     |                |               |
      (range       |     |     (chunk replicate)   (repair reads)
       scan LIST)  |  (commit)            |               |
           |       |     |                |               |
           +-------+-----+                |               |
                                          |               |
                    +---------------------+---------------+-------------------+
                    |                                                         |
                    v                                                         v
     +--------------------------------+                   +---------------------------------+
     |   CONTROL PLANE                |                   |   BACKGROUND / DATA PLANE       |
     |   - Placement Driver (PD):     |                   |   - Repairer (scan->reconstruct)|
     |     picks (node, rack, zone)   |<---- health ----->|   - Anti-entropy scrubber       |
     |     per stripe                 |                   |   - Erasure Coder (3->RS shift) |
     |   - Topology Service           |                   |   - Garbage Collector           |
     |   - Health / failure detector  |                   |   - Tiering Mover (hot->cold)   |
     |   - Capacity planner           |                   |   - Rebalancer (disk skew fix)  |
     +--------------------------------+                   +---------------------------------+

     Side services:
      - IAM / KMS (DEK wrap/unwrap, per-tenant KEKs)
      - Bucket quota / billing meter
      - Audit log (all control-plane ops, immutable WORM)
      - Replication service (cross-region, v3)

     Scale annotations:
       - Frontend → Metadata : ~12 M QPS at 256 shards (46 K QPS/shard avg, 80 K peak)
       - Frontend → Chunkserver : ~45 Tbps ingress, ~320 Tbps egress regional
       - Chunkserver disk IOPS : 60 disks × 150 IOPS = 9 K per node × 2000 = 18 M IOPS
       - PD health pings : every 5 s to every node → 400 QPS from PD, fanout to 2000

Write path (single-part PUT, ≤64 MB)

Client                  Frontend              Metadata (shard S)          Chunkservers (14 picks across 4 zones)
  |                        |                         |                               |
  |-- PUT key body ------->|                         |                               |
  |                        |- auth/sigV4             |                               |
  |                        |- shard(bucket,key)=S    |                               |
  |                        |- slice body into 8MB chunks (N=ceil(size/8MB))          |
  |                        |- for each chunk:                                        |
  |                        |    erasure-encode RS(10,4) -> 14 shards                 |
  |                        |    PD.pick_placement(14) -> [(node,disk), ...]          |
  |                        |- begin_write_txn to S  |                               |
  |                        |----------------------- >|-- reserve object version v ---|
  |                        |                         |   (lease / uncommitted row)   |
  |                        |< ---- leases ----------| |                              |
  |                        |                                                         |
  |                        |-------------- WRITE each shard in parallel ------------>|
  |                        |                                                         |
  |                        |                          <---- CRC + ack per shard -----|
  |                        |  (wait for ALL 14 shards acked; or 12-of-14 and repair) |
  |                        |                                                         |
  |                        |- compose manifest (chunk_ids, sizes)                    |
  |                        |- commit_object_txn(meta, ec_scheme, manifest, sha256)   |
  |                        |------------------------>|-- Raft commit on shard S ----|
  |                        |< --- commit OK ---------|                              |
  |< -- 200 ETag ----------|                         |                              |

Latency budget (single 8 MB chunk, intra-DC):
  auth + shard routing          ~2 ms
  slice + encode (in-memory)    ~4 ms     (RS(10,4) at 2 GB/s per core)
  PD placement                  ~1 ms     (cached)
  metadata reserve              ~5 ms     (Raft prepare)
  parallel 14-shard write       ~25 ms    (TCP + disk write cache + fsync-on-close)
  metadata commit               ~10 ms    (Raft commit)
  -----------------------------------
  total                         ~47 ms    -> p50 PUT target met

Read path (GET)

Client                  Frontend              Metadata (shard S)          Chunkservers
  |                        |                         |                               |
  |-- GET key ------------>|                         |                               |
  |                        |- auth                   |                               |
  |                        |- ask: obj_meta(b,k)  -->|-- read row (cached) --+       |
  |                        |<-- manifest, acl, size -|                       |       |
  |                        |- auth against ACL                                       |
  |                        |- for each needed chunk (range -> subset):               |
  |                        |    send parallel GET_CHUNK(chunk_id) to placement[0..9] |
  |                        |                                                         |
  |                        |                          <---- chunk bytes + crc -------|
  |                        |- verify crc, stream to client                           |
  |                        |- (if any of 10 is slow/dead beyond hedge deadline:      |
  |                        |    fetch 2 parity chunks + reconstruct)                 |
  |                        |                                                         |
  |<-- 200 body -----------|                                                         |

Latency budget (hot, cache-hit on metadata):
  LB + auth                     ~3 ms
  metadata read (cached)        <1 ms  (LRU hit) or 5 ms (SSD miss)
  chunkserver dispatch          ~1 ms
  parallel 10 chunk reads       ~15 ms (one disk seek + transfer)
  hedge + stream                ~5 ms
  -------
  total                         ~25-30 ms -> p50 hit

Recovery path (disk dies)

Health detector (PD) sees node/disk FAIL at T0
   |
   v
PD queries chunk_placement WHERE disk_id = dead_disk  (range scan -> ~5M chunks)
   |
   v
Repairer schedules reconstructions (priority queue: high if stripe health ≤ floor+1)
   |
   v
For each stripe with a lost shard:
   - pick 10 healthy survivors (for RS(10,4))
   - choose new placement satisfying rack/zone constraints
   - pull 10 shards over the "repair QoS class" (separate token bucket from user traffic)
   - reconstruct missing shard, write to new disk
   - Paxos-update chunk_placement atomically (CAS on gen_number)
   |
   v
Anti-entropy scrubber periodically re-reads every chunk, verifies crc32c,
   marks SILENT_CORRUPTION if mismatch -> triggers same repair path.

MTTR target: single-disk (20 TB, 5M chunks) repaired within 4 hours.
 Requires: 20 TB / 4 h = 1.4 GB/s reconstruct bandwidth sustained.
 Allocated: 10 Gbps repair-class lane per rack, so 64 racks × 1.25 GB/s = 80 GB/s aggregate.
 Per-stripe: 10 × shard_size read + 1 × shard_size write = 11× write-amp.

7 Deep Dives (3 earned-secret topics) #

7.1 Replication vs Erasure Coding — the real trade-off

Why critical. This choice moves ~40% of the cluster cost and defines 90% of the failure-mode surface. Getting it wrong burns either money (over-replicated) or durability (under-parity). Every "use EC" hand-wave at L7 gets probed.

Alternatives & numbers

Scheme	Overhead	Max losses / stripe	Reconstruct reads	Small-object penalty	Typical production use
3× replication	3.00×	2	1 (just copy a replica)	none	GFS v1, HDFS default, Ceph RBD
RS(6,3)	1.50×	3	6×	moderate	Azure LRS variants, small clusters
RS(10,4)	1.40×	4	10×	high	Colossus (default warm), f4, S3 (EC tier)
RS(12,4)	1.33×	4	12×	high	Azure, HDFS-EC
LRC(12,2,2) — Microsoft	1.33×	mixed; local repairs cheaper	6× for single loss	high	Azure Storage
Tape (archive)	1.0× + 2-copy across sites	geo-scale	hours	N/A	AWS Glacier Deep Archive

Key math — why 1.4× isn't the whole story

Repair cost scales with k. For RS(10,4), replacing one dead shard reads 10 shards to reconstruct. With 20 TB drives:

Read amplification during repair = 10× (read 10 shards at shard_size to make 1 shard_size of data).
On a 20 TB disk death, that's 200 TB of reads. At 1 GB/s per node, > 55 node-hours of repair IO competing with user reads.
This is why you MUST have QoS lanes. Google's public papers describe separate Colossus D "repair tokens." Without them, a disk failure → tail-latency spike → angry pager.

The small-object penalty (earned-secret) Any object smaller than one shard (e.g., 512 KB on a 8 MB shard size) still has to be split into 14 shards. You end up storing 14 × padding + 512 KB across 14 disks. For a 512 KB object under RS(10,4) with 8 MB shard size, you've touched 14 disks for 512 KB → catastrophic IOPS amplification on GET. Real systems solve this by:

Packing — small objects first packed into 64 MB "bundles," then bundles are erasure-coded. This is Facebook Haystack's insight: one bundle = 1000s of photos = amortize EC overhead. f4 extends this to warm tier with RS(10,4).
Hot-tier 3× replication — hot small objects get 3× replication; EC applies only on cold/warm tier transition (via lifecycle). Matches S3 Standard → S3 IA transition model.
LRC (Microsoft Locally Recoverable Codes) — adds local parity groups so single-shard repair reads only 6 shards, not 12. Pays 33% vs 40% overhead.

My chosen approach (and why)

Hot tier (new writes, <7 days): 3× replication across 3 zones. PUT latency 40 ms (no encode). Reads can serve from nearest replica. Repair = 1× read.
Warm tier (7–90 days): RS(10,4) across 4 zones via background coder. Fully supports 4-shard loss including zone failure.
Cold tier (>90 days): RS(12,4) or LRC(12,2,2), shard size bumped to 64 MB, possibly stored on SMR drives with bundles.
Archive tier: single RS(10,4) + geo-replicated to a second region on tape.

Failure modes & mitigations

Correlated disk failure (bad batch, same firmware bug) — placement MUST NOT cluster shards of one stripe on disks with same firmware version. PD records firmware_id and diversifies.
Zone failure on 3-zone region — RS(10,4) can't tolerate losing ≥5 shards; requires 4-zone layout or RS(6,3) with 3 zones × 3 shards/zone.
Slow shard ("stuck" but not dead) — during GET, hedge by reading k+1 (11 shards) after 20 ms; first 10 to return wins. Classic "tail at scale" technique (Dean & Barroso 2013).
Bit-rot — crc32c per 64 KB page + sha256 end-to-end. Scrubber re-verifies every chunk every 14 days.

Quantified durability derivation (the 11-nines claim)

Assume:

Disk AFR = 1% annual → P(disk death in 1 year) = 0.01.
Stripe of 14 shards (RS(10,4)), each on independent disks.
P(≥5 shard losses in 1 year before repair) ≤ stripe_loss_prob.
MTTR = 4 hours (aggressive placement + repair).

Simplified model: during repair window (4 h = 1/2190 year), per-disk failure prob ≈ 0.01 / 2190 ≈ 4.6e-6. P(≥4 more fail in that window | 1 already lost) ≈ C(13,4) × (4.6e-6)^4 = 715 × 4.5e-22 ≈ 3.2e-19 per stripe per year.

Total stripes ≈ 1.4 EB / (14 × 8 MB) = 1.25e10 stripes. P(any stripe loses ≥5 shards) ≈ 1.25e10 × 3.2e-19 ≈ 4e-9 per year → ~9 nines.

To hit 11 nines we need MTTR shortened, or additional parity (RS(12,4) has 4 losses tolerated still, but with 14 shards → tighter). Real systems get the extra 2 nines via:

Geo-redundancy for archive (adds effectively ~3 nines once independent-region failure is factored in).
Shorter MTTR (sub-hour via aggressive repair) → each power of 2 on MTTR adds ~4× to durability.
Diversity of failure modes beyond independent AFR (firmware, power, cooling) — hurts, not helps; Backblaze data shows real-world correlated failures.

The honest L7 admission: 11 nines is a marketing number. Real per-object annual loss on S3 is empirically much smaller than the model predicts because of geo-replication and software redundancy layers; the model above gives 9 nines for a single-region RS(10,4) warm tier, which is why production systems layer cross-region async replication and journaled writes on top.

7.2 Metadata Scalability — why GFS's single master had to die

Why critical. Metadata is the serializer of the whole system. A slow metadata service makes the fastest chunkservers useless. GFS master hit a wall at 50 PB / ~100 M files / ~10 K QPS. Colossus's first job was to kill that master.

Alternatives (quantified)

Design	Namespace scale	Metadata QPS	LIST perf	Ops complexity	Systems
Single master (RAM)	≤50 PB / ~100 M files	~10 K QPS	O(1) tree walk	Simple, single-pane	GFS v1, HDFS NN v1
Federation (namespace slicing)	linear w/ slices	+N×	breaks cross-slice rename	medium	HDFS Federation
BigTable-backed (Colossus curators)	effectively unbounded	~1 M+ QPS	O(log) + range scan	complex	Colossus, BigTable itself
Sharded transactional KV (Spanner dirs)	unbounded	~10 M QPS	range scan with ordered shards	complex	F1, CockroachDB-style
DHT / consistent hashing (Dynamo)	unbounded	~10 M QPS	no prefix scan — deal-breaker for S3 LIST	simple	Dynamo, Cassandra, Swift rings

Why DHT-style metadata is WRONG for S3-like

LIST ?prefix=foo/bar/ requires finding all keys in range [foo/bar/, foo/bar0). A hash-based placement scatters that range across every shard. A LIST becomes a scatter-gather over every metadata node, which:
- makes LIST latency scale with shard count;
- makes LIST cost scale with shard count, not result size;
- makes LIST fragile — one slow shard stalls the whole result.
Dynamo/Swift sidestep this by not doing prefix LIST natively (Swift does flat container listings but has scaling issues there). This is fine for KV blobs; it's fatal for S3 API compatibility.

My chosen approach

Ordered sharded KV for metadata (Spanner-like directory layer). (bucket_id, key) lex-ordered.
256 shards per region, each a 5-member Paxos/Raft group.
Shard split when a shard grows past ~100 GB of metadata or 50 K QPS.
Shard re-mapping handled by a directory service; frontend caches shard→range mapping with TTL + ETag.

Bucket-hotspot mitigation (earned-secret)

A tenant uploads 1 M objects/hour to the same prefix (e.g., /logs/YYYYMMDD/HHMM/). The metadata shard owning that range gets hammered.
Fix: auto-split on load, not just on size. Each shard tracks p99 write latency; if >SLA for 5 min, propose split to range-mid.
Client-side hint: SDKs recommend adding a hash prefix to keys (/logs/ab12_20260418.../) — classic S3 partition-key trick (since 2018 S3 removed the need, but underneath it's exactly this).

Metadata-vs-data consistency (§ title of the problem) How do we keep metadata consistent with the data it points to?

The commit protocol:

Frontend slices body, assigns chunk IDs (content-addressable).
Frontend writes chunks to chunkservers — chunks become uncommitted, linked only in a transient upload record.
Chunkservers fsync + return placement receipt with generation number.
Frontend calls metadata commit_object with manifest + receipts.
Metadata Paxos commits the row. At this moment the object exists.
If step 4 or 5 fails, chunks are orphaned. GC reaps them after TTL (default 7 days). No lost data; just space reclaimed.

Critical property: a client never sees an object until metadata commit. If metadata commits but frontend crashes before returning 200, client retries; retry writes same content-addressed chunks (idempotent) and re-commits — metadata sees same version_id and returns OK (idempotent commit). This is why version_id is derived from a hash of the content + timestamp, not a server-generated serial that would differ on retry.

Handling torn writes / partial failures

A chunk write fails mid-stripe: 12 of 14 shards acked, 2 didn't. Options:
1. Abort: tell client 503, GC the 12 acked shards (cheap — they're orphaned).
2. Commit with degraded flag, repair in background. Dangerous if further failure before repair; don't.
Always abort-and-retry. Simpler, matches S3 behavior.

Dealing with the metadata service itself going down

Write availability: 99.99% via Paxos across 5 replicas in 3 zones (tolerates 2 losses).
Read availability: local-follower reads OK for immutable keys (post-commit metadata doesn't mutate except for lifecycle/delete).
Metadata partitioning (network partition inside one region): minority-side rejects writes (503) but can still serve reads with "may be stale" hint. Matches CP side of CAP.

7.3 Durable writes & end-to-end integrity (how we actually hit 11 nines in practice)

Why critical. Silent data corruption is the #1 cause of real-world "impossible" data loss on large clusters. Backblaze reports ~1 bit error per 10^14–10^15 bits read on consumer disks — at EB-scale, you will see silent corruption daily. No amount of replication fixes corruption you didn't detect.

The stack of integrity checks

Client-provided content-MD5 or SHA256 in PUT header. Server re-hashes on receipt; mismatch → 400 BadDigest. Detects in-flight corruption from client to edge.
Per-shard CRC32C computed at encode time, stored alongside shard. Verified on every disk read.
Per-page CRC32C at 64 KB granularity inside the chunkfile. Verified on every read; bad page → reconstruct from parity.
Per-object SHA-256 in metadata, end-to-end. After read + reassemble + decrypt, SHA-256 is recomputed; mismatch → pull alternate parity, log corruption, fail fast.
Scrub walker — background process re-reads every shard every N days (typically 14 days on warm, 30 days on cold), verifies all CRCs. On corruption: mark shard dead, trigger repair.
End-of-disk-life RMA triggered by SMART + CRC error rate. Disk is "drained" before it hits critical.

Why not just TCP checksum?

TCP CRC-16 misses ~1-in-65K two-bit errors; and once data is memcpy'd in RAM, TCP doesn't help. [Stone & Partridge 2000] "When the CRC and TCP Checksum Disagree."
Disk firmware checksums miss writes that went to wrong sectors ("phantom writes"). Only end-to-end checksums catch torn writes / misdirected writes.

Quorum writes vs synchronous all (the interview sleight-of-hand)

You'll see answers say "quorum write: W=2, R=2, N=3, consistency guaranteed." That's Dynamo, and it's subtly wrong for object storage:

With W=2 of 3, the 3rd replica is eventually consistent. Read-your-writes requires R+W > N → R=2, W=2, N=3 → R+W=4 > 3. OK.
But: if W=2 and the 2 that acked are in the same rack, a rack failure loses the write. You must enforce zone-diversity at write time, not just placement policy.
S3 went strong-consistent in Dec 2020 by writing to an odd number with majority and a consistent cache on metadata. Details not public, but almost certainly: write W=3 of 3 (all) synchronously for metadata, and for data, multi-AZ quorum with repair.

My model: metadata = synchronous Paxos commit (W=3 of 5); data chunks = all-acks (W=14 of 14 for RS(10,4)) at write time with hedging. Why all-acks:

Only 14 shards per stripe; "quorum of 14" is silly.
The latency cost of waiting for the last 2 is small if tail is well-managed.
If 1–2 shards are persistently slow, PD reassigns before the write (health-aware placement), rather than committing with missing shards. A stripe committed with missing shards is a stripe you have to reconstruct on first read — bad latency UX.

Handling bit rot across the tiers

Hot: 3× replicated, scrubbed weekly, CRC mismatch → copy from another replica.
Warm: RS(10,4), scrubbed bi-weekly, CRC mismatch → reconstruct from parity.
Cold: RS(12,4) or LRC, scrubbed monthly, shards on SMR HDDs (read-mostly), corruption rate higher → more frequent scrub budget needed.
Archive (tape): scrubbed yearly, 2-copy geo.

Production naming Colossus D papers mention the "end-to-end CRC composition" pattern: block CRC composes to chunk CRC composes to object CRC, so you can verify a range read's integrity without re-reading the whole object. ZFS does this with Merkle trees (Bonwick). HDFS does it with per-block checksum files.

8 Failure Modes & Resilience #

Per-component, detection → blast radius → mitigation → recovery.

8.1 Single disk failure

Detection: SMART + read error rate + missed heartbeat from disk scrubber. Node marks disk DEAD within 60 s.
Blast: one shard per stripe, 5 M stripes affected (20 TB / 4 MB avg shard = 5 M).
Mitigation: stripes are still live (RS(10,4) tolerates 4 losses). No user-facing error.
Recovery: repair queued immediately, SLA 4 h. During repair, read path uses reconstruction for affected stripes (adds ~50 ms). Alert only if >5 disks fail simultaneously (then rack/node event).

8.2 Node failure

Detection: PD heartbeat miss for 3 intervals (15 s). Node marked DOWN.
Blast: 60 disks × ~5 M chunks = 300 M chunks. Each stripe loses 1 shard (since placement MUST ensure one shard/node).
Mitigation: same as 8.1 ×60. Large repair job (1.2 PB) — spread over 24–48 h if QoS allows.
Recovery: if node comes back within grace period (1 h), re-attach and cancel repair for not-yet-started work. If not, repair to completion; drain the node when it returns.
Correlated failure: if 5 nodes in one rack die (power outage), the invariant "≤1 shard/rack" means stripes lose ≤1 shard — still safe. But placement must actually enforce it.

8.3 Rack failure

Detection: correlated heartbeat loss in one rack-id cluster.
Blast: up to 1800 disks, 108 M chunks, but ≤1 shard/stripe under invariant.
Mitigation: no user-facing error if invariant holds. If placement ever violated invariant (bug), now you see correlated stripe damage.
Recovery: power-on or draining. Repair-to-completion over 72 h via separate QoS lane (<10% of cluster bandwidth to avoid user-traffic SLO burn).

8.4 Zone failure

Detection: regional health service (not the same zone's PD — chicken/egg) detects zone-wide packet loss.
Blast: ~25% of fleet in a 4-zone region = 500 nodes, 30 K disks. Under RS(10,4) with ≤4 shards/zone, each stripe loses ≤4 shards → still readable but at bare minimum, no further loss tolerance.
Mitigation: IMMEDIATELY enter "degraded" mode: halt repair traffic (it amplifies damage if we mis-place), halt placement churn, serve reads only via reconstruction for affected stripes, block writes to affected zone.
Recovery: after zone returns, delta-repair (re-verify shards rather than re-write). If zone permanently lost, 30+ day repair job to re-hydrate to 14/14.

8.5 Metadata partition

Detection: Raft leader lease fails to renew; followers detect.
Blast: the affected metadata shard (one 256th of keyspace) is unavailable for writes; reads can continue with stale data for stale_read_bound.
Mitigation: client SDK sees 503 on that key-range and retries with backoff. Most buckets span multiple shards (hashed), so a single-shard outage only affects ~0.4% of keys.
Recovery: new leader elected within 5 s (typical Raft). If minority permanently isolated, majority continues.

8.6 Silent corruption

Detection: CRC mismatch on read OR during scrub.
Blast: single shard per event.
Mitigation: read path falls back to parity reconstruction, logs event, pulls from an alternate shard.
Recovery: shard marked dead; repair replaces it. If SHA-256 mismatch at object level (meaning multiple shards corrupted in a correlated way), this is an incident — page the SRE, freeze the object's bucket, investigate (firmware bug? cosmic ray? malicious?). Correlated silent corruption has happened in production (Amazon S3 2008 hash-bug, Netflix 2019 memory bit-flip).

8.7 Thundering herd after outage

Scenario: a zone comes back after 30-min outage; 25% of client traffic re-targets; cold caches; retry-amplified QPS can be 3–10×.
Detection: QPS spike monitors at LB and at metadata.
Mitigation:
- Global rate-limit at LB: token bucket per tenant, 2× steady-state allowed for 60 s then drop to SLA.
- Coalesce in-flight reads: multiple GETs for same key within 50 ms serve from one upstream read (request collapsing / "singleflight"); no amplification.
- Backoff with jitter enforced in official SDK. Clients without SDK: 503 + Retry-After tells them to wait.
- Graceful admit: LB sheds 10% of load rather than let the fleet tip into CPU saturation. Matches Netflix "adaptive concurrency limits" (Little's law in the LB).
Recovery: rate-limit relaxes when QPS returns to steady-state for 5 min.

8.8 Correlated cross-zone failure (the "bad deploy" scenario)

Scenario: a rolling push of a new chunkserver binary has a bug that crashes on a specific byte pattern. Deploy hits 3 of 4 zones before detection.
Detection: elevated crash rate across multiple zones within 2 min.
Mitigation: deploy system auto-rollback on health-signal regression (Canary + Argos-style). Staggered deploy per zone with 24 h hold makes zone-wide correlated failures from code nearly impossible.
Recovery: roll back binary, nodes self-recover as they restart.

9 Evolution Path #

v1 — Single-region, GFS-like, 3× replication

Single metadata master (active + passive, log-shipped failover), 3× rep everywhere.
Targets: 10 PB, 1 B objects, 100 K QPS.
Timeline: 6 months to MVP. Known to break at ~50 PB / master exhaustion.
Operationally: simple, single pane, 3-person SRE team can carry it.

v2 — Sharded metadata + EC for cold (Colossus-like)

256 metadata shards (Paxos), hot-tier 3× rep, warm/cold RS(10,4).
Background tiering mover converts warm→cold.
LRC(12,2,2) experimented for coldest tier.
Add: lifecycle rules, versioning, multipart.
Scale targets: 1 EB, 10 B objects, 10 M QPS. (This is the design spec above.)
Key migration: metadata migrated via online dual-write + shadow-read; data tier converted stripe-by-stripe.

v3 — Multi-region, globally-eventually-consistent

Per-region v2 stacks, each with its own metadata/data plane.
Cross-region replication service: async log-shipped (change data capture from Raft log) with p95 <5 s end-to-end.
Conflict resolution: last-writer-wins by Hybrid Logical Clock timestamp; for PUT-PUT same key, arbitrary resolution with notification to tenant.
Global table per bucket: uses Spanner-like TrueTime or Calvin-style deterministic order for the tiny subset of "global-strong" buckets (at latency cost).
Geo tiering: hot tier in region-of-origin, cold tier shipped to lower-$$ region (e.g., us-central to us-iowa) for 40% cost reduction.
Disaster recovery: every region has 2 partner regions; losing 1 region degrades read latency by ~50 ms, losing 2 regions simultaneously is pager-worthy but RPO=seconds.

v4 — (out of scope, FYI)

Compute-near-storage (S3 Select, SELECT-over-object).
Post-quantum-ready KMS.
Storage-class-memory tier for ultra-hot (Optane-like; though Intel killed Optane, similar class tech persists).

10 Out-of-1-Hour Notes #

10.1 Consistent hashing vs centralized placement

Consistent hashing (Dynamo): no placement driver, every node knows the ring, O(log N) rebalance on membership change. Simple, but:
- Rack/zone diversity is hard — the ring doesn't know topology. Dynamo uses "virtual nodes + preference lists," still suboptimal under correlated failure.
- Small objects pay full per-request hashing cost; for EC stripes you need to pick k+m locations with constraints, which Ring can't natively do.
- Result: Dynamo-style is great for equal-size, no-topology-constraint payloads (sessions, shopping carts). Not great for EC-first object storage.
Centralized PD: can optimize for topology, capacity, heat, and firmware diversity. Cost is complexity and that PD can be a bottleneck (mitigate: cache in frontends, PD only authoritative for changes).
Real-world: Colossus uses PD. Ceph uses CRUSH which is a clever middle ground — pseudo-random but topology-aware, so clients can compute placement from a topology map without a central oracle.

10.2 Disk formats

Ext4 fails at billions of files (inode table bloat). Don't do 1 chunk = 1 file.
XFS is better but same tail. Packing into chunkfiles is mandatory.
Raw block + in-house format (what Colossus D does): directly manage LBAs, no filesystem overhead. Fastest, most complex.
SMR (shingled) HDDs: 15–20% density improvement, but sequential-write-only. Perfect for cold tier append-only data. EC bundles rewritten rarely → SMR is a natural fit.
NVMe SSDs: for hot tier only; $/GB still ~20× HDD in 2026.

10.3 Write-through CDN vs origin reads

CDN at edge (CloudFront / Fastly / Google-CDN): caches public GETs; 10× reduction in origin QPS and ~50ms latency improvement globally. Invalidation on PUT via purge API or short TTL.
Origin direct for private objects, pre-signed URLs, large streaming content that doesn't fit edge caches.
Write-through CDN: almost never. Writes are rare enough that CDN-layer caching complicates consistency (CDN caches stale → reader sees old object) without bandwidth savings.
Design choice: serve public bucket policy via CDN; everything else direct to frontend.

10.4 Encryption at rest

Per-object DEK, randomly generated 256-bit AES key.
DEK encrypts object bytes in AES-GCM (authenticated encryption catches tampering in addition to CRC).
DEK is wrapped with per-tenant KEK stored in KMS.
KEK itself wrapped by a root key in HSM, rotated yearly.
Rotation: DEK rotation requires re-encrypt (expensive, usually lazy on next overwrite). KEK rotation is cheap — only re-wrap existing DEKs, re-key via KMS API.
Key deletion = effective object deletion for compliance — destroy KEK, DEKs unwrappable, objects unrecoverable. GDPR right-to-delete friendly (but see §10.7 for EC + deletion headache).
BYOK (customer-managed keys): tenant holds KEK in their own KMS; our system calls their KMS to unwrap DEK on every access. High-value enterprise feature.

10.5 Multi-tenant isolation & noisy neighbor

Per-tenant token bucket at LB — capped QPS and bandwidth.
Per-tenant shard affinity: huge tenants pinned to a subset of metadata shards so their LIST bursts don't harm others.
Priority classes on chunkserver disk scheduler: P0 (paid), P1 (standard), P2 (best-effort batch). Colossus D has per-request priority with weighted fair queuing.
CPU/mem isolation: containerized chunkserver and metadata workers; noisy tenant can't starve sibling containers.
IOPS SLO per tenant — S3 historically under-exposed this; GCP's Anywhere Cache and Azure's Premium tiers now expose guaranteed IOPS.

10.6 Cost per GB-month modeling

Raw HDD: ~$15/TB (2026 enterprise spot price, amortized). At 1.4× EC overhead, effective $21/TB-physical-for-1TB-logical.
Chassis + rack + power + cooling: ~1.5–2× disk cost over 5 years.
Networking amortized: ~$2/TB/year.
Software/SRE/amortized R&D: ~$3/TB/year.
Total warm tier: ~$8/TB/year = $0.67/TB-month = $0.00067/GB-month cost.
S3 Standard charges ~$0.023/GB-month — ~34× markup, funds availability, API, profit, and that the public price is a "retail" price with implied headroom.
Tape tier (Glacier Deep Archive): LTO-9 is ~$4/TB, robotic library + retrievals ~$2/TB/year effective; ~$0.00017/GB-month cost, retail $0.00099.
Erasure coding saves ~48% of disk cost vs 3× replication — at 1 EB scale that's $150M+/region/5-yr. This is the reason EC exists.

10.7 Regulatory (GDPR right-to-delete meets EC)

"Delete" on object-level = metadata tombstone + eventual GC of chunks.
GC of EC-shards: all 14 shards of each affected stripe must be re-encoded (since a stripe may contain bytes from multiple objects if packed). For packed bundles, deletion = mark bytes dead; compaction rewrites bundle excluding dead bytes (like LSM-tree compaction). Until compaction runs, deleted bytes remain recoverable in theory.
Compliance SLA: GDPR requires "without undue delay," typically ≤30 days. Lifecycle scheduler ensures compaction runs within SLA on delete-heavy buckets.
Crypto-shredding as an alternative: delete the per-object DEK from KMS. Bytes remain but unreadable. Acceptable for most regulators; cryptographic one-way erasure in practice. Cheaper than re-compacting EC bundles.
Data sovereignty: bucket pinned to a specific region/country; no cross-region replication without tenant opt-in. Audit log proves location.

10.8 Observability (tail latency and how you find disk rot)

RED metrics: rate, error, duration per endpoint × tenant × region.
Histogram quantiles — not averages. p50/p90/p99/p99.9 per GET/PUT path. Tail is where SRE lives.
Disk-level wearout dashboard: SMART attributes, CRC error rates, reallocated sectors, disk-model groups (for correlated failure prediction).
Repair queue depth & bandwidth consumed — first-class SLO. Alert when EC reconstruct fanout per user read exceeds threshold (= stripes are under-healthy).
Metadata shard fan-out for LIST operations — detects hotspot buckets.
Per-tenant top-N bandwidth / IOPS dashboards — blame game when a tenant noisy-neighbors.
Chaos engineering: regular random disk/node/rack kills; enforces that MTTR SLO is real, not aspirational.
Tracing: open-telemetry span from client → LB → FE → metadata → chunkserver for every sampled 1:10K request. Essential for debugging tail latency ("why did this particular PUT take 3s?" → span shows metadata Paxos leader election during that window).

10.9 Capacity planning out-of-band

Growth forecast: 60%/year logical data growth per region (historically matched for hyperscalers).
Disk procurement lead time: 26 weeks in 2026 post-AI-GPU-supply-crunch side-effects. Plan 9 months ahead.
Power ceiling in the DC is the real bottleneck, not floor space: a rack of 1800 disks draws ~15 kW. A 10 MW DC caps at ~670 racks — aligns with our 64-rack regional spec giving 10x growth headroom before needing a new DC.

10.10 What I did not have time to build but would

Cross-tenant dedup (cryptographic — convergent encryption) — saves 10–30% on CI / backup workloads. Privacy-sensitive; needs careful design.
Event notifications (bucket → Lambda/PubSub) for PUT/DELETE. Reliable delivery requires a durable outbox pattern on metadata commit.
Object Lock / WORM compliance mode — immutable retention for SEC/FINRA compliance.
Bucket inventory reports — S3 exports CSV/Parquet inventory to a destination bucket daily. Enables fast LIST for customers.
Intelligent tiering — ML-predicted access pattern per object, auto-moves between hot/warm.

Verification vs quality bar #

SRE pager-carryable? Yes — §8 per-component failure-modes with detection/blast/mitigation/recovery. Added "staggered deploy" and thundering-herd mitigations (§8.7, 8.8).
Every arrow maps to real API (§4) or schema (§5)? Yes — arrows labeled with gRPC/HTTPS, map to commit_object, reserve_version, WRITE_SHARD, GET_CHUNK, health-ping.
Deep-dive L7 vs L6? L7 check: (a) quantified durability derivation from AFR+MTTR, (b) RS small-object penalty + bundle/Haystack rescue, (c) metadata DHT-vs-KV prefix-LIST trade-off, (d) named post-2020 S3 consistency change with why it's hard, (e) CRC vs SHA-256 vs TCP-checksum tower with failure-mode citation.
Durability math derived from AFR + k/n + MTTR? §7.1 — single-region 9 nines derived; acknowledge 11 nines requires geo-redundancy, not a single-region physical guarantee. This is the honest L7 answer; L6 would assert 11 nines and move on.
Rejected alternatives for every major choice with quantified trade-offs? Yes — object vs block vs file (§1), DHT vs sharded KV (§7.2), 3× vs RS vs LRC (§7.1), single master vs sharded (§7.2), CDN write-through vs not (§10.3).
Real systems named? GFS, Colossus (incl. D, curators), HDFS, Ceph (RADOS, RGW, CRUSH), S3, Dynamo, Swift, Haystack, f4, Azure Blob (LRC), Spanner, FoundationDB, BigTable, Glacier, Backblaze (AFR source), Stone/Partridge TCP checksum paper, Dean/Barroso tail-at-scale.

Interview delivery order (60 min) #

0–5 min: §1 restatement, clarifying questions, pick object store.
5–10 min: §2 FR + §3 NFR and core BOE numbers (1 EB, 70K disks, 256 shards).
10–20 min: §4 API + §5 Schema.
20–30 min: §6 ASCII diagram (draw live), walk write + read paths.
30–50 min: §7 one or two deep-dives, interviewer-driven (usually §7.1 EC or §7.2 metadata).
50–60 min: §8 failure modes, §9 evolution path.
Skip §10 in interview unless asked; use for follow-up.