All problems

Q4 Storage & Query Systems 28 min read 12 sections

Design a Distributed Storage System (Object Store)

Choose block, file, or object storage, then justify durability, placement, repair, and metadata consistency.

ConsistencyPartitioningAvailabilityCost / efficiencyObservability

1 Problem Restatement & Clarifying Questions #

Restatement

Design a planet-scale distributed object store (S3 / GCS / Azure Blob / Ceph RADOSGW / f4-class). Durability, strong-enough consistency, sharded metadata, background repair. Pager-carryable for an SRE on-call.

Why object store (not block, not file)?

I pick object store and will justify by ruling out the alternatives:

Target Access pattern Why wrong for this brief
Block (EBS, Ceph RBD, SAN) 4KB random R/W, attach-to-one-VM, POSIX-ish via layers above Tight latency (<1ms), requires IOPS-scheduled LUN-per-host, doesn't fit "EB-scale multi-tenant key addressing." Block stores top out at ~PB per cluster because you're selling a volume abstraction, not a namespace.
File (GFS, HDFS, Lustre, Isilon, ColossusFS) POSIX-ish, directories, rename, hardlinks, mtime, append, chown POSIX rename is a directory transaction. Directory-op scaling is where GFS single-master hit 50 PB / ~100 M files; Colossus and HDFS Federation spent years fixing it. Not the interesting bottleneck to design against in 60 min.
Object (S3, GCS, Azure Blob, Swift, Ceph RGW, Haystack/f4) Immutable-by-default blobs, flat key namespace, PUT/GET/DELETE/LIST by prefix, eventually billions of tenants Matches "EB-scale, immutable-write + big-read, cheap, durable" workload. Flat namespace kills the dir-tx bottleneck. Chosen.

So: object store, S3-grade semantics, GCS/Colossus-grade internals.

Clarifying Questions (I would ask the interviewer; I'm answering them for self-study)

  1. Workload shape? Assume write-once-read-many-delete-rarely, skewed toward large objects (p50=512 KB, p95=64 MB, p99=1 GB, tail to 5 TB via multipart). This is the "real" distribution on S3/GCS — not uniform. Small-object tax is real (see §7.1).
  2. Scale? Target 1 EB logical stored, 10 B live objects, 10 M QPS GET, 1 M QPS PUT, 100 K QPS LIST. This is mid-tier regional (AWS has 400+ trillion objects; I'm sizing one region).
  3. Latency SLO? p50 GET <30 ms, p99 GET <200 ms for hot cache tier, <500 ms for cold. p50 PUT <80 ms, p99 <500 ms. Tail is where L7 lives.
  4. Durability SLO? 11 nines (99.999999999 % annual) — S3 / GCS / Azure standard. I'll derive it in §3 from AFR + EC k/n + MTTR.
  5. Availability SLO? 99.99% per region, 99.999% multi-region. One region is allowed to brown out; metadata is allowed to reject writes before corrupting.
  6. Consistency? S3 went strong-consistent for PUT-after-PUT in Dec 2020. I'll match: read-your-writes for new keys and overwrites; eventually-consistent LIST. Multi-region: async, last-writer-wins with vector metadata.
  7. Multi-region? Yes, but v3. v1 is single-region. Avoid designing for the hardest case first.
  8. Multi-tenant? Yes, billions of buckets, millions of tenants. Noisy-neighbor protection required (§10.5).
  9. Encryption? At-rest AES-256 per-object DEK wrapped by tenant KEK in KMS; in-transit TLS. Out-of-scope for core 60 min, detailed in §10.4.

Out-of-scope (explicitly deferred)

Cross-object transactions; server-side rich query (SQL-on-blob); streaming mutation of an existing object mid-read; filesystem semantics (rename/hardlink); strong cross-region consistency; durable queue / pub-sub on top.


2 Functional Requirements #

In-scope (numbered)

  1. PUT /b/{bucket}/{key} — body up to 5 GB single-shot, 5 TB via multipart; returns ETag (content-MD5 or content-SHA256).
  2. GET /b/{bucket}/{key} — full body, with If-Match/If-None-Match/If-Modified-Since and HTTP Range: for byte ranges.
  3. DELETE /b/{bucket}/{key} — soft delete with versioning; tombstone propagation.
  4. LIST /b/{bucket}?prefix=&marker=&max-keys=&delimiter=/ — prefix walk, pagination. Eventually consistent by design (more on this in §7.2).
  5. Multipart Upload: initiateuploadPart × N (parallel) → complete (atomic manifest commit) → single object. Enables 100 GB+ uploads over flaky networks and parallel ingest up to ~10 Gbps per object.
  6. Range Read: Range: bytes=a-b — served from exact chunk, never read full object.
  7. ACL / IAM: bucket-level policy + per-object ACL; pre-signed URL (HMAC over canonical request) for delegated access with TTL.
  8. Versioning: per-bucket flag; DELETE creates delete-marker, previous version retained; PUT never overwrites in place.
  9. Lifecycle: per-bucket rules (transition to cold tier after N days, expire after M days).
  10. CopyObject (server-side copy): metadata-only when source/dest share storage class, deep-copy otherwise.

Out-of-scope (explicitly)

  • SELECT-over-object (S3 Select / BigQuery-external). Separate compute layer.
  • Atomic PUT-if-not-exists conditional on another key. Single-key conditional PUT is in-scope (If-Match on ETag); cross-key transactions are out.
  • Strong global consistency — out, see §9 v3.
  • Filesystem rename / directory metadata (would require namespace service).

3 Non-Functional Requirements + Capacity Back-of-Envelope #

NFRs

NFR Target Rationale
Availability 99.99% per region (4.38 min/mo error budget) AWS S3, GCS bucket-standard
Durability 11 nines annual Industry standard; derived below
Durability for archive tier 12 nines Deep-archive / Glacier class
Latency p50 GET <30 ms (hot) Within-DC round trip + 1 disk seek
Latency p99 GET <200 ms (hot), <500 ms (cold) Budget for 1 reconstruct if chunk slow/dead
Latency p50 PUT <80 ms Write + 2 replica ack (v1) or 10+4 EC acks (v2)
Latency p99 PUT <500 ms Tail-tolerant, includes metadata commit
Consistency Read-your-writes for PUT; eventual for LIST Matches S3-Dec-2020 model
Throughput 10 M GET QPS, 1 M PUT QPS per region Mid-tier region
Cost <$0.015 / GB-month warm, <$0.001 / GB-month cold Within S3 Standard / Glacier envelope

Capacity BOE (show the math)

Logical data

  • 1 EB = 10^18 B = 10^6 TB = 10^3 PB.
  • 10 B objects; mean size = 1 EB / 10 B = 100 MB (skewed; p50 is 512 KB but the tail dominates bytes).

Physical footprint — replication vs erasure coding

  • Pure 3× replication: 3 EB physical.
  • Reed-Solomon RS(10,4) (10 data + 4 parity per stripe): 1.4× overhead → 1.4 EB physical. This is Facebook f4 / Azure LRS / GCS default for cold & warm.
  • Savings of 3.0 − 1.4 = 1.6 EB. At ~$20 /TB HDD amortized (disk + chassis + 3 yr) that's $32 M saved per region. Non-trivial; see §7.1.

Disk fleet

  • Assume 20 TB enterprise HDD (2026 vintage; Seagate Exos / WD Ultrastar are shipping 22–24 TB).
  • 1.4 EB / 20 TB = 70 000 disks just for user data.
  • Add overhead: 15 % free-space for compaction/rebalance, 5 % for metadata replicas, 10 % for transient multipart parts not yet committed, 10 % reserved for EC-reconstruction scratch = 40 % overhead → ~100 K HDDs.
  • Hot tier on NVMe SSD (3.84 TB): assume 5 % of bytes are hot → 50 PB / 3.84 TB = ~13 K SSDs.
  • Total: ~113 K storage drives in the region.

Rack / zone / region layout

  • 60 disks per node (JBOD chassis with HBAs), 1 node per RU × 2 RU = 30 nodes per rack × 60 disks = 1800 disks/rack.
  • 100 K / 1800 ≈ 56 racks for HDD, say 64 racks total with headroom.
  • 4 zones per region × 16 racks/zone.
  • Placement invariant: no stripe has >1 shard in the same rack; RS(10,4) spread across ≥2 zones so a single-zone outage loses ≤7 shards out of 14 (survivable, since RS(10,4) tolerates 4 losses per stripe — but this means a zone failure makes EC unrepairable for some stripes; see §7.1 for the correct fix).
    • Correct invariant: spread 14 shards across ≥3 zones with max 5 shards/zone, so a 1-zone loss loses ≤5 shards, within the 4-tolerance — wait, 5 > 4. Fix: RS(10,4) across ≥4 zones, ≤4 shards/zone (so 1-zone loss ≤4 shards, exactly at tolerance). At 3-zone layout, RS(10,4) cannot survive a zone loss; must use RS(6,3) or LRC for 3-zone regions. This is the single most common bug in interview EC answers.

Metadata footprint

  • 10 B objects × ~2 KB metadata/object (bucket_id, key, version, size, checksum, chunk manifest for ~5 chunks × 24 B, ACL id, timestamps, EC params) = 20 TB metadata.
  • 3× replicated + Paxos/Raft overhead → ~60 TB across metadata servers. Fits comfortably on sharded SSD KV.
  • Metadata op rate: every GET = 1 metadata read, every PUT = 1 metadata write, every LIST = range scan. 10 M GET + 1 M PUT + 100 K LIST = ~11 M read QPS, ~1 M write QPS against metadata.
    • At 50 K read QPS per Spanner-class shard → 220 shards for reads; at 10 K write QPS per shard → 100 shards for writes. Take max ≈ 256 metadata shards for headroom.

Network bandwidth

  • Read: 10 M QPS × 100 MB avg = 10^15 B/s = 1 EB/s egress. Obviously wrong at mean; most GETs are p50=512 KB.
  • Recalibrate with weighted mean: log-normal around p50=512 KB with p99=1 GB tail. Weighted avg ~4 MB/GET. 10 M × 4 MB = 40 TB/s = 320 Tbps regional egress. Matches "big region scale" (AWS us-east-1 is rumored >1 Pbps aggregate).
  • Write: 1 M × 4 MB × 1.4 (EC overhead) = 5.6 TB/s = 45 Tbps regional ingest to disks.
  • Per-rack: 320 Tbps / 64 racks = 5 Tbps/rack — needs 4× 800G or 8× 400G uplinks. Consistent with a hyperscale spine.

QPS per chunkserver

  • 30 nodes/rack × 64 racks ≈ 1920 storage nodes. 10 M GET / 1920 = ~5 K GET QPS/node. At 60 disks × ~150 random IOPS = 9 K IOPS/node — disks are the bottleneck for small GETs, so hot data must live on SSDs or be cached.

4 High-Level API #

REST/HTTP with S3-compatible signatures (SigV4 HMAC). Internal plane is gRPC + Protobuf.

Client-facing (external, HTTPS)

PUT  /b/{bucket}/{key}                          Content-Length, Content-MD5, x-amz-storage-class, x-amz-server-side-encryption
     → 200 { ETag, VersionId, x-amz-request-id, x-amz-id-2 }
     Errors: 400 InvalidRequest | 403 AccessDenied | 409 BucketAlreadyExists
           | 412 PreconditionFailed | 503 SlowDown (back-pressure, client retries with expo backoff + jitter)

GET  /b/{bucket}/{key}                          [Range: bytes=a-b] [If-Match: etag] [If-None-Match: etag]
     → 200/206 body, ETag, Content-Length, Last-Modified, x-amz-version-id
     Errors: 304 NotModified | 404 NoSuchKey | 416 RangeNotSatisfiable | 503 SlowDown

DELETE /b/{bucket}/{key}[?versionId=]
     → 204 (async GC; immediate tombstone in metadata)

LIST /b/{bucket}?prefix=&marker=&max-keys=1000&delimiter=/
     → 200 { Contents:[…], NextMarker, CommonPrefixes:[…], IsTruncated }
     Explicitly: eventually consistent, reflects metadata state at read time ±seconds.

# Multipart
POST   /b/{bucket}/{key}?uploads                 → { UploadId }
PUT    /b/{bucket}/{key}?partNumber=N&uploadId=  body → { ETag }
POST   /b/{bucket}/{key}?uploadId=               { Parts:[{N, ETag}] } → final ETag, atomic commit
DELETE /b/{bucket}/{key}?uploadId=               abort + GC parts

# Pre-signed URL
GET /b/{bucket}/{key}?X-Amz-Signature=&X-Amz-Expires=900&X-Amz-SignedHeaders=…
     Server verifies HMAC with bucket-owner's AKID-SK; forwards IAM as the bucket owner's.

Error semantics (important for back-pressure)

  • 503 SlowDown on overload — always with Retry-After hint and a token-bucket per caller. Client libraries back off exponentially (base 50 ms, factor 2, max 20 s, jittered).
  • 500 InternalError for unexpected; client retries only idempotent ops (GET, DELETE with version, PUT with Content-MD5 since it's idempotent-by-hash).
  • 429-class rate-limit at edge before requests touch the data plane, so a thundering herd doesn't fan out past LB (see §8.6).

Control-plane & admin

  • CreateBucket / DeleteBucket / PutBucketPolicy / PutBucketLifecycle / PutBucketVersioning — low-QPS, strongly consistent, Paxos-ordered.

5 Data Schema #

Engines

Data Engine Why
Object metadata (bucket, key → manifest) Sharded transactional KV — Spanner-class (multi-Paxos groups per shard) or FoundationDB Need strong consistency for read-your-writes PUT; need ordered range scans for LIST; billions of keys; multi-region Paxos later (v3). Rejected: Cassandra/Dynamo — no transactions, LIST is painful; MySQL/Postgres — can't shard this wide natively.
Chunk index (chunk_id → placement) Same KV or a separate placement service (BigTable-style) Separating means object metadata can be cached aggressively (immutable manifest post-commit); chunk-index changes during repair without invalidating object caches.
Bucket catalog Same KV, rooted at system bucket /sys/buckets Low QPS, strong consistency, ACL-attached
ACL / IAM policies Same KV Cross-referenced on every auth; cached aggressively
Data (chunks) Chunkfiles on local disk (Colossus D / GFS chunkserver / RADOS OSD). One chunkfile = many chunks append-packed. Avoids 1-chunk-1-file on XFS (inode pressure kills perf at billions-of-files scale). Classic f4/Haystack lesson: pack small objects into large container files.
Write-ahead log Local SSD per metadata shard + shard's Paxos ring Bounds commit latency

Schemas (logical)

# Metadata: objects table (sharded by hash(bucket_id, key))
object_meta {
  bucket_id      uint64
  key            string         # up to 1 KB
  version_id     uint128        # monotonic (ts << 64) | random
  is_delete_marker bool
  size           uint64
  content_md5    bytes(16)      # ETag for single-part
  sha256         bytes(32)      # internal end-to-end integrity
  storage_class  enum {HOT, WARM, COLD, ARCHIVE}
  ec_scheme      enum {REPL_3, RS_10_4, LRC_12_2_2}
  chunk_manifest [chunk_id]     # ordered; covers object bytes [0, size)
  chunk_sizes    [uint64]
  acl_id         uint64
  created_ts     uint64         # hybrid logical clock
  lifecycle_tag  uint32         # for batch lifecycle scans
  tenant_id      uint64
  encryption     { kms_kek_id, wrapped_dek }
  user_metadata  map<string,string>   # up to 2 KB
}
# Primary key: (bucket_id, key, version_id DESC)
# Range scans on (bucket_id, key) serve LIST.

# Chunk placement (sharded by chunk_id)
chunk_placement {
  chunk_id       uint128        # globally unique, content-addressable seed
  stripe_id      uint128        # many chunks in one EC stripe share this
  stripe_pos     uint8          # 0..13 for RS(10,4)
  disk_id        uint64         # physical disk
  node_id        uint64
  rack_id        uint32
  zone_id        uint16
  chunkfile_id   uint64
  offset         uint64
  length         uint32
  crc32c         uint32         # per-chunk; composes with object-level sha256
  gen_number     uint32         # lease generation; bumped on recovery
  state          enum {HEALTHY, REPAIRING, STALE, DEAD}
}

# Bucket
bucket {
  bucket_id      uint64 PK
  name           string UNIQUE GLOBAL
  owner          uint64
  created_ts     uint64
  versioning     enum {DISABLED, ENABLED, SUSPENDED}
  lifecycle      [rule]
  replication    { enabled, dest_regions[] }  # for v3
  policy_id      uint64
}

# ACL
acl {
  acl_id         uint64 PK
  grants         [{grantee_type, grantee_id, permission}]
}

Why sharded KV (not a blob-store master like GFS)

  • GFS master held all FS metadata in one process's RAM. Hit ~50 PB / ~100 M file ceiling circa 2009.
  • Colossus (curators) shards metadata into BigTable, which is itself sharded; each curator serves a key-range.
  • This design: hash(bucket_id, key) mod N_shards → shard; each shard is a 5-member Paxos/Raft group on SSD. Range LIST uses an ordered layer (Spanner directories) or a side-index.

Why content-addressable chunk IDs

  • Enables dedup for identical chunks across objects (a blessing for image thumbnails, CI artifacts, git packs).
  • Enables retry idempotency: if a PUT fails mid-way, a retry writing the same bytes produces the same chunk IDs; no phantom duplicates.

6 System Diagram (ASCII, centerpiece) #

Regional architecture

                                   +----------------------+
                                   |   Client SDK / App   |
                                   |  (S3 SigV4, retries) |
                                   +----------+-----------+
                                              |  HTTPS, SigV4
                                              v
                                   +----------------------+
                                   |   Edge / CDN Cache   |   (optional for GET; bypass for PUT)
                                   |   (write-through? no)|
                                   +----------+-----------+
                                              |  HTTPS
                                              v
                         +--------- Anycast / GCLB ---------+
                         |   Regional Load Balancer (L7)     |   ~100 k QPS/box, 200 boxes
                         |   TLS term, WAF, rate-limit edge  |
                         +------+----------------------+----+
                                |                      |
                                v                      v
                      +---------+-----------+  +-------+----------+
                      |  Frontend / Gateway |  |  Frontend /Gateway|   ... fleet ~2000 nodes
                      |  (stateless)        |  |  ...              |
                      |  - Auth (SigV4,IAM) |  |                   |
                      |  - req parse/validate  |                   |
                      |  - slice body->chunks  |                   |
                      |  - pick EC scheme      |                   |
                      |  - upload coordinator  |                   |
                      +----+-------+-------+-+                      |
                           |       |       |                        |
          gRPC (KV RPC)    |       |       |  gRPC (bulk data)      |
                           v       v       v                        v
     +--------------------------+    +--------------------------------+
     |   METADATA SERVICE       |    |       CHUNKSERVER FLEET        |
     |   256 shards             |    |   ~2000 nodes × 60 HDD + SSD   |
     |   Raft 5-way per shard   |    |   (Colossus D-class)           |
     |   SSD-backed             |    |   - chunkfiles (packed)        |
     |   p99 read <5 ms         |    |   - per-disk IO scheduler      |
     |   p99 write <15 ms       |    |   - QoS classes (user/repair)  |
     |   -------------          |    |   - crc32c on every 64KB page  |
     |   objects                |    |   - lease holder per chunk     |
     |   chunk_placement        |    |                                |
     |   buckets, ACLs, multipart                                     |
     |   upload parts in flight |    +----^---------------^-----------+
     +-----^-------^-----^------+         |               |
           |       |     |                |               |
      (range       |     |     (chunk replicate)   (repair reads)
       scan LIST)  |  (commit)            |               |
           |       |     |                |               |
           +-------+-----+                |               |
                                          |               |
                    +---------------------+---------------+-------------------+
                    |                                                         |
                    v                                                         v
     +--------------------------------+                   +---------------------------------+
     |   CONTROL PLANE                |                   |   BACKGROUND / DATA PLANE       |
     |   - Placement Driver (PD):     |                   |   - Repairer (scan->reconstruct)|
     |     picks (node, rack, zone)   |<---- health ----->|   - Anti-entropy scrubber       |
     |     per stripe                 |                   |   - Erasure Coder (3->RS shift) |
     |   - Topology Service           |                   |   - Garbage Collector           |
     |   - Health / failure detector  |                   |   - Tiering Mover (hot->cold)   |
     |   - Capacity planner           |                   |   - Rebalancer (disk skew fix)  |
     +--------------------------------+                   +---------------------------------+

     Side services:
      - IAM / KMS (DEK wrap/unwrap, per-tenant KEKs)
      - Bucket quota / billing meter
      - Audit log (all control-plane ops, immutable WORM)
      - Replication service (cross-region, v3)

     Scale annotations:
       - Frontend → Metadata : ~12 M QPS at 256 shards (46 K QPS/shard avg, 80 K peak)
       - Frontend → Chunkserver : ~45 Tbps ingress, ~320 Tbps egress regional
       - Chunkserver disk IOPS : 60 disks × 150 IOPS = 9 K per node × 2000 = 18 M IOPS
       - PD health pings : every 5 s to every node → 400 QPS from PD, fanout to 2000

Write path (single-part PUT, ≤64 MB)

Client                  Frontend              Metadata (shard S)          Chunkservers (14 picks across 4 zones)
  |                        |                         |                               |
  |-- PUT key body ------->|                         |                               |
  |                        |- auth/sigV4             |                               |
  |                        |- shard(bucket,key)=S    |                               |
  |                        |- slice body into 8MB chunks (N=ceil(size/8MB))          |
  |                        |- for each chunk:                                        |
  |                        |    erasure-encode RS(10,4) -> 14 shards                 |
  |                        |    PD.pick_placement(14) -> [(node,disk), ...]          |
  |                        |- begin_write_txn to S  |                               |
  |                        |----------------------- >|-- reserve object version v ---|
  |                        |                         |   (lease / uncommitted row)   |
  |                        |< ---- leases ----------| |                              |
  |                        |                                                         |
  |                        |-------------- WRITE each shard in parallel ------------>|
  |                        |                                                         |
  |                        |                          <---- CRC + ack per shard -----|
  |                        |  (wait for ALL 14 shards acked; or 12-of-14 and repair) |
  |                        |                                                         |
  |                        |- compose manifest (chunk_ids, sizes)                    |
  |                        |- commit_object_txn(meta, ec_scheme, manifest, sha256)   |
  |                        |------------------------>|-- Raft commit on shard S ----|
  |                        |< --- commit OK ---------|                              |
  |< -- 200 ETag ----------|                         |                              |

Latency budget (single 8 MB chunk, intra-DC):
  auth + shard routing          ~2 ms
  slice + encode (in-memory)    ~4 ms     (RS(10,4) at 2 GB/s per core)
  PD placement                  ~1 ms     (cached)
  metadata reserve              ~5 ms     (Raft prepare)
  parallel 14-shard write       ~25 ms    (TCP + disk write cache + fsync-on-close)
  metadata commit               ~10 ms    (Raft commit)
  -----------------------------------
  total                         ~47 ms    -> p50 PUT target met

Read path (GET)

Client                  Frontend              Metadata (shard S)          Chunkservers
  |                        |                         |                               |
  |-- GET key ------------>|                         |                               |
  |                        |- auth                   |                               |
  |                        |- ask: obj_meta(b,k)  -->|-- read row (cached) --+       |
  |                        |<-- manifest, acl, size -|                       |       |
  |                        |- auth against ACL                                       |
  |                        |- for each needed chunk (range -> subset):               |
  |                        |    send parallel GET_CHUNK(chunk_id) to placement[0..9] |
  |                        |                                                         |
  |                        |                          <---- chunk bytes + crc -------|
  |                        |- verify crc, stream to client                           |
  |                        |- (if any of 10 is slow/dead beyond hedge deadline:      |
  |                        |    fetch 2 parity chunks + reconstruct)                 |
  |                        |                                                         |
  |<-- 200 body -----------|                                                         |

Latency budget (hot, cache-hit on metadata):
  LB + auth                     ~3 ms
  metadata read (cached)        <1 ms  (LRU hit) or 5 ms (SSD miss)
  chunkserver dispatch          ~1 ms
  parallel 10 chunk reads       ~15 ms (one disk seek + transfer)
  hedge + stream                ~5 ms
  -------
  total                         ~25-30 ms -> p50 hit

Recovery path (disk dies)

Health detector (PD) sees node/disk FAIL at T0
   |
   v
PD queries chunk_placement WHERE disk_id = dead_disk  (range scan -> ~5M chunks)
   |
   v
Repairer schedules reconstructions (priority queue: high if stripe health ≤ floor+1)
   |
   v
For each stripe with a lost shard:
   - pick 10 healthy survivors (for RS(10,4))
   - choose new placement satisfying rack/zone constraints
   - pull 10 shards over the "repair QoS class" (separate token bucket from user traffic)
   - reconstruct missing shard, write to new disk
   - Paxos-update chunk_placement atomically (CAS on gen_number)
   |
   v
Anti-entropy scrubber periodically re-reads every chunk, verifies crc32c,
   marks SILENT_CORRUPTION if mismatch -> triggers same repair path.

MTTR target: single-disk (20 TB, 5M chunks) repaired within 4 hours.
 Requires: 20 TB / 4 h = 1.4 GB/s reconstruct bandwidth sustained.
 Allocated: 10 Gbps repair-class lane per rack, so 64 racks × 1.25 GB/s = 80 GB/s aggregate.
 Per-stripe: 10 × shard_size read + 1 × shard_size write = 11× write-amp.

7 Deep Dives (3 earned-secret topics) #

7.1 Replication vs Erasure Coding — the real trade-off

Why critical. This choice moves ~40% of the cluster cost and defines 90% of the failure-mode surface. Getting it wrong burns either money (over-replicated) or durability (under-parity). Every "use EC" hand-wave at L7 gets probed.

Alternatives & numbers

Scheme Overhead Max losses / stripe Reconstruct reads Small-object penalty Typical production use
3× replication 3.00× 2 1 (just copy a replica) none GFS v1, HDFS default, Ceph RBD
RS(6,3) 1.50× 3 moderate Azure LRS variants, small clusters
RS(10,4) 1.40× 4 10× high Colossus (default warm), f4, S3 (EC tier)
RS(12,4) 1.33× 4 12× high Azure, HDFS-EC
LRC(12,2,2) — Microsoft 1.33× mixed; local repairs cheaper 6× for single loss high Azure Storage
Tape (archive) 1.0× + 2-copy across sites geo-scale hours N/A AWS Glacier Deep Archive

Key math — why 1.4× isn't the whole story

Repair cost scales with k. For RS(10,4), replacing one dead shard reads 10 shards to reconstruct. With 20 TB drives:

  • Read amplification during repair = 10× (read 10 shards at shard_size to make 1 shard_size of data).
  • On a 20 TB disk death, that's 200 TB of reads. At 1 GB/s per node, > 55 node-hours of repair IO competing with user reads.
  • This is why you MUST have QoS lanes. Google's public papers describe separate Colossus D "repair tokens." Without them, a disk failure → tail-latency spike → angry pager.

The small-object penalty (earned-secret) Any object smaller than one shard (e.g., 512 KB on a 8 MB shard size) still has to be split into 14 shards. You end up storing 14 × padding + 512 KB across 14 disks. For a 512 KB object under RS(10,4) with 8 MB shard size, you've touched 14 disks for 512 KB → catastrophic IOPS amplification on GET. Real systems solve this by:

  1. Packing — small objects first packed into 64 MB "bundles," then bundles are erasure-coded. This is Facebook Haystack's insight: one bundle = 1000s of photos = amortize EC overhead. f4 extends this to warm tier with RS(10,4).
  2. Hot-tier 3× replication — hot small objects get 3× replication; EC applies only on cold/warm tier transition (via lifecycle). Matches S3 Standard → S3 IA transition model.
  3. LRC (Microsoft Locally Recoverable Codes) — adds local parity groups so single-shard repair reads only 6 shards, not 12. Pays 33% vs 40% overhead.

My chosen approach (and why)

  • Hot tier (new writes, <7 days): 3× replication across 3 zones. PUT latency 40 ms (no encode). Reads can serve from nearest replica. Repair = 1× read.
  • Warm tier (7–90 days): RS(10,4) across 4 zones via background coder. Fully supports 4-shard loss including zone failure.
  • Cold tier (>90 days): RS(12,4) or LRC(12,2,2), shard size bumped to 64 MB, possibly stored on SMR drives with bundles.
  • Archive tier: single RS(10,4) + geo-replicated to a second region on tape.

Failure modes & mitigations

  • Correlated disk failure (bad batch, same firmware bug) — placement MUST NOT cluster shards of one stripe on disks with same firmware version. PD records firmware_id and diversifies.
  • Zone failure on 3-zone region — RS(10,4) can't tolerate losing ≥5 shards; requires 4-zone layout or RS(6,3) with 3 zones × 3 shards/zone.
  • Slow shard ("stuck" but not dead) — during GET, hedge by reading k+1 (11 shards) after 20 ms; first 10 to return wins. Classic "tail at scale" technique (Dean & Barroso 2013).
  • Bit-rot — crc32c per 64 KB page + sha256 end-to-end. Scrubber re-verifies every chunk every 14 days.

Quantified durability derivation (the 11-nines claim)

Assume:

  • Disk AFR = 1% annual → P(disk death in 1 year) = 0.01.
  • Stripe of 14 shards (RS(10,4)), each on independent disks.
  • P(≥5 shard losses in 1 year before repair) ≤ stripe_loss_prob.
  • MTTR = 4 hours (aggressive placement + repair).

Simplified model: during repair window (4 h = 1/2190 year), per-disk failure prob ≈ 0.01 / 2190 ≈ 4.6e-6. P(≥4 more fail in that window | 1 already lost) ≈ C(13,4) × (4.6e-6)^4 = 715 × 4.5e-22 ≈ 3.2e-19 per stripe per year.

Total stripes ≈ 1.4 EB / (14 × 8 MB) = 1.25e10 stripes. P(any stripe loses ≥5 shards) ≈ 1.25e10 × 3.2e-19 ≈ 4e-9 per year → ~9 nines.

To hit 11 nines we need MTTR shortened, or additional parity (RS(12,4) has 4 losses tolerated still, but with 14 shards → tighter). Real systems get the extra 2 nines via:

  • Geo-redundancy for archive (adds effectively ~3 nines once independent-region failure is factored in).
  • Shorter MTTR (sub-hour via aggressive repair) → each power of 2 on MTTR adds ~4× to durability.
  • Diversity of failure modes beyond independent AFR (firmware, power, cooling) — hurts, not helps; Backblaze data shows real-world correlated failures.

The honest L7 admission: 11 nines is a marketing number. Real per-object annual loss on S3 is empirically much smaller than the model predicts because of geo-replication and software redundancy layers; the model above gives 9 nines for a single-region RS(10,4) warm tier, which is why production systems layer cross-region async replication and journaled writes on top.

7.2 Metadata Scalability — why GFS's single master had to die

Why critical. Metadata is the serializer of the whole system. A slow metadata service makes the fastest chunkservers useless. GFS master hit a wall at 50 PB / ~100 M files / ~10 K QPS. Colossus's first job was to kill that master.

Alternatives (quantified)

Design Namespace scale Metadata QPS LIST perf Ops complexity Systems
Single master (RAM) ≤50 PB / ~100 M files ~10 K QPS O(1) tree walk Simple, single-pane GFS v1, HDFS NN v1
Federation (namespace slicing) linear w/ slices +N× breaks cross-slice rename medium HDFS Federation
BigTable-backed (Colossus curators) effectively unbounded ~1 M+ QPS O(log) + range scan complex Colossus, BigTable itself
Sharded transactional KV (Spanner dirs) unbounded ~10 M QPS range scan with ordered shards complex F1, CockroachDB-style
DHT / consistent hashing (Dynamo) unbounded ~10 M QPS no prefix scan — deal-breaker for S3 LIST simple Dynamo, Cassandra, Swift rings

Why DHT-style metadata is WRONG for S3-like

  • LIST ?prefix=foo/bar/ requires finding all keys in range [foo/bar/, foo/bar0). A hash-based placement scatters that range across every shard. A LIST becomes a scatter-gather over every metadata node, which:
    • makes LIST latency scale with shard count;
    • makes LIST cost scale with shard count, not result size;
    • makes LIST fragile — one slow shard stalls the whole result.
  • Dynamo/Swift sidestep this by not doing prefix LIST natively (Swift does flat container listings but has scaling issues there). This is fine for KV blobs; it's fatal for S3 API compatibility.

My chosen approach

  • Ordered sharded KV for metadata (Spanner-like directory layer). (bucket_id, key) lex-ordered.
  • 256 shards per region, each a 5-member Paxos/Raft group.
  • Shard split when a shard grows past ~100 GB of metadata or 50 K QPS.
  • Shard re-mapping handled by a directory service; frontend caches shard→range mapping with TTL + ETag.

Bucket-hotspot mitigation (earned-secret)

  • A tenant uploads 1 M objects/hour to the same prefix (e.g., /logs/YYYYMMDD/HHMM/). The metadata shard owning that range gets hammered.
  • Fix: auto-split on load, not just on size. Each shard tracks p99 write latency; if >SLA for 5 min, propose split to range-mid.
  • Client-side hint: SDKs recommend adding a hash prefix to keys (/logs/ab12_20260418.../) — classic S3 partition-key trick (since 2018 S3 removed the need, but underneath it's exactly this).

Metadata-vs-data consistency (§ title of the problem) How do we keep metadata consistent with the data it points to?

The commit protocol:

  1. Frontend slices body, assigns chunk IDs (content-addressable).
  2. Frontend writes chunks to chunkservers — chunks become uncommitted, linked only in a transient upload record.
  3. Chunkservers fsync + return placement receipt with generation number.
  4. Frontend calls metadata commit_object with manifest + receipts.
  5. Metadata Paxos commits the row. At this moment the object exists.
  6. If step 4 or 5 fails, chunks are orphaned. GC reaps them after TTL (default 7 days). No lost data; just space reclaimed.

Critical property: a client never sees an object until metadata commit. If metadata commits but frontend crashes before returning 200, client retries; retry writes same content-addressed chunks (idempotent) and re-commits — metadata sees same version_id and returns OK (idempotent commit). This is why version_id is derived from a hash of the content + timestamp, not a server-generated serial that would differ on retry.

Handling torn writes / partial failures

  • A chunk write fails mid-stripe: 12 of 14 shards acked, 2 didn't. Options:
    1. Abort: tell client 503, GC the 12 acked shards (cheap — they're orphaned).
    2. Commit with degraded flag, repair in background. Dangerous if further failure before repair; don't.
  • Always abort-and-retry. Simpler, matches S3 behavior.

Dealing with the metadata service itself going down

  • Write availability: 99.99% via Paxos across 5 replicas in 3 zones (tolerates 2 losses).
  • Read availability: local-follower reads OK for immutable keys (post-commit metadata doesn't mutate except for lifecycle/delete).
  • Metadata partitioning (network partition inside one region): minority-side rejects writes (503) but can still serve reads with "may be stale" hint. Matches CP side of CAP.

7.3 Durable writes & end-to-end integrity (how we actually hit 11 nines in practice)

Why critical. Silent data corruption is the #1 cause of real-world "impossible" data loss on large clusters. Backblaze reports ~1 bit error per 10^14–10^15 bits read on consumer disks — at EB-scale, you will see silent corruption daily. No amount of replication fixes corruption you didn't detect.

The stack of integrity checks

  1. Client-provided content-MD5 or SHA256 in PUT header. Server re-hashes on receipt; mismatch → 400 BadDigest. Detects in-flight corruption from client to edge.
  2. Per-shard CRC32C computed at encode time, stored alongside shard. Verified on every disk read.
  3. Per-page CRC32C at 64 KB granularity inside the chunkfile. Verified on every read; bad page → reconstruct from parity.
  4. Per-object SHA-256 in metadata, end-to-end. After read + reassemble + decrypt, SHA-256 is recomputed; mismatch → pull alternate parity, log corruption, fail fast.
  5. Scrub walker — background process re-reads every shard every N days (typically 14 days on warm, 30 days on cold), verifies all CRCs. On corruption: mark shard dead, trigger repair.
  6. End-of-disk-life RMA triggered by SMART + CRC error rate. Disk is "drained" before it hits critical.

Why not just TCP checksum?

  • TCP CRC-16 misses ~1-in-65K two-bit errors; and once data is memcpy'd in RAM, TCP doesn't help. [Stone & Partridge 2000] "When the CRC and TCP Checksum Disagree."
  • Disk firmware checksums miss writes that went to wrong sectors ("phantom writes"). Only end-to-end checksums catch torn writes / misdirected writes.

Quorum writes vs synchronous all (the interview sleight-of-hand)

You'll see answers say "quorum write: W=2, R=2, N=3, consistency guaranteed." That's Dynamo, and it's subtly wrong for object storage:

  • With W=2 of 3, the 3rd replica is eventually consistent. Read-your-writes requires R+W > N → R=2, W=2, N=3 → R+W=4 > 3. OK.
  • But: if W=2 and the 2 that acked are in the same rack, a rack failure loses the write. You must enforce zone-diversity at write time, not just placement policy.
  • S3 went strong-consistent in Dec 2020 by writing to an odd number with majority and a consistent cache on metadata. Details not public, but almost certainly: write W=3 of 3 (all) synchronously for metadata, and for data, multi-AZ quorum with repair.

My model: metadata = synchronous Paxos commit (W=3 of 5); data chunks = all-acks (W=14 of 14 for RS(10,4)) at write time with hedging. Why all-acks:

  • Only 14 shards per stripe; "quorum of 14" is silly.
  • The latency cost of waiting for the last 2 is small if tail is well-managed.
  • If 1–2 shards are persistently slow, PD reassigns before the write (health-aware placement), rather than committing with missing shards. A stripe committed with missing shards is a stripe you have to reconstruct on first read — bad latency UX.

Handling bit rot across the tiers

  • Hot: 3× replicated, scrubbed weekly, CRC mismatch → copy from another replica.
  • Warm: RS(10,4), scrubbed bi-weekly, CRC mismatch → reconstruct from parity.
  • Cold: RS(12,4) or LRC, scrubbed monthly, shards on SMR HDDs (read-mostly), corruption rate higher → more frequent scrub budget needed.
  • Archive (tape): scrubbed yearly, 2-copy geo.

Production naming Colossus D papers mention the "end-to-end CRC composition" pattern: block CRC composes to chunk CRC composes to object CRC, so you can verify a range read's integrity without re-reading the whole object. ZFS does this with Merkle trees (Bonwick). HDFS does it with per-block checksum files.


8 Failure Modes & Resilience #

Per-component, detection → blast radius → mitigation → recovery.

8.1 Single disk failure

  • Detection: SMART + read error rate + missed heartbeat from disk scrubber. Node marks disk DEAD within 60 s.
  • Blast: one shard per stripe, 5 M stripes affected (20 TB / 4 MB avg shard = 5 M).
  • Mitigation: stripes are still live (RS(10,4) tolerates 4 losses). No user-facing error.
  • Recovery: repair queued immediately, SLA 4 h. During repair, read path uses reconstruction for affected stripes (adds ~50 ms). Alert only if >5 disks fail simultaneously (then rack/node event).

8.2 Node failure

  • Detection: PD heartbeat miss for 3 intervals (15 s). Node marked DOWN.
  • Blast: 60 disks × ~5 M chunks = 300 M chunks. Each stripe loses 1 shard (since placement MUST ensure one shard/node).
  • Mitigation: same as 8.1 ×60. Large repair job (1.2 PB) — spread over 24–48 h if QoS allows.
  • Recovery: if node comes back within grace period (1 h), re-attach and cancel repair for not-yet-started work. If not, repair to completion; drain the node when it returns.
  • Correlated failure: if 5 nodes in one rack die (power outage), the invariant "≤1 shard/rack" means stripes lose ≤1 shard — still safe. But placement must actually enforce it.

8.3 Rack failure

  • Detection: correlated heartbeat loss in one rack-id cluster.
  • Blast: up to 1800 disks, 108 M chunks, but ≤1 shard/stripe under invariant.
  • Mitigation: no user-facing error if invariant holds. If placement ever violated invariant (bug), now you see correlated stripe damage.
  • Recovery: power-on or draining. Repair-to-completion over 72 h via separate QoS lane (<10% of cluster bandwidth to avoid user-traffic SLO burn).

8.4 Zone failure

  • Detection: regional health service (not the same zone's PD — chicken/egg) detects zone-wide packet loss.
  • Blast: ~25% of fleet in a 4-zone region = 500 nodes, 30 K disks. Under RS(10,4) with ≤4 shards/zone, each stripe loses ≤4 shards → still readable but at bare minimum, no further loss tolerance.
  • Mitigation: IMMEDIATELY enter "degraded" mode: halt repair traffic (it amplifies damage if we mis-place), halt placement churn, serve reads only via reconstruction for affected stripes, block writes to affected zone.
  • Recovery: after zone returns, delta-repair (re-verify shards rather than re-write). If zone permanently lost, 30+ day repair job to re-hydrate to 14/14.

8.5 Metadata partition

  • Detection: Raft leader lease fails to renew; followers detect.
  • Blast: the affected metadata shard (one 256th of keyspace) is unavailable for writes; reads can continue with stale data for stale_read_bound.
  • Mitigation: client SDK sees 503 on that key-range and retries with backoff. Most buckets span multiple shards (hashed), so a single-shard outage only affects ~0.4% of keys.
  • Recovery: new leader elected within 5 s (typical Raft). If minority permanently isolated, majority continues.

8.6 Silent corruption

  • Detection: CRC mismatch on read OR during scrub.
  • Blast: single shard per event.
  • Mitigation: read path falls back to parity reconstruction, logs event, pulls from an alternate shard.
  • Recovery: shard marked dead; repair replaces it. If SHA-256 mismatch at object level (meaning multiple shards corrupted in a correlated way), this is an incident — page the SRE, freeze the object's bucket, investigate (firmware bug? cosmic ray? malicious?). Correlated silent corruption has happened in production (Amazon S3 2008 hash-bug, Netflix 2019 memory bit-flip).

8.7 Thundering herd after outage

  • Scenario: a zone comes back after 30-min outage; 25% of client traffic re-targets; cold caches; retry-amplified QPS can be 3–10×.
  • Detection: QPS spike monitors at LB and at metadata.
  • Mitigation:
    • Global rate-limit at LB: token bucket per tenant, 2× steady-state allowed for 60 s then drop to SLA.
    • Coalesce in-flight reads: multiple GETs for same key within 50 ms serve from one upstream read (request collapsing / "singleflight"); no amplification.
    • Backoff with jitter enforced in official SDK. Clients without SDK: 503 + Retry-After tells them to wait.
    • Graceful admit: LB sheds 10% of load rather than let the fleet tip into CPU saturation. Matches Netflix "adaptive concurrency limits" (Little's law in the LB).
  • Recovery: rate-limit relaxes when QPS returns to steady-state for 5 min.

8.8 Correlated cross-zone failure (the "bad deploy" scenario)

  • Scenario: a rolling push of a new chunkserver binary has a bug that crashes on a specific byte pattern. Deploy hits 3 of 4 zones before detection.
  • Detection: elevated crash rate across multiple zones within 2 min.
  • Mitigation: deploy system auto-rollback on health-signal regression (Canary + Argos-style). Staggered deploy per zone with 24 h hold makes zone-wide correlated failures from code nearly impossible.
  • Recovery: roll back binary, nodes self-recover as they restart.

9 Evolution Path #

v1 — Single-region, GFS-like, 3× replication

  • Single metadata master (active + passive, log-shipped failover), 3× rep everywhere.
  • Targets: 10 PB, 1 B objects, 100 K QPS.
  • Timeline: 6 months to MVP. Known to break at ~50 PB / master exhaustion.
  • Operationally: simple, single pane, 3-person SRE team can carry it.

v2 — Sharded metadata + EC for cold (Colossus-like)

  • 256 metadata shards (Paxos), hot-tier 3× rep, warm/cold RS(10,4).
  • Background tiering mover converts warm→cold.
  • LRC(12,2,2) experimented for coldest tier.
  • Add: lifecycle rules, versioning, multipart.
  • Scale targets: 1 EB, 10 B objects, 10 M QPS. (This is the design spec above.)
  • Key migration: metadata migrated via online dual-write + shadow-read; data tier converted stripe-by-stripe.

v3 — Multi-region, globally-eventually-consistent

  • Per-region v2 stacks, each with its own metadata/data plane.
  • Cross-region replication service: async log-shipped (change data capture from Raft log) with p95 <5 s end-to-end.
  • Conflict resolution: last-writer-wins by Hybrid Logical Clock timestamp; for PUT-PUT same key, arbitrary resolution with notification to tenant.
  • Global table per bucket: uses Spanner-like TrueTime or Calvin-style deterministic order for the tiny subset of "global-strong" buckets (at latency cost).
  • Geo tiering: hot tier in region-of-origin, cold tier shipped to lower-$$ region (e.g., us-central to us-iowa) for 40% cost reduction.
  • Disaster recovery: every region has 2 partner regions; losing 1 region degrades read latency by ~50 ms, losing 2 regions simultaneously is pager-worthy but RPO=seconds.

v4 — (out of scope, FYI)

  • Compute-near-storage (S3 Select, SELECT-over-object).
  • Post-quantum-ready KMS.
  • Storage-class-memory tier for ultra-hot (Optane-like; though Intel killed Optane, similar class tech persists).

10 Out-of-1-Hour Notes #

10.1 Consistent hashing vs centralized placement

  • Consistent hashing (Dynamo): no placement driver, every node knows the ring, O(log N) rebalance on membership change. Simple, but:
    • Rack/zone diversity is hard — the ring doesn't know topology. Dynamo uses "virtual nodes + preference lists," still suboptimal under correlated failure.
    • Small objects pay full per-request hashing cost; for EC stripes you need to pick k+m locations with constraints, which Ring can't natively do.
    • Result: Dynamo-style is great for equal-size, no-topology-constraint payloads (sessions, shopping carts). Not great for EC-first object storage.
  • Centralized PD: can optimize for topology, capacity, heat, and firmware diversity. Cost is complexity and that PD can be a bottleneck (mitigate: cache in frontends, PD only authoritative for changes).
  • Real-world: Colossus uses PD. Ceph uses CRUSH which is a clever middle ground — pseudo-random but topology-aware, so clients can compute placement from a topology map without a central oracle.

10.2 Disk formats

  • Ext4 fails at billions of files (inode table bloat). Don't do 1 chunk = 1 file.
  • XFS is better but same tail. Packing into chunkfiles is mandatory.
  • Raw block + in-house format (what Colossus D does): directly manage LBAs, no filesystem overhead. Fastest, most complex.
  • SMR (shingled) HDDs: 15–20% density improvement, but sequential-write-only. Perfect for cold tier append-only data. EC bundles rewritten rarely → SMR is a natural fit.
  • NVMe SSDs: for hot tier only; $/GB still ~20× HDD in 2026.

10.3 Write-through CDN vs origin reads

  • CDN at edge (CloudFront / Fastly / Google-CDN): caches public GETs; 10× reduction in origin QPS and ~50ms latency improvement globally. Invalidation on PUT via purge API or short TTL.
  • Origin direct for private objects, pre-signed URLs, large streaming content that doesn't fit edge caches.
  • Write-through CDN: almost never. Writes are rare enough that CDN-layer caching complicates consistency (CDN caches stale → reader sees old object) without bandwidth savings.
  • Design choice: serve public bucket policy via CDN; everything else direct to frontend.

10.4 Encryption at rest

  • Per-object DEK, randomly generated 256-bit AES key.
  • DEK encrypts object bytes in AES-GCM (authenticated encryption catches tampering in addition to CRC).
  • DEK is wrapped with per-tenant KEK stored in KMS.
  • KEK itself wrapped by a root key in HSM, rotated yearly.
  • Rotation: DEK rotation requires re-encrypt (expensive, usually lazy on next overwrite). KEK rotation is cheap — only re-wrap existing DEKs, re-key via KMS API.
  • Key deletion = effective object deletion for compliance — destroy KEK, DEKs unwrappable, objects unrecoverable. GDPR right-to-delete friendly (but see §10.7 for EC + deletion headache).
  • BYOK (customer-managed keys): tenant holds KEK in their own KMS; our system calls their KMS to unwrap DEK on every access. High-value enterprise feature.

10.5 Multi-tenant isolation & noisy neighbor

  • Per-tenant token bucket at LB — capped QPS and bandwidth.
  • Per-tenant shard affinity: huge tenants pinned to a subset of metadata shards so their LIST bursts don't harm others.
  • Priority classes on chunkserver disk scheduler: P0 (paid), P1 (standard), P2 (best-effort batch). Colossus D has per-request priority with weighted fair queuing.
  • CPU/mem isolation: containerized chunkserver and metadata workers; noisy tenant can't starve sibling containers.
  • IOPS SLO per tenant — S3 historically under-exposed this; GCP's Anywhere Cache and Azure's Premium tiers now expose guaranteed IOPS.

10.6 Cost per GB-month modeling

  • Raw HDD: ~$15/TB (2026 enterprise spot price, amortized). At 1.4× EC overhead, effective $21/TB-physical-for-1TB-logical.
  • Chassis + rack + power + cooling: ~1.5–2× disk cost over 5 years.
  • Networking amortized: ~$2/TB/year.
  • Software/SRE/amortized R&D: ~$3/TB/year.
  • Total warm tier: ~$8/TB/year = $0.67/TB-month = $0.00067/GB-month cost.
  • S3 Standard charges ~$0.023/GB-month — ~34× markup, funds availability, API, profit, and that the public price is a "retail" price with implied headroom.
  • Tape tier (Glacier Deep Archive): LTO-9 is ~$4/TB, robotic library + retrievals ~$2/TB/year effective; ~$0.00017/GB-month cost, retail $0.00099.
  • Erasure coding saves ~48% of disk cost vs 3× replication — at 1 EB scale that's $150M+/region/5-yr. This is the reason EC exists.

10.7 Regulatory (GDPR right-to-delete meets EC)

  • "Delete" on object-level = metadata tombstone + eventual GC of chunks.
  • GC of EC-shards: all 14 shards of each affected stripe must be re-encoded (since a stripe may contain bytes from multiple objects if packed). For packed bundles, deletion = mark bytes dead; compaction rewrites bundle excluding dead bytes (like LSM-tree compaction). Until compaction runs, deleted bytes remain recoverable in theory.
  • Compliance SLA: GDPR requires "without undue delay," typically ≤30 days. Lifecycle scheduler ensures compaction runs within SLA on delete-heavy buckets.
  • Crypto-shredding as an alternative: delete the per-object DEK from KMS. Bytes remain but unreadable. Acceptable for most regulators; cryptographic one-way erasure in practice. Cheaper than re-compacting EC bundles.
  • Data sovereignty: bucket pinned to a specific region/country; no cross-region replication without tenant opt-in. Audit log proves location.

10.8 Observability (tail latency and how you find disk rot)

  • RED metrics: rate, error, duration per endpoint × tenant × region.
  • Histogram quantiles — not averages. p50/p90/p99/p99.9 per GET/PUT path. Tail is where SRE lives.
  • Disk-level wearout dashboard: SMART attributes, CRC error rates, reallocated sectors, disk-model groups (for correlated failure prediction).
  • Repair queue depth & bandwidth consumed — first-class SLO. Alert when EC reconstruct fanout per user read exceeds threshold (= stripes are under-healthy).
  • Metadata shard fan-out for LIST operations — detects hotspot buckets.
  • Per-tenant top-N bandwidth / IOPS dashboards — blame game when a tenant noisy-neighbors.
  • Chaos engineering: regular random disk/node/rack kills; enforces that MTTR SLO is real, not aspirational.
  • Tracing: open-telemetry span from client → LB → FE → metadata → chunkserver for every sampled 1:10K request. Essential for debugging tail latency ("why did this particular PUT take 3s?" → span shows metadata Paxos leader election during that window).

10.9 Capacity planning out-of-band

  • Growth forecast: 60%/year logical data growth per region (historically matched for hyperscalers).
  • Disk procurement lead time: 26 weeks in 2026 post-AI-GPU-supply-crunch side-effects. Plan 9 months ahead.
  • Power ceiling in the DC is the real bottleneck, not floor space: a rack of 1800 disks draws ~15 kW. A 10 MW DC caps at ~670 racks — aligns with our 64-rack regional spec giving 10x growth headroom before needing a new DC.

10.10 What I did not have time to build but would

  • Cross-tenant dedup (cryptographic — convergent encryption) — saves 10–30% on CI / backup workloads. Privacy-sensitive; needs careful design.
  • Event notifications (bucket → Lambda/PubSub) for PUT/DELETE. Reliable delivery requires a durable outbox pattern on metadata commit.
  • Object Lock / WORM compliance mode — immutable retention for SEC/FINRA compliance.
  • Bucket inventory reports — S3 exports CSV/Parquet inventory to a destination bucket daily. Enables fast LIST for customers.
  • Intelligent tiering — ML-predicted access pattern per object, auto-moves between hot/warm.

Verification vs quality bar #

  • SRE pager-carryable? Yes — §8 per-component failure-modes with detection/blast/mitigation/recovery. Added "staggered deploy" and thundering-herd mitigations (§8.7, 8.8).
  • Every arrow maps to real API (§4) or schema (§5)? Yes — arrows labeled with gRPC/HTTPS, map to commit_object, reserve_version, WRITE_SHARD, GET_CHUNK, health-ping.
  • Deep-dive L7 vs L6? L7 check: (a) quantified durability derivation from AFR+MTTR, (b) RS small-object penalty + bundle/Haystack rescue, (c) metadata DHT-vs-KV prefix-LIST trade-off, (d) named post-2020 S3 consistency change with why it's hard, (e) CRC vs SHA-256 vs TCP-checksum tower with failure-mode citation.
  • Durability math derived from AFR + k/n + MTTR? §7.1 — single-region 9 nines derived; acknowledge 11 nines requires geo-redundancy, not a single-region physical guarantee. This is the honest L7 answer; L6 would assert 11 nines and move on.
  • Rejected alternatives for every major choice with quantified trade-offs? Yes — object vs block vs file (§1), DHT vs sharded KV (§7.2), 3× vs RS vs LRC (§7.1), single master vs sharded (§7.2), CDN write-through vs not (§10.3).
  • Real systems named? GFS, Colossus (incl. D, curators), HDFS, Ceph (RADOS, RGW, CRUSH), S3, Dynamo, Swift, Haystack, f4, Azure Blob (LRC), Spanner, FoundationDB, BigTable, Glacier, Backblaze (AFR source), Stone/Partridge TCP checksum paper, Dean/Barroso tail-at-scale.

Interview delivery order (60 min) #

  • 0–5 min: §1 restatement, clarifying questions, pick object store.
  • 5–10 min: §2 FR + §3 NFR and core BOE numbers (1 EB, 70K disks, 256 shards).
  • 10–20 min: §4 API + §5 Schema.
  • 20–30 min: §6 ASCII diagram (draw live), walk write + read paths.
  • 30–50 min: §7 one or two deep-dives, interviewer-driven (usually §7.1 EC or §7.2 metadata).
  • 50–60 min: §8 failure modes, §9 evolution path.
  • Skip §10 in interview unless asked; use for follow-up.
esc
navigate open esc close