All problems

Q7 Real-Time Analytics & Monitoring 25 min read 12 sections

Design a CI/CD Pipeline with Smart Test-Failure Handling

Run monorepo builds and tests efficiently while distinguishing genuine failures from flaky or infra-driven noise.

SchedulingObservabilityAvailabilityCost / efficiency

Meta-strategy (not part of the answer — read once) #

This is the "unusual, requirements-driven" archetype. The interviewer will withhold a requirement and see whether I elicit it. My opening move is not a diagram — it is a 15-question clarifying volley, batched by theme, designed to reveal the real problem. The canonical trap: candidate assumes "large monorepo" means Google-scale 2B LOC; interviewer actually has in mind a 50M-LOC enterprise repo where the CI system itself is the bottleneck, or a multi-repo "federated monorepo" where the dependency graph crosses repo boundaries. Ask, then design.

Second trap: candidate jumps to "we'll use Bazel + remote cache." That is a tool, not a design. The L7 move is to derive the correctness invariants first (hermeticity, determinism, content-addressable inputs) and then defend Bazel against Buck2 / Pants / Gradle with numbers.

Third trap: candidate says "if a test passed on retry, it's flaky." This is statistically wrong and will mask real regressions. L7 answer uses Wilson score lower bound, a reproduction protocol, and cites Google TAP's CulpritFinder to distinguish flake from regression.


1 Problem Restatement & Clarifying Questions #

Restatement (30 sec)

"Design a CI/CD pipeline for a large monorepo. Primary requirements: (1) detect test failures, (2) decide when a failure is flaky vs real, and (3) route notifications to the right owners without paging the whole team. Secondary: the system must hand off to a deploy orchestrator (canary / rollback). I'll clarify scope, then drive from there."

Clarifying questions — batched, with why I'm asking

A. Scale & shape (drives capacity math, sharding model)

  1. Repo size: LOC, file count, number of BUILD/test targets? (50M LOC + 1M targets is a very different problem from 500M LOC + 50M targets — the latter needs a cross-shard dependency graph service.)
  2. Commit velocity: CLs/day, peak CLs/sec? Distribution across the day (US hours vs global)? (Drives executor pool elasticity and scheduler queue depth.)
  3. Engineer count and team count? (Drives notification routing fan-out and CODEOWNERS cardinality.)
  4. Test population: unit / integration / e2e split? Are there long-running tests (>30 min)?
  5. Existing infra: do we own remote execution (a la RBE / BuildBuddy / EngFlow), or are we building on Jenkins / GitHub Actions / Tekton?

B. SLOs & cost envelope (drives HA, pool sizing, storage tier) 6. Presubmit latency p50/p95? Is "10 min p50" aggressive or slack? (Google TAP internally runs p50 ~6 min.) 7. CI system availability target — 99.9% implies ~8.7h downtime/yr; can devs merge without CI during an outage (break-glass)? 8. Budget per CI-minute and per-commit cap? (Cost drives cache hit target; at $0.01/CPU-min × 1000 CPUs × 10 min = $100/run, unacceptable without caching.)

C. Correctness & policy (drives flake algorithm, quarantine policy) 9. Pre-merge vs post-merge model: do CLs block on CI (Gerrit-style pre-submit gate), or merge-then-test (Git main-branch + post-submit)? Or both (Google TAP is hybrid)? 10. What's the current false-fail rate? If devs already distrust CI ("I'll just rerun it"), that's a cultural problem masking a flake-detection deficit. 11. Regulatory / audit requirements: SOX for deploy lineage? SOC2 for change management? PCI/HIPAA in scope? (Determines whether we need tamper-evident logs and signed artifact attestation.)

D. Ownership & notification (drives CODEOWNERS model, routing) 12. Is there an existing CODEOWNERS system? Path-based only, or team-graph-based (e.g., Meta's fbpkg ownership)? Are there shared files with no clear owner? 13. Notification channels: Slack/chat, email, pager (PagerDuty/OpsGenie)? Is paging reserved for post-submit main-branch breaks? 14. On-call handoff: do teams have rotations that we can query, or is it "notify the author"?

E. Deployment hand-off 15. What does the deploy orchestrator look like — blue/green, canary, regional rollout? Is CI signal the only gate, or does it combine with SLO burn-rate? (This determines whether CI produces a boolean "green/red" or a richer "build quality score.")

Assumed answers for this doc (stated explicitly — I'd verify in the interview)

Dimension Value
Repo size 200M LOC, 5M source files, 2M Bazel targets, 800K test targets
Engineers 10K active developers across 1K teams
CL velocity 100K CLs/day (avg 10/eng), peak 3 CL/sec
Presubmit SLO p50 < 10 min, p95 < 25 min
Availability 99.9% (dev productivity blocker)
Pre- vs post-submit Both: presubmit smoke on affected targets, post-submit full at head
Cost target ≤ $0.50/CL on average, ≤ $5/CL p99
Test history retention 90d hot, 2y cold (audit)
Notifications Slack primary, PagerDuty for post-submit main-break
Deploy Canary + blue/green, CI produces multi-dimensional gate signal

2 Functional Requirements #

In scope (numbered, referenced later)

FR-1. CL submission & webhook ingestion. Accept webhook from VCS (GitHub Enterprise / Gerrit / internal). Dedupe by (repo, sha). Idempotent.

FR-2. Affected-target computation. Given a sha + base sha, compute the minimum set of build + test targets that could be affected. Must be conservative (no missed targets) and tight (no over-selection that explodes the run).

FR-3. Pre-submit gate. Run affected tests before merge. Return a signal sufficient to block/allow merge. p50 < 10 min.

FR-4. Post-submit continuous integration. On every merge to main, run a broader suite (affected + downstream), feed flake classifier with high-quality signal.

FR-5. Flake detection & quarantine. Maintain a per-test flake score (statistically grounded). Auto-quarantine tests exceeding threshold. Auto-dequarantine when stable. Must not mask real regressions.

FR-6. Smart retries. Retry a failing test only if its prior statistics classify it as likely-flaky. Bounded retries with exponential backoff, deterministic seed variation.

FR-7. Ownership resolution & notification routing. Resolve owner from CODEOWNERS + git-blame + path hierarchy. Batch notifications; escalate by severity tier. No spam on flake-only failures.

FR-8. Artifact production & attestation. Produce signed, hermetic build artifacts with provenance (SLSA level 3 target).

FR-9. Deploy hand-off. Emit a multi-dimensional quality signal (pass/fail, flake-adjusted confidence, coverage delta, perf delta) to the deploy orchestrator.

FR-10. Query & admin APIs. flaky_score(test_id), quarantine(test_id), rerun(cl_id, targets), get_status(cl_id).

Out of scope (explicit)

  • Lint / format enforcement (separate pre-commit hook fleet).
  • Full SAST/DAST pipeline details (we'll call into it as a gate, not design it).
  • Release comms / changelogs / JIRA integration.
  • Build-graph authoring tooling (BUILD-file IDE, genrule support).
  • Secrets management for deploys (assume a KMS exists).
  • Credentials / identity for the CI system itself (assume SPIFFE-style workload identity exists).

3 NFRs + Capacity Estimate #

NFRs

NFR Target Rationale
Availability (coordinator, scheduler, metadata svc) 99.95% Dev productivity blocker; one SRE-hour of downtime = 10K eng × 1h × $100/h ≈ $1M
Presubmit latency p50 10 min / p95 25 min Google TAP benchmark; >25 min devs context-switch
Flake false-fail rate < 1% Above this, trust erodes and "just rerun" culture starts
Flake false-quarantine rate < 0.1% Quarantining a real regression is catastrophic
Notification precision > 95% (notif reaches an owner who can act) Below 90% = learned-helplessness muting
Cost $0.50/CL avg $0.50 × 100K CL/day = $50K/day = $18M/yr, acceptable at 10K-eng scale
Durability of test history 11 9's (S3/GCS class) Audit + flake-model training data
Build reproducibility 100% for hermetic targets Invariant; if violated, cache is invalid

Back-of-envelope capacity

Test execution volume

Engineers:                          10,000
CLs/eng/day:                        10
CLs/day:                            100,000
Affected targets/CL (p50):          50
Affected targets/CL (p95):          500
Presubmit test runs/day (p50):      100K × 50  = 5M
Presubmit test runs/day (w/ p95):   ≈ 10M      [mix of p50 + p95]
Retry avg (flaky-classified only):  0.05 × extra runs ≈ 500K
Post-submit full runs/day:          1 × 800K tests × ~10 retries_budget_full = 8M at head, but deduped by sha ⇒ 800K
TOTAL test executions/day:          ~18M
Per second avg:                     ~210 tests/sec
Peak (3×):                          ~630 tests/sec

Executor pool sizing

Avg test wall time:                 30 sec (unit-heavy mix)
Wall-sec/day:                       18M × 30 = 540M sec = 6,250 core-days
Core-days/day steady-state:         ~6,250 cores running flat
With peak 3× overhead + headroom:   ~20,000 cores peak
Cost @ $0.01/core-min:              6,250 × 1440 × $0.01 = ~$90K/day raw compute
                                    → need ≥ 80% cache hit to land at $50K/day = $0.50/CL

Cache hit ratio drives the cost curve non-linearly. At 50% hit we pay $45K/day; at 80% we pay $18K/day; at 95% we pay $4.5K/day. Our design target is 85% hit rate, justified below.

Notification fan-out

Failures/day:                        ~2% of runs = 360K failure events
After flake filter (80% masked):     ~72K real failures
After batching (5-min windows/team): ~1K batched notifications/day
Pages (post-submit main-break only): ~50/day
                                     → well under humane pager budget

Storage

test_run rows:                      18M/day × 90d hot = 1.6B rows hot
                                    + 2y cold ≈ 13B rows cold
Row size (compressed columnar):     ~100 B with dictionary + RLE
Hot store (BigQuery/ClickHouse):    160 GB hot (easy)
Cold store (object storage parquet): 1.3 TB cold
test_metadata:                      800K tests × ~2 KB = 1.6 GB (Spanner-easy)
dep_graph snapshots:                2M targets × ~200 B × 1 snapshot/merge
                                    × 10K merges/day × 7d = ~2.8 TB/week
                                    → content-addressed dedupe + delta encoding cuts 100×
                                    → ~30 GB/week hot

Query API load

`flaky_score(test_id)` calls:        Every test run needs it before retry decision
                                     → 18M/day = 210 QPS avg, 630 QPS peak
                                     → trivial if cached; backing store is Spanner
`get_status(cl_id)` UI polls:        10K eng × 10 polls/CL × 10 CLs = 1M polls/day
                                     → 12 QPS avg, bursty

4 High-Level API #

Design principles

  • gRPC primary (streaming for logs, bidi for coordinator ↔ executor). REST gateway for webhooks and UI.
  • Idempotency via client-provided request ID. All writes are upserts keyed on (cl_id, target_id, attempt).
  • Backward compat: proto3 with explicit field numbers; never reuse field numbers.

Service boundaries (each is a separately deployable binary)

ci-coordinator        — CL intake, lifecycle state machine
dep-analyzer          — BUILD graph traversal, affected-target compute
scheduler             — queueing, bin-packing onto executor pool
exec-controller       — executor fleet manager (health, drain, autoscale)
result-ingest         — writes to test_run; fans out to flake-pipeline
flake-classifier      — streaming stats + batch model refresh
quarantine-svc        — quarantine state machine + dequarantine scheduler
ownership-resolver    — CODEOWNERS + blame + hierarchy rules
notif-router          — batching, severity tiering, channel fanout
artifact-svc          — signed artifact store + provenance
deploy-gate           — emits quality signal to deploy orchestrator
metadata-svc          — test_metadata OLTP reads/writes (Spanner fronted)
history-svc           — test_run analytics reads (BigQuery/ClickHouse)
admin-api             — quarantine overrides, reruns, backfills

API surface (abridged proto)

// ====== ci-coordinator ======
service CICoordinator {
  // VCS webhook target (via REST gateway). Idempotent on (repo, sha, event_kind).
  rpc SubmitCL(SubmitCLRequest) returns (SubmitCLResponse);
  rpc GetStatus(GetStatusRequest) returns (stream CLStatus);  // streaming for UI
  rpc Rerun(RerunRequest) returns (RerunResponse);            // admin
  rpc Cancel(CancelRequest) returns (CancelResponse);
}

message SubmitCLRequest {
  string request_id = 1;       // idempotency key
  string repo = 2;
  string sha = 3;
  string base_sha = 4;
  string author = 5;
  repeated string modified_paths = 6;  // hint; dep-analyzer re-verifies
  SubmitMode mode = 7;         // PRESUBMIT | POSTSUBMIT | MANUAL
}

message CLStatus {
  string cl_id = 1;
  CLState state = 2;           // ANALYZING, QUEUED, RUNNING, PASSED, FAILED, FLAKY_RETRYING
  int64 targets_total = 3;
  int64 targets_passed = 4;
  int64 targets_failed = 5;
  int64 targets_flaky = 6;
  double quality_score = 7;    // 0.0–1.0; emitted to deploy-gate
}

// ====== flake-classifier ======
service FlakeClassifier {
  rpc FlakyScore(FlakyScoreRequest) returns (FlakyScoreResponse);
  rpc BatchFlakyScore(BatchFlakyScoreRequest) returns (BatchFlakyScoreResponse);
  rpc ShouldRetry(ShouldRetryRequest) returns (ShouldRetryResponse);
}

message FlakyScoreResponse {
  string test_id = 1;
  double wilson_lower_bound_pass = 2;   // 95% CI lower bound of true pass rate
  double ewma_30d_fail_rate = 3;
  int32 sample_size_30d = 4;
  FlakeClassification classification = 5;  // STABLE | SUSPECT | FLAKY | BROKEN | QUARANTINED
  string reason = 6;  // human-readable explanation
}

// ====== ownership-resolver ======
service OwnershipResolver {
  rpc ResolveOwners(ResolveOwnersRequest) returns (ResolveOwnersResponse);
}

message ResolveOwnersResponse {
  repeated Owner owners = 1;
  ResolutionStrategy strategy_used = 2;  // CODEOWNERS | GIT_BLAME | PATH_HIERARCHY | FALLBACK
  double confidence = 3;
}

// ====== notif-router ======
// internal only — called by coordinator / post-submit watcher
service NotifRouter {
  rpc Notify(NotifRequest) returns (NotifResponse);
}

message NotifRequest {
  string event_id = 1;          // idempotency
  Severity severity = 2;        // INFO | WARNING | ERROR | PAGE
  Category category = 3;        // PRESUBMIT_FAIL | POSTSUBMIT_BREAK | QUARANTINE_FAIL_UNDER | ...
  repeated Owner owners = 4;
  string cl_id = 5;
  repeated string test_ids = 6;
  string summary_md = 7;
  map<string, string> labels = 8;  // team, service, env
}

// ====== quarantine-svc ======
service Quarantine {
  rpc Quarantine(QuarantineRequest) returns (QuarantineResponse);
  rpc Dequarantine(DequarantineRequest) returns (DequarantineResponse);
  rpc ListQuarantined(ListQuarantinedRequest) returns (ListQuarantinedResponse);
}

Webhook contract (VCS → coordinator)

POST /v1/vcs/webhook
Headers:
  X-Signature: HMAC-SHA256(secret, body)   ← verified before parse
  X-Event-Id: UUID                         ← dedupe key
  X-Delivery-Attempt: int                  ← VCS retry semantics
Body: {repo, ref, before_sha, after_sha, author, paths[]}

Coordinator is at-least-once consumer. Dedup via X-Event-Id with 24h TTL Redis. On replay, a submitted CL returns existing cl_id (idempotent).


5 Data Schema #

Storage engines — per-table justification

Table Engine Why
test_run (fact) BigQuery (or ClickHouse on-prem) 10s of billions rows, columnar scan for flake model training, partition by day, cluster by test_id
test_run_hot (last 7d) Bigtable / Cassandra Point lookup by (test_id, ts) for flake scorer at < 10ms
test_metadata Spanner (or CockroachDB / Vitess-MySQL) OLTP writes on quarantine state; strong consistency required to avoid split-brain quarantine
dep_graph_snapshot Content-addressed blob store (S3/GCS) + SQL index Immutable Merkle-keyed; content-dedupe across CLs
notification_policy Spanner Low-volume, consistent reads
ownership Spanner + in-memory cache Derived from CODEOWNERS; rebuilt on merge; cached per-dep-analyzer pod
artifacts Object store + index in Spanner Immutable, content-addressed

Schema definitions (PostgreSQL/Spanner DDL style)

-- Facts: immutable append-only
CREATE TABLE test_run (
  run_id            BYTES(16) PRIMARY KEY,    -- UUIDv7 (time-ordered)
  cl_id             STRING(40) NOT NULL,
  repo              STRING(256) NOT NULL,
  sha               STRING(40) NOT NULL,
  test_id           STRING(512) NOT NULL,      -- bazel-label format: //foo/bar:test
  target_hash       BYTES(32) NOT NULL,        -- content-hash of inputs (cache key)
  start_ts          TIMESTAMP NOT NULL,
  end_ts            TIMESTAMP,
  result            STRING(16),                -- PASS | FAIL | TIMEOUT | ERROR | SKIPPED
  attempt           INT64 NOT NULL,            -- 1-indexed
  runner_id         STRING(64),                -- executor node ID
  sandbox_id        STRING(64),                -- for post-mortem
  exit_code         INT64,
  duration_ms       INT64,
  mem_peak_mb       INT64,
  retry_reason      STRING(64),                -- NULL | FLAKY | INFRA | USER_MANUAL
  cache_outcome     STRING(16),                -- MISS | HIT_LOCAL | HIT_REMOTE
  flake_seed_id     STRING(32),                -- seed variation across retries
  logs_uri          STRING(512),               -- pointer to object store
) PARTITION BY DATE(start_ts);

CREATE INDEX idx_test_run_by_test ON test_run(test_id, start_ts DESC);
CREATE INDEX idx_test_run_by_cl   ON test_run(cl_id, test_id);

-- Mutable per-test state
CREATE TABLE test_metadata (
  test_id                  STRING(512) PRIMARY KEY,
  owners                   ARRAY<STRING(128)>,  -- team IDs
  quarantine_state         STRING(16),          -- NONE | AUTO | MANUAL | RELEASING
  quarantine_since         TIMESTAMP,
  quarantine_reason        STRING(512),
  quarantine_ticket        STRING(128),         -- JIRA link
  flake_score              FLOAT64,             -- cached Wilson lower bound
  flake_class              STRING(16),          -- STABLE|SUSPECT|FLAKY|BROKEN
  flake_updated_at         TIMESTAMP,
  runs_7d                  INT64,
  fails_7d                 INT64,
  last_run_at              TIMESTAMP,
  last_pass_at             TIMESTAMP,
  last_fail_at             TIMESTAMP,
  consecutive_fails        INT64,               -- signal for QUARANTINED-now-real-failing
  tags                     ARRAY<STRING(64)>,   -- e.g., "flaky-historically", "network-dependent"
);

-- Dep graph snapshots (content-addressed)
CREATE TABLE dep_graph_snapshot (
  sha              STRING(40) PRIMARY KEY,
  merkle_root      BYTES(32) NOT NULL,          -- Merkle root of BUILD + sources
  blob_uri         STRING(512) NOT NULL,        -- pointer to compressed snapshot
  parent_sha       STRING(40),
  delta_uri        STRING(512),                 -- forward delta from parent
  created_at       TIMESTAMP,
);

-- Ownership (derived from CODEOWNERS at each main-branch sha)
CREATE TABLE ownership (
  path_glob        STRING(512) NOT NULL,        -- e.g., "/srv/payments/**"
  team             STRING(128) NOT NULL,
  priority         INT64 NOT NULL,              -- for tiebreak (nearest-ancestor wins)
  valid_from_sha   STRING(40),
  valid_to_sha     STRING(40),                  -- NULL = current
  PRIMARY KEY (path_glob, valid_from_sha)
);

-- Notification policy
CREATE TABLE notification_policy (
  team             STRING(128) PRIMARY KEY,
  presubmit_channel        STRING(128),          -- slack channel
  postsubmit_channel       STRING(128),
  page_escalation          STRING(128),          -- pagerduty service id
  batch_window_sec         INT64 DEFAULT 300,
  severity_floor           STRING(16),           -- min severity to notify
  quiet_hours              STRING(64),           -- cron-like
  allow_mention_author     BOOL DEFAULT TRUE,
);

-- Quarantine audit log (tamper-evident)
CREATE TABLE quarantine_event (
  event_id         BYTES(16) PRIMARY KEY,
  test_id          STRING(512),
  action           STRING(16),                   -- QUARANTINE | DEQUARANTINE | OVERRIDE
  actor            STRING(128),                  -- system | user-id
  ts               TIMESTAMP,
  reason           STRING(512),
  evidence_json    JSON,                         -- stats snapshot
  prev_hash        BYTES(32),                    -- hash-chain for tamper detection
  this_hash        BYTES(32),
);

Query API (examples)

-- "is test X flaky?" — O(1) from test_metadata
SELECT flake_class, flake_score, runs_7d, fails_7d
FROM test_metadata WHERE test_id = @test_id;

-- "show flakes I own" — drives dashboards & on-call handoff
SELECT test_id, flake_score, last_fail_at
FROM test_metadata
WHERE 'team-payments' IN UNNEST(owners)
  AND flake_class IN ('FLAKY', 'SUSPECT')
ORDER BY flake_score ASC;  -- most flaky first

-- "has this test regressed since quarantine?" — the critical anti-masking query
SELECT COUNT(*) AS postquar_fails
FROM test_run r
JOIN test_metadata m USING (test_id)
WHERE r.test_id = @test_id
  AND r.start_ts > m.quarantine_since
  AND r.result = 'FAIL'
  AND r.attempt = 1;   -- first attempts only; ignore retries

6 System Diagram — CENTERPIECE #

                               ┌──────────────────────────────────────────────┐
                               │                 DEVELOPERS                   │
                               │   (10K eng · 100K CL/day · 1K teams)         │
                               └────┬───────────────────────────┬─────────────┘
                                    │ git push                   │ UI polls
                                    ▼                             ▼
                        ┌─────────────────────┐      ┌────────────────────────┐
                        │   VCS (Gerrit /     │      │   CI UI (Next.js)      │
                        │   GitHub Enterprise)│      │   get_status (gRPC-Web)│
                        └─────┬───────────────┘      └─────────┬──────────────┘
                              │ webhook (REST, HMAC-signed)    │
                              │ ≤ 3 CL/s peak, 100K/day        │ 12 QPS avg
                              ▼                                 ▼
        ┌─────────────────────────────────────────────────────────────────────────┐
        │                        REST GATEWAY (Envoy + auth filter)               │
        │  - HMAC verify   - rate limit   - TLS termination   - OAuth for UI      │
        └──────────────────────────────┬──────────────────────────────────────────┘
                                       │ gRPC, X-Event-Id dedupe, mTLS (SPIFFE)
                                       ▼
        ┌──────────────────────────────────────────────────────────────────────────┐
        │                    ci-coordinator (5 replicas, stateless)                │
        │  State machine: NEW → ANALYZING → QUEUED → RUNNING → {PASSED|FAILED|     │
        │                                                       FLAKY_RETRYING}    │
        │  Persists CL lifecycle in Spanner (1 row/CL, indexed by (repo, sha))     │
        └────────┬──────────────────────────┬───────────────────┬──────────────────┘
                 │ AffectedTargets RPC       │ Schedule RPC      │ emit status
                 │ (avg 12ms, p99 180ms)     │                   │ (fan-out to UI)
                 ▼                           ▼                   ▼
  ┌──────────────────────┐    ┌─────────────────────┐   ┌─────────────────────┐
  │   dep-analyzer       │    │     scheduler       │   │  status-pubsub      │
  │  (5 replicas)        │    │   (3 replicas HA)   │   │  (Redis pub/sub     │
  │                      │    │                     │   │   or Kafka)         │
  │  1. Load parent      │    │  - Priority queue   │   └─────────────────────┘
  │     dep_graph        │    │  - Bin-pack onto    │
  │     (content-hash)   │    │    executor shards  │
  │  2. Compute delta    │    │  - Backpressure     │
  │     Merkle tree      │    │    from exec pool   │
  │  3. Walk reverse-dep │    │  - Weighted fair    │
  │     graph → affected │    │    (presubmit > post)│
  │  4. Return target    │    └──────┬──────────────┘
  │     list + Bazel     │           │ RunTask gRPC (streaming bidi)
  │     digests          │           │ up to ~20K cores peak
  └──────────────────────┘           ▼
                                  ┌────────────────────────────────────────────┐
                                  │         EXECUTOR POOL (autoscaled)         │
                                  │  2K–20K workers across 5 cells (zones)     │
                                  │  Each: gVisor/runc sandbox, cgroup-limited │
                                  │        hermetic FS overlay, no network by  │
                                  │        default, content-addressed inputs   │
                                  │                                            │
                                  │   ┌──────────┐  ┌──────────┐  ┌──────────┐ │
                                  │   │ worker-1 │  │ worker-2 │  │ worker-N │ │
                                  │   │ gVisor   │  │ gVisor   │  │ gVisor   │ │
                                  │   └────┬─────┘  └────┬─────┘  └────┬─────┘ │
                                  └────────┼─────────────┼──────────────┼──────┘
                                           │ action-cache lookup        │
                                           │ (Bazel RE API, CAS)        │
                                           ▼                             │
                                  ┌───────────────────────────┐          │
                                  │   REMOTE CACHE (CAS)      │          │
                                  │   - Content-addressed     │          │
                                  │   - Regional replicas     │◄─────────┘
                                  │   - ~85% hit rate target  │
                                  │   - 100 TB, LRU + pinning │
                                  └───────────┬───────────────┘
                                              │ on miss: execute + write back
                                              ▼
                                   ┌──────────────────────┐
                                   │  result-ingest       │
                                   │  (Kafka producer)    │
                                   │  writes: test_run    │
                                   └──────┬───────────────┘
                                          │ Kafka topic: test.run.results
                                          │ ~200 msg/s avg, 600 peak
                                          │ partitioned by test_id
                                          ▼
             ┌────────────────────────────┼──────────────────────────────────┐
             │                            │                                  │
             ▼                            ▼                                  ▼
  ┌─────────────────────┐   ┌─────────────────────────────┐   ┌──────────────────────┐
  │ history-svc         │   │      FLAKE CLASSIFICATION    │   │  post-submit        │
  │ writes to:          │   │         PIPELINE             │   │  bisector           │
  │  - Bigtable (hot)   │   │  (see sub-diagram below)    │   │  (on main-break)    │
  │  - BigQuery (cold)  │   └──────────────┬──────────────┘   └──────┬──────────────┘
  └─────────────────────┘                  │                          │
                                           ▼                          ▼
                                 ┌─────────────────────┐    ┌─────────────────────┐
                                 │  quarantine-svc     │    │  bisect-controller  │
                                 │  - state machine    │    │  - git-bisect auto  │
                                 │  - audit log        │    │  - re-runs at mid   │
                                 │  - dequar scheduler │    │    commits          │
                                 └─────────┬───────────┘    └─────────┬───────────┘
                                           │                          │
                                           └────────────┬─────────────┘
                                                        │ FAIL / QUARANTINE / REGRESSION event
                                                        ▼
                                        ┌──────────────────────────────────────┐
                                        │        notif-router                  │
                                        │  1. ownership-resolver RPC           │
                                        │     (CODEOWNERS + blame-walk)        │
                                        │  2. severity tiering                 │
                                        │  3. batching window (5 min default)  │
                                        │  4. dedupe by (team, test, cl_day)   │
                                        │  5. channel fanout                   │
                                        └────┬────────────┬────────────┬───────┘
                                             ▼            ▼            ▼
                                         ┌──────┐    ┌──────┐    ┌────────────┐
                                         │ Slack│    │Email │    │PagerDuty   │
                                         │ API  │    │ SMTP │    │(post-submit│
                                         └──────┘    └──────┘    │ only)      │
                                                                 └────────────┘

     ┌──────────────────────────────────────────────────────────────┐
     │                    ARTIFACT / DEPLOY PATH                     │
     │                                                               │
     │  executor pool ──► artifact-svc (sign, SLSA attestation)     │
     │                         │                                     │
     │                         ▼                                     │
     │               ┌─────────────────────┐                         │
     │               │  artifact registry  │                         │
     │               │  (OCI, signed, CAS) │                         │
     │               └────────┬────────────┘                         │
     │                        │                                     │
     │                        ▼                                     │
     │               ┌─────────────────────┐                         │
     │               │   deploy-gate       │                         │
     │               │  (quality score)    │                         │
     │               └────────┬────────────┘                         │
     │                        │                                     │
     │                        ▼                                     │
     │               ┌─────────────────────┐                         │
     │               │  deploy-orch        │                         │
     │               │  (Spinnaker/Argo)   │                         │
     │               │  - blue/green       │                         │
     │               │  - canary + SLO     │                         │
     │               │  - auto-rollback    │                         │
     │               └─────────────────────┘                         │
     └──────────────────────────────────────────────────────────────┘

Sub-diagram: flake classification pipeline

Kafka topic: test.run.results (partitioned by test_id, 64 partitions)
       │
       ├──────────────────────────────────────────┐
       ▼                                          ▼
┌───────────────────────┐              ┌────────────────────────────┐
│  streaming-scorer     │              │  batch-trainer             │
│  (Flink/Dataflow)     │              │  (BigQuery + Spark nightly)│
│                       │              │                            │
│  Per test_id window:  │              │  - Fits richer model:      │
│   - 30d rolling stats │              │     · features: test dur,  │
│   - Wilson lower      │              │       mem spike, time of   │
│     bound of pass-    │              │       day, shard id,       │
│     rate              │              │       runner class, host   │
│   - EWMA of fail-rate │              │       pressure, network    │
│   - attempt-1 vs      │              │       isolation state      │
│     retry delta       │              │     · gradient-boosted     │
│   - consecutive fails │              │       tree → P(flake|ctx)  │
│     since quarantine  │              │  - Outputs model blob to   │
│                       │              │    S3; streaming-scorer    │
│  Emits:               │              │    hot-reloads hourly      │
│   - update to         │              │  - Labels from human       │
│     test_metadata     │              │    JIRA triage + bisect    │
│     (Spanner upsert)  │              │    ground truth            │
│   - FLAKE_DETECTED    │              └────────────────────────────┘
│     event to Kafka    │
│   - REGRESSION_UNDER_ │
│     QUARANTINE event  │
└───────────────────────┘
       │
       ▼
  quarantine-svc    (decision policy engine)
  notif-router      (consumes events)

Diagram legend: every arrow has a contract

# Arrow Protocol QPS Payload
1 VCS → Gateway HTTPS webhook, HMAC 3/s peak ~2 KB JSON
2 Gateway → coordinator gRPC, mTLS 3/s peak SubmitCLRequest
3 coordinator → dep-analyzer gRPC 3/s peak (repo, sha, base_sha)
4 dep-analyzer → dep_graph store object read 3/s peak ~10 MB compressed Merkle delta
5 coordinator → scheduler gRPC 3/s peak target list (avg 50, p99 5K)
6 scheduler → executor gRPC bidi stream 630/s peak RunTask (inputs digests, limits)
7 executor → CAS REST, CAS protocol high content-addressed get/put
8 executor → result-ingest gRPC 210/s avg TestRunResult proto, ~4 KB
9 result-ingest → Kafka Kafka producer 210/s avg same, partitioned by test_id
10 Kafka → streaming-scorer Kafka consumer 210/s avg
11 streaming-scorer → Spanner gRPC upsert ~50/s (after windowing) test_metadata row
12 streaming-scorer → Kafka (events) producer ~0.5/s FLAKE_DETECTED / REGRESSION
13 quarantine-svc → Spanner gRPC low test_metadata update + audit row
14 notif-router → ownership-resolver gRPC 1/s (batched) paths + sha
15 notif-router → Slack/PD HTTPS ~0.01/s batched notification
16 executor → artifact-svc gRPC varies signed build artifacts
17 deploy-gate → deploy-orch webhook 10K/day quality score + attestation

7 Deep-Dives (3 critical topics) #

7.1 Incremental build & test selection via target graph

Why critical. At 2M targets, running "everything" per CL is physically impossible — 800K test targets × 30s = 6,700 core-hours = $3,400/CL at spot rates. The entire economic viability of monorepo CI hinges on conservatively computing the minimum affected set. Incorrect pruning either (a) misses a regression (catastrophic) or (b) over-selects and burns budget.

Alternatives — quantified.

Option How it computes "affected" False-negative risk Scale limit Adoption
Path-based (e.g., Jenkins Pipeline when { changeset }) Regex on changed files → pipeline branches HIGH. Misses transitive deps. 1K targets Low (legacy)
Static call-graph + IDE indexer (e.g., Kythe) Compile-time symbol graph Medium. Reflection / DI / codegen miss edges. 100K targets Research
BUILD-graph + content-hashed inputs (Bazel / Buck2 / Pants) Declared deps in BUILD files; Merkle tree of inputs; action cache keyed on digest LOW iff rules are hermetic 10M+ targets (Google scale) High (Google, Meta, Stripe, Dropbox, Spotify use it)
ML-predicted TIA (test impact analysis, Microsoft paper) Feature model of changed files → predicted failing tests HIGH. Probabilistic; used only as hint. Unbounded Supplementary (Microsoft, Facebook TestInsights)
Dynamic call graph (coverage-recorded) Record per-test coverage; on change, re-run tests covering changed lines Medium. Initial state of coverage becomes stale. Reflection-safe. 1M targets Niche (Facebook Predictive Test Selection uses as one input)

Chosen: BUILD-graph + Merkle/CAS (Bazel-style), ML-TIA as hint only. Justification:

  1. Correctness invariant: if rules are hermetic (declared inputs are complete), the Merkle-tree digest of a target's inputs deterministically identifies its output. Cache hit ⇔ equivalent execution. This is the one class of system where "cache" is correctness, not optimization.
  2. Why Bazel over Buck2: Buck2 (Meta, 2023 OSS) has a theoretically superior Starlark evaluator (daemon-less, parallel) and better DX, but in 2026 Bazel's remote execution ecosystem (BuildBuddy, EngFlow, NativeLink) is mature and the cross-lang support (C++, Java, Python, Go, Rust, JS) is 3 years ahead. At 10K-eng scale, tooling maturity dominates. I'd reevaluate in 2–3 years.
  3. Why not Pants v2 or Gradle: Pants is excellent for Python-heavy orgs but weaker C++/Java. Gradle's configuration phase is not content-addressable out of the box, breaking cache correctness. Rejected.

Earned-secret depth: the cache-correctness trap.

The hermeticity invariant looks simple ("declare all inputs") but fails in practice in ~5 classes of bugs, each of which silently corrupts the cache:

  1. Nondeterminism in compilers — e.g., __DATE__, goroutine iteration order, Go's map iteration. Mitigation: SOURCE_DATE_EPOCH, GODEBUG=randmapiter=0, -frandom-seed=. Bazel's --experimental_repo_remote_exec + reproducible-builds checks.
  2. Absolute path leakage — paths in debug info (DW_AT_comp_dir). Mitigation: --remap-path-prefix, strip-nondeterminism tool.
  3. File system orderingglob across platforms. Mitigation: sorted glob + .bazelrc canonicalization.
  4. Toolchain drift — developer-local vs remote compiler. Mitigation: toolchain containers, --incompatible_enable_cc_toolchain_resolution.
  5. Environment variables — mtime, hostname, locale. Mitigation: --incompatible_strict_action_env, sealed env allowlist.

At Google, a 2019 internal audit found ~0.3% of targets were non-hermetic and occasionally poisoned the cache. The mitigation was a cache-poisoning detector: re-execute 1% of cache hits and compare outputs; mismatch triggers a cache invalidation + rule-author notification. I'd build this from day 1 (cost: 1% extra compute; value: catastrophic-bug prevention).

Test selection correctness proof sketch.

Let T = set of test targets. Let reverse_deps(x) = transitive reverse deps of node x. For a CL touching file set F:

affected = ⋃_{f ∈ F} reverse_deps(target_of(f))  ∩  T

This is sound (no missed affected target) iff the BUILD graph is complete (hermeticity). It's tight modulo runtime-only deps (reflection, DI) — those are caught by the post-submit full suite and by probabilistic ML-TIA as a sanity check.

Failure modes & mitigations.

Failure Detection Mitigation
Non-hermetic rule pollutes cache Random 1% re-execution mismatch Auto-invalidate + quarantine rule, notify author
BUILD file drift from source Dep-lint on every merge Block merge on missing dep
Merkle-tree computation bug Reproducible-build nightly job CI-of-CI: compute digest twice, compare
Cache size bloat Hit-rate dashboards per-cell LRU with target-age weighting; pin hot deps
RBE latency spike p99 sched latency metric Multi-region failover; sched-level degradation mode

Real systems named. Bazel + BuildBuddy RBE, Buck2 (Meta), Google TAP + ForgeCluster, Meta's Sapling + CI, Pants v2, BuildKit (Docker), Nix + Hydra (reference for content-addressable builds done right).


7.2 Flake detection statistics

Why critical. Naive "passed on retry = flaky" logic masks real regressions. A test that is truly broken by a CL will pass on retry only if the failure was transient (e.g., a race condition that manifests 50% of the time). Misclassifying it as flaky leads to quarantine, then to merge-then-regression. The single most common CI failure mode at > 1K-eng scale is flaky tests masking real bugs (Google's internal incident data shows this is 30–40% of escape-to-production bugs when flake classifier is weak).

Alternatives — quantified.

Option Mechanism False-flake rate False-stable rate Data needed
"Passed on retry → flaky" Boolean HIGH (~10%) HIGH (~5%) 1 run
Simple pass-rate threshold (e.g., < 98% in last 7d) Point estimate, no CI Medium Medium 10+ runs
Wilson score lower bound (95% CI) on pass-rate Bayesian-adjacent LOW (<1%) with n ≥ 50 Low 50+ runs
EWMA of fail-rate + Bayesian prior Time-weighted, accommodates drift Low Low 30+ runs over time
Reproduction protocol (run N times w/ different seeds) Direct experimental evidence Very low Very low N extra runs per suspect
ML classifier (gradient-boosted features) Context-aware (host, time, load) Low when trained Medium cold start Labeled training set
TAP CulpritFinder-style bisection Re-run at bisected commits to isolate flake vs regression Very low Very low 10–15 extra runs on post-submit

Chosen: layered approach.

Layer 0 (streaming, per run): compute Wilson lower bound + EWMA over 30d window
Layer 1 (retry decision):     retry only if classification ∈ {FLAKY, SUSPECT}
Layer 2 (post-submit break):  CulpritFinder bisect to distinguish flake from regression
Layer 3 (auto-quarantine):    require n ≥ 50 samples AND Wilson_LB < 0.95 AND reproduced on seed-varied rerun
Layer 4 (batch trainer, nightly): ML model refines priors with features; feeds back to Layer 0 as a prior

Earned-secret depth: why Wilson, and why n≥50.

Wilson score interval is the correct CI for a binomial proportion when n is small. For k failures in n runs, pass-rate p̂ = (n-k)/n, the 95% CI lower bound is:

        p̂ + z²/(2n) − z · √(p̂(1−p̂)/n + z²/(4n²))
L =   ───────────────────────────────────────────
                     1 + z²/n

with z = 1.96.

Worked example (the insight the interviewer is probing for):

  • n=10, k=2p̂ = 0.80, Wilson LB ≈ 0.44. Conclusion: we have almost no idea if pass-rate is 44% or 96%; too few samples to classify as flaky.
  • n=50, k=1p̂ = 0.98, Wilson LB ≈ 0.89. Still not confident it's < 95% threshold. Needs reproduction.
  • n=100, k=5p̂ = 0.95, Wilson LB ≈ 0.89. Now we can classify as SUSPECT.
  • n=500, k=30p̂ = 0.94, Wilson LB ≈ 0.92. Classify as FLAKY.

This is why "passed on retry" is insufficient: n=2 is meaningless. Google's TAP requires n ≥ 50 over time before auto-quarantine, and always runs a seed-varied reproduction (different shard, different host, different time-of-day) to prevent a real regression from being misclassified during a peak-load window.

The masking problem — the critical invariant.

A test can be flaky and regressed simultaneously. Classic trap: test has historical fail-rate 3% (flaky, quarantined), then a CL truly breaks it (post-CL fail-rate 80%). If we only look at overall pass-rate, the flaky label persists. Fix: track pre-CL vs post-CL pass-rate for every quarantined test as a separate bucket. Alert when post-CL fail-rate diverges significantly (χ² test p<0.01) from pre-CL.

SELECT
  test_id,
  SUM(CASE WHEN start_ts < q.quarantine_since THEN 1 ELSE 0 END) AS pre_n,
  SUM(CASE WHEN start_ts < q.quarantine_since AND result='FAIL' THEN 1 ELSE 0 END) AS pre_fail,
  SUM(CASE WHEN start_ts >= q.quarantine_since THEN 1 ELSE 0 END) AS post_n,
  SUM(CASE WHEN start_ts >= q.quarantine_since AND result='FAIL' THEN 1 ELSE 0 END) AS post_fail
FROM test_run r
JOIN test_metadata q USING (test_id)
WHERE q.quarantine_state != 'NONE'
GROUP BY test_id
HAVING chi_square_p_value(pre_n, pre_fail, post_n, post_fail) < 0.01;

Every row → paging event tagged REGRESSION_UNDER_QUARANTINE, which bypasses the normal batching and goes straight to PagerDuty.

CulpritFinder (TAP's approach, worth replicating).

When a test breaks on post-submit at head, and flake classifier says "flaky", the naive response is "mute and move on." TAP instead re-runs the test at the parent commit:

  • If it passes at parent, it's a new regression (bisect to find the CL, page the author).
  • If it also fails at parent, walk back until it passes. The "culprit CL" is bracketed.
  • If it flakes at every re-run, it's confirmed flaky.

Cost: 10–15 extra runs on each suspected regression. Benefit: near-zero escape rate for regressions-disguised-as-flakes. Worth it at any scale where bug-to-prod cost > $1K.

Failure modes.

Failure Detection Mitigation
Cold start: new test has no history n < 10 on first failure Treat as "genuine fail" by default; require 2 green runs before trusting
Flake rate drifts due to infra change Step-change in EWMA across all tests Reset classifier, notify CI SRE
Adversarial test author (CL tests with random sleep) Correlation between author and flake score SRE dashboard of per-author flake contribution
Classifier cache stale after retrain Hot-reload staleness metric Version model blob; scorer checks version on each read
Quarantine storm (many tests quarantined at once) Rate-limit on quarantine events Cap at 20 auto-quarantines/hour; excess goes to human review

Real systems. Google TAP (internal) + CulpritFinder, Meta's Flaky-Test-Detective (FBIT), Mozilla's Flaky Test dashboards, Jenkins flaky-test-handler plugin (weak), Datadog CI Visibility (commercial), BuildBuddy flaky-tests feature.


7.3 Notification routing

Why critical. At 10K eng, "CI broke, notify the team" produces 1K notifications/day. Humans develop learned-helplessness and mute the channel. Precision > 95% is a psychological threshold below which the system destroys its own signal value.

Alternatives.

Strategy Precision Recall Latency Complexity
"Notify author of CL" HIGH for presubmit. LOW for post-submit head-breaks (often a merge race). Misses owners who need visibility Low Trivial
CODEOWNERS (GitHub-style, path-based) Medium. Struggles with shared files. Good Low Medium
CODEOWNERS + nearest-ancestor fallback High Good Low Medium
git-blame walk to find last modifier of the failing assertion High when line-precise Can page wrong person on refactor Medium Medium
Hybrid: CODEOWNERS primary + blame for the failing line Very high Very high Medium High
LLM-inferred owner (classifier over commit history + PR reviewers) Medium in 2026 (improving) Medium High High

Chosen: Hybrid CODEOWNERS + blame + hierarchy fallback.

Algorithm:

def resolve_owners(test_id, failing_assertion_line, cl):
  # 1. CODEOWNERS of the test file itself
  owners = codeowners_lookup(test_path(test_id))

  # 2. Augment with blame of the assertion line (if available)
  if failing_assertion_line:
    blame_author = git_blame(failing_assertion_line).author
    blame_team  = team_of(blame_author)
    if blame_team and blame_team not in owners:
      owners.append((blame_team, priority=0.5))  # secondary

  # 3. On presubmit, always notify CL author as the actionable party
  if cl.is_presubmit:
    owners.append((cl.author_team, priority=1.0))

  # 4. Nearest-ancestor fallback if empty
  if not owners:
    owners = nearest_ancestor_codeowners(test_path(test_id))

  # 5. Last resort
  if not owners:
    owners = [("platform-sre", priority=0.1)]

  return dedupe(owners)

Severity tiering (the anti-spam primitive).

Severity Trigger Channel Batching Rate limit
INFO Pre-submit fail on a test classified FLAKY, no regression signal Slack thread on CL Coalesce into CL status N/A
WARNING Pre-submit fail on STABLE test Slack DM to author + team channel 5-min window 10/hr/team
ERROR Post-submit main-branch break (non-flake) Slack team channel + @here 1-min window 5/hr/team
PAGE REGRESSION_UNDER_QUARANTINE, or main-branch break + deploy-gate fail PagerDuty No batch No limit

Earned-secret depth: the batching window is a correctness mechanism, not a kindness.

Without batching, a rollback-worthy main-branch break can generate 500 test failures (cascade of integration tests depending on the bad code). Each generates a notification → 500 pages → responder pages the wrong person because it's the first page that opened the incident → root-cause delayed.

Correct design:

  1. Event correlation window (1 min at post-submit, 5 min at presubmit) — group all failures sharing a (sha, failure_signature) prefix.
  2. Failure signature = hash of (normalized stack trace, failed assertion text). Tests failing with the same signature are likely one root cause.
  3. One notification per (team, signature, window) regardless of how many tests share the signature.
  4. Top-N summary in the notification ("147 tests failed with AssertionError: db connection refused; likely root cause: DB pool change in db/pool.go line 82").

Google's Critique system does this correlation; Meta's CI Observability does similar grouping.

Failure modes.

Failure Detection Mitigation
Paging storm Rate of notifications > 10× baseline Auto-circuit-breaker: coalesce all into a single "CI incident" page; suspend flake-classifier retries; engage CI on-call
Wrong owner (CODEOWNERS out of date) Feedback: "not me" button in Slack → ML signal Weekly stale-CODEOWNERS audit; auto-PR to update
Author OOO Calendar integration Fallback to team rotation (PagerDuty on-call lookup)
Shared file with no clear owner CODEOWNERS hit ratio per file "Orphan file" dashboard; assign to platform team
Notification → action latency Track time-to-ack Alert on team-level p95 > 1h

Real systems. Google Critique (code review + notification), Meta's CI Observability + Claim (acknowledge-a-failure UX), Datadog Events Explorer, PagerDuty with CODEOWNERS integration, GitHub's CODEOWNERS + Required Reviewers, Bazel's TestResult proto fields for structured failure info.


8 Failure Modes & Resilience #

Per-component table

Component Failure Detection (time) Blast radius Mitigation Recovery
VCS webhook Lost delivery 5-min "no CL since" watchdog + VCS delivery dashboards Missed test runs for that sha (1) VCS retry on 5xx (7 attempts, exp backoff); (2) pull-based reconciler scans for un-CI'd SHAs every 5 min Replay webhook from VCS admin UI; reconciler auto-catches
ci-coordinator Pod crash gRPC health probe; state in Spanner CL stuck in ANALYZING Stateless pods behind HAProxy; next pod picks up via Spanner lease Spanner row lease expires in 30s; auto-recover
Spanner / metadata Regional outage Client-side latency spike + error rate Flake lookups fail → devs can't merge Multi-region Spanner; degradation mode: read-only from secondary, skip flake classification (run all tests, no quarantine) Manual promote; resume quarantine decisions when primary back
dep-analyzer Incorrect affected set CI-of-CI: reproducible build mismatch Missed regressions merge to main Dual-compute on 1% of CLs; alert on divergence; mandatory full suite on post-submit catches missed Invalidate cache; rerun affected CLs; rare
Scheduler Backlog explosion Queue depth > threshold (10K) Latency SLO violation Shed load: drop manual reruns; pause post-submit; alert on-call; autoscale Drain backlog with autoscale; manual reruns return when queue drains
Executor node Poisoned state (stale kernel, leaked processes) Health check: per-node flake rate vs peer mean > 3σ Falsely fails tests until quarantined Auto-drain node on elevated flake rate; reimage from golden image Reprovision; re-admit after clean run
Remote cache Cache poisoning (non-hermetic result cached) 1% re-execution audit Bad output until cache entry expires Content-addressed invalidation by target_hash prefix; quarantine rule author Rebuild from source on invalidated keys
Kafka (result bus) Broker partition loss Consumer lag Flake stats go stale Replication factor 3 + min-ISR 2; consumer offset checkpoints in Redis Kafka self-heals; consumers resume from checkpoint
Flake classifier Classifier down Staleness of test_metadata.flake_updated_at Retry decision falls back to "no retry" (safe default) Cache last-known classification in scheduler's local LRU Restart classifier; recompute from hot store
Quarantine svc False-positive quarantine of real regression Pre/post quarantine divergence chi² test Real bug ships Block merge if any non-flake failing; regression-under-quar goes to PAGE severity Manual dequarantine + bisect
notif-router Notification storm Notif rate > 10× baseline Pager fatigue, real alerts missed Auto-coalesce to incident page; suspend retries Incident commander resolves; notif-router drains queue
Ownership resolver Stale CODEOWNERS after refactor Notification feedback ("not me") rate Wrong team paged Weekly auto-audit PR; UI "reassign" button feeds back to ML Merge ownership fix; rebuild cache
Deploy gate False green signal Prod canary SLO burn Bad code deploys Multi-dim signal (test pass + perf delta + coverage); canary stage catches it Auto-rollback on SLO burn; CI post-mortem
Artifact signer Key compromise Signing-key audit log anomaly Unauthorized artifact release KMS-backed ephemeral signing keys (per-run); SLSA provenance Rotate KMS root; invalidate affected artifacts
Merge race Two CLs pass presubmit independently; together break main Post-submit break within 10 min of merge Brief main-branch break Rebase-on-merge + post-merge re-verify (TAP pattern); "merge queue" serialization for hot paths Bisect; revert CL; flake classifier distinguishes from real flake

Systemic resilience patterns

  • Bulkheads: presubmit and post-submit pools are physically separate — a post-submit overload never blocks devs.
  • Backpressure: scheduler rejects (with 429 retryable) when executor pool > 85% saturated; coordinator holds CL in QUEUED with user-visible queue position.
  • Graceful degradation modes, declared explicitly:
    • MODE_NORMAL: full pipeline
    • MODE_NO_FLAKE_CLASSIFIER: classifier down → run all retries once (no quarantine, no skips); throughput unchanged, cost ~10% up
    • MODE_NO_CACHE: remote cache down → run all targets from scratch; cost 7× up, SLO breached; SEV-2 auto-filed
    • MODE_BREAK_GLASS: entire CI offline → manual merge with author attestation, logged for post-hoc audit
  • Chaos drills monthly: simulate each degradation mode in pre-prod shadow fleet.

9 Evolution Path #

v1 — Minimum viable (0–6 months, handles 100 eng / 1K CL/day)

  • Single-queue Jenkins or GitHub Actions running full suite per CL.
  • Git-triggered; no caching; no flake detection.
  • All failures → email to author + team channel.
  • Manual quarantine via a YAML file in repo.

When this breaks: queue depth grows superlinearly past ~500 CL/day; p95 CI time exceeds 1 hour; developers start [skip ci] and learned helplessness.

Cost at v1 scale: ~$50/CL (no caching). Acceptable only < 500 CL/day.

v2 — Sharded with incremental build (6–18 months, 1K–5K eng / 10K CL/day)

  • Adopt Bazel + BuildBuddy RBE.
  • Affected-target compute with a dep-analyzer service.
  • Remote CAS cache; target 60% hit rate initially → 80% at 12 months.
  • Flake detection: Wilson score, threshold-based auto-quarantine.
  • CODEOWNERS-based routing; batching via Slack.
  • Jenkins → Tekton or Buildkite for pipeline orchestration.
  • Storage: test_run in Postgres initially (adequate to 1B rows); split to BigQuery at 5B.

When this breaks: ~10K CL/day, Postgres rows exceed 1B/yr, dep-analyzer cold-start dominates p95.

v3 — Global TAP-scale CI (18+ months, 10K+ eng / 100K CL/day — this design)

  • Dedicated executor pool with hermetic sandboxes (gVisor + cgroups + net namespaces).
  • Bi-modal storage: BigQuery cold + Bigtable hot + Spanner OLTP.
  • Streaming flake classifier (Flink/Dataflow) + nightly ML retrain.
  • CulpritFinder auto-bisect on post-submit regressions.
  • Deploy-gate with multi-dimensional quality score integrated into canary.
  • Multi-region failover; continuous integration model (post-submit-driven; devs merge to a virtual queue like Google TAP).
  • SLSA L3 artifact attestation; SOX/SOC2 audit trail.

What's next (v4):

  • Speculative execution: based on author's historical review-to-merge rate, start CI before CL is opened.
  • LLM-assisted failure triage: GPT-4 or Claude-Opus classifies failure logs, proposes a root cause, tags likely-owner files. Already in pilot at multiple orgs in 2026.
  • Test generation from coverage gaps + mutation testing to improve flake-distinguishability.
  • Federated CI across multi-repo with cross-repo dep graph (for "logical monorepo, physical polyrepo" orgs).

10 Out-of-1-Hour Notes (L7 extras) #

10.1 Hermetic sandboxing — defense-in-depth

  • gVisor (userspace kernel, Google) for strong syscall isolation at modest perf cost (~10–15% CPU overhead vs runc); protects the host from a malicious test.
  • cgroups v2 per-run CPU / memory / IO quotas; a runaway test can't starve neighbors.
  • pid namespaces so a test can't see peer processes; unshare(CLONE_NEWPID).
  • Network namespaces: default no network; tests declare requires_network: true with a scoped firewall. This is the #1 source of flakes; enforcing isolation kills ~30% of flake population at the cost of exposing tests that depended on external services (which should have been stubbed anyway).
  • Filesystem overlays: inputs mounted read-only from CAS, outputs to a dedicated tmpfs; no leakage between runs.
  • Mount namespaces + seccomp-bpf to block dangerous syscalls (ptrace, mount, reboot).
  • UID mapping via user namespaces so tests run as root-in-sandbox but nobody outside.

10.2 Content-addressed cache correctness

  • Action cache key = SHA-256(serialize(command, args, env, input_digests_merkle_root, toolchain_digest)).
  • Output cache = blob store keyed by each output file's content hash.
  • Mandatory: reproducible-builds audit (random 1% re-execution, diff outputs bit-for-bit). Alert on any divergence.
  • Cache pinning: hot targets (platform libraries) pinned with TTL = ∞; cold targets expire LRU.
  • Multi-layer cache: worker-local SSD → zonal → regional → global CAS. Targets 85% hit rate with p99 < 80 ms.

10.3 Bisect automation on post-submit failures

  • Triggered by POSTSUBMIT_FAIL event + flake classifier says "not flaky."
  • Binary search between last-green-sha and failing-sha.
  • For each midpoint: launch a bisect run of the failing test only (cheap).
  • Log2(N) runs to find culprit, where N = CLs between greens. At 100 CL/hr merge rate and 10-min breakage detection, N ≈ 17 → ~5 runs → culprit in ~1 hr.
  • Culprit author gets ERROR severity notification with "your CL broke main; revert ready to merge."

10.4 Test impact analysis vs static call graph

  • Static call-graph TIA (Kythe-style) misses reflection, DI, code generation.
  • Coverage-based TIA (Facebook PTS, Microsoft's paper) records per-test coverage, selects tests touching changed lines. Works well for integration tests; struggles with e2e (coverage too broad).
  • BUILD-graph TIA (chosen in §7.1) is conservative/correct but coarse; pairs well with ML-TIA as a tighter probabilistic hint.
  • Hybrid approach: BUILD graph for correctness (run this) + ML-TIA for priority (run these first for fail-fast signal). Google uses this combination.

10.5 Canary + rollback tie-in

  • CI's quality_score is one input to the deploy gate, not the only one. Other inputs:
    • SLO burn-rate at canary (primary)
    • Config drift check
    • Manual approval for SEV-1 deploys
  • Rollback triggers:
    • Canary SLO burn > 2%/hr → auto-rollback
    • Automated regression test failure in canary → hold + page deploy on-call
    • Manual kill-switch
  • Post-rollback CI forensics: artifacts + test_run rows from the rolled-back sha are tagged rollback_culprit=true, feeding into post-mortem tooling.

10.6 Regulatory audit trail (SOX compliance for change management)

  • Every merge = change event with immutable log: (cl_id, author, reviewers, CI pass evidence, deployer, deploy time, canary metrics).
  • Hash-chain the change-event log (like quarantine_event schema) for tamper evidence.
  • Quarterly SOX audit surfaces:
    • Any deploy without green CI (must be break-glass with exec approval)
    • Any quarantine override without ticket
    • Any artifact without signed SLSA provenance
  • Separation of duties: author of CL cannot deploy their own artifact to prod without a distinct approver (enforced in deploy-orch).

10.7 Cost per CI-minute

  • Target: $0.005/CI-minute at 90% cache hit.
  • Levers:
    • Spot/preemptible instances for executor pool (60–70% discount; interruptions tolerated because test runs are idempotent and short).
    • Multi-tenancy on executor hosts (gVisor makes this safe); pack tests by memory profile.
    • Autoscaler tuned to CL arrival rate (statistical; not reactive).
    • Regional CAS replicas co-located with executors (network cost dominates inter-region).
  • Anti-pattern: flat-rate GPU pool for "ML tests" sitting idle 80% of the time. Instead, burst to a shared AI/ML cluster with preemption.

10.8 Observability of the CI system itself ("signal-to-noise dashboards")

A CI system not instrumented becomes a black box. Required dashboards:

  • Precision panel: % of failures that led to a CL abandonment or fix-up commit. Target > 95% (false-fail rate < 5%).
  • Flake leaderboard: top-50 tests by flake score, with on-call team.
  • Cache hit rate per pool / per language / per team.
  • Latency SLO burn: p50/p95 presubmit latency vs target, with burn-rate.
  • Executor utilization — both CPU and memory, because test mixes skew.
  • Notification precision: "clicked-to-debug" vs "muted" ratio per team.
  • Regression-under-quarantine counter (the most critical single metric: any non-zero value is a potential escape).
  • Cost per CL with breakdown (compute, cache, storage, network).
  • CI-of-CI: reproducibility audit pass rate, bisect accuracy.

10.9 Agentic / LLM-era extensions (2026 relevance)

Given the candidate's Agentic AI background, frame these in the interview if asked:

  • LLM-assisted failure triage: feed failure logs + diff to an LLM, produce (a) likely root cause, (b) suggested owner, (c) auto-generated fix PR for trivial cases.
  • Agent sandbox isolation: same hermetic sandbox we use for test isolation generalizes to agent execution isolation — a useful cross-domain point.
  • Privacy-preserving CI: if tests process PII fixtures, the sandbox's net-isolation + ephemeral filesystem are the same primitives used for privacy infra. Candidate can pivot to this if interviewer shows Privacy interest.

10.10 What I would defer if interviewer insists

If out of time, skip §7.3 (notif routing) in the interview — it's the most discussable but least architecturally load-bearing. Keep §7.1 (build graph) and §7.2 (flake statistics) because they contain the hardest L7 insights. Offer §7.3 as follow-up.


Verification checklist (done before submission) #

  • SRE pager-carryable? Yes — on-call could carry this today: coordinator/Spanner/scheduler are well-understood building blocks; degradation modes defined; chaos drills specified.
  • Every diagram arrow → real API/data flow? Yes — each arrow mapped to the numbered table under the diagram.
  • Deep-dive L7 or L6? L7: Wilson score math with worked numbers, cache-correctness failure taxonomy, CulpritFinder reference, regression-under-quarantine detection as a distinct metric, failure-signature coalescing as correctness-not-kindness framing.
  • Flake detection statistically rigorous? Yes — explicit rejection of "passed on retry = flaky"; n ≥ 50 sample floor; reproduction protocol; χ² test for regression-under-quarantine; bisection for final classification.
  • Real systems named with rejection rationale? Yes — Bazel vs Buck2 vs Pants vs Gradle, Jenkins vs Tekton, Google TAP + CulpritFinder, Meta Sapling + FBIT, Datadog CI Vis, Spinnaker/Argo.
  • BOE numbers calculated not asserted? Yes — executor core-days, cache-hit cost curve, notification fan-out after filtering, storage row count math.
  • Cost envelope closed? Yes — $0.50/CL derives from 85% cache hit rate, which is justified by cache architecture (multi-tier + pinning).

End of solution.

esc
navigate open esc close