Q7 Real-Time Analytics & Monitoring 25 min read 12 sections
Design a CI/CD Pipeline with Smart Test-Failure Handling
Run monorepo builds and tests efficiently while distinguishing genuine failures from flaky or infra-driven noise.
Meta-strategy (not part of the answer — read once) #
This is the "unusual, requirements-driven" archetype. The interviewer will withhold a requirement and see whether I elicit it. My opening move is not a diagram — it is a 15-question clarifying volley, batched by theme, designed to reveal the real problem. The canonical trap: candidate assumes "large monorepo" means Google-scale 2B LOC; interviewer actually has in mind a 50M-LOC enterprise repo where the CI system itself is the bottleneck, or a multi-repo "federated monorepo" where the dependency graph crosses repo boundaries. Ask, then design.
Second trap: candidate jumps to "we'll use Bazel + remote cache." That is a tool, not a design. The L7 move is to derive the correctness invariants first (hermeticity, determinism, content-addressable inputs) and then defend Bazel against Buck2 / Pants / Gradle with numbers.
Third trap: candidate says "if a test passed on retry, it's flaky." This is statistically wrong and will mask real regressions. L7 answer uses Wilson score lower bound, a reproduction protocol, and cites Google TAP's CulpritFinder to distinguish flake from regression.
1 Problem Restatement & Clarifying Questions #
Restatement (30 sec)
"Design a CI/CD pipeline for a large monorepo. Primary requirements: (1) detect test failures, (2) decide when a failure is flaky vs real, and (3) route notifications to the right owners without paging the whole team. Secondary: the system must hand off to a deploy orchestrator (canary / rollback). I'll clarify scope, then drive from there."
Clarifying questions — batched, with why I'm asking
A. Scale & shape (drives capacity math, sharding model)
- Repo size: LOC, file count, number of BUILD/test targets? (50M LOC + 1M targets is a very different problem from 500M LOC + 50M targets — the latter needs a cross-shard dependency graph service.)
- Commit velocity: CLs/day, peak CLs/sec? Distribution across the day (US hours vs global)? (Drives executor pool elasticity and scheduler queue depth.)
- Engineer count and team count? (Drives notification routing fan-out and CODEOWNERS cardinality.)
- Test population: unit / integration / e2e split? Are there long-running tests (>30 min)?
- Existing infra: do we own remote execution (a la RBE / BuildBuddy / EngFlow), or are we building on Jenkins / GitHub Actions / Tekton?
B. SLOs & cost envelope (drives HA, pool sizing, storage tier) 6. Presubmit latency p50/p95? Is "10 min p50" aggressive or slack? (Google TAP internally runs p50 ~6 min.) 7. CI system availability target — 99.9% implies ~8.7h downtime/yr; can devs merge without CI during an outage (break-glass)? 8. Budget per CI-minute and per-commit cap? (Cost drives cache hit target; at $0.01/CPU-min × 1000 CPUs × 10 min = $100/run, unacceptable without caching.)
C. Correctness & policy (drives flake algorithm, quarantine policy) 9. Pre-merge vs post-merge model: do CLs block on CI (Gerrit-style pre-submit gate), or merge-then-test (Git main-branch + post-submit)? Or both (Google TAP is hybrid)? 10. What's the current false-fail rate? If devs already distrust CI ("I'll just rerun it"), that's a cultural problem masking a flake-detection deficit. 11. Regulatory / audit requirements: SOX for deploy lineage? SOC2 for change management? PCI/HIPAA in scope? (Determines whether we need tamper-evident logs and signed artifact attestation.)
D. Ownership & notification (drives CODEOWNERS model, routing)
12. Is there an existing CODEOWNERS system? Path-based only, or team-graph-based (e.g., Meta's fbpkg ownership)? Are there shared files with no clear owner?
13. Notification channels: Slack/chat, email, pager (PagerDuty/OpsGenie)? Is paging reserved for post-submit main-branch breaks?
14. On-call handoff: do teams have rotations that we can query, or is it "notify the author"?
E. Deployment hand-off 15. What does the deploy orchestrator look like — blue/green, canary, regional rollout? Is CI signal the only gate, or does it combine with SLO burn-rate? (This determines whether CI produces a boolean "green/red" or a richer "build quality score.")
Assumed answers for this doc (stated explicitly — I'd verify in the interview)
| Dimension | Value |
|---|---|
| Repo size | 200M LOC, 5M source files, 2M Bazel targets, 800K test targets |
| Engineers | 10K active developers across 1K teams |
| CL velocity | 100K CLs/day (avg 10/eng), peak 3 CL/sec |
| Presubmit SLO | p50 < 10 min, p95 < 25 min |
| Availability | 99.9% (dev productivity blocker) |
| Pre- vs post-submit | Both: presubmit smoke on affected targets, post-submit full at head |
| Cost target | ≤ $0.50/CL on average, ≤ $5/CL p99 |
| Test history retention | 90d hot, 2y cold (audit) |
| Notifications | Slack primary, PagerDuty for post-submit main-break |
| Deploy | Canary + blue/green, CI produces multi-dimensional gate signal |
2 Functional Requirements #
In scope (numbered, referenced later)
FR-1. CL submission & webhook ingestion. Accept webhook from VCS (GitHub Enterprise / Gerrit / internal). Dedupe by (repo, sha). Idempotent.
FR-2. Affected-target computation. Given a sha + base sha, compute the minimum set of build + test targets that could be affected. Must be conservative (no missed targets) and tight (no over-selection that explodes the run).
FR-3. Pre-submit gate. Run affected tests before merge. Return a signal sufficient to block/allow merge. p50 < 10 min.
FR-4. Post-submit continuous integration. On every merge to main, run a broader suite (affected + downstream), feed flake classifier with high-quality signal.
FR-5. Flake detection & quarantine. Maintain a per-test flake score (statistically grounded). Auto-quarantine tests exceeding threshold. Auto-dequarantine when stable. Must not mask real regressions.
FR-6. Smart retries. Retry a failing test only if its prior statistics classify it as likely-flaky. Bounded retries with exponential backoff, deterministic seed variation.
FR-7. Ownership resolution & notification routing. Resolve owner from CODEOWNERS + git-blame + path hierarchy. Batch notifications; escalate by severity tier. No spam on flake-only failures.
FR-8. Artifact production & attestation. Produce signed, hermetic build artifacts with provenance (SLSA level 3 target).
FR-9. Deploy hand-off. Emit a multi-dimensional quality signal (pass/fail, flake-adjusted confidence, coverage delta, perf delta) to the deploy orchestrator.
FR-10. Query & admin APIs. flaky_score(test_id), quarantine(test_id), rerun(cl_id, targets), get_status(cl_id).
Out of scope (explicit)
- Lint / format enforcement (separate pre-commit hook fleet).
- Full SAST/DAST pipeline details (we'll call into it as a gate, not design it).
- Release comms / changelogs / JIRA integration.
- Build-graph authoring tooling (BUILD-file IDE, genrule support).
- Secrets management for deploys (assume a KMS exists).
- Credentials / identity for the CI system itself (assume SPIFFE-style workload identity exists).
3 NFRs + Capacity Estimate #
NFRs
| NFR | Target | Rationale |
|---|---|---|
| Availability (coordinator, scheduler, metadata svc) | 99.95% | Dev productivity blocker; one SRE-hour of downtime = 10K eng × 1h × $100/h ≈ $1M |
| Presubmit latency | p50 10 min / p95 25 min | Google TAP benchmark; >25 min devs context-switch |
| Flake false-fail rate | < 1% | Above this, trust erodes and "just rerun" culture starts |
| Flake false-quarantine rate | < 0.1% | Quarantining a real regression is catastrophic |
| Notification precision | > 95% (notif reaches an owner who can act) | Below 90% = learned-helplessness muting |
| Cost | $0.50/CL avg | $0.50 × 100K CL/day = $50K/day = $18M/yr, acceptable at 10K-eng scale |
| Durability of test history | 11 9's (S3/GCS class) | Audit + flake-model training data |
| Build reproducibility | 100% for hermetic targets | Invariant; if violated, cache is invalid |
Back-of-envelope capacity
Test execution volume
Engineers: 10,000
CLs/eng/day: 10
CLs/day: 100,000
Affected targets/CL (p50): 50
Affected targets/CL (p95): 500
Presubmit test runs/day (p50): 100K × 50 = 5M
Presubmit test runs/day (w/ p95): ≈ 10M [mix of p50 + p95]
Retry avg (flaky-classified only): 0.05 × extra runs ≈ 500K
Post-submit full runs/day: 1 × 800K tests × ~10 retries_budget_full = 8M at head, but deduped by sha ⇒ 800K
TOTAL test executions/day: ~18M
Per second avg: ~210 tests/sec
Peak (3×): ~630 tests/sec
Executor pool sizing
Avg test wall time: 30 sec (unit-heavy mix)
Wall-sec/day: 18M × 30 = 540M sec = 6,250 core-days
Core-days/day steady-state: ~6,250 cores running flat
With peak 3× overhead + headroom: ~20,000 cores peak
Cost @ $0.01/core-min: 6,250 × 1440 × $0.01 = ~$90K/day raw compute
→ need ≥ 80% cache hit to land at $50K/day = $0.50/CL
Cache hit ratio drives the cost curve non-linearly. At 50% hit we pay $45K/day; at 80% we pay $18K/day; at 95% we pay $4.5K/day. Our design target is 85% hit rate, justified below.
Notification fan-out
Failures/day: ~2% of runs = 360K failure events
After flake filter (80% masked): ~72K real failures
After batching (5-min windows/team): ~1K batched notifications/day
Pages (post-submit main-break only): ~50/day
→ well under humane pager budget
Storage
test_run rows: 18M/day × 90d hot = 1.6B rows hot
+ 2y cold ≈ 13B rows cold
Row size (compressed columnar): ~100 B with dictionary + RLE
Hot store (BigQuery/ClickHouse): 160 GB hot (easy)
Cold store (object storage parquet): 1.3 TB cold
test_metadata: 800K tests × ~2 KB = 1.6 GB (Spanner-easy)
dep_graph snapshots: 2M targets × ~200 B × 1 snapshot/merge
× 10K merges/day × 7d = ~2.8 TB/week
→ content-addressed dedupe + delta encoding cuts 100×
→ ~30 GB/week hot
Query API load
`flaky_score(test_id)` calls: Every test run needs it before retry decision
→ 18M/day = 210 QPS avg, 630 QPS peak
→ trivial if cached; backing store is Spanner
`get_status(cl_id)` UI polls: 10K eng × 10 polls/CL × 10 CLs = 1M polls/day
→ 12 QPS avg, bursty
4 High-Level API #
Design principles
- gRPC primary (streaming for logs, bidi for coordinator ↔ executor). REST gateway for webhooks and UI.
- Idempotency via client-provided request ID. All writes are upserts keyed on (cl_id, target_id, attempt).
- Backward compat: proto3 with explicit field numbers; never reuse field numbers.
Service boundaries (each is a separately deployable binary)
ci-coordinator — CL intake, lifecycle state machine
dep-analyzer — BUILD graph traversal, affected-target compute
scheduler — queueing, bin-packing onto executor pool
exec-controller — executor fleet manager (health, drain, autoscale)
result-ingest — writes to test_run; fans out to flake-pipeline
flake-classifier — streaming stats + batch model refresh
quarantine-svc — quarantine state machine + dequarantine scheduler
ownership-resolver — CODEOWNERS + blame + hierarchy rules
notif-router — batching, severity tiering, channel fanout
artifact-svc — signed artifact store + provenance
deploy-gate — emits quality signal to deploy orchestrator
metadata-svc — test_metadata OLTP reads/writes (Spanner fronted)
history-svc — test_run analytics reads (BigQuery/ClickHouse)
admin-api — quarantine overrides, reruns, backfills
API surface (abridged proto)
// ====== ci-coordinator ======
service CICoordinator {
// VCS webhook target (via REST gateway). Idempotent on (repo, sha, event_kind).
rpc SubmitCL(SubmitCLRequest) returns (SubmitCLResponse);
rpc GetStatus(GetStatusRequest) returns (stream CLStatus); // streaming for UI
rpc Rerun(RerunRequest) returns (RerunResponse); // admin
rpc Cancel(CancelRequest) returns (CancelResponse);
}
message SubmitCLRequest {
string request_id = 1; // idempotency key
string repo = 2;
string sha = 3;
string base_sha = 4;
string author = 5;
repeated string modified_paths = 6; // hint; dep-analyzer re-verifies
SubmitMode mode = 7; // PRESUBMIT | POSTSUBMIT | MANUAL
}
message CLStatus {
string cl_id = 1;
CLState state = 2; // ANALYZING, QUEUED, RUNNING, PASSED, FAILED, FLAKY_RETRYING
int64 targets_total = 3;
int64 targets_passed = 4;
int64 targets_failed = 5;
int64 targets_flaky = 6;
double quality_score = 7; // 0.0–1.0; emitted to deploy-gate
}
// ====== flake-classifier ======
service FlakeClassifier {
rpc FlakyScore(FlakyScoreRequest) returns (FlakyScoreResponse);
rpc BatchFlakyScore(BatchFlakyScoreRequest) returns (BatchFlakyScoreResponse);
rpc ShouldRetry(ShouldRetryRequest) returns (ShouldRetryResponse);
}
message FlakyScoreResponse {
string test_id = 1;
double wilson_lower_bound_pass = 2; // 95% CI lower bound of true pass rate
double ewma_30d_fail_rate = 3;
int32 sample_size_30d = 4;
FlakeClassification classification = 5; // STABLE | SUSPECT | FLAKY | BROKEN | QUARANTINED
string reason = 6; // human-readable explanation
}
// ====== ownership-resolver ======
service OwnershipResolver {
rpc ResolveOwners(ResolveOwnersRequest) returns (ResolveOwnersResponse);
}
message ResolveOwnersResponse {
repeated Owner owners = 1;
ResolutionStrategy strategy_used = 2; // CODEOWNERS | GIT_BLAME | PATH_HIERARCHY | FALLBACK
double confidence = 3;
}
// ====== notif-router ======
// internal only — called by coordinator / post-submit watcher
service NotifRouter {
rpc Notify(NotifRequest) returns (NotifResponse);
}
message NotifRequest {
string event_id = 1; // idempotency
Severity severity = 2; // INFO | WARNING | ERROR | PAGE
Category category = 3; // PRESUBMIT_FAIL | POSTSUBMIT_BREAK | QUARANTINE_FAIL_UNDER | ...
repeated Owner owners = 4;
string cl_id = 5;
repeated string test_ids = 6;
string summary_md = 7;
map<string, string> labels = 8; // team, service, env
}
// ====== quarantine-svc ======
service Quarantine {
rpc Quarantine(QuarantineRequest) returns (QuarantineResponse);
rpc Dequarantine(DequarantineRequest) returns (DequarantineResponse);
rpc ListQuarantined(ListQuarantinedRequest) returns (ListQuarantinedResponse);
}
Webhook contract (VCS → coordinator)
POST /v1/vcs/webhook
Headers:
X-Signature: HMAC-SHA256(secret, body) ← verified before parse
X-Event-Id: UUID ← dedupe key
X-Delivery-Attempt: int ← VCS retry semantics
Body: {repo, ref, before_sha, after_sha, author, paths[]}
Coordinator is at-least-once consumer. Dedup via X-Event-Id with 24h TTL Redis. On replay, a submitted CL returns existing cl_id (idempotent).
5 Data Schema #
Storage engines — per-table justification
| Table | Engine | Why |
|---|---|---|
test_run (fact) |
BigQuery (or ClickHouse on-prem) | 10s of billions rows, columnar scan for flake model training, partition by day, cluster by test_id |
test_run_hot (last 7d) |
Bigtable / Cassandra | Point lookup by (test_id, ts) for flake scorer at < 10ms |
test_metadata |
Spanner (or CockroachDB / Vitess-MySQL) | OLTP writes on quarantine state; strong consistency required to avoid split-brain quarantine |
dep_graph_snapshot |
Content-addressed blob store (S3/GCS) + SQL index | Immutable Merkle-keyed; content-dedupe across CLs |
notification_policy |
Spanner | Low-volume, consistent reads |
ownership |
Spanner + in-memory cache | Derived from CODEOWNERS; rebuilt on merge; cached per-dep-analyzer pod |
artifacts |
Object store + index in Spanner | Immutable, content-addressed |
Schema definitions (PostgreSQL/Spanner DDL style)
-- Facts: immutable append-only
CREATE TABLE test_run (
run_id BYTES(16) PRIMARY KEY, -- UUIDv7 (time-ordered)
cl_id STRING(40) NOT NULL,
repo STRING(256) NOT NULL,
sha STRING(40) NOT NULL,
test_id STRING(512) NOT NULL, -- bazel-label format: //foo/bar:test
target_hash BYTES(32) NOT NULL, -- content-hash of inputs (cache key)
start_ts TIMESTAMP NOT NULL,
end_ts TIMESTAMP,
result STRING(16), -- PASS | FAIL | TIMEOUT | ERROR | SKIPPED
attempt INT64 NOT NULL, -- 1-indexed
runner_id STRING(64), -- executor node ID
sandbox_id STRING(64), -- for post-mortem
exit_code INT64,
duration_ms INT64,
mem_peak_mb INT64,
retry_reason STRING(64), -- NULL | FLAKY | INFRA | USER_MANUAL
cache_outcome STRING(16), -- MISS | HIT_LOCAL | HIT_REMOTE
flake_seed_id STRING(32), -- seed variation across retries
logs_uri STRING(512), -- pointer to object store
) PARTITION BY DATE(start_ts);
CREATE INDEX idx_test_run_by_test ON test_run(test_id, start_ts DESC);
CREATE INDEX idx_test_run_by_cl ON test_run(cl_id, test_id);
-- Mutable per-test state
CREATE TABLE test_metadata (
test_id STRING(512) PRIMARY KEY,
owners ARRAY<STRING(128)>, -- team IDs
quarantine_state STRING(16), -- NONE | AUTO | MANUAL | RELEASING
quarantine_since TIMESTAMP,
quarantine_reason STRING(512),
quarantine_ticket STRING(128), -- JIRA link
flake_score FLOAT64, -- cached Wilson lower bound
flake_class STRING(16), -- STABLE|SUSPECT|FLAKY|BROKEN
flake_updated_at TIMESTAMP,
runs_7d INT64,
fails_7d INT64,
last_run_at TIMESTAMP,
last_pass_at TIMESTAMP,
last_fail_at TIMESTAMP,
consecutive_fails INT64, -- signal for QUARANTINED-now-real-failing
tags ARRAY<STRING(64)>, -- e.g., "flaky-historically", "network-dependent"
);
-- Dep graph snapshots (content-addressed)
CREATE TABLE dep_graph_snapshot (
sha STRING(40) PRIMARY KEY,
merkle_root BYTES(32) NOT NULL, -- Merkle root of BUILD + sources
blob_uri STRING(512) NOT NULL, -- pointer to compressed snapshot
parent_sha STRING(40),
delta_uri STRING(512), -- forward delta from parent
created_at TIMESTAMP,
);
-- Ownership (derived from CODEOWNERS at each main-branch sha)
CREATE TABLE ownership (
path_glob STRING(512) NOT NULL, -- e.g., "/srv/payments/**"
team STRING(128) NOT NULL,
priority INT64 NOT NULL, -- for tiebreak (nearest-ancestor wins)
valid_from_sha STRING(40),
valid_to_sha STRING(40), -- NULL = current
PRIMARY KEY (path_glob, valid_from_sha)
);
-- Notification policy
CREATE TABLE notification_policy (
team STRING(128) PRIMARY KEY,
presubmit_channel STRING(128), -- slack channel
postsubmit_channel STRING(128),
page_escalation STRING(128), -- pagerduty service id
batch_window_sec INT64 DEFAULT 300,
severity_floor STRING(16), -- min severity to notify
quiet_hours STRING(64), -- cron-like
allow_mention_author BOOL DEFAULT TRUE,
);
-- Quarantine audit log (tamper-evident)
CREATE TABLE quarantine_event (
event_id BYTES(16) PRIMARY KEY,
test_id STRING(512),
action STRING(16), -- QUARANTINE | DEQUARANTINE | OVERRIDE
actor STRING(128), -- system | user-id
ts TIMESTAMP,
reason STRING(512),
evidence_json JSON, -- stats snapshot
prev_hash BYTES(32), -- hash-chain for tamper detection
this_hash BYTES(32),
);
Query API (examples)
-- "is test X flaky?" — O(1) from test_metadata
SELECT flake_class, flake_score, runs_7d, fails_7d
FROM test_metadata WHERE test_id = @test_id;
-- "show flakes I own" — drives dashboards & on-call handoff
SELECT test_id, flake_score, last_fail_at
FROM test_metadata
WHERE 'team-payments' IN UNNEST(owners)
AND flake_class IN ('FLAKY', 'SUSPECT')
ORDER BY flake_score ASC; -- most flaky first
-- "has this test regressed since quarantine?" — the critical anti-masking query
SELECT COUNT(*) AS postquar_fails
FROM test_run r
JOIN test_metadata m USING (test_id)
WHERE r.test_id = @test_id
AND r.start_ts > m.quarantine_since
AND r.result = 'FAIL'
AND r.attempt = 1; -- first attempts only; ignore retries
6 System Diagram — CENTERPIECE #
┌──────────────────────────────────────────────┐
│ DEVELOPERS │
│ (10K eng · 100K CL/day · 1K teams) │
└────┬───────────────────────────┬─────────────┘
│ git push │ UI polls
▼ ▼
┌─────────────────────┐ ┌────────────────────────┐
│ VCS (Gerrit / │ │ CI UI (Next.js) │
│ GitHub Enterprise)│ │ get_status (gRPC-Web)│
└─────┬───────────────┘ └─────────┬──────────────┘
│ webhook (REST, HMAC-signed) │
│ ≤ 3 CL/s peak, 100K/day │ 12 QPS avg
▼ ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ REST GATEWAY (Envoy + auth filter) │
│ - HMAC verify - rate limit - TLS termination - OAuth for UI │
└──────────────────────────────┬──────────────────────────────────────────┘
│ gRPC, X-Event-Id dedupe, mTLS (SPIFFE)
▼
┌──────────────────────────────────────────────────────────────────────────┐
│ ci-coordinator (5 replicas, stateless) │
│ State machine: NEW → ANALYZING → QUEUED → RUNNING → {PASSED|FAILED| │
│ FLAKY_RETRYING} │
│ Persists CL lifecycle in Spanner (1 row/CL, indexed by (repo, sha)) │
└────────┬──────────────────────────┬───────────────────┬──────────────────┘
│ AffectedTargets RPC │ Schedule RPC │ emit status
│ (avg 12ms, p99 180ms) │ │ (fan-out to UI)
▼ ▼ ▼
┌──────────────────────┐ ┌─────────────────────┐ ┌─────────────────────┐
│ dep-analyzer │ │ scheduler │ │ status-pubsub │
│ (5 replicas) │ │ (3 replicas HA) │ │ (Redis pub/sub │
│ │ │ │ │ or Kafka) │
│ 1. Load parent │ │ - Priority queue │ └─────────────────────┘
│ dep_graph │ │ - Bin-pack onto │
│ (content-hash) │ │ executor shards │
│ 2. Compute delta │ │ - Backpressure │
│ Merkle tree │ │ from exec pool │
│ 3. Walk reverse-dep │ │ - Weighted fair │
│ graph → affected │ │ (presubmit > post)│
│ 4. Return target │ └──────┬──────────────┘
│ list + Bazel │ │ RunTask gRPC (streaming bidi)
│ digests │ │ up to ~20K cores peak
└──────────────────────┘ ▼
┌────────────────────────────────────────────┐
│ EXECUTOR POOL (autoscaled) │
│ 2K–20K workers across 5 cells (zones) │
│ Each: gVisor/runc sandbox, cgroup-limited │
│ hermetic FS overlay, no network by │
│ default, content-addressed inputs │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ worker-1 │ │ worker-2 │ │ worker-N │ │
│ │ gVisor │ │ gVisor │ │ gVisor │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└────────┼─────────────┼──────────────┼──────┘
│ action-cache lookup │
│ (Bazel RE API, CAS) │
▼ │
┌───────────────────────────┐ │
│ REMOTE CACHE (CAS) │ │
│ - Content-addressed │ │
│ - Regional replicas │◄─────────┘
│ - ~85% hit rate target │
│ - 100 TB, LRU + pinning │
└───────────┬───────────────┘
│ on miss: execute + write back
▼
┌──────────────────────┐
│ result-ingest │
│ (Kafka producer) │
│ writes: test_run │
└──────┬───────────────┘
│ Kafka topic: test.run.results
│ ~200 msg/s avg, 600 peak
│ partitioned by test_id
▼
┌────────────────────────────┼──────────────────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────────┐ ┌─────────────────────────────┐ ┌──────────────────────┐
│ history-svc │ │ FLAKE CLASSIFICATION │ │ post-submit │
│ writes to: │ │ PIPELINE │ │ bisector │
│ - Bigtable (hot) │ │ (see sub-diagram below) │ │ (on main-break) │
│ - BigQuery (cold) │ └──────────────┬──────────────┘ └──────┬──────────────┘
└─────────────────────┘ │ │
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ quarantine-svc │ │ bisect-controller │
│ - state machine │ │ - git-bisect auto │
│ - audit log │ │ - re-runs at mid │
│ - dequar scheduler │ │ commits │
└─────────┬───────────┘ └─────────┬───────────┘
│ │
└────────────┬─────────────┘
│ FAIL / QUARANTINE / REGRESSION event
▼
┌──────────────────────────────────────┐
│ notif-router │
│ 1. ownership-resolver RPC │
│ (CODEOWNERS + blame-walk) │
│ 2. severity tiering │
│ 3. batching window (5 min default) │
│ 4. dedupe by (team, test, cl_day) │
│ 5. channel fanout │
└────┬────────────┬────────────┬───────┘
▼ ▼ ▼
┌──────┐ ┌──────┐ ┌────────────┐
│ Slack│ │Email │ │PagerDuty │
│ API │ │ SMTP │ │(post-submit│
└──────┘ └──────┘ │ only) │
└────────────┘
┌──────────────────────────────────────────────────────────────┐
│ ARTIFACT / DEPLOY PATH │
│ │
│ executor pool ──► artifact-svc (sign, SLSA attestation) │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ artifact registry │ │
│ │ (OCI, signed, CAS) │ │
│ └────────┬────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ deploy-gate │ │
│ │ (quality score) │ │
│ └────────┬────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────┐ │
│ │ deploy-orch │ │
│ │ (Spinnaker/Argo) │ │
│ │ - blue/green │ │
│ │ - canary + SLO │ │
│ │ - auto-rollback │ │
│ └─────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
Sub-diagram: flake classification pipeline
Kafka topic: test.run.results (partitioned by test_id, 64 partitions)
│
├──────────────────────────────────────────┐
▼ ▼
┌───────────────────────┐ ┌────────────────────────────┐
│ streaming-scorer │ │ batch-trainer │
│ (Flink/Dataflow) │ │ (BigQuery + Spark nightly)│
│ │ │ │
│ Per test_id window: │ │ - Fits richer model: │
│ - 30d rolling stats │ │ · features: test dur, │
│ - Wilson lower │ │ mem spike, time of │
│ bound of pass- │ │ day, shard id, │
│ rate │ │ runner class, host │
│ - EWMA of fail-rate │ │ pressure, network │
│ - attempt-1 vs │ │ isolation state │
│ retry delta │ │ · gradient-boosted │
│ - consecutive fails │ │ tree → P(flake|ctx) │
│ since quarantine │ │ - Outputs model blob to │
│ │ │ S3; streaming-scorer │
│ Emits: │ │ hot-reloads hourly │
│ - update to │ │ - Labels from human │
│ test_metadata │ │ JIRA triage + bisect │
│ (Spanner upsert) │ │ ground truth │
│ - FLAKE_DETECTED │ └────────────────────────────┘
│ event to Kafka │
│ - REGRESSION_UNDER_ │
│ QUARANTINE event │
└───────────────────────┘
│
▼
quarantine-svc (decision policy engine)
notif-router (consumes events)
Diagram legend: every arrow has a contract
| # | Arrow | Protocol | QPS | Payload |
|---|---|---|---|---|
| 1 | VCS → Gateway | HTTPS webhook, HMAC | 3/s peak | ~2 KB JSON |
| 2 | Gateway → coordinator | gRPC, mTLS | 3/s peak | SubmitCLRequest |
| 3 | coordinator → dep-analyzer | gRPC | 3/s peak | (repo, sha, base_sha) |
| 4 | dep-analyzer → dep_graph store | object read | 3/s peak | ~10 MB compressed Merkle delta |
| 5 | coordinator → scheduler | gRPC | 3/s peak | target list (avg 50, p99 5K) |
| 6 | scheduler → executor | gRPC bidi stream | 630/s peak | RunTask (inputs digests, limits) |
| 7 | executor → CAS | REST, CAS protocol | high | content-addressed get/put |
| 8 | executor → result-ingest | gRPC | 210/s avg | TestRunResult proto, ~4 KB |
| 9 | result-ingest → Kafka | Kafka producer | 210/s avg | same, partitioned by test_id |
| 10 | Kafka → streaming-scorer | Kafka consumer | 210/s avg | |
| 11 | streaming-scorer → Spanner | gRPC upsert | ~50/s (after windowing) | test_metadata row |
| 12 | streaming-scorer → Kafka (events) | producer | ~0.5/s | FLAKE_DETECTED / REGRESSION |
| 13 | quarantine-svc → Spanner | gRPC | low | test_metadata update + audit row |
| 14 | notif-router → ownership-resolver | gRPC | 1/s (batched) | paths + sha |
| 15 | notif-router → Slack/PD | HTTPS | ~0.01/s | batched notification |
| 16 | executor → artifact-svc | gRPC | varies | signed build artifacts |
| 17 | deploy-gate → deploy-orch | webhook | 10K/day | quality score + attestation |
7 Deep-Dives (3 critical topics) #
7.1 Incremental build & test selection via target graph
Why critical. At 2M targets, running "everything" per CL is physically impossible — 800K test targets × 30s = 6,700 core-hours = $3,400/CL at spot rates. The entire economic viability of monorepo CI hinges on conservatively computing the minimum affected set. Incorrect pruning either (a) misses a regression (catastrophic) or (b) over-selects and burns budget.
Alternatives — quantified.
| Option | How it computes "affected" | False-negative risk | Scale limit | Adoption |
|---|---|---|---|---|
Path-based (e.g., Jenkins Pipeline when { changeset }) |
Regex on changed files → pipeline branches | HIGH. Misses transitive deps. | 1K targets | Low (legacy) |
| Static call-graph + IDE indexer (e.g., Kythe) | Compile-time symbol graph | Medium. Reflection / DI / codegen miss edges. | 100K targets | Research |
| BUILD-graph + content-hashed inputs (Bazel / Buck2 / Pants) | Declared deps in BUILD files; Merkle tree of inputs; action cache keyed on digest | LOW iff rules are hermetic | 10M+ targets (Google scale) | High (Google, Meta, Stripe, Dropbox, Spotify use it) |
| ML-predicted TIA (test impact analysis, Microsoft paper) | Feature model of changed files → predicted failing tests | HIGH. Probabilistic; used only as hint. | Unbounded | Supplementary (Microsoft, Facebook TestInsights) |
| Dynamic call graph (coverage-recorded) | Record per-test coverage; on change, re-run tests covering changed lines | Medium. Initial state of coverage becomes stale. Reflection-safe. | 1M targets | Niche (Facebook Predictive Test Selection uses as one input) |
Chosen: BUILD-graph + Merkle/CAS (Bazel-style), ML-TIA as hint only. Justification:
- Correctness invariant: if rules are hermetic (declared inputs are complete), the Merkle-tree digest of a target's inputs deterministically identifies its output. Cache hit ⇔ equivalent execution. This is the one class of system where "cache" is correctness, not optimization.
- Why Bazel over Buck2: Buck2 (Meta, 2023 OSS) has a theoretically superior Starlark evaluator (daemon-less, parallel) and better DX, but in 2026 Bazel's remote execution ecosystem (BuildBuddy, EngFlow, NativeLink) is mature and the cross-lang support (C++, Java, Python, Go, Rust, JS) is 3 years ahead. At 10K-eng scale, tooling maturity dominates. I'd reevaluate in 2–3 years.
- Why not Pants v2 or Gradle: Pants is excellent for Python-heavy orgs but weaker C++/Java. Gradle's configuration phase is not content-addressable out of the box, breaking cache correctness. Rejected.
Earned-secret depth: the cache-correctness trap.
The hermeticity invariant looks simple ("declare all inputs") but fails in practice in ~5 classes of bugs, each of which silently corrupts the cache:
- Nondeterminism in compilers — e.g.,
__DATE__, goroutine iteration order, Go's map iteration. Mitigation:SOURCE_DATE_EPOCH,GODEBUG=randmapiter=0,-frandom-seed=. Bazel's--experimental_repo_remote_exec+ reproducible-builds checks. - Absolute path leakage — paths in debug info (
DW_AT_comp_dir). Mitigation:--remap-path-prefix,strip-nondeterminismtool. - File system ordering —
globacross platforms. Mitigation: sorted glob +.bazelrccanonicalization. - Toolchain drift — developer-local vs remote compiler. Mitigation: toolchain containers,
--incompatible_enable_cc_toolchain_resolution. - Environment variables — mtime, hostname, locale. Mitigation:
--incompatible_strict_action_env, sealed env allowlist.
At Google, a 2019 internal audit found ~0.3% of targets were non-hermetic and occasionally poisoned the cache. The mitigation was a cache-poisoning detector: re-execute 1% of cache hits and compare outputs; mismatch triggers a cache invalidation + rule-author notification. I'd build this from day 1 (cost: 1% extra compute; value: catastrophic-bug prevention).
Test selection correctness proof sketch.
Let T = set of test targets. Let reverse_deps(x) = transitive reverse deps of node x. For a CL touching file set F:
affected = ⋃_{f ∈ F} reverse_deps(target_of(f)) ∩ T
This is sound (no missed affected target) iff the BUILD graph is complete (hermeticity). It's tight modulo runtime-only deps (reflection, DI) — those are caught by the post-submit full suite and by probabilistic ML-TIA as a sanity check.
Failure modes & mitigations.
| Failure | Detection | Mitigation |
|---|---|---|
| Non-hermetic rule pollutes cache | Random 1% re-execution mismatch | Auto-invalidate + quarantine rule, notify author |
| BUILD file drift from source | Dep-lint on every merge | Block merge on missing dep |
| Merkle-tree computation bug | Reproducible-build nightly job | CI-of-CI: compute digest twice, compare |
| Cache size bloat | Hit-rate dashboards per-cell | LRU with target-age weighting; pin hot deps |
| RBE latency spike | p99 sched latency metric | Multi-region failover; sched-level degradation mode |
Real systems named. Bazel + BuildBuddy RBE, Buck2 (Meta), Google TAP + ForgeCluster, Meta's Sapling + CI, Pants v2, BuildKit (Docker), Nix + Hydra (reference for content-addressable builds done right).
7.2 Flake detection statistics
Why critical. Naive "passed on retry = flaky" logic masks real regressions. A test that is truly broken by a CL will pass on retry only if the failure was transient (e.g., a race condition that manifests 50% of the time). Misclassifying it as flaky leads to quarantine, then to merge-then-regression. The single most common CI failure mode at > 1K-eng scale is flaky tests masking real bugs (Google's internal incident data shows this is 30–40% of escape-to-production bugs when flake classifier is weak).
Alternatives — quantified.
| Option | Mechanism | False-flake rate | False-stable rate | Data needed |
|---|---|---|---|---|
| "Passed on retry → flaky" | Boolean | HIGH (~10%) | HIGH (~5%) | 1 run |
| Simple pass-rate threshold (e.g., < 98% in last 7d) | Point estimate, no CI | Medium | Medium | 10+ runs |
| Wilson score lower bound (95% CI) on pass-rate | Bayesian-adjacent | LOW (<1%) with n ≥ 50 | Low | 50+ runs |
| EWMA of fail-rate + Bayesian prior | Time-weighted, accommodates drift | Low | Low | 30+ runs over time |
| Reproduction protocol (run N times w/ different seeds) | Direct experimental evidence | Very low | Very low | N extra runs per suspect |
| ML classifier (gradient-boosted features) | Context-aware (host, time, load) | Low when trained | Medium cold start | Labeled training set |
| TAP CulpritFinder-style bisection | Re-run at bisected commits to isolate flake vs regression | Very low | Very low | 10–15 extra runs on post-submit |
Chosen: layered approach.
Layer 0 (streaming, per run): compute Wilson lower bound + EWMA over 30d window
Layer 1 (retry decision): retry only if classification ∈ {FLAKY, SUSPECT}
Layer 2 (post-submit break): CulpritFinder bisect to distinguish flake from regression
Layer 3 (auto-quarantine): require n ≥ 50 samples AND Wilson_LB < 0.95 AND reproduced on seed-varied rerun
Layer 4 (batch trainer, nightly): ML model refines priors with features; feeds back to Layer 0 as a prior
Earned-secret depth: why Wilson, and why n≥50.
Wilson score interval is the correct CI for a binomial proportion when n is small. For k failures in n runs, pass-rate p̂ = (n-k)/n, the 95% CI lower bound is:
p̂ + z²/(2n) − z · √(p̂(1−p̂)/n + z²/(4n²))
L = ───────────────────────────────────────────
1 + z²/n
with z = 1.96.
Worked example (the insight the interviewer is probing for):
n=10, k=2→p̂ = 0.80, Wilson LB ≈0.44. Conclusion: we have almost no idea if pass-rate is 44% or 96%; too few samples to classify as flaky.n=50, k=1→p̂ = 0.98, Wilson LB ≈0.89. Still not confident it's < 95% threshold. Needs reproduction.n=100, k=5→p̂ = 0.95, Wilson LB ≈0.89. Now we can classify as SUSPECT.n=500, k=30→p̂ = 0.94, Wilson LB ≈0.92. Classify as FLAKY.
This is why "passed on retry" is insufficient: n=2 is meaningless. Google's TAP requires n ≥ 50 over time before auto-quarantine, and always runs a seed-varied reproduction (different shard, different host, different time-of-day) to prevent a real regression from being misclassified during a peak-load window.
The masking problem — the critical invariant.
A test can be flaky and regressed simultaneously. Classic trap: test has historical fail-rate 3% (flaky, quarantined), then a CL truly breaks it (post-CL fail-rate 80%). If we only look at overall pass-rate, the flaky label persists. Fix: track pre-CL vs post-CL pass-rate for every quarantined test as a separate bucket. Alert when post-CL fail-rate diverges significantly (χ² test p<0.01) from pre-CL.
SELECT
test_id,
SUM(CASE WHEN start_ts < q.quarantine_since THEN 1 ELSE 0 END) AS pre_n,
SUM(CASE WHEN start_ts < q.quarantine_since AND result='FAIL' THEN 1 ELSE 0 END) AS pre_fail,
SUM(CASE WHEN start_ts >= q.quarantine_since THEN 1 ELSE 0 END) AS post_n,
SUM(CASE WHEN start_ts >= q.quarantine_since AND result='FAIL' THEN 1 ELSE 0 END) AS post_fail
FROM test_run r
JOIN test_metadata q USING (test_id)
WHERE q.quarantine_state != 'NONE'
GROUP BY test_id
HAVING chi_square_p_value(pre_n, pre_fail, post_n, post_fail) < 0.01;
Every row → paging event tagged REGRESSION_UNDER_QUARANTINE, which bypasses the normal batching and goes straight to PagerDuty.
CulpritFinder (TAP's approach, worth replicating).
When a test breaks on post-submit at head, and flake classifier says "flaky", the naive response is "mute and move on." TAP instead re-runs the test at the parent commit:
- If it passes at parent, it's a new regression (bisect to find the CL, page the author).
- If it also fails at parent, walk back until it passes. The "culprit CL" is bracketed.
- If it flakes at every re-run, it's confirmed flaky.
Cost: 10–15 extra runs on each suspected regression. Benefit: near-zero escape rate for regressions-disguised-as-flakes. Worth it at any scale where bug-to-prod cost > $1K.
Failure modes.
| Failure | Detection | Mitigation |
|---|---|---|
| Cold start: new test has no history | n < 10 on first failure | Treat as "genuine fail" by default; require 2 green runs before trusting |
| Flake rate drifts due to infra change | Step-change in EWMA across all tests | Reset classifier, notify CI SRE |
| Adversarial test author (CL tests with random sleep) | Correlation between author and flake score | SRE dashboard of per-author flake contribution |
| Classifier cache stale after retrain | Hot-reload staleness metric | Version model blob; scorer checks version on each read |
| Quarantine storm (many tests quarantined at once) | Rate-limit on quarantine events | Cap at 20 auto-quarantines/hour; excess goes to human review |
Real systems. Google TAP (internal) + CulpritFinder, Meta's Flaky-Test-Detective (FBIT), Mozilla's Flaky Test dashboards, Jenkins flaky-test-handler plugin (weak), Datadog CI Visibility (commercial), BuildBuddy flaky-tests feature.
7.3 Notification routing
Why critical. At 10K eng, "CI broke, notify the team" produces 1K notifications/day. Humans develop learned-helplessness and mute the channel. Precision > 95% is a psychological threshold below which the system destroys its own signal value.
Alternatives.
| Strategy | Precision | Recall | Latency | Complexity |
|---|---|---|---|---|
| "Notify author of CL" | HIGH for presubmit. LOW for post-submit head-breaks (often a merge race). | Misses owners who need visibility | Low | Trivial |
| CODEOWNERS (GitHub-style, path-based) | Medium. Struggles with shared files. | Good | Low | Medium |
| CODEOWNERS + nearest-ancestor fallback | High | Good | Low | Medium |
| git-blame walk to find last modifier of the failing assertion | High when line-precise | Can page wrong person on refactor | Medium | Medium |
| Hybrid: CODEOWNERS primary + blame for the failing line | Very high | Very high | Medium | High |
| LLM-inferred owner (classifier over commit history + PR reviewers) | Medium in 2026 (improving) | Medium | High | High |
Chosen: Hybrid CODEOWNERS + blame + hierarchy fallback.
Algorithm:
def resolve_owners(test_id, failing_assertion_line, cl):
# 1. CODEOWNERS of the test file itself
owners = codeowners_lookup(test_path(test_id))
# 2. Augment with blame of the assertion line (if available)
if failing_assertion_line:
blame_author = git_blame(failing_assertion_line).author
blame_team = team_of(blame_author)
if blame_team and blame_team not in owners:
owners.append((blame_team, priority=0.5)) # secondary
# 3. On presubmit, always notify CL author as the actionable party
if cl.is_presubmit:
owners.append((cl.author_team, priority=1.0))
# 4. Nearest-ancestor fallback if empty
if not owners:
owners = nearest_ancestor_codeowners(test_path(test_id))
# 5. Last resort
if not owners:
owners = [("platform-sre", priority=0.1)]
return dedupe(owners)
Severity tiering (the anti-spam primitive).
| Severity | Trigger | Channel | Batching | Rate limit |
|---|---|---|---|---|
| INFO | Pre-submit fail on a test classified FLAKY, no regression signal | Slack thread on CL | Coalesce into CL status | N/A |
| WARNING | Pre-submit fail on STABLE test | Slack DM to author + team channel | 5-min window | 10/hr/team |
| ERROR | Post-submit main-branch break (non-flake) | Slack team channel + @here | 1-min window | 5/hr/team |
| PAGE | REGRESSION_UNDER_QUARANTINE, or main-branch break + deploy-gate fail | PagerDuty | No batch | No limit |
Earned-secret depth: the batching window is a correctness mechanism, not a kindness.
Without batching, a rollback-worthy main-branch break can generate 500 test failures (cascade of integration tests depending on the bad code). Each generates a notification → 500 pages → responder pages the wrong person because it's the first page that opened the incident → root-cause delayed.
Correct design:
- Event correlation window (1 min at post-submit, 5 min at presubmit) — group all failures sharing a
(sha, failure_signature)prefix. - Failure signature = hash of (normalized stack trace, failed assertion text). Tests failing with the same signature are likely one root cause.
- One notification per (team, signature, window) regardless of how many tests share the signature.
- Top-N summary in the notification ("147 tests failed with
AssertionError: db connection refused; likely root cause: DB pool change indb/pool.goline 82").
Google's Critique system does this correlation; Meta's CI Observability does similar grouping.
Failure modes.
| Failure | Detection | Mitigation |
|---|---|---|
| Paging storm | Rate of notifications > 10× baseline | Auto-circuit-breaker: coalesce all into a single "CI incident" page; suspend flake-classifier retries; engage CI on-call |
| Wrong owner (CODEOWNERS out of date) | Feedback: "not me" button in Slack → ML signal | Weekly stale-CODEOWNERS audit; auto-PR to update |
| Author OOO | Calendar integration | Fallback to team rotation (PagerDuty on-call lookup) |
| Shared file with no clear owner | CODEOWNERS hit ratio per file | "Orphan file" dashboard; assign to platform team |
| Notification → action latency | Track time-to-ack | Alert on team-level p95 > 1h |
Real systems. Google Critique (code review + notification), Meta's CI Observability + Claim (acknowledge-a-failure UX), Datadog Events Explorer, PagerDuty with CODEOWNERS integration, GitHub's CODEOWNERS + Required Reviewers, Bazel's TestResult proto fields for structured failure info.
8 Failure Modes & Resilience #
Per-component table
| Component | Failure | Detection (time) | Blast radius | Mitigation | Recovery |
|---|---|---|---|---|---|
| VCS webhook | Lost delivery | 5-min "no CL since" watchdog + VCS delivery dashboards | Missed test runs for that sha | (1) VCS retry on 5xx (7 attempts, exp backoff); (2) pull-based reconciler scans for un-CI'd SHAs every 5 min | Replay webhook from VCS admin UI; reconciler auto-catches |
| ci-coordinator | Pod crash | gRPC health probe; state in Spanner | CL stuck in ANALYZING | Stateless pods behind HAProxy; next pod picks up via Spanner lease | Spanner row lease expires in 30s; auto-recover |
| Spanner / metadata | Regional outage | Client-side latency spike + error rate | Flake lookups fail → devs can't merge | Multi-region Spanner; degradation mode: read-only from secondary, skip flake classification (run all tests, no quarantine) | Manual promote; resume quarantine decisions when primary back |
| dep-analyzer | Incorrect affected set | CI-of-CI: reproducible build mismatch | Missed regressions merge to main | Dual-compute on 1% of CLs; alert on divergence; mandatory full suite on post-submit catches missed | Invalidate cache; rerun affected CLs; rare |
| Scheduler | Backlog explosion | Queue depth > threshold (10K) | Latency SLO violation | Shed load: drop manual reruns; pause post-submit; alert on-call; autoscale | Drain backlog with autoscale; manual reruns return when queue drains |
| Executor node | Poisoned state (stale kernel, leaked processes) | Health check: per-node flake rate vs peer mean > 3σ | Falsely fails tests until quarantined | Auto-drain node on elevated flake rate; reimage from golden image | Reprovision; re-admit after clean run |
| Remote cache | Cache poisoning (non-hermetic result cached) | 1% re-execution audit | Bad output until cache entry expires | Content-addressed invalidation by target_hash prefix; quarantine rule author | Rebuild from source on invalidated keys |
| Kafka (result bus) | Broker partition loss | Consumer lag | Flake stats go stale | Replication factor 3 + min-ISR 2; consumer offset checkpoints in Redis | Kafka self-heals; consumers resume from checkpoint |
| Flake classifier | Classifier down | Staleness of test_metadata.flake_updated_at | Retry decision falls back to "no retry" (safe default) | Cache last-known classification in scheduler's local LRU | Restart classifier; recompute from hot store |
| Quarantine svc | False-positive quarantine of real regression | Pre/post quarantine divergence chi² test | Real bug ships | Block merge if any non-flake failing; regression-under-quar goes to PAGE severity | Manual dequarantine + bisect |
| notif-router | Notification storm | Notif rate > 10× baseline | Pager fatigue, real alerts missed | Auto-coalesce to incident page; suspend retries | Incident commander resolves; notif-router drains queue |
| Ownership resolver | Stale CODEOWNERS after refactor | Notification feedback ("not me") rate | Wrong team paged | Weekly auto-audit PR; UI "reassign" button feeds back to ML | Merge ownership fix; rebuild cache |
| Deploy gate | False green signal | Prod canary SLO burn | Bad code deploys | Multi-dim signal (test pass + perf delta + coverage); canary stage catches it | Auto-rollback on SLO burn; CI post-mortem |
| Artifact signer | Key compromise | Signing-key audit log anomaly | Unauthorized artifact release | KMS-backed ephemeral signing keys (per-run); SLSA provenance | Rotate KMS root; invalidate affected artifacts |
| Merge race | Two CLs pass presubmit independently; together break main | Post-submit break within 10 min of merge | Brief main-branch break | Rebase-on-merge + post-merge re-verify (TAP pattern); "merge queue" serialization for hot paths | Bisect; revert CL; flake classifier distinguishes from real flake |
Systemic resilience patterns
- Bulkheads: presubmit and post-submit pools are physically separate — a post-submit overload never blocks devs.
- Backpressure: scheduler rejects (with 429 retryable) when executor pool > 85% saturated; coordinator holds CL in
QUEUEDwith user-visible queue position. - Graceful degradation modes, declared explicitly:
MODE_NORMAL: full pipelineMODE_NO_FLAKE_CLASSIFIER: classifier down → run all retries once (no quarantine, no skips); throughput unchanged, cost ~10% upMODE_NO_CACHE: remote cache down → run all targets from scratch; cost 7× up, SLO breached; SEV-2 auto-filedMODE_BREAK_GLASS: entire CI offline → manual merge with author attestation, logged for post-hoc audit
- Chaos drills monthly: simulate each degradation mode in pre-prod shadow fleet.
9 Evolution Path #
v1 — Minimum viable (0–6 months, handles 100 eng / 1K CL/day)
- Single-queue Jenkins or GitHub Actions running full suite per CL.
- Git-triggered; no caching; no flake detection.
- All failures → email to author + team channel.
- Manual quarantine via a YAML file in repo.
When this breaks: queue depth grows superlinearly past ~500 CL/day; p95 CI time exceeds 1 hour; developers start [skip ci] and learned helplessness.
Cost at v1 scale: ~$50/CL (no caching). Acceptable only < 500 CL/day.
v2 — Sharded with incremental build (6–18 months, 1K–5K eng / 10K CL/day)
- Adopt Bazel + BuildBuddy RBE.
- Affected-target compute with a dep-analyzer service.
- Remote CAS cache; target 60% hit rate initially → 80% at 12 months.
- Flake detection: Wilson score, threshold-based auto-quarantine.
- CODEOWNERS-based routing; batching via Slack.
- Jenkins → Tekton or Buildkite for pipeline orchestration.
- Storage: test_run in Postgres initially (adequate to 1B rows); split to BigQuery at 5B.
When this breaks: ~10K CL/day, Postgres rows exceed 1B/yr, dep-analyzer cold-start dominates p95.
v3 — Global TAP-scale CI (18+ months, 10K+ eng / 100K CL/day — this design)
- Dedicated executor pool with hermetic sandboxes (gVisor + cgroups + net namespaces).
- Bi-modal storage: BigQuery cold + Bigtable hot + Spanner OLTP.
- Streaming flake classifier (Flink/Dataflow) + nightly ML retrain.
- CulpritFinder auto-bisect on post-submit regressions.
- Deploy-gate with multi-dimensional quality score integrated into canary.
- Multi-region failover; continuous integration model (post-submit-driven; devs merge to a virtual queue like Google TAP).
- SLSA L3 artifact attestation; SOX/SOC2 audit trail.
What's next (v4):
- Speculative execution: based on author's historical review-to-merge rate, start CI before CL is opened.
- LLM-assisted failure triage: GPT-4 or Claude-Opus classifies failure logs, proposes a root cause, tags likely-owner files. Already in pilot at multiple orgs in 2026.
- Test generation from coverage gaps + mutation testing to improve flake-distinguishability.
- Federated CI across multi-repo with cross-repo dep graph (for "logical monorepo, physical polyrepo" orgs).
10 Out-of-1-Hour Notes (L7 extras) #
10.1 Hermetic sandboxing — defense-in-depth
- gVisor (userspace kernel, Google) for strong syscall isolation at modest perf cost (~10–15% CPU overhead vs runc); protects the host from a malicious test.
- cgroups v2 per-run CPU / memory / IO quotas; a runaway test can't starve neighbors.
- pid namespaces so a test can't see peer processes;
unshare(CLONE_NEWPID). - Network namespaces: default no network; tests declare
requires_network: truewith a scoped firewall. This is the #1 source of flakes; enforcing isolation kills ~30% of flake population at the cost of exposing tests that depended on external services (which should have been stubbed anyway). - Filesystem overlays: inputs mounted read-only from CAS, outputs to a dedicated tmpfs; no leakage between runs.
- Mount namespaces + seccomp-bpf to block dangerous syscalls (ptrace, mount, reboot).
- UID mapping via user namespaces so tests run as root-in-sandbox but nobody outside.
10.2 Content-addressed cache correctness
- Action cache key =
SHA-256(serialize(command, args, env, input_digests_merkle_root, toolchain_digest)). - Output cache = blob store keyed by each output file's content hash.
- Mandatory: reproducible-builds audit (random 1% re-execution, diff outputs bit-for-bit). Alert on any divergence.
- Cache pinning: hot targets (platform libraries) pinned with TTL = ∞; cold targets expire LRU.
- Multi-layer cache: worker-local SSD → zonal → regional → global CAS. Targets 85% hit rate with p99 < 80 ms.
10.3 Bisect automation on post-submit failures
- Triggered by
POSTSUBMIT_FAILevent + flake classifier says "not flaky." - Binary search between last-green-sha and failing-sha.
- For each midpoint: launch a bisect run of the failing test only (cheap).
- Log2(N) runs to find culprit, where N = CLs between greens. At 100 CL/hr merge rate and 10-min breakage detection, N ≈ 17 → ~5 runs → culprit in ~1 hr.
- Culprit author gets
ERRORseverity notification with "your CL broke main; revert ready to merge."
10.4 Test impact analysis vs static call graph
- Static call-graph TIA (Kythe-style) misses reflection, DI, code generation.
- Coverage-based TIA (Facebook PTS, Microsoft's paper) records per-test coverage, selects tests touching changed lines. Works well for integration tests; struggles with e2e (coverage too broad).
- BUILD-graph TIA (chosen in §7.1) is conservative/correct but coarse; pairs well with ML-TIA as a tighter probabilistic hint.
- Hybrid approach: BUILD graph for correctness (run this) + ML-TIA for priority (run these first for fail-fast signal). Google uses this combination.
10.5 Canary + rollback tie-in
- CI's
quality_scoreis one input to the deploy gate, not the only one. Other inputs:- SLO burn-rate at canary (primary)
- Config drift check
- Manual approval for SEV-1 deploys
- Rollback triggers:
- Canary SLO burn > 2%/hr → auto-rollback
- Automated regression test failure in canary → hold + page deploy on-call
- Manual kill-switch
- Post-rollback CI forensics: artifacts + test_run rows from the rolled-back sha are tagged
rollback_culprit=true, feeding into post-mortem tooling.
10.6 Regulatory audit trail (SOX compliance for change management)
- Every merge = change event with immutable log: (cl_id, author, reviewers, CI pass evidence, deployer, deploy time, canary metrics).
- Hash-chain the change-event log (like
quarantine_eventschema) for tamper evidence. - Quarterly SOX audit surfaces:
- Any deploy without green CI (must be break-glass with exec approval)
- Any quarantine override without ticket
- Any artifact without signed SLSA provenance
- Separation of duties: author of CL cannot deploy their own artifact to prod without a distinct approver (enforced in deploy-orch).
10.7 Cost per CI-minute
- Target: $0.005/CI-minute at 90% cache hit.
- Levers:
- Spot/preemptible instances for executor pool (60–70% discount; interruptions tolerated because test runs are idempotent and short).
- Multi-tenancy on executor hosts (gVisor makes this safe); pack tests by memory profile.
- Autoscaler tuned to CL arrival rate (statistical; not reactive).
- Regional CAS replicas co-located with executors (network cost dominates inter-region).
- Anti-pattern: flat-rate GPU pool for "ML tests" sitting idle 80% of the time. Instead, burst to a shared AI/ML cluster with preemption.
10.8 Observability of the CI system itself ("signal-to-noise dashboards")
A CI system not instrumented becomes a black box. Required dashboards:
- Precision panel: % of failures that led to a CL abandonment or fix-up commit. Target > 95% (false-fail rate < 5%).
- Flake leaderboard: top-50 tests by flake score, with on-call team.
- Cache hit rate per pool / per language / per team.
- Latency SLO burn: p50/p95 presubmit latency vs target, with burn-rate.
- Executor utilization — both CPU and memory, because test mixes skew.
- Notification precision: "clicked-to-debug" vs "muted" ratio per team.
- Regression-under-quarantine counter (the most critical single metric: any non-zero value is a potential escape).
- Cost per CL with breakdown (compute, cache, storage, network).
- CI-of-CI: reproducibility audit pass rate, bisect accuracy.
10.9 Agentic / LLM-era extensions (2026 relevance)
Given the candidate's Agentic AI background, frame these in the interview if asked:
- LLM-assisted failure triage: feed failure logs + diff to an LLM, produce (a) likely root cause, (b) suggested owner, (c) auto-generated fix PR for trivial cases.
- Agent sandbox isolation: same hermetic sandbox we use for test isolation generalizes to agent execution isolation — a useful cross-domain point.
- Privacy-preserving CI: if tests process PII fixtures, the sandbox's net-isolation + ephemeral filesystem are the same primitives used for privacy infra. Candidate can pivot to this if interviewer shows Privacy interest.
10.10 What I would defer if interviewer insists
If out of time, skip §7.3 (notif routing) in the interview — it's the most discussable but least architecturally load-bearing. Keep §7.1 (build graph) and §7.2 (flake statistics) because they contain the hardest L7 insights. Offer §7.3 as follow-up.
Verification checklist (done before submission) #
- SRE pager-carryable? Yes — on-call could carry this today: coordinator/Spanner/scheduler are well-understood building blocks; degradation modes defined; chaos drills specified.
- Every diagram arrow → real API/data flow? Yes — each arrow mapped to the numbered table under the diagram.
- Deep-dive L7 or L6? L7: Wilson score math with worked numbers, cache-correctness failure taxonomy, CulpritFinder reference, regression-under-quarantine detection as a distinct metric, failure-signature coalescing as correctness-not-kindness framing.
- Flake detection statistically rigorous? Yes — explicit rejection of "passed on retry = flaky"; n ≥ 50 sample floor; reproduction protocol; χ² test for regression-under-quarantine; bisection for final classification.
- Real systems named with rejection rationale? Yes — Bazel vs Buck2 vs Pants vs Gradle, Jenkins vs Tekton, Google TAP + CulpritFinder, Meta Sapling + FBIT, Datadog CI Vis, Spinnaker/Argo.
- BOE numbers calculated not asserted? Yes — executor core-days, cache-hit cost curve, notification fan-out after filtering, storage row count math.
- Cost envelope closed? Yes — $0.50/CL derives from 85% cache hit rate, which is justified by cache architecture (multi-tier + pinning).
End of solution.