Ship/No-Ship Decisions

Context

Your AI feature is “done.” The engineering team says: “It works.” The design team says: “Looks good.” But as a PM, you know: with AI, “works” does not mean the same as with traditional software. The chatbot delivers brilliant answers sometimes and embarrassing ones other times — and nobody can predict which case will occur when.

Traditional software ships when it works correctly. AI features ship when they work well enough. This is a fundamentally different decision because AI output quality exists on a spectrum, not a binary. Different users experience different quality levels for the same feature. Edge cases are infinite. And quality may degrade over time as data distributions shift.

The PM must define “good enough” before building, not after. Without a predefined quality bar, teams either ship too early (harming users) or never ship (harming the business).

Concept

Quality Gates for AI Features

A quality gate is a predefined threshold that must be met before an AI feature progresses to the next deployment stage. Gates should be defined during planning, not discovered during launch review.

Recommended quality gates:

Golden dataset performance: Model clears the eval suite, no regression on critical tasks
Safety checks: No sensitive data in prompts or logs; red team findings addressed
Latency budget: P50 and P95 latency within acceptable bounds, with fallbacks for tool errors
Cost ceiling: Cost per successful interaction under target
Fairness audit: Performance consistent across demographic groups (see Lesson 5)
Fallback behavior: Graceful degradation when the model fails, times out, or returns low-confidence results

Staged Rollout Strategies

AI features demand more cautious rollouts than deterministic software because failure modes are harder to predict.

Shadow Mode (Dark Launch): The AI feature runs in production processing real inputs, but outputs are not shown to users. Only the team sees and evaluates. Tests performance on real, current examples without user impact. Duration: 1-2 weeks minimum.

Canary Release: Route 1-5% of production traffic to the new AI feature. Monitor key metrics: error rate, latency, user satisfaction, escalation rate. If metrics remain within predefined bounds, gradually ramp up. Each expansion phase has clear exit criteria.

A/B Testing: Split users between AI feature variants to measure causal impact. Critical for measuring whether AI actually improves user outcomes. Run until statistical significance is reached (typically 2-4 weeks).

Feature Flags: Combine canary + A/B. Control exposure by user segment, geography, or account tier. Enable instant rollback if issues emerge.

Setting Thresholds

Baseline first: Measure the current experience (human performance, existing rule-based system)
Non-inferiority: The AI must be at least as good as the current process
User-acceptable minimum: Through user research, determine the quality level below which users reject the feature
Business-viable minimum: Below what quality level does the feature cost more than it saves?
Set the bar at the highest of these three

Framework

Ship/No-Ship Checklist:

Gate	Question	Ship if…	Block if…
Eval suite	Does it pass the golden dataset?	Scores meet or exceed baseline	Regression on any critical category
Safety	Did it pass red team review?	All critical findings addressed	Open critical vulnerabilities
Latency	Is it fast enough?	P95 within budget	P95 exceeds 2x target
Cost	Is it economically viable?	Cost per success under target	Cost per success exceeds value delivered
Fairness	Is performance equitable?	Variance across groups within bounds	Significant disparity on protected groups
Fallback	What happens when it fails?	Graceful degradation defined and tested	No fallback — failure = broken experience
Rollback	Can we undo this?	Instant rollback via feature flag	No rollback mechanism

Scenario: GitHub Copilot — From Preview to GA

It is summer 2021. GitHub, in partnership with OpenAI, has built an AI feature that generates code suggestions directly in the editor: Copilot. The internal demos are impressive — the model completes functions, writes boilerplate, and even suggests tests. The engineering team is convinced: the product works.

But GitHub faces a difficult ship/no-ship decision. Copilot is not meant for a few hundred power users — it needs to reach millions of developers. Code suggestions that are wrong 70% of the time would not just annoy people; they would actively introduce bugs into production codebases. And every suggestion a developer accepts becomes part of software that other people rely on.

The facts:

Suggestion acceptance rate started at roughly 30% — seven out of ten suggestions were dismissed
Internal dogfooding showed: developers were more productive, but suggestion quality varied significantly across programming languages and contexts
GitHub implemented feature flags per user and organization for instant rollback
The SPACE framework (Satisfaction, Performance, Activity, Communication, Efficiency) was established to measure developer productivity impact
Copilot ran in the background generating suggestions while GitHub measured which ones were accepted versus dismissed — a shadow-mode equivalent at the suggestion level
A controlled study found that developers completed tasks 55% faster with Copilot
In enabled repositories, roughly 46% of code came from Copilot suggestions

The question:

Three months into the Technical Preview, the metrics look promising. Acceptance rate is climbing, developer satisfaction is high, and controlled studies show clear productivity gains. Stakeholders are pushing: why not go GA now? The product is “good enough” — every month of waiting is lost market share. Would you go GA after 3 months, or wait?

Decide

How did GitHub decide — and what can we learn from it?

GitHub’s decision: A 12-month staged rollout — from June 2021 (Technical Preview) to June 2022 (General Availability). Despite promising early metrics, GitHub did not accelerate.

The rollout in stages:

Internal dogfooding: GitHub engineers used Copilot in their daily work. First quality gates: does it work for our own engineers? Where does it break?
Invited testers: Selected external developers with dedicated feedback channels. Quality gate: does the acceptance rate hold outside of GitHub’s own codebase patterns?
Public Technical Preview: Open access with a waitlist. Quality gate: does the infrastructure scale? Do metrics remain stable under heterogeneous traffic?
General Availability: Paid product for everyone. Quality gate: is the acceptance rate high enough that developers would pay for the service?

Why 12 months instead of 3:

Viewed through the ship/no-ship lens, GitHub systematically checked at each gate:

Gate	GitHub’s approach
Eval suite	SPACE framework across five dimensions — not just “does the code compile” but “are developers more satisfied and productive?”
Safety	Security scans on generated code, checks for leaked secrets and license violations in suggestions
Latency	Suggestion latency optimized so Copilot would not interrupt developer flow
Cost	Cost per suggestion brought to a sustainable level for a subscription model
Fairness	Performance measured across programming languages and developer experience levels
Fallback	Dismissing a suggestion = no harm. Feature flag = instant rollback per user/org

The acceptance rate was optimized from roughly 30% to roughly 34%. That sounds like a small jump — but at millions of suggestions per day, every percentage point of reduced noise is a significant quality improvement.

Why “good enough” at 3 months was not good enough:

The early-adopter cohort was not representative of the GA audience. Power users tolerate more friction than mainstream developers
Security-relevant edge cases (generated code with vulnerabilities, copied license fragments) required time to identify and address
The pricing model needed validation — developer satisfaction alone is insufficient if the unit economics do not work
GitHub needed evidence beyond demos: the 55%-faster study was conducted during the extended preview, not before it

The outcome: Copilot became the fastest-adopted developer tool in GitHub’s history. The additional 9 months were not lost market share — they were an investment in a product that millions of developers trusted enough to pay for.

Reflect

Define quality gates before building, not during launch review. GitHub did not invent the SPACE framework when Copilot was “done” — it was the measurement system from the start. Five dimensions instead of a single metric, because developer productivity cannot be reduced to acceptance rate alone.
Staged rollouts are mandatory for AI. GitHub’s four-stage rollout (internal, invited, preview, GA) systematically addressed different risks: first functionality, then external validation, then scale, then monetization. Each stage had its own exit criteria.
Shadow mode catches failures at zero user risk. Copilot’s suggest-and-measure approach was an elegant shadow-mode equivalent: the model generated, but the developer decided. Every dismissed suggestion was a data point with no user harm.
“Good enough” for early adopters is not “good enough” for GA. The hardest lesson from Copilot’s rollout: three months of preview data from tech-savvy early adopters tells you little about how millions of mainstream users will behave. GitHub’s patience was not timidity — it was a deliberate ship/no-ship decision.

Sources: GitHub Blog — “Research: Quantifying GitHub Copilot’s Impact on Developer Productivity and Happiness” (2022), Ziegler et al. — “Productivity Assessment of Neural Code Completion” (2022), GitHub Blog — GitHub Copilot is Generally Available (2022), Forsgren et al. — “The SPACE of Developer Productivity” (2021)