Ship/No-Ship Decisions
Context
Section titled “Context”Your AI feature is “done.” The engineering team says: “It works.” The design team says: “Looks good.” But as a PM, you know: with AI, “works” does not mean the same as with traditional software. The chatbot delivers brilliant answers sometimes and embarrassing ones other times — and nobody can predict which case will occur when.
Traditional software ships when it works correctly. AI features ship when they work well enough. This is a fundamentally different decision because AI output quality exists on a spectrum, not a binary. Different users experience different quality levels for the same feature. Edge cases are infinite. And quality may degrade over time as data distributions shift.
The PM must define “good enough” before building, not after. Without a predefined quality bar, teams either ship too early (harming users) or never ship (harming the business).
Concept
Section titled “Concept”Quality Gates for AI Features
Section titled “Quality Gates for AI Features”A quality gate is a predefined threshold that must be met before an AI feature progresses to the next deployment stage. Gates should be defined during planning, not discovered during launch review.
Recommended quality gates:
- Golden dataset performance: Model clears the eval suite, no regression on critical tasks
- Safety checks: No sensitive data in prompts or logs; red team findings addressed
- Latency budget: P50 and P95 latency within acceptable bounds, with fallbacks for tool errors
- Cost ceiling: Cost per successful interaction under target
- Fairness audit: Performance consistent across demographic groups (see Lesson 5)
- Fallback behavior: Graceful degradation when the model fails, times out, or returns low-confidence results
Staged Rollout Strategies
Section titled “Staged Rollout Strategies”AI features demand more cautious rollouts than deterministic software because failure modes are harder to predict.
Shadow Mode (Dark Launch): The AI feature runs in production processing real inputs, but outputs are not shown to users. Only the team sees and evaluates. Tests performance on real, current examples without user impact. Duration: 1-2 weeks minimum.
Canary Release: Route 1-5% of production traffic to the new AI feature. Monitor key metrics: error rate, latency, user satisfaction, escalation rate. If metrics remain within predefined bounds, gradually ramp up. Each expansion phase has clear exit criteria.
A/B Testing: Split users between AI feature variants to measure causal impact. Critical for measuring whether AI actually improves user outcomes. Run until statistical significance is reached (typically 2-4 weeks).
Feature Flags: Combine canary + A/B. Control exposure by user segment, geography, or account tier. Enable instant rollback if issues emerge.
Setting Thresholds
Section titled “Setting Thresholds”- Baseline first: Measure the current experience (human performance, existing rule-based system)
- Non-inferiority: The AI must be at least as good as the current process
- User-acceptable minimum: Through user research, determine the quality level below which users reject the feature
- Business-viable minimum: Below what quality level does the feature cost more than it saves?
- Set the bar at the highest of these three
Framework
Section titled “Framework”Ship/No-Ship Checklist:
| Gate | Question | Ship if… | Block if… |
|---|---|---|---|
| Eval suite | Does it pass the golden dataset? | Scores meet or exceed baseline | Regression on any critical category |
| Safety | Did it pass red team review? | All critical findings addressed | Open critical vulnerabilities |
| Latency | Is it fast enough? | P95 within budget | P95 exceeds 2x target |
| Cost | Is it economically viable? | Cost per success under target | Cost per success exceeds value delivered |
| Fairness | Is performance equitable? | Variance across groups within bounds | Significant disparity on protected groups |
| Fallback | What happens when it fails? | Graceful degradation defined and tested | No fallback — failure = broken experience |
| Rollback | Can we undo this? | Instant rollback via feature flag | No rollback mechanism |
Scenario: GitHub Copilot — From Preview to GA
Section titled “Scenario: GitHub Copilot — From Preview to GA”It is summer 2021. GitHub, in partnership with OpenAI, has built an AI feature that generates code suggestions directly in the editor: Copilot. The internal demos are impressive — the model completes functions, writes boilerplate, and even suggests tests. The engineering team is convinced: the product works.
But GitHub faces a difficult ship/no-ship decision. Copilot is not meant for a few hundred power users — it needs to reach millions of developers. Code suggestions that are wrong 70% of the time would not just annoy people; they would actively introduce bugs into production codebases. And every suggestion a developer accepts becomes part of software that other people rely on.
The facts:
- Suggestion acceptance rate started at roughly 30% — seven out of ten suggestions were dismissed
- Internal dogfooding showed: developers were more productive, but suggestion quality varied significantly across programming languages and contexts
- GitHub implemented feature flags per user and organization for instant rollback
- The SPACE framework (Satisfaction, Performance, Activity, Communication, Efficiency) was established to measure developer productivity impact
- Copilot ran in the background generating suggestions while GitHub measured which ones were accepted versus dismissed — a shadow-mode equivalent at the suggestion level
- A controlled study found that developers completed tasks 55% faster with Copilot
- In enabled repositories, roughly 46% of code came from Copilot suggestions
The question:
Three months into the Technical Preview, the metrics look promising. Acceptance rate is climbing, developer satisfaction is high, and controlled studies show clear productivity gains. Stakeholders are pushing: why not go GA now? The product is “good enough” — every month of waiting is lost market share. Would you go GA after 3 months, or wait?
Decide
Section titled “Decide”How did GitHub decide — and what can we learn from it?
GitHub’s decision: A 12-month staged rollout — from June 2021 (Technical Preview) to June 2022 (General Availability). Despite promising early metrics, GitHub did not accelerate.
The rollout in stages:
- Internal dogfooding: GitHub engineers used Copilot in their daily work. First quality gates: does it work for our own engineers? Where does it break?
- Invited testers: Selected external developers with dedicated feedback channels. Quality gate: does the acceptance rate hold outside of GitHub’s own codebase patterns?
- Public Technical Preview: Open access with a waitlist. Quality gate: does the infrastructure scale? Do metrics remain stable under heterogeneous traffic?
- General Availability: Paid product for everyone. Quality gate: is the acceptance rate high enough that developers would pay for the service?
Why 12 months instead of 3:
Viewed through the ship/no-ship lens, GitHub systematically checked at each gate:
| Gate | GitHub’s approach |
|---|---|
| Eval suite | SPACE framework across five dimensions — not just “does the code compile” but “are developers more satisfied and productive?” |
| Safety | Security scans on generated code, checks for leaked secrets and license violations in suggestions |
| Latency | Suggestion latency optimized so Copilot would not interrupt developer flow |
| Cost | Cost per suggestion brought to a sustainable level for a subscription model |
| Fairness | Performance measured across programming languages and developer experience levels |
| Fallback | Dismissing a suggestion = no harm. Feature flag = instant rollback per user/org |
The acceptance rate was optimized from roughly 30% to roughly 34%. That sounds like a small jump — but at millions of suggestions per day, every percentage point of reduced noise is a significant quality improvement.
Why “good enough” at 3 months was not good enough:
- The early-adopter cohort was not representative of the GA audience. Power users tolerate more friction than mainstream developers
- Security-relevant edge cases (generated code with vulnerabilities, copied license fragments) required time to identify and address
- The pricing model needed validation — developer satisfaction alone is insufficient if the unit economics do not work
- GitHub needed evidence beyond demos: the 55%-faster study was conducted during the extended preview, not before it
The outcome: Copilot became the fastest-adopted developer tool in GitHub’s history. The additional 9 months were not lost market share — they were an investment in a product that millions of developers trusted enough to pay for.
Reflect
Section titled “Reflect”- Define quality gates before building, not during launch review. GitHub did not invent the SPACE framework when Copilot was “done” — it was the measurement system from the start. Five dimensions instead of a single metric, because developer productivity cannot be reduced to acceptance rate alone.
- Staged rollouts are mandatory for AI. GitHub’s four-stage rollout (internal, invited, preview, GA) systematically addressed different risks: first functionality, then external validation, then scale, then monetization. Each stage had its own exit criteria.
- Shadow mode catches failures at zero user risk. Copilot’s suggest-and-measure approach was an elegant shadow-mode equivalent: the model generated, but the developer decided. Every dismissed suggestion was a data point with no user harm.
- “Good enough” for early adopters is not “good enough” for GA. The hardest lesson from Copilot’s rollout: three months of preview data from tech-savvy early adopters tells you little about how millions of mainstream users will behave. GitHub’s patience was not timidity — it was a deliberate ship/no-ship decision.
Sources: GitHub Blog — “Research: Quantifying GitHub Copilot’s Impact on Developer Productivity and Happiness” (2022), Ziegler et al. — “Productivity Assessment of Neural Code Completion” (2022), GitHub Blog — GitHub Copilot is Generally Available (2022), Forsgren et al. — “The SPACE of Developer Productivity” (2021)