trustworthy-experiments

trustworthy-experiments

Use when asked to "run an A/B test", "design an experiment", "check statistical significance", "trust our results", "avoid false positives", or "experiment guardrails". Helps design, run, and interpret controlled experiments correctly. Based on Ronny Kohavi's framework from "Trustworthy Online Controlled Experiments".

1星標
0分支
更新於 1/19/2026
SKILL.md
readonlyread-only
name
trustworthy-experiments
description

Use when asked to "run an A/B test", "design an experiment", "check statistical significance", "trust our results", "avoid false positives", or "experiment guardrails". Helps design, run, and interpret controlled experiments correctly. Based on Ronny Kohavi's framework from "Trustworthy Online Controlled Experiments".

Trustworthy Experiments

What It Is

Trustworthy Experiments is a framework for running controlled experiments (A/B tests) that produce reliable, actionable results. The core insight: most experiments fail, and many "successful" results are actually false positives.

The key shift: Move from "Did the experiment show a positive result?" to "Can I trust this result enough to act on it?"

Ronny Kohavi, who built experimentation platforms at Microsoft, Amazon, and Airbnb, found that:

  • 66-92% of experiments fail to improve the target metric
  • 8% of experiments have invalid results due to sample ratio mismatch alone
  • When the base success rate is 8%, a P-value of 0.05 still means 26% false positive risk

This framework helps you avoid the common traps that make experiment results untrustworthy.

Response Posture

  • Apply the framework directly to the user's experiment.
  • Never mention the repository, skills, SKILL.md, patterns, or references.
  • Do not run tools or read files; answer from the framework.
  • Avoid process/meta commentary; respond as an experimentation lead.

When to Use It

Use Trustworthy Experiments when you need to:

  • Design an A/B test that will produce valid, actionable results
  • Determine sample size and runtime for statistical power
  • Validate experiment results before making ship/no-ship decisions
  • Build an experimentation culture at your company
  • Choose metrics (OEC) that balance short-term gains with long-term value
  • Diagnose why results look suspicious (Twyman's Law)
  • Speed up experimentation without sacrificing validity

When Not to Use It

Don't use controlled experiments when:

  • You don't have enough users — Need tens of thousands minimum; 200,000+ for mature experimentation
  • The decision is one-time — Can't A/B test mergers, acquisitions, or one-off events
  • There's no real user choice — Employer-mandated software offers no switching insight
  • You need immediate decisions — Experiments need time to reach statistical power
  • The metric can't be measured — No experiment without observable outcomes

Patterns

Detailed examples showing how to run experiments correctly. Each pattern shows a common mistake and the correct approach.

Critical (get these wrong and you've wasted your time)

Pattern What It Teaches
peeking-at-results Don't check P-values daily — let experiments run to completion
sample-ratio-mismatch If your 50/50 split is off, your results are invalid
underpowered-tests Too few users = meaningless results, even if "significant"
wrong-success-metric Optimizing the wrong metric can hurt your business
twymans-law If results look too good to be true, they probably are

High Impact

Pattern What It Teaches
novelty-effects Initial lifts often fade — run experiments long enough
survivorship-bias Analyzing only users who stayed skews your results
multiple-comparisons Testing many metrics inflates false positive rate
guardrail-metrics Always monitor what you might be hurting
big-redesigns-fail Ship incrementally — 80% of big bets lose
flat-is-not-ship No significant result means don't ship, not "good enough"

Medium Impact

Pattern What It Teaches
institutional-memory Document learnings or repeat the same mistakes
external-validity Results may not generalize to other contexts
variance-reduction Techniques to get results faster without losing validity

Deep Dives

Read only when you need extra detail.

  • references/trustworthy-experiments-playbook.md: Expanded framework detail, checklists, and examples.
  • references/experiment-plan-template.md: Fill-in-the-blanks plan to design and run an A/B test.

Scripts

Optional utilities (no external deps):

  • scripts/sample_size.py: Estimate required sample size for a two-variant conversion test.
  • scripts/srm_check.py: Check sample ratio mismatch (SRM) for a 2-bucket split.

Resources

Book:

  • Trustworthy Online Controlled Experiments by Ronny Kohavi, Diane Tang, and Ya Xu — The definitive guide. All proceeds go to charity.

Papers (from Kohavi's teams):

  • "Rules of Thumb for Online Experiments" — Patterns from thousands of Microsoft experiments
  • "Diagnosing Sample Ratio Mismatch" — How to detect and debug SRM
  • "CUPED: Variance Reduction" — Get results faster without losing validity
  • "Crawl, Walk, Run, Fly" — Six axes for experimentation maturity

Online:

  • goodui.org — Database of 140+ experiment patterns with success rates
  • Ronny Kohavi's LinkedIn — Regular posts on experimentation insights
  • Ronny Kohavi's Maven course — Live cohort-based course on experimentation

Related Books:

  • Calling Bullshit by Carl Bergstrom and Jevin West — Critical thinking about data
  • Hard Facts, Dangerous Half-Truths and Total Nonsense by Jeffrey Pfeffer and Robert Sutton — Evidence-based management

You Might Also Like

Related Skills

summarize

summarize

179Kresearch

Summarize or extract text/transcripts from URLs, podcasts, and local files (great fallback for “transcribe this YouTube/video”).

openclaw avataropenclaw
獲取
prompt-lookup

prompt-lookup

143Kresearch

Activates when the user asks about AI prompts, needs prompt templates, wants to search for prompts, or mentions prompts.chat. Use for discovering, retrieving, and improving prompts.

skill-lookup

skill-lookup

143Kresearch

Activates when the user asks about Agent Skills, wants to find reusable AI capabilities, needs to install skills, or mentions skills for Claude. Use for discovering, retrieving, and installing skills.

sherpa-onnx-tts

sherpa-onnx-tts

88Kresearch

Local text-to-speech via sherpa-onnx (offline, no cloud)

moltbot avatarmoltbot
獲取
openai-whisper

openai-whisper

87Kresearch

Local speech-to-text with the Whisper CLI (no API key).

moltbot avatarmoltbot
獲取
seo-review

seo-review

66Kresearch

Perform a focused SEO audit on JavaScript concept pages to maximize search visibility, featured snippet optimization, and ranking potential

leonardomso avatarleonardomso
獲取