Estimating treatment effects and power analysis


  • Sampling distributions and standard errors
  • Confidence intervals for average treatment effects
  • Power analysis
    • Type I errors
    • Type II errors
    • Type M errors

If you have difficulty invoking load_ext, try doing pip install rpy2 --upgrade in your command line and restarting the Jupyter Python kernel.

In [1]:
%load_ext rpy2.ipython
In [2]:
# First time users: install packages by uncommenting the following line.
#install.packages(c('dplyr', 'ggplot2', 'foreach', 'lmtest', 'sandwich', 'doMC'))



Attaching package: ‘dplyr’

The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,

The following object is masked from ‘package:stats’:


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

foreach: simple, scalable parallel programming from Revolution Analytics
Use Revolution R for scalability, fault tolerance and more.
Loading required package: zoo

Attaching package: ‘zoo’

The following objects are masked from ‘package:base’:

    as.Date, as.Date.numeric

Loading required package: grid
Loading required package: lattice
Loading required package: survival
Loading required package: splines
Loading required package: Formula

Attaching package: ‘Hmisc’

The following objects are masked from ‘package:dplyr’:

    combine, src, summarize

The following objects are masked from ‘package:plyr’:

    is.discrete, summarize

The following objects are masked from ‘package:base’:

    format.pval, round.POSIXt, trunc.POSIXt, units

Sampling distributions and standard errors

Sampling distribution of an average

  • Hypothetical experiment: what is the average number of heads for biased coin with p = 0.2 after 100 coin flips?
  • Can observe through simulation: how is the average distributed when you replicate the experiment 2000 times? This is called the sampling distribution
In [3]:
N <- 100
p <- 0.2
num.replications <- 2000
replications <- replicate(num.replications, mean(rbinom(N, 1, p)))

print(quantile(replications, c(0.025, 0.975)))  # 95% interval
hist(replications, xlim=c(0,0.4))               # histogram
 2.5% 97.5% 
 0.13  0.28 

Standard error

  • The standard error is the standard deviation of the sampling distribution
In [4]:
# Alternatively, we can use the standard deviation of the sampling distribution.
# which is approximately normal for sufficiently large N. <- sd(replications)
mean.est.p <- mean(replications)
c(mean.est.p - 1.96*, mean.est.p + 1.96*
[1] 0.123 0.277

Estimating standard error using math

  • The standard deviation of $k$ draws from a Bernoulli distribution is approximated by:

    $$\sigma = \sqrt{\frac{p(p-1)}{N}}$$

  • This is the standard error for our experiment!

  • 95% confidence interval is $z_{\alpha/2}\text{SE} = 1.96\text{SE}$
In [5]:
single.trial <- rbinom(N, 1, p)
p.single.trial <- mean(single.trial)

# normal approximation of the SE of a binomial
se <- sqrt(p.single.trial*(1-p.single.trial)/N)
[1] 0.0384
In [6]:
# norm approx 95% CI
c(p.single.trial - 1.96*se, p.single.trial + 1.96*se)
[1] 0.105 0.255

Relationship between N and CI width

In [7]:
# let's construct the confidence intervals for p for a few values of p, N
d <- expand.grid(p=c(0.05, 0.25, 0.5), n=seq(100, 2000, 100))
d <- d %>% mutate(se=sqrt(p*(1-p)/n))

qplot(n, p, data=d, color=factor(p),
  ylab='p (with 95% confidence interval)') +
  geom_errorbar(aes(ymin=p-1.96*se, ymax=p+1.96*se))

Relationship between N and CI width

In [8]:
# you can see that the standard error rapidly diminishes with N.
# (sqrt(1/n), in particular).
qplot(n, se, data=d, color=factor(p), geom='line', ylab='standard error')

Sampling distribution for an average treatment effects

Simple experiment

  • Consider simulated experiment with binary treatment, $D$:

    $$y = \mu + D*\delta + \epsilon$$

  • $\delta$ is the average treatment effect (ATE)

In [9]:
run.experiment <- function(n, ate) {
    y0=rnorm(n, mean=10, sd=3)
  ) %>%
    y1=y0 + ate,
    D=rbinom(n, 1, 0.5),
    y=D*y1 + (1-D)*y0

my.experiment <- run.experiment(2000, 2.0)

     y0    y1 D     y
1  9.56 11.56 0  9.56
2  9.03 11.03 1 11.03
3 13.57 15.57 0 13.57
4 12.31 14.31 1 14.31
5  7.27  9.27 1  9.27
6  7.56  9.56 1  9.56

Distribution of outcomes for simulated subjects

In [10]:
qplot(y, geom='histogram', data=my.experiment, facets=D~.) + theme_bw()
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Sampling distribution of estimated ATE

In [11]:
est.ate <- function(exp) {
  mean(exp[exp$D==1,]$y) - mean(exp[exp$D==0,]$y)

num.replications <- 1000
ates <- replicate(
  est.ate(run.experiment(n=1e3, 2.0)))

hist(ates, main='sampling distribution of ATE')

Estimating SEs: Using a t-test

In [12]:
# 95% of the estimated treated effects lie within this range
quantile(ates, c(0.025, 0.975))

# Large sample approximations let us make statements about the
# variance of the ATE over different (similar) populations (or
# possible randomizations).

with(my.experiment, t.test(y[D==1], y[D==0]))
	Welch Two Sample t-test

data:  y[D == 1] and y[D == 0]
t = 14.4, df = 1998, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.64 2.15
sample estimates:
mean of x mean of y 
     12.0      10.1 

Estimating SEs: Using a linear model

In [13]:
m <- lm(y ~ D, data=my.experiment)
est.ate <- coeftest(m)['D', 'Estimate'] <- coeftest(m)['D', 'Std. Error']

c(est.ate-1.96*, est.ate+1.96*
[1] 1.64 2.15

Estimating SEs: Using the bootstrap

In [14]:
weighted.est.ate <- function(exp) {
  m <- lm(y ~ D, data=exp, weights=.weights)
  coeftest(m)['D', 'Estimate']

replicates <- iid.bootstrap(my.experiment, weighted.est.ate)
print(quantile(replicates, c(0.025, 0.975)))
hist(replicates, main='bootstrap distribution of ATE for a single experiment')
 2.5% 97.5% 
 1.62  2.16 
In [15]:

# Here is a function that returns confidence intervals for a dataframe
# with an outcome y and a binary treatment D. We will use this later. <- function(experiment) {
  t <- coeftest(lm(y ~ D, data=experiment))
  est.ate <- t['D', 'Estimate'] <- t['D', 'Std. Error']
  p.value <- t['D', 'Pr(>|t|)']
  return(list(est.ate=est.ate,, p.value=p.value))
[1] 1.89

[1] 0.132

[1] 1.54e-44

Type I, II, M errors

  • Type I error: Finding a "significant" difference when there is no difference.
  • Type II error: Failure to find a "significant" difference.
  • Type M error: Measuring an ATE that is too large

Simulating variability when replicating an experiment

In [16]:
# This function runs an experiment multiple times (num.replications),
# and generates a dataframe containing estimates of the treatment effects and
# standard errors for each experiment, sorted by the size of the effect.
replicate.experiment <- function(sim.func, params, num.replications=100) {
  # Note: you can parallelize this process below by using %dopar% if
  # you have the 'doMC' package installed.
  exps <- foreach(replication=1:num.replications, .combine=rbind) %do% {, params)))
  exps <- exps[order(exps$est.ate),]
  exps <- mutate(exps,

Type I errors

  • Type I error: detecting a "statistically significant effect" when there is no effect.
In [17]:
# run 300 trials of an experiment with no ATE
null.experiments <- replicate.experiment(run.experiment,
  list(n=1e3, ate=0.0),

qplot(est.ate, data=null.experiments, main="sampling distribution of ATE")
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Type I errors

  • 5% of the experiments will yield "statistically significant" results
  • Visualize by rank ordering CIs from 200 hypothetical experiments by effect size
In [18]:
# plot the results of the experiments, rank-ordered by effect size, along
# with their confidence intervals. Experiments that are "significant" have
# confidence intervals that do not cross zero and are highlighted in blue.
qplot(rank, est.ate, data=null.experiments, color=significant,
  xlab='experiment number (ranked by effect size)', ylab='estimated ATE',
  main='demonstration of Type I errors') +
  geom_pointrange(aes(ymin=est.ate-1.96*, ymax=est.ate+1.96* +
[1] 0.07

Type II errors

  • Consider an experiment with a true effect of +0.5 (out of an average, 10)
  • Experiments that are too small often do not detect treatment effect
In [19]:
sad.experiments <- replicate.experiment(run.experiment,
  list(n=250, ate=0.5),

qplot(rank, est.ate, data=sad.experiments, color=significant,
  xlab='experiment number (ranked by effect size)', ylab='estimated ATE',
  main='demonstration of Type II errors') +
  geom_pointrange(aes(ymin=est.ate-1.96*, ymax=est.ate+1.96* +
  geom_point() + geom_hline(yintercept=0.5)

Type M errors

  • Type M error is a magnitude error
  • When experiments are underpowered, and we only report on "significant experiments", we are expected to overestimate effects
In [20]:
# Fraction of experiments that are statistically significant.
# This is called the statistical power of the experiment.

### Type M errors

# when experiments are underpowered, only reporting on significant experiments
# can massively overstate effects.
with(subset(sad.experiments, significant==TRUE), mean(est.ate))
[1] 0.275
[1] 0.937

Power analysis

  • Use analytical expressions or simulations to determine how large your experiment should be
  • Depends on
    • Number of subjects in each condition
    • Effect size
    • Variance of outcome
In [21]:
run.experiment2 <- function(n, ate, prop.treated=0.5, {
    y0=rnorm(n, mean=10,
  ) %>%
    y1=y0 + ate,
    D=rbinom(n, 1, prop.treated),
    y=D*y1 + (1-D)*y0

Relationship between precision and proportion of subjects in experiment

Let's simulate a few replications with varying proportions of users assigned to treatment and control.

In [22]:
exps <- foreach(p=seq(0.04, 0.96, 0.04), .combine=rbind) %do% {
  varying.prop = list(n=1e3, ate=1, prop.treated=p,
  replications <- replicate.experiment(
    sim.ate=mean(replications$est.ate),$, 1)

Relationship between precision and proportion of subjects in experiment

  • Using sampling distribution of simulations
In [23]:
qplot(prop.treated, 1.96*, data=exps,
      ylab='95% CI width', xlab='proportion of subjects in treatment',
      ylim=c(0, 2.0), geom=c('line','point'),main='Precision of estimated ATE (simulation)')

Relationship between precision and proportion of subjects in experiment

  • Using estimated standard error of a single experiment
In [24]:
qplot(prop.treated, 1.96*, data=exps,
       ylab='95% CI width', xlab='proportion of subjects in treatment',
       ylim=c(0,2.0), main='Precision of estimated ATE (approximation)')

How variance affects required sample size

In [25]:
# We will use a stripped down version that doesn't use replications for
# demonstration purposes, using with a single trial for each
# parameterization.

n <- 1e3
true.ate <- 0.1
exps <- foreach(n=seq(100, 1e3, 1e2), .combine=rbind) %do% {
  foreach(, 2, 3), .combine=rbind) %do% { 
    my.experiment <- run.experiment2(
      n=n, ate=1, prop.treated=0.5, <-

qplot(n, 1.96*, data=exps, color=factor(sd), geom=c('line', 'point')) + geom_hline(aes(yintercept=true.ate), linetype='dashed')


  • How many samples do you need to detect a 10 percentage point change for a bernoulli outcome with p = 0.33?