Estimating treatment effects and power analysis¶

Outline¶

Sampling distributions and standard errors
Confidence intervals for average treatment effects
Power analysis
- Type I errors
- Type II errors
- Type M errors

Sampling distributions and standard errors¶

Sampling distribution of an average¶

Hypothetical experiment: what is the average number of heads for biased coin with p = 0.2 after 100 coin flips?
Can observe through simulation: how is the average distributed when you replicate the experiment 2000 times? This is called the sampling distribution

In [3]:

%%R
N <- 100
p <- 0.2
num.replications <- 2000
replications <- replicate(num.replications, mean(rbinom(N, 1, p)))

print(quantile(replications, c(0.025, 0.975)))  # 95% interval
hist(replications, xlim=c(0,0.4))               # histogram

 2.5% 97.5% 
 0.13  0.28

Standard error¶

The standard error is the standard deviation of the sampling distribution

In [4]:

%%R
# Alternatively, we can use the standard deviation of the sampling distribution.
# which is approximately normal for sufficiently large N.
est.se <- sd(replications)
mean.est.p <- mean(replications)
mean.est.p
est.se
c(mean.est.p - 1.96*est.se, mean.est.p + 1.96*est.se)

[1] 0.123 0.277

Estimating standard error using math¶

The standard deviation of $k$ draws from a Bernoulli distribution is approximated by:

$$\sigma = \sqrt{\frac{p(p-1)}{N}}$$
This is the standard error for our experiment!
95% confidence interval is $z_{\alpha/2}\text{SE} = 1.96\text{SE}$

In [5]:

%%R
single.trial <- rbinom(N, 1, p)
p.single.trial <- mean(single.trial)

# normal approximation of the SE of a binomial
se <- sqrt(p.single.trial*(1-p.single.trial)/N)
se

[1] 0.0384

In [6]:

%%R
# norm approx 95% CI
c(p.single.trial - 1.96*se, p.single.trial + 1.96*se)

[1] 0.105 0.255

Relationship between N and CI width¶

In [7]:

%%R
# let's construct the confidence intervals for p for a few values of p, N
d <- expand.grid(p=c(0.05, 0.25, 0.5), n=seq(100, 2000, 100))
d <- d %>% mutate(se=sqrt(p*(1-p)/n))

qplot(n, p, data=d, color=factor(p),
  ylab='p (with 95% confidence interval)') +
  geom_errorbar(aes(ymin=p-1.96*se, ymax=p+1.96*se))

Relationship between N and CI width¶

In [8]:

%%R
# you can see that the standard error rapidly diminishes with N.
# (sqrt(1/n), in particular).
qplot(n, se, data=d, color=factor(p), geom='line', ylab='standard error')

Sampling distribution for an average treatment effects¶

Simple experiment¶

Consider simulated experiment with binary treatment, $D$:

$$y = \mu + D*\delta + \epsilon$$
$\delta$ is the average treatment effect (ATE)

In [9]:

%%R
run.experiment <- function(n, ate) {
  data.frame(
    y0=rnorm(n, mean=10, sd=3)
  ) %>%
  mutate(
    y1=y0 + ate,
    D=rbinom(n, 1, 0.5),
    y=D*y1 + (1-D)*y0
  )
}

my.experiment <- run.experiment(2000, 2.0)

head(my.experiment)

     y0    y1 D     y
1  9.56 11.56 0  9.56
2  9.03 11.03 1 11.03
3 13.57 15.57 0 13.57
4 12.31 14.31 1 14.31
5  7.27  9.27 1  9.27
6  7.56  9.56 1  9.56

Distribution of outcomes for simulated subjects¶

In [10]:

%%R
qplot(y, geom='histogram', data=my.experiment, facets=D~.) + theme_bw()

stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Sampling distribution of estimated ATE¶

In [11]:

%%R
est.ate <- function(exp) {
  mean(exp[exp$D==1,]$y) - mean(exp[exp$D==0,]$y)
}

num.replications <- 1000
ates <- replicate(
  num.replications,
  est.ate(run.experiment(n=1e3, 2.0)))

hist(ates, main='sampling distribution of ATE')

Estimating SEs: Using a t-test¶

In [12]:

%%R
# 95% of the estimated treated effects lie within this range
quantile(ates, c(0.025, 0.975))

# Large sample approximations let us make statements about the
# variance of the ATE over different (similar) populations (or
# possible randomizations).

est.ate(my.experiment)
with(my.experiment, t.test(y[D==1], y[D==0]))

	Welch Two Sample t-test

data:  y[D == 1] and y[D == 0]
t = 14.4, df = 1998, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.64 2.15
sample estimates:
mean of x mean of y 
     12.0      10.1

Estimating SEs: Using a linear model¶

In [13]:

%%R
m <- lm(y ~ D, data=my.experiment)
est.ate <- coeftest(m)['D', 'Estimate']
est.se <- coeftest(m)['D', 'Std. Error']

c(est.ate-1.96*est.se, est.ate+1.96*est.se)

[1] 1.64 2.15

Estimating SEs: Using the bootstrap¶

In [14]:

%%R
weighted.est.ate <- function(exp) {
  m <- lm(y ~ D, data=exp, weights=.weights)
  coeftest(m)['D', 'Estimate']
}

replicates <- iid.bootstrap(my.experiment, weighted.est.ate)
print(quantile(replicates, c(0.025, 0.975)))
hist(replicates, main='bootstrap distribution of ATE for a single experiment')

 2.5% 97.5% 
 1.62  2.16

Type I, II, M errors¶

Type I error: Finding a "significant" difference when there is no difference.
Type II error: Failure to find a "significant" difference.
Type M error: Measuring an ATE that is too large

Simulating variability when replicating an experiment¶

In [16]:

%%R
# This function runs an experiment multiple times (num.replications),
# and generates a dataframe containing estimates of the treatment effects and
# standard errors for each experiment, sorted by the size of the effect.
replicate.experiment <- function(sim.func, params, num.replications=100) {
  # Note: you can parallelize this process below by using %dopar% if
  # you have the 'doMC' package installed.
  exps <- foreach(replication=1:num.replications, .combine=rbind) %do% {
  	as.data.frame(get.ci(do.call(sim.func, params)))
  }
  exps <- exps[order(exps$est.ate),]
  exps <- mutate(exps,
    rank=1:nrow(exps),
    significant=sign(est.ate-1.96*est.se)==sign(est.ate+1.96*est.se)
  )
  exps
}

Type I errors¶

Type I error: detecting a "statistically significant effect" when there is no effect.

In [17]:

%%R
# run 300 trials of an experiment with no ATE
null.experiments <- replicate.experiment(run.experiment,
  list(n=1e3, ate=0.0),
  num.replications=200)

qplot(est.ate, data=null.experiments, main="sampling distribution of ATE")

stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Type I errors¶

5% of the experiments will yield "statistically significant" results
Visualize by rank ordering CIs from 200 hypothetical experiments by effect size

In [18]:

%%R
# plot the results of the experiments, rank-ordered by effect size, along
# with their confidence intervals. Experiments that are "significant" have
# confidence intervals that do not cross zero and are highlighted in blue.
print(mean(null.experiments$significant))
qplot(rank, est.ate, data=null.experiments, color=significant,
  xlab='experiment number (ranked by effect size)', ylab='estimated ATE',
  main='demonstration of Type I errors') +
  geom_pointrange(aes(ymin=est.ate-1.96*est.se, ymax=est.ate+1.96*est.se)) +
  geom_point()

[1] 0.07

Type II errors¶

Consider an experiment with a true effect of +0.5 (out of an average, 10)
Experiments that are too small often do not detect treatment effect

In [19]:

%%R
sad.experiments <- replicate.experiment(run.experiment,
  list(n=250, ate=0.5),
  num.replications=200)

qplot(rank, est.ate, data=sad.experiments, color=significant,
  xlab='experiment number (ranked by effect size)', ylab='estimated ATE',
  main='demonstration of Type II errors') +
  geom_pointrange(aes(ymin=est.ate-1.96*est.se, ymax=est.ate+1.96*est.se)) +
  geom_point() + geom_hline(yintercept=0.5)

Type M errors¶

Type M error is a magnitude error
When experiments are underpowered, and we only report on "significant experiments", we are expected to overestimate effects

In [20]:

%%R
# Fraction of experiments that are statistically significant.
# This is called the statistical power of the experiment.
print(mean(sad.experiments$significant))

### Type M errors

# when experiments are underpowered, only reporting on significant experiments
# can massively overstate effects.
with(subset(sad.experiments, significant==TRUE), mean(est.ate))

[1] 0.275
[1] 0.937

Power analysis¶

Use analytical expressions or simulations to determine how large your experiment should be
Depends on
- Number of subjects in each condition
- Effect size
- Variance of outcome

In [21]:

%%R
run.experiment2 <- function(n, ate, prop.treated=0.5, y.sd=3) {
  data.frame(
    y0=rnorm(n, mean=10, sd=y.sd)
  ) %>%
  mutate(
    y1=y0 + ate,
    D=rbinom(n, 1, prop.treated),
    y=D*y1 + (1-D)*y0
  )
}

Relationship between precision and proportion of subjects in experiment¶

Let's simulate a few replications with varying proportions of users assigned to treatment and control.

In [22]:

%%R
exps <- foreach(p=seq(0.04, 0.96, 0.04), .combine=rbind) %do% {
  varying.prop = list(n=1e3, ate=1, prop.treated=p, y.sd=3)
  replications <- replicate.experiment(
    sim.func=run.experiment2,
    params=varying.prop,
    num.replications=250
  )
  data.frame(
    prop.treated=p,
    sim.se=sd(replications$est.ate),
    sim.ate=mean(replications$est.ate),
    norm.est.se=head(replications$est.se, 1)
  )
}

Relationship between precision and proportion of subjects in experiment¶

Using sampling distribution of simulations

In [23]:

%%R
qplot(prop.treated, 1.96*sim.se, data=exps,
      ylab='95% CI width', xlab='proportion of subjects in treatment',
      ylim=c(0, 2.0), geom=c('line','point'),main='Precision of estimated ATE (simulation)')

Relationship between precision and proportion of subjects in experiment¶

Using estimated standard error of a single experiment

In [24]:

%%R
qplot(prop.treated, 1.96*norm.est.se, data=exps,
       ylab='95% CI width', xlab='proportion of subjects in treatment',
       ylim=c(0,2.0), main='Precision of estimated ATE (approximation)')

How variance affects required sample size¶

In [25]:

%%R
# We will use a stripped down version that doesn't use replications for
# demonstration purposes, using get.ci() with a single trial for each
# parameterization.

n <- 1e3
true.ate <- 0.1
p=0.5
exps <- foreach(n=seq(100, 1e3, 1e2), .combine=rbind) %do% {
  foreach(my.sd=c(1, 2, 3), .combine=rbind) %do% { 
    my.experiment <- run.experiment2(
      n=n, ate=1, prop.treated=0.5, y.sd=my.sd)
    my.ses <- get.ci(my.experiment)
    data.frame(n=n, sd=my.sd,
               est.ate=my.ses$est.ate,
               est.se=my.ses$est.se)
  }
}

qplot(n, 1.96*est.se, data=exps, color=factor(sd), geom=c('line', 'point')) + geom_hline(aes(yintercept=true.ate), linetype='dashed')

Exercise¶

How many samples do you need to detect a 10 percentage point change for a bernoulli outcome with p = 0.33?