Analysis of Variance (ANOVA)

How to compare means of more than two groups?

\(t\)-Test

Sample size is small (\(n < 30\)) and/or population standard deviation is unknown.

\[t = \frac{\bar x - \mu}{s / \sqrt{n}} \sim t(n-1)\]

Three types of \(t\)-tests:

One-sample \(t\)-test
Two-sample \(t\)-test
Paired \(t\)-test

Question:

What if we want to compare the means of more than two groups?

Issues with Multiple \(t\)-Tests

\(k \choose 2\) \(=\frac{k(k-1)}{2}\) pairwise \(t\)-tests
The probability of making at least one Type I error is: \[P(\text{at least one Type I error}) = 1 - (1 - \alpha)^k\] where \(k\) is the number of tests.

Issues with Multiple \(t\)-Tests

Ways to Fix the Issues

Bonferroni correction (\(\alpha_{new} = \frac{\alpha}{k}\), but can be overly conservative)
Analysis of Variance (ANOVA)

Variance

Variance measures the dispersion of a set of data points around their mean.
Two kinds of variance in ANOVA:
- Variance within groups: dispersion within each group
- Variance between groups: dispersion between the group means
ANOVA essentially compares the variation within groups against the variation between groups.

Intuition

In which scenario are the means of the two groups significantly different?

Intuition

Between-group Variability

The smaller the distance between sample means, the less likely population means will differ significantly. (Scenario 1 vs. Scenario 2)

Within-group Variability

The greater the distance between sample means, the less likely population means will differ significantly. (Scenario 1 vs. Scenario 3)

Variance Between Groups

Sum of Squares Between

\[SS_{between} = \sum_{i=1}^{k} n_i (\bar{X}_i - \bar{X})^2\] where:

\(k\) is the number of groups
\(n_i\) is the number of observations in group \(i\)
\(\bar{X}_i\) is the mean of group \(i\)
\(\bar{X}\) is the overall mean

Variance Within Groups

Sum of Squares Within

\[SS_{within} = \sum_{i=1}^{k} \sum_{j=1}^{n_i} (X_{ij} - \bar{X}_i)^2\]

where:

\(k\) is the number of groups
\(n_i\) is the number of observations in group \(i\)
\(\bar{X}_i\) is the mean of group \(i\)
\(X_{ij}\) is the \(j\)th observation in group \(i\)

F-statistic

\[F = \frac{SS_{between} / (k - 1)}{SS_{within} / (N - k)} = \frac{MS_{between}}{MS_{within}} \sim F(k-1, N-k)\]

where \(N\) is the total number of observations and \(MS\) is the mean square.

F-Distibution

Assumptions of ANOVA

Independent and random samples
Approximately normal population distribution
Equal population variances (rule of thumb: \(\frac{\text{largest variance}}{\text{smallest variance}} < 2\))

Example

You’re testing the effect of three different call-to-action (CTA) button designs on your e-commerce website. The metric of interest is the click-through rate (CTR), which is the percentage of visitors who click the button. You randomly assign visitors to one of the three designs (A, B, or C) and record their CTRs.

Design A: \(n_1 = 50, \bar x = 6.2, s = 0.5\)
Design B: \(n_2 = 50, \bar x = 7.0, s = 0.5\)
Design C: \(n_3 = 50, \bar x = 6.5, s = 0.5\)

Example

\(H_0\): \(\mu_1 = \mu_2 = \mu_3\)
\(H_a\): At least one \(\mu_i\) is different

set.seed(12345)
df = tibble(design_a = rnorm(50, 6.2, 0.5),
            design_b = rnorm(50, 7.0, 0.5),
            design_c = rnorm(50, 6.5, 0.5)) |>
  pivot_longer(cols = everything(), names_to = "design", values_to = "percentage")

result = aov(percentage ~ design, data = df)
summary(result)

             Df Sum Sq Mean Sq F value   Pr(>F)    
design        2  20.48  10.241   32.62 1.89e-12 ***
Residuals   147  46.15   0.314                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Reject \(H_0\) at \(\alpha = 0.05\) significance level.
We conclude that at least one of the designs has a different CTR.

Tukey’s HSD Test

It can be used to determine which groups are different from each other.
The test statistic is \(q = \frac{\bar x_i - \bar x_j}{\sqrt{MS_{within} / n}}\).
The critical value is \(q_{\alpha, k, n-k}\).
If \(|q| > q_{\alpha, k, n-k}\), then we reject \(H_0\) for the pair of groups.

TukeyHSD(result)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = percentage ~ design, data = df)

$design
                        diff         lwr        upr     p adj
design_b-design_a  0.8656309  0.60029322  1.1309687 0.0000000
design_c-design_a  0.2037145 -0.06162325  0.4690522 0.1673197
design_c-design_b -0.6619165 -0.92725419 -0.3965787 0.0000001

We reject \(H_0\) for the pair of groups (A, B) and (B, C).
We fail to reject \(H_0\) for the pair of groups (A, C).