Demystifying the Central Limit Theorem
Marbles to the Rescue: Making the Abstract Tangible
Premise
Imagine that you have a bag filled with red and blue marbles.
How would you guess the proportion of red ones without going through every marble?
Simulating Marble Population
Taking One Sample
Taking More Samples
size = 10
n = 100
set.seed(42)
samples = get_samples(marbles, sample_size = size, number_of_samples = n, func = is_red)
sample_means = samples |>
group_by(sample_number) |>
summarise(proportion = mean(sample_value))
kable(head(sample_means, 5), "html") |>
kable_styling(font_size = 20)| sample_number | proportion |
|---|---|
| 1 | 0.5 |
| 2 | 0.7 |
| 3 | 0.4 |
| 4 | 0.7 |
| 5 | 0.8 |
Taking More Samples

Increasing Sample Size to 20
size = 20
set.seed(42)
samples = get_samples(marbles, sample_size = size, number_of_samples = n, func = is_red)
sample_means = samples |>
group_by(sample_number) |>
summarise(proportion = mean(sample_value))
standard_errors = standard_errors |>
add_row(sample_size = size,
standard_error = sd(sample_means$proportion))
kable(standard_errors)| sample_size | standard_error |
|---|---|
| 10 | 0.1742980 |
| 20 | 0.1035665 |
Increasing Sample Size to 20

Increasing Sample Size to 50
size = 50
set.seed(42)
samples = get_samples(marbles, sample_size = size, number_of_samples = n, func = is_red)
sample_means = samples |>
group_by(sample_number) |>
summarise(proportion = mean(sample_value))
standard_errors = standard_errors |>
add_row(sample_size = size,
standard_error = sd(sample_means$proportion))
kable(standard_errors, "html")| sample_size | standard_error |
|---|---|
| 10 | 0.1742980 |
| 20 | 0.1035665 |
| 50 | 0.0669916 |
Increasing Sample Size to 50

Increasing Sample Size to 100
size = 100
set.seed(42)
samples = get_samples(marbles, sample_size = size, number_of_samples = n, func = is_red)
sample_means = samples |>
group_by(sample_number) |>
summarise(proportion = mean(sample_value))
standard_errors = standard_errors |>
add_row(sample_size = size,
standard_error = sd(sample_means$proportion))
kable(standard_errors, "html")| sample_size | standard_error |
|---|---|
| 10 | 0.1742980 |
| 20 | 0.1035665 |
| 50 | 0.0669916 |
| 100 | 0.0494720 |
Increasing Sample Size to 100

Central Limit Theorem
As the sample size increases, typically when it’s greater than 30, the distribution of the sample means approaches a normal distribution.
Skewed Data
set.seed(42)
n = 1000
mean_log = log(30000)
sd_log = 1
salaries = rlnorm(n, meanlog = mean_log, sdlog = sd_log)
ggplot(data.frame(salaries), aes(x = salaries)) +
geom_histogram(bins = 50, fill = "blue", color = "black") +
geom_vline(xintercept = mean(salaries), color = "red", linewidth = 1) +
geom_text(aes(x = mean(salaries) * 2, y = 400, label = str_glue("{round(mean(salaries), 2)}")),
color = "red", size = 3) +
labs(title = "Salary Distribution", x = "Salary", y = "Count") +
theme_minimal()
Skewed Data

Key Insights

- Sample means cluster around the population mean
- Larger sample sizes decrease variability
- Distribution of sample means becomes more normal as size increases
- Works for any population distribution
Red Marble Proportion Estimation
Question
- How confident are we in the estimation?
- How can we quantify our level of certainty?