Marbles to the Rescue: Making the Abstract Tangible
Imagine that you have a bag filled with red and blue marbles.
How would you guess the proportion of red ones without going through every marble?
size = 10
n = 100
set.seed(42)
samples = get_samples(marbles, sample_size = size, number_of_samples = n, func = is_red)
sample_means = samples |>
group_by(sample_number) |>
summarise(proportion = mean(sample_value))
kable(head(sample_means, 5), "html") |>
kable_styling(font_size = 20)
sample_number | proportion |
---|---|
1 | 0.5 |
2 | 0.7 |
3 | 0.4 |
4 | 0.7 |
5 | 0.8 |
size = 20
set.seed(42)
samples = get_samples(marbles, sample_size = size, number_of_samples = n, func = is_red)
sample_means = samples |>
group_by(sample_number) |>
summarise(proportion = mean(sample_value))
standard_errors = standard_errors |>
add_row(sample_size = size,
standard_error = sd(sample_means$proportion))
kable(standard_errors)
sample_size | standard_error |
---|---|
10 | 0.1742980 |
20 | 0.1035665 |
size = 50
set.seed(42)
samples = get_samples(marbles, sample_size = size, number_of_samples = n, func = is_red)
sample_means = samples |>
group_by(sample_number) |>
summarise(proportion = mean(sample_value))
standard_errors = standard_errors |>
add_row(sample_size = size,
standard_error = sd(sample_means$proportion))
kable(standard_errors, "html")
sample_size | standard_error |
---|---|
10 | 0.1742980 |
20 | 0.1035665 |
50 | 0.0669916 |
size = 100
set.seed(42)
samples = get_samples(marbles, sample_size = size, number_of_samples = n, func = is_red)
sample_means = samples |>
group_by(sample_number) |>
summarise(proportion = mean(sample_value))
standard_errors = standard_errors |>
add_row(sample_size = size,
standard_error = sd(sample_means$proportion))
kable(standard_errors, "html")
sample_size | standard_error |
---|---|
10 | 0.1742980 |
20 | 0.1035665 |
50 | 0.0669916 |
100 | 0.0494720 |
As the sample size increases, typically when it’s greater than 30, the distribution of the sample means approaches a normal distribution.
set.seed(42)
n = 1000
mean_log = log(30000)
sd_log = 1
salaries = rlnorm(n, meanlog = mean_log, sdlog = sd_log)
ggplot(data.frame(salaries), aes(x = salaries)) +
geom_histogram(bins = 50, fill = "blue", color = "black") +
geom_vline(xintercept = mean(salaries), color = "red", linewidth = 1) +
geom_text(aes(x = mean(salaries) * 2, y = 400, label = str_glue("{round(mean(salaries), 2)}")),
color = "red", size = 3) +
labs(title = "Salary Distribution", x = "Salary", y = "Count") +
theme_minimal()