Random stratified sample from each factor levels

Learn random stratified sample from each factor levels with practical examples, diagrams, and best practices. Covers r, sample, r-factor development techniques with visual explanations.

Performing Random Stratified Sampling in R for Factor Levels

Illustration of data points distributed across different strata, with a subset selected from each, representing stratified sampling.

Learn how to conduct a random stratified sample from each factor level in R, ensuring proportional representation and statistical validity for your analyses.

When working with data in R, especially in fields like social sciences, biology, or market research, you often encounter datasets with categorical variables, known as 'factors'. A common requirement is to draw a random sample that maintains the original proportion of observations within each category or 'level' of these factors. This technique is called stratified sampling, and it's crucial for ensuring your sample is representative of the population across different subgroups.

Understanding Stratified Sampling

Stratified sampling is a method of sampling from a population that can be partitioned into subpopulations (strata). In R, these strata often correspond to the levels of a factor variable. The goal is to ensure that each stratum is adequately represented in the sample, preventing any single group from being over or under-represented by chance, which can happen with simple random sampling. This is particularly important when certain strata are small or when you expect significant differences between strata.

flowchart TD
    A[Original Dataset] --> B{Identify Stratification Variable (Factor)};
    B --> C{Define Strata (Factor Levels)};
    C --> D{Determine Sample Size per Stratum};
    D --> E[Randomly Sample from Each Stratum];
    E --> F[Combine Stratum Samples];
    F --> G[Final Stratified Sample];

Workflow for Random Stratified Sampling

Implementing Stratified Sampling in R

R provides several ways to perform stratified sampling. The most straightforward approach involves using functions from base R or leveraging specialized packages like dplyr or sampling. The core idea is to group your data by the factor levels and then apply a random sampling function to each group. You can specify either a fixed number of samples per stratum or a proportion of the stratum's size.

Method 1: Using Base R with split and lapply

This method involves splitting your data frame by the factor variable, applying a sampling function to each resulting sub-data frame, and then combining them back. This offers fine-grained control and is excellent for understanding the underlying mechanics.

# Create a sample data frame
data <- data.frame(
  ID = 1:100,
  Group = factor(sample(c("A", "B", "C", "D"), 100, replace = TRUE, prob = c(0.4, 0.3, 0.2, 0.1))),
  Value = rnorm(100)
)

# Define sample size per group (e.g., 5 observations from each group)
sample_size_per_group <- 5

# Stratified sampling using split and lapply
sampled_data_baseR <- do.call(rbind, lapply(split(data, data$Group), function(x) {
  x[sample(nrow(x), min(sample_size_per_group, nrow(x))), ]
}))

print(table(sampled_data_baseR$Group))
print(sampled_data_baseR)

Stratified sampling using base R's split and lapply.

Method 2: Using dplyr for a Tidy Approach

The dplyr package, part of the tidyverse, offers a more elegant and readable way to perform stratified sampling using group_by() and sample_n() or sample_frac(). This is often preferred for its clarity and integration with other data manipulation tasks.

library(dplyr)

# Create a sample data frame (same as before)
data <- data.frame(
  ID = 1:100,
  Group = factor(sample(c("A", "B", "C", "D"), 100, replace = TRUE, prob = c(0.4, 0.3, 0.2, 0.1))),
  Value = rnorm(100)
)

# Stratified sampling using dplyr (sample 10% from each group)
sampled_data_dplyr_frac <- data %>%
  group_by(Group) %>%
  sample_frac(0.1)

print(table(sampled_data_dplyr_frac$Group))
print(sampled_data_dplyr_frac)

# Stratified sampling using dplyr (sample a fixed number, e.g., 5 from each group)
sampled_data_dplyr_n <- data %>%
  group_by(Group) %>%
  sample_n(size = 5, replace = FALSE)

print(table(sampled_data_dplyr_n$Group))
print(sampled_data_dplyr_n)

Stratified sampling using dplyr's group_by() and sample_frac()/sample_n().

Ensuring Reproducibility

Random sampling, by its nature, produces different results each time it's run. To ensure your sampling process is reproducible, always set a random seed using set.seed() before performing any sampling operation. This allows others (or yourself in the future) to obtain the exact same sample by running the code with the same seed.

set.seed(123) # Set a seed for reproducibility

# Now perform your sampling operation, e.g., with dplyr
sampled_data_reproducible <- data %>%
  group_by(Group) %>%
  sample_frac(0.1)

print(table(sampled_data_reproducible$Group))

Setting a random seed for reproducible stratified sampling.