Part 2: Pixels? Pixels!? SOS (Select Our Sample)

Continuing from Part 1, we have collected the data but now we face another issue. Power BI does not like trying to process such a large .csv so we will need to take another approach. We will need a sample, however there are methods, all with their ups and downs. Now we could talk all day about the many different types, but instead I am going to tell you what I went with and why.

I settled with bootstrapping:

#Program setup to automatically and efficiently handle packages
cran_mirror <- "https://cran.csiro.au/"
install.packages("pacman", repos = cran_mirror)
pacman::p_load(ggplot2, tidyr, dplyr, tidyverse)

#Read CSV file change directory as needed
data <- read.csv("s_cross_csv.csv")
# Sample size calculation change as needed
# Define parameters
Z <- 2.576  # Z-score for 99% confidence level
p <- 0.5    # Estimated proportion for maximum variability
E <- 0.01   # Margin of error

# Calculate sample size
n <- (Z^2 * p * (1 - p)) / (E^2)

# Print the sample size
cat("Required sample size:", ceiling(n), "\n")

#function for sampling random entries with no replacement, mutate is adding a new row with Id numbers
sample_entries <- function(data, sample_size, set_number) {
  data %>% sample_n(sample_size) %>% mutate(set_id = set_number)
}

#set seed for repoduction
set.seed(123)
#change the below numbers to match sample size and sample amount
number_of_sets <- 30
sample_size <- n

#List for storing samples
samples_list <- imap(1:number_of_sets, ~ sample_entries(data, sample_size, .x))
#Write to csv each sample individualy
iwalk(samples_list, ~ write_csv(.x, paste0("sample_", .y,"id_pixel_sets.csv")))

#wrote to a single csv for all sample sets
combined_samples <- bind_rows(samples_list)
write_csv(combined_samples, "pixel_bootstrap_sample.csv")

Due to the sample size, the relevance of replacement is negligible.
The data is plentiful and clean, so why not increase the accuracy and opt for something a little more robust than a simple single random sample?
There is an opportunity for furthering my own skills.

Now, this is all just an exploratory analysis, so this is overkill, especially as the weight and ramifications of what I find are negligible. Regardless, there is a certain joy in taking things a little further. In the next part, we will get this all loaded into Power BI and see what we can find.