Continuing from Part 1, we have collected the data but now we face another issue. Power BI does not like trying to process such a large .csv so we will need to take another approach. We will need a sample, however there are methods, all with their ups and downs. Now we could talk all day about the many different types, but instead I am going to tell you what I went with and why.
I settled with bootstrapping:
#Program setup to automatically and efficiently handle packages
cran_mirror <- "https://cran.csiro.au/"
install.packages("pacman", repos = cran_mirror)
pacman::p_load(ggplot2, tidyr, dplyr, tidyverse)
#Read CSV file change directory as needed
data <- read.csv("s_cross_csv.csv")
# Sample size calculation change as needed
# Define parameters
Z <- 2.576 # Z-score for 99% confidence level
p <- 0.5 # Estimated proportion for maximum variability
E <- 0.01 # Margin of error
# Calculate sample size
n <- (Z^2 * p * (1 - p)) / (E^2)
# Print the sample size
cat("Required sample size:", ceiling(n), "\n")
#function for sampling random entries with no replacement, mutate is adding a new row with Id numbers
sample_entries <- function(data, sample_size, set_number) {
data %>% sample_n(sample_size) %>% mutate(set_id = set_number)
}
#set seed for repoduction
set.seed(123)
#change the below numbers to match sample size and sample amount
number_of_sets <- 30
sample_size <- n
#List for storing samples
samples_list <- imap(1:number_of_sets, ~ sample_entries(data, sample_size, .x))
#Write to csv each sample individualy
iwalk(samples_list, ~ write_csv(.x, paste0("sample_", .y,"id_pixel_sets.csv")))
#wrote to a single csv for all sample sets
combined_samples <- bind_rows(samples_list)
write_csv(combined_samples, "pixel_bootstrap_sample.csv")
- Due to the sample size, the relevance of replacement is negligible.
- The data is plentiful and clean, so why not increase the accuracy and opt for something a little more robust than a simple single random sample?
- There is an opportunity for furthering my own skills.
Now, this is all just an exploratory analysis, so this is overkill, especially as the weight and ramifications of what I find are negligible. Regardless, there is a certain joy in taking things a little further. In the next part, we will get this all loaded into Power BI and see what we can find.