PowerQuery Puzzle solved with R

[This article was first published on Numbers around us - Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

#201–206

Puzzles

Author: ExcelBI

All files (xlsx with puzzle and R with solution) for each and every puzzle are available on my Github. Enjoy.

Puzzle #201

We need to find out which customer had opportunity to buy specific product (and maybe bought). We receive two tables: one presenting time of customer activity and one presenting item availability. If in second one we meet empty cell then in start column it means that it was available even before, and in finish column that it is still on stock even after last customer ends his purchasing adventure. This task looks hard, but it really not. We need to make date sequences for each person and product, than find common dates and add some transformation to get result table. Check it out.

Loading libraries and data

library(tidyverse)
library(readxl)

path = "Power Query/PQ_Challenge_201.xlsx"
input1 = read_excel(path, range = "A2:C7")
input2 = read_excel(path, range = "A10:C16")
test = read_excel(path, range = "E1:K6")

Transformation

i1 = input1 %>%
  mutate(date = map2(`Buy Date From`, `Buy Date To`, seq, by = "day")) %>%
  unnest(date) %>%
  select(Buyer, date)

i2 = input2 %>%
  mutate(`Stock Start Date` = replace_na(`Stock Start Date`, min(`Stock Start Date`, na.rm = TRUE)),
         `Stock Finish Date` = replace_na(`Stock Finish Date`, max(i1$date, na.rm = TRUE))) %>%
  mutate(date = map2(`Stock Start Date`, `Stock Finish Date`, seq, by = "day"))  %>%
  unnest(date) %>%
  select(Items, date)

result = i1 %>%
  inner_join(i2, by = c("date")) %>%
  pivot_wider(names_from = Items, values_from = date, values_fn = length) %>%
  select(`Buyer / Items` = 1, sort(colnames(.), decreasing = FALSE)) %>%
  mutate(across(-c(1), ~ifelse(is.na(.), ., "X")))

Validation

all.equal(result, test)
# [1] TRUE

Puzzle #202

Somebody make table that somehow represents organizational hierarchy, but like always we are assigned to clean this mess up. We need to find hierarchy level and subordinations (who reports to whom), and store it as Serial. That one was tricky to make, but let try to walk it together.

Loading libraries and data

library(tidyverse)
library(readxl)

path = "Power Query/PQ_Challenge_202.xlsx"
input = read_excel(path, range = "A1:C18")
test  = read_excel(path, range = "E1:F18")

Transformation

result = input %>% 
  mutate(L1 = cumsum(!is.na(Name1))) %>%
  mutate(L2 = cumsum(!is.na(Name2)), .by = L1) %>%
  mutate(L3 = cumsum(!is.na(Name3)), .by = c(L1, L2)) %>%
  mutate(across(starts_with("L"), ~ ifelse(. == 0, NA, .))) %>%
  mutate(across(everything(), ~  as.character(.))) %>%
  rowwise() %>%
  mutate(Names = coalesce(Name3, Name2, Name1), 
         Serial = case_when(
           !is.na(L3) ~ paste(L1, L2, L3, sep = "."),
           !is.na(L2) ~ paste(L1,L2, sep = "."),
           !is.na(L1) ~ L1
         )) %>%
  ungroup() %>%
  select(Serial, Names)                                                                                                                                                                                                  

Validation

identical(result, test)
# [1] TRUE

Puzzle #203

Messy spreadsheets, chaos in a making. How many of us have seen at least one, and fixed at least one of them. What we have today. Base of spreadsheet were 3 groups that we see in first column separated with empty rows. But there are some cells with weird strings and some numbers outside of primarely chosen rows. So we need to summarise our groups of rows (to be specific find average of each group) and get every other cells with numbers all together to category “Remaining”. We need some serious tools here.

Load libraries and data

library(tidyverse)
library(readxl)

path = "Power Query/PQ_Challenge_203.xlsx"
input = read_excel(path, range = "A1:C14")
test  = read_excel(path, range = "E1:F5")

Transformation

result = input %>%
  mutate(Text = as.numeric(Text),
         Group = consecutive_id(is.na(Amount1)) / 2 * !is.na(Amount1)) %>%
  mutate(Group = ifelse(is.na(Amount1), "Remaining", paste0("Group", Group))) %>%
  summarise(nmb = list(c(Amount1, Amount2, Text)), .by = Group) %>%
  mutate(nmb = map(nmb, ~.x[!is.na(.x)])) %>%
  mutate(avg = map_dbl(nmb, ~mean(.x, na.rm = TRUE)) %>% round()) %>%
  arrange(Group) %>%
  select(Group, `Avg Amount` = avg)

There is pretty nice trick done in one of line. We are adding consective_id on column to distinguish groups, but empty rows shouldn’t be in those groups, so we do some magic: multiply groups assignment by 1 if there is value in first column, and by 0 if not, it makes our empty row group 0, which we at the end named “Remaining”.

Validation

identical(result, test)
# [1] TRUE

Puzzle #204

We have table with lists of fruits (I want to think about it as fruit salad bowls :D). And we need to make cross check for them, to tell how they are similar to each other, how many fruits are common for pairs of salads (for example: first salad has 2 fruits common with second, 1 with third and 5 with fourth. Intersection is good concept and tool to use here.

Loading libraries and data

library(tidyverse)
library(readxl)

path = "Power Query/PQ_Challenge_204.xlsx"
input = read_excel(path, range = "A1:D7")
test = read_excel(path, range = "F1:I4")

Transformation

count_intersections <- function(col_name, df) {
  col = df[[col_name]] %>% na.omit()
  other_cols = df %>% select(-all_of(col_name)) %>% map(na.omit)
  
  intersection_counts = other_cols %>%
    map_int(~ length(intersect(col, .x)))
  
  filtered_counts = intersection_counts[intersection_counts > 0]
  filtered_names = names(filtered_counts)
  
  map2_chr(filtered_names, filtered_counts, ~ paste(.x, "-", .y)) %>%
    paste(collapse = ", ")
}

result = map_chr(names(input), ~ count_intersections(.x, input))

result1 = tibble(
  Column = paste(names(input), "Match"),
  Intersections = result
) %>%
  separate_rows(Intersections, sep = ", ") %>%
  mutate(nr = row_number(), .by = Column) %>%
  pivot_wider(names_from = Column, values_from = Intersections) %>%
  select(-nr)

Validation

identical(result1, test)
# [1] TRUE

Puzzle #205

We again received data in two separate parts. First table presents number of people with specific answer while second what was the answer. We need to join them and place it in some weird format our boss asked. Let’s do it.

Loading libraries and data

library(tidyverse)
library(readxl)

path = "Power Query/PQ_Challenge_205.xlsx"
input1 = read_excel(path, range = "A2:B13")
input2 = read_excel(path, range = "D2:E13")
test = read_excel(path, range = "H2:L8")

Transformation

input = left_join(input1, input2, by = "Item") 

result = input %>%
  arrange(desc(YesNo), Item) %>%
  mutate(nr = row_number(), .by = YesNo) %>%
  mutate(nr_rem = nr %% 2,
         nr_int = ifelse(nr_rem == 1, nr %/% 2 + 1,  nr %/% 2)) %>%
  select(-nr) %>%
  pivot_wider(names_from = nr_rem, values_from = c(Item, Value), 
              values_fill = list(Value = 0)) %>%
  mutate(Sum = Value_0 + Value_1) %>%
  select(YesNo, Item1 = Item_1, Item2 = Item_0, Sum) %>%
  mutate(`%age` = Sum/sum(Sum), .by = YesNo) 

Validation

identical(result, test)
# [1] TRUE

Puzzle #206

And here we are in world of fairytales, because I don’t know how to explain sense of this transformation. It looks like Big Bad Wolf comes up and blow away our data along the spreadsheet. And we need to find out how it is even possible. We need to unite, and separate again, pivot longer and back wider so many techniques are used to achieve it.

Loading libraries and data

library(tidyverse)
library(readxl)

path = "Power Query/PQ_Challenge_206.xlsx"
input = read_excel(path, range = "A1:D13")
test  = read_excel(path, range = "F1:K19")

Transformation

r1 = input %>%
  mutate(group = cumsum(is.na(Group1)) + 1) %>%
  filter(!is.na(Group1)) %>%
  mutate(nr = row_number(), .by = group) %>%
  unite("Group", Group1:Group2, sep = "-") %>%
  unite("Value", Value1:Value2, sep = "-") %>%
  pivot_longer(-c(nr, group), names_to = "Variable", values_to = "Value") %>%
  select(-Variable)

rearrange_df <- function(df, part) {
  df %>%
    filter(group == part) %>%
    select(-group) %>%
    mutate(col = nr, row = row_number()) %>%
    pivot_wider(names_from = col, values_from = Value) %>%
    as.data.frame()
}

result = map_df(unique(r1$group), ~ rearrange_df(r1, .x)) %>%
  select(-c(1,2)) %>%
  separate_wider_delim(1:ncol(.), delim = "-", names_sep = "-") %>%
  mutate(across(everything(), ~ if_else(. == "NA", NA_character_, .)))

names(result) = names(test)

Validation

all.equal(result, test)
# [1] TRUE

Remember, always if you have structure to compare which contains NA’s do not identical, but rather all.equals, that can check even NA’s.

Feel free to comment, share and contact me with advices, questions and your ideas how to improve anything. Contact me on Linkedin if you wish as well.


PowerQuery Puzzle solved with R was originally published in Numbers around us on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Numbers around us - Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)