PowerQuery Puzzle solved with R

[This article was first published on Numbers around us - Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Excel BI’s Weekend PQ Puzzles Solutions

Until this time I didn’t really look on PowerQuery Puzzles of ExcelBI, which he publishes every weekend. But after making some posts about Excel Puzzles, I realized that PQ ones are equally if not more interesting. Usually they need more steps to achieve a goal, and sometimes they are little bit harder. In this series I will not make another framework approaches (data.table or base), but rather explain code step by step.
Today as a part of consistent publishing schedule, I start second format “PQ Puzzles Solved with R”, that will be published on Tuesday, one day after Excel Puzzles from all previous week. At the very begining I’ll start with 4 puzzles (from last two weekends). I hope you will enjoy it.

Puzzles:

PQ_129: content file
PQ_130: content file
PQ_131: content file
PQ_132: content file

PQ_129

What do we have here. Two-column table with dated measurement of temperature is provided, and we have to make something that we can call periodical report. We need to find out specific measurements, aggregates and properties from this time series to get: the highest, the lowest and average temperature per period and days when those extremes happened.

But we do not have specific intervals like Jan 1st to Feb 18th or anything similar. We have to calculate it dynamically. So… Let dive in…

Load data and libraries

library(tidyverse)
library(readxl)
library(lubridate)

input = read_excel("PQ_Challenge_129.xlsx", range = "B2:C1002")

I do not read testing table, because it would be only working on day of publication.

Prepare functions and mid-steps

extract_data <- function(data, period, date_range) {
  # Convert the first element of date_range to a Date object
  date_range_start <- as.Date(date_range[1], format = "%Y-%m-%d")  
  
  # The data manipulation begins here
  data %>%
    # Filter rows where Date is within the specified range
    filter(Date <= today() & Date >= date_range_start) %>%
    # Summarise the data to find maximum, minimum, and average temperature
    summarise(
      MaxTemp = max(Temp),                  # Maximum temperature
      MaxtempDate = Date[which.max(Temp)],  # Date when maximum temperature occurred
      MinTemp = min(Temp),                  # Minimum temperature
      MintempDate = Date[which.min(Temp)],  # Date when minimum temperature occurred
      AvgTemp = mean(Temp)                  # Average temperature
    ) %>%
    # Ensure only one row of summary is returned
    slice(1) %>%
    # Add a column for the specified period
    mutate(Period = period) %>%
    # Reorder columns to have Period first
    select(Period, everything())
}

The function takes three arguments:

  1. data: a dataframe that contains temperature data.
  2. period: a descriptive value (like a string) indicating the period of the data.
  3. date_range: a vector of dates defining the range for data extraction.

But we need to define those periods and date_ranges so, lets prepare one more auxiliary structure:

periods <- data.frame(
  # Create a column 'Period' with descriptive names for time periods
  Period = c("Last 7 Days", "Last 30 Days", "Last 365 Days", "Month to Date", "Quarter to Date", "Year to Date"),
  
  # Create a column 'DateRangeStart' with calculated start dates for each period
  DateRangeStart = c(
    today() - days(6),                  # Start date for the last 7 days
    today() - days(29),                 # Start date for the last 30 days
    today() - days(364),                # Start date for the last 365 days
    floor_date(today(), "month"),       # Start date for the current month
    floor_date(today(), "quarter"),     # Start date for the current quarter
    floor_date(today(), "year")),       # Start date for the current year
  
  stringsAsFactors = FALSE             # Ensure that strings are not converted to factors
)

How is it looks? Of course dates are calculated dynamically and will change every day.

           Period DateRangeStart
1     Last 7 Days     2023-11-14
2    Last 30 Days     2023-10-22
3   Last 365 Days     2022-11-21
4   Month to Date     2023-11-01
5 Quarter to Date     2023-10-01
6    Year to Date     2023-01-01

We have to process calculation for each of given periods which means intervals between start dates and today. So lets use our function which takes our original timeseries source, and two values from periods dataframe to calculate this report. We are going to map function on both input and periods data frames with following script:

output <- map2(periods$Period, periods$DateRangeStart, ~extract_data(input, .x, .y)) %>%
 bind_rows() %>%
 mutate(AvgTemp = round(AvgTemp, 2))

And something weird happened. I have warnings that there is Infinity in results. Quick look and I know. We cannot calculate results for last 7 days, because I prepare this article over week after dates in source ended. But whatever… Here you have result as we get it, for Nov 20th.

 # A tibble: 5 × 6
  Period          MaxTemp MaxtempDate         MinTemp MintempDate         AvgTemp
  <chr>             <dbl> <dttm>                <dbl> <dttm>                <dbl>
1 Last 30 Days         34 2023-10-22 00:00:00     -15 2023-10-29 00:00:00   11.1 
2 Last 365 Days        39 2023-05-29 00:00:00     -26 2022-12-19 00:00:00    7.68
3 Month to Date        31 2023-11-02 00:00:00      -5 2023-11-04 00:00:00   16.6 
4 Quarter to Date      34 2023-10-22 00:00:00     -17 2023-10-20 00:00:00   10.0 
5 Year to Date         39 2023-05-29 00:00:00     -23 2023-07-05 00:00:00    7.7 

I think it looks good and we can proceed to next puzzle.

PQ_130

Here we’ve got two tables and we have to find out which keywords from first one are present in which column of the second one. Sounds weird? Maybe little bit, but what exactly we have to do?

Load data and libraries

library(tidytext)
library(tidyverse)
library(readxl)

T1  = read_excel("PQ_Challenge_130.xlsx", range = "A1:B12")
T2  = read_excel("PQ_Challenge_130.xlsx", range = "D1:G6")

Test = read_excel("PQ_Challenge_130.xlsx", range = "D11:F23")

In table T1 we have keywords with id, in T2 we have columns to check.

Transform data

Now we have to extract two columns we need to check (Text and City Country) and “slice&dice” them to single words. At the end we need to put it in one table which will tell us from which column it comes.

R1 = T2 %>%
  unnest_tokens(word, Text) %>%
  ungroup() %>%
  mutate(`Column Name` = "Text")

R2 = T2 %>%
  unnest_tokens(word, `City / Country`) %>%
  ungroup() %>%
  mutate(`Column Name` = "City / Country")

Result = R1 %>%
  bind_rows(R2)

Finally we are “slicing” first column for single words as well and join it with Result table. It will give us information about description of which person has which keyword id in which column. I promise it looks more clever than that sentence.

T1R = T1 %>%
  unnest_tokens(word, Value) %>%
  ungroup()

Res1 = Result %>%
  left_join(T1R, by = "word") %>%
  select(`Emp ID`, No, `Column Name`) %>%
  filter(!is.na(No)) %>%
  arrange(`Emp ID`, No) %>%
  unique()

How it looks at the end?

# A tibble: 12 × 3
   `Emp ID`    No `Column Name` 
      <dbl> <dbl> <chr>         
 1        1     6 City / Country
 2        1     9 Text          
 3        1    11 Text          
 4        2     4 Text          
 5        2    10 City / Country
 6        3     1 City / Country
 7        3     3 City / Country
 8        4     2 Text          
 9        4     4 City / Country
10        5     5 City / Country
11        5     7 Text          
12        5     8 City / Country

What else left? We can check if this solution is equal to solution/target table provided by puzzle author. And it is.

identical(Res1, Test)
# [1] TRUE

PQ_131

In this case it looks like somebody is doing accounting notes but has really narrow peace of paper and we need to expand it to shape that facts about one file are in the single row.

Load data and libraries

library(tidyverse)
library(readxl) 

input = read_excel(“PQ_Challenge_131.xlsx”, range = “A1:F10”, col_names = FALSE)
test = read_excel(“PQ_Challenge_131.xlsx”, range = “I1:R4”, col_names = TRUE)

Transform data

We need to split this weird table into two elements: labels and values. It will help us to change its shape. Every odd row should be taken to labels, and every even to values. Then both of this auxiliary table should be pivoted to long version. That causes that both tables have exactly the same dimension, and that means that every label is just on the same row like its value in second table.

labels = input %>% 
 filter(row_number() %% 2 == 1) %>%
 pivot_longer(cols = -c(…1), names_to = “row”, values_to = “label”)

values = input %>% 
 filter(row_number() %% 2 == 0) %>%
 pivot_longer(cols = -c(…1), names_to = “row”, values_to = “value”)

After this manipulation we can bind them column-wise. Then we need little cleaning and pivot it back to wide version. What does change here? Every file named Group has only one line with data about it.
Lets finally check if we are having the same values as in provided solution.

final = bind_cols(labels, values) %>% 
  select(Group = ...1, label , value ) %>%
  filter(!is.na(label)) %>%
  pivot_wider(names_from = label, values_from = value) %>%
  mutate(across(everything(), ~ifelse(is.na(.), NA_integer_, as.numeric(.))))

identical(test, final)
#> [1] TRUE

And it is identical.

PQ_132

I have some weird thought. In the office we meet in this puzzles, they have only narrow sheets of paper. Firstly accountant need to extract data from narrow sheet to wide version, now HR need to put wide table on only five columns. But we are here to help so lets go.

Loading data and libraries

library(tidyverse)
library(readxl)

input = read_excel("PQ_Challenge_132.xlsx", range = "A1:I6")
test = read_excel("PQ_Challenge_132.xlsx", range = "A10:E30")

Transform data

First thing we are asked for is to conactenate first and last name in one column.

result = input %>%
  unite("Full Name", c("First Name", "Last Name"), sep = " ") 

Then I need to create one auxiliary structure which is empty, but will be needed because we have odd number of columns to be wrapped, so we will need it to fill the hole.

empty = result %>% 
  select(1) %>%
  mutate(Attr = NA_character_, Value = NA_character_, `Emp ID` = as.character(`Emp ID`))

Then we do the main magic. This time first thing is to pivot it to longer version and nest rows. Then we are working on nested structure and its rows. After nesting we need to specify which row (which property of data record) is in even, and which in odd row, because it will determine if this property will finish in first or second column of unnested data.

rest = result %>% 
  mutate(across(everything(), as.character)) %>%
  pivot_longer(-1, names_to = "Attribute", values_to = "Value") %>%
  group_by(`Emp ID`) %>%
  mutate(row = row_number()) %>%
  nest(data = c(Attribute, Value)) %>%
  arrange(row) %>%
  mutate(row_even = if_else(row %% 2 == 0, "2","1" ))

Finally we are splitting it into two tables by even/odd differentiation. Each of those tables have to be unnested separately. And it appears that one of them is 4x5 and the other 3x5. Here we need our empty table. After those manipulations we can bind both column-wise, clean up little bit and take care of column names.

final_one = rest %>%
  filter(row_even == "1") %>%
  unnest(data) %>%
  select(-row_even, -row)

final_two = rest %>%
  filter(row_even == "2") %>%
  unnest(data) %>%
  select(-row_even, -row) %>%
  bind_rows(empty)

final = final_one %>%
  bind_cols(final_two) %>%
  select(1,2,3,5,6)

colnames(final) <- c("Emp ID", "Attribute1", "Value1", "Attribute2", "Value2")

Unfortunatelly unifying provided solution with our final result would need some additional adjustments, like type changing and so on. But if you would like to check it visually, you will see that it is exactly the same.

# A tibble: 20 × 5
   `Emp ID` Attribute1    Value1            Attribute2 Value2
   <chr>    <chr>         <chr>             <chr>      <chr> 
 1 813185   Full Name     Marisol Hunt      Gender     F     
 2 252591   Full Name     Emmett Cotton     Gender     M     
 3 781324   Full Name     Monica Adkins     Gender     F     
 4 598718   Full Name     Frederick Frazier Gender     M     
 5 435759   Full Name     Prince Hanson     Gender     M     
 6 813185   Date of Birth 1990-02-08        Weight     50    
 7 252591   Date of Birth 1968-10-24        Weight     71    
 8 781324   Date of Birth 1970-05-26        Weight     44    
 9 598718   Date of Birth 1989-03-07        Weight     65    
10 435759   Date of Birth 1984-05-09        Weight     60    
11 813185   Salary        154143            State      NJ    
12 252591   Salary        78407             State      TX    
13 781324   Salary        193067            State      NV    
14 598718   Salary        136736            State      NJ    
15 435759   Salary        72788             State      IN    
16 813185   Sales         127               NA         NA    
17 252591   Sales         3673              NA         NA    
18 781324   Sales         783               NA         NA    
19 598718   Sales         93783             NA         NA    
20 435759   Sales         7893              NA         NA    

It is, isn’t it?

Thank you for reading first part of PQ puzzles solved with R. This “episode” has 4 riddles, but next Tuesday will start normal mode which is “puzzles from last weekend only”.

I hope you like it. Look also for my general articles about R which will come on Thursdays and Excel puzzles that will always precede PQ by one day.


PowerQuery Puzzle solved with R was originally published in Numbers around us on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Numbers around us - Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)