PowerQuery Puzzle solved with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Excel BI’s Weekend PQ Puzzles Solutions
Until this time I didn’t really look on PowerQuery Puzzles of ExcelBI, which he publishes every weekend. But after making some posts about Excel Puzzles, I realized that PQ ones are equally if not more interesting. Usually they need more steps to achieve a goal, and sometimes they are little bit harder. In this series I will not make another framework approaches (data.table or base), but rather explain code step by step.
Today as a part of consistent publishing schedule, I start second format “PQ Puzzles Solved with R”, that will be published on Tuesday, one day after Excel Puzzles from all previous week. At the very begining I’ll start with 4 puzzles (from last two weekends). I hope you will enjoy it.
Puzzles:
PQ_129: content file
PQ_130: content file
PQ_131: content file
PQ_132: content file
PQ_129
What do we have here. Two-column table with dated measurement of temperature is provided, and we have to make something that we can call periodical report. We need to find out specific measurements, aggregates and properties from this time series to get: the highest, the lowest and average temperature per period and days when those extremes happened.
But we do not have specific intervals like Jan 1st to Feb 18th or anything similar. We have to calculate it dynamically. So… Let dive in…
Load data and libraries
library(tidyverse) library(readxl) library(lubridate) input = read_excel("PQ_Challenge_129.xlsx", range = "B2:C1002")
I do not read testing table, because it would be only working on day of publication.
Prepare functions and mid-steps
extract_data <- function(data, period, date_range) { # Convert the first element of date_range to a Date object date_range_start <- as.Date(date_range[1], format = "%Y-%m-%d") # The data manipulation begins here data %>% # Filter rows where Date is within the specified range filter(Date <= today() & Date >= date_range_start) %>% # Summarise the data to find maximum, minimum, and average temperature summarise( MaxTemp = max(Temp), # Maximum temperature MaxtempDate = Date[which.max(Temp)], # Date when maximum temperature occurred MinTemp = min(Temp), # Minimum temperature MintempDate = Date[which.min(Temp)], # Date when minimum temperature occurred AvgTemp = mean(Temp) # Average temperature ) %>% # Ensure only one row of summary is returned slice(1) %>% # Add a column for the specified period mutate(Period = period) %>% # Reorder columns to have Period first select(Period, everything()) }
The function takes three arguments:
- data: a dataframe that contains temperature data.
- period: a descriptive value (like a string) indicating the period of the data.
- date_range: a vector of dates defining the range for data extraction.
But we need to define those periods and date_ranges so, lets prepare one more auxiliary structure:
periods <- data.frame( # Create a column 'Period' with descriptive names for time periods Period = c("Last 7 Days", "Last 30 Days", "Last 365 Days", "Month to Date", "Quarter to Date", "Year to Date"), # Create a column 'DateRangeStart' with calculated start dates for each period DateRangeStart = c( today() - days(6), # Start date for the last 7 days today() - days(29), # Start date for the last 30 days today() - days(364), # Start date for the last 365 days floor_date(today(), "month"), # Start date for the current month floor_date(today(), "quarter"), # Start date for the current quarter floor_date(today(), "year")), # Start date for the current year stringsAsFactors = FALSE # Ensure that strings are not converted to factors )
How is it looks? Of course dates are calculated dynamically and will change every day.
Period DateRangeStart 1 Last 7 Days 2023-11-14 2 Last 30 Days 2023-10-22 3 Last 365 Days 2022-11-21 4 Month to Date 2023-11-01 5 Quarter to Date 2023-10-01 6 Year to Date 2023-01-01
We have to process calculation for each of given periods which means intervals between start dates and today. So lets use our function which takes our original timeseries source, and two values from periods dataframe to calculate this report. We are going to map function on both input and periods data frames with following script:
output <- map2(periods$Period, periods$DateRangeStart, ~extract_data(input, .x, .y)) %>% bind_rows() %>% mutate(AvgTemp = round(AvgTemp, 2))
And something weird happened. I have warnings that there is Infinity in results. Quick look and I know. We cannot calculate results for last 7 days, because I prepare this article over week after dates in source ended. But whatever… Here you have result as we get it, for Nov 20th.
# A tibble: 5 × 6 Period MaxTemp MaxtempDate MinTemp MintempDate AvgTemp <chr> <dbl> <dttm> <dbl> <dttm> <dbl> 1 Last 30 Days 34 2023-10-22 00:00:00 -15 2023-10-29 00:00:00 11.1 2 Last 365 Days 39 2023-05-29 00:00:00 -26 2022-12-19 00:00:00 7.68 3 Month to Date 31 2023-11-02 00:00:00 -5 2023-11-04 00:00:00 16.6 4 Quarter to Date 34 2023-10-22 00:00:00 -17 2023-10-20 00:00:00 10.0 5 Year to Date 39 2023-05-29 00:00:00 -23 2023-07-05 00:00:00 7.7
I think it looks good and we can proceed to next puzzle.
PQ_130
Here we’ve got two tables and we have to find out which keywords from first one are present in which column of the second one. Sounds weird? Maybe little bit, but what exactly we have to do?
Load data and libraries
library(tidytext) library(tidyverse) library(readxl) T1 = read_excel("PQ_Challenge_130.xlsx", range = "A1:B12") T2 = read_excel("PQ_Challenge_130.xlsx", range = "D1:G6") Test = read_excel("PQ_Challenge_130.xlsx", range = "D11:F23")
In table T1 we have keywords with id, in T2 we have columns to check.
Transform data
Now we have to extract two columns we need to check (Text and City Country) and “slice&dice” them to single words. At the end we need to put it in one table which will tell us from which column it comes.
R1 = T2 %>% unnest_tokens(word, Text) %>% ungroup() %>% mutate(`Column Name` = "Text") R2 = T2 %>% unnest_tokens(word, `City / Country`) %>% ungroup() %>% mutate(`Column Name` = "City / Country") Result = R1 %>% bind_rows(R2)
Finally we are “slicing” first column for single words as well and join it with Result table. It will give us information about description of which person has which keyword id in which column. I promise it looks more clever than that sentence.
T1R = T1 %>% unnest_tokens(word, Value) %>% ungroup() Res1 = Result %>% left_join(T1R, by = "word") %>% select(`Emp ID`, No, `Column Name`) %>% filter(!is.na(No)) %>% arrange(`Emp ID`, No) %>% unique()
How it looks at the end?
# A tibble: 12 × 3 `Emp ID` No `Column Name` <dbl> <dbl> <chr> 1 1 6 City / Country 2 1 9 Text 3 1 11 Text 4 2 4 Text 5 2 10 City / Country 6 3 1 City / Country 7 3 3 City / Country 8 4 2 Text 9 4 4 City / Country 10 5 5 City / Country 11 5 7 Text 12 5 8 City / Country
What else left? We can check if this solution is equal to solution/target table provided by puzzle author. And it is.
identical(Res1, Test) # [1] TRUE
PQ_131
In this case it looks like somebody is doing accounting notes but has really narrow peace of paper and we need to expand it to shape that facts about one file are in the single row.
Load data and libraries
library(tidyverse) library(readxl) input = read_excel(“PQ_Challenge_131.xlsx”, range = “A1:F10”, col_names = FALSE) test = read_excel(“PQ_Challenge_131.xlsx”, range = “I1:R4”, col_names = TRUE)
Transform data
We need to split this weird table into two elements: labels and values. It will help us to change its shape. Every odd row should be taken to labels, and every even to values. Then both of this auxiliary table should be pivoted to long version. That causes that both tables have exactly the same dimension, and that means that every label is just on the same row like its value in second table.
labels = input %>% filter(row_number() %% 2 == 1) %>% pivot_longer(cols = -c(…1), names_to = “row”, values_to = “label”) values = input %>% filter(row_number() %% 2 == 0) %>% pivot_longer(cols = -c(…1), names_to = “row”, values_to = “value”)
After this manipulation we can bind them column-wise. Then we need little cleaning and pivot it back to wide version. What does change here? Every file named Group has only one line with data about it.
Lets finally check if we are having the same values as in provided solution.
final = bind_cols(labels, values) %>% select(Group = ...1, label , value ) %>% filter(!is.na(label)) %>% pivot_wider(names_from = label, values_from = value) %>% mutate(across(everything(), ~ifelse(is.na(.), NA_integer_, as.numeric(.)))) identical(test, final) #> [1] TRUE
And it is identical.
PQ_132
I have some weird thought. In the office we meet in this puzzles, they have only narrow sheets of paper. Firstly accountant need to extract data from narrow sheet to wide version, now HR need to put wide table on only five columns. But we are here to help so lets go.
Loading data and libraries
library(tidyverse) library(readxl) input = read_excel("PQ_Challenge_132.xlsx", range = "A1:I6") test = read_excel("PQ_Challenge_132.xlsx", range = "A10:E30")
Transform data
First thing we are asked for is to conactenate first and last name in one column.
result = input %>% unite("Full Name", c("First Name", "Last Name"), sep = " ")
Then I need to create one auxiliary structure which is empty, but will be needed because we have odd number of columns to be wrapped, so we will need it to fill the hole.
empty = result %>% select(1) %>% mutate(Attr = NA_character_, Value = NA_character_, `Emp ID` = as.character(`Emp ID`))
Then we do the main magic. This time first thing is to pivot it to longer version and nest rows. Then we are working on nested structure and its rows. After nesting we need to specify which row (which property of data record) is in even, and which in odd row, because it will determine if this property will finish in first or second column of unnested data.
rest = result %>% mutate(across(everything(), as.character)) %>% pivot_longer(-1, names_to = "Attribute", values_to = "Value") %>% group_by(`Emp ID`) %>% mutate(row = row_number()) %>% nest(data = c(Attribute, Value)) %>% arrange(row) %>% mutate(row_even = if_else(row %% 2 == 0, "2","1" ))
Finally we are splitting it into two tables by even/odd differentiation. Each of those tables have to be unnested separately. And it appears that one of them is 4x5 and the other 3x5. Here we need our empty table. After those manipulations we can bind both column-wise, clean up little bit and take care of column names.
final_one = rest %>% filter(row_even == "1") %>% unnest(data) %>% select(-row_even, -row) final_two = rest %>% filter(row_even == "2") %>% unnest(data) %>% select(-row_even, -row) %>% bind_rows(empty) final = final_one %>% bind_cols(final_two) %>% select(1,2,3,5,6) colnames(final) <- c("Emp ID", "Attribute1", "Value1", "Attribute2", "Value2")
Unfortunatelly unifying provided solution with our final result would need some additional adjustments, like type changing and so on. But if you would like to check it visually, you will see that it is exactly the same.
# A tibble: 20 × 5 `Emp ID` Attribute1 Value1 Attribute2 Value2 <chr> <chr> <chr> <chr> <chr> 1 813185 Full Name Marisol Hunt Gender F 2 252591 Full Name Emmett Cotton Gender M 3 781324 Full Name Monica Adkins Gender F 4 598718 Full Name Frederick Frazier Gender M 5 435759 Full Name Prince Hanson Gender M 6 813185 Date of Birth 1990-02-08 Weight 50 7 252591 Date of Birth 1968-10-24 Weight 71 8 781324 Date of Birth 1970-05-26 Weight 44 9 598718 Date of Birth 1989-03-07 Weight 65 10 435759 Date of Birth 1984-05-09 Weight 60 11 813185 Salary 154143 State NJ 12 252591 Salary 78407 State TX 13 781324 Salary 193067 State NV 14 598718 Salary 136736 State NJ 15 435759 Salary 72788 State IN 16 813185 Sales 127 NA NA 17 252591 Sales 3673 NA NA 18 781324 Sales 783 NA NA 19 598718 Sales 93783 NA NA 20 435759 Sales 7893 NA NA
It is, isn’t it?
Thank you for reading first part of PQ puzzles solved with R. This “episode” has 4 riddles, but next Tuesday will start normal mode which is “puzzles from last weekend only”.
I hope you like it. Look also for my general articles about R which will come on Thursdays and Excel puzzles that will always precede PQ by one day.
PowerQuery Puzzle solved with R was originally published in Numbers around us on Medium, where people are continuing the conversation by highlighting and responding to this story.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.