[This article was first published on Numbers around us – Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

#201–206

Puzzles

Author: ExcelBI

All files (xlsx with puzzle and R with solution) for each and every puzzle are available on my Github. Enjoy.

Puzzle #201

We need to find out which customer had opportunity to buy specific product (and maybe bought). We receive two tables: one presenting time of customer activity and one presenting item availability. If in second one we meet empty cell then in start column it means that it was available even before, and in finish column that it is still on stock even after last customer ends his purchasing adventure. This task looks hard, but it really not. We need to make date sequences for each person and product, than find common dates and add some transformation to get result table. Check it out.

Loading libraries and data

library(tidyverse)
library(readxl)

path = "Power Query/PQ_Challenge_201.xlsx"
input1 = read_excel(path, range = "A2:C7")
input2 = read_excel(path, range = "A10:C16")
test = read_excel(path, range = "E1:K6")

Transformation

i1 = input1 %>%
  mutate(date = map2(`Buy Date From`, `Buy Date To`, seq, by = "day")) %>%
  unnest(date) %>%
  select(Buyer, date)

i2 = input2 %>%
  mutate(`Stock Start Date` = replace_na(`Stock Start Date`, min(`Stock Start Date`, na.rm = TRUE)),
         `Stock Finish Date` = replace_na(`Stock Finish Date`, max(i1$date, na.rm = TRUE))) %>%
  mutate(date = map2(`Stock Start Date`, `Stock Finish Date`, seq, by = "day"))  %>%
  unnest(date) %>%
  select(Items, date)

result = i1 %>%
  inner_join(i2, by = c("date")) %>%
  pivot_wider(names_from = Items, values_from = date, values_fn = length) %>%
  select(`Buyer / Items` = 1, sort(colnames(.), decreasing = FALSE)) %>%
  mutate(across(-c(1), ~ifelse(is.na(.), ., "X")))

Validation

all.equal(result, test)
# [1] TRUE

Puzzle #202

Somebody make table that somehow represents organizational hierarchy, but like always we are assigned to clean this mess up. We need to find hierarchy level and subordinations (who reports to whom), and store it as Serial. That one was tricky to make, but let try to walk it together.

Loading libraries and data

library(tidyverse)
library(readxl)

path = "Power Query/PQ_Challenge_202.xlsx"
input = read_excel(path, range = "A1:C18")
test  = read_excel(path, range = "E1:F18")

Transformation

result = input %>%
  mutate(L1 = cumsum(!is.na(Name1))) %>%
  mutate(L2 = cumsum(!is.na(Name2)), .by = L1) %>%
  mutate(L3 = cumsum(!is.na(Name3)), .by = c(L1, L2)) %>%
  mutate(across(starts_with("L"), ~ ifelse(. == 0, NA, .))) %>%
  mutate(across(everything(), ~  as.character(.))) %>%
  rowwise() %>%
  mutate(Names = coalesce(Name3, Name2, Name1),
         Serial = case_when(
           !is.na(L3) ~ paste(L1, L2, L3, sep = "."),
           !is.na(L2) ~ paste(L1,L2, sep = "."),
           !is.na(L1) ~ L1
         )) %>%
  ungroup() %>%
  select(Serial, Names)                                                                                                                                                                                                  

Validation

identical(result, test)
# [1] TRUE

Puzzle #203

Messy spreadsheets, chaos in a making. How many of us have seen at least one, and fixed at least one of them. What we have today. Base of spreadsheet were 3 groups that we see in first column separated with empty rows. But there are some cells with weird strings and some numbers outside of primarely chosen rows. So we need to summarise our groups of rows (to be specific find average of each group) and get every other cells with numbers all together to category “Remaining”. We need some serious tools here.

Load libraries and data

library(tidyverse)
library(readxl)

path = "Power Query/PQ_Challenge_203.xlsx"
input = read_excel(path, range = "A1:C14")
test  = read_excel(path, range = "E1:F5")

Transformation

result = input %>%
  mutate(Text = as.numeric(Text),
         Group = consecutive_id(is.na(Amount1)) / 2 * !is.na(Amount1)) %>%
  mutate(Group = ifelse(is.na(Amount1), "Remaining", paste0("Group", Group))) %>%
  summarise(nmb = list(c(Amount1, Amount2, Text)), .by = Group) %>%
  mutate(nmb = map(nmb, ~.x[!is.na(.x)])) %>%
  mutate(avg = map_dbl(nmb, ~mean(.x, na.rm = TRUE)) %>% round()) %>%
  arrange(Group) %>%
  select(Group, `Avg Amount` = avg)

There is pretty nice trick done in one of line. We are adding consective_id on column to distinguish groups, but empty rows shouldn’t be in those groups, so we do some magic: multiply groups assignment by 1 if there is value in first column, and by 0 if not, it makes our empty row group 0, which we at the end named “Remaining”.

Validation

identical(result, test)
# [1] TRUE

Puzzle #204

We have table with lists of fruits (I want to think about it as fruit salad bowls :D). And we need to make cross check for them, to tell how they are similar to each other, how many fruits are common for pairs of salads (for example: first salad has 2 fruits common with second, 1 with third and 5 with fourth. Intersection is good concept and tool to use here.

Loading libraries and data

library(tidyverse)
library(readxl)

path = "Power Query/PQ_Challenge_204.xlsx"
input = read_excel(path, range = "A1:D7")
test = read_excel(path, range = "F1:I4")

Transformation

count_intersections <- function(col_name, df) {
  col = df[[col_name]] %>% na.omit()
  other_cols = df %>% select(-all_of(col_name)) %>% map(na.omit)

  intersection_counts = other_cols %>%
    map_int(~ length(intersect(col, .x)))

  filtered_counts = intersection_counts[intersection_counts > 0]
  filtered_names = names(filtered_counts)

  map2_chr(filtered_names, filtered_counts, ~ paste(.x, "-", .y)) %>%
    paste(collapse = ", ")
}

result = map_chr(names(input), ~ count_intersections(.x, input))

result1 = tibble(
  Column = paste(names(input), "Match"),
  Intersections = result
) %>%
  separate_rows(Intersections, sep = ", ") %>%
  mutate(nr = row_number(), .by = Column) %>%
  pivot_wider(names_from = Column, values_from = Intersections) %>%
  select(-nr)

Validation

identical(result1, test)
# [1] TRUE

Puzzle #205

We again received data in two separate parts. First table presents number of people with specific answer while second what was the answer. We need to join them and place it in some weird format our boss asked. Let’s do it.

Loading libraries and data

library(tidyverse)
library(readxl)

path = "Power Query/PQ_Challenge_205.xlsx"
input1 = read_excel(path, range = "A2:B13")
input2 = read_excel(path, range = "D2:E13")
test = read_excel(path, range = "H2:L8")

Transformation

input = left_join(input1, input2, by = "Item")

result = input %>%
  arrange(desc(YesNo), Item) %>%
  mutate(nr = row_number(), .by = YesNo) %>%
  mutate(nr_rem = nr %% 2,
         nr_int = ifelse(nr_rem == 1, nr %/% 2 + 1,  nr %/% 2)) %>%
  select(-nr) %>%
  pivot_wider(names_from = nr_rem, values_from = c(Item, Value),
              values_fill = list(Value = 0)) %>%
  mutate(Sum = Value_0 + Value_1) %>%
  select(YesNo, Item1 = Item_1, Item2 = Item_0, Sum) %>%
  mutate(`%age` = Sum/sum(Sum), .by = YesNo) 

Validation

identical(result, test)
# [1] TRUE

Puzzle #206

And here we are in world of fairytales, because I don’t know how to explain sense of this transformation. It looks like Big Bad Wolf comes up and blow away our data along the spreadsheet. And we need to find out how it is even possible. We need to unite, and separate again, pivot longer and back wider so many techniques are used to achieve it.

Loading libraries and data

library(tidyverse)
library(readxl)

path = "Power Query/PQ_Challenge_206.xlsx"
input = read_excel(path, range = "A1:D13")
test  = read_excel(path, range = "F1:K19")

Transformation

r1 = input %>%
  mutate(group = cumsum(is.na(Group1)) + 1) %>%
  filter(!is.na(Group1)) %>%
  mutate(nr = row_number(), .by = group) %>%
  unite("Group", Group1:Group2, sep = "-") %>%
  unite("Value", Value1:Value2, sep = "-") %>%
  pivot_longer(-c(nr, group), names_to = "Variable", values_to = "Value") %>%
  select(-Variable)

rearrange_df <- function(df, part) {
  df %>%
    filter(group == part) %>%
    select(-group) %>%
    mutate(col = nr, row = row_number()) %>%
    pivot_wider(names_from = col, values_from = Value) %>%
    as.data.frame()
}

result = map_df(unique(r1$group), ~ rearrange_df(r1, .x)) %>%
  select(-c(1,2)) %>%
  separate_wider_delim(1:ncol(.), delim = "-", names_sep = "-") %>%
  mutate(across(everything(), ~ if_else(. == "NA", NA_character_, .)))

names(result) = names(test)

Validation

all.equal(result, test)
# [1] TRUE

Remember, always if you have structure to compare which contains NA’s do not identical, but rather all.equals, that can check even NA’s.

Feel free to comment, share and contact me with advices, questions and your ideas how to improve anything. Contact me on Linkedin if you wish as well.


PowerQuery Puzzle solved with R was originally published in Numbers around us on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Numbers around us – Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: PowerQuery Puzzle solved with R

Decoding The Power of Data Science with R: An Analysis of Puzzles #201-#206

The article published on Medium and shared on R-bloggers titled “PowerQuery Puzzle solved with R” presents six different puzzles which are solved using R, a popular programming language for data analysis. The problems are related to different data processing tasks such as joining and transforming data, matching patterns, summarizing group data, handling missing values, and restructuring data. In this piece, we delve deep into the elements of these puzzles, analyze their implications for the broader field of data science, and recommend future strategies for tackling similar problems.

Analyzing the Puzzles

The puzzles discussed in the article cover a range of data tasks that most data scientists encounter in real life. The puzzles relate to:

  • Identifying a product based on customer activity
  • Constructing an organizational hierarchy from an unsorted database
  • Creating a common identifier for different data groups
  • Comparing the similarities between different sets of data, such as fruit salads
  • Joining separate sets of data
  • Retrieving and restructuring data with multiple techniques

For each puzzle, the author presents the relevant R code which was used to solve the problem. These illustrate the use of various R libraries and functions for data transformation, sequence generation, spreadsheet manipulation, hierarchical path finding, data validation, and much more.

Long-term Implications and Future Developments

The article emphasizes the incredible versatility of R in dealing with a multitude of data science tasks. Due to its comprehensive libraries and support for various data manipulations, R has become an essential tool for data scientists around the world. These puzzles encourage this skill-building which can help in developing more complex data processing tasks in the future.

While R has proved itself to be effective in these scenarios, future advancements may see the development of more integrated solutions for data processing, potentially making languages such as R even more powerful and user-friendly. As the complexity and scale of data increase, the demand for tools that can perform efficient and accurate computations will rise. Technologies like machine learning and AI might play a vital role in facilitating this process.

Actionable Advice

For those seeking to develop their data science skills with R, these puzzles serve as an excellent practical guide. To enhance comprehension and coding skills, here are a few tips:

  • Practice: Try solving these puzzles independently and then compare solutions with the one provided. This can help understand different approaches to a problem.
  • Understand the libraries: The article uses multiple R libraries. Understand the functions used from each library and experiment with them.
  • Experiment with data: Don’t limit yourself to these puzzles. Try manipulating different datasets to understand real-world applicability.

To handle similar tasks in the future, it is advisable to keep up-to-date with advancements in R and related AI technologies. Participating in coding challenges and working on data projects can also help to remain competitive and proficient in the field.

Read the original article