[This article was first published on Numbers around us – Medium, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

#191–192

Puzzles

Author: ExcelBI

All files (xlsx with puzzle and R with solution) for each and every puzzle are available on my Github. Enjoy.

Puzzle #191

Usually on Power Query Challenges we are getting some table transformations, but not this time. We have some pretty hardcore text manipulation here. What we have to do? Get from text only “words” that consists of:

  • digits mixed with special characters plus capital letters mixed with special characters or
  • capital letters mixed with special characters plus digits mixed with special characters.

In both cases special characters are optional. But can be mixed as densely as author would only imagine, even after every letter or digit.

But it is not the end, after finding proper strings we need to clear all special characters, and concatenate in order letters > digits, separated with underscore.

I can confess that hardest thing here was to compose regular expression to find those “words”. It took me about 90% of all time consumed for this task. Check this out.

Loading libraries and data

library(tidyverse)
library(readxl)
library(rebus)

path = "Power Query/PQ_Challenge_191.xlsx"
input = read_excel(path, range = "A1:A11")
test = read_excel(path, range = "A1:B11") 

Transformation

pattern1 = "b[A-Z]+(?:[!@#$%^&*_+=]*[A-Z]*)*[!@#$%^&*_+=]*[0-9]+(?:[!@#$%^&*_+=]*[0-9]*)*b"
pattern2 = "b[0-9]+(?:[!@#$%^&*_+=]*[0-9]*)*[!@#$%^&*_+=]*[A-Z]+(?:[!@#$%^&*_+=]*[A-Z]*)*b"

order_chars = function(text) {
  text = str_replace_all(text, "[^[:alnum:]]", "")
  letters = str_extract_all(text, "[A-Z]")[[1]] %>% paste0(collapse = "")
  numbers = str_extract_all(text, "[0-9]")[[1]] %>% paste0(collapse = "")
  result = paste0(letters, "_", numbers)
  return(result)
}


result = input %>%
  mutate(pat1 = str_extract_all(Text, pattern1),
         pat2 = str_extract_all(Text, pattern2)) %>%
  mutate(ext = map2(pat1, pat2, ~c(.x, .y))) %>%
  select(-c(pat1, pat2)) %>%
  unnest(ext, keep_empty = T) %>%
  mutate(result = map_chr(ext, order_chars)) %>%
  group_by(Text) %>%
  summarise(`Answer Expected` = paste0(result, collapse = ", ")) %>%
  mutate(`Answer Expected` = if_else(`Answer Expected` == "NA_NA", NA_character_, `Answer Expected`))

Validation — “by eye”

res = left_join(test, result, by = c("Text" = "Text"))

# A tibble: 10 × 3
   Text                                                   `Answer Expected.x` `Answer Expected.y`
   <chr>                                                  <chr>               <chr>
 1 Life is beautiful LA$340                               LA_340              LA_340
 2 Q#AR_8 Dream big, work hard                            QAR_8               QAR_8
 3 Actions 55 speak 83_LDR louder than words              LDR_83              LDR_83
 4 Every Q9#02 MOMENT 89abc matters                       Q_902               Q_902
 5 Kindness costs 45A6Q nothing 83 ABC                    NA                  NA
 6 Believe88 45 you 2_3*ABC can, and you're halfway there ABC_23              ABC_23
 7 Happiness Q@56, TY#787 is a choice                     Q_56, TY_787        Q_56, TY_787
 8 Time 99+RT heals all wounds GHOPQ*45                   RT_99, GHOPQ_45     GHOPQ_45, RT_99
 9 Knowledge is ABc_5726 power 23#PQR                     PQR_23              PQR_23
10 EMBRACE THE A$B$C$2$3$8$8$0 JOURNEY                    ABC_23880           ABC_23880   

Extra

I need to explain Regex for those who are not really familiar with it.

b[A-Z]+(?:[!@#$%^&*_+=]*[A-Z]*)*[!@#$%^&*_+=]*[0-9]+(?:[!@#$%^&*_+=]*[0-9]*)*b

- b: at the beginning and end are responsible for applying code
  to "word" which means fragment of string separated by whitespaces.
- [A-Z]+: fragment consisting of one or more capital letters
- (?:....): is called non-capturing group, I am using it to made group for
  purpose of checking its multi-occurence and that is why there is * after,
  which means zero or more occurences.
- [!@#$%^&*_+=]*[A-Z]*: inside non-capturing group means that we are looking
  for group in which there can be special sign or capital letter. I use it to
  check if after first segment of letters there are more mixed with special characters.
-[!@#$%^&*_+=]*: we have optional special characters in middle of expression as well
- [0-9]+: fragment consisting of one or more digit
- (?:[!@#$%^&*_+=]*[0-9]*)*: and again we have non-capturing group for mixing
  digits and special characters

I hope I don’t need to explain second one, because it is only reversed: digits first than capital letters.

Puzzle #192

And we have some table manipulation as well. Today we have project management issue to solve. We have pretty nice data about planned and actual performance of some projects. We have dates for each scenario, and we need to calculate if projects are late, on time or are done faster than planned, and of course if they consume planned time, or maybe more or less. It is not very hard task, but needs a lot of transformations and conditional expressions. Find out yourself.

Loading libraries and data

library(tidyverse)
library(readxl)

path = 'Power Query/PQ_Challenge_192.xlsx'
input = read_excel(path, range = "A1:E14")
test  = read_excel(path, range = "G1:J6")

Transformation

count_workdays <- function(from, to) {
  map2(from, to, seq, by = "days") %>%
    map(~ tibble(timeperiod = .x)) %>%
    map(~ mutate(.x, weekday = wday(timeperiod, week_start = 1)))  %>%
    map(~ filter(.x, weekday %in% 1:5)) %>%
    map_int(~ nrow(.x))
}

result = input %>%
  filter_all(any_vars(!is.na(.))) %>%
  fill(everything(), .direction = "down") %>%
  rename("scenario" = 3) %>%
  pivot_wider(names_from = scenario, values_from = c(4, 5)) %>%
  mutate(`Schedule Performance` = case_when(
    `To Date_Actual` > `To Date_Plan` ~ "Overrun",
    `To Date_Actual` < `To Date_Plan` ~ "Underrun",
    TRUE ~ "On Time"
  ),
  `Actual Dates` = map2_int(`From Date_Actual`, `To Date_Actual`, count_workdays) ,
  `Plan Dates` = map2_int(`From Date_Plan`, `To Date_Plan`, count_workdays),
  `Cost Performance` = case_when(
    `Actual Dates` > `Plan Dates` ~ "Overrun",
    `Actual Dates` < `Plan Dates` ~ "Underrun",
    TRUE ~ "At Cost"
  )) %>%
  mutate(nr = row_number(), .by = Project) %>%
  select(Project, Phase, nr, `Schedule Performance`, `Cost Performance`) %>%
  mutate(Project = if_else(nr == 1, Project, NA_character_)) %>%
  select(-nr)

Validation

identical(result, test)
# [1] TRUE

Feel free to comment, share and contact me with advices, questions and your ideas how to improve anything. Contact me on Linkedin if you wish as well.


PowerQuery Puzzle solved with R was originally published in Numbers around us on Medium, where people are continuing the conversation by highlighting and responding to this story.

To leave a comment for the author, please follow the link and comment on their blog: Numbers around us – Medium.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: PowerQuery Puzzle solved with R

Implications and Future Developments of Text and Data Manipulation Techniques

The puzzle solutions above showcase significant, high-level text and data manipulation techniques implemented with R programming language, specifically pertaining to Excel data. The long-term implications of these techniques are considerable for data analytics, particularly for those handling volumes of text and numeric data that require critical transformation or manipulation.

Enhanced Text Processing

The first puzzle deals with intricate text processing operations, such as word search based on specific conditions and characters’ arrangements including special characters, uppercase letters, and digits. This solution paves the way for more complex text-processing requirements in the future, especially given the widespread of unstructured data across the internet. Understanding and manipulating such data effectively can help glean insights that were previously hard to derive.

Project Management Efficiency

The second puzzle’s solution demonstrates core project management principles in action: tracking schedule and cost performance. This enables us to predict the timely or delayed completion of projects, establishing a more efficient project management timeline. Future developments might involve incorporating more sophisticated forecasting measures and intervention strategies to ensure the smooth progress of projects.

Actionable Advice

These techniques’ potential and power can multiply for businesses that handle large quantities of project data and digital content. Here are a few pieces of advice to harness the potential fully:

  • Invest in upskilling: Encourage your data analytics team to explore and master regular expressions and table manipulation techniques in R. This will provide them with more sophisticated tools to extract and manipulate data.
  • Automate processes: Implement these solutions on a larger scale to automate text and data processing tasks that can save significant man-hours.
  • Incorporate into project management: Apply these methods to your project management practices to track project performance accurately including schedule and cost. This will assist in project planning and resourcing.
  • Explore other applications: These techniques aren’t limited to the text and project data discussed here. Consider other areas within your organization where such data transformations can bring about efficiency.

In conclusion, these solutions demonstrate the potential of R and Python programming languages for text and data manipulation, leading the way for more advanced data analysis solutions in the future.

Read the original article