[This article was first published on rstats on Irregularly Scheduled Programming, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

I’ve been using a lot of programming languages recently and they all have their
quirks, differentiating features, and unique qualities, but one thing most of
them have is that they handle strings as a collection of characters. R doesn’t,
it has a “character” type which is 0 or more characters, and that’s what we call
a “string”, but what if it did have iterable strings?

For comparison, here’s some Python code

for i in "string":
    print(i)

s
t
r
i
n
g

and some Haskell

upperA = map (c -> if c == 'a' then 'A' else c)
upperA "banana"
"bAnAnA"

and some Julia

[x+1 for x in "HAL"]
3-element Vector{Char}:
 'I': ASCII/Unicode U+0049 (category Lu: Letter, uppercase)
 'B': ASCII/Unicode U+0042 (category Lu: Letter, uppercase)
 'M': ASCII/Unicode U+004D (category Lu: Letter, uppercase)

In each of these cases the string is treated as a collection of individual
characters. Many languages make this distinction, going so far as using different
quotes to distinguish them; e.g. double quotes for strings "string" and single
quotes for individual characters 's'. This makes a even more sense when the
language supports types in that a string has a String type that is composed
of 0 or more Char types.

R is dynamically typed, so we don’t strictly enforce type signatures, and is an
array language, so it has natural support for arrays (vectors, lists, matrices).
So why are strings not collections of characters?

My guess is that for the majority of use-cases, it wasn’t necessary – a lot of
the time when we read in text data we want the entirety of the string and don’t
want to worry about dealing with a collection on top of the collection of strings
themselves. Plus, if you really need the individual characters you can split the
text up with strsplit(x, "").

But if you do want to work with individual characters, calling
strsplit(x, "")[[1]] throughout your code gets ugly. I solved the Exercism
problem ‘Anagram’ in R and really didn’t like how it looked

anagram <- function(subject, candidates) {
  # remove any same words and inconsistent lengths
  nonsames <- candidates[tolower(candidates) != tolower(subject) &
                           nchar(subject) == nchar(candidates)]
  if (!length(nonsames)) return(c()) # no remaining candidates
  s_letters <- sort(tolower(strsplit(subject, "")[[1]]))
  c_letters <- sapply(sapply(nonsames, (x) strsplit(x, "")), sort, simplify = FALSE)
  # find all cases where the letters are all the same
  anagrams <- nonsames[sapply(c_letters, (x) all(s_letters == tolower(x)))]
  # if none found, return NULL
  if(!length(anagrams)) NULL else anagrams
}

Two calls to strsplit, then needing to sapply over that collection to sort it…
not pretty at all. Here’s a Haskell solution
from someone very knowledgeable in our local functional programming Meetup group

import Data.List (sort)
import Data.Char (toLower)
anagramsFor :: String -> [String] -> [String]
anagramsFor xs = filter (isAnagram xs' . map toLower)
  where xs' = map toLower xs
isAnagram :: String -> String -> Bool
isAnagram a b
  | a == b = False
  | otherwise = sort a == sort b

which, excluding the type declarations and the fact that it needs to deal with
the edge case that it has to be a rearrangement, could nearly be a one-liner

import Data.List (sort)
import Data.Char (toLower)
isAnagram a b = sort (map toLower a) == sort (map toLower b)

Wouldn’t it be nice if we could do things like this in R?

The world if R had iterable strings

The world if R had iterable strings

I don’t expect it would ever happen (maaaybe via some special string handling
like the bare strings r"(this doesn't need escaping)" but unlikely). I
couldn’t find a package that did this (by all means, let me know if there is
one) so I decided to build it myself and see how it could work.

Introducing {charcuterie} – named
partly because it looks like “cut” “char”, and partly because of charcuterie boards
involving lots of little bits of appetizers.

image by Google gemini

image by Google gemini
library(charcuterie)

At its core, this is just defining chars(x) as strsplit(x, "")[[1]] and
slapping a new class on the output, but big improvements don’t immediately come
from moonshots, they come from incremental improvements. Once I had this, I wanted
to do things with it like sort the individual characters. There is of course a
sort method for vectors (but not for individual strings) so

sort("string")
## [1] "string"
sort(c("s", "t", "r", "i", "n", "g"))
## [1] "g" "i" "n" "r" "s" "t"

One aspect of treating strings as collections of characters is that they should
always look like strings, so I needed to modify the sort method to return an
object of this new class, and make this class display collections of characters
as a string. That just involves pasting the characters back together for printing,
so now I can have this

s <- chars("string")
s
## [1] "string"
sort(s)
## [1] "ginrst"

It looks like a string, but it behaves like a collection of characters!

When you do things right, people won’t know you’ve done anything at all

When you do things right, people won’t know you’ve done anything at all

I thought about what other operations I might want to do and now I have methods to

  • sort with sort
  • reverse with rev
  • index with [
  • concatenate with c
  • print with format and print
  • slice with head and tail
  • set operations with setdiff, union, intersect, and a new except
  • leverage existing vectorised operations like unique, toupper, and tolower

I suspect the concatenation will be the one that raises the most eyebrows… I’ve
dealt with the way that other languages join together strings before
and I’m certainly open to what this version should do, but I think it makes
sense to add the collections as

c(chars("butter"), chars("fly"))
## [1] "butterfly"

If you need more than one chars at a time, you’re asking for a vector of vectors,
which R doesn’t support – it supports a list of them, though

x <- lapply(c("butter", "fly"), chars)
x
## [[1]]
## [1] "butter"
##
## [[2]]
## [1] "fly"
unclass(x[[2]])
## [1] "f" "l" "y"

This still sounds simple, and it is – the point is that it feels a lot more
ergonomic to use this inside a function compared to strsplit(x, "")[[1]] and
working with the collection manually.

I added an entire vignette of examples to the package, including identifying vowels

vowels <- function(word) {
  ch <- chars(word)
  setNames(ch %in% chars("aeiou"), ch)
}
vowels("string")
##     s     t     r     i     n     g
## FALSE FALSE FALSE  TRUE FALSE FALSE
vowels("banana")
##     b     a     n     a     n     a
## FALSE  TRUE FALSE  TRUE FALSE  TRUE

palindromes

palindrome <- function(a, ignore_spaces = FALSE) {
  a <- chars(a)
  if (ignore_spaces) a <- except(a, " ")
  all(rev(a) == a)
}
palindrome("palindrome")
## [1] FALSE
palindrome("racecar")
## [1] TRUE
palindrome("never odd or even", ignore_spaces = TRUE)
## [1] TRUE

and performing character-level substitutions

spongebob <- function(phrase) {
  x <- chars(phrase)
  odds <- seq(1, length(x), 2)
  x[odds] <- toupper(x[odds])
  string(x)
}
spongebob("you can't do anything useful with this package")
## [1] "YoU CaN'T Do aNyThInG UsEfUl wItH ThIs pAcKaGe"
YoU CaN’T Do aNyThInG UsEfUl wItH ThIs pAcKaGe

YoU CaN’T Do aNyThInG UsEfUl wItH ThIs pAcKaGe

On top of all that, I felt it was worthwhile stretching my R package building
muscles, so I’ve added tests with 100% coverage, and ensured it fully passes
check().

I don’t expect this would be used on huge text sources, but it’s useful to me
for silly little projects. If you have any suggestions for functionality that
could extend this then by all means let me know either in
GitHub Issues, the comment section
below, or Mastodon.

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.3.3 (2024-02-29)
##  os       Pop!_OS 22.04 LTS
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_AU.UTF-8
##  ctype    en_AU.UTF-8
##  tz       Australia/Adelaide
##  date     2024-08-03
##  pandoc   3.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package     * version    date (UTC) lib source
##  blogdown      1.18       2023-06-19 [1] CRAN (R 4.3.2)
##  bookdown      0.36       2023-10-16 [1] CRAN (R 4.3.2)
##  bslib         0.6.1      2023-11-28 [3] CRAN (R 4.3.2)
##  cachem        1.0.8      2023-05-01 [3] CRAN (R 4.3.0)
##  callr         3.7.3      2022-11-02 [3] CRAN (R 4.2.2)
##  charcuterie * 0.0.0.9000 2024-08-03 [1] local
##  cli           3.6.1      2023-03-23 [1] CRAN (R 4.3.3)
##  crayon        1.5.2      2022-09-29 [3] CRAN (R 4.2.1)
##  devtools      2.4.5      2022-10-11 [1] CRAN (R 4.3.2)
##  digest        0.6.34     2024-01-11 [3] CRAN (R 4.3.2)
##  ellipsis      0.3.2      2021-04-29 [3] CRAN (R 4.1.1)
##  evaluate      0.23       2023-11-01 [3] CRAN (R 4.3.2)
##  fastmap       1.1.1      2023-02-24 [3] CRAN (R 4.2.2)
##  fs            1.6.3      2023-07-20 [3] CRAN (R 4.3.1)
##  glue          1.7.0      2024-01-09 [1] CRAN (R 4.3.3)
##  htmltools     0.5.7      2023-11-03 [3] CRAN (R 4.3.2)
##  htmlwidgets   1.6.2      2023-03-17 [1] CRAN (R 4.3.2)
##  httpuv        1.6.12     2023-10-23 [1] CRAN (R 4.3.2)
##  icecream      0.2.1      2023-09-27 [1] CRAN (R 4.3.2)
##  jquerylib     0.1.4      2021-04-26 [3] CRAN (R 4.1.2)
##  jsonlite      1.8.8      2023-12-04 [3] CRAN (R 4.3.2)
##  knitr         1.45       2023-10-30 [3] CRAN (R 4.3.2)
##  later         1.3.1      2023-05-02 [1] CRAN (R 4.3.2)
##  lifecycle     1.0.4      2023-11-07 [1] CRAN (R 4.3.3)
##  magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.3.3)
##  memoise       2.0.1      2021-11-26 [3] CRAN (R 4.2.0)
##  mime          0.12       2021-09-28 [3] CRAN (R 4.2.0)
##  miniUI        0.1.1.1    2018-05-18 [1] CRAN (R 4.3.2)
##  pkgbuild      1.4.2      2023-06-26 [1] CRAN (R 4.3.2)
##  pkgload       1.3.3      2023-09-22 [1] CRAN (R 4.3.2)
##  prettyunits   1.2.0      2023-09-24 [3] CRAN (R 4.3.1)
##  processx      3.8.3      2023-12-10 [3] CRAN (R 4.3.2)
##  profvis       0.3.8      2023-05-02 [1] CRAN (R 4.3.2)
##  promises      1.2.1      2023-08-10 [1] CRAN (R 4.3.2)
##  ps            1.7.6      2024-01-18 [3] CRAN (R 4.3.2)
##  purrr         1.0.2      2023-08-10 [3] CRAN (R 4.3.1)
##  R6            2.5.1      2021-08-19 [1] CRAN (R 4.3.3)
##  Rcpp          1.0.11     2023-07-06 [1] CRAN (R 4.3.2)
##  remotes       2.4.2.1    2023-07-18 [1] CRAN (R 4.3.2)
##  rlang         1.1.4      2024-06-04 [1] CRAN (R 4.3.3)
##  rmarkdown     2.25       2023-09-18 [3] CRAN (R 4.3.1)
##  rstudioapi    0.15.0     2023-07-07 [3] CRAN (R 4.3.1)
##  sass          0.4.8      2023-12-06 [3] CRAN (R 4.3.2)
##  sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.3.2)
##  shiny         1.7.5.1    2023-10-14 [1] CRAN (R 4.3.2)
##  stringi       1.8.3      2023-12-11 [3] CRAN (R 4.3.2)
##  stringr       1.5.1      2023-11-14 [3] CRAN (R 4.3.2)
##  urlchecker    1.0.1      2021-11-30 [1] CRAN (R 4.3.2)
##  usethis       3.0.0      2024-07-29 [1] CRAN (R 4.3.3)
##  vctrs         0.6.5      2023-12-01 [1] CRAN (R 4.3.3)
##  xfun          0.41       2023-11-01 [3] CRAN (R 4.3.2)
##  xtable        1.8-4      2019-04-21 [1] CRAN (R 4.3.2)
##  yaml          2.3.8      2023-12-11 [3] CRAN (R 4.3.2)
##
##  [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.3
##  [2] /usr/local/lib/R/site-library
##  [3] /usr/lib/R/site-library
##  [4] /usr/lib/R/library
##
## ──────────────────────────────────────────────────────────────────────────────

To leave a comment for the author, please follow the link and comment on their blog: rstats on Irregularly Scheduled Programming.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.


Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: {charcuterie} – What if Strings Were Iterable in R?

Read the original article