Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
I’ve been using a lot of programming languages recently and they all have their
quirks, differentiating features, and unique qualities, but one thing most of
them have is that they handle strings as a collection of characters. R doesn’t,
it has a “character” type which is 0 or more characters, and that’s what we call
a “string”, but what if it did have iterable strings?
For comparison, here’s some Python code
for i in "string": print(i) s t r i n g
and some Haskell
upperA = map (c -> if c == 'a' then 'A' else c) upperA "banana" "bAnAnA"
and some Julia
[x+1 for x in "HAL"] 3-element Vector{Char}: 'I': ASCII/Unicode U+0049 (category Lu: Letter, uppercase) 'B': ASCII/Unicode U+0042 (category Lu: Letter, uppercase) 'M': ASCII/Unicode U+004D (category Lu: Letter, uppercase)
In each of these cases the string is treated as a collection of individual
characters. Many languages make this distinction, going so far as using different
quotes to distinguish them; e.g. double quotes for strings "string"
and single
quotes for individual characters 's'
. This makes a even more sense when the
language supports types in that a string has a String
type that is composed
of 0 or more Char
types.
R is dynamically typed, so we don’t strictly enforce type signatures, and is an
array language, so it has natural support for arrays (vectors, lists, matrices).
So why are strings not collections of characters?
My guess is that for the majority of use-cases, it wasn’t necessary – a lot of
the time when we read in text data we want the entirety of the string and don’t
want to worry about dealing with a collection on top of the collection of strings
themselves. Plus, if you really need the individual characters you can split the
text up with strsplit(x, "")
.
But if you do want to work with individual characters, calling
strsplit(x, "")[[1]]
throughout your code gets ugly. I solved the Exercism
problem ‘Anagram’ in R and really didn’t like how it looked
anagram <- function(subject, candidates) { # remove any same words and inconsistent lengths nonsames <- candidates[tolower(candidates) != tolower(subject) & nchar(subject) == nchar(candidates)] if (!length(nonsames)) return(c()) # no remaining candidates s_letters <- sort(tolower(strsplit(subject, "")[[1]])) c_letters <- sapply(sapply(nonsames, (x) strsplit(x, "")), sort, simplify = FALSE) # find all cases where the letters are all the same anagrams <- nonsames[sapply(c_letters, (x) all(s_letters == tolower(x)))] # if none found, return NULL if(!length(anagrams)) NULL else anagrams }
Two calls to strsplit
, then needing to sapply
over that collection to sort it…
not pretty at all. Here’s a Haskell solution
from someone very knowledgeable in our local functional programming Meetup group
import Data.List (sort) import Data.Char (toLower) anagramsFor :: String -> [String] -> [String] anagramsFor xs = filter (isAnagram xs' . map toLower) where xs' = map toLower xs isAnagram :: String -> String -> Bool isAnagram a b | a == b = False | otherwise = sort a == sort b
which, excluding the type declarations and the fact that it needs to deal with
the edge case that it has to be a rearrangement, could nearly be a one-liner
import Data.List (sort) import Data.Char (toLower) isAnagram a b = sort (map toLower a) == sort (map toLower b)
Wouldn’t it be nice if we could do things like this in R?
I don’t expect it would ever happen (maaaybe via some special string handling
like the bare strings r"(this doesn't need escaping)"
but unlikely). I
couldn’t find a package that did this (by all means, let me know if there is
one) so I decided to build it myself and see how it could work.
Introducing {charcuterie} – named
partly because it looks like “cut” “char”, and partly because of charcuterie boards
involving lots of little bits of appetizers.
library(charcuterie)
At its core, this is just defining chars(x)
as strsplit(x, "")[[1]]
and
slapping a new class on the output, but big improvements don’t immediately come
from moonshots, they come from incremental improvements. Once I had this, I wanted
to do things with it like sort the individual characters. There is of course a
sort method for vectors (but not for individual strings) so
sort("string") ## [1] "string" sort(c("s", "t", "r", "i", "n", "g")) ## [1] "g" "i" "n" "r" "s" "t"
One aspect of treating strings as collections of characters is that they should
always look like strings, so I needed to modify the sort method to return an
object of this new class, and make this class display collections of characters
as a string. That just involves pasting the characters back together for printing,
so now I can have this
s <- chars("string") s ## [1] "string" sort(s) ## [1] "ginrst"
It looks like a string, but it behaves like a collection of characters!
I thought about what other operations I might want to do and now I have methods to
- sort with
sort
- reverse with
rev
- index with
[
- concatenate with
c
- print with
format
andprint
- slice with
head
andtail
- set operations with
setdiff
,union
,intersect
, and a newexcept
- leverage existing vectorised operations like
unique
,toupper
, andtolower
I suspect the concatenation will be the one that raises the most eyebrows… I’ve
dealt with the way that other languages join together strings before
and I’m certainly open to what this version should do, but I think it makes
sense to add the collections as
c(chars("butter"), chars("fly")) ## [1] "butterfly"
If you need more than one chars
at a time, you’re asking for a vector of vectors,
which R doesn’t support – it supports a list of them, though
x <- lapply(c("butter", "fly"), chars) x ## [[1]] ## [1] "butter" ## ## [[2]] ## [1] "fly" unclass(x[[2]]) ## [1] "f" "l" "y"
This still sounds simple, and it is – the point is that it feels a lot more
ergonomic to use this inside a function compared to strsplit(x, "")[[1]]
and
working with the collection manually.
I added an entire vignette of examples to the package, including identifying vowels
vowels <- function(word) { ch <- chars(word) setNames(ch %in% chars("aeiou"), ch) } vowels("string") ## s t r i n g ## FALSE FALSE FALSE TRUE FALSE FALSE vowels("banana") ## b a n a n a ## FALSE TRUE FALSE TRUE FALSE TRUE
palindromes
palindrome <- function(a, ignore_spaces = FALSE) { a <- chars(a) if (ignore_spaces) a <- except(a, " ") all(rev(a) == a) } palindrome("palindrome") ## [1] FALSE palindrome("racecar") ## [1] TRUE palindrome("never odd or even", ignore_spaces = TRUE) ## [1] TRUE
and performing character-level substitutions
spongebob <- function(phrase) { x <- chars(phrase) odds <- seq(1, length(x), 2) x[odds] <- toupper(x[odds]) string(x) } spongebob("you can't do anything useful with this package") ## [1] "YoU CaN'T Do aNyThInG UsEfUl wItH ThIs pAcKaGe"
On top of all that, I felt it was worthwhile stretching my R package building
muscles, so I’ve added tests with 100% coverage, and ensured it fully passes
check()
.
I don’t expect this would be used on huge text sources, but it’s useful to me
for silly little projects. If you have any suggestions for functionality that
could extend this then by all means let me know either in
GitHub Issues, the comment section
below, or Mastodon.
devtools::session_info()
## ─ Session info ─────────────────────────────────────────────────────────────── ## setting value ## version R version 4.3.3 (2024-02-29) ## os Pop!_OS 22.04 LTS ## system x86_64, linux-gnu ## ui X11 ## language (EN) ## collate en_AU.UTF-8 ## ctype en_AU.UTF-8 ## tz Australia/Adelaide ## date 2024-08-03 ## pandoc 3.2 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/x86_64/ (via rmarkdown) ## ## ─ Packages ─────────────────────────────────────────────────────────────────── ## package * version date (UTC) lib source ## blogdown 1.18 2023-06-19 [1] CRAN (R 4.3.2) ## bookdown 0.36 2023-10-16 [1] CRAN (R 4.3.2) ## bslib 0.6.1 2023-11-28 [3] CRAN (R 4.3.2) ## cachem 1.0.8 2023-05-01 [3] CRAN (R 4.3.0) ## callr 3.7.3 2022-11-02 [3] CRAN (R 4.2.2) ## charcuterie * 0.0.0.9000 2024-08-03 [1] local ## cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.3) ## crayon 1.5.2 2022-09-29 [3] CRAN (R 4.2.1) ## devtools 2.4.5 2022-10-11 [1] CRAN (R 4.3.2) ## digest 0.6.34 2024-01-11 [3] CRAN (R 4.3.2) ## ellipsis 0.3.2 2021-04-29 [3] CRAN (R 4.1.1) ## evaluate 0.23 2023-11-01 [3] CRAN (R 4.3.2) ## fastmap 1.1.1 2023-02-24 [3] CRAN (R 4.2.2) ## fs 1.6.3 2023-07-20 [3] CRAN (R 4.3.1) ## glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.3) ## htmltools 0.5.7 2023-11-03 [3] CRAN (R 4.3.2) ## htmlwidgets 1.6.2 2023-03-17 [1] CRAN (R 4.3.2) ## httpuv 1.6.12 2023-10-23 [1] CRAN (R 4.3.2) ## icecream 0.2.1 2023-09-27 [1] CRAN (R 4.3.2) ## jquerylib 0.1.4 2021-04-26 [3] CRAN (R 4.1.2) ## jsonlite 1.8.8 2023-12-04 [3] CRAN (R 4.3.2) ## knitr 1.45 2023-10-30 [3] CRAN (R 4.3.2) ## later 1.3.1 2023-05-02 [1] CRAN (R 4.3.2) ## lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.3) ## magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.3) ## memoise 2.0.1 2021-11-26 [3] CRAN (R 4.2.0) ## mime 0.12 2021-09-28 [3] CRAN (R 4.2.0) ## miniUI 0.1.1.1 2018-05-18 [1] CRAN (R 4.3.2) ## pkgbuild 1.4.2 2023-06-26 [1] CRAN (R 4.3.2) ## pkgload 1.3.3 2023-09-22 [1] CRAN (R 4.3.2) ## prettyunits 1.2.0 2023-09-24 [3] CRAN (R 4.3.1) ## processx 3.8.3 2023-12-10 [3] CRAN (R 4.3.2) ## profvis 0.3.8 2023-05-02 [1] CRAN (R 4.3.2) ## promises 1.2.1 2023-08-10 [1] CRAN (R 4.3.2) ## ps 1.7.6 2024-01-18 [3] CRAN (R 4.3.2) ## purrr 1.0.2 2023-08-10 [3] CRAN (R 4.3.1) ## R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.3) ## Rcpp 1.0.11 2023-07-06 [1] CRAN (R 4.3.2) ## remotes 2.4.2.1 2023-07-18 [1] CRAN (R 4.3.2) ## rlang 1.1.4 2024-06-04 [1] CRAN (R 4.3.3) ## rmarkdown 2.25 2023-09-18 [3] CRAN (R 4.3.1) ## rstudioapi 0.15.0 2023-07-07 [3] CRAN (R 4.3.1) ## sass 0.4.8 2023-12-06 [3] CRAN (R 4.3.2) ## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.2) ## shiny 1.7.5.1 2023-10-14 [1] CRAN (R 4.3.2) ## stringi 1.8.3 2023-12-11 [3] CRAN (R 4.3.2) ## stringr 1.5.1 2023-11-14 [3] CRAN (R 4.3.2) ## urlchecker 1.0.1 2021-11-30 [1] CRAN (R 4.3.2) ## usethis 3.0.0 2024-07-29 [1] CRAN (R 4.3.3) ## vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.3) ## xfun 0.41 2023-11-01 [3] CRAN (R 4.3.2) ## xtable 1.8-4 2019-04-21 [1] CRAN (R 4.3.2) ## yaml 2.3.8 2023-12-11 [3] CRAN (R 4.3.2) ## ## [1] /home/jono/R/x86_64-pc-linux-gnu-library/4.3 ## [2] /usr/local/lib/R/site-library ## [3] /usr/lib/R/site-library ## [4] /usr/lib/R/library ## ## ──────────────────────────────────────────────────────────────────────────────
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: {charcuterie} – What if Strings Were Iterable in R?