Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
This is part of a series of related posts on Apache Arrow. Other posts
in the series are:
- Understanding the Parquet file
format - Reading and Writing Data with {arrow} (This post)
- Parquet vs the RDS Format (Coming soon)
What is (Apache) Arrow?
Apache Arrow is a cross-language
development platform for in-memory data. As it’s in-memory (as opposed
to data stored on disk), it provides additional speed boosts. It’s
designed for efficient analytic operations, and uses a standardised
language-independent columnar memory format for flat and hierarchical
data. The {arrow} R package provides an interface to the ‘Arrow C++’
library – an efficient package for
analytic operations on modern hardware.
There are many great tutorials on using
{arrow} (see the links at the bottom
of the post for example). The purpose of this blog post isn’t to simply
reproduce a few examples, but to understand some of what’s happening
behind the scenes. In this particular post, we’re interested in
understanding the reading/writing aspects of {arrow}.
Getting started
The R package is installed from CRAN in the usual way
install.packages("arrow")
Then loaded using
library("arrow")
This blog post uses the NYC Taxi
data. It’s pretty
big – around ~40GB in total. To download it locally,
data_nyc = "data/nyc-taxi" open_dataset("s3://voltrondata-labs-datasets/nyc-taxi") |> dplyr::filter(year %in% 2012:2021) |> write_dataset(data_nyc, partitioning = c("year", "month"))
Once this has completed, you can check everything has downloaded
correctly by running
nrow(open_dataset(data_nyc)) ## [1] 1150352666
Loading in data
Unsurprisingly, the first command we come across is open_dataset()
.
This opens the data and (sort of) reads it in.
library("arrow") open_dataset(data_nyc) ## FileSystemDataset with 120 Parquet files ## vendor_name: string ## pickup_datetime: timestamp[ms] ## dropoff_datetime: timestamp[ms] ## passenger_count: int64 ## trip_distance: double ## ...
Reading is a lazy action. This allows us to manipulate much larger data
sets than R could typically deal with. The default print method lists
the columns in the data set, with their associated type. These data
types come directly from the C++ API so don’t always have a
corresponding R type. For example, the year
column is an int32
(a 32
bit integer), whereas passenger_count
is int64
(a 64 bit integer).
In R, these are both integers.
As you might guess, there’s a corresponding function write_dataset()
.
Looking at the (rather good) documentation, we come across a few
concepts that are worth exploring further.
Whether you want to start from scratch, or improve your skills, Jumping Rivers has a training course for you.
File formats
The main file formats associated are
parquet
: a format designed to minimise storage – see our recent
blog
post
that delves into some of the details surrounding the format;arrow
/feather
: in-memory format created to optimise vectorised
computations;csv
: the world runs on csv files (and Excel).
The common workflow is storing your data as parquet files. The Arrow
library then loads the data and processes the data in the arrow format.
Storing data in the Arrow format
The obvious thought (to me at least) was, why not store the data as
arrow? Ignoring for the moment that Arrow doesn’t
promise long-term archival storage using
the arrow format, we can do a few tests.
Using the NYC-taxi data, we can create a quick subset
# Replace format = "arrow" with format = "parquet" # to create the correspond # parquet equivalent open_dataset(file.path(data_path, "year=2019")) |> write_dataset("data/nyc-taxi-arrow", partitioning = "month", format = "arrow")
A very quick, but not particularly thorough test suggests that
- the arrow format requires ten times more storage space. So for the
entirenyc-taxi
data set, parquet takes around ~38GB, but arrow
would take around 380GB. - storing as arrow makes some operations quicker. For the few examples I
tried, there was around a 10% increase in speed.
The large storage penalty was enough to convince me of the merits of
storing data as parquet, but there may be some niche situations where
you might switch.
Hive partitioning
Both open_dataset()
and write_dataset()
functions mention “Hive
partitioning” – in fact we sneakily included a partioning
argument in
the code above. For the open_dataset()
function, it guesses if we use
Hive partitioning, whereas for the write_dataset()
function we can
specify the partition. But what actually is it?
Hive partitioning is a method used to split a table into multiple files
based on partition keys. A partition key is a variable of interest in
your data, for example, year or month. The files are then organised in
folders. Within each folder, the key has a value is determined by
the name of the folder. By partitioning the data in this way, we can
make it faster to do queries on data slices.
Suppose we wanted to partition the data by year, then the file structure
would be
taxi-data year=2018 file1.parquet file2.parquet year=2019 file4.parquet file5.parquet
Of course, we can partition by more than one variable, such as both year
and month
taxi-data year=2018 month=01 file01.parquet month=02 file02.parquet file03.parquet ... year=2019 month=01 ...
See the excellent vignette on
datasets in the
{arrow} package.
Example: Partitioning
Parquet files aren’t the only files we can partition. We can also use
the same concept with CSV files. For example,
tmp_dir = tempdir() write_dataset(palmerpenguins::penguins, path = tmp_dir, partitioning = "species", format = "csv")
This looks like
list.files(tmp_dir, recursive = TRUE, pattern = ".csv$") ## [1] "species=Adelie/part-0.csv" "species=Chinstrap/part-0.csv" ## [3] "species=Gentoo/part-0.csv"
You can also partition using the group()
function from {dplyr}
palmerpenguins::penguins |> dplyr::group_by(species) |> write_dataset(path = tmp_dir, format = "csv")
In my opinion, while it makes conceptual sense to partition CSV files,
in practice it’s probably not worthwhile. Any CSV files that you
partition to get speed benefits, you might as well use parquet.
Single files vs dataset APIs
When reading in data using Arrow, we can either use the single file
function (these start with read_
) or use the dataset API (these start
with open_
).
For example, using read_csv_arrow()
reads the CSV file directly into
memory. If the file is particularly large, then we’ll run out of memory.
One thing to note, is the as_data_frame
argument. By default this is
set to TRUE
, meaning that read_csv_arrow()
will return a tibble
.
The upside of this is that we have a familiar object. The downside is
that it takes up more room than Arrow’s internal data representation (an
Arrow
Table)
This blog post by François
Michonneau goes
into far more detail, and discusses the R and Python implementations of
the different APIs.
Acknowledgements
This blog was motivated by the excellent Arrow
tutorial at Posit Conf 2023,
run by Steph Hazlitt and Nic Crane. The NYC dataset came from that
tutorial, and a number of the ideas that I explored were discussed with
the tutorial leaders. I also used a number of resources found on various
corners of the web. I’ve tried to provide links, but if I’ve missed any,
let me know.
For updates and revisions to this article, see the original post
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.
Continue reading: Reading and Writing Data with {arrow}
Understanding Apache Arrow and its practical applications
The article presents a detailed explanation of Apache Arrow, a cross-language development platform for in-memory data, offering significant speed improvements due to its in-memory design when compared to data stored on a disk. This piece explores the use of Apache Arrow in the R package and its implications for analytic operations on hardware. Several important concepts are explored, including file formats, memory storage, and partitioning methods.
Key Takeaways and Future Developments
Apache Arrow’s language-independent columnar memory format offers efficiency for flat and hierarchical data, showing potential implications for work with large datasets. Applying Apache Arrow in R programming requires understanding the reading and writing features offered by its standardised environment. However, incorporating this tool into broader applications may require the development of appropriate libraries and formats based on its specifications.
The discussion on the advantages of different file formats, such as parquet, arrow/feather, and CSV, indicates that selection of file format should be made depending upon storage and operational needs. While the parquet format minimizes storage, the arrow or feather format is suited for vectorised computations. Meanwhile, CSV is widely used across the globe despite its lack of specific advantages.
Another notable insight is the concept of Hive partitioning, a process that splits data into multiple files based on partition keys like year or month. This method could be used to hasten queries performed on data slices. Although its main use case is with Parquet files, CSV files can also be partitioned using this concept.
Actionable Advice
A crucial recommendation from this article is to store data as parquet files due to their minimal storage requirement while processing the information in the arrow format. For those working with large datasets particularly in structured formats like CSV or Parquet, implementing Hive partitioning could dramatically improve data operation speeds, but this should be considered with respect to the storage costs or processing capabilities required.
Furthermore, when reading data using Arrow, the single file function starting with read_ can be used, but for large files, the dataset API starting with open_ should be employed to avoid running out of memory. Therefore, operators should be mindful of their data sizes and choose the appropriate method accordingly.
In conclusion, the Apache Arrow platform holds promising potential for improving the speed and efficiency of data analytics cross-platform. However, to maximize its utility and effectiveness, users should select the right file format and partition methods suitable for their specific data requirements and system capabilities.