Introducing schematic: Simplifying Data Validation in R

[This article was first published on data-in-flight, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

I’m thrilled to announce the release of schematic, an R package that helps you (the developer) communicate data validation problems to non-technical users. With schematic, you can leverage tidyselect selectors and other conveniences to compare incoming data against a schema, avoiding punishing issues caused by invalid or poor quality data.

schematic can now be installed via CRAN:

install.packages("schematic")

Learn more about schematic by checking out the docs.

Motivation

Having built and deployed a number of shiny apps or APIs that require users to upload data, I noticed a common pain point: how do I communicate in simple terms any issues with the data and, more importantly, what those issues are? I needed a way to present the user with error messages that satisfy two needs:

Simple and non-technical: allow developers to explain the problem rather than forcing users to understand the technical aspects of each test (you don’t want to have to explain to users what is.logical means).
Holistic checking: present all validation issues rather than stopping evaluation on the first failure.

There already exists a number of data validation packages for R, including (but not limited to) pointblank, data.validator, and validate; so why introduce a new player? schematic certainly shares similarities with many of these packages, but where I think it innovates over existing solutions is in its unique combination of the following:

Lightweight: Minimal dependencies with a clear focus on checking data without the bells and whistles of graphics, tables, and whatnot.
User-focused but developer-friendly: Developers (especially those approaching from a tidyverse mentality) will like the expressive syntax; users will appreciate the informative instructions on how to comprehensively fix data issues (no more whack-a-mole with fixing one problem only to learn there are many others).
Easy to integrate into applications (e.g., Shiny, Plumber): Schematic returns error messages rather than reports or data.frames, meaning that you don’t need additional logic to trigger a run time error; just pass along the error message in a notification or error code.

How it works

Warning

All R errors that appear in this post are intentional for the purpose of demonstrating schematic’s error messaging.

Schematic is extremely simple. You only need to do two things: create a schema and then check a data.frame against the schema.

A schema is a set of rules for columns in a data.frame. A rule consists of two parts:

Selector – the column(s) on which to apply to rule
Predicate – a function that must return a single TRUE or FALSE indicating the pass or fail of the check

Let’s imagine a scenario where we have survey data and we want to ensure it matches our expectations. Here’s some sample survey data:

survey_data <- data.frame(
  id = c(1:3, NA, 5),
  name = c("Emmett", "Billy", "Sally", "Woolley", "Duchess"),
  age = c(19.2, 10, 22.5, 19, 19),
  sex = c("M", "M", "F", "M", NA),
  q_1 = c(TRUE, FALSE, FALSE, FALSE, TRUE),
  q_2 = c(FALSE, FALSE, TRUE, TRUE, TRUE),
  q_3 = c(TRUE, TRUE, TRUE, TRUE, FALSE)
)

We declare a schema using schema() and provide it with rules following the format selector ~ predicate:

library(schematic)

my_schema <- schema(
  id ~ is_incrementing,
  id ~ is_all_distinct,
  c(name, sex) ~ is.character,
  c(id, age) ~ is_whole_number,
  education ~ is.factor,
  sex ~ function(x) all(x %in% c("M", "F")),
  starts_with("q_") ~ is.logical,
  final_score ~ is.numeric
)

Then we use check_schema to evaluate our data against the schema. Any and all errors will be captured in the error message:

check_schema(
  data = survey_data,
  schema = my_schema
)

Error in `check_schema()`:
! Schema Error:
- Columns `education` and `final_score` missing from data
- Column `id` failed check `is_incrementing`
- Column `age` failed check `is_whole_number`
- Column `sex` failed check `function(x) all(x %in% c("M", "F"))`

The error message will combine columns into a single statement if they share the same validation issue. schematic will also automatically report if any columns declared in the schema are missing from the data.

Customizing the message

By default the error message is helpful for developers, but if you need to communicate the schema mismatch to a non-technical person they’ll have trouble understanding some or all of the errors. You can customize the output of each rule by inputting the rule as a named argument.

Let’s fix up the previous example to make the messages more understandable.

my_helpful_schema <- schema(
  "values are increasing" = id ~ is_incrementing,
  "values are all distinct" = id ~ is_all_distinct,
  "is a string" = c(name, sex) ~ is.character,
  "is a string with specific levels" = education ~ is.factor,
  "is a whole number (no decimals)" = c(id, age) ~ is_whole_number,
  "has only entries 'F' or 'M'" = sex ~ function(x) all(x %in% c("M", "F")),
  "includes only TRUE or FALSE" = starts_with("q_") ~ is.logical,
  "is a number" = final_score ~ is.numeric
)

check_schema(
  data = survey_data,
  schema = my_helpful_schema
)

Error in `check_schema()`:
! Schema Error:
- Columns `education` and `final_score` missing from data
- Column `id` failed check `values are increasing`
- Column `age` failed check `is a whole number (no decimals)`
- Column `sex` failed check `has only entries 'F' or 'M'`

And that’s really all there is to it. schematic does come with a few handy predicate functions like is_whole_number() which is a more permissive version of is.integer() that allows for columns stored as numeric or double but still requires non-decimal values.

Moreover, schematic includes a handful of modifiers that allow you to change the behavior of some predicates, for instance, allowing NAs with mod_nullable():

# Before using `mod_nullable()` this rule triggered an error
my_schema <- schema(
  "all values are increasing (except empty values)" = id ~ mod_nullable(is_incrementing)
)

check_schema(
  data = survey_data,
  schema = my_schema
)

Conclusion

In the end, my hope is to make schematic as simple as possible and help both developers and users. It’s a package I designed initially with the sole intention of saving myself from writing validation code that takes up 80% of the actual codebase.¹ I hope you find it useful too.

Notes

This post was created using R version 4.5.0 (2025-04-11) and schematic version 0.1.0.

Footnotes

Not an exaggeration. I have a Plumber API that allows users to POST data to be processed. 80% of that plumber code is to validate the incoming data.︎

To leave a comment for the author, please follow the link and comment on their blog: data-in-flight.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: Announcing… schematic

Long-term Implications and Possible Future Developments of the Schematic Package

The schematic package, now available on CRAN, serves the purpose of communicating data validation problems to non-technical users. Its impact on the R community and the data validation sector suggests key long-term implications and a number of future developments.

Impact on Developers and Non-technical Users

Developers can now use tidyselect selectors and other functions to compare incoming data against a set schema with schematic. The payoff? Developers can avoid significant issues trackable to poor quality data or invalid inputs. Simple and non-technical error messages are one of the standout features of schematic, which allows the users to understand and rectify the issues without diving into technical nuances.

For example, customizing the message output of each rule gives the developer the liberty to communicate validation issues in a language that end users are comfortable with. This is a significant step forward in reducing the information gap and misunderstanding between coders and no-coders.

Future Directions for Schematic

The creator of schematic has expressed the aim of a continuous quest for simplicity. It’s clear that schematic is designed, first and foremost, to be an easy-to-use tool, free of unnecessary complexity. The future of this package, then, might involve further iterations and updates to improve user-friendliness, predominantly around usability concerns rather than adding new features.

The creator acknowledges that schematic shares many features with other data validation packages. However, its unique combination of a lightweight structure, user-focused but developer-friendly conversations, and easy integration into applications makes it stand out. Therefore, in the future, it might work on enhancing this unique selling points (USP) rather than reinventing the wheel.

The creator also mentioned the struggle of writing validation code that often occupies a major chunk of the entire codebase. The sense of relief after the creation of schematic suggests that the package might continually improve on making the life of developers easier, helping them focus on core logic rather than writing repetitive validation code.

Actionable Advice

Given its utility and the intention behind its creation, it’s recommended for developers to explore the schematic package. It is lightweight and easy to integrate, making it an invaluable asset for any shiny apps or APIs that require users to upload data.

Using schematic can drastically reduce the amount of time spent on writing validation code, and more importantly, facilitate effective communication with non-technical users. By encoding complex rules into simpler messages, developers are alleviating potential bottlenecks and confusions in their workflows.

Developers should also follow the updates on schematic. The creator is committed to its simplicity, and therefore it’s likely that future iterations of the package will be even more streamlined.

To Sum it Up

The arrival of schematic marks a milestone in the data validation arena. As developers, it is a tool worth exploring. While its primary aim is simplification and saving time on validation, the underlying benefit it brings is the bridging of communication gaps between technical and non-technical users.

Going forward, we may see further efforts to make the tool even easier to use, providing an even greater resource for the R community and beyond.

Read the original article

Introducing schematic: Simplifying Data Validation in R

Motivation

How it works

Customizing the message

Conclusion

Notes

Footnotes

Long-term Implications and Possible Future Developments of the Schematic Package

Impact on Developers and Non-technical Users

Future Directions for Schematic

Actionable Advice

To Sum it Up

Submit a Comment Cancel reply

Recent Posts

Recent Comments