A walk in the tidyverse

Microsoft

Sep 02, 2022

If you worked with R to explore a dataset and build a report from this analysis, you have probably heard about the tidyverse. If you used R in your data science project, to fit a predictive model able to produce the most accurate prediction possible for new data, you have probably experimented with tidymodels. However, questions like “what is the tidyverse” or “how does the tidymodels framework fit in it” might still have no clear answer for you.

What is the tidyverse

The simplest way to describe the tidyverse is as a collection of R packages - sharing common ideas and norms - designed to perform data science tasks easily and quickly. But I don’t like this definition too much because it tends to be too simplistic. Tidyverse is more than that: it’s a philosophy or a lifestyle embodied in a collection of packages.

The tidyverse project – dating back to 2016 - has the ambitious mission of facilitating the conversation about data between humans and computers. This means that its structure and grammar wish to be as consistent and readable as possible, providing users with a step-by-step learning path in which every step (or package) makes it easier to learn the next one.

Tidymodels, similarly to tidyverse, builds on the core R language, by applying tidyverse principles to statistical modeling and machine learning domain.

The tidyverse manifesto

The main principles of the tidyverse philosophy are three:

Design for Humans

A large portion of R users is composed by data analysts and data scientists more than software developers. That’s one of the main reasons behind this principle, which implies not only providing clear documentation and training but also that the software itself should be intuitive and self-explanatory.

To achieve this goal, it’s essential to guarantee a friendly user interface and descriptive naming for packages and functions. The tidyverse approach is to rely on the grammar of verbs and nouns. For example, to filter from a dataset of students only the ones whose name is “Jenny” you will use:

filter(df_students, Name == "Jenny")

And, when cleaning your dataset from “Not Available” values, you will write:

students <- df_students %>%
  drop_na()

Reuse Existing Data Structures

Whenever possible, functions in tidyverse re-use existing data structures, rather than creating custom ones. In particular, the tibble is the preferred data structure for many R packages in tidyverse. A tibble is a modern reimagining of a data frame, “lazier” - i.e. it doesn’t change variable names or types, and doesn’t do partial matching – and “surly” – i.e. it prevents common R errors, like dropping dimensions. Data in tibbles is tidy, such that variables are in columns and observations are in rows.

Some other packages work at a lower level, focusing on a single type of variable. For example, stringr for strings, lubridate for date/times, and forcats for factors.

Illustration of a tibble from an artwork by allisonhorst

Design for the Pipe and Functional Programming

This principle relies on the use of the magrittr pipe operator (%>%), enabling to chain together a set of R functions, executed in a logical sequence.

For example, if you wish to compute the mean study time (in terms of hours) and the mean grade for the students of a class who passed or failed a certain course, you can use the following syntax:

students %>% 
  group_by(Pass) %>% 
  summarise(mean_study = mean(StudyHours), mean_grade = mean(Grade))

where the column “Pass” is a Boolean indicating if the student passed or failed the course.

When it comes to data visualization, using ggplot2, a similar paradigm is applied to obtain an elegant graph, through the + operator. In fact, ggplot2 enables the combination of independent components of a graphic in a series of iterative steps.

For example, if you wish to visualize a simple bar chart of the students’ grades you can use the following code:

ggplot(data = df_students) +
  geom_col(mapping = aes(x = Name, y = Grade)) +
  ggtitle("Student Grades")

where the ggplot() function initializes a graphic, geom_col() adds a layer of bars whose height corresponds to the variables that are specified by the mapping argument, and ggtitle() adds a title to the chart.

Also, the tidyverse approach embraces the functional programming nature of R. This includes for example the use of immutable objects and copy-on-modify semantics, together with tools that abstract over for loops (i.e. apply() or map() families of functions).

Takeaways

The tidyverse project has as its specific goal to provide R users with a uniform interface, based on packages and functions that work together naturally and to close the gap between humans and machines, when working with data and statistical models.

To better understand this in practice, you could visualize the tidyverse and the tidymodels frameworks in a wider context, like the MLOps cycle below:

where tidyverse enhances the process of understanding and preparing the data for further processing, while tidymodels is an important tool to train and evaluate machine learning models.

Last, but not least, tidyverse is also the community of data scientists and R developers that use this collection of packages.

Learn more

If you enjoyed this overview of the tidyverse and you are eager to learn more about the founding principles and the philosophy behind it, I strongly recommend reading Tidy Modeling with R by Max Kuhn and Julia Silge.

If you wish to know more about how to use tidymodels and tidyverse functions to perform exploratory data analysis and build machine learning models, and you wish to practice with hands-on exercises in a pre-configured environment, you should check the Create machine learning models with R and tidymodels learning path on MS Learn.

Finally, if you want to get some guidance in your learning journey with R and Tidyverse by subject matter experts you can watch (live or on-demand) the TidyFridays Learn Live series.

Updated Sep 02, 2022

Version 1.0

Microsoft

Joined February 17, 2022

View Profile

Educator Developer Blog