R extensions and the tidyverse

The basic features of R are incredibly useful for someone coming from a non-R background, but experienced R users have sought faster and more effective data manipulation and improved plotting abilities, as well as multimedia options.

Hadley Wickham’s contribution to the R universe includes his ‘ tidyverse‘ concept, and the associated set of packages.  These packages contain R functions designed using a common data philosophy and graphical grammar.  The tidyverse is a framework of data concepts like ‘tidy data’ and ‘tibbles’ (extensions of R’s native data.frame).  The standard for ‘tidy data’ as defined by Hadley is this basic rectangular form that forms the basis for R:

“Tidy data is data where:

  1. Each variable is in a column.
  2. Each observation is a row.
  3. Each value is a cell.”

These can all be installed collectively from within R Studio with install.package("tidyverse")
or you can install each one separately.  Installing the tidyverse means you can use the R command:
library("tidyverse")
to give you the advantage of making all the sub-components of that library available

Hadley explained these concepts in his books on R available online, including R for Data Science, R Packages and Advanced R.

It is useful to know the history and naming conventions of the tidyverse packages.  They use function names separated by an underscore instead of the dot that Base R uses.  For this reason, tidyverse has data_frame instead of data.frame, read_csv instead of read.csv,write_csv instead of write.csv etc.  reshape2 modified Base R’s reshape.   ‘dplyr’ was the 2014 evolution of an earlier package called plyr, also by Hadley.

1. dplyr

dplyr provides another abstraction layer for data, introducing new key verbs such as mutate, filter, summarise and the key adverb “by group”.  See Hadley’s own take on this.

R’s native group by and aggregate functions were common functions that took several steps to implement  – this package provided brevity in that context.  This is at the cost of more than one line entries.   Some discussion about pros and cons of dplyr v data.table.

2. tibble is the name of a package, but it is also the name of a data object/concept that is central to the tidyverse.   The use in script is ‘tb_df’ (tibble dataframe).  There is online PDF documentation.

Tibbles are an alternative to the data.table.  Tibbles extend R’s native data frames, not the data.tables (authored by a separate author).  Tibbles, in fact, form the central data type of the ‘tidyverse’.   data.frames, lists, vectors can be converted to tibbles with as_tibble(objectname).

Note that even though a tibble can be the output from a ‘tidyverse’ command ‘data_frame‘ this is NOT the basic R data.frame.

The concept of a modified data frame in the form of a tibble has been explored in several places, including Hadley’s R book, at the CRAN repository and at the tibble module in the tidyverse website.

3. tidyr

For data tidying.  It includes functions like ‘gather’, which collapses multiple columns into a key-value pair according to specification.  The complement is spread, in which a key-value pair is spread across multiple columns.

list
gather (unpivot) <->spread (pivot)
nest<->unnest

On-line documentation.

For an independent view, see Garrett Gman.

4. ggplot2

A ‘graphics of grammar‘ inspired extension to R’s native plot function.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top