R extensions and the tidyverse

The basic features of R are incredibly useful for someone coming from a non-R background, but experienced R users have sought faster and more effective data manipulation and improved plotting abilities, as well as multimedia options.

Hadley Wickham’s contribution to the R universe includes his ‘ tidyverse‘ concept, and the associated set of packages.  These packages contain R functions designed using a common data philosophy and graphical grammar.  The tidyverse is a framework of data concepts like ‘tidy data’ and ‘tibbles’ (extensions of R’s native data.frame).  The standard for ‘tidy data’ as defined by Hadley is this basic rectangular form that forms the basis for R:

“Tidy data is data where:

  1. Each variable is in a column.
  2. Each observation is a row.
  3. Each value is a cell.”

These can all be installed collectively from within R Studio with install.package("tidyverse")
or you can install each one separately.  Installing the tidyverse means you can use the R command:
library("tidyverse")
to give you the advantage of making all the sub-components of that library available

Hadley explained these concepts in his books on R available online, including R for Data Science, R Packages and Advanced R.

It is useful to know the history and naming conventions of the tidyverse packages.  They use function names separated by an underscore instead of the dot that Base R uses.  For this reason, tidyverse has data_frame instead of data.frame, read_csv instead of read.csv,write_csv instead of write.csv etc.  reshape2 modified Base R’s reshape.   ‘dplyr’ was the 2014 evolution of an earlier package called plyr, also by Hadley.

1. dplyr

dplyr provides another abstraction layer for data, introducing new key verbs such as mutate, filter, summarise and the key adverb “by group”.  See Hadley’s own take on this.

R’s native group by and aggregate functions were common functions that took several steps to implement  – this package provided brevity in that context.  This is at the cost of more than one line entries.   Some discussion about pros and cons of dplyr v data.table.

2. tibble is the name of a package, but it is also the name of a data object/concept that is central to the tidyverse.   The use in script is ‘tb_df’ (tibble dataframe).  There is online PDF documentation.

Tibbles are an alternative to the data.table.  Tibbles extend R’s native data frames, not the data.tables (authored by a separate author).  Tibbles, in fact, form the central data type of the ‘tidyverse’.   data.frames, lists, vectors can be converted to tibbles with as_tibble(objectname).

Note that even though a tibble can be the output from a ‘tidyverse’ command ‘data_frame‘ this is NOT the basic R data.frame.

The concept of a modified data frame in the form of a tibble has been explored in several places, including Hadley’s R book, at the CRAN repository and at the tibble module in the tidyverse website.

3. tidyr

For data tidying.  It includes functions like ‘gather’, which collapses multiple columns into a key-value pair according to specification.  The complement is spread, in which a key-value pair is spread across multiple columns.

list
gather (unpivot) <->spread (pivot)
nest<->unnest

On-line documentation.

For an independent view, see Garrett Gman.

4. ggplot2

A ‘graphics of grammar‘ inspired extension to R’s native plot function.

 

 

R extensions and the data.table

One alternative to R’s native data frames is the data.table package first released to CRAN in 2006.  data table extends base R’s data.frame.  data.table was authored principally by Arun Srinivasan and Matt Dowle.

One of the central features of the data.table was that it replaced the focus on data.frame rownames with the concept of a ‘key‘, which could be a multi-column set of values, but only one of the ‘key’ columns would operate as the index column.  Further details here.

In the R ecosystem, data.table represents a different branch of the extensions to R’s basic data.frame object to that of the ‘tidyverse’ with its dplyr, readr and tibbles.  The pros and cons have been discussed by their respective authors.  Additionally, tidyverse’s author suggests data.table is faster for importing in some cases.  See also r-bloggers.

 

Prosecutor’s Fallacies and Relative Probabilities

In the United Kingdom some years ago a mother of two deceased children was convicted, in part, because evidence was given that the probability of her two children dying through natural means or cot death (her defence) was very low – as low as 1 in 73 million.

The woman was Sally Clark, a solicitor at the time.   The disturbing decision went on appeal, but she never recovered from the trauma and died after release from prison.

Well-meaning mathematicians have offered three broad opinions on the relevance of the statistical data to the case:

  1.  The 1 in 73 million probability was over-stated; it was based on wrong assumptions about independence of each of the two children’s death; that the true rate of cot death was closer to 1 in 100,000.
  2. The claim that a person is guilty of the charges because the nature of the defence has such a low incidence rate, generally, is an example of the ‘prosecutor’s fallacy’; and
  3. To avoid the prosecutor’s fallacy in such a case, we should look at how relatively likely the prosecution case is compared to the defence’s case.

See Plus Magazine

The third point relates to proof in legal proceedings and is not something entirely within the domain of mathematics.  

The authors of the Plus Magazine offered the explanation that Bayes’ Rule would allow the calculation of the relative likelihoods of the kinds of events forming the basis for the prosecution and defence cases.  These relative likelihoods, based on a comparison of how frequently different kinds of events occur in the general population.

I don’t accept that these relative probabilities are useful metrics, at least not in such a crude form.   The main objection is that in Court cases, we are not simply trying to guess at whether we are randomly pulling a ‘criminal’ or an ‘innocent’ person from a general population.   We are not attempting a classification exercise, by comparing which of two unknown events would be more likely, before we know the outcome.  There’s nothing wrong with the advice to look at how likely it is that other events might have occurred, viewed at the same time, and looking forward into the future.  However, I think that such an approach leads us to look at the Court’s role in a slightly inaccurate way.

The problem with the third part of the article, namely the use of relative probabilities, to answer the prosecutor’s fallacy, is that it continues to ask the wrong questions, and they are:

  1. What is the probability that something might happen, in the future, if we don’t have any further information?  
  2. What is the probability that some other event might happen, in the future, if we don’t have any further information?
  3. Which of these probabilities is the highest?

This merely replaces one line of fallacious reasoning with another.  In a Court case, we do not want to predict the prospect that a randomly chosen person, from the general population, is more likely to have been a criminal than an innocent person.  What if the general population statistics had been different?  That would not have changed the innocence of Ms Clark.   From her perspective, the only historical event that had occurred involved her innocence, with a probability of 100%.

The jury’s role is not to make gambling wagers on innocence based on sampled statistics that have nothing to do with the trajectory of historical events in a specific case.  The jury is to weigh up their own subjective assessment of whether a historical event occurred, based on the evidence that relates to the specific evidence and the specific accused.  General statistics for the same kind of offence should never be used as the basis for a conviction, at least not without asking how that helps avoid convicting an innocent person.

A trial is not trying to estimate the occurrence of a random event, but trying to unravel and resolve uncertainties about what occurred in a particular setting, at a particular time, with specific people.

A Court case is more analogous to a situation in which we are attempting to confirm a hypothesis that an event (and the events leading up to it, implicitly), have already occurred.    We are not trying to predict whether a series of events will occur, before they happen.

From time to time, people trying to prove historical facts do suddenly become excited by the notion that anything that happens in the world, if you go back to the time when you did not know anything, appear extremely unlikely.   The more we try to predict the future, in apparent ignorance of all other relevant information, the more unpredictable it seems.   The sobering insight is that so too are all the other possible events.

Python’s adoption of R’s ‘data frame’

Due to the influence of R on python, some of the terminology is similar and could become confusing, but this is avoided by understanding that many of the concepts originated with R and have since been incrementally picked up in the python universe or elsewhere.

Here are some examples

A ‘data frame’ was introduced to python through the ‘pandas’ library.

Generally, the python libraries follow the lead of R. The data structures in R are generally known as ‘Data Frames’, with a two-dimensional structure.

Wes McKinney, a statistician, built a python library he called ‘pandas‘ that uses Series data structures (one dimensional) and a Data Frame data structure (two-dimensional) for time series processing (originally in a financial context).

Pandas is built on top of the python library numPy, so those wishing to improve performance, such as in the HPC context, may decide to proceed to numPy directly.

McKinney has written a book “Python for Data Analysis” (2nd Ed, 2017).

ggplot (python)

ggplot2, for example, is a package that exists within the R universe and uses the command ‘ggplot()’ in that context, but its later implementation within the python universe (as a library) is called ‘ggplot‘. ggplot (python) is unashamedly described as ‘a plotting system for Python based on R’s ggplot2 and the Grammar of Graphics‘ . To achieve its goals, it also requires the python implementation of a data frame, using the pandas library.

datatable (python)

A python library that has been based on Matt Dowle’s R library called ‘data.table’.  The library project started in 2017 at H20ai, not surprising since Matt Dowle also works there as a “Hacker” (since 2015).  H20 is a Silicon Valley machine-learning company in Mountain View, California.

Live Coding Overview

Live Coding is a topic raised when software developers talk about the tools they like to use, or might like to use in the future.   The main idea behind live coding is to allow a software developer to illustrate the state of variables and other data in the code as coding is carried out.  Some might even extend it to suggesting that coding should take place in a ‘playbox’ that allows immediate simulation of results.

The motivation for live coding seems to be to avoid the need to wait to compile the code, or even to run a script, before obtaining some feedback.  It is orientated around programming as a manual, humanistic art.   Those in favour of live coding want to be able to visualise and anticipate the consequences of their code at the point it is written; to obtain the benefits of computing at the stage when design and restraints are being written.

Many people have contributed to ideas for how to achieve live coding.  One of those is Bret Victor, an independent designer and former Apple employee/consultant has presented many influential talks outlining his principles as a software designer.

One of the questions we might ask about live coding is what current technology helps to illustrate or build the concept?  In some ways it is a bit like Russian dolls – in order to achieve abstraction for software developers, we have to write some lower-level code.   Bret Victor himself has often demonstrated his vision, but not always explained the code in which his utility itself has been written.    The code snippets in his videos suggest ActionScript (associated with Flash technology) or a scripting language similar to JavaScript.

A specific application of the live coding environment (IDE) is COLT software, the ‘Code Orchestra Live Coding’ tool.

A similar idea has been implemented in the LightTable code editing environment.  A talk by Chris Granger, one of the developers, in late 2012, suggested that the tool was implemented in ClojureScript and uses the design pattern of a Component Entity System.

R extensions and interactivity

Yihui Xie is a creative software engineer who works at RStudio.  His R packages are innovative and useful.

1. shiny

Shiny enables you to build interactive web applications straight from R.

2. knitR and animation.

Yihui Xie created a literate programming package for R, called knitR.  knitR enables R code to be embedded in other documents of different file types.  Both knitR and the animation package are the subject of a 2013 journal article by Xie.

The book ‘Dynamic Documents with R and knitr‘ is authored by Xie.

Another author used knitR and some other packages to create a package for an ‘iPython’-style R notebook using markdown to produce the HTML.

3.  htmlwidgets.

This package enables the use of javascript libraries in an HTML document generated R/Markdown or from Shiny.  Before this, it was necessary to work with R objects, and embed in script.  But now the html package includes everything with one line of R code.

4. Leaflets and DataTables.

There is a discussion video at RStudio in which Xie illustrates the use of htmlwidgets in two further packages, Leaflet and DataTables.

5. Xaringan

The xaringan package (with its own self-explanatory xaringan “Presentation Ninja” slideshow) is another of Yihui Xiu’s excellent output.  It is used to create browser-based slides, and relies on the remark.js package, which Xiu recommends you become familiar with before proceeding to use xaringan.   Here’s a link to a recent slideshow by Dr Alison Hill explaining her use of xaringan.

Xaringan has leveraged the work of the remark.js library, which converts markdown into HTML files ready for displaying in a browser.

6. Slidex

Daniel Anderson has recently written a neat package called slidex that will convert pptx (powerpoint) slides into HTML-ready powerpoint slides, by extracting the content and then using R’s xaringan/remark.js to complete the build of the slideshow ready for a browser.

R extensions: Grammar of Graphics, tidyverse and ggplot2

The Grammar of Graphics (the ‘gg’ in the ‘ggplot2’)

Hadley Wickham is the author of several popular R packages and an employee of R Studio.  In developing his graphical plotting package for R, called ggplot2, he has applied many of the ideas in Leland Wilkinson’s influential Grammar of Graphics concept and book.  This 2010 article “A Layered Grammar of Graphics” explains ggplot2 and how it uses the Grammar of Graphics idea for R.  Hadley has also written his own ggplot2 book.

ggplot2 is one of several packages found in the set of R extensions known as the tidy verse.

 

Data science literacy

The data life cycle and software

One of the requirements to work professionally with data is not only mathematical knowledge, but familiarity with data processing hardware and software.  Software packages can be useful through many stages of the life cycle, from data acquisition and cleaning, through to formatting and eventually, data visualisation.

Data-focussed scientists are often encouraged to be comfortable working with either R or python, though many will start with R for their statistical needs. Python is a scripted programming language that has built up many statistical and visualisation libraries that offer some alternative ways of working with data to working with R exclusively. Some cautions about using python have been raised.

See my separate post on how python is moving toward R.

Data science training and assistance

UWA’s HackyHour sessions are all about the programming needs of scientists.

A community of experienced scientists and computer scientists are now involved in training other scientists to work with data effectively, and to develop basic programming skills quickly, and effectively. Locally, training efforts are being made in HackyHours, and at the Pawsey Centre in Perth, where there are resources for High Performance Computing (HPC) available. There have also been Research Bazaars that combine Software Carpentry and other digital literacy sessions. The HackyHour sessions are in the spirit of the Software Carpentry training that was developed by Greg Wilson, whilst working at Mozilla.

Books

There are many introductory texts available for R and python.

e.g. An introduction to Data Cleaning with R

Computation and Performance

1. In general

One of the considerations when data files reach a certain size will be the time it takes to process the information, and the performance of the computing hardware and software that is being used.   I’ll have more details, for a range of use cases, on this website in the future.

Some of the considerations are data structures, choice of software data structures, and the opportunity for data-orientated processing (e.g. parallel computation).  In the context of R or python, some scientists explicitly compare the performance of some packages or libraries against others, for similar data processing activities, on their local hardware.

2.  High Performance Computing (HPC)

The availability of super computing facilities increases the need for scientists to become familiar with all aspects of the data processing life cycle, from the way in which they store their data, to the software manipulation of the data, and the parallel computing environment in which data can be processed in several cores simultaneously.

Locally, the Pawsey centre is trying to improve the deficit of knowledge in all these areas, as well as the workflow from scientist through to the performance of the supercomputing centre.  At the programming level, it is encouraging the use of python libraries like numpy, scipy, with group training conducted using a Jupyter Notebooks hub.

A further consideration for HPC will be universal data formats like HDF5, and in time, the provision of docker-based environments for providing the software to be used with the HPC hardware.

 

Data science software

This is a working list of software that is useful for data science and visualisation projects.  I’ll have a separate page for useful data libraries.

R

Some useful R packages from the Institute of Bioinformatics, Johannes Kepler Uni, Linz are available at http://www.bioinf.jku.at/software/.

RStudio

Jupyter Notebooks (a dynamic/interactive programming environment for python, R and other languages).   It works by providing that environment in notebooks, which are html pages empowered to handle scripts and an input-output flow for data.  This is a link to what it is.  It can also be used in a shared way, by setting up a JupyterHub to deliver the same notebook document to many users.   There is a gallery of interesting jupyter notebooks, including one that is an introduction to python.

If you wish to work with Jupyter Notebooks and python in an integrated environment, one option is to use the Anaconda DataScience Distribution, which features a recent version of python, and most of the useful data science libraries for analytics (numpy, scipy and pandas), visualisation (matplotlib) and machine learning (theano, tensor flow).   There are some additional setup suggestions at Yale.

Static web pages in the form of an HTML file can be created from python (including the ipython notebooks) using the Sphinx program, originally written for the python documentation.

BORIS (Behavioural Observation Research Interactive Software).

I discovered BORIS talking to Erika at HackyHour this June 2018. Ethology and ethograms are very practical scientific concepts for a range of topics: something I’ve unknowingly been doing as part of my cricket scoring apps!

 

R software

R is a software environment for statistical computing and graphics (https://www.r-project.org/). It is well established and scripting languages like python easily interface with its data structures (in fact, python library Pandas reproduces many of them), but R has its own universe of ideas and extensions as well.

One related application is RStudio (www.rstudio.com), which is intended to provide a graphical interface to R. RStudio itself is extended with packages like Shiny (https://shiny.rstudio.com/), to make it easier to build interactive web apps straight from RStudio.

R has its own scripting language, which is not python (clearly), but python can work with R data structures nonetheless.

The data structure within R (the ‘Data Frame’) is a rectangular (row, column) structure that forms the basis for many data applications.