Data science literacy

The data life cycle and software

One of the requirements to work professionally with data is not only mathematical knowledge, but familiarity with data processing hardware and software.  Software packages can be useful through many stages of the life cycle, from data acquisition and cleaning, through to formatting and eventually, data visualisation.

Data-focussed scientists are often encouraged to be comfortable working with either R or python, though many will start with R for their statistical needs. Python is a scripted programming language that has built up many statistical and visualisation libraries that offer some alternative ways of working with data to working with R exclusively. Some cautions about using python have been raised.

See my separate post on how python is moving toward R.

Data science training and assistance

UWA’s HackyHour sessions are all about the programming needs of scientists.

A community of experienced scientists and computer scientists are now involved in training other scientists to work with data effectively, and to develop basic programming skills quickly, and effectively. Locally, training efforts are being made in HackyHours, and at the Pawsey Centre in Perth, where there are resources for High Performance Computing (HPC) available. There have also been Research Bazaars that combine Software Carpentry and other digital literacy sessions. The HackyHour sessions are in the spirit of the Software Carpentry training that was developed by Greg Wilson, whilst working at Mozilla.

Books

There are many introductory texts available for R and python.

e.g. An introduction to Data Cleaning with R

Computation and Performance

1. In general

One of the considerations when data files reach a certain size will be the time it takes to process the information, and the performance of the computing hardware and software that is being used.   I’ll have more details, for a range of use cases, on this website in the future.

Some of the considerations are data structures, choice of software data structures, and the opportunity for data-orientated processing (e.g. parallel computation).  In the context of R or python, some scientists explicitly compare the performance of some packages or libraries against others, for similar data processing activities, on their local hardware.

2.  High Performance Computing (HPC)

The availability of super computing facilities increases the need for scientists to become familiar with all aspects of the data processing life cycle, from the way in which they store their data, to the software manipulation of the data, and the parallel computing environment in which data can be processed in several cores simultaneously.

Locally, the Pawsey centre is trying to improve the deficit of knowledge in all these areas, as well as the workflow from scientist through to the performance of the supercomputing centre.  At the programming level, it is encouraging the use of python libraries like numpy, scipy, with group training conducted using a Jupyter Notebooks hub.

A further consideration for HPC will be universal data formats like HDF5, and in time, the provision of docker-based environments for providing the software to be used with the HPC hardware.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top