Ode to the joy of R data packages

I have been meaning to get into writing my first R package for some time and was fortunate enough to be guided through the process at a Tidy Tools workshop run by Hadley Wickham (which I thoroughly recommend if anyone gets the chance). This got me (somewhat overly) excited about the workflow possibilities and I hunted out nails for my new found hammer. This led to the way I read in and clean data in preparation for analysis being revolutionised, which I will now evangelise.

I will not go through some of the technical details of package development, if you want to read about that I found Hilary Parker’s post really helpful for getting started (supplemented by the fantastic usethis package) and Erik Howard’s invaluable post as well as Dave Kleinschmidt’s for then applying this to data storage.

Loading and cleaning data in the Dark Ages, before the R Package workflow

Previously, in my unchecked desire to get started with the ‘real’ analysis I would get the data in as quickly and dirtily as possible: initially just the first few lines of an R script would load the data. Then several lines of code to clean, piece by piece. Functions grow to clear things up and if I was feeling particularly structured I would create a separate R script for loading the data into the environment and then source this from other scripts and markdown documents when needed, and sometimes a separate R script for the cleaning functions. I thought this was pretty organised and efficient, albeit sometimes a laborious task.

I shudder at the naivety of my former self.

New and improved data loading – cleaning workflow, with R packages

Now, in my R packaged induced enlightened state (surely temporary, until I find a better way) I have the following workflow:

  1. “Sketch” the data; maybe some quick plots but usually using head(df) and plenty of unique(df$var)
  2. Identify the data cleaning tasks and write unit tests for these jobs.
  3. Write functions that will (hopefully / eventually) pass the unit tests.
  4. Build / test package; repeat steps 2 – 4.
  5. Run cleaning functions on data from data-raw folder, exporting using use_data(df) and documenting where desired.
  6. Load into analysis with library(rdatapackage)

Now this is certainly not foolproof – I still find myself part way through an analysis and need to head back and do some cleaning sometimes – but my transition from wrangling to analysis has been much smoother since implementing this workflow.

But why use a package for data?

Indeed, why not just write some well named and structured cleaning functions as part of the analysis? I have found a few unexpected (to me, at least) gems in using an R package for this kind of task:

RStudio shortcuts + usethis

The keyboard shortcuts (such as Ctrl-Shift-t for testing a package) combined with the usethis package really speed up the process, and they are both seemingly designed to help optimise package development workflow.

Unit tests

These help gamify the whole process – I find it really satisfying achieving each next task. They also help ensure that you will not unknowingly ruin some earlier cleaning task that you had thought solved by solving some more complicated problem.

I also find that writing the code for unit tests is fairly easy, and a natural place to start. For example, a recent project I had the following entries in a variable:

resource/x-bb-lesson, resource/x-bb-document, ...

Clearly this would read nicer as ‘lesson’ and ‘document’, so writing the unit test was easy enough:

test_that("clean_data_var works", {
  expect_equal(clean_data_var("resource/x-bb-lesson"), "lesson")
  expect_equal(clean_data_var("resource/x-bb-document"), "document")
})

At this stage I have not written the function clean_data_var yet, but I have clearly defined what I need it to do. Now, regex is a powerful tool but I rarely get expressions first time and I have found having the testing built in and only a quick keyboard shortcut away immensely helpful.

Anonymising data

I work a lot with student activity data so being able to move quickly from names / numbers to Student34 or similar is nice.

Documentation

This has been really helpful in keeping track of what actually was in that data frame, and if there were any issues involved.

Portability

Not just of the data, but of the functions used to clean the data. Being able to quickly grab an old function to solve a similar task using rdatapackage::functionname is super handy.

Proselytizing over

This might be old-hat to some people out there, but for anyone struggling to get stuck into this key part of data science work this changed approach has really helped me do a better, cleaner job and it’s fun.

Array method for polynomial division

Recently I ran into an issue teaching polynomial long division in my year 11 class, as some students had not encountered the long division algorithm. During an attempt to explain how the whole process worked we stumbled upon another method for doing polynomial division, which some of the students have then adopted readily. I am sure this is not new to everyone but we found it interesting and I believe it is worth sharing.

Examples

Express x^3-2x^2+4x-4 in the form Q(x)(x-2) + R(x)

First we set up a multiplication array, with the divisor on one side. To determine the size of the array we need some knowledge about polynomials – namely that the polynomial Q(x) will be of order 2, hence having 3 terms.

We then go about filling out the multiplication array so that the result would give us the polynomial we are dividing. As long as we set up the powers of x in descending order then the like terms will form diagonals. This is easier to see in action in the video below:

 

Thoughts

Unless students were very comfortable with long division they have tended to gravitate towards using this array method, which also seems to be slightly quicker. I like that it lays out the problem in a more natural way to represent reverse engineering the multiplication – but I am quiet biased towards anything that involves graphical representations. The actual lesson where we, as a class, realised this could be a new way of solving these problems was a wonderful moment – actually sharing the feeling of discovering something new (to us at least!).

Authoring mathematics tests in R Markdown – how awkward can it be?

(Short story: quite, but I learned a lot on the way – including LaTex, ggplot and reshaping data)

I recently thought to try and and write mathematics paper using R Markdown within R Studio, mainly because it had a statistical graph required and I might learn a few things along the way. It was a slow experience – taking way longer than it normally would – but a lot of this was hacking the formatting to behave more like Word in parts and picking up Markdown and LaTex syntax. Definitely have an R Markdown cheatsheet and LaTex cheatsheet handy.

Here is the final result, and the rmd file.

Spacing in markdown

This was a pain to get right – but once you have a few frameworks in place it is not too bad. The following were handy:

 
\newpage
\text{        } #to add whitespace
\hfill #to right-align the next section of text

Working out space (blank lines for writing solutions) also required a small hack, but using the following got ok results:

\huge
..................
\normalsize

Figures and images

The chunk options at the start of your code chunks were key to get the images and figures right – and here working in RStudio really helped being able to quickly knit the file to pdf to check what it looked like. This was where you really appreciated wysiwyg’s such as word. I would often be playing with the figure width and height of each piece to get them right. The png / jpeg and grid packages in R where also a great help – and although it was a lot more of a fiddle than the copy paste of Word I did get quicker at it by the end.

``` {r fig.width = 3, fig.height = 3, fig.align = 'right'}
library(jpeg)
library(grid)
img grid.raster(img)
```

Statistical charts and figures were where this whole exercise felt worthwhile. The following spits out a boxplot with specified quartiles and can be easily manipulated:
“` {r echo = FALSE, fig.width = 5, fig.height = 2.5, fig.align = ‘center’}
data <- data.frame(Score = c(3,5,9,22,25))
boxplot(data, horizontal = TRUE)
“`
The real power comes when you start using R’s built in distribution functions. The following code is to be run in R directly or in an R code block in Markdown.

# Creating the data set
df <- data.frame(x1 = rnorm(100, 5,3), x2 = rnorm(100,6,4))
boxplot(df)
# or using ggplot for scatter
library(ggplot2)
g1 <- ggplot(df, aes(x = x1, y = x2)) + geom_point()
g1

To fully utilise some of the ggplot features it is handy to get the data into long format. This means instead of two columns of data named x1 and x2, with 100 rows, you would have 200 rows of one ID column and another Score column, the first column consisting of a bunch of identifiers (in this case ‘x1′ or x2’) and the second column containing the scores. The reshape2 package in R made life a lot, lot easier for this one.

library(reshape2)
df.long <- melt(df)
g2 <- ggplot(df.long, aes(value, color = variable)) + 
         geom_histogram() + 
         facet_grid(variable ~ .)
g2

Thoughts

There are some benefits of this approach; once the template is set up it runs well (I now use this for short homework pieces all the time), it was an easy way to blend LaTex and some graphing capabilities, and the end results looked clean and good. There are more restrictions in play than in a traditional program such as word, but sometimes this is a good thing. Being able to quickly generate complex graphs – it is worth seeing some of the ways ggplot can be used – and include them for students to analyse has been really effective.

Opportunities for Data in Schools with R

Recently I have been getting way too much enjoyment out of learning to program in R and am now starting to look at how it might be useful to implement in schools. I am very fortunate in that I work in a school that is very open to trying new things and progressive in how it uses and analyses the vast amount of data. So over the next few years I plan to experiment with a two-pronged attack on number crunching in schools, and share what I learn along the way with the world.

Prong 1 – Statistics in the classroom using R

After having a play with R markdown and building some rudimentary Shiny apps there seems to be a huge amount of scope for building case-studies or learning experiences that enable high school students to get a lot more out of learning statistics than what currently occurs. Having a fully R-based statistics course would be a long way off (pen and paper testing of statistics is not giving up without a fight in my part of the world) however I believe some clever scaffolding of Markdown / Shiny documents might enable students to explore key concepts using real data, and meaningfully sized data.

Prong 2 – School data analysis using R

Schools collect a huge amount of data, often in isolated pockets that do not really communicate with each other. A single student generates a huge amount of data: reports, grades, standardized testing, common assessments, homework, etc. With more work being submitted online there is also the opportunity to track a students written (well, typed) work over time. I am hoping to do a few things with this, some which may be possible, some not:

  • Text analysis for writing quality, tracking student progress over their time at school
  • Linking multiple data sources to look for patterns
  • Building easily understandable Shiny Apps (or similar) that help tell a narrative of a student, or groups of students
  • Using machine learning to identify patterns that would not be spotted otherwise

Hopefully this is the first of many posts about the possibilities of making more out of the vast numbers and textual data floating around the education system. I would love to hear from anyone out there that has trodden down this road before.

The best professional learning I never did

Some time ago a colleague of mine travelled to Finland to see what all the fuss about their education system was. There was the expected – clearly well educated teachers and an environment where education was seen as a right and not a privilege – however there where other things noticed that have since become a key “well why not?” Part of my education philosophy.

One problem, many techniques.
Students were exposed to one particular problem, or an interesting shape, and then went on to explore a whole range of applications arising from this. This may not have been a pervasive national approach but it certainly seems effective. It is the opposite of examining a particular technique, say finding the gradient between two points, and then creating numerous examples where you needed to find the slope between two points. There is something inherently artificial about the latter process that I believe is a key element of student revulsion to mathematics. If you are practising a skill then practise it – no need to dress it up in faux application, instead look for rich examples that highlight connections.

Be aware when you are the one disrupting learning.

I was not expecting insights into classroom management to be something to come from the juggernaut of PISA testing, however it may exemplify a teacher mindset that does help results. My colleague noticed that students were only immediately reprimanded by the teacher in two cases – either it was directly impacting the learning or overt rudeness. This was clearer with what was not addressed. Late students – no need to make a scene, the teacher would carry on and talk to them later. I know I have been guilty of interrupting the class because someone is idling into the lesson late. Multitasking – a great example was a student asking questions after a physics class, whilst applying makeup. I don’t know how many teachers in my home country would happily let a student appear distracted or unfocused whilst helping them – I certainly find it hard to judge. I guess the moral of the story here is to give students every opportunity to engage with your subject. I now try to be acutely aware of when I am, as a teacher, getting in the way of student engagement – and then get out of the way. 

Acting out Ebola

What does it mean to visualise data anyway? How do we understand the world around us?

These are the kind of questions that have been the focus of my year 10 mathematics students lately, and one of the big issues to look at lately has been the spread of the Ebola virus. There are lots of fantastic visualisations of the data out there – but sometimes seeing the data as static is not quite enough to grasp what is going on. I am a big fan of running computer simulations to help my own understanding, but that does require a fair bit of expertise and time. So how to introduce a simulation into the classroom, without having to program? Introduce: The Ebola Game.

We took 3 statistics; that Ebola in West Africa was roughly killing 1 out of every 2 people infected, an infected person was passing on the virus to another 1.7 people on average, and that the virus’ symptoms were relatively low key in the first week and horrific in the second. I set up a turn based game, with each turn representing a week in the life of our small village. The students then set up the game rules:

  • Each week they would “meet” 3 other people in the village, and that each meeting had a 50% chance of spreading the virus.
  • If you contracted the virus, your first week was spent living a normal life in the village (possibly infecting others you meet), but the second week you had to go to hospital
  • At the end of the second week you had a 50% chance of surviving. If you survived, you were then immune.

And so the game began, with yours truly secretly taking notes on who was currently infected without their knowledge. Students wandered around the class and had to shake hands with 3 different people, representing the three potential transmissions they were encountering each week. I acted as the media, reporting the new cases each week (along with the occasional suspicious flu just to keep them on their toes). To make it more interesting the students took on a few roles:

  • Mayor (elected). They had the power to enforce a curfew, effectively reducing the number of enforced meetings between students per week.
  • Doctor and Nurse. Both worked at the hospital, so were forced to interact with known cases, however the students decided that they would have a lower chance of catching the virus because they would be taking precautions.
  • Undertaker. Escorted the ‘dead’ from the village, again with a small chance of catching the infection.

Things got interesting quickly. Students got quite reluctant to shake hands (understandably!) and started petitioning the Mayor for stronger restrictions. If someone went to hospital fear quickly spread amongst those that had had recent contact with the patient. No one went to visit anyone in hospital. Ever. The nurse got sick, but survived and was thus immune. The undertaker died so another student had to take up his job (reluctantly!). As the simulation went on we graphed the number of infections and the number of dead on the board, both steadily rising. However, it was not until the curfews and enough immunity spread through the group that the cases started to die out, but by that time a third of the class had perished.

It was a great trigger for conversation about what the statistics mean. I felt I learned a lot personally in that a simulation can be run in a low tech and much more entertaining way. The biggest gain here though, was really unpacking such a simple statistic such as “1 in 2 people” and seeing what this really means.

A subject to be created

Central to my teaching philosophy is the idea that mathematics is a subject to be created, not a created subject. I heard the phrase at a meeting of like minded educators, and it has resonated with me so much that I am looking for a Latin translation for the family crest. But why place this idea at the center? And is it even true? I heard the phrase when talking about mathematics education, but believe we can extrapolate this idea to encompass mathematics itself.

Ask anyone to work out 3 divided by 0 and they will (mostly!) let you know it can’t be done. Ask what zero divided by zero is and a few less perhaps will assure you that you cannot divide by zero, and up until the time of Newton and Leibniz they would have had the consensus of the world of Natural Philosophers as well. But calculus and much of western science would have never emerged if not for the idea that trying to assign a sensible number to 0/0 might be useful.

The history of mathematics is littered with examples where people have simply thought “why not?” and journeyed past common sense and tradition to somewhere beautiful and mysterious. Complex analysis. Riemann geometry. Analytic continuation. I saw a lovely proof recently that used this exact idea, taking the divergent sum (1 – 1 + 1 – 1 + 1 …) and assigning a “reasonable” value, and it leads to one of the most bizarre and beautiful results around (check out the great Numberphile video here).

I guess the moral of the story is that if we do not allow for the “what if” play in mathematics lessons, then we are not really doing mathematics. The how of this may be a challenge, but I feel the why is a truth that we should hold to be self-evident.

Creating doubt, confusion and mathematics