EP624: Statistical Programming and Reproducible Research

Table of Contents

1 System Setup

1.1 Operating System

  • Install any latest distribution of GNU/Linux on your computer. I use Manjaro, but feel free to install any.
    • If you have a Mac, you can continue using OSX, and install the required applications on OSX.

1.2 Applications

GNU/Linux   update

Commands below are for setting a Manjaro/Arch Linux computer. For other distributions (for example, Debian/Ubuntu-based), use their package managers to install the same applications.

#+CAPTION: Setting up a Manjaro/Arch Linux computer
## These commands should be run one line at a time. Do not run it as a script.
## Modify to suit your requirements
sudo pacman -Ssy
sudo pacman -S emacs firefox texlive-most biber nmap openssh yay gimp phpmyadmin evince git git-lfs mpv vlc r htop neomutt isync pass mariadb postgresql aspell-en termite pass notmuch go tk afew ack keychain chromium msmtp unison python-pip python-imaging ghostscript hplip geos youtube-dl evince pandoc-citeproc
sudo pacman -R palemoon
sudo pacman -S rofi #Only if you use i3wm
sudo pacman -U https://www.archlinux.org/packages/community/x86_64/browserpass/download/
sudo pacman -S browserpass-firefox browserpass-chromium
systemctl enable sshd
yay -S dropbox
yay -S megasync
yay -S keybase-bin
yay -S skype
yay -S signal-desktop
yay -S mimio
yay -S libreoffice-extension-languagetool
install_pulse

OS-X

The built-in Emacs on OS-X is an older version, and it would be a good idea to install the latest version instead.

The best option is to install it via homebrew. I like the version available from railwaycat/emacsmacport tap (https://github.com/railwaycat/homebrew-emacsmacport).

First install Homebrew. Then, from the terminal, give following commands:

$ brew tap railwaycat/emacsmacport

$ brew install emacs-mac

Install MacTeX from http://www.tug.org/mactex/

Install R from http://www.r-project.org

Also install pandoc and pandoc-citeproc using homebrew:

$ brew install pandoc pandoc-citeproc

1.3 Create your ssh keys

ssh or Secure Shell is a protocol for securely transferring data over networks.

Use these commands to create your ssh keys.

# If you do not already have ssh keys, create them as below
mkdir ~/.ssh
ssh-keygen -t rsa #No need to use a password, when asked. Just press enter for all the questions (accept defaults).

If you have just accepted the defaults, these commands would have created two files in .ssh directory: ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub.

id_rsa is your private key which much be kept safe and is never shared with anyone.

id_rsa.pub is the matching public key which is shared with any machine with which you want to have secure data transfers.

1.4 Emacs configuration

In your github.com account, go to settings, click on “SSH and GPG keys”, and add a new ssh key. In the box that opens, in Title add your name, and the name of your machine, and in the box below, paste the contents of the public key that was created in the previous step (~/.ssh/id_rsa.pub).

Once that is done, following commands will configure Emacs for the requirements of this course.

git clone git@github.com:vikasrawal/prelude-ep624.git
#git clone https://github.com/vikasrawal/prelude-ep624.git # Use may use this if you do not have a github account with your public ssh key added to it.

mv .emacs.d .emacs.d.old # Renames if you already have a .emacs.d directory
mv prelude-ep624 .emacs.d
cd .emacs.d
git checkout ep624
cd

1.5 Access to your Github repositories   new

Git is a version control system. Created originally by Linus Torvaldes, it is the most widely used version control system for collaboration and tracking by software developers. Since most of our work – the org files, the R scripts, the LaTeX files – would be in plain text files, we can use git not only to track evolution of our programs but also of our research papers, dissertations and books.

Apart from locally hosting your repositories on your own machine, you can also keep your repositories on platforms like Github and Gitlab. If you want to keep your work-in-progress research on these platforms, you would need rights to create private repositories. While Gitlab provides it by default, you can enable it on Github by claiming educational discount.

I will be using git and Github to run this course. Which means that you will need to know basic git commands to submit your assignments and get my feedback. The idea is to use this to familiarise you with basic git commands.

I will create a separate repository for each student on Github, and invite you as a collaborator. You will clone it on your computer using the commands mentioned below, create a separate sub-directory for each assignment and do all your work in these sub-directories. Once you have completed some work and want to share it with me, you will commit it and push it to the github repository. You commit anything to the repository in two steps. You first add it to the staging area and then you commit it.

 1: ## Basic git workflow for this course
 2: ## The following command, after modifying the name of your repository,
 3: ## clones your repository. A directory -- 2020_yourname -- will be
 4: ## created in the directory from where the command is given
 5: ## (your home directory in this case). All future work for the course
 6: ## should be done in 2020_yourname directory.
 7: 
 8: cd ~
 9: git clone --recurse-submodules git@github.com:ep624/2020_yourname.git
10: 
11: cd ~/2020_yourname
12: rm -Rf resources
13: git clone git@github.com:ep624/resources.git
14: 
15: cd ~/2020_yourname/resources
16: git pull origin master
17: 
18: 
19: ## create a separate subdirectory in 2020_yourname for each of your assignments.
20: cd ~/2020_yourname
21: mkdir midterm
22: 
23: ## To add a file to the staging area (a new file, or a file that has been modified)
24: git add mainfile.org
25: git add figure1.png ## the repository accepts png files as git lfs objects.
26: 
27: ## To commit the file
28: git commit -m "A message describing this commit"
29: 
30: ## To push it to github repository
31: git push
32: 
33: ## To pull from github the work of other collaborators (my comments/edits, for example)
34: ## and merge it into your local repository
35: git pull
36: 

A separate team has been created on Github for your batch. You should accept the invite and join the team.

1.6 Learning how to set up your system

It would be useful for you to learn how to

  • install a Linux system.
  • use package manager of your Linux distribution to install software.
  • customise Emacs.

While we have given some instructions here, and provided some ready-made configuration files for you to deploy, you should learn what these are doing, how to modify them, and how to extend them.

2 Introduction

2.1 Introduction to the course

Why should you use programming for statistical analysis?

  • Programming makes your results reproducible and verifiable.
  • It is easier to revise your estimations, graphs, tables etc if you write and save your programs.
  • Updating your analysis, for example when new data become available, is easy, as the same/modified programs could be run on new data to get new results.

Why use R?

  • R is open-source and free of cost.
    • Will always remain available.
    • Development is fast; tools for using new, cutting-edge statistical techniques become available as add-on packages.

2.2 Introduction to shell programming

Introduction to *nix file system

  • *nix file system is structured as a tree with “/” as the root.
  • /home/yourusername is your $HOME. ~ is a shortcut to this directory.
    • On Mac/OS-X, $HOME is located at /Users/yourusername. ~ is mapped to this directory.

Important Bash Commands to Learn

Help
man
manual
File management
ls
list files
mkdir
make directory
cp
copy
mv
move
rm
remove/delete
ssh, scp, rsync
commands used for secure copy to remote locations
find
find a file
ack
find files containing a string
Navigation
cd
change directory
pwd
path of working directory
Seeing/modifying files on the commandline
cat, more, less, head, tail
commands to see contents of text files
awk, sed
tools to process text files line by line.
diff
to compare two text files
Redirection
\(>\)
redirect output to a file (if the output file already exists, it will be replaced)
\(>>\)
append output to a file (if output file does not exist, it will be created)
\(|\)
pipe output to another command
Others
grep
filter lines of file containing a string
“ip address show”
to find your ip address
sort
sort lines
uniq
to find unique adjacent lines
wc
count number of bytes, words and lines

3 Emacs

3.1 Introduction to Emacs

Emacs is an powerful cross-platform editor which can be used for all kinds of tasks that you do on your computer. Although Emacs is different from every other editor you are likely to have used, learning to use Emacs would be a huge asset for your computing abilities. A good way to learn Emacs, to ask, for each activity you do on your computer, can I do this using Emacs? You will be amazed at how many times the answer is: yes, and more efficiently than in your current application.

This section does not provide a detailed guide to the use of Emacs. In this guide, I will just provide a minimal set of basic commands in emacs to get you started. This is a minimal but sufficient set to be able to work. I expect that you would learn more commands as you start using emacs.

3.2 Notations

In emacs, a buffer is like a tab in an application. It is normal to have several buffers open at the same time. Each file opens in emacs as a buffer. Buffers could also have processes like R running in them. Emacs displays any messages for you in a separate buffer.

Most commands in emacs are given using the Control (ctrl) or the Meta (often mapped to alt) keys. Control key is usually referred to as C- and the Meta key as M-. So a command C-c means pressing Control and c together. Command M-x means pressing Meta and x together. Everything is case-sensitive. So M-X would mean, pressing Meta, Shift and x together. C-c M-x l would mean pressing C-c, release, then M-x, release, and then l.

3.3 Basic commands

Table 1 gives the commands that are the most important. This is a minimal set, commands that you should aim to learn as soon as possible. There are many more, which you will learn as you start using emacs.

All commands have a verbose version that can be used by pressing M-x and writing the command. For example, M-x find-file to open a file. All major commands are also mapped to a shortcut. For example, instead of typing M-x find-file to open a file, you can say C-x C-f. I remember shortcuts for commands that I use most frequently. For others, I use the verbose versions. Over time, one learns more shortcuts and starts using them instead of the verbose versions.

Table 1: Essential emacs commands
Description Verbose command Shortcut
  M-x followed by  
Opening files, saving and closing    
Open a file find-file C-x C-f
Save the buffer/file save-buffer C-x C-s
Save as: prompts for a new filename and saves the buffer into it write-named-file C-x C-w
Save all buffers and quit emacs save-buffers-kill-emacs C-x C-c
Copy, Cut and Delete Commands    
Delete the rest of the current line kill-line C-k
To select text, press this at the beginning of the region and then take the cursor to the end set-mark-command C-spacebar
Cut the selected region kill-region C-w
Copy the selected region copy-region-as-kill M-w
Paste or insert at current cursor location yank C-y
Search Commands    
prompts for text string and then searches from the current cursor position forwards in the buffer isearch-forward C-s
Find-and-replace: replaces one string with another, one by one, asking for each occurrence of search string query-replace M-%
Find-and-replace: replaces all occurrences of one string with another replace-string  
Other commands    
Divide a long sentence into multiple lines, each smaller than the maximum width specified fill-paragraph M-q
Window and Buffer Commands    
Switch to another buffer switch-to-buffer C-x b
List all buffers list-buffers C-x C-b
Split current window into two windows; each window can show same or different buffers double-window C-x 2
Remove the split zero-window C-x 0
When you have two or more windows, move the cursor to the next window other-window C-x o
Canceling and undoing    
Abort the command in progress keyboard-quit C-g
Undo undo C-_

3.4 Resources

  • A Tutorial Introduction to GNU Emacs
  • A. J Rossini and Richard M Heiberger and Rodney A Sparapani and Martin Mächler and Kurt Hornik (2004), “Emacs Speaks Statistics: A Multiplatform, Multipackage Development Environment for Statistical Analysis”, Journal of Computational and Graphical Statistics, 13(1), 247-261
  • Heiberger, Richard M (2001), “Emacs Speaks Statistcs: One Interface, Many Programs”, in K. Hornik and F. Leisch (eds.), Proceedings of the 2nd International Workshop on Distributed Statistical Computing, March 15-17, Vienna, Austria
  • Rossini, A.J., Heiberger, R.M., Hornik, K., Maechler, M., Sparapani, R.A., Eglen, S.J., Spinu, V., Henry, L. (2016), ESS: Emacs Speaks Statistics
  • Rossini, Anthony, Martin Maechler, Kurt Hornik, Richard M. Heiberger, (2001) “Emacs Speaks Statistics: A Universal Interface for Statistical Analysis”, University of Washington Biostatistics Available at: http://works.bepress.com/anthony_rossini/3/

4 R Programming

4.1 Basics

  • R is a case sensitive language. Do not mix upper case and lower case characters.
  • You should have a separate directory for each project/article you work on.
  • By default, the work you do is saved in this working directory in a file called .RData
  • All R commands are followed by (). Arguments that the command needs are given inside the parentheses. Empty parentheses have to be used even if a command needs no arguments. For example,
    • q will not work as a command
    • q() is a command
    • q(save="yes") is a command with an argument that asks the R environment to be saved before quitting.
  • -> or <- is used to create an object using output of a command.
    • c(1:5)->col1 creates an object col1 containing the output of the command c(1:5).
    • col1<-c(1:5) also does the same thing.
  • In this guide, all commands are formatted using a mono-spaced font to distinguish them. Such lines are numbered. But line numbers are not a part of the command. In the boxes, text starting with a # is used to write comments for the reader. These comments are not part of the commands. You do not need to type them into R as any line starting with a # is treated as a comment in R and not executed. When you write your own programs, use lines starting with # to write your own comments that may be necessary to make your program intelligible to your collaborators.
 1: #+CAPTION: Some basic commands in R
 2: # Text after # is a comment to explain what the command does
 3: q() # Quit R; prompts for saving the workspace.
 4: ?command # Open help for 'command'; q to quit help.
 5: ??keyword # List commands with 'keyword' in the title
 6: help.search("keyword") # List commands with 'keyword'
 7:                        #  anywhere in the docs
 8: setwd("c:/project1")  # On a Windows machine, sets the working directory to c:/project1
 9: setwd("~/proj2")  # On a Linux or OSX machine, sets the working directory to ~/proj2
10: getwd() # Reports the present working directory

4.2 Data structures in R

Vectors

  • A vector is a column of data. The most basic command for creating a vector (think of a column) is c().
    • c(1:50) creates a vector with numbers 1 to 50.
  • If vec1 is a vector, vec1[i] refers to ith element of the vector.
  • A vector can be a numeric vector, a character vector, or a factor (categorical) vector. – character variables are text strings, – numeric variables are numbers – factors are categorical variables, in which each observation belongs to a category. Category labels may be numeric or characters. Even when they are numeric, they are just labels for categories and should not be confused as numbers.

A numeric vector is a collection of numbers.

 1: 
 2: # Line 3 creates a numeric vector called col1 with
 3: # 13 elements: numbers 1 to 10, 11, 14.5 and 6.
 4: c(1:5,11,14.5,6)->col1
 5: col1
 6: # Line 6 creates a character vector called col2
 7: # with four elements, each being a string of
 8: # characters
 9: c("Sheela","Manoj","Karim","John")->col2
10: col2
11: # Line 9 creates a factor vector in which each
12: # element belongs to category "M" or "F".
13: as.factor(c("M","M","M","F","F","M","M"))->col3
14: col3
15: # Line 13 reads a factor vector and outputs a
16: # vector that is a list of categories. In this
17: # case, the output would be: [1] "F" "M"
18: levels(col3)
19: # Line 15 can be used to check what type of object
20: # is col3.
21: class(col3)

Results of running the code above

[1]  1.0  2.0  3.0  4.0  5.0 11.0 14.5  6.0
[1] "Sheela" "Manoj"  "Karim"  "John"
[1] M M M F F M M
Levels: F M
[1] "F" "M"
[1] "factor"

Data frames

  • A data frame is a collection of vectors. It is a table that stores variables in columns and observations in rows. You could think of data frames as equivalent to a spreadsheet with many columns. When you import a data file, it would normally be imported as a data frame.
  • However, there is one important difference. In an R data frame, each column is a particular type of vector (numeric, character or factor). You cannot mix the type of data in a given column. A spreadsheet, in contrast, can contain anything in any cell.
  • Just as for a vector col1, col1[i] refers to the ith element, in a data frame data[i,j] refers to the element in ith row and jth column. Note that the first argument (before the comma) specifies the rows and the second argument (after the comma) specifies the columns. This is one way to subset data frames.
    • data[1:9,c(2,5,6)] picks the first nine rows and column nos 2, 5 and 6.
    • data[data[,1]->10000,] selects all rows in which the first column is greater than 10000, and for these rows, shows all columns. Note that not specifying anything after comma is interpreted as all columns. Similarly, nothing before the comma would be interpreted as all rows.
  • data1$var1 refers to vector var1 inside data frame data1.
    • data1$var1^2->data1$var5 calculates square of var1 in data1 and inserts it as a new vector (var5) in data1.
    • data1$var1+2*data1$var2->data1$var2 takes var1 and var2 from data1, computes var1+2*var2, and replaces var2 in data1 with the computed variable.
  • R provides a large number of commands to manipulate data frames in various ways. You should master these. See Box 1 to learn commands for selecting subsets of data frames, match merging data frames, appending data frames, adding rows and columns to data frames, and reshaping data frames.

## Selecting rows using conditions. Conditions can be
## specified using operators such as:
## ==, !=, <, <=, >, >=, is.na, is.nan
subset(data1,var1>10000)->data2

## Selecting columns (may be combined with argument for
## selecting rows).

subset(data1,select=c("var1","var5","var10"))->data2

## merging data frames using variables var1 and var2,
## selecting only matching rows of data3 and data4.

merge(data3, data4, by=c("var1","var2"))->data5

## merging data frames using variables var1 and var2,
## selecting all rows of data3 and matching rows of
## data4. Rows with no matching information from data4
## will have NAs in columns coming from data4

merge(data3, data4, by=c("var1","var2"),
      all.x=TRUE)->data5

## merging data frames using variables var1 and var2,
## selecting all rows of data4 and matching rows of
## data3

merge(data3, data4, by=c("var1","var2"),
      all.y=TRUE)->data5

## merging data frames using variables var1 and var2,
## selecting all rows of data 3 and data4.

merge(data3, data4, by=c("var1","var2"),
      all=TRUE)->data5


4.3 Statistical Graphics with ggplot2

Showing statistical information in an appropriate graphical form can help reveal the underlying patterns and trends. This is the purpose of using statistical graphs. Graphs are not made to add colour to your report. Spreadsheet software generated chartjunk has increasingly been passed off as statistical graphs. tufte2001visual remarked, “graphical decoration, which prospers in technical publications as well as in commercial and media graphics, comes cheaper than the hard work required to produce intriguing numbers and secure evidence”.

The ggplot2 package provides an extremely powerful system for making statistical graphs. The ggplot2 package is based on Leland Wilkinson’s The Grammer of Graphics wilkinson2006grammar. It can be used for making a large variety of statistical graphs as well as maps. Although it is somewhat complex, mastering ggplot2 would give you a big advantage in making interesting statistical graphs.

This guide would give you a basic idea of the ggplot2 syntax. You should look at the ggplot2 website for more examples. You are also advised to read wickham2016ggplot2 and chang2012r. These are expected to be used as ready reckoners.

In ggplot2, graphs are made in steps. You start from a basic specification of the graph, and then add/modify the details.

Step 1: Specify the aesthetic mappings

In the first step, you map, through aesthetic mappings, selected variables in your data to different scales. ggplot2 provides various types of scales for mapping your variables. The most important of these are the horizontal x axis, the vertical y axis, colour, size, linetype and shape (of a point). Continuous variables or discrete variables(for example, factors) can be mapped to these. Only discrete variables can be mapped to shape and linetype.

1: ggplot(data,aes(x=year,y=income,colour=State))->p

You could also specify colour, size and shape independently of aesthetics if you want to specify a specific colour, shape or size but keep them constant for all the data points.

1: ggplot(data,aes(x=year,y=income,size=State),
2:        colour="darkred")->p

Step 2: Specify the type of graph you want

In the second step, you specify the type of graph you want to make. These are called geometric objects, or “geom”, in the ggplot2 lexicon. Where necessary, you may also specify a statistical transformation that should be done while plotting (using commands that typically start with a stat_). Table 2 provides a list of various types of graphs/geometric objects that can be made with ggplot2, and the associated commands.

2: # We start from the object 'p' created in Step 1
3: # and use '+' to add to it/modify it.
4: p+geom_point()->p #This makes a scatter plot
5: p+geom_line()->p #This adds a line to the plot
6: p # At any stage, doing this would actually
7:   # plot p on the screen.
Table 2: Various types of geometric objects (geoms), and the commands used to make them, in ggplot2
Command Type of graph
geom_abline, geom_hline, geom_vline, Reference lines: horizontal, vertical, and diagonal
geom_bar, geom_col, stat_count, Bars charts
geom_bin2d, stat_bin_2d, Heatmap of 2d bin counts
geom_blank, Draw nothing
geom_boxplot, stat_boxplot, A box and whiskers plot (in the style of Tukey)
geom_contour, stat_contour, 2d contours of a 3d surface
geom_count, stat_sum, Count overlapping points
geom_density_2d, stat_density_2d, Contours of a 2d density estimate
geom_density, stat_density, Smoothed density estimates
geom_dotplot, Dot plot
geom_errorbarh, Horizontal error bars
geom_hex, stat_bin_hex, Hexagonal heatmap of 2d bin counts
geom_freqpoly, geom_histogram, stat_bin, Histograms and frequency polygons
geom_jitter, Jittered points
geom_crossbar, geom_errorbar, geom_linerange, geom_pointrange, Vertical intervals: lines, crossbars & errorbars
geom_map, Polygons from a reference map
geom_path, geom_line, geom_step, Connect observations
geom_point, Points
geom_polygon, Polygons
geom_qq, stat_qq, A quantile-quantile plot
geom_quantile, stat_quantile, Quantile regression
geom_ribbon, geom_area, Ribbons and area plots
geom_rug, Rug plots in the margins
geom_segment, geom_curve, Line segments and curves
geom_smooth, stat_smooth, Smoothed conditional means
geom_spoke, Line segments parameterised by location, direction and distance
geom_label, geom_text, Text
geom_raster, geom_rect, geom_tile, Rectangles
geom_violin, stat_ydensity, Violin plot
stat_sf, geom_sf, coord_sf, Visualise sf objects

Step 3: Modify the scale

For each of these scales (x, y, colour, size and shape), you can define attributes such as the range to be displayed, the value (say of colour or shape) associated with each category in a discrete variable, the axis label, how and where to tick-mark the axis, and the type of legend to be drawn.

 6: p+scale_x_continuous("Financial Year",
 7:                      limits=c(1990,2012),
 8:                      breaks=c(1990,1995,
 9:                               2000,2010))->p
10: 

Step 4: Modifying the overall appearance of the graph

ggplot2 provides various themes, and different elements to change them, to tweak the overall look of the graph. Table 3 gives a list of different aspects of a theme that can be modified. Table 4 lists different attributes of text and theme elements that can be modified.

Table 3: Theme items that control appearance of text
Name Description Element type
text All text elements element_text()
rect All rectangular elements element_rect()
line All line elements element_line()
axis.line Lines along axes element_line()
axis.title, axis.title.x, axis.title.y Appearance of axis labels: both, x and y element_text()
axis.text, axis.text.x, axis.text.y Appearance of tick labels: both, x and y element_text()
legend.background Background of legend element_rect()
legend.text, legend.title Legend item and title appearance element_text()
legend.position Position of the legend “left”,“right”,“bottom”,“top”, or two-element numeric vector if you wish to place it inside the plot area
panel.background Background of plotting area element_rect()
panel.border Border around plotting area element_rect(linetype=“dashed”)
panel.grid.major, panel.grid.major.x, panel.grid.major.y Major grid lines: both, vertical and horizontal element_line()
panel.grid.minor, panel.grid.minor.x, panel.grid.minor.y Minor grid lines: both, vertical and horizontal element_line()
plot.background Background of the entire plot element_rect(fill = “white”, colour = NA)
plot.title Title text appearance element_text()
strip.background Background of facet labels element_rect()
strip.text, strip.text.x, strip.text.y Text appearance for facet labels: both, horizontal and vertical element_text()
Table 4: Properties of theme elements and text geoms
Theme elements Text geoms Description
family family Helvetica, Times, Courier
face fontface plain, bold, italic, bold.italic
colour colour Color (name or“#RRGGBB”)
size size Font size (in points for theme elements; in mm for geoms)
hjust hjust Horizontal alignment: 0=left, 0.5=center, 1=right
vjust vjust Vertical alignment: 0=bottom, 0.5=middle, 1=top
angle angle Angle in degrees
lineheight lineheight Line spacing multiplier
 9: # The line below adds a title to the graph
10: p<-p + ggtitle("Title of my graph")
11: # The line below changes the x axis title
12: p<-p+theme(axis.title.x=element_text(size=16,
13:                                      lineheight=.9,
14:                                      family="Times",
15:                                      face="bold.italic",
16:                                      colour="red"))

Step 5: If needed, use faceting to show multiple plots in a panel

ggplot2 also provides facetting tools, to be used if you want to show multiple graphs in a panel. gglot2 provides two functions to split the graph into different categories: facet_wrap and facet_grid.

14: # The line below creates a separate graph for
15: # each value of category, and arranges that
16: # in a matrix of three columns, and as many
17: # rows as required.
18: p+facet_wrap(~category,ncol=3)
15: # The line below creates a separate graph
16: # for each combination of category1
17: # and category2. Graphs are arranged in
18: # a grid with values of category1 making rows
19: # and values of category2 making columns.
20: p+facet_grid(category1~category2,ncol=3)

4.4 Making tables using the data.table package

Many commands in R, and in several add-on packages, can be used for working with and creating statistical tables. Among the most popular are the apply family of functions and plyr, dplyr and reshape2 packages. The data.tables package may be considered a swiss knife for working with and producing statistical tables. Of all the tools, this is one of the most flexible, versatile and efficient.

  • data.table does not work directly on data frames but needs you to create a data frame like object called “data.table”.
  • You can convert an existing data frame to a data.table using

    as.data.table(data1)->data1

  • Or you can use fread("delimited_file_name")->data1 to read a delimited file and directly create a data table.
    • fread is fast and efficient.
    • It also automatically detects separating character and variable names.
  • All commands that work on a data frame can also be used on a “data.table”. So nothing is lost by keeping your data as a data table rather than as a data frame.
  • The basic syntax of data.table is as follows: DT[i,j,by]. This subsets DT using i, calculates j, groups using “by”
    • i (subsetting rows)
      • a set of logical expressions to subset DT.
      • columns can be referred to directly by their names.
      • data2.dt[sector==1] subsets rows for which variable sector==1.
    • j (selecting columns, summarising)
      • [,col] selects a column and returns it as a vector.
      • [,.col] selects a column and returns the output as datatable.
      • [,.(col1, col2)] or [,list(col1, col2)] return two columns.
      • [,.(colname1=col1, colname2=col2)] simultaneously renames columns.
      • [,.(c1mn=mean(col1),c2med=med(col2))] calculates mean and median.
    • by (separately doing the calculations for groups)
      • Following code computes mean and median separately for each combination of col3 and col4.

        [,.(c1mn=mean(col1),c2med=med(col2)),
          .(col3,col4)]
        
  • In a data.table, you can define some variables as “keys”. These can be used for quickly sorting, subsetting and merging.
    • setkey(DT, sector) defines sector as a key variable in DT.
    • If sector in DT is defined as a key variable, then the following command will subset rows for which the key variable takes values “Industry” or “Agriculture”.

      DT[c("Industry","Agriculture")]
      
    • In the usual DT[i,j,by] type commands, you can replace by with keyby as shown below to simultaneously define keys and sort the output.

      [,.(c1mn=mean(col1),c2med=med(col2)),
       keyby=.(col3,col4)]
      
  • Output of a data.table is also a data.table. This means, that a sequence of commands can be chained as:

    # Three commands run in a sequence
    
    DT[i1,j1,by1]->DT2
    DT2[i2,j2,by2]->DT3
    DT3[i3,j2,by3]
    
    # Can be combined into one line
    # as follows
    
    DT[i1,j1,by1][i2,j2,by2][i3,j2,by3]
    
    
  • It is also possible to use DT[i,j] to do a transformation on DT itself rather than create a new data.table.
    • For example, to create a new column called newcol in DT, one can use j as

      newcol:=columnname1*columnname2.

      Note that := is used to say that the transformation is to be done on DT itself.

    • A limitation of this usage is that only one field can be transformed at a time. Transforming multiple fields would require chaining the data.table commands.
  • Sometimes you may wish to rotate a cross-table created using data table. That is, you may wish to create columns out of one or more disaggregating (keyby) variables that have been used. This can be achieved in the following way.
    • In the command below, the first data.table transformation computes mean by sector, sex and religion. The second data.table transformation creates all combinations of sector, sex and religion, and ensures that DT2 has rows for all combinations (including null rows). The third data.table transformation rotates the table, with religion and sex being identified as row disaggregators and the remaining keyby variable (sector) as the column disaggregator.

      DT[,.(col1mean=mean(columnname1)),
         keyby=.(sector,sex,religion)]->DT1
      
      DT1[CJ(unique(sector),
             unique(sex),
             unique(religion))]->DT2
      
      DT2[,as.list(col1mean)
         ,by=.(sex,religion)]
      
    • Like all data.table commands, these can be chained together into one line of code.
  • data.table provides a very convenient syntax for merging data.tables.
    • DT1[DT2] joins matching data from DT2 with DT1; the matches are done using their keys (presuming the keys are defined).
    • DT1[DT2,on=c(var1=“var2”)] joins matching data from DT2 with DT1; matches are done using var1 of DT1 and var2 of DT2.

4.5 Analysis of Survey Data in R

4.6 Econometric Analysis in R

4.7 Exercises

COVID-19

Inside the resources directory, please give the following commands to get the COVID-19 dataset.

git submodule init

git pull --recurse-submodules

The data will be downloaded in the “resources/COVID-19/csse_covid_19_data/csse_covid_19_time_series” directory. These data are updated daily. You should be able to git pull --recurse-submodules everyday to update the dataset.

What exploratory analysis can you do with this data-set using the commands that you have learnt so far?

NSS 75, Schedule 25.2

In the resources directory, sub-directory nss75252e has following files.

download.sh
Bash script that downloads the required data. Open and check what the file does before executing. Executing unknown shell-scripts given by others can be a potential security hazard. So be sure you know what you are executing. You may have to use chmod to make the file executable. The script also downloads a file called datalay75_252.XLS which gives layout of each data file.
readdata.R
This is an R script written by me for reading the data files. Code for reading the file R75252L03.TXT for level 3 is missing, and is for you to write. Use code for other levels as illustrations for you to learn how to read these files. Information for the variables contained in R75252L03.TXT can be found in datalay75_252.XLS

4.8 Resources

  • Chang, Winston (2013), R Graphics Cookbook, OReilly
  • Crawley, Michael J (2015), Statistics: An Introduction using R, Wiley
  • Fox, John and Weisberg, Sanford (2011), An R Companion to Applied Regression, Sage Publications, Thousand Oaks, CA.
  • Hatekar, Neeraj (2010), Principles of Econometrics: An Introduction (Using R), Sage
  • Hilbe, Joseph (2013), Methods of Statistical Model Estimation, Chapman & Hall/CRC Press.
  • Lumley, Thomas, Complex Surveys: A Guide to Analysis Using R, John Wiley & Sons
  • Maindonald

, John and Braun, John (2003), Data Analysis and Graphics Using R: An Example-based Approach, Cambridge University Press

  • Michael J. Crawley (2013), The R Book, Second Edition, John Wiley & Sons
  • Murrell, Paul (2011), R Graphics, Chapman & Hall/CRC Press.
  • R Core Team (2015), An Introduction to R, R Foundation for Statistical Computing,Vienna.
  • R Core Team (2015), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing,Vienna.
  • Spector, Phil (2008), Data Manipulation with R, Springer
  • Wikham, Hadley, ggplot2: Elegant Graphics for Data Analysis, UseR Series, Springer

5 Reproducible Research

5.1 Introduction to orgmode

Org-mode basics

Preamble

An Org file has a few special lines at the top that set up the environment. Following lines are an example of the minimal set of lines that we shall use.

#+TITLE: Using Emacs, Org-mode and R for Research Writing in Social Sciences
#+SUBTITLE: A Toolkit for Writing Reproducible Research Papers and Monographs
#+AUTHOR: Vikas Rawal
#+DATE: May 4, 2014
#+OPTIONS: toc:2 H:3 num:2

As you can see, each line starts with a keyword, and the values for this keyword are specified after the colon.

Table 5 gives details of a few major special lines that we shall use.

Table 5: Main special lines to be used at the top of an Org buffer
Keyword Purpose
#+TITLE To declare title of the paper
#+AUTHOR To declare author/s of the paper
#+DATE Sets the date. If blank, no date is used. If this keyword is omitted, current date is used.
#+OPTIONS There are many options you can give. These are what I find the most important Multiple options can be separated by a space and specified on the same line.
  toc:--- (Do not include a Table of contents), toc:n (Include n levels of sections and sub-sections in Table of contents)
  H:2 (Treat top two levels of headlines as section levels, and anything below that as item list. Modify the number as appropriate)
  num:2 (Number top two levels of headlines. Modify the number as appropriate.)

In addition to these, we can use LaTeX specific options for formatting the pdf output, odt specific options for formatting the odt/docx output, and R specific options for setting up the R environment. These would also be specified using special lines at the top of the file. I shall provide details of some of these in the Sections where these topics are discussed.

In research monographs prepared using org, you may need quite a few lines in the preamble to set everything. These would tend to make the top of the org file look daunting. It may be helpful to store these lines in a set of separate org files, and include them in your main document using line such as this:

#+INCLUDE: path-to-file/papersetup.org

Sections and headlines

After the special lines at the top comes the main body of the Org file.

The content in any Org file is organised in a hierarchy of headlines. Think of these headlines as sections of your paper.

A headline in Org starts with one or more stars (*) followed by a space. A single star denotes the main sections, double star denote the subsections, three stars denote sub-subsections, and so on. We shall use this to create the section structure of our document. You can create as many levels of sections as you need.

See the following example. Note that headlines are not numbered. We leave section numbering for org-mode to handle automatically.

#+TITLE: Reproducible Research Papers using Org-mode and R: A Guide
#+AUTHOR: Vikas Rawal
#+DATE: May 4, 2014
* Introduction
This is the first section. Add your content here.
* Literature review
** Is this an important issue
This is a sub-section under top-level section "Literature review" Now
indulgence dissimilar for his thoroughly has terminated. Agreement
offending commanded my an. Change wholly say why eldest period. Are
projection put celebrated particular unreserved joy unsatiable its. In
then dare good am rose bred or. On am in nearer square wanted.
** What are the major disputes in the literature
*** adulterated text
Instrument cultivated alteration any favourable expression law far
nor. Both new like tore but year. An from mean on with when sing pain.
Oh to as principles devonshire companions unsatiable an delightful.
The ourselves suffering the sincerity. Inhabit her manners adapted age
certain. Debating offended at branched striking be subjects.
*** Unadulterated prose
Announcing of invitation principles in. Cold in late or deal.
Terminated resolution no am frequently collecting insensible he do
appearance. Projection invitation affronting admiration if no on or.
It as instrument boisterous frequently apartments an in. Mr excellence
inquietude conviction is in unreserved particular. You fully seems
stand nay own point walls. Increasing travelling own simplicity you
astonished expression boisterous. Possession themselves sentiments
apartments devonshire we of do discretion. Enjoyment discourse ye
continued pronounce we necessary abilities.
* Methodology
This is the next top-level section. There are no sub-sections under this.
* Results
This is the third top-level section. There are sub-sections under this.
** Result 1
This is a sub-section under section Results.
** Result 2
This is another sub-section under section Results
* Conclusions
This is the next and final top-level section. There are  no sub-sections under it.

Org handles these headlines beautifully. With your cursor on the headline, pressing tab folds-in the contents of a headline. If you press tab on a folded headline, it opens to display the contents. If there are multiple levels of headlines, these open in stages as you repeat pressing the tab key.

When you are on a headline, pressing M-return creates a new headline at the same level (that is, with the same number of stars). Once you are on the new headline, a tab moves it to a lower level (that is, a star is added), and shift-tab moves it to a higher level (that is, a star is removed).

When I start writing a paper, I start with a tentative headline/section structure, and then start filling in the content under each headline, and modify the section structure, if needed, as the paper develops.

(Further reading, Headlines in Org manual)

Itemised lists

Following syntax produces unordered (bulleted) lists:

+ bullet
+ bullet
  - bullet2 1
  - bullet2 2
+ bullet
+ bullet

This is how this list shows up in the final document

  • bullet
  • bullet
    • bullet2 1
    • bullet2 2
  • bullet
  • bullet

Following syntax produces ordered/numbered lists:

1. Item 1
2. Item 2
  1) Item 2.1
  2) Item 2.2
     1) Item 2.2.1
3) Item 3

This is how the ordered list shows up in the final document.

  1. Item 1
  2. Item 2
    1. Item 2.1
    2. Item 2.2
      1. Item 2.2.1
  3. Item 3

Following syntax produces a description list:

+ item1 :: description of item1
+ item2 :: description of item2
  - bullet1 under item2
  - bullet2 under item2

This is how this list shows up in the final document

item1
description of item1
item2
description of item2
  • bullet1 under item2
  • bullet2 under item2

Note that:

  • In unordered/description lists, + and - signs are interchangeable.
  • Similarly, in ordered lists 1. and 1) are interchangeable.
  • Levels of bullets and numbering are determined by indentation.
  • Different types of lists can be mixed using numbers and bullets for different levels.
  • If the cursor is on a line that is part of an itemised list, M-return inserts a new line with a bullet/number below the present line with the same level of indentation.
Inserting footnotes
  • To insert a footnote at any point, use C-c C-x f
  • To reorder and renumber footnotes after inserting a footnote in a text that already has some footnotes after the point where a new footnote is being inserted, use C-u C-c C-x f S
Tables
  • Sample code

    We shall directly create only those tables in Org that present content not being produced through statistical analysis. For tables that are created through statistical analysis, we shall embed R programs rather than the tables themselves. This is discussed in Section Org-mode and R of this guide.

    The following sample code produces a reasonably formatted table, with a numbered title above the table and a name for cross-referencing the table from the text anywhere in the document.

    See Table [BROKEN LINK: table-yield] for an illustration of how this table shows up in the final document.

    #+NAME: table-yield
    #+CAPTION: Simple table created using LaTeX tabular environment
    #+attr_latex: :environment tabular :width \textwidth :align lrr
        | State          | Average yield | Average income |
        |----------------+---------------+----------------|
        | Madhya Pradesh |           672 |          13000 |
        | Haryana        |           300 |          25000 |
        | Punjab         |           260 |          35000 |
    
  • Table editor

    Org-mode has an in-built table editor, which is very simple to use.

    • Tables in Org have columns separated using |.
    • Once you create the first row by separating columns using |, pressing tabs takes you from the first column to the next. Org automatically aligns the columns.
    • At the end of the row, pressing tab again, creates a new blank row. You can also create a new blank row by pressing return anywhere in the last row.
    • For creating a horizontal line anywhere, type |- at the starting of the line, and press tab.
    • Contents of each cell are aligned automatically by Org.
    • To delete a row, use C-k.

    Org provides various commands for manipulating the design of tables. Table 6 provides the most important ones. Note that Table 6 is created using Org mode. It also gives you an idea of how the table would look eventually.

    Table 6: Commands to manipulate tables in Org
    Command Description
    M-<left> Move the column left
    M-<right> Move the column right
    M-S-<left> Delete the current column
    M-S-<right> Insert a new column to the left of the cursor position
    M-<up> Move row up
    M-<down> Move row down
    M-S-<up> Delete the current row or horizontal line
    M-S-<down> Insert a new row above the current row

    For more commands for manipulating tables, see this section of the Org manual. In particular, you may want to look at spreadsheet-like functions of the table editor.

    One limitation of Org is lack of support for merging of cells in a Table.

Images

You can insert images in documents as follows

[[file:path-to-file/a.jpg]]

You should do this for images that you already have, and you just want to insert them in the document. For graphs produced by R, we shall embed the code instead, so that the graph is generated and inserted automatically.

Captions and cross-references

We would like to give a title to our tables and images. And we would like to be able to refer to them from the text. These are achieved by adding two lines above every table and image.

  • A line starting with #+CAPTION: placed just above a table or a figure adds a title to it. All Tables and Figures titles are automatically numbered.
  • For referring to these Tables and Figures in the text, we shall name each table and figure in a line starting with #+NAME: as below.

To illustrate, for inserting an image, with a caption and a name, this is what we shall do.

#+NAME: literacy-rate
#+CAPTION: Percentage of literate men and women, by country (per cent)
file:a.jpg

Similarly, a table will be inserted as follows.

#+NAME: literacy-rate-table
#+CAPTION: Percentage of literate men and women, by country (per cent)
| Country    | Men | Women |
|------------+-----+-------|
| India      |  75 |    43 |
| Bangladesh |  83 |    63 |
| Rwanda     |  77 |    60 |

To refer to the Table above in the text, write Table [[literacy-rate-table]]. As an illustration, see the following sentence.

Tables literacy-rate-table and health-table, and Figure
literacy-figure, show the level of underdevelopment.

By default, all objects with captions are numbered, and names are used to anchor cross-references. When the formatted output is produced, all the references would be automatically converted to appropriate numbers. If new objects are inserted in the paper, numbering will be adjusted automatically when you create the formatted output.

Org-mode and R

Configuration

Following code in research-toolkit.org enables Org to run different types of code. If you have installed research-toolkit.org as specified in [BROKEN LINK: customemacs], these are already enabled.

I have included here the languages that I commonly use. See Org manual, if you would like to add any more.

(org-babel-do-load-languages
   'org-babel-load-languages
   '((R . t)
     (org . t)
     (ditaa . t)
     (latex . t)
     (dot . t)
     (emacs-lisp . t)
     (gnuplot . t)
     (screen . nil)
     (shell . t)
     (sql . nil)
     (sqlite . t)))
Special lines for R

Org allows you to run multiple R sessions simultaneously, if you are working on two documents side by side, and would like to keep statistical work for the two separately.

This is done by naming the R session which a particular Org file is linked to. All R code in this file would be run in the specified R session. You could have, at the same time, another R session, with a different name, being called by another Org buffer.

We can give a name to the R session (let us say, my-r-session) that our Org buffer should be linked to by adding the following line at the top (in the preamble, that is).

#+property: session my-r-session
Embedding R code in an Org document

Org uses ESS (emacs-speaks-statistics) to provide a fully functional, syntax-aware, development environment to write R code. R code is embedded into Org as a source block. The basic syntax is

#+NAME: name_of_code_block
#+BEGIN_SRC R <switches> <header-arguments>

  <Your R code goes here.>

#+END_SRC

This is how source blocks are created.

  • First write the lines starting with #+NAME, #+BEGIN_SRC and #+END_SRC.
  • Then with your cursor in between the BEGIN_SRC and the END_SRC lines, give the command C-c ’ (that is, press Ctrl-C, release, and press ’).
    • This would open a new buffer using ESS mode. If you type your code in this buffer, you will see that ESS is syntax-aware and nicely highlights R code.
    • ESS also allows you to run (evaluate) the code that you write, to test what your code is doing. Use C-j for evaluating a single line of code, C-b for evaluating the whole ess buffer, or C-r for a marked region within the ess buffer.
  • Once you have finished writing a code block and tested it, press C-c ’ again to come back to your Org buffer.
  • In your Org buffer, with your cursor in a source-block, press C-c C-c to evaluate the whole code block and have the results included in your document.
  • You can always edit your source code by opening a temporary ESS buffer using C-c’
Code blocks that read data and load functions for later use in the document without any immediate output

I normally have one or two code blocks that read the data I am going to use, call the libraries that I use, and define a few functions of my own that I plan to use. I want this code block to be evaluated, so that these data, libraries and functions become available in my R environment. But no output from such code blocks is expected to be included into the document.1

Code block 1 is an example of such a code block. Note :results value silent switch used in the #+begin_src line.

#+NAME: readdata-code
#+BEGIN_SRC R :results value silent

read.data("datafile1.csv",sep=",",header=T)->mydata1


#+END_SRC
Code blocks that produce results in the form of a table

Most of code blocks in my papers fall in this category. The code block may use data and functions made available by previous code blocks, read some new data and may load some new functions. The code block does some statistical processing. The last command of the code block produces an object (for example, a data.frame) that is included in the document as a Table.

For example, the code block 1 below uses mydata1 read in the previous code block, reads a new dataset, and processes them to create a table that shows average BMI by country.

#+NAME: bmi-table-code
#+BEGIN_SRC R :results value :colnames yes :hline yes
aggregate(height~Country,data=mydata1,mean)->a1
read.data("datafile2.csv",sep=",",header=T)->mydata2
aggregate(weight~Country,data=mydata2,mean)->a2
merge(a1,a2,by="Country")->a1
a1$weight/a1$height->a1$BMI
subset(a1,select=c("Country","BMI"))
#+END_SRC

You can evaluate this code using C-c C-c. When you do that, it produces the output, and places it immediately below the code block. The results display the output of the code under a line that looks like below

#+RESULTS: bmi-table-code

Note that the results are tied to the code block using the name of the code block. Every time you go to the source code block and press C-c C-c, the code will be evaluated again and the results will be updated.

On top of the line starting with #+RESULTS:, we shall add two more lines, to give the table a title and a name. Note that both the code block and the result of the code block have separate names.

#+NAME: bmi-table-output
#+CAPTION: Average BMI, by country
#+RESULTS: bmi-table-code

Like any Org table, you can cross-refer to this table using [[bmi-table-output]].

Code blocks that produce a graph to be included in the document

These code blocks can have a series of commands. The last command produces a graph that we would like to be included in the document.

The following code shows an example of a code block that produces a graph.

#+NAME: mygraph-code
#+BEGIN_SRC R :results output graphics :file bmi2.png :width 825 :height 1050 :fonts serif

#+END_SRC

As before, for creating your graph, you first write the #+NAME, BEGIN_SRC and the END_SRC lines, and then go into a temporary ESS buffer by using C-c ’.

  • Once in this temporary ESS buffer, you can write the R commands for making your graph.
  • As you write, you can evaluate the commands using C-j, C-r and C-b and see what your output looks like.
  • The output is displayed on your screen using the default graphic device used by R (X11, quartz or windows graphic device depending upon your operating system).
  • Once you have finalised your graph, you press C-c ’ and come back to the Org buffer.

Note that creation of the image file is left to appropriate switches in the #+BEGIN_SRC line. Org automatically chooses appropriate graphic device to produce the file. When you evaluate this code using C-c C-c, the results are displayed below the code block as follows.

#+RESULTS: mygraph-code
bmi2.png

Note that, taking the file name from our #+BEGIN_SRC line, a file called bmi2.png was automatically created and linked, so that the graph would be inserted in the document when you produce the formatted output.2 Every time you evaluate the code using C-c C-c, the underlying image file containing the graph is overwritten by a new file.

As with the tables, we shall add a caption and a name to it as follows

#+NAME: my-bmi-graph
#+CAPTION: Average BMI, by Country
#+RESULTS: mygraph-code
gini.png

You can now refer to this graph in the text using [[my-bmi-graph]].

5.2 Formatting tables and figures for print

5.3 Citation management using org-ref and biblatex

  • A sample biblatex database to get you started can be taken from here. For now, you may just download it, save it and add to it. Please note that the source file on github keeps getting updated. Eventually, you may want to manage your own biblatex database using something like git.
  • There is also another Emacs mode – ebib – that can be used for maintaining your biblatex database. To start ebib, use M-x ebib

5.4 Resources

  • Dominik, Carsten and others (2016), The Org Manual, Free Software Foundation
  • Gentleman, Robert and Duncan Temple Lang, “Statistical Analyses and Reproducible Research”, Journal of Computational and Graphical Statistics, 16 (1), pp. 1-23
  • Kitchin, John (2014), The org-ref Manual
  • Knuth, Donald E. (1992), Literate Programming, Center for the Study of Language and Information, Stanford.
  • Rawal, Vikas (2014), Using Emacs, Org-mode and R for Research Writing in Social Sciences: A Toolkit for Writing Reproducible Research Papers and Monographs, http://archive.indianstatistics.org/tools/orgpapers.pdf; a revised but incomplete version may be seen at https://github.com/vikasrawal/orgpaper/blob/master/orgpapers.org
  • Rossini, Anthony and Friedrich Leisch. (2003) “Literate Statistical Practice”, University of Washington Biostatistics Working Paper Series , Available at: http://works.bepress.com/anthony_rossini/6/
  • Schulte, Eric and Dan Davison (2011), “Active Documents with Org-Mode”, Computing in Science & Engineering.
  • Schulte, Eric, Dan Davison, Thomas Dye and Carsten Dominik (2016), “A Multi-Language Computing Environment for Literate Programming and Reproducible Research.” Journal of Statistical Software, 46 (3), 1 – 24.

Footnotes:

1

For libraries and functions that you need to call, it is even better to include them in a .Rprofile file in your working directory. These libraries and functions would then be called when R is started, and not each time you evaluate code blocks in your document.

2

Of various image formats, I find that png files are most versatile. png files support transparency, and are rendered well both on the web and in print. You can also specify jpeg or pdf files. pdf files for images work very well if you are only going to produce a pdf document.

Author: Vikas Rawal

Created: 2020-03-30 Mon 05:15

Validate