EP624: Statistical Programming and Reproducible Research
Table of Contents
- 1. System Setup
- 2. Introduction
- 3. Emacs
- 4. R Programming
- 5. Reproducible Research
1 System Setup
1.1 Operating System
- Install any latest distribution of GNU/Linux on your computer. I use Manjaro, but feel free to install any.
- If you have a Mac, you can continue using OSX, and install the required applications on OSX.
1.2 Applications
GNU/Linux update
Commands below are for setting a Manjaro/Arch Linux computer. For other distributions (for example, Debian/Ubuntu-based), use their package managers to install the same applications.
#+CAPTION: Setting up a Manjaro/Arch Linux computer ## These commands should be run one line at a time. Do not run it as a script. ## Modify to suit your requirements sudo pacman -Ssy sudo pacman -S emacs firefox texlive-most biber nmap openssh yay gimp phpmyadmin evince git git-lfs mpv vlc r htop neomutt isync pass mariadb postgresql aspell-en termite pass notmuch go tk afew ack keychain chromium msmtp unison python-pip python-imaging ghostscript hplip geos youtube-dl evince pandoc-citeproc sudo pacman -R palemoon sudo pacman -S rofi #Only if you use i3wm sudo pacman -U https://www.archlinux.org/packages/community/x86_64/browserpass/download/ sudo pacman -S browserpass-firefox browserpass-chromium systemctl enable sshd yay -S dropbox yay -S megasync yay -S keybase-bin yay -S skype yay -S signal-desktop yay -S mimio yay -S libreoffice-extension-languagetool install_pulse
OS-X
The built-in Emacs on OS-X is an older version, and it would be a good idea to install the latest version instead.
The best option is to install it via homebrew. I like the version available from railwaycat/emacsmacport tap (https://github.com/railwaycat/homebrew-emacsmacport).
First install Homebrew. Then, from the terminal, give following commands:
$ brew tap railwaycat/emacsmacport
$ brew install emacs-mac
Install MacTeX from http://www.tug.org/mactex/
Install R from http://www.r-project.org
Also install pandoc and pandoc-citeproc using homebrew:
$ brew install pandoc pandoc-citeproc
1.3 Create your ssh keys
ssh or Secure Shell is a protocol for securely transferring data over networks.
Use these commands to create your ssh keys.
# If you do not already have ssh keys, create them as below mkdir ~/.ssh ssh-keygen -t rsa #No need to use a password, when asked. Just press enter for all the questions (accept defaults).
If you have just accepted the defaults, these commands would have created two files in .ssh
directory: ~/.ssh/id_rsa
and ~/.ssh/id_rsa.pub
.
id_rsa is your private key which much be kept safe and is never shared with anyone.
id_rsa.pub is the matching public key which is shared with any machine with which you want to have secure data transfers.
1.4 Emacs configuration
In your github.com account, go to settings, click on “SSH and GPG keys”, and add a new ssh key. In the box that opens, in Title add your name, and the name of your machine, and in the box below, paste the contents of the public key that was created in the previous step (~/.ssh/id_rsa.pub
).
Once that is done, following commands will configure Emacs for the requirements of this course.
git clone git@github.com:vikasrawal/prelude-ep624.git #git clone https://github.com/vikasrawal/prelude-ep624.git # Use may use this if you do not have a github account with your public ssh key added to it. mv .emacs.d .emacs.d.old # Renames if you already have a .emacs.d directory mv prelude-ep624 .emacs.d cd .emacs.d git checkout ep624 cd
1.5 Access to your Github repositories new
Git is a version control system. Created originally by Linus Torvaldes, it is the most widely used version control system for collaboration and tracking by software developers. Since most of our work – the org files, the R scripts, the LaTeX files – would be in plain text files, we can use git not only to track evolution of our programs but also of our research papers, dissertations and books.
Apart from locally hosting your repositories on your own machine, you can also keep your repositories on platforms like Github and Gitlab. If you want to keep your work-in-progress research on these platforms, you would need rights to create private repositories. While Gitlab provides it by default, you can enable it on Github by claiming educational discount.
I will be using git and Github to run this course. Which means that you will need to know basic git commands to submit your assignments and get my feedback. The idea is to use this to familiarise you with basic git commands.
I will create a separate repository for each student on Github, and invite you as a collaborator. You will clone it on your computer using the commands mentioned below, create a separate sub-directory for each assignment and do all your work in these sub-directories. Once you have completed some work and want to share it with me, you will commit it and push it to the github repository. You commit anything to the repository in two steps. You first add it to the staging area and then you commit it.
1: ## Basic git workflow for this course 2: ## The following command, after modifying the name of your repository, 3: ## clones your repository. A directory -- 2020_yourname -- will be 4: ## created in the directory from where the command is given 5: ## (your home directory in this case). All future work for the course 6: ## should be done in 2020_yourname directory. 7: 8: cd ~ 9: git clone --recurse-submodules git@github.com:ep624/2020_yourname.git 10: 11: cd ~/2020_yourname 12: rm -Rf resources 13: git clone git@github.com:ep624/resources.git 14: 15: cd ~/2020_yourname/resources 16: git pull origin master 17: 18: 19: ## create a separate subdirectory in 2020_yourname for each of your assignments. 20: cd ~/2020_yourname 21: mkdir midterm 22: 23: ## To add a file to the staging area (a new file, or a file that has been modified) 24: git add mainfile.org 25: git add figure1.png ## the repository accepts png files as git lfs objects. 26: 27: ## To commit the file 28: git commit -m "A message describing this commit" 29: 30: ## To push it to github repository 31: git push 32: 33: ## To pull from github the work of other collaborators (my comments/edits, for example) 34: ## and merge it into your local repository 35: git pull 36:
A separate team has been created on Github for your batch. You should accept the invite and join the team.
1.6 Learning how to set up your system
It would be useful for you to learn how to
- install a Linux system.
- use package manager of your Linux distribution to install software.
- customise Emacs.
While we have given some instructions here, and provided some ready-made configuration files for you to deploy, you should learn what these are doing, how to modify them, and how to extend them.
2 Introduction
2.1 Introduction to the course
Why should you use programming for statistical analysis?
- Programming makes your results reproducible and verifiable.
- It is easier to revise your estimations, graphs, tables etc if you write and save your programs.
- Updating your analysis, for example when new data become available, is easy, as the same/modified programs could be run on new data to get new results.
Why use R?
- R is open-source and free of cost.
- Will always remain available.
- Development is fast; tools for using new, cutting-edge statistical techniques become available as add-on packages.
2.2 Introduction to shell programming
Introduction to *nix file system
- *nix file system is structured as a tree with “/” as the root.
/home/yourusername
is your $HOME. ~ is a shortcut to this directory.- On Mac/OS-X, $HOME is located at
/Users/yourusername
. ~ is mapped to this directory.
- On Mac/OS-X, $HOME is located at
Important Bash Commands to Learn
Help
- man
- manual
File management
- ls
- list files
- mkdir
- make directory
- cp
- copy
- mv
- move
- rm
- remove/delete
- ssh, scp, rsync
- commands used for secure copy to remote locations
- find
- find a file
- ack
- find files containing a string
Navigation
- cd
- change directory
- pwd
- path of working directory
Seeing/modifying files on the commandline
- cat, more, less, head, tail
- commands to see contents of text files
- awk, sed
- tools to process text files line by line.
- diff
- to compare two text files
Redirection
- \(>\)
- redirect output to a file (if the output file already exists, it will be replaced)
- \(>>\)
- append output to a file (if output file does not exist, it will be created)
- \(|\)
- pipe output to another command
Others
- grep
- filter lines of file containing a string
- “ip address show”
- to find your ip address
- sort
- sort lines
- uniq
- to find unique adjacent lines
- wc
- count number of bytes, words and lines
3 Emacs
3.1 Introduction to Emacs
Emacs is an powerful cross-platform editor which can be used for all kinds of tasks that you do on your computer. Although Emacs is different from every other editor you are likely to have used, learning to use Emacs would be a huge asset for your computing abilities. A good way to learn Emacs, to ask, for each activity you do on your computer, can I do this using Emacs? You will be amazed at how many times the answer is: yes, and more efficiently than in your current application.
This section does not provide a detailed guide to the use of Emacs. In this guide, I will just provide a minimal set of basic commands in emacs to get you started. This is a minimal but sufficient set to be able to work. I expect that you would learn more commands as you start using emacs.
3.2 Notations
In emacs, a buffer is like a tab in an application. It is normal to have several buffers open at the same time. Each file opens in emacs as a buffer. Buffers could also have processes like R running in them. Emacs displays any messages for you in a separate buffer.
Most commands in emacs are given using the Control (ctrl) or the Meta
(often mapped to alt) keys. Control key is usually referred to as C-
and the Meta key as M-
. So a command C-c
means pressing Control
and c together. Command M-x
means pressing Meta and x
together. Everything is case-sensitive. So M-X
would mean, pressing
Meta, Shift and x together. C-c M-x l
would mean pressing C-c,
release, then M-x, release, and then l.
3.3 Basic commands
Table 1 gives the commands that are the most important. This is a minimal set, commands that you should aim to learn as soon as possible. There are many more, which you will learn as you start using emacs.
All commands have a verbose version that can be used by pressing M-x
and writing the command. For example, M-x find-file
to open a file.
All major commands are also mapped to a shortcut. For example, instead
of typing M-x find-file
to open a file, you can say C-x C-f
. I
remember shortcuts for commands that I use most frequently. For
others, I use the verbose versions. Over time, one learns more
shortcuts and starts using them instead of the verbose versions.
Description | Verbose command | Shortcut |
---|---|---|
M-x followed by |
||
Opening files, saving and closing | ||
Open a file | find-file |
C-x C-f |
Save the buffer/file | save-buffer |
C-x C-s |
Save as: prompts for a new filename and saves the buffer into it | write-named-file |
C-x C-w |
Save all buffers and quit emacs | save-buffers-kill-emacs |
C-x C-c |
Copy, Cut and Delete Commands | ||
Delete the rest of the current line | kill-line |
C-k |
To select text, press this at the beginning of the region and then take the cursor to the end | set-mark-command |
C-spacebar |
Cut the selected region | kill-region |
C-w |
Copy the selected region | copy-region-as-kill |
M-w |
Paste or insert at current cursor location | yank |
C-y |
Search Commands | ||
prompts for text string and then searches from the current cursor position forwards in the buffer | isearch-forward |
C-s |
Find-and-replace: replaces one string with another, one by one, asking for each occurrence of search string | query-replace |
M-% |
Find-and-replace: replaces all occurrences of one string with another | replace-string |
|
Other commands | ||
Divide a long sentence into multiple lines, each smaller than the maximum width specified | fill-paragraph |
M-q |
Window and Buffer Commands | ||
Switch to another buffer | switch-to-buffer |
C-x b |
List all buffers | list-buffers |
C-x C-b |
Split current window into two windows; each window can show same or different buffers | double-window |
C-x 2 |
Remove the split | zero-window |
C-x 0 |
When you have two or more windows, move the cursor to the next window | other-window |
C-x o |
Canceling and undoing | ||
Abort the command in progress | keyboard-quit |
C-g |
Undo | undo |
C-_ |
3.4 Resources
- A Tutorial Introduction to GNU Emacs
- A. J Rossini and Richard M Heiberger and Rodney A Sparapani and Martin Mächler and Kurt Hornik (2004), “Emacs Speaks Statistics: A Multiplatform, Multipackage Development Environment for Statistical Analysis”, Journal of Computational and Graphical Statistics, 13(1), 247-261
- Heiberger, Richard M (2001), “Emacs Speaks Statistcs: One Interface, Many Programs”, in K. Hornik and F. Leisch (eds.), Proceedings of the 2nd International Workshop on Distributed Statistical Computing, March 15-17, Vienna, Austria
- Rossini, A.J., Heiberger, R.M., Hornik, K., Maechler, M., Sparapani, R.A., Eglen, S.J., Spinu, V., Henry, L. (2016), ESS: Emacs Speaks Statistics
- Rossini, Anthony, Martin Maechler, Kurt Hornik, Richard M. Heiberger, (2001) “Emacs Speaks Statistics: A Universal Interface for Statistical Analysis”, University of Washington Biostatistics Available at: http://works.bepress.com/anthony_rossini/3/
4 R Programming
4.1 Basics
- R is a case sensitive language. Do not mix upper case and lower case characters.
- You should have a separate directory for each project/article you work on.
- By default, the work you do is saved in this working directory in a file called .RData
- All R commands are followed by
()
. Arguments that the command needs are given inside the parentheses. Empty parentheses have to be used even if a command needs no arguments. For example,q
will not work as a commandq()
is a commandq(save="yes")
is a command with an argument that asks the R environment to be saved before quitting.
->
or<-
is used to create an object using output of a command.c(1:5)->col1
creates an object col1 containing the output of the commandc(1:5)
.col1<-c(1:5)
also does the same thing.
- In this guide, all commands are formatted using a mono-spaced font to distinguish them. Such lines are numbered. But line numbers are not a part of the command. In the boxes, text starting with a
#
is used to write comments for the reader. These comments are not part of the commands. You do not need to type them into R as any line starting with a#
is treated as a comment in R and not executed. When you write your own programs, use lines starting with#
to write your own comments that may be necessary to make your program intelligible to your collaborators.
1: #+CAPTION: Some basic commands in R 2: # Text after # is a comment to explain what the command does 3: q() # Quit R; prompts for saving the workspace. 4: ?command # Open help for 'command'; q to quit help. 5: ??keyword # List commands with 'keyword' in the title 6: help.search("keyword") # List commands with 'keyword' 7: # anywhere in the docs 8: setwd("c:/project1") # On a Windows machine, sets the working directory to c:/project1 9: setwd("~/proj2") # On a Linux or OSX machine, sets the working directory to ~/proj2 10: getwd() # Reports the present working directory
4.2 Data structures in R
Vectors
- A vector is a column of data. The most basic command for creating a vector (think of a column) is
c()
.c(1:50)
creates a vector with numbers 1 to 50.
- If vec1 is a vector, vec1[i] refers to ith element of the vector.
- A vector can be a numeric vector, a character vector, or a factor (categorical) vector. – character variables are text strings, – numeric variables are numbers – factors are categorical variables, in which each observation belongs to a category. Category labels may be numeric or characters. Even when they are numeric, they are just labels for categories and should not be confused as numbers.
A numeric vector is a collection of numbers.
1: 2: # Line 3 creates a numeric vector called col1 with 3: # 13 elements: numbers 1 to 10, 11, 14.5 and 6. 4: c(1:5,11,14.5,6)->col1 5: col1 6: # Line 6 creates a character vector called col2 7: # with four elements, each being a string of 8: # characters 9: c("Sheela","Manoj","Karim","John")->col2 10: col2 11: # Line 9 creates a factor vector in which each 12: # element belongs to category "M" or "F". 13: as.factor(c("M","M","M","F","F","M","M"))->col3 14: col3 15: # Line 13 reads a factor vector and outputs a 16: # vector that is a list of categories. In this 17: # case, the output would be: [1] "F" "M" 18: levels(col3) 19: # Line 15 can be used to check what type of object 20: # is col3. 21: class(col3)
Results of running the code above
[1] 1.0 2.0 3.0 4.0 5.0 11.0 14.5 6.0 [1] "Sheela" "Manoj" "Karim" "John" [1] M M M F F M M Levels: F M [1] "F" "M" [1] "factor"
Data frames
- A data frame is a collection of vectors. It is a table that stores variables in columns and observations in rows. You could think of data frames as equivalent to a spreadsheet with many columns. When you import a data file, it would normally be imported as a data frame.
- However, there is one important difference. In an R data frame, each column is a particular type of vector (numeric, character or factor). You cannot mix the type of data in a given column. A spreadsheet, in contrast, can contain anything in any cell.
- Just as for a vector col1, col1[i] refers to the ith element, in a data frame data[i,j] refers to the element in ith row and jth column. Note that the first argument (before the comma) specifies the rows and the second argument (after the comma) specifies the columns. This is one way to subset data frames.
data[1:9,c(2,5,6)]
picks the first nine rows and column nos 2, 5 and 6.data[data[,1]->10000,]
selects all rows in which the first column is greater than 10000, and for these rows, shows all columns. Note that not specifying anything after comma is interpreted as all columns. Similarly, nothing before the comma would be interpreted as all rows.
data1$var1
refers to vector var1 inside data frame data1.data1$var1^2->data1$var5
calculates square of var1 in data1 and inserts it as a new vector (var5) in data1.data1$var1+2*data1$var2->data1$var2
takes var1 and var2 from data1, computes var1+2*var2, and replaces var2 in data1 with the computed variable.
- R provides a large number of commands to manipulate data frames in various ways. You should master these. See Box 1 to learn commands for selecting subsets of data frames, match merging data frames, appending data frames, adding rows and columns to data frames, and reshaping data frames.
## Selecting rows using conditions. Conditions can be ## specified using operators such as: ## ==, !=, <, <=, >, >=, is.na, is.nan subset(data1,var1>10000)->data2 ## Selecting columns (may be combined with argument for ## selecting rows). subset(data1,select=c("var1","var5","var10"))->data2 ## merging data frames using variables var1 and var2, ## selecting only matching rows of data3 and data4. merge(data3, data4, by=c("var1","var2"))->data5 ## merging data frames using variables var1 and var2, ## selecting all rows of data3 and matching rows of ## data4. Rows with no matching information from data4 ## will have NAs in columns coming from data4 merge(data3, data4, by=c("var1","var2"), all.x=TRUE)->data5 ## merging data frames using variables var1 and var2, ## selecting all rows of data4 and matching rows of ## data3 merge(data3, data4, by=c("var1","var2"), all.y=TRUE)->data5 ## merging data frames using variables var1 and var2, ## selecting all rows of data 3 and data4. merge(data3, data4, by=c("var1","var2"), all=TRUE)->data5
4.3 Statistical Graphics with ggplot2
Showing statistical information in an appropriate graphical form can help reveal the underlying patterns and trends. This is the purpose of using statistical graphs. Graphs are not made to add colour to your report. Spreadsheet software generated chartjunk has increasingly been passed off as statistical graphs. tufte2001visual remarked, “graphical decoration, which prospers in technical publications as well as in commercial and media graphics, comes cheaper than the hard work required to produce intriguing numbers and secure evidence”.
The ggplot2 package provides an extremely powerful system for making statistical graphs. The ggplot2 package is based on Leland Wilkinson’s The Grammer of Graphics wilkinson2006grammar. It can be used for making a large variety of statistical graphs as well as maps. Although it is somewhat complex, mastering ggplot2 would give you a big advantage in making interesting statistical graphs.
This guide would give you a basic idea of the ggplot2 syntax. You should look at the ggplot2 website for more examples. You are also advised to read wickham2016ggplot2 and chang2012r. These are expected to be used as ready reckoners.
In ggplot2, graphs are made in steps. You start from a basic specification of the graph, and then add/modify the details.
Step 1: Specify the aesthetic mappings
In the first step, you map, through aesthetic mappings, selected variables in your data to different scales. ggplot2 provides various types of scales for mapping your variables. The most important of these are the horizontal x axis, the vertical y axis, colour, size, linetype and shape (of a point). Continuous variables or discrete variables(for example, factors) can be mapped to these. Only discrete variables can be mapped to shape and linetype.
1: ggplot(data,aes(x=year,y=income,colour=State))->p
You could also specify colour, size and shape independently of aesthetics if you want to specify a specific colour, shape or size but keep them constant for all the data points.
1: ggplot(data,aes(x=year,y=income,size=State), 2: colour="darkred")->p
Step 2: Specify the type of graph you want
In the second step, you specify the type of graph you want to make. These are called geometric objects, or “geom”, in the ggplot2 lexicon. Where necessary, you may also specify a statistical transformation that should be done while plotting (using commands that typically start with a stat_). Table 2 provides a list of various types of graphs/geometric objects that can be made with ggplot2, and the associated commands.
2: # We start from the object 'p' created in Step 1 3: # and use '+' to add to it/modify it. 4: p+geom_point()->p #This makes a scatter plot 5: p+geom_line()->p #This adds a line to the plot 6: p # At any stage, doing this would actually 7: # plot p on the screen.
Command | Type of graph |
---|---|
geom_abline, geom_hline, geom_vline, | Reference lines: horizontal, vertical, and diagonal |
geom_bar, geom_col, stat_count, | Bars charts |
geom_bin2d, stat_bin_2d, | Heatmap of 2d bin counts |
geom_blank, | Draw nothing |
geom_boxplot, stat_boxplot, | A box and whiskers plot (in the style of Tukey) |
geom_contour, stat_contour, | 2d contours of a 3d surface |
geom_count, stat_sum, | Count overlapping points |
geom_density_2d, stat_density_2d, | Contours of a 2d density estimate |
geom_density, stat_density, | Smoothed density estimates |
geom_dotplot, | Dot plot |
geom_errorbarh, | Horizontal error bars |
geom_hex, stat_bin_hex, | Hexagonal heatmap of 2d bin counts |
geom_freqpoly, geom_histogram, stat_bin, | Histograms and frequency polygons |
geom_jitter, | Jittered points |
geom_crossbar, geom_errorbar, geom_linerange, geom_pointrange, | Vertical intervals: lines, crossbars & errorbars |
geom_map, | Polygons from a reference map |
geom_path, geom_line, geom_step, | Connect observations |
geom_point, | Points |
geom_polygon, | Polygons |
geom_qq, stat_qq, | A quantile-quantile plot |
geom_quantile, stat_quantile, | Quantile regression |
geom_ribbon, geom_area, | Ribbons and area plots |
geom_rug, | Rug plots in the margins |
geom_segment, geom_curve, | Line segments and curves |
geom_smooth, stat_smooth, | Smoothed conditional means |
geom_spoke, | Line segments parameterised by location, direction and distance |
geom_label, geom_text, | Text |
geom_raster, geom_rect, geom_tile, | Rectangles |
geom_violin, stat_ydensity, | Violin plot |
stat_sf, geom_sf, coord_sf, | Visualise sf objects |
Step 3: Modify the scale
For each of these scales (x, y, colour, size and shape), you can define attributes such as the range to be displayed, the value (say of colour or shape) associated with each category in a discrete variable, the axis label, how and where to tick-mark the axis, and the type of legend to be drawn.
6: p+scale_x_continuous("Financial Year", 7: limits=c(1990,2012), 8: breaks=c(1990,1995, 9: 2000,2010))->p 10:
Step 4: Modifying the overall appearance of the graph
ggplot2 provides various themes, and different elements to change them, to tweak the overall look of the graph. Table 3 gives a list of different aspects of a theme that can be modified. Table 4 lists different attributes of text and theme elements that can be modified.
Name | Description | Element type |
---|---|---|
text | All text elements | element_text() |
rect | All rectangular elements | element_rect() |
line | All line elements | element_line() |
axis.line | Lines along axes | element_line() |
axis.title, axis.title.x, axis.title.y | Appearance of axis labels: both, x and y | element_text() |
axis.text, axis.text.x, axis.text.y | Appearance of tick labels: both, x and y | element_text() |
legend.background | Background of legend | element_rect() |
legend.text, legend.title | Legend item and title appearance | element_text() |
legend.position | Position of the legend | “left”,“right”,“bottom”,“top”, or two-element numeric vector if you wish to place it inside the plot area |
panel.background | Background of plotting area | element_rect() |
panel.border | Border around plotting area | element_rect(linetype=“dashed”) |
panel.grid.major, panel.grid.major.x, panel.grid.major.y | Major grid lines: both, vertical and horizontal | element_line() |
panel.grid.minor, panel.grid.minor.x, panel.grid.minor.y | Minor grid lines: both, vertical and horizontal | element_line() |
plot.background | Background of the entire plot | element_rect(fill = “white”, colour = NA) |
plot.title | Title text appearance | element_text() |
strip.background | Background of facet labels | element_rect() |
strip.text, strip.text.x, strip.text.y | Text appearance for facet labels: both, horizontal and vertical | element_text() |
Theme elements | Text geoms | Description |
---|---|---|
family | family | Helvetica, Times, Courier |
face | fontface | plain, bold, italic, bold.italic |
colour | colour | Color (name or“#RRGGBB”) |
size | size | Font size (in points for theme elements; in mm for geoms) |
hjust | hjust | Horizontal alignment: 0=left, 0.5=center, 1=right |
vjust | vjust | Vertical alignment: 0=bottom, 0.5=middle, 1=top |
angle | angle | Angle in degrees |
lineheight | lineheight | Line spacing multiplier |
9: # The line below adds a title to the graph 10: p<-p + ggtitle("Title of my graph") 11: # The line below changes the x axis title 12: p<-p+theme(axis.title.x=element_text(size=16, 13: lineheight=.9, 14: family="Times", 15: face="bold.italic", 16: colour="red"))
Step 5: If needed, use faceting to show multiple plots in a panel
ggplot2 also provides facetting tools, to be used if you want to show multiple graphs in a panel. gglot2 provides two functions to split the graph into different categories: facet_wrap and facet_grid.
14: # The line below creates a separate graph for 15: # each value of category, and arranges that 16: # in a matrix of three columns, and as many 17: # rows as required. 18: p+facet_wrap(~category,ncol=3)
15: # The line below creates a separate graph 16: # for each combination of category1 17: # and category2. Graphs are arranged in 18: # a grid with values of category1 making rows 19: # and values of category2 making columns. 20: p+facet_grid(category1~category2,ncol=3)
4.4 Making tables using the data.table package
Many commands in R, and in several add-on packages, can be used for working with and creating statistical tables. Among the most popular are the apply
family of functions and plyr, dplyr and reshape2 packages. The data.tables package may be considered a swiss knife for working with and producing statistical tables. Of all the tools, this is one of the most flexible, versatile and efficient.
- data.table does not work directly on data frames but needs you to create a data frame like object called “data.table”.
You can convert an existing data frame to a data.table using
as.data.table(data1)->data1
- Or you can use
fread("delimited_file_name")->data1
to read a delimited file and directly create a data table.fread
is fast and efficient.- It also automatically detects separating character and variable names.
- All commands that work on a data frame can also be used on a “data.table”. So nothing is lost by keeping your data as a data table rather than as a data frame.
- The basic syntax of data.table is as follows:
DT[i,j,by]
. This subsets DT using i, calculates j, groups using “by”- i (subsetting rows)
- a set of logical expressions to subset DT.
- columns can be referred to directly by their names.
data2.dt[sector==1]
subsets rows for which variable sector==1.
- j (selecting columns, summarising)
[,col]
selects a column and returns it as a vector.[,.col]
selects a column and returns the output as datatable.[,.(col1, col2)] or [,list(col1, col2)]
return two columns.[,.(colname1=col1, colname2=col2)]
simultaneously renames columns.[,.(c1mn=mean(col1),c2med=med(col2))]
calculates mean and median.
- by (separately doing the calculations for groups)
Following code computes mean and median separately for each combination of col3 and col4.
[,.(c1mn=mean(col1),c2med=med(col2)), .(col3,col4)]
- i (subsetting rows)
- In a data.table, you can define some variables as “keys”. These can be used for quickly sorting, subsetting and merging.
setkey(DT, sector)
definessector
as a key variable in DT.If
sector
inDT
is defined as a key variable, then the following command will subset rows for which the key variable takes values “Industry” or “Agriculture”.DT[c("Industry","Agriculture")]
In the usual
DT[i,j,by]
type commands, you can replaceby
withkeyby
as shown below to simultaneously define keys and sort the output.[,.(c1mn=mean(col1),c2med=med(col2)), keyby=.(col3,col4)]
Output of a data.table is also a data.table. This means, that a sequence of commands can be chained as:
# Three commands run in a sequence DT[i1,j1,by1]->DT2 DT2[i2,j2,by2]->DT3 DT3[i3,j2,by3] # Can be combined into one line # as follows DT[i1,j1,by1][i2,j2,by2][i3,j2,by3]
- It is also possible to use
DT[i,j]
to do a transformation on DT itself rather than create a new data.table.For example, to create a new column called newcol in DT, one can use j as
newcol:=columnname1*columnname2
.Note that
:=
is used to say that the transformation is to be done on DT itself.- A limitation of this usage is that only one field can be transformed at a time. Transforming multiple fields would require chaining the data.table commands.
- Sometimes you may wish to rotate a cross-table created using data table. That is, you may wish to create columns out of one or more disaggregating (keyby) variables that have been used. This can be achieved in the following way.
In the command below, the first data.table transformation computes mean by sector, sex and religion. The second data.table transformation creates all combinations of sector, sex and religion, and ensures that DT2 has rows for all combinations (including null rows). The third data.table transformation rotates the table, with religion and sex being identified as row disaggregators and the remaining keyby variable (sector) as the column disaggregator.
DT[,.(col1mean=mean(columnname1)), keyby=.(sector,sex,religion)]->DT1 DT1[CJ(unique(sector), unique(sex), unique(religion))]->DT2 DT2[,as.list(col1mean) ,by=.(sex,religion)]
- Like all data.table commands, these can be chained together into one line of code.
- data.table provides a very convenient syntax for merging data.tables.
- DT1[DT2] joins matching data from DT2 with DT1; the matches are done using their keys (presuming the keys are defined).
- DT1[DT2,on=c(var1=“var2”)] joins matching data from DT2 with DT1; matches are done using var1 of DT1 and var2 of DT2.
4.5 Analysis of Survey Data in R
4.6 Econometric Analysis in R
4.7 Exercises
COVID-19
Inside the resources directory, please give the following commands to get the COVID-19 dataset.
git submodule init
git pull --recurse-submodules
The data will be downloaded in the “resources/COVID-19/csse_covid_19_data/csse_covid_19_time_series” directory. These data are updated daily. You should be able to git pull --recurse-submodules
everyday to update the dataset.
What exploratory analysis can you do with this data-set using the commands that you have learnt so far?
NSS 75, Schedule 25.2
In the resources directory, sub-directory nss75252e has following files.
- download.sh
- Bash script that downloads the required data. Open and check what the file does before executing. Executing unknown shell-scripts given by others can be a potential security hazard. So be sure you know what you are executing. You may have to use chmod to make the file executable. The script also downloads a file called datalay75_252.XLS which gives layout of each data file.
- readdata.R
- This is an R script written by me for reading the data files. Code for reading the file R75252L03.TXT for level 3 is missing, and is for you to write. Use code for other levels as illustrations for you to learn how to read these files. Information for the variables contained in R75252L03.TXT can be found in datalay75_252.XLS
4.8 Resources
- Chang, Winston (2013), R Graphics Cookbook, OReilly
- Crawley, Michael J (2015), Statistics: An Introduction using R, Wiley
- Fox, John and Weisberg, Sanford (2011), An R Companion to Applied Regression, Sage Publications, Thousand Oaks, CA.
- Hatekar, Neeraj (2010), Principles of Econometrics: An Introduction (Using R), Sage
- Hilbe, Joseph (2013), Methods of Statistical Model Estimation, Chapman & Hall/CRC Press.
- Lumley, Thomas, Complex Surveys: A Guide to Analysis Using R, John Wiley & Sons
- Maindonald
, John and Braun, John (2003), Data Analysis and Graphics Using R: An Example-based Approach, Cambridge University Press
- Michael J. Crawley (2013), The R Book, Second Edition, John Wiley & Sons
- Murrell, Paul (2011), R Graphics, Chapman & Hall/CRC Press.
- R Core Team (2015), An Introduction to R, R Foundation for Statistical Computing,Vienna.
- R Core Team (2015), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing,Vienna.
- Spector, Phil (2008), Data Manipulation with R, Springer
- Wikham, Hadley, ggplot2: Elegant Graphics for Data Analysis, UseR Series, Springer
5 Reproducible Research
5.1 Introduction to orgmode
Org-mode basics
Preamble
An Org file has a few special lines at the top that set up the environment. Following lines are an example of the minimal set of lines that we shall use.
#+TITLE: Using Emacs, Org-mode and R for Research Writing in Social Sciences #+SUBTITLE: A Toolkit for Writing Reproducible Research Papers and Monographs #+AUTHOR: Vikas Rawal #+DATE: May 4, 2014 #+OPTIONS: toc:2 H:3 num:2
As you can see, each line starts with a keyword, and the values for this keyword are specified after the colon.
Table 5 gives details of a few major special lines that we shall use.
Keyword | Purpose |
---|---|
#+TITLE |
To declare title of the paper |
#+AUTHOR |
To declare author/s of the paper |
#+DATE |
Sets the date. If blank, no date is used. If this keyword is omitted, current date is used. |
#+OPTIONS |
There are many options you can give. These are what I find the most important Multiple options can be separated by a space and specified on the same line. |
toc:--- (Do not include a Table of contents), toc:n (Include n levels of sections and sub-sections in Table of contents) | |
H:2 (Treat top two levels of headlines as section levels, and anything below that as item list. Modify the number as appropriate) | |
num:2 (Number top two levels of headlines. Modify the number as appropriate.) |
In addition to these, we can use LaTeX specific options for formatting the pdf output, odt specific options for formatting the odt/docx output, and R specific options for setting up the R environment. These would also be specified using special lines at the top of the file. I shall provide details of some of these in the Sections where these topics are discussed.
In research monographs prepared using org, you may need quite a few lines in the preamble to set everything. These would tend to make the top of the org file look daunting. It may be helpful to store these lines in a set of separate org files, and include them in your main document using line such as this:
#+INCLUDE: path-to-file/papersetup.org
Sections and headlines
After the special lines at the top comes the main body of the Org file.
The content in any Org file is organised in a hierarchy of headlines. Think of these headlines as sections of your paper.
A headline in Org starts with one or more stars (*) followed by a space. A single star denotes the main sections, double star denote the subsections, three stars denote sub-subsections, and so on. We shall use this to create the section structure of our document. You can create as many levels of sections as you need.
See the following example. Note that headlines are not numbered. We leave section numbering for org-mode to handle automatically.
#+TITLE: Reproducible Research Papers using Org-mode and R: A Guide #+AUTHOR: Vikas Rawal #+DATE: May 4, 2014 * Introduction This is the first section. Add your content here. * Literature review ** Is this an important issue This is a sub-section under top-level section "Literature review" Now indulgence dissimilar for his thoroughly has terminated. Agreement offending commanded my an. Change wholly say why eldest period. Are projection put celebrated particular unreserved joy unsatiable its. In then dare good am rose bred or. On am in nearer square wanted. ** What are the major disputes in the literature *** adulterated text Instrument cultivated alteration any favourable expression law far nor. Both new like tore but year. An from mean on with when sing pain. Oh to as principles devonshire companions unsatiable an delightful. The ourselves suffering the sincerity. Inhabit her manners adapted age certain. Debating offended at branched striking be subjects. *** Unadulterated prose Announcing of invitation principles in. Cold in late or deal. Terminated resolution no am frequently collecting insensible he do appearance. Projection invitation affronting admiration if no on or. It as instrument boisterous frequently apartments an in. Mr excellence inquietude conviction is in unreserved particular. You fully seems stand nay own point walls. Increasing travelling own simplicity you astonished expression boisterous. Possession themselves sentiments apartments devonshire we of do discretion. Enjoyment discourse ye continued pronounce we necessary abilities. * Methodology This is the next top-level section. There are no sub-sections under this. * Results This is the third top-level section. There are sub-sections under this. ** Result 1 This is a sub-section under section Results. ** Result 2 This is another sub-section under section Results * Conclusions This is the next and final top-level section. There are no sub-sections under it.
Org handles these headlines beautifully. With your cursor on the headline, pressing tab folds-in the contents of a headline. If you press tab on a folded headline, it opens to display the contents. If there are multiple levels of headlines, these open in stages as you repeat pressing the tab key.
When you are on a headline, pressing M-return creates a new headline at the same level (that is, with the same number of stars). Once you are on the new headline, a tab moves it to a lower level (that is, a star is added), and shift-tab moves it to a higher level (that is, a star is removed).
When I start writing a paper, I start with a tentative headline/section structure, and then start filling in the content under each headline, and modify the section structure, if needed, as the paper develops.
(Further reading, Headlines in Org manual)
Itemised lists
Following syntax produces unordered (bulleted) lists:
+ bullet + bullet - bullet2 1 - bullet2 2 + bullet + bullet
This is how this list shows up in the final document
- bullet
- bullet
- bullet2 1
- bullet2 2
- bullet
- bullet
Following syntax produces ordered/numbered lists:
1. Item 1 2. Item 2 1) Item 2.1 2) Item 2.2 1) Item 2.2.1 3) Item 3
This is how the ordered list shows up in the final document.
- Item 1
- Item 2
- Item 2.1
- Item 2.2
- Item 2.2.1
- Item 3
Following syntax produces a description list:
+ item1 :: description of item1 + item2 :: description of item2 - bullet1 under item2 - bullet2 under item2
This is how this list shows up in the final document
- item1
- description of item1
- item2
- description of item2
- bullet1 under item2
- bullet2 under item2
Note that:
- In unordered/description lists,
+
and-
signs are interchangeable. - Similarly, in ordered lists 1. and 1) are interchangeable.
- Levels of bullets and numbering are determined by indentation.
- Different types of lists can be mixed using numbers and bullets for different levels.
- If the cursor is on a line that is part of an itemised list, M-return inserts a new line with a bullet/number below the present line with the same level of indentation.
Inserting footnotes
- To insert a footnote at any point, use
C-c C-x f
- To reorder and renumber footnotes after inserting a footnote in a
text that already has some footnotes after the point where a new
footnote is being inserted, use
C-u C-c C-x f S
Tables
- Sample code
We shall directly create only those tables in Org that present content not being produced through statistical analysis. For tables that are created through statistical analysis, we shall embed R programs rather than the tables themselves. This is discussed in Section Org-mode and R of this guide.
The following sample code produces a reasonably formatted table, with a numbered title above the table and a name for cross-referencing the table from the text anywhere in the document.
See Table [BROKEN LINK: table-yield] for an illustration of how this table shows up in the final document.
#+NAME: table-yield #+CAPTION: Simple table created using LaTeX tabular environment #+attr_latex: :environment tabular :width \textwidth :align lrr | State | Average yield | Average income | |----------------+---------------+----------------| | Madhya Pradesh | 672 | 13000 | | Haryana | 300 | 25000 | | Punjab | 260 | 35000 |
- Table editor
Org-mode has an in-built table editor, which is very simple to use.
- Tables in Org have columns separated using |.
- Once you create the first row by separating columns using |, pressing tabs takes you from the first column to the next. Org automatically aligns the columns.
- At the end of the row, pressing tab again, creates a new blank row. You can also create a new blank row by pressing return anywhere in the last row.
- For creating a horizontal line anywhere, type |- at the starting of the line, and press tab.
- Contents of each cell are aligned automatically by Org.
- To delete a row, use
C-k
.
Org provides various commands for manipulating the design of tables. Table 6 provides the most important ones. Note that Table 6 is created using Org mode. It also gives you an idea of how the table would look eventually.
Table 6: Commands to manipulate tables in Org Command Description M-<left>
Move the column left M-<right>
Move the column right M-S-<left>
Delete the current column M-S-<right>
Insert a new column to the left of the cursor position M-<up>
Move row up M-<down>
Move row down M-S-<up>
Delete the current row or horizontal line M-S-<down>
Insert a new row above the current row For more commands for manipulating tables, see this section of the Org manual. In particular, you may want to look at spreadsheet-like functions of the table editor.
One limitation of Org is lack of support for merging of cells in a Table.
Images
You can insert images in documents as follows
[[file:path-to-file/a.jpg]]
You should do this for images that you already have, and you just want to insert them in the document. For graphs produced by R, we shall embed the code instead, so that the graph is generated and inserted automatically.
Captions and cross-references
We would like to give a title to our tables and images. And we would like to be able to refer to them from the text. These are achieved by adding two lines above every table and image.
- A line starting with
#+CAPTION:
placed just above a table or a figure adds a title to it. All Tables and Figures titles are automatically numbered. - For referring to these Tables and Figures in the text, we shall name
each table and figure in a line starting with
#+NAME:
as below.
To illustrate, for inserting an image, with a caption and a name, this is what we shall do.
#+NAME: literacy-rate #+CAPTION: Percentage of literate men and women, by country (per cent) file:a.jpg
Similarly, a table will be inserted as follows.
#+NAME: literacy-rate-table #+CAPTION: Percentage of literate men and women, by country (per cent) | Country | Men | Women | |------------+-----+-------| | India | 75 | 43 | | Bangladesh | 83 | 63 | | Rwanda | 77 | 60 |
To refer to the Table above in the text, write Table
[[literacy-rate-table]]
. As an illustration, see the following sentence.
Tables literacy-rate-table and health-table, and Figure literacy-figure, show the level of underdevelopment.
By default, all objects with captions are numbered, and names are used to anchor cross-references. When the formatted output is produced, all the references would be automatically converted to appropriate numbers. If new objects are inserted in the paper, numbering will be adjusted automatically when you create the formatted output.
Org-mode and R
Configuration
Following code in research-toolkit.org enables Org to run different types of code. If you have installed research-toolkit.org as specified in [BROKEN LINK: customemacs], these are already enabled.
I have included here the languages that I commonly use. See Org manual, if you would like to add any more.
(org-babel-do-load-languages 'org-babel-load-languages '((R . t) (org . t) (ditaa . t) (latex . t) (dot . t) (emacs-lisp . t) (gnuplot . t) (screen . nil) (shell . t) (sql . nil) (sqlite . t)))
Special lines for R
Org allows you to run multiple R sessions simultaneously, if you are working on two documents side by side, and would like to keep statistical work for the two separately.
This is done by naming the R session which a particular Org file is linked to. All R code in this file would be run in the specified R session. You could have, at the same time, another R session, with a different name, being called by another Org buffer.
We can give a name to the R session (let us say, my-r-session) that our Org buffer should be linked to by adding the following line at the top (in the preamble, that is).
#+property: session my-r-session
Embedding R code in an Org document
Org uses ESS (emacs-speaks-statistics) to provide a fully functional, syntax-aware, development environment to write R code. R code is embedded into Org as a source block. The basic syntax is
#+NAME: name_of_code_block #+BEGIN_SRC R <switches> <header-arguments> <Your R code goes here.> #+END_SRC
This is how source blocks are created.
- First write the lines starting with
#+NAME
,#+BEGIN_SRC
and#+END_SRC
. - Then with your cursor in between the
BEGIN_SRC
and theEND_SRC
lines, give the command C-c ’ (that is, press Ctrl-C, release, and press ’).- This would open a new buffer using ESS mode. If you type your code in this buffer, you will see that ESS is syntax-aware and nicely highlights R code.
- ESS also allows you to run (evaluate) the code that you write, to
test what your code is doing. Use
C-j
for evaluating a single line of code,C-b
for evaluating the whole ess buffer, orC-r
for a marked region within the ess buffer.
- Once you have finished writing a code block and tested it, press C-c ’ again to come back to your Org buffer.
- In your Org buffer, with your cursor in a source-block,
press
C-c C-c
to evaluate the whole code block and have the results included in your document. - You can always edit your source code by opening a temporary ESS buffer using C-c’
Code blocks that read data and load functions for later use in the document without any immediate output
I normally have one or two code blocks that read the data I am going to use, call the libraries that I use, and define a few functions of my own that I plan to use. I want this code block to be evaluated, so that these data, libraries and functions become available in my R environment. But no output from such code blocks is expected to be included into the document.1
Code block 1 is an example of such a code block. Note
:results value silent
switch used in the #+begin_src
line.
#+NAME: readdata-code #+BEGIN_SRC R :results value silent read.data("datafile1.csv",sep=",",header=T)->mydata1 #+END_SRC
Code blocks that produce results in the form of a table
Most of code blocks in my papers fall in this category. The code block may use data and functions made available by previous code blocks, read some new data and may load some new functions. The code block does some statistical processing. The last command of the code block produces an object (for example, a data.frame) that is included in the document as a Table.
For example, the code block 1 below uses mydata1 read in the previous code block, reads a new dataset, and processes them to create a table that shows average BMI by country.
#+NAME: bmi-table-code #+BEGIN_SRC R :results value :colnames yes :hline yes aggregate(height~Country,data=mydata1,mean)->a1 read.data("datafile2.csv",sep=",",header=T)->mydata2 aggregate(weight~Country,data=mydata2,mean)->a2 merge(a1,a2,by="Country")->a1 a1$weight/a1$height->a1$BMI subset(a1,select=c("Country","BMI")) #+END_SRC
You can evaluate this code using C-c C-c. When you do that, it produces the output, and places it immediately below the code block. The results display the output of the code under a line that looks like below
#+RESULTS: bmi-table-code
Note that the results are tied to the code block using the name of the
code block. Every time you go to the source code block and press
C-c C-c
, the code will be evaluated again and the results will be
updated.
On top of the line starting with #+RESULTS:
, we shall add two more
lines, to give the table a title and a name. Note that both the code
block and the result of the code block have separate names.
#+NAME: bmi-table-output #+CAPTION: Average BMI, by country #+RESULTS: bmi-table-code
Like any Org table, you can cross-refer to this table using
[[bmi-table-output]]
.
Code blocks that produce a graph to be included in the document
These code blocks can have a series of commands. The last command produces a graph that we would like to be included in the document.
The following code shows an example of a code block that produces a graph.
#+NAME: mygraph-code #+BEGIN_SRC R :results output graphics :file bmi2.png :width 825 :height 1050 :fonts serif #+END_SRC
As before, for creating your graph, you first write the #+NAME
,
BEGIN_SRC
and the END_SRC
lines, and then go into a temporary ESS
buffer by using C-c ’.
- Once in this temporary ESS buffer, you can write the R commands for making your graph.
- As you write, you can evaluate the commands using
C-j
,C-r
andC-b
and see what your output looks like. - The output is displayed on your screen using the default graphic device used by R (X11, quartz or windows graphic device depending upon your operating system).
- Once you have finalised your graph, you press C-c ’ and come back to the Org buffer.
Note that creation of the image file is left to appropriate switches
in the #+BEGIN_SRC
line. Org automatically chooses appropriate
graphic device to produce the file. When you evaluate this code using
C-c C-c
, the results are displayed below the code block as follows.
#+RESULTS: mygraph-code bmi2.png
Note that, taking the file name from our #+BEGIN_SRC
line, a file
called bmi2.png
was automatically created and linked, so that the
graph would be inserted in the document when you produce the formatted
output.2 Every time you evaluate the code using C-c C-c
, the
underlying image file containing the graph is overwritten by a new
file.
As with the tables, we shall add a caption and a name to it as follows
#+NAME: my-bmi-graph #+CAPTION: Average BMI, by Country #+RESULTS: mygraph-code gini.png
You can now refer to this graph in the text using [[my-bmi-graph]]
.
5.2 Formatting tables and figures for print
5.3 Citation management using org-ref and biblatex
- A sample biblatex database to get you started can be taken from here. For now, you may just download it, save it and add to it. Please note that the source file on github keeps getting updated. Eventually, you may want to manage your own biblatex database using something like git.
- There is also another Emacs mode – ebib – that can be used for maintaining your biblatex database. To start ebib, use M-x ebib
5.4 Resources
- Dominik, Carsten and others (2016), The Org Manual, Free Software Foundation
- Gentleman, Robert and Duncan Temple Lang, “Statistical Analyses and Reproducible Research”, Journal of Computational and Graphical Statistics, 16 (1), pp. 1-23
- Kitchin, John (2014), The org-ref Manual
- Knuth, Donald E. (1992), Literate Programming, Center for the Study of Language and Information, Stanford.
- Rawal, Vikas (2014), Using Emacs, Org-mode and R for Research Writing in Social Sciences: A Toolkit for Writing Reproducible Research Papers and Monographs, http://archive.indianstatistics.org/tools/orgpapers.pdf; a revised but incomplete version may be seen at https://github.com/vikasrawal/orgpaper/blob/master/orgpapers.org
- Rossini, Anthony and Friedrich Leisch. (2003) “Literate Statistical Practice”, University of Washington Biostatistics Working Paper Series , Available at: http://works.bepress.com/anthony_rossini/6/
- Schulte, Eric and Dan Davison (2011), “Active Documents with Org-Mode”, Computing in Science & Engineering.
- Schulte, Eric, Dan Davison, Thomas Dye and Carsten Dominik (2016), “A Multi-Language Computing Environment for Literate Programming and Reproducible Research.” Journal of Statistical Software, 46 (3), 1 – 24.
Footnotes:
For libraries and functions that you need to call, it is even better to include them in a .Rprofile file in your working directory. These libraries and functions would then be called when R is started, and not each time you evaluate code blocks in your document.
Of various image formats, I find that png files are most versatile. png files support transparency, and are rendered well both on the web and in print. You can also specify jpeg or pdf files. pdf files for images work very well if you are only going to produce a pdf document.