Getting started in R

Last updated on June 13, 2017

I am currently doing the John Hopkins course on R that is offered through Coursera. There is likely to be a gap between taking the course and using R, and these are my notes on how to get started.

Software and setup

Google R, download the version to match your operating system, and install it on your machine using the defaults. Set up a data folder in My Documents.

Data

Store your data in your folder in a .csv file.
Use the first row to label your columns.
Use NA to mark missing data.

Read your datafile

Open R with the shortcut
Read your datafile using this command substituting the filename for datafile: data <- read.csv (“datafile.csv”)
List your data to screen with this command: data
Note that you can use any name you like instead of “data” [though I imagine there are some unallowable names]

Find out the number of rows/cases and varibles/columns

To find out the number of columns where data is the name of your data as above : ncol(data)
To find out the number of rows where data is the name of your data as above : nrow(data)

Print out the first line to inspect the names of your variables/columns

Use this command where data is the name of your data as above : data[1, ]

Take a subset of your data

For sake of the example, let the name of your first variable be VAR1 and your third variable be VAR3
Make a new dataframe containing all rows where the values of VAR1 and VAR3 are as shown: newdata <- subset(data, VAR1 > 31 & VAR3 > 90)

Take a subset of one variable excluding missing data

Set up a new variable/vector containing a LOGICAL variable which is TRUE when a value is missing: VAR1BAD <- is.na(data[,1])
Set up a new variable/vector that copies the values from the original vector, providing they are not “bad”: VAR1GOOD <- data[,1][!VAR1BAD]

Do some basic statistics on your newly created variable/vector

Mean[VAR1GOOD]
Max [VAR1GOOD]

Issues covered during the first week not listed above

Vectors must contain data of the same type i.e., numeric, character, or logical
A list can contain a mix of types
When a vector, as opposed to a list, has mixed types, the type is “coerced” to the LCD, so to speak – logical is reduced to numeric (1,0) and numeric and logical is reduced to character
R uses factors – which in essence are labels such as “male” and “female” where other statistics programmes used numerals. Note that the underlying value might actually be numerical.
Data is read in as a dataframe rather than a matric i.e. as a table that can contain columns of different types. Dataframes can be converted to matrices.
There are various tricks for using massive data sets not covered in this post.