Last updated on June 13, 2017

I am currently doing the John Hopkins course on R that is offered through Coursera.   There is likely to be a gap between taking the course and using R, and these are my notes on how to get started.

## Software and setup

1. Google R, download the version to match your operating system, and install it on your machine using the defaults.  Set up a data folder in My Documents.

## Data

2. Use the first row to label your columns.
3. Use NA to mark missing data.

1. Open R with the shortcut
3. List your data to screen with this command: data
4. Note that you can use any name you like instead of “data” [though I imagine there are some unallowable names]

## Find out the number of rows/cases and varibles/columns

1. To find out the number of columns where data is the name of your data as above : ncol(data)
2. To find out the number of rows where data is the name of your data as above : nrow(data)

## Print out the first line to inspect the names of your variables/columns

1. Use this command where data is the name of your data as above : data[1, ]

## Take a subset of your data

1. For sake of the example, let the name of your first variable be VAR1 and your third variable be VAR3
2. Make a new dataframe containing all rows where the values of VAR1 and VAR3 are as shown: newdata <- subset(data, VAR1 > 31 & VAR3 > 90)

## Take a subset of one variable excluding missing data

1. Set up a new variable/vector containing a LOGICAL variable which is TRUE when a value is missing: VAR1BAD <- is.na(data[,1])
2. Set up a new variable/vector that copies the values from the original vector, providing they are not “bad”: VAR1GOOD <- data[,1][!VAR1BAD]

## Do some basic statistics on your newly created variable/vector

1. Mean[VAR1GOOD]
2. Max [VAR1GOOD]

## Issues covered during the first week not listed above

1. Vectors must contain data of the same type i.e., numeric, character, or logical
2. A list can contain a mix of types
3. When a vector, as opposed to a list, has mixed types, the type is “coerced” to the LCD, so to speak – logical is reduced to numeric (1,0) and numeric and logical is reduced to character
4. R uses factors – which in essence are labels such as “male” and “female” where other statistics programmes used numerals. Note that the underlying value might actually be numerical.
5. Data is read in as a dataframe rather than a matric i.e. as a table that can contain columns of different types. Dataframes can be converted to matrices.
6. There are various tricks for using massive data sets not covered in this post.

Published in BIG DATA, MACHINE LEARNING & ANALYTICS R