I am currently doing the John Hopkins course on R that is offered through Coursera. There is likely to be a gap between taking the course and using R, and these are my notes on how to get started.
Software and setup
- Google R, download the version to match your operating system, and install it on your machine using the defaults. Set up a data folder in My Documents.
- Store your data in your folder in a .csv file.
- Use the first row to label your columns.
- Use NA to mark missing data.
Read your datafile
- Open R with the shortcut
- Read your datafile using this command substituting the filename for datafile: data <- read.csv (“datafile.csv”)
- List your data to screen with this command: data
- Note that you can use any name you like instead of “data” [though I imagine there are some unallowable names]
Find out the number of rows/cases and varibles/columns
- To find out the number of columns where data is the name of your data as above : ncol(data)
- To find out the number of rows where data is the name of your data as above : nrow(data)
Print out the first line to inspect the names of your variables/columns
- Use this command where data is the name of your data as above : data[1, ]
Take a subset of your data
- For sake of the example, let the name of your first variable be VAR1 and your third variable be VAR3
- Make a new dataframe containing all rows where the values of VAR1 and VAR3 are as shown: newdata <- subset(data, VAR1 > 31 & VAR3 > 90)
Take a subset of one variable excluding missing data
- Set up a new variable/vector containing a LOGICAL variable which is TRUE when a value is missing: VAR1BAD <- is.na(data[,1])
- Set up a new variable/vector that copies the values from the original vector, providing they are not “bad”: VAR1GOOD <- data[,1][!VAR1BAD]
Do some basic statistics on your newly created variable/vector
- Max [VAR1GOOD]
Issues covered during the first week not listed above
- Vectors must contain data of the same type i.e., numeric, character, or logical
- A list can contain a mix of types
- When a vector, as opposed to a list, has mixed types, the type is “coerced” to the LCD, so to speak – logical is reduced to numeric (1,0) and numeric and logical is reduced to character
- R uses factors – which in essence are labels such as “male” and “female” where other statistics programmes used numerals. Note that the underlying value might actually be numerical.
- Data is read in as a dataframe rather than a matric i.e. as a table that can contain columns of different types. Dataframes can be converted to matrices.
- There are various tricks for using massive data sets not covered in this post.