R – flowingmotion

Make what you will of this. It was written during a long day trying to make sense of the JHU course on R with liberal helpings of the simpler course from Princeton. I haven’t checked the text for accuracy or typos and have put it here just so I can find it when I need it. If it helps you to make sense of R, good.

Get your data and save it in a .csv file

Your first task when using R to do statistical analysis is to collect the data.

Layout your data

Normally, you lay out your data in a table. Observations, cases, instances or people are in the rows. Variables or things you observed are in in columns.

Usually there is one column, which is unique but anonymous identifier for each case/row.

You will also have columns for ‘factor’ information such as gender (male/female/other), age (0-125), and so on.

You can put the column names in Row 1 – or you can leave out the column names and put the first case in Row 1.

Save your data

It is quite normal to load data into an Excel table and then to save it in .csv (comma separated value) format.

.xls format can be read but .csv is more common and should be your standard practice.

You can also capture your data in Excel and save it as .text file. I will show you how to read both the .csv and .txt.

Read your data into R

Your second task is to read a data file into R so that you can use it and analyse it.

You have three primary ways to read in your file. By the time you have worked through these methods, you will have mastered several basic R commands.

First method: Read the file from an online source

1. Online urls tend to be long and cumbersome. So put the url into a variable called fileurl.

2. We use <- to equate the variable and the url

3. We surround the url with “”

4. fileurltxt<-“http://spark-public.s3.amazonaws.com/stats1/datafiles/Stats1.13.HW.02.txt” [name taken from one of the Coursera courses]

5. For avoidance of confusion: notice this is a text file and even though we have made a variable with this name, the file we read in remains a .txt file.

6. Alternatively: fileurl csv<- “http://spark-public.s3.amazonaws.com/stats1/datafiles/Stats1.13.HW.02.csv”

7. Now you can see why my variable was named – file-url-csv or file-url-txt – I don’t want to forget what I have done.

8. To read a .txt file, simply:

mydata <- read.table(fileurltxt, header = TRUE) or mydata <-read.table (fileurltxt, header=T)

9. To read a .csv file, simply:

mydata <- read.csv(filurlcsv, header = TRUE) or as above, abbreviate TRUE to T

10. If you leave off the header=TRUE, the first line of the file is treated as data. Let’s spell this out: if you leave off the information about the header and the first line is a set of column headings, these will be erroneously treated as the first case. If you do not have headers, then of course, replace TRUE with FALSE or, just take out header=TRUE. I recommend setting header = FALSE so that when you return to your program after many months, you are immediately clear about what you did.

What you have learned to do so far

· To set up a variable to store a long url

· The <- command to “put data into a variable or matrix or table or dataframe”

· To put “” around strings (includeing urls)

· To check whether your data is in a .txt or .csv file

· Adifferent command to read .txt and .csv files (read.table and read.csv, respectively)

· To add header=TRUE or header=FALSE to say whether the top line of the file contains column headings or data.

· To read your data into a data frame called mydata. You can call mydata what you like (though you might discover some reserved names). Use a name that is short, descript and memorable.

At this point, you can easily read data from a location on the web and you have data in mydata ready to use. If you want to see your data, simple type mydata followed by enter and your data will be listed. If you have a very long file, don’t list all of it. Type head(mydata) instead to get the top few lines and tail(mydata) to get the bottom few lines

So you have learned to more things:

11. To read a data frame, simply type its name

12. To read the first or last lines of a large dataframe, type head(dataframename) or tail(dataframename).

Read the file from a directory or folder on your laptop

If you have downloaded or aved a file onto your laptop, then you are going to follow exactly the same procedure as above. But you will have to replace the url within “” with a path and filename.

Begin by learning how to set your working directory.

1. To find out your current working directory, use: getwd()

2. To set a working directory, use: setwd(“C:/users/yourusername/documents/R”)

3. Check that with getwd()

4. Of course, if R does not exist, make the directory and put your datafile there

5. The reason I used R is that when you load packages and libraries, R automatically makes this folder

6. Note you can also put your data in a sub-directory “C:/users/yourusername/documents/R/datafiles”

7. Of course, datafiles can be anything you chose to name it

8. Also note that you must use / not \

9. Assuming your data file is in /R/datafiles and that your working directory is R, then your path is “datafiles”

10. Here are your new commands for reading a .txt file

filenametxt = “your file name including .txt”

path = “datafiles”

mydata<-read.table((file=file.path(path, filenametxt), header = TRUE)

11. And for .csv

filenamecsv =”your filename including .csv)

path =”datafiles)

mydata<- read.csv((file=file.path(path, filenamecsv), header=TRUE)

Read any one data files in any one of many directories

Now let us imagine you are working on a large project and you have several data sub-directories (part1, part2, etc.). And each directory contains many datafiles. The datafiles might be named with numbers (001.csv to 999.csv). Everything in this section applies to .txt files too, but you must use read.table not read.csv and substitute .txt for .csv

1. To be able to tell R to find the file you want, we set the file name as an argument in a function or script.

2. Set up the barebones of the function:

getdatafile <-function(id, path){

}

3. Type getdatafile to see the script

4. Edit the getdatafile function by using fix(getdatafile) . Note that you will edit in a little popup and when you save, you should see the corrected function on the original console. I have had endless trouble with this, so work carefully to eliminate muddle.

getdatafile <-function(id, path) {

read.csv((file=file.path(path, id)), header=TRUE)

}

Now run getadatafile(“filename.csv”,”datafiles”) by replacing your

The full datafile should come up on the console.

5. Explore that further by editing getdatafile again and reading the file to mydata:

mydata<-read.table((file=file.path(path, id)),header=TRUE)

6. Now the data will not read out. Moreover when you type mydata, you will get NULL, or a message that it does not exist, or values of my data that you put there earlier. So rerun this command but clear mydata first with mydata<-0.

7. To have access to the data, you have to set the value of the whole function as the value of mydata. To do this, type return(mydata) just before the }

8. Now when you run the getdatafile(“put the file name here”, “put the path name here”), the data prints to the console again.

9. To store this data for later use, you have to type mydata<-getdatafile(“filename”,”pathname”)). The reason for this is that mydata only existed within the function and the ‘logic’ of a function is that you return a value of a function, not a value of things inside the function. To tease this out further, you can have mydata=3 set before you call the function. Unless you call the function “onto” mydata, mydata will continue to be 3, despite having read a whole file into another mydata within the function.

So what has been learned here?

· A function has a specific form with

a. The name of the function

b. <-

c. Function

d. () containing arguments

e. {}

f. Code within the {}

· The function exists to arrive a t a value, that might be one number or a large table. This is what is returned.

· The value of the function vomited up to the console unless we direct it to a variable such as valueofgetdatafile<-getdatfile(“filename”,”pathname”)

· If we redirect the output into a variable within the function, such as mydata<- read.csv(filepathcsv), then we must remember to make the output the value of the function again but typing return(mydata) just before the last }

· Once again, if we don’t want mydata vomited up over the console we must direct it into valueofgetdatafile

All this seems unnecessarily convoluted but, as I understand it, it is a function of the modularity of object oriented programming. Don’t fight it, just master it.

As a final frill, we are going to make it easier to type in the file name. At present, we have to type in “24.csv” or “345.txt” etc.

Can we make it easier and let people simply type (24,”path name here”)?

To do this, we will use a command called sprintf and two other commands called paste as.numeric.

Sprintf allows us to add trailing zeros to a number. For 1 to be 001, and 10 to be 010 and 100 to stay 100 –

filename<-sprintf(“%03d, id)

Notice that this command will fail when you want a file called 1000.csv.

To add on the .csv, use the paste command. The paste command makes lists and adds spaces by default. Sep=”” removes the spaces.

filename<-paste(sprintf(%03d,id),”.csv”, sep=””)

Notice that you might have thought the items to be concatenated would be in their own (), but they are not.

Also notice, if you try this, that it still will not work because the id has been turned into text or character. To keep the number as the number, we will use as.numeric

filename<-paste(sprint(“%03d”, as.numeric(id)), “.csv”, sep=””)

Replace id in the read.csv (or read.table) commands and the user is at liberty to put the number of the file only without the extension into the getfiledate(id,path).

Note also that this little routine does not help if some of the files are .txt and some are .csv. They must all of be same type and the code must reflect the type and the read.csv or read.table must match.

You should now have a script that reads

getdatafile(function(id, path){

Filename<-paste(sprint(“%03d”,as.numeric(id)),”.csv”,sep=””)

Mydata<-read.csv9(file=file.path(path, filename),header=TRUE)

Return(mydata)

}

And you use the script –

Gethecontentsofdatafile<-getdatafile(23, “datafiles”)

Where you put the number of the file that you want and datafiles is the directory /R/datafiles where the file can be found.

To retrieve the file, you type

Getthecontentsofdatafile

Summary

This post covered how to read files into R from three sources – an external url, a folder on your machine, and from one or more directories containing many files.

It is important to check whether you are using .txt or .csv files and to change the scripts to match. Also change the command – read.table is for .txt and read.csv for .csv.

It is also necessary to have the labelling of files in some coherent pattern. This script goes up to 999 files and does not handle files with names like abc.txt.

We also covered the basics of a function – and the very confusing return function. It is best to play around with this until it becomes more intuitive.

Finally, we used three more commands –sprint that is used for printing and has a useful feature for adding trailing zeros and can make 1 into 001.

As.numeric stopped 001 turning into text and keeps it as a sequence of three digits.

And the paste command allows you to add .csv on the end and its feature sep=”” allows us to remove a space so 001 .csv becomes 001.csv.

I hope you find this helpful

2 Comments

I am currently doing the John Hopkins course on R that is offered through Coursera. There is likely to be a gap between taking the course and using R, and these are my notes on how to get started.

Software and setup

Google R, download the version to match your operating system, and install it on your machine using the defaults. Set up a data folder in My Documents.

Data

Store your data in your folder in a .csv file.
Use the first row to label your columns.
Use NA to mark missing data.

Read your datafile

Open R with the shortcut
Read your datafile using this command substituting the filename for datafile: data <- read.csv (“datafile.csv”)
List your data to screen with this command: data
Note that you can use any name you like instead of “data” [though I imagine there are some unallowable names]

Find out the number of rows/cases and varibles/columns

To find out the number of columns where data is the name of your data as above : ncol(data)
To find out the number of rows where data is the name of your data as above : nrow(data)

Print out the first line to inspect the names of your variables/columns

Use this command where data is the name of your data as above : data[1, ]

Take a subset of your data

For sake of the example, let the name of your first variable be VAR1 and your third variable be VAR3
Make a new dataframe containing all rows where the values of VAR1 and VAR3 are as shown: newdata <- subset(data, VAR1 > 31 & VAR3 > 90)

Take a subset of one variable excluding missing data

Set up a new variable/vector containing a LOGICAL variable which is TRUE when a value is missing: VAR1BAD <- is.na(data[,1])
Set up a new variable/vector that copies the values from the original vector, providing they are not “bad”: VAR1GOOD <- data[,1][!VAR1BAD]

Do some basic statistics on your newly created variable/vector

Mean[VAR1GOOD]
Max [VAR1GOOD]

Issues covered during the first week not listed above

Vectors must contain data of the same type i.e., numeric, character, or logical
A list can contain a mix of types
When a vector, as opposed to a list, has mixed types, the type is “coerced” to the LCD, so to speak – logical is reduced to numeric (1,0) and numeric and logical is reduced to character
R uses factors – which in essence are labels such as “male” and “female” where other statistics programmes used numerals. Note that the underlying value might actually be numerical.
Data is read in as a dataframe rather than a matric i.e. as a table that can contain columns of different types. Dataframes can be converted to matrices.
There are various tricks for using massive data sets not covered in this post.

Tag: R

Read data files into R

Get your data and save it in a .csv file

Layout your data

Save your data

Read your data into R

First method: Read the file from an online source

What you have learned to do so far

Read the file from a directory or folder on your laptop

Read any one data files in any one of many directories

So what has been learned here?

Summary

Like this:

Getting started in R

Software and setup

Data

Read your datafile

Find out the number of rows/cases and varibles/columns

Print out the first line to inspect the names of your variables/columns

Take a subset of your data

Take a subset of one variable excluding missing data

Do some basic statistics on your newly created variable/vector

Issues covered during the first week not listed above

Like this:

Tag: R

Read data files into R

Get your data and save it in a .csv file

Layout your data

Save your data

Read your data into R

First method: Read the file from an online source

What you have learned to do so far

Read the file from a directory or folder on your laptop

Read any one data files in any one of many directories

So what has been learned here?

Summary

Like this:

Using data.frame in R

Data frames in R

What is a data.frame?

Reshape data with a data.frame on the fly

Like this:

Getting started in R

Software and setup

Data

Read your datafile

Find out the number of rows/cases and varibles/columns

Print out the first line to inspect the names of your variables/columns

Take a subset of your data

Take a subset of one variable excluding missing data

Do some basic statistics on your newly created variable/vector

Issues covered during the first week not listed above

Like this: