Pages

Showing posts with label graph. Show all posts
Showing posts with label graph. Show all posts

Thursday, November 8, 2012

Data types part 2: Using classes to your advantage


Last week I talked about objects including scalars, vectors, matrices, dataframes, and lists.  This post will show you how to use the objects (and their corresponding classes) you create in R to your advantage.

First off, it's important to remember that columns of dataframes are vectors.  That is, if I have a dataframe called mydata, the columns mydata$Height and mydata$Weight are vectors. Numeric vectors can be multiplied or added together, squared, added or multiplied by a constant, etc. Operations on vectors are done element by element, meaning here row by row.

First, I read in a file of data, called mydata, using the read.csv() function. I get the dataframe below:


I check the classes of my objects using class(), or all at the same time with ls.str().

class(mydata$Weight)
class(mydata$Height)

or










So I see that mydata is a dataframe and all my columns are numeric (num).  Now, if I want to create a new column in my dataset which calculates BMI, I can do some vector operations:

mydata$BMI<-mydata$Weight/(mydata$Height)^2 * 703


Which is the formula for BMI from weight in pounds and height in inches. Notice how if any component of the calculation is a missing (NA) value, R calculates the BMI as NA as well.

Now I can do summary statistics on my data and store those as a matrix. For example, I start with summary statistics on my Age vector:

summary(mydata$Age)






If I want to extract an element of this summary table, say the minimum, I can do

summary(mydata$Age)[1]

which extracts the first element (of 6) of the summary table.

But what I really want is a summary matrix of a bunch of variables: Age, Sex, and BMI.  To do this I can rowbind the summary statistics of those three variables together using the rbind() function, but only take the 1st, 4th, and 6th elements of the summary table, which as you can see correspond to the Min, Mean, and Max. This creates a matrix, which I call summary.matrix:

summary.matrix<-rbind(summary(mydata$Age)[c(1,4,6)], summary(mydata$BMI)[c(1,4,6)], summary(mydata$Sex)[c(1,4,6)])

Rowbinding is basically stacking rows on top of each other.  I add rownames and then print the class of my summary matrix and the results.

rownames(summary.matrix)<-c("Age", "BMI", "Sex")
class(summary.matrix)
summary.matrix










There is also a much more efficient way of doing this using the apply() function.  Previously I had another post on the apply function, but I find that it takes a lot of examples to get comfortable with so here is another application.

Apply() is a great example of classes because it takes in a dataframe as the first argument (mydata, all rows, but I choose only columns 2, 3, and 7).  I then apply it to the numeric vector columns (MARGIN=2) of this subsetted dataframe, and then for each of those columns I perform the mean and standard deviation, removing the NA's from consideration.  I save this in a matrix I call summary.matrix2.

summary.matrix2<-apply(mydata[,c(2,3,7)], MARGIN=2, FUN=function(x) c(mean(x,na.rm=TRUE), sd(x, na.rm=TRUE)))

I then rename the rows of the this matrix and print the results, rounded to two decimal places.  Notice how the format of the final matrix is different here. Above the rows were the variables and the columns the summary statistics, while here it is reversed.  I could have column binded (cbind() instead of the rbind()) in the first case and I would have gotten the matrix transposed to be like this one.

rownames(summary.matrix2)<-c("Mean", "Stdev")
round(summary.matrix2, 2)







Finally, I want to demonstrate how you can take advantage of scalars and vectors when graphing. Creating scalar and vectors objects is really helpful when you are doing the same task multiple times.  I give the example of creating a bunch of scatterplots.

I want to make a scatterplot for each of three variables (Height, Weight, and BMI) against age.  Since all three scatterplots are going to be very similar, I want to standardize all of my plotting arguments including the range of ages, the plot symbols and the plot colors.  I want to include a vertical line for the mean age and a title for each plot.  The code is below:


##Assign numeric vector for the range of x-axis
agelimit<-c(20,80)

##Assign numeric single scalar to plotsymbols and meanage
plotsymbols<-2
meanage<-mean(mydata$Age)

##Assign single character words to plottype and plotcolor 
plottype<-"p"
plotcolor<-"darkgreen"

##Assign a vector of characters to titletext
titletext<-c("Scatterplot", "vs Age")

Ok, so now that I have all those assigned, I can plot the three plots all together using the following code.  Notice how all the highlighted code is the same in each plot (except for the main title) and I'm using the assigned objects I just created.  The great part about this is that if I decide I actually want to plot color to be red, I can change it in just one place.  You can think about how this would be useful in other situations (data cleaning, regressions, etc) when you do the same thing multiple times and then decide to change one little parameter. If you're not sure about the code below, I posted on the basics of plotting here.

##Plot area is 1 row, 3 columns
par(mfrow=c(1,3))

##Plot all three plots using the assigned objects
plot(mydata$Age, mydata$Height, xlab="Age", ylab="Height", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "Height", titletext[2]))
abline(v=meanage)

plot(mydata$Age, mydata$Weight, xlab="Age", ylab="Weight", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "Weight", titletext[2]))
abline(v=meanage)

plot(mydata$Age, mydata$BMI, xlab="Age", ylab="BMI", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "BMI", titletext[2]))
abline(v=meanage)


Notice how I do the main title with the paste statement.  Paste() is useful for combining words and elements of another variable together into one phrase.  The output looks like this, below.  Pretty nice!










Monday, October 15, 2012

What a nice looking scatterplot!


This week, we look at plotting data using scatterplots. I'll definitely have a post on other ways of plotting data, like boxplots or histograms.

Our data from last week remains the same:


First, a quick way to look at all of your continuous variables at once is just to do a plot command of your data.  Here, I will subset the data to just take three columns and plot those against each other:

plot(mydata[,c(2,4,5)])

This takes all rows, and the columns 2, 4, and 5 from the dataset and plots them all against each other, like this:



Next, I want to make a nice scatterplot of Weight on Height.  The basic format is

plot(xvariable, yvariable)

So it looks like this:

plot(mydata$Weight, mydata$Height)


But this is a little ugly.  Fortunately, there are a million options that I can take advantage of.  In this first post on plotting, I will:

  • add labels for the x and y axes (xlab and ylab, respectively) 
  • change the dimensions of the plot so it's not quite so condensed (xlim and ylim)
  • add a title (main) and change the font size of the title (cex.main)
  • get rid of the frame around the plot (frame.plot=FALSE)
  • change the type of plotting symbol from little circles to little trianges (pch=2) and make those little triangles blue (col="blue").


plot(mydata$Weight, mydata$Height, xlab="Weight (lbs)", ylab="Height (inches)", xlim=c(80,200), ylim=c(55,75), main="Height vs Weight", pch=2, cex.main=1.5, frame.plot=FALSE , col="blue")



Now, let's get a little more complicated.  I want the color of the plot symbol to be indicative of whether the observation is male or female, and to put a legend in there too. This is super easy right inside the plot function call using the ifelse() statement.  To review, the ifelse() statement is similar to the cond() statement in Stata.  It looks like this:

ifelse(condition, result if condition is true, result if condition is false)

So here I change my parameter col=blue to col=ifelse(mydata$Sex==1, "red", "blue"). This is saying that if the sex is a 1, make the color of the triangle red, else make it blue:

plot(mydata$Weight, mydata$Height, xlab="Weight (lbs)", ylab="Height (inches)", xlim=c(80,200), ylim=c(55,75), main="Height vs Weight", pch=2, cex.main=1.5, frame.plot=FALSE, col=ifelse(mydata$Sex==1, "red", "blue"))

Then I add in the legend. The first two parameters of the legend function are the x and y points where the legend should begin (here at the point (80,75)).  Then I indicate that I want two triangle symbols (pch=c(2,2)).  The first 2 is for the number of symbols and the second 2 is to indicate that pch=2 (a triangle) as it was in the previous example.  Next I say I want the first triangle to be red and the second one blue.  I label the two symbols with the labels "Male" and "Female".  Next, I indicate I want a box around the legend (bty="o") and that I want the box to be darkgreen.  Finally, I indicate that the font size of the whole legend text should be .8. (cex=.8).

legend(80, 75, pch=c(2,2), col=c("red", "blue"), c("Male", "Female"), bty="o",  box.col="darkgreen", cex=.8)



Incidentally, I could get the same plot by identifying "topleft" in my legend() call, as below.  But sometimes it's nice to put the legend exactly where you want it and the legend options only allow for “bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right”, “center”.

legend("topleft", pch=c(2,2), col=c("red", "blue"), c("Male", "Female"), bty="o", cex=.8, box.col="darkgreen")

Finally, one of the really nice aspects of R is being able to manipulate the plot region and make it do exactly what you want.  For starters, we can have two plots side by side, by indicating:

par(mfrow=c(1,2))

meaning one row and two columns. If I wanted a 2 by 2 plot area, I would do mfrow=c(2,2).

Now I'll show the full code for the plot below.  The first part is just the first plot we already did, but I add in a vertical line at the average weight and add in text.  The second plot is Height on Age, and I add in the linear regression line.  To do this is quite easy.  I start by running the regression of Height on Age and save it as "reg".  Then I use abline() to add the line to the plot.  Finally, I use the text() function to add text to the plot anywhere I want.  I walk you through the code below:


##set up the plot area with 1 row and 2 columns of plots
par(mfrow=c(1,2))

##first plot height on weight
plot(mydata$Weight, mydata$Height, xlab="Weight (lbs)", ylab="Height (inches)", xlim=c(80,200), ylim=c(55,75), main="Height vs Weight", pch=2, cex.main=1.5, frame.plot=FALSE, col="blue")

##add in the vertical line at the mean of the weight, using na.rm=TRUE to remove the NAs from consideration
abline(v=mean(mydata$Weight, na.rm=TRUE), col="orange")

##add in text at the point (140, 73), with font size .8. The position is 4, meaning that the text moves to the right from the starting point. The "\n" is a carriage return (moves the text to the next line)
text(140,73, cex=.8, pos=4, "Orange line is\n sample average\n weight")

##add in the second plot of height on age
plot(mydata$Age, mydata$Height, xlab="Age (years)", ylab="Height (inches)", xlim=c(0,80), ylim=c(55,75), main="Height vs Age", pch=3, cex.main=1.5, frame.plot=FALSE, col="darkred")


##run a linear regression of Height on Age - if this is confusing, I'll do a post on linear regressions very soon
reg<-lm(Height~Age, data=mydata)

##add the regression line to the plot
abline(reg)


##add text to the plot. Start at the point 0,70.  Position the text to the right of this point (pos=4), make the font smaller (cex=.8), and add in the text using the paste function since I'm pasting in both text and the contents of some variables. For the text, I extract the intercept by taking the first coefficient from the reg object with the code reg$coef[1]; and the coefficient on Age by taking the second coefficient, reg$coef[2].  I round both of those to the second decimal point using round(x,2). There's a lot going on here but hopefully I've unpacked it for everyone.
text(0,72, paste("Height ~ ", round(reg$coef[1],2), "+", round(reg$coef[2],2), "*Age"), pos=4, cex=.8)


Plots are really fun to do in R.  This post was just a basic introduction and more will come on the many other interesting plotting features one can take advantage of in R.  If you want to see more options in R plotting, you can always look at R documentation, or other R blogs and help pages.  Here are a few: