Pages

Showing posts with label plot. Show all posts
Showing posts with label plot. Show all posts

Monday, February 25, 2013

Normal distribution functions

Ah, the Central Limit Theorem.  The basis of much of statistical inference and how we get those 95% confidence intervals.  It's just so beautiful!  Lately, I have found myself looking up the normal distribution functions in R.  They can be difficult to keep straight, so this post will give a succinct overview and show you how they can be useful in your data analysis.

To start, here is a table with all four normal distribution functions and their purpose, syntax, and an example:

PurposeSyntaxExample
rnormGenerates random numbers 
from normal distribution
rnorm(n, mean, sd)rnorm(1000, 3, .25)
Generates 1000 numbers
from a normal with mean 3
and sd=.25
dnormProbability Density Function
(PDF)
dnorm(x, mean, sd)dnorm(0, 0, .5)
Gives the density (height of the
PDF) of the normal
with mean=0 and sd=.5. 
pnormCumulative Distribution Function
(CDF)
pnorm(q, mean, sd)pnorm(1.96, 0, 1)
Gives the area under the
standard normal curve to
the left of 1.96,
i.e. ~0.975
qnormQuantile Function - inverse of
pnorm
qnorm(p, mean, sd)qnorm(0.975, 0, 1)
Gives the value at which the
CDF of the standard normal
is .975, i.e. ~1.96

Note that for all functions, leaving out the mean and standard deviation would result in default values of mean=0 and sd=1, a standard normal distribution.

Another important note for the pnorn() function is the ability to get the right hand probability using the lower.tail=FALSE option.  For example,







In the first line, we are calculating the area to the left of 1.96, while in the second line we are calculating the area to the right of 1.96.

With these functions, I can do some fun plotting. I create a sequence of values from -4 to 4, and then calculate both the standard normal PDF and the CDF of each of those values.  I also generate 1000 random draws from the standard normal distribution. I then plot these next to each other. Whenever you use probability functions, you should, as a habit, remember to set the seed. Setting the seed means locking in the sequence of "random" (they are pseudorandom) numbers that R gives you, so you can reproduce your work later on.

set.seed(3000)
xseq<-seq(-4,4,.01)
densities<-dnorm(xseq, 0,1)
cumulative<-pnorm(xseq, 0, 1)
randomdeviates<-rnorm(1000,0,1)
 
par(mfrow=c(1,3), mar=c(3,4,4,2))

plot(xseq, densities, col="darkgreen",xlab="", ylab="Density", type="l",lwd=2, cex=2, main="PDF of Standard Normal", cex.axis=.8)

plot(xseq, cumulative, col="darkorange", xlab="", ylab="Cumulative Probability",type="l",lwd=2, cex=2, main="CDF of Standard Normal", cex.axis=.8)

hist(randomdeviates, main="Random draws from Std Normal", cex.axis=.8, xlim=c(-4,4))

The par() parameters set up a plotting area of 1 row and 3 columns (mfrow), and move the three plots closer to each other (mar). Here is a good explanation of the plotting area.  The output is below:

















Now, when we have our actual data, we can do a visual check of the normality of our outcome variable, which, if we assume a linear relationship with normally distributed errors, should also be normal. Let's make up some data, where I add noise by using rnorm() - here I'm generating the same amount of random numbers as is the length of the xseq vector, with a mean of 0 and a standard deviation of 5.5.

xseq<-seq(-4,4,.01)
y<-2*xseq + rnorm(length(xseq),0,5.5)

And now I can plot a histogram of y (check out my post on histograms if you want more detail) and add a curve() function to the plot using the mean and standard deviation of y as the parameters:

hist(y, prob=TRUE, ylim=c(0,.06), breaks=20)
curve(dnorm(x, mean(y), sd(y)), add=TRUE, col="darkblue", lwd=2)

Here, the curve() function takes as its first parameter a function itself (or an expression) that must be written as some function of x.  Our function here is dnorm(). The x in the dnorm() function is not an object we have created; rather, it's indicating that there's a variable that is being evaluated, and the evaluation is the normal density at the mean of y and standard deviation of y. Make sure to include add=TRUE so that the curve is plotted on the same plot as the histogram.  Here is what we get:


Here are some other good sources on the topic of probability distribution functions:

Saturday, December 22, 2012

Basics of Histograms


Histograms are used very often in public health to show the distributions of your independent and dependent variables.  Although the basic command for histograms (hist()) in R is simple, getting your histogram to look exactly like you want takes getting to know a few options of the plot.  Here I present ways to customize your histogram for your needs.

First, I want to point out that ggplot2 is a package in R that does some amazing graphics, including histograms.  I will do a post on ggplot2 in the coming year.   However, the hist() function in base R is really easy and fast, and does the job for most of your histogram-ing needs. However, if you want to do complicated histograms, I would recommend reading up on ggplot2.

Okay so for our purposes today, instead of importing data, I'll create some normally distributed data myself. In R, you can generate normal data this way using the rnorm() function:

BMI<-rnorm(n=1000, m=24.2, sd=2.2) 

So now we have some BMI data, and the basic histogram plot that comes out of R looks like this:

hist(BMI)


Which is actually pretty nice.  There are a number of things that R does by default in creating this histogram, and I think it's useful to print out that information to understand the parameters of this histogram.  You can do this by saving the histogram as an object and then printing it like this:

histinfo<-hist(BMI)
histinfo

And you get the output below:



This is helpful because you can see how R has decided to break up your data by default. It shows the breaks, which are the cutoff points for the bins. It shows the counts, intensity/density for each bin (same thing but two different names for R version compatibility), the midpoints of each bin, and then the name of the variable, whether the bins are equidistant, and the class of the object. You can of course take any one of these outputs by itself, i.e.  histinfo$counts would give you just the vector of counts.

Now we can manipulate this information to customize our plot.

1. Number of bins

R chooses how to bin your data for you by default using an algorithm, but if you want coarser or finer groups, there are a number of ways to do this. We can see that right now from the output above that the breaks go from 17 to 32 by 1.  You can use the breaks() option to change this in a number of ways.  An easy way is just to give it one number that gives the number of cells for the histogram:


hist(BMI, breaks=20, main="Breaks=20")
hist(BMI, breaks=5, main="Breaks=5")



The bins don't correspond to exactly the number you put in, because of the way R runs its algorithm to break up the data but it gives you generally what you want. If you want more control over exactly the breakpoints between bins, you can be more precise with the breaks() option and give it a vector of breakpoints, like this:

hist(BMI, breaks=c(17,20,23,26,29,32), main="Breaks is vector of breakpoints")


This dictates exactly the start and end point of each bin.  Of course, you could give the breaks vector as a sequence like this to cut down on the messiness of the code:

hist(BMI, breaks=seq(17,32,by=3), main="Breaks is vector of breakpoints")

Note that when giving breakpoints, the default for R is that the histogram cells are right-closed (left open) intervals of the form (a,b]. You can change this with the right=FALSE option, which would change the intervals to be of the form [a,b).  This is important if you have a lot of points exactly at the breakpoint.


2. Frequency vs Density 

Often, we are more interested in density than frequency, since frequency is relative to your sample size. Instead of counting the number of datapoints per bin, R can give the probability densities using the freq=FALSE option:


hist(BMI, freq=FALSE, main="Density plot")


Notice the y-axis now.  If the breaks are equidistant, with difference between breaks=1, then the height of each rectangle is proportional to the number of points falling into the cell, and thus the sum of the probability densities adds up to 1.  Here I specify plot=FALSE because I just want the histogram output, not the plot, and show how the sum of all of the densities is 1:


However, if you choose to make bins that are not all separated by 1 (like breaks=c(17,25,26, 32) or something like that), then the plot still has an area of 1, but the area of the rectangles is the fraction of data points falling into the cells. The densities are calculated like this as counts/(n*diff(breaks).  Thus, this adds up to 1 if add up the areas of the rectangles, i.e. you multiply each density by the difference in the breaks like this:











3. Plot aesthetics 

Finally, we can make the histogram better looking by adjusting the x-axis, y-axis, axis labels, title, and color like this:


hist(BMI, freq=FALSE, xlab="Body Mass Index", main="Distribution of Body Mass Index", col="lightgreen", xlim=c(15,35),  ylim=c(0, .20))

Here along with our frequency option, I changed the x-axis label, changed the main title, made the color light green, and provided limits for both the x-axis and y-axis.  Note that defining the look of you axis using xlim() will not have any impact on the bins - this option is only for the aesthetics of the plot itself.

Finally, I can add a nice normal distribution curve to this plot using the curve() function, in which I specify a normal density function with mean and standard deviation that is equal to the mean and standard deviation of my data, and I add this to my previous plot with a dark blue color and a line width of 2. You can play around with these options to get the kind of line you want:

curve(dnorm(x, mean=mean(BMI), sd=sd(BMI)), add=TRUE, col="darkblue", lwd=2) 


And we get this!


Thursday, November 8, 2012

Data types part 2: Using classes to your advantage


Last week I talked about objects including scalars, vectors, matrices, dataframes, and lists.  This post will show you how to use the objects (and their corresponding classes) you create in R to your advantage.

First off, it's important to remember that columns of dataframes are vectors.  That is, if I have a dataframe called mydata, the columns mydata$Height and mydata$Weight are vectors. Numeric vectors can be multiplied or added together, squared, added or multiplied by a constant, etc. Operations on vectors are done element by element, meaning here row by row.

First, I read in a file of data, called mydata, using the read.csv() function. I get the dataframe below:


I check the classes of my objects using class(), or all at the same time with ls.str().

class(mydata$Weight)
class(mydata$Height)

or










So I see that mydata is a dataframe and all my columns are numeric (num).  Now, if I want to create a new column in my dataset which calculates BMI, I can do some vector operations:

mydata$BMI<-mydata$Weight/(mydata$Height)^2 * 703


Which is the formula for BMI from weight in pounds and height in inches. Notice how if any component of the calculation is a missing (NA) value, R calculates the BMI as NA as well.

Now I can do summary statistics on my data and store those as a matrix. For example, I start with summary statistics on my Age vector:

summary(mydata$Age)






If I want to extract an element of this summary table, say the minimum, I can do

summary(mydata$Age)[1]

which extracts the first element (of 6) of the summary table.

But what I really want is a summary matrix of a bunch of variables: Age, Sex, and BMI.  To do this I can rowbind the summary statistics of those three variables together using the rbind() function, but only take the 1st, 4th, and 6th elements of the summary table, which as you can see correspond to the Min, Mean, and Max. This creates a matrix, which I call summary.matrix:

summary.matrix<-rbind(summary(mydata$Age)[c(1,4,6)], summary(mydata$BMI)[c(1,4,6)], summary(mydata$Sex)[c(1,4,6)])

Rowbinding is basically stacking rows on top of each other.  I add rownames and then print the class of my summary matrix and the results.

rownames(summary.matrix)<-c("Age", "BMI", "Sex")
class(summary.matrix)
summary.matrix










There is also a much more efficient way of doing this using the apply() function.  Previously I had another post on the apply function, but I find that it takes a lot of examples to get comfortable with so here is another application.

Apply() is a great example of classes because it takes in a dataframe as the first argument (mydata, all rows, but I choose only columns 2, 3, and 7).  I then apply it to the numeric vector columns (MARGIN=2) of this subsetted dataframe, and then for each of those columns I perform the mean and standard deviation, removing the NA's from consideration.  I save this in a matrix I call summary.matrix2.

summary.matrix2<-apply(mydata[,c(2,3,7)], MARGIN=2, FUN=function(x) c(mean(x,na.rm=TRUE), sd(x, na.rm=TRUE)))

I then rename the rows of the this matrix and print the results, rounded to two decimal places.  Notice how the format of the final matrix is different here. Above the rows were the variables and the columns the summary statistics, while here it is reversed.  I could have column binded (cbind() instead of the rbind()) in the first case and I would have gotten the matrix transposed to be like this one.

rownames(summary.matrix2)<-c("Mean", "Stdev")
round(summary.matrix2, 2)







Finally, I want to demonstrate how you can take advantage of scalars and vectors when graphing. Creating scalar and vectors objects is really helpful when you are doing the same task multiple times.  I give the example of creating a bunch of scatterplots.

I want to make a scatterplot for each of three variables (Height, Weight, and BMI) against age.  Since all three scatterplots are going to be very similar, I want to standardize all of my plotting arguments including the range of ages, the plot symbols and the plot colors.  I want to include a vertical line for the mean age and a title for each plot.  The code is below:


##Assign numeric vector for the range of x-axis
agelimit<-c(20,80)

##Assign numeric single scalar to plotsymbols and meanage
plotsymbols<-2
meanage<-mean(mydata$Age)

##Assign single character words to plottype and plotcolor 
plottype<-"p"
plotcolor<-"darkgreen"

##Assign a vector of characters to titletext
titletext<-c("Scatterplot", "vs Age")

Ok, so now that I have all those assigned, I can plot the three plots all together using the following code.  Notice how all the highlighted code is the same in each plot (except for the main title) and I'm using the assigned objects I just created.  The great part about this is that if I decide I actually want to plot color to be red, I can change it in just one place.  You can think about how this would be useful in other situations (data cleaning, regressions, etc) when you do the same thing multiple times and then decide to change one little parameter. If you're not sure about the code below, I posted on the basics of plotting here.

##Plot area is 1 row, 3 columns
par(mfrow=c(1,3))

##Plot all three plots using the assigned objects
plot(mydata$Age, mydata$Height, xlab="Age", ylab="Height", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "Height", titletext[2]))
abline(v=meanage)

plot(mydata$Age, mydata$Weight, xlab="Age", ylab="Weight", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "Weight", titletext[2]))
abline(v=meanage)

plot(mydata$Age, mydata$BMI, xlab="Age", ylab="BMI", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "BMI", titletext[2]))
abline(v=meanage)


Notice how I do the main title with the paste statement.  Paste() is useful for combining words and elements of another variable together into one phrase.  The output looks like this, below.  Pretty nice!










Monday, October 15, 2012

What a nice looking scatterplot!


This week, we look at plotting data using scatterplots. I'll definitely have a post on other ways of plotting data, like boxplots or histograms.

Our data from last week remains the same:


First, a quick way to look at all of your continuous variables at once is just to do a plot command of your data.  Here, I will subset the data to just take three columns and plot those against each other:

plot(mydata[,c(2,4,5)])

This takes all rows, and the columns 2, 4, and 5 from the dataset and plots them all against each other, like this:



Next, I want to make a nice scatterplot of Weight on Height.  The basic format is

plot(xvariable, yvariable)

So it looks like this:

plot(mydata$Weight, mydata$Height)


But this is a little ugly.  Fortunately, there are a million options that I can take advantage of.  In this first post on plotting, I will:

  • add labels for the x and y axes (xlab and ylab, respectively) 
  • change the dimensions of the plot so it's not quite so condensed (xlim and ylim)
  • add a title (main) and change the font size of the title (cex.main)
  • get rid of the frame around the plot (frame.plot=FALSE)
  • change the type of plotting symbol from little circles to little trianges (pch=2) and make those little triangles blue (col="blue").


plot(mydata$Weight, mydata$Height, xlab="Weight (lbs)", ylab="Height (inches)", xlim=c(80,200), ylim=c(55,75), main="Height vs Weight", pch=2, cex.main=1.5, frame.plot=FALSE , col="blue")



Now, let's get a little more complicated.  I want the color of the plot symbol to be indicative of whether the observation is male or female, and to put a legend in there too. This is super easy right inside the plot function call using the ifelse() statement.  To review, the ifelse() statement is similar to the cond() statement in Stata.  It looks like this:

ifelse(condition, result if condition is true, result if condition is false)

So here I change my parameter col=blue to col=ifelse(mydata$Sex==1, "red", "blue"). This is saying that if the sex is a 1, make the color of the triangle red, else make it blue:

plot(mydata$Weight, mydata$Height, xlab="Weight (lbs)", ylab="Height (inches)", xlim=c(80,200), ylim=c(55,75), main="Height vs Weight", pch=2, cex.main=1.5, frame.plot=FALSE, col=ifelse(mydata$Sex==1, "red", "blue"))

Then I add in the legend. The first two parameters of the legend function are the x and y points where the legend should begin (here at the point (80,75)).  Then I indicate that I want two triangle symbols (pch=c(2,2)).  The first 2 is for the number of symbols and the second 2 is to indicate that pch=2 (a triangle) as it was in the previous example.  Next I say I want the first triangle to be red and the second one blue.  I label the two symbols with the labels "Male" and "Female".  Next, I indicate I want a box around the legend (bty="o") and that I want the box to be darkgreen.  Finally, I indicate that the font size of the whole legend text should be .8. (cex=.8).

legend(80, 75, pch=c(2,2), col=c("red", "blue"), c("Male", "Female"), bty="o",  box.col="darkgreen", cex=.8)



Incidentally, I could get the same plot by identifying "topleft" in my legend() call, as below.  But sometimes it's nice to put the legend exactly where you want it and the legend options only allow for “bottomright”, “bottom”, “bottomleft”, “left”, “topleft”, “top”, “topright”, “right”, “center”.

legend("topleft", pch=c(2,2), col=c("red", "blue"), c("Male", "Female"), bty="o", cex=.8, box.col="darkgreen")

Finally, one of the really nice aspects of R is being able to manipulate the plot region and make it do exactly what you want.  For starters, we can have two plots side by side, by indicating:

par(mfrow=c(1,2))

meaning one row and two columns. If I wanted a 2 by 2 plot area, I would do mfrow=c(2,2).

Now I'll show the full code for the plot below.  The first part is just the first plot we already did, but I add in a vertical line at the average weight and add in text.  The second plot is Height on Age, and I add in the linear regression line.  To do this is quite easy.  I start by running the regression of Height on Age and save it as "reg".  Then I use abline() to add the line to the plot.  Finally, I use the text() function to add text to the plot anywhere I want.  I walk you through the code below:


##set up the plot area with 1 row and 2 columns of plots
par(mfrow=c(1,2))

##first plot height on weight
plot(mydata$Weight, mydata$Height, xlab="Weight (lbs)", ylab="Height (inches)", xlim=c(80,200), ylim=c(55,75), main="Height vs Weight", pch=2, cex.main=1.5, frame.plot=FALSE, col="blue")

##add in the vertical line at the mean of the weight, using na.rm=TRUE to remove the NAs from consideration
abline(v=mean(mydata$Weight, na.rm=TRUE), col="orange")

##add in text at the point (140, 73), with font size .8. The position is 4, meaning that the text moves to the right from the starting point. The "\n" is a carriage return (moves the text to the next line)
text(140,73, cex=.8, pos=4, "Orange line is\n sample average\n weight")

##add in the second plot of height on age
plot(mydata$Age, mydata$Height, xlab="Age (years)", ylab="Height (inches)", xlim=c(0,80), ylim=c(55,75), main="Height vs Age", pch=3, cex.main=1.5, frame.plot=FALSE, col="darkred")


##run a linear regression of Height on Age - if this is confusing, I'll do a post on linear regressions very soon
reg<-lm(Height~Age, data=mydata)

##add the regression line to the plot
abline(reg)


##add text to the plot. Start at the point 0,70.  Position the text to the right of this point (pos=4), make the font smaller (cex=.8), and add in the text using the paste function since I'm pasting in both text and the contents of some variables. For the text, I extract the intercept by taking the first coefficient from the reg object with the code reg$coef[1]; and the coefficient on Age by taking the second coefficient, reg$coef[2].  I round both of those to the second decimal point using round(x,2). There's a lot going on here but hopefully I've unpacked it for everyone.
text(0,72, paste("Height ~ ", round(reg$coef[1],2), "+", round(reg$coef[2],2), "*Age"), pos=4, cex=.8)


Plots are really fun to do in R.  This post was just a basic introduction and more will come on the many other interesting plotting features one can take advantage of in R.  If you want to see more options in R plotting, you can always look at R documentation, or other R blogs and help pages.  Here are a few: