Public Health: functions

Showing posts with label functions. Show all posts

Friday, October 4, 2013

Loops revisited: How to rethink macros when using R

Thursday, July 11, 2013

Let's Make a Date: Date and Time classes in R

Tuesday, June 25, 2013

Sample size calculations equivalent to Stata functions

Sunday, March 17, 2013

Extracting Information From Objects Using Names()

One of the big differences between a language like Stata compared to R is the ability in R to handle many different types of objects at once, and combine them together or pull them apart. I had a post about objects last year, but I thought I'd show in this post how to extract information from objects you create in R.

For this example, I'll go back to a dataset I've used in the past called mydata.Rdata and it's in the Code and Data Download site.

One function that is extremely useful to know is names(). The names() function will show you everything that is stored in R under that object name. So, for example, if you do

where mydata is a dataframe object, you will get the names of the columns, which are the vectors that comprise the dataframe. Note that names(mydata) is an object itself (because everything is an object in R) - it is a character vector of length 7. You can save this vector and print out the class to verify this.

But names() can be useful for much more than just column names, as we'll see in a moment.

But before we go on, let's take a moment to remember how subsetting works. In subsetting, you use square brackets to pull out exactly the element of an object that you want. So if I want to subset a dataframe, I can say

mydata.subset<-mydata[,c(1:2)]

which is saving into the new object mydata.subset, all the rows and only the first two columns of the mydata dataframe.

Now, let's combine the concept of using the names() function with the concept of subsetting to change one of the column names of our dataset:

names(mydata)[4]<-"Weight_lbs"

Here we are saying, of the names(mydata) object, take the fourth component and make it "Weight_lbs". Now, if you run the names() on our dataframe, we find the change has been made:

Ok, so now we'll see how the names() function can be used in other applications.

1. Summary objects

There are two ways to extract information from objects in R, using subsetting and using the "$" operator.

Below, we summarize the Age vector and store the results in sum.vec. We print out the sum.vec object and the print out the corresponding names. Now we can extract the 1st element of the summary vector of Age in the following way using the [ ] operator.

This gives us the first element, which is the minimum. We could also do:

sum.vec[c(2,3,5)]

for the 25th, 50th, and 75th percentiles.

The other way to extract is by using "$". For example, the summary() function on a table object gives you a Chi squared test:

Here, you can extract any of the pieces of information that came out in the test, including the number of cases, the number of variables, the test statistic, etc. We can extract the pvalue of the test statistic by using the "$" operator, like this:

Let's see how this can be useful in the next example.

2. Regressions and statistical tests

The standard linear regression that we run in R is using lm(). It looks like this:

But there's a lot more that R has calculated that is not shown here. We can see this by saving this linear regression as an object and running names() on it:

So we see that saved under the reg.object are the coefficients, the residuals, fitted values, degrees of freedom, and a lot more. To find out everything that names() provides for a given object, look it up by doing ?lm. Now, to extract any of these components, like the residuals, use the "$" operator like this:

reg.object$residuals

You can make use of this extraction by taking the mean of the residuals

or plotting their distribution:

hist(reg.object$residuals, main="Distribution of Residuals" ,xlab="Residuals")

Don't forget that you can summarize regression objects using summary(), and get the names() of that summary too, like this:

summary(reg.object)
names(summary(reg.object))

which will give you more objects you can extract from your regression. You can use the names() function on any statistical model or function such as aov(), t.test(), chisq.test(), etc.

3. Histograms and boxplots

Finally, let's go back to that histogram and save that into an object. There are objects under names() of the histogram object now:

I showed how you can manipulate those in my post on histograms.

Similarly, for boxplot:

Here I've extracted the stats object which gives you the lower whisker, the lower hinge, the median, the upper hinge, and the upper whisker for each group, which you can see below.

Monday, February 25, 2013

Normal distribution functions

Ah, the Central Limit Theorem. The basis of much of statistical inference and how we get those 95% confidence intervals. It's just so beautiful! Lately, I have found myself looking up the normal distribution functions in R. They can be difficult to keep straight, so this post will give a succinct overview and show you how they can be useful in your data analysis.

To start, here is a table with all four normal distribution functions and their purpose, syntax, and an example:

	Purpose	Syntax	Example
rnorm	Generates random numbers from normal distribution	rnorm(n, mean, sd)	rnorm(1000, 3, .25) Generates 1000 numbers from a normal with mean 3 and sd=.25
dnorm	Probability Density Function (PDF)	dnorm(x, mean, sd)	dnorm(0, 0, .5) Gives the density (height of the PDF) of the normal with mean=0 and sd=.5.
pnorm	Cumulative Distribution Function (CDF)	pnorm(q, mean, sd)	pnorm(1.96, 0, 1) Gives the area under the standard normal curve to the left of 1.96, i.e. ~0.975
qnorm	Quantile Function - inverse of pnorm	qnorm(p, mean, sd)	qnorm(0.975, 0, 1) Gives the value at which the CDF of the standard normal is .975, i.e. ~1.96

Note that for all functions, leaving out the mean and standard deviation would result in default values of mean=0 and sd=1, a standard normal distribution.

Another important note for the pnorn() function is the ability to get the right hand probability using the lower.tail=FALSE option. For example,

In the first line, we are calculating the area to the left of 1.96, while in the second line we are calculating the area to the right of 1.96.

With these functions, I can do some fun plotting. I create a sequence of values from -4 to 4, and then calculate both the standard normal PDF and the CDF of each of those values. I also generate 1000 random draws from the standard normal distribution. I then plot these next to each other. Whenever you use probability functions, you should, as a habit, remember to set the seed. Setting the seed means locking in the sequence of "random" (they are pseudorandom) numbers that R gives you, so you can reproduce your work later on.

set.seed(3000)
xseq<-seq(-4,4,.01)
densities<-dnorm(xseq, 0,1)
cumulative<-pnorm(xseq, 0, 1)
randomdeviates<-rnorm(1000,0,1)
 
par(mfrow=c(1,3), mar=c(3,4,4,2))

plot(xseq, densities, col="darkgreen",xlab="", ylab="Density", type="l",lwd=2, cex=2, main="PDF of Standard Normal", cex.axis=.8)

plot(xseq, cumulative, col="darkorange", xlab="", ylab="Cumulative Probability",type="l",lwd=2, cex=2, main="CDF of Standard Normal", cex.axis=.8)

hist(randomdeviates, main="Random draws from Std Normal", cex.axis=.8, xlim=c(-4,4))

The par() parameters set up a plotting area of 1 row and 3 columns (mfrow), and move the three plots closer to each other (mar). Here is a good explanation of the plotting area. The output is below:

Now, when we have our actual data, we can do a visual check of the normality of our outcome variable, which, if we assume a linear relationship with normally distributed errors, should also be normal. Let's make up some data, where I add noise by using rnorm() - here I'm generating the same amount of random numbers as is the length of the xseq vector, with a mean of 0 and a standard deviation of 5.5.

xseq<-seq(-4,4,.01)

y<-2*xseq + rnorm(length(xseq),0,5.5)

And now I can plot a histogram of y (check out my post on histograms if you want more detail) and add a curve() function to the plot using the mean and standard deviation of y as the parameters:

hist(y, prob=TRUE, ylim=c(0,.06), breaks=20)
curve(dnorm(x, mean(y), sd(y)), add=TRUE, col="darkblue", lwd=2)

Here, the curve() function takes as its first parameter a function itself (or an expression) that must be written as some function of x. Our function here is dnorm(). The x in the dnorm() function is not an object we have created; rather, it's indicating that there's a variable that is being evaluated, and the evaluation is the normal density at the mean of y and standard deviation of y. Make sure to include add=TRUE so that the curve is plotted on the same plot as the histogram. Here is what we get:

Here are some other good sources on the topic of probability distribution functions:

Monday, January 14, 2013

For loops (and how to avoid them)

My experience when starting out in R was trying to clean and recode data using for() loops, usually with a few if() statements in the loop as well, and finding the whole thing complicated and frustrating.

In this post, I'll go over how you can avoid for() loops for both improving the quality and speed of your programming, as well as your sanity.

So here we have our classic dataset called mydata.Rdata (you can download this if you want, link at the right):

And if I were in Stata and wanted to create an age group variable, I could just do:

gen Agegroup=1
replace Agegroup=2 if Age>10 & Age<20
replace Agegroup=3 if Age>=20

But when I try this in R, it fails:

Why does it fail? It fails because Age is a vector so the condition if(mydata$Age<10) is asking "is the vector Age less than 10", which is not what we want to know. We want to ask, row by row is each element of Age<10, so we need to specify the element of the vector we're referring to. We don't specify the element and thus we get the warning (really, error), "only the first element will be used." So when this fails, the first way people try to solve this problem is with a crazy for() loop like this:

###########Unnecessarily long and ugly code below#######
mydata$Agegroup1<-0

for (i in 1:10){
if(mydata$Age[i]>10 & mydata$Age[i]<20){
mydata$Agegroup1[i]<-1
}
if(mydata$Age[i]>=20){
mydata$Agegroup1[i]<-2
}
}

Here we tell R to go down the rows from i=1 to i=10, and for each of those rows indexed by i, check to see what value of Age it is, and then assign Agegroup a value of 1 or 2. This works, but at a high cost - you can easily make a mistake with all those indexed vectors, and also for() loops take a lot of computing time, which would be a big deal if this dataset were 10000 observations instead of 10.

So how can we avoid doing this?

One of the most useful functions I have found is one that I have referred to a number of times in my blog so far - the ifelse() function. The ifelse() function evaluates a condition, and then assigns a value if it's true and a value if it's false. The great part about it is that it can read in a vector and check each element of the vector one by one so you don't need indices or a loop. You don't even need to initialize some new variable before you run the statement. Like this:

mydata$newvariable<-ifelse(Condition of some variable,
Value of new variable if condition is true,
Value of new variable if condition is false)

so for example:

mydata$Old<-ifelse(mydata$Age>40,1,0)

This says, check to see if the elements of the vector mydata$Age are greater than 40: if an element is greater than 40, it assigns the value of 1 to mydata$Old, and if it's not greater than 40, it assigns the value of 0 to mydata$Old.

But we wanted to assign values 0, 1, and 2 to an Agegroup variable. To do this, we can use nested ifelse() statements:

mydata$Agegroup2<-ifelse(mydata$Age>10 & mydata$Age<20,1,
ifelse(mydata$Age>20, 2,0))

Now this says, first check whether each element of the Age vector is >10 and <20. If it is, assign 1 to Agegroup2. If it's not, then evaluate the next ifelse() statement, whether Age>20. If it is, assign Agegroup2 a value of 2. If it's not any of those, then assign it 0. We can see that both the loop and the ifelse() statements give us the same result:

You can nest ifelse() statement as much as you like. Just be careful about your final category - it assigns the last value to whatever values are left over that didn't meet any condition (including if a value is NA!) so make sure you want that to happen.

Other examples of ways to use the ifelse() function:

If you want to add a column with the mean of Weight by sex for each individual, you can do this with ifelse() like this:

mydata$meanweight.bysex<-ifelse(mydata$Sex==0,

mean(mydata$Weight[mydata$Sex==0], na.rm=TRUE),

mean(mydata$Weight[mydata$Sex==1], na.rm=TRUE))

If you want to recode missing values:

mydata$Height.recode<-ifelse(is.na(mydata$Height),
9999,
mydata$Height)

If you want to combine two variables together into a new one, such as to create a new ID variable based on year (which I added to this dataframe) and ID:

mydata$ID.long<-ifelse(mydata$ID<10,

paste(mydata$year, "-0",mydata$ID,sep=""),

paste(mydata$year, "-", mydata$ID, sep=""))

Other ways to avoid the for loop:

The apply functions: If you think you have to use a loop because you have to apply some sort of function to each observation in your data, think again! Use the apply() functions instead. For example:

If you have a lot of missing values and want to recode them all at once, or want to sum up the number of times you see a certain value in a row, check out my post on the apply function here.

You can also use other functions such as cut() to do the age grouping above. Here's the post on how this function works, so I won't go over it again, except to say if you convert from a factor to a numeric, *always* convert to a character before converting it to numeric:

mydata$Agegroup3<-as.numeric(as.character(cut(mydata$Age, c(0,10,20,100),labels=0:2)))

Basically, any time you think you have to do a loop, think about how you can do it with another function. It will save you a lot of time and mistakes in your code.

Pages