Pages

Showing posts with label apply. Show all posts
Showing posts with label apply. Show all posts

Monday, January 14, 2013

For loops (and how to avoid them)

My experience when starting out in R was trying to clean and recode data using for() loops, usually with a few if() statements in the loop as well, and finding the whole thing complicated and frustrating.

In this post, I'll go over how you can avoid for() loops for both improving the quality and speed of your programming, as well as your sanity.

So here we have our classic dataset called mydata.Rdata (you can download this if you want, link at the right):



And if I were in Stata and wanted to create an age group variable, I could just do:

gen Agegroup=1
replace Agegroup=2 if Age>10 & Age<20
replace Agegroup=3 if Age>=20

But when I try this in R, it fails:







Why does it fail? It fails because Age is a vector so the condition if(mydata$Age<10) is asking "is the vector Age less than 10", which is not what we want to know.  We want to ask, row by row is each element of Age<10, so we need to specify the element of the vector we're referring to. We don't specify the element and thus we get the warning (really, error), "only the first element will be used."  So when this fails, the first way people try to solve this problem is with a crazy for() loop like this:

###########Unnecessarily long and ugly code below#######
mydata$Agegroup1<-0

for (i in  1:10){
  if(mydata$Age[i]>10 & mydata$Age[i]<20){
    mydata$Agegroup1[i]<-1
  }
  if(mydata$Age[i]>=20){
    mydata$Agegroup1[i]<-2
  }
}

Here we tell R to go down the rows from i=1 to i=10, and for each of those rows indexed by i, check to see what value of Age it is, and then assign Agegroup a value of 1 or 2.  This works, but at a high cost - you can easily make a mistake with all those indexed vectors, and also for() loops take a lot of computing time, which would be a big deal if this dataset were 10000 observations instead of 10.

So how can we avoid doing this?

One of the most useful functions I have found is one that I have referred to a number of times in my blog so far - the ifelse() function.  The ifelse() function evaluates a condition, and then assigns a value if it's true and a value if it's false.  The great part about it is that it can read in a vector and check each element of the vector one by one so you don't need indices or a loop. You don't even need to initialize some new variable before you run the statement.  Like this:

mydata$newvariable<-ifelse(Condition of some variable,
                    Value of new variable if condition is true
                    Value of new variable if condition is false)

so for example:

mydata$Old<-ifelse(mydata$Age>40,1,0)

This says, check to see if the elements of the vector mydata$Age are greater than 40: if an element is greater than 40, it assigns the value of 1 to mydata$Old, and if it's not greater than 40, it assigns the value of 0 to mydata$Old.

But we wanted to assign values 0, 1, and 2 to an Agegroup variable.  To do this, we can use nested ifelse() statements:

mydata$Agegroup2<-ifelse(mydata$Age>10 & mydata$Age<20,1,     
                  ifelse(mydata$Age>20, 2,0))

Now this says, first check whether each element of the Age vector is >10 and <20.  If it is, assign 1 to Agegroup2.  If it's not, then evaluate the next ifelse() statement, whether Age>20.  If it is, assign Agegroup2 a value of 2.  If it's not any of those, then assign it 0.  We can see that both the loop and the ifelse() statements give us the same result:


You can nest ifelse() statement as much as you like. Just be careful about your final category - it assigns the last value to whatever values are left over that didn't meet any condition (including if a value is NA!) so make sure you want that to happen.


Other examples of ways to use the ifelse() function:
  • If you want to add a column with the mean of Weight by sex for each individual, you can do this with ifelse() like this:
mydata$meanweight.bysex<-ifelse(mydata$Sex==0,  
               mean(mydata$Weight[mydata$Sex==0], na.rm=TRUE),         
               mean(mydata$Weight[mydata$Sex==1], na.rm=TRUE))



  • If you want to recode missing values:
mydata$Height.recode<-ifelse(is.na(mydata$Height),
                      9999, 
                      mydata$Height)

  • If you want to combine two variables together into a new one, such as to create a new ID variable based on year (which I added to this dataframe) and ID:
mydata$ID.long<-ifelse(mydata$ID<10, 
                paste(mydata$year, "-0",mydata$ID,sep=""), 
                paste(mydata$year, "-", mydata$ID, sep=""))



Other ways to avoid the for loop:

  • The apply functions:  If you think you have to use a loop because you have to apply some sort of function to each observation in your data, think again! Use the apply() functions instead.  For example:
  • You can also use other functions such as cut() to do the age grouping above. Here's the post on how this function works, so I won't go over it again, except to say if you convert from a factor to a numeric, *always* convert to a character before converting it to numeric:
mydata$Agegroup3<-as.numeric(as.character(cut(mydata$Age, c(0,10,20,100),labels=0:2)))


Basically, any time you think you have to do a loop, think about how you can do it with another function. It will save you a lot of time and mistakes in your code.


Thursday, November 8, 2012

Data types part 2: Using classes to your advantage


Last week I talked about objects including scalars, vectors, matrices, dataframes, and lists.  This post will show you how to use the objects (and their corresponding classes) you create in R to your advantage.

First off, it's important to remember that columns of dataframes are vectors.  That is, if I have a dataframe called mydata, the columns mydata$Height and mydata$Weight are vectors. Numeric vectors can be multiplied or added together, squared, added or multiplied by a constant, etc. Operations on vectors are done element by element, meaning here row by row.

First, I read in a file of data, called mydata, using the read.csv() function. I get the dataframe below:


I check the classes of my objects using class(), or all at the same time with ls.str().

class(mydata$Weight)
class(mydata$Height)

or










So I see that mydata is a dataframe and all my columns are numeric (num).  Now, if I want to create a new column in my dataset which calculates BMI, I can do some vector operations:

mydata$BMI<-mydata$Weight/(mydata$Height)^2 * 703


Which is the formula for BMI from weight in pounds and height in inches. Notice how if any component of the calculation is a missing (NA) value, R calculates the BMI as NA as well.

Now I can do summary statistics on my data and store those as a matrix. For example, I start with summary statistics on my Age vector:

summary(mydata$Age)






If I want to extract an element of this summary table, say the minimum, I can do

summary(mydata$Age)[1]

which extracts the first element (of 6) of the summary table.

But what I really want is a summary matrix of a bunch of variables: Age, Sex, and BMI.  To do this I can rowbind the summary statistics of those three variables together using the rbind() function, but only take the 1st, 4th, and 6th elements of the summary table, which as you can see correspond to the Min, Mean, and Max. This creates a matrix, which I call summary.matrix:

summary.matrix<-rbind(summary(mydata$Age)[c(1,4,6)], summary(mydata$BMI)[c(1,4,6)], summary(mydata$Sex)[c(1,4,6)])

Rowbinding is basically stacking rows on top of each other.  I add rownames and then print the class of my summary matrix and the results.

rownames(summary.matrix)<-c("Age", "BMI", "Sex")
class(summary.matrix)
summary.matrix










There is also a much more efficient way of doing this using the apply() function.  Previously I had another post on the apply function, but I find that it takes a lot of examples to get comfortable with so here is another application.

Apply() is a great example of classes because it takes in a dataframe as the first argument (mydata, all rows, but I choose only columns 2, 3, and 7).  I then apply it to the numeric vector columns (MARGIN=2) of this subsetted dataframe, and then for each of those columns I perform the mean and standard deviation, removing the NA's from consideration.  I save this in a matrix I call summary.matrix2.

summary.matrix2<-apply(mydata[,c(2,3,7)], MARGIN=2, FUN=function(x) c(mean(x,na.rm=TRUE), sd(x, na.rm=TRUE)))

I then rename the rows of the this matrix and print the results, rounded to two decimal places.  Notice how the format of the final matrix is different here. Above the rows were the variables and the columns the summary statistics, while here it is reversed.  I could have column binded (cbind() instead of the rbind()) in the first case and I would have gotten the matrix transposed to be like this one.

rownames(summary.matrix2)<-c("Mean", "Stdev")
round(summary.matrix2, 2)







Finally, I want to demonstrate how you can take advantage of scalars and vectors when graphing. Creating scalar and vectors objects is really helpful when you are doing the same task multiple times.  I give the example of creating a bunch of scatterplots.

I want to make a scatterplot for each of three variables (Height, Weight, and BMI) against age.  Since all three scatterplots are going to be very similar, I want to standardize all of my plotting arguments including the range of ages, the plot symbols and the plot colors.  I want to include a vertical line for the mean age and a title for each plot.  The code is below:


##Assign numeric vector for the range of x-axis
agelimit<-c(20,80)

##Assign numeric single scalar to plotsymbols and meanage
plotsymbols<-2
meanage<-mean(mydata$Age)

##Assign single character words to plottype and plotcolor 
plottype<-"p"
plotcolor<-"darkgreen"

##Assign a vector of characters to titletext
titletext<-c("Scatterplot", "vs Age")

Ok, so now that I have all those assigned, I can plot the three plots all together using the following code.  Notice how all the highlighted code is the same in each plot (except for the main title) and I'm using the assigned objects I just created.  The great part about this is that if I decide I actually want to plot color to be red, I can change it in just one place.  You can think about how this would be useful in other situations (data cleaning, regressions, etc) when you do the same thing multiple times and then decide to change one little parameter. If you're not sure about the code below, I posted on the basics of plotting here.

##Plot area is 1 row, 3 columns
par(mfrow=c(1,3))

##Plot all three plots using the assigned objects
plot(mydata$Age, mydata$Height, xlab="Age", ylab="Height", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "Height", titletext[2]))
abline(v=meanage)

plot(mydata$Age, mydata$Weight, xlab="Age", ylab="Weight", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "Weight", titletext[2]))
abline(v=meanage)

plot(mydata$Age, mydata$BMI, xlab="Age", ylab="BMI", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "BMI", titletext[2]))
abline(v=meanage)


Notice how I do the main title with the paste statement.  Paste() is useful for combining words and elements of another variable together into one phrase.  The output looks like this, below.  Pretty nice!










Sunday, September 23, 2012

The infamous apply function

For R beginners, the apply() function seems like a secret doorway into programming bliss. It seems so powerful, and yet, beyond reach. For those just starting out, examples of how to use apply() can really help with the intuition of how to harness its power. Here are some great ways to use apply() that can really help make R programming enjoyable and useful.  

First, the general structure of apply() is like so:

apply(x, MARGIN, FUN)

  1. The first argument, "x", is whatever dataset or columns of a dataset you want to do something to.
  2. The second argument, "MARGIN", is how you want to apply function.  The choices are either over the rows (MARGIN=1) or the columns (MARGIN=2).
  3. The third argument (FUN) is the function you apply.  

So for an easy example, if you want to just sum the entries of all the columns in your dataset called "mydata", you can do it this way:

apply(mydata, 2, sum)

But this is not always very useful.  We have other columns in our datasets, and we probably don't want to just sum all the time.  What else can we do? Here are two nice ways to use apply():


1. Counting how many columns meet a certain condition 

I have 13 child outcomes in a dataset named "births" and I want to count up how many live births there were. My "births" data looks like this: 



How can I add up the live births, especially with those pesky NA's in there? Here's a one line way to do it: 

 births$childcount<-apply(births[,1:5], MARGIN=1, function(x) {sum(x=="live birth", na.rm=TRUE)}) 

This code is saying, for the first 5 columns of my dataset births, for each row (MARGIN=1), apply the following function. The function takes x as the input (x is just the births[,1:5] dataset), and sums up for each column of this dataset the number of times it sees "live birth". The na.rm option removes any NA's from consideration.  If you had other conditions, you could say function(x) {sum(x>2010, na.rm=TRUE)}) for example, if you wanted to count up how many years were after 2010. 


 2. Changing coded missing values to NA for multiple columns at a time 

 Often datasets code their missing values as 99 or -99 instead of just leaving them blank. We might want to change these to actual missing so we can work with the data better.  For one variable at at time, I can do it with with ifelse() statement:

originaldata$variable1<-ifelse(originaldata$variable1==99 | originaldata$variable1==-99, NA, originalvariable1)

This is equivalent to the cond() command in stata, where the first argument evaluates the condition, the second argument is what is done if the condition is true, and the third argument is what is done if the condition is false.  

But what if I have 3 or 30 columns that I want to do this to? I don't have to write ifelse() statements for them all individually.  Instead, I use apply.

Here we have a dataset called "originaldata" and we have 4 variables that we want to change from the original missing values to NA values. These variables are in column numbers 2, 4, 5, and 6, as below:




I take the columns of original dataset, and for each of those columns, I use an ifelse statement to check the value of the entry: if it's 99 or -99 I change it to NA, and if it's not then I leave it the way it is. This creates a new dataset called "new data" with just those columns that I choose.

newdata<-apply(originaldata[,c(2,4:6)], MARGIN=2, function(x) {ifelse(x==99 | x==-99, NA,x)})

We print out newdata:


Now if we want the original dataset together with the changed variables, we can just cbind (column bind) them together like so:

alldata<-cbind(originaldata[, c(-2,-4:-6)], newdata)





If you want to be extra fancy, you can just combine the cbind() statement with the apply() in one statement, like this:


newdata<-cbind(originaldata[,c(-2,-4,-6)], apply(originaldata[,c(2,4:6)], MARGIN=2, function(x) {ifelse(x==99 | x==-99, NA,x)}))