Public Health: subset

Showing posts with label subset. Show all posts

Friday, October 4, 2013

Loops revisited: How to rethink macros when using R

Sunday, April 7, 2013

Mastering Matrices

R has many ways to store information. Most of the time, our data comes in the form of a dataset, which we bring into R as a data.frame object. However, there are times when we want to use matrices as well. This post will show you how matrices can be useful and how to manipulate them easily.

First of all, the big difference between matrices and dataframes is that all of the rows and columns of a matrix must have the same class (numeric, character, etc). In a dataframe, you can have some of each. See my initial post about objects, here.

You can convert from one to the other using as.data.frame() or as.matrix(). Be careful though, that if you convert a dataframe with different classes of columns, then your matrix will just be all character:

In order to have a numeric matrix, I'm going to just take the first 6 columns of the mydata dataframe. I can delete columns of a matrix or dataframe in two ways:

mydata.mat<-as.matrix(mydata[,1:6])
mydata.mat<-as.matrix(mydata[,-7])

These two lines are doing the exact same thing. In the first, I am subsetting the dataframe mydata by taking all rows and the first 6 columns of the dataframe, then I'm converting that subset to a matrix. In the second, I'm taking all rows and all columns except the 7th column. Note that if I wanted to drop even more columns, I would just use the c() function like this:

mydata.mat<-as.matrix(mydata[,c(-3,-7)])

Note now that since I have taken out the one character column in my dataframe before I convert it to a matrix, I will get a numeric matrix instead of a character matrix:

This kind of operation for deleting columns works the same way in both matrices and dataframes. However, to add a column to a dataframe or matrix is different. In a dataframe, you can use the $ operator to identify columns, like mydata$Married is the vector corresponding to the Married column. However, you can't use the $ operator on matrices. You will get the following error that the "$ operator is invalid for atomic vectors", which I see all the time when I'm converting back and forth from dataframes to matrices and make a mistake:

All this message means is that the object you're using is a matrix (mydata.mat) and you can't use the $ operator on a matrix. If you get this message, you can either use as.data.frame() to convert your matrix to a dataframe, or you can adjust what you are doing to accomodate the rules of matrices. For adding columns to a matrix, you use cbind(), and likewise for rows, rbind().

So let's say I want to add an age squared column. In the dataframe, I do:

mydata$agesq<-mydata$Age^2

which instantly names the new column "agesq". Now for a matrix, there are two ways to do this, via indexing by number or by name of the original column:

mydata.mat<-cbind(mydata.mat, mydata.mat[,2]^2)
mydata.mat<-cbind(mydata.mat, mydata.mat[,"Age"]^2)

In the first line, I'm taking all rows and the second column of the mydata.mat matrix and squaring it, then I'm column binding it to the original matrix. In the second line, I'm doing the exact same thing, except that instead of indexing with a number, I can use the name of the column "Age". I get the following after running both statements:

Notice that the last two columns of this matrix do not have names, which can be rectified, by using the colnames() function:

colnames(mydata.mat)[7:8]<-c("AgeSq", "AgeSqAgain")

I don't want to rename everything, so I take the 7th and 8th columns and name those appropriately.

Finally, what can matrices do for us? One important aspect of matrices is of course matrix multiplication, which is how we do any multivariable regression analysis. I'll do a post soon on regression analysis by hand in R. But another reason is that matrices are great way to store values that you return during the course of running a loop.

For example, say I want to show how great the central limit theorem is. I'll generate deviates from some other distribution, say the Poisson, and take the mean of the draws each time. I'll do this 1000 times and then show what the histogram looks like. In a problem like this, I'll use a loop. I'll also use a matrix to store the mean each time.

Ok, we start out by initializing a matrix. We'll create a matrix of all NAs with 1000 rows and 3 columns using the matrix() function:

mat1<-matrix(NA, nrow=1000, ncol=3)

Next, we'll set up the for() loop. Let's look at it first and then go through the logic:

for(i in 1:nrow(mat1)){

  vec1<-rpois(1,1)
  vec2<-rpois(10,1)
  vec3<-rpois(100,1)
  mat1[i,]<-c(mean(vec1), mean(vec2), mean(vec3))
 
}

So in the first line, we're saying for each value of i going from i=1 to i=nrow(mat1), do the stuff in the loop. We could have written 1:1000, but it's nice to leave it as nrow(mat1) since we may want to change the size of mat1 later and this way the loop will still be fine.

Next, we draw from a Poisson distribution three times, each time a larger number of draws (first 1 draw, then 10, then 100), and each time with a lambda of 1.

Finally, and this is where the matrix comes in, we'll take the mean of each one of those vectors and store it. We will store the three values in the ith row of the mat1 matrix, filling in all three columns. In a longer way, I could have done:

mat1[i,1]<-mean(vec1)
mat1[i,2]<-mean(vec2)
mat1[i,3]<-mean(vec3)

and it would have come out the same, but the first way is nicer since it's more compact. Remember that matrices are just columns and rows of vectors, so you can always assign a vector to a row, as long as it's the same length. When you concatenate numbers (using the c() function), you make a vector, which is why it works.

Now, let's see how the old CLT is working by plotting some histograms:

par(mfrow=c(1,3))
hist(mat1[,1], main="n=1")
hist(mat1[,2], main="n=10")
hist(mat1[,3], main="n=100")

Again, with the histograms, I can plot each column at a time by subsetting the mat1 matrix:

Pretty nice! Other very helpful places to better understand matrices:

Sunday, March 17, 2013

Extracting Information From Objects Using Names()

One of the big differences between a language like Stata compared to R is the ability in R to handle many different types of objects at once, and combine them together or pull them apart. I had a post about objects last year, but I thought I'd show in this post how to extract information from objects you create in R.

For this example, I'll go back to a dataset I've used in the past called mydata.Rdata and it's in the Code and Data Download site.

One function that is extremely useful to know is names(). The names() function will show you everything that is stored in R under that object name. So, for example, if you do

where mydata is a dataframe object, you will get the names of the columns, which are the vectors that comprise the dataframe. Note that names(mydata) is an object itself (because everything is an object in R) - it is a character vector of length 7. You can save this vector and print out the class to verify this.

But names() can be useful for much more than just column names, as we'll see in a moment.

But before we go on, let's take a moment to remember how subsetting works. In subsetting, you use square brackets to pull out exactly the element of an object that you want. So if I want to subset a dataframe, I can say

mydata.subset<-mydata[,c(1:2)]

which is saving into the new object mydata.subset, all the rows and only the first two columns of the mydata dataframe.

Now, let's combine the concept of using the names() function with the concept of subsetting to change one of the column names of our dataset:

names(mydata)[4]<-"Weight_lbs"

Here we are saying, of the names(mydata) object, take the fourth component and make it "Weight_lbs". Now, if you run the names() on our dataframe, we find the change has been made:

Ok, so now we'll see how the names() function can be used in other applications.

1. Summary objects

There are two ways to extract information from objects in R, using subsetting and using the "$" operator.

Below, we summarize the Age vector and store the results in sum.vec. We print out the sum.vec object and the print out the corresponding names. Now we can extract the 1st element of the summary vector of Age in the following way using the [ ] operator.

This gives us the first element, which is the minimum. We could also do:

sum.vec[c(2,3,5)]

for the 25th, 50th, and 75th percentiles.

The other way to extract is by using "$". For example, the summary() function on a table object gives you a Chi squared test:

Here, you can extract any of the pieces of information that came out in the test, including the number of cases, the number of variables, the test statistic, etc. We can extract the pvalue of the test statistic by using the "$" operator, like this:

Let's see how this can be useful in the next example.

2. Regressions and statistical tests

The standard linear regression that we run in R is using lm(). It looks like this:

But there's a lot more that R has calculated that is not shown here. We can see this by saving this linear regression as an object and running names() on it:

So we see that saved under the reg.object are the coefficients, the residuals, fitted values, degrees of freedom, and a lot more. To find out everything that names() provides for a given object, look it up by doing ?lm. Now, to extract any of these components, like the residuals, use the "$" operator like this:

reg.object$residuals

You can make use of this extraction by taking the mean of the residuals

or plotting their distribution:

hist(reg.object$residuals, main="Distribution of Residuals" ,xlab="Residuals")

Don't forget that you can summarize regression objects using summary(), and get the names() of that summary too, like this:

summary(reg.object)
names(summary(reg.object))

which will give you more objects you can extract from your regression. You can use the names() function on any statistical model or function such as aov(), t.test(), chisq.test(), etc.

3. Histograms and boxplots

Finally, let's go back to that histogram and save that into an object. There are objects under names() of the histogram object now:

I showed how you can manipulate those in my post on histograms.

Similarly, for boxplot:

Here I've extracted the stats object which gives you the lower whisker, the lower hinge, the median, the upper hinge, and the upper whisker for each group, which you can see below.

Monday, October 1, 2012

Quick and Easy Subsetting

Public health datasets can be enormous and difficult to look at. Often it is great to be able to only look at specific parts of the dataset, or to only run analysis on a specific part of a dataset. There are two ways that you can subset a dataset in R:

Using the subset() function
Using matrix indexing

The first way may sound easier, but the second one is very useful. Let me show you why.

Let's start with the subset() function. The basic structure is like this:

subset(x, condition, select=c(var1, var2))

where x is the original dataset you want to subset, condition is a condition on your rows, and select is the columns you want to keep. So for example, I have a dataset called mydata, shown below. I want to look at only women that are over 50 years old, and I only want to look at their ID, age, and weight. Therefore I do this:

sub.data<-subset(mydata, Age>50 & Sex==0, select=c(ID, Age, Weight))

So here I am saying, take the dataset called "mydata", take only rows where the Age >50 and the Sex ==0 (when you write a condition, you always use two equal signs), and then take only the columns ID, Age, and Weight.

Here are the original (left) and the subsetted (right) datasets:

You can also do a statement like this:

sub.data2<-subset(mydata, Age>50 & Sex==0, select=c(ID:Sex))

which takes all of the columns between (and including) ID and Sex. This is nice so you don't have to list out all the variables one by one.

However, now let's take a more real example. I have a dataset with 200 columns and 3000 rows. I want to keep say 50 columns, but they're not all sequential. Some are at the beginning of my dataset, some are at the end. Do I have to go in and write out each name of the variable? Nope. Instead we can use R's matrix indexing capability. It's super easy and it goes like this in a general form:

newdata<-originaldata[rows I want, columns I want]

The comma is crucial here since the dataset is organized by rows and columns. If you don't put anything in the "rows I want" place, R keeps everything. So, for example, I can do this:

sub.data3<-mydata[,c(1:3)]

So this is saying, take the dataset "mydata", and take all the rows (since there is nothing before the comma), and take columns 1-3. Call this subsetted dataset "sub.data3".

I can also make the equivalent statement from sub.data2 using indexing like this:

sub.data4<-mydata[mydata$Age>50 & mydata$Sex==0, c(1:3)]

Or, if I just want to look at the first 50 observations and all columns, I can do this:

sub.data5<-mydata[c(1:50), ]

Finally, I can do what I said before, in that I can choose the variables I want from a super long list. I first can find out the index numbers of my columns using the names() command like this:

names(mydata)

Then I can say I want columns 1-5, 8, 12-15, 100-120, 197, and 199. I can easily do this by saying

sub.data6<-mydata[,c(1:5, 8, 12:15, 100:120, 197, 199)]

Finally, the last thing I want to point out is that if you want to keep most columns but just take away, say, the last 10, you can do this:

sub.data6<-mydata[,-c(190:200)]

The negative sign is like a 'drop' command. Keep everything except the columns I list out.

It's a very nice feature and very fast!

Pages