Public Health: matrix

Showing posts with label matrix. Show all posts

Sunday, April 7, 2013

Mastering Matrices

R has many ways to store information. Most of the time, our data comes in the form of a dataset, which we bring into R as a data.frame object. However, there are times when we want to use matrices as well. This post will show you how matrices can be useful and how to manipulate them easily.

First of all, the big difference between matrices and dataframes is that all of the rows and columns of a matrix must have the same class (numeric, character, etc). In a dataframe, you can have some of each. See my initial post about objects, here.

You can convert from one to the other using as.data.frame() or as.matrix(). Be careful though, that if you convert a dataframe with different classes of columns, then your matrix will just be all character:

In order to have a numeric matrix, I'm going to just take the first 6 columns of the mydata dataframe. I can delete columns of a matrix or dataframe in two ways:

mydata.mat<-as.matrix(mydata[,1:6])
mydata.mat<-as.matrix(mydata[,-7])

These two lines are doing the exact same thing. In the first, I am subsetting the dataframe mydata by taking all rows and the first 6 columns of the dataframe, then I'm converting that subset to a matrix. In the second, I'm taking all rows and all columns except the 7th column. Note that if I wanted to drop even more columns, I would just use the c() function like this:

mydata.mat<-as.matrix(mydata[,c(-3,-7)])

Note now that since I have taken out the one character column in my dataframe before I convert it to a matrix, I will get a numeric matrix instead of a character matrix:

This kind of operation for deleting columns works the same way in both matrices and dataframes. However, to add a column to a dataframe or matrix is different. In a dataframe, you can use the $ operator to identify columns, like mydata$Married is the vector corresponding to the Married column. However, you can't use the $ operator on matrices. You will get the following error that the "$ operator is invalid for atomic vectors", which I see all the time when I'm converting back and forth from dataframes to matrices and make a mistake:

All this message means is that the object you're using is a matrix (mydata.mat) and you can't use the $ operator on a matrix. If you get this message, you can either use as.data.frame() to convert your matrix to a dataframe, or you can adjust what you are doing to accomodate the rules of matrices. For adding columns to a matrix, you use cbind(), and likewise for rows, rbind().

So let's say I want to add an age squared column. In the dataframe, I do:

mydata$agesq<-mydata$Age^2

which instantly names the new column "agesq". Now for a matrix, there are two ways to do this, via indexing by number or by name of the original column:

mydata.mat<-cbind(mydata.mat, mydata.mat[,2]^2)
mydata.mat<-cbind(mydata.mat, mydata.mat[,"Age"]^2)

In the first line, I'm taking all rows and the second column of the mydata.mat matrix and squaring it, then I'm column binding it to the original matrix. In the second line, I'm doing the exact same thing, except that instead of indexing with a number, I can use the name of the column "Age". I get the following after running both statements:

Notice that the last two columns of this matrix do not have names, which can be rectified, by using the colnames() function:

colnames(mydata.mat)[7:8]<-c("AgeSq", "AgeSqAgain")

I don't want to rename everything, so I take the 7th and 8th columns and name those appropriately.

Finally, what can matrices do for us? One important aspect of matrices is of course matrix multiplication, which is how we do any multivariable regression analysis. I'll do a post soon on regression analysis by hand in R. But another reason is that matrices are great way to store values that you return during the course of running a loop.

For example, say I want to show how great the central limit theorem is. I'll generate deviates from some other distribution, say the Poisson, and take the mean of the draws each time. I'll do this 1000 times and then show what the histogram looks like. In a problem like this, I'll use a loop. I'll also use a matrix to store the mean each time.

Ok, we start out by initializing a matrix. We'll create a matrix of all NAs with 1000 rows and 3 columns using the matrix() function:

mat1<-matrix(NA, nrow=1000, ncol=3)

Next, we'll set up the for() loop. Let's look at it first and then go through the logic:

for(i in 1:nrow(mat1)){

  vec1<-rpois(1,1)
  vec2<-rpois(10,1)
  vec3<-rpois(100,1)
  mat1[i,]<-c(mean(vec1), mean(vec2), mean(vec3))
 
}

So in the first line, we're saying for each value of i going from i=1 to i=nrow(mat1), do the stuff in the loop. We could have written 1:1000, but it's nice to leave it as nrow(mat1) since we may want to change the size of mat1 later and this way the loop will still be fine.

Next, we draw from a Poisson distribution three times, each time a larger number of draws (first 1 draw, then 10, then 100), and each time with a lambda of 1.

Finally, and this is where the matrix comes in, we'll take the mean of each one of those vectors and store it. We will store the three values in the ith row of the mat1 matrix, filling in all three columns. In a longer way, I could have done:

mat1[i,1]<-mean(vec1)
mat1[i,2]<-mean(vec2)
mat1[i,3]<-mean(vec3)

and it would have come out the same, but the first way is nicer since it's more compact. Remember that matrices are just columns and rows of vectors, so you can always assign a vector to a row, as long as it's the same length. When you concatenate numbers (using the c() function), you make a vector, which is why it works.

Now, let's see how the old CLT is working by plotting some histograms:

par(mfrow=c(1,3))
hist(mat1[,1], main="n=1")
hist(mat1[,2], main="n=10")
hist(mat1[,3], main="n=100")

Again, with the histograms, I can plot each column at a time by subsetting the mat1 matrix:

Pretty nice! Other very helpful places to better understand matrices:

Thursday, November 8, 2012

Data types part 2: Using classes to your advantage

Last week I talked about objects including scalars, vectors, matrices, dataframes, and lists. This post will show you how to use the objects (and their corresponding classes) you create in R to your advantage.

First off, it's important to remember that columns of dataframes are vectors. That is, if I have a dataframe called mydata, the columns mydata$Height and mydata$Weight are vectors. Numeric vectors can be multiplied or added together, squared, added or multiplied by a constant, etc. Operations on vectors are done element by element, meaning here row by row.

First, I read in a file of data, called mydata, using the read.csv() function. I get the dataframe below:

I check the classes of my objects using class(), or all at the same time with ls.str().

class(mydata$Weight)
class(mydata$Height)

or

So I see that mydata is a dataframe and all my columns are numeric (num). Now, if I want to create a new column in my dataset which calculates BMI, I can do some vector operations:

mydata$BMI<-mydata$Weight/(mydata$Height)^2 * 703

Which is the formula for BMI from weight in pounds and height in inches. Notice how if any component of the calculation is a missing (NA) value, R calculates the BMI as NA as well.

Now I can do summary statistics on my data and store those as a matrix. For example, I start with summary statistics on my Age vector:

summary(mydata$Age)

If I want to extract an element of this summary table, say the minimum, I can do

summary(mydata$Age)[1]

which extracts the first element (of 6) of the summary table.

But what I really want is a summary matrix of a bunch of variables: Age, Sex, and BMI. To do this I can rowbind the summary statistics of those three variables together using the rbind() function, but only take the 1st, 4th, and 6th elements of the summary table, which as you can see correspond to the Min, Mean, and Max. This creates a matrix, which I call summary.matrix:

summary.matrix<-rbind(summary(mydata$Age)[c(1,4,6)], summary(mydata$BMI)[c(1,4,6)], summary(mydata$Sex)[c(1,4,6)])

Rowbinding is basically stacking rows on top of each other. I add rownames and then print the class of my summary matrix and the results.

rownames(summary.matrix)<-c("Age", "BMI", "Sex")
class(summary.matrix)
summary.matrix

There is also a much more efficient way of doing this using the apply() function. Previously I had another post on the apply function, but I find that it takes a lot of examples to get comfortable with so here is another application.

Apply() is a great example of classes because it takes in a dataframe as the first argument (mydata, all rows, but I choose only columns 2, 3, and 7). I then apply it to the numeric vector columns (MARGIN=2) of this subsetted dataframe, and then for each of those columns I perform the mean and standard deviation, removing the NA's from consideration. I save this in a matrix I call summary.matrix2.

summary.matrix2<-apply(mydata[,c(2,3,7)], MARGIN=2, FUN=function(x) c(mean(x,na.rm=TRUE), sd(x, na.rm=TRUE)))

I then rename the rows of the this matrix and print the results, rounded to two decimal places. Notice how the format of the final matrix is different here. Above the rows were the variables and the columns the summary statistics, while here it is reversed. I could have column binded (cbind() instead of the rbind()) in the first case and I would have gotten the matrix transposed to be like this one.

rownames(summary.matrix2)<-c("Mean", "Stdev")
round(summary.matrix2, 2)

Finally, I want to demonstrate how you can take advantage of scalars and vectors when graphing. Creating scalar and vectors objects is really helpful when you are doing the same task multiple times. I give the example of creating a bunch of scatterplots.

I want to make a scatterplot for each of three variables (Height, Weight, and BMI) against age. Since all three scatterplots are going to be very similar, I want to standardize all of my plotting arguments including the range of ages, the plot symbols and the plot colors. I want to include a vertical line for the mean age and a title for each plot. The code is below:

##Assign numeric vector for the range of x-axis
agelimit<-c(20,80)

##Assign numeric single scalar to plotsymbols and meanage
plotsymbols<-2
meanage<-mean(mydata$Age)

##Assign single character words to plottype and plotcolor
plottype<-"p"
plotcolor<-"darkgreen"

##Assign a vector of characters to titletext
titletext<-c("Scatterplot", "vs Age")

Ok, so now that I have all those assigned, I can plot the three plots all together using the following code. Notice how all the highlighted code is the same in each plot (except for the main title) and I'm using the assigned objects I just created. The great part about this is that if I decide I actually want to plot color to be red, I can change it in just one place. You can think about how this would be useful in other situations (data cleaning, regressions, etc) when you do the same thing multiple times and then decide to change one little parameter. If you're not sure about the code below, I posted on the basics of plotting here.

##Plot area is 1 row, 3 columns
par(mfrow=c(1,3))

##Plot all three plots using the assigned objects
plot(mydata$Age, mydata$Height, xlab="Age", ylab="Height", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "Height", titletext[2]))
abline(v=meanage)

plot(mydata$Age, mydata$Weight, xlab="Age", ylab="Weight", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "Weight", titletext[2]))
abline(v=meanage)

plot(mydata$Age, mydata$BMI, xlab="Age", ylab="BMI", xlim=agelimit,pch=plotsymbols, type=plottype, col=plotcolor, main=paste(titletext[1], "BMI", titletext[2]))
abline(v=meanage)

Notice how I do the main title with the paste statement. Paste() is useful for combining words and elements of another variable together into one phrase. The output looks like this, below. Pretty nice!

Thursday, November 1, 2012

Data types, part 1: Ways to store variables

I've been alluding to different R data types, or classes, in various posts, so I want to go over them in more detail. This is part 1 of a 3 part series on data types. In this post, I'll describe and give a general overview of useful data types. In parts 2 and 3, I'll show you in more detailed examples how you can use these data types to your advantage when you're programming.

When you program in R, you must always refer to various objects that you have created. This is in contrast to say, Stata, where you open up a dataset and any variables you refer to are columns of that dataset (with the exception of local macro variables and so on). So for example, if I have a dataset like the one below:

I can just say in Stata

keep if Age>25

and Stata knows that I am talking about the column Age of this dataset.

But in R, I can't do that because I get this error:

As the error indicates, 'Age' is not an object that I have created. This is because 'Age' is part of the dataframe that is called "mydata". A dataframe, as we will see below, is an object (and in this case also a class) with certain properties. How do I know it's a dataframe? I can check with the class() statement:

What does it mean for "mydata" to be a dataframe? Well, there are many different ways to store variables in R (i.e. objects), which have corresponding classes. I enumerate the most common and useful subset of these objects below along with their description and class:

Object	Description	Class
Single Number or letter/word	Just a single number or character/word/phrase in quotes	Either numeric or character
Vector	A vector of either all numbers or all characters strung together	Either all numeric or all character
Matrix	Has columns and rows - all entries are of the same class	Either all numeric or all character
Dataframe	Like a matrix but columns can be different classes	data.frame
List	A bunch of different objects all grouped together under one name	list

There are other classes including factors, which are so useful that they will be a separate post in this blog, so for now I'll leave those aside. You can also make your own classes, but that's definitely beyond the scope of this introduction to objects and classes.

Ok, so here are some examples of different ways of assigning names to these objects and printing the contents on the screen. I chose to name my variables descriptively of what they are (like numeric.var or matrix.var), but of course you can name them anything you want with any mix of periods and underscores, lowercase and uppercase letters, i.e. id_number, Height.cm, BIRTH.YEAR.MONTH, firstname_lastname_middlename, etc. I would only guard against naming variables by calling them things like mean or median, since those are established functions in R and might lead to some weird things happening.

1. Single number or character/word/phrase in quotation marks: just assign one number or one thing in quotes to the variable name

numeric.var<-10
character.var<-"Hello!"

2. Vector: use the c() operator or a function like seq() or rep() to combine several numbers into one vector.

vector.numeric<-c(1,2,3,10)
vector.char<-rep("abc",3)

3. Matrix: use the matrix() function to specify the entries, then the number of rows, and the number of columns in the matrix. Matrices can only be indexed using matrix notation, like [1,2] for row 1, column 2. More about indexing in my previous post on subsetting.

matrix.numeric<-matrix(data=c(1:6),nrow=3,ncol=2)
matrix.character<-matrix(data=c("a","b","c","d"), nrow=2, ncol=2)

4. Dataframe: use the data.frame() function to combine variables together. Here you must use the cbind() function to "column bind" the variables. Notice how I can mix numeric columns with character columns, which is also not possible in matrices. If I want to refer to a specific column, I use the $ operator, like dataframe.var$ID for the second column.

dataframe.var<-data.frame(cbind(School=1, ID=1:5, Test=c("math","read","math","geo","hist")))

Alternatively, any dataset you pull into R using the read.csv(), read.dta(), or read.xport() functions (see my blog post about this here), will automatically be a dataframe.

What's important to note about dataframes is that the variables in your dataframe also have classes. So for example, the class of the whole dataframe is "data.frame", but the class of the ID column is a "factor."

Again, I'll go into factors in another post and how to change back and forth between factors and numeric or character classes.

5. List: use the list() function and list all of the objects you want to include. The list combines all the objects together and has a specific indexing convention, the double square bracket like so: [[1]]. I will go into lists in another post.

list.var<-list(numeric.var, vector.char, matrix.numeric, dataframe.var)

To know what kinds of objects you have created and thus what is in your local memory, use the ls() function like so:

To remove an object, you do:

rm(character.var)

and to remove all objects, you can do:

rm(list=ls())

So that was a brief introduction to objects and classes. Next week, I'll go into how these are useful for easier and more efficient programming.

Monday, October 1, 2012

Quick and Easy Subsetting

Public health datasets can be enormous and difficult to look at. Often it is great to be able to only look at specific parts of the dataset, or to only run analysis on a specific part of a dataset. There are two ways that you can subset a dataset in R:

Using the subset() function
Using matrix indexing

The first way may sound easier, but the second one is very useful. Let me show you why.

Let's start with the subset() function. The basic structure is like this:

subset(x, condition, select=c(var1, var2))

where x is the original dataset you want to subset, condition is a condition on your rows, and select is the columns you want to keep. So for example, I have a dataset called mydata, shown below. I want to look at only women that are over 50 years old, and I only want to look at their ID, age, and weight. Therefore I do this:

sub.data<-subset(mydata, Age>50 & Sex==0, select=c(ID, Age, Weight))

So here I am saying, take the dataset called "mydata", take only rows where the Age >50 and the Sex ==0 (when you write a condition, you always use two equal signs), and then take only the columns ID, Age, and Weight.

Here are the original (left) and the subsetted (right) datasets:

You can also do a statement like this:

sub.data2<-subset(mydata, Age>50 & Sex==0, select=c(ID:Sex))

which takes all of the columns between (and including) ID and Sex. This is nice so you don't have to list out all the variables one by one.

However, now let's take a more real example. I have a dataset with 200 columns and 3000 rows. I want to keep say 50 columns, but they're not all sequential. Some are at the beginning of my dataset, some are at the end. Do I have to go in and write out each name of the variable? Nope. Instead we can use R's matrix indexing capability. It's super easy and it goes like this in a general form:

newdata<-originaldata[rows I want, columns I want]

The comma is crucial here since the dataset is organized by rows and columns. If you don't put anything in the "rows I want" place, R keeps everything. So, for example, I can do this:

sub.data3<-mydata[,c(1:3)]

So this is saying, take the dataset "mydata", and take all the rows (since there is nothing before the comma), and take columns 1-3. Call this subsetted dataset "sub.data3".

I can also make the equivalent statement from sub.data2 using indexing like this:

sub.data4<-mydata[mydata$Age>50 & mydata$Sex==0, c(1:3)]

Or, if I just want to look at the first 50 observations and all columns, I can do this:

sub.data5<-mydata[c(1:50), ]

Finally, I can do what I said before, in that I can choose the variables I want from a super long list. I first can find out the index numbers of my columns using the names() command like this:

names(mydata)

Then I can say I want columns 1-5, 8, 12-15, 100-120, 197, and 199. I can easily do this by saying

sub.data6<-mydata[,c(1:5, 8, 12:15, 100:120, 197, 199)]

Finally, the last thing I want to point out is that if you want to keep most columns but just take away, say, the last 10, you can do this:

sub.data6<-mydata[,-c(190:200)]

The negative sign is like a 'drop' command. Keep everything except the columns I list out.

It's a very nice feature and very fast!

Pages