Pages

Showing posts with label table. Show all posts
Showing posts with label table. Show all posts

Sunday, January 20, 2013

Translating Weird R Errors


I love R. I think it's intuitive and clever and overall a great language. But I do get really annoyed sometimes at the completely ridiculous, cryptic error messages it often gives me.  This post will go over some of those seemingly nonsensical errors so you don't have to go crazy trying to find the bug in your code.

1. all arguments must have the same length

To start with, I just make up some quick data:

prob1<-as.data.frame(cbind(c(1,2,3),c(5,4,3)))
colnames(prob1)<-c("Education","Ethnicity")

And now I just want to do a simple table but I get this error:






What the heck. I look back at my dataset and make sure that both those variables are the same length, which they are. The problem here is that I misspelled "Education".  There's a missing "a" in there and instead of telling me that I referenced a variable that doesn't exist, R bizarrely tells me to check the length of my variables. Remember: Anytime you get an error, check to make sure you've spelled everything right. 

If I do this, everything works out great:
table(prob1$Education, prob1$Ethnicity)


2. replacement has 0 rows, data has 3

A very similar problem, with a very different error message. Let's say I forgot what columns were in my prob1 data and I thought I had a Sex indicator in there. So I try to recode it like this:

This error message is also pretty unhelpful. The syntax is totally correct; the problem is that I just don't have a variable named Sex in my dataset. If I do this instead to recode education, a variable that exists, everything is fine:

prob1$Educ_recode<-as.numeric(prob1$Education==2)


3. undefined columns selected

Ironically, the error we so badly wanted before comes up but for a completely different reason. See if you can find the problem here.  I'll take that same little dataset and I just want to know how many rows there are in which Education is not equal to 1.

So, if I want to know the number of rows of the dataframe prob1, I do:

nrow(prob1)

and if I want to know how many have a value of Education not equal to 1, I do the following (incorrectly) and get an error:






Now I check my variable name and I've definitely spelled Education right this time. The problem, actually, is not that I have referenced a column that doesn't exist but I've messed up the syntax to the nrow() function, in that I haven't defined what columns I want to subset.  When I do,

prob1[prob1$Education!=1]

this doesn't make any sense, because I'm saying to subset prob1 but to do this I have to specify which rows I want and which columns I want.  This just lists one condition in the brackets and it's unclear whether it's for the rows or columns.  See my post on subsetting for more details on this.

If I do it the following way, all is good since I'm saying to subset prob1 with only rows with education !=1 and all columns:

nrow(prob1[prob1$Education!=1,])

So this error message does make sense in a way, but it's still a bit cryptic in my opinion.


Monday, October 8, 2012

Summarizing Data

In this post, I'll go over four functions that you can use to nicely summarize your data.  Before any regression analysis, a descriptive analysis is key to understanding your variables and the relationships between them.  Next week, I'll have a post on plotting, so this post is limited to the summary(), table(), and aggregate() functions.

Here is my dataset for this example:




The first thing I want to do is look at my data overall - get the range of values for each variable, and see what missing values I have.  I can do this simply by doing:

summary(mydata)

This produces the output below, and shows me that both Weight and Height have missing values. The Migrantstatus variable is a factor (categorical), so it lists the number in each category.







If I want to just summarize one variable, I can do summary(mydata$Weight) for example. And remember from last week, that if I just want to summarize some portion of my data, I can subset using indexing like so:  summary(mydata[,c(2:5)])

Next, I want to tabulate my data.  I can do univariate and bivariate tables (I can even do more dimensions than that!) by using the table() function.  Table() gives me the totals in each group.  If I want proportions instead of totals, I can use prop.table() around my table() function.  The code looks like this with the output below:

table1<-table(mydata$Sex)

table1
prop.table(table1)



Next, I do the bivariate tables.  The first variable is the row and the second is the column.  I can do proportions here as well, but I must be careful about the margin.  Margin=1 means that R calculates the proportions across rows, while margin=2 is down columns. I show a table of Sex vs Marital status below with two types of proportion tables.

table2<-table(mydata$Sex, mydata$Married)

table2
prop.table(table2, margin=1)
prop.table(table2, margin=2)



And if I want to do three dimensions, I put all three variables in my table() function.  R will give me the 2x2 table of sex and marital status, stratified by the third variable, migrant status.


table3<-table(mydata$Sex, mydata$Married, mydata$Migrantstatus)

















The great part about R is that I can take any component of this table that I want.  For example, if I just want the table for migrants, I can do:

table3[,,1]

which tells R to give me all rows and columns, but only for the first category of the third variable.  I get the following output, which you can see is the same as the first part of table 3 from above.








Finally, what if I want to calculate the mean of one variable by another variable? One way to do this is to use the aggregate() function.  Aggregate does exactly that: it takes one variable (the first argument) and calculates some kind of function on it (the FUN= argument), by another variable (the by=list() argument).  So here I am going to do the mean weight for each sex.  Here the syntax is a little funny because R wants a list for the by variable.  I will go over lists at another post in the future, or you can look it up on another R site.

aggtable<-aggregate(mydata$Weight, by=list(mydata$Sex), FUN=mean)
aggtable




However, something is wrong. The NA in the weight column is messing up my mean calculation.  To get around this, I use the na.rm=TRUE parameter which removes any NAs from consideration.

aggtable.narm<-aggregate(mydata$Weight, by=list(mydata$Sex), FUN=mean, na.rm=TRUE)
aggtable.narm





Victory! If I want to name my columns of this table, I can do:

names(aggtable.narm)<-c("Sex","Meanweight")

And of course if you want to do mean tables by more than one variable, you can put the all in the list argument, like so: by=list(mydata$Sex, mydata$Married).  The code and output would look like this:


aggtable.3<-aggregate(mydata$Weight, by=list(mydata$Sex, mydata$Married), FUN=mean, na.rm=TRUE)

names(aggtable.3)<-c("Sex","Married","Meanweight")

aggtable.3