US CDC has issued an MMWR describing the growing threat of multidrug-resistant gonorrhea and what public health agencies and partners can do to delay the emergence of cephalosporin-resistant strains, mitigate the public health consequences of expanded resistance, and prevent a return to the era of untreatable gonorrhea. For these strategies and more information, see the MMWR at: http://go.usa.gov/4JY3
A total of 40 confirmed gonorrhea cases have been reported in Maine through Feb. 16, compared to 54 reported cases in 2012 and 19 reported cases in 2011 during the same period. Some cases are reported more than once and may represent reinfection, recurrent infection, or persistent infection.
The majority of cases have been reported in Androscoggin County, followed by Cumberland County. Maine CDC is issuing biweekly gonorrhea surveillance updates. The most recent report is available at: http://go.usa.gov/4JrF
Current treatment guidelines and reporting requirements are included in this Health Alert from Dec. 4: http://go.usa.gov/4JY9
Tuesday, February 26, 2013
Monday, February 25, 2013
Normal distribution functions
Ah, the Central Limit Theorem. The basis of much of statistical inference and how we get those 95% confidence intervals. It's just so beautiful! Lately, I have found myself looking up the normal distribution functions in R. They can be difficult to keep straight, so this post will give a succinct overview and show you how they can be useful in your data analysis.
To start, here is a table with all four normal distribution functions and their purpose, syntax, and an example:
Note that for all functions, leaving out the mean and standard deviation would result in default values of mean=0 and sd=1, a standard normal distribution.
Another important note for the pnorn() function is the ability to get the right hand probability using the lower.tail=FALSE option. For example,
In the first line, we are calculating the area to the left of 1.96, while in the second line we are calculating the area to the right of 1.96.
With these functions, I can do some fun plotting. I create a sequence of values from -4 to 4, and then calculate both the standard normal PDF and the CDF of each of those values. I also generate 1000 random draws from the standard normal distribution. I then plot these next to each other. Whenever you use probability functions, you should, as a habit, remember to set the seed. Setting the seed means locking in the sequence of "random" (they are pseudorandom) numbers that R gives you, so you can reproduce your work later on.
The par() parameters set up a plotting area of 1 row and 3 columns (mfrow), and move the three plots closer to each other (mar). Here is a good explanation of the plotting area. The output is below:
Now, when we have our actual data, we can do a visual check of the normality of our outcome variable, which, if we assume a linear relationship with normally distributed errors, should also be normal. Let's make up some data, where I add noise by using rnorm() - here I'm generating the same amount of random numbers as is the length of the xseq vector, with a mean of 0 and a standard deviation of 5.5.
And now I can plot a histogram of y (check out my post on histograms if you want more detail) and add a curve() function to the plot using the mean and standard deviation of y as the parameters:
Here, the curve() function takes as its first parameter a function itself (or an expression) that must be written as some function of x. Our function here is dnorm(). The x in the dnorm() function is not an object we have created; rather, it's indicating that there's a variable that is being evaluated, and the evaluation is the normal density at the mean of y and standard deviation of y. Make sure to include add=TRUE so that the curve is plotted on the same plot as the histogram. Here is what we get:
Here are some other good sources on the topic of probability distribution functions:
To start, here is a table with all four normal distribution functions and their purpose, syntax, and an example:
| Purpose | Syntax | Example | |
|---|---|---|---|
| rnorm | Generates random numbers from normal distribution | rnorm(n, mean, sd) | rnorm(1000, 3, .25) Generates 1000 numbers from a normal with mean 3 and sd=.25 |
| dnorm | Probability Density Function (PDF) | dnorm(x, mean, sd) | dnorm(0, 0, .5) Gives the density (height of the PDF) of the normal with mean=0 and sd=.5. |
| pnorm | Cumulative Distribution Function (CDF) | pnorm(q, mean, sd) | pnorm(1.96, 0, 1) Gives the area under the standard normal curve to the left of 1.96, i.e. ~0.975 |
| qnorm | Quantile Function - inverse of pnorm | qnorm(p, mean, sd) | qnorm(0.975, 0, 1) Gives the value at which the CDF of the standard normal is .975, i.e. ~1.96 |
Note that for all functions, leaving out the mean and standard deviation would result in default values of mean=0 and sd=1, a standard normal distribution.
Another important note for the pnorn() function is the ability to get the right hand probability using the lower.tail=FALSE option. For example,
In the first line, we are calculating the area to the left of 1.96, while in the second line we are calculating the area to the right of 1.96.
With these functions, I can do some fun plotting. I create a sequence of values from -4 to 4, and then calculate both the standard normal PDF and the CDF of each of those values. I also generate 1000 random draws from the standard normal distribution. I then plot these next to each other. Whenever you use probability functions, you should, as a habit, remember to set the seed. Setting the seed means locking in the sequence of "random" (they are pseudorandom) numbers that R gives you, so you can reproduce your work later on.
set.seed(3000)
xseq<-seq(-4,4,.01)
densities<-dnorm(xseq, 0,1)
cumulative<-pnorm(xseq, 0, 1)
randomdeviates<-rnorm(1000,0,1)
par(mfrow=c(1,3), mar=c(3,4,4,2))
plot(xseq, densities, col="darkgreen",xlab="", ylab="Density", type="l",lwd=2, cex=2, main="PDF of Standard Normal", cex.axis=.8)
plot(xseq, cumulative, col="darkorange", xlab="", ylab="Cumulative Probability",type="l",lwd=2, cex=2, main="CDF of Standard Normal", cex.axis=.8)
hist(randomdeviates, main="Random draws from Std Normal", cex.axis=.8, xlim=c(-4,4))
The par() parameters set up a plotting area of 1 row and 3 columns (mfrow), and move the three plots closer to each other (mar). Here is a good explanation of the plotting area. The output is below:
Now, when we have our actual data, we can do a visual check of the normality of our outcome variable, which, if we assume a linear relationship with normally distributed errors, should also be normal. Let's make up some data, where I add noise by using rnorm() - here I'm generating the same amount of random numbers as is the length of the xseq vector, with a mean of 0 and a standard deviation of 5.5.
xseq<-seq(-4,4,.01)
y<-2*xseq + rnorm(length(xseq),0,5.5)
And now I can plot a histogram of y (check out my post on histograms if you want more detail) and add a curve() function to the plot using the mean and standard deviation of y as the parameters:
Here, the curve() function takes as its first parameter a function itself (or an expression) that must be written as some function of x. Our function here is dnorm(). The x in the dnorm() function is not an object we have created; rather, it's indicating that there's a variable that is being evaluated, and the evaluation is the normal density at the mean of y and standard deviation of y. Make sure to include add=TRUE so that the curve is plotted on the same plot as the histogram. Here is what we get:
Here are some other good sources on the topic of probability distribution functions:
Monday, February 11, 2013
Important public health data now online
Maine CDC is pleased to release the Maine Behavioral Risk Factor Surveillance System (BRFSS) 2000-2010 Interactive web-query that should make most popular BRFSS public health data easily available. BRFSS non-institutional adult data is available at the State, District, and County levels in a variety of public health categories including General Health, Chronic Disease, Risk Factors, Prevention Activities, and Environmental Health. Since 1987, Maine BRFSS data has been an important surveillance data source for Maine CDC programs, other DHHS programs, and public health entities. A new version with additional data and a new weighting methodology will be available soon. Subsequent yearly updates will be available.
The BRFSS web-query should be intuitive in its use. It has both basic and advanced query functions and there is a "Quick Guide" reference for more information. The web-query, a link to the National BRFSS homepage with additional BRFSS data, and related BRFSS information are available at http://go.usa.gov/4yhm
Any questions regarding this Maine web-query system can be directed to Kip Neale, Maine BRFSS Coordinator at Kip.Neale@Maine.gov or 287-1420.
The BRFSS web-query should be intuitive in its use. It has both basic and advanced query functions and there is a "Quick Guide" reference for more information. The web-query, a link to the National BRFSS homepage with additional BRFSS data, and related BRFSS information are available at http://go.usa.gov/4yhm
Any questions regarding this Maine web-query system can be directed to Kip Neale, Maine BRFSS Coordinator at Kip.Neale@Maine.gov or 287-1420.
Thursday, February 7, 2013
Immunization recommendations
Each year, the Advisory Committee on Immunization Practices (ACIP) develops recommendations for routine use of vaccines in children, adolescents, and adults in the US. This year, for the first time, recommended immunization schedules for children and adolescents up to age 18 and adults age 19 and older are being published together. Updated schedules are available at http://www.cdc.gov/vaccines/schedules/
The MMWR for children and adolescent immunization schedules is available at http://go.usa.gov/4yJk
The MMWR for adult immunization schedules is available at http://go.usa.gov/4yJP
Adult vaccination is critical to public health. Far too few adults are fully vaccinated, leaving themselves and others at risk of vaccine preventable diseases, including pneumococcal disease, whooping cough, and shingles.
US CDC has issued an MMWR describing the status of vaccination of US adults in 2011. For more information, read the MMWR at http://go.usa.gov/4yzB
The MMWR for children and adolescent immunization schedules is available at http://go.usa.gov/4yJk
The MMWR for adult immunization schedules is available at http://go.usa.gov/4yJP
Adult vaccination is critical to public health. Far too few adults are fully vaccinated, leaving themselves and others at risk of vaccine preventable diseases, including pneumococcal disease, whooping cough, and shingles.
US CDC has issued an MMWR describing the status of vaccination of US adults in 2011. For more information, read the MMWR at http://go.usa.gov/4yzB
Friday, February 1, 2013
American Heart Month
February is American Heart Month. Most of us know someone who has had a heart attack or stroke. In Maine, more than 1 of every 4 deaths is from heart disease or stroke.
You can help protect yourself and your loved ones from heart disease and stroke by understanding the risks and taking these steps
Heart Month is a great time to learn about the Million Hearts Campaign.
Million Hearts is a national campaign with the goal of preventing 1 million heart attacks and strokes over five years.
For more about heart disease and stroke visit http://www.mainehearthealth.org/
- Know your ABCS
- Ask your doctor if you should take an Aspirin every day
- Find out if you have high Blood pressure or Cholesterol, if you do, work with your doctor to treat it
- If you Smoke, get help to quit ~ Maine Tobacco Helpline 1-800-207-1230
- Be physically active at least 30 minutes on most days
- Eat more fruits, vegetables and other foods low in sodium and transfat
- Take medicine as prescribed by your doctor
What else can you do?
For more about heart disease and stroke visit http://www.mainehearthealth.org/
Converting a dataset from wide to long
I recently had to convert a dataset that I was working with from a wide format to a long format for my analysis. I struggled with this a bit, but finally found the right sources and the right package to do it, so I thought I'd share my practical example of reshaping data in R. This post is specifically helpful for those using Demographic and Health Survey (DHS) data.
The DHS dataset includes one observation for each woman. For each observation, there are 20 columns for each birth she could have had for 16 different characteristics. If no birth happened then the cell is left missing. The characteristics include birth month, birth year, sex of the child, death month, death year, who the child lives with, current age of child, and so on.
For clarity, I've shortened the data to just 7 observations and two characteristics of each birth (b2 and b4) for 3 possible births:
Here v012 is the mother's age, all the b2 variables are year of births, and the b4 variables are the sex of the child.
So the first subject, aged 30, has had two births - one in 2000 and one in 2005, both boys. Since she did not have a third birth yet, her values for b2_03 and b4_03 are missing.
I would like to convert this dataset to one where I have one observation per child born - i.e. from the wide format to the long format. Specifically, I would like the columns to just be the caseid of the mother, age of the mother, year of birth of the child, and sex of the child.
There are a number of ways of doing this in R, including melt() and cast(), plyr(), aggregate(), and others. However, after a good deal of struggle and looking things up, I found that the reshape() function is the most intuitive and user-friendly for the needs of this problem. You don't need to use melt and cast at all, which are difficult to manipulate in my opinion.
The reshape() function takes in a number of important parameters that will be necessary for our transformation (there are more parameters than this, but I've boiled it down to the ones that are crucial):
reshape(data, varying = NULL, timevar = "time", idvar = "id", direction, sep = "")
data = dataframe you want to convert
varying = columns in the wide format that correspond to a single column in the long format
timevar = name of new variable that differentiates multiple observations from the same individual
idvar = variable in your dataset that identifies multiple records from the same individual
direction = "wide" if you're going from long to wide and "long" if you're going from wide to long
sep = the symbol that separates the name of a varying column from its number
Ok, so the most important thing about the reshape function is that you have to give it variable names that it can understand. Specifically, the columns that I want to turn into one column should all follow the same structure in the naming convention. It can be anything with a pattern like this:
Birthyear_1, Birthyear_2, Birthyear_3
b1, b2, b3
year.1, year.2, year.3
etc. As long as they follow the same text and number pattern. If the dataset is not in this format, you will have a lot of problems and I suggest changing your variable names before trying to convert. For the first example, you would use sep="_", the second would be sep="", and the third is sep="." and all of these are valid.
Ok, so let's try it out. All variables that we want to convert into one column we can put into the varying parameter and R will sort them out based on the naming patterns. We specify a long direction, that our id variable is caseid, and, importantly, that the separating symbol between the name of the variable and its order number is a "_".
births.long1<-reshape(births.wide, varying=c("b2_01","b2_02","b2_03", "b4_01", "b4_02", "b4_03"), direction="long", idvar="caseid", sep="_")
This gives us the following result (truncated picture):
Which is what we wanted! Each observation is now a birth, with corresponding mother caseid and age, and the year (b2) and sex of the child (b4) for each observation correctly lined up. R has added in the time variable which is just the number after the "_" for the varying variables. Notice that R automatically orders the dataset by this time variable, so you will have to re-order if you want it by caseid.
If you wanted to convert all of the 16 characteristics of each birth, you do not need to write out each individual variable. You can easily do it like this, indicating the number of the columns that you want:
births.long2<-reshape(births.wide, varying=c(3:8), direction="long", idvar="caseid", sep="_", timevar="order")
births.long2<-births.long2[order(births.long2$caseid),]
names(births.long2)<-c("subject","age","order", "birthyear", "childsex")
births.long2<-na.omit(births.long2)
Here, I've specified that all variables from column 3 to column 8 are varying variables, and I've also indicated that the new variable that R creates should be called "order" instead of the default "time". Then I reorder my dataset by caseid, name the new columns, and take out all of the missing observations, which are just non-existent births. I get the following result:
Notice how subject 6 is completely gone because she did not have any children and this is now a dataset of children ever born.
I've made this example of reshape very specific to DHS data, but there are also many great sources on how to reshape data in R. Here are a couple that I found especially helpful:
Subscribe to:
Comments (Atom)





