A use of gsub, reshape2 and sqldf with healthcare data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Building off other industry-specific posts, I want to use healthcare data to demonstrate the use of R packages. The data can be downloaded here. To read the .CSV file in R you might read the post how to import data in R. Packages in R are stored in libraries and often are pre-installed, but reaching the next level of skill requires being able to know when to use new packages and what they contain. With that let’s get to our example.
gsub
When working with vectors and strings, especially in cleaning up data, gsub
makes cleaning data much simpler. In my healthcare data, I wanted to convert dollar values to integers (ie. $21,000 to 21000), and I used gsub
as seen below.
Reading the data in R from CSV file. I am naming the dataset “hosp”.
hosp <- read.csv("Payment_and_value_of_care_-_Hospital.csv")
In the code below I will remove hospitals without estimates
hospay<-hosp[hosp$Payment.category !="Not Available" & hosp$Payment.category !="Number of Cases Too Small",]
Now its time to remove the dollar signs and commas in estimate values
hospay$Payment <- as.numeric(gsub("[$,]","",hospay$Payment)) hospay$Lower.estimate <- as.numeric(gsub("[$,]", "", hospay$Lower.estimate)) hospay$Higher.estimate <- as.numeric(gsub("[$,]", "", hospay$Lower.estimate)) head(hospay$Payment) [1] 13469 12863 12308 12222 21376 14740
reshape2
In looking at the data, I wanted to focus on the Payment estimate. So I used the melt()
function that is part of reshape2
. Melt allows pivot-table style capabilities to restructure data without losing values.
library(reshape2) hosp_mel<-melt(data=hospay,id=c(2,5,9,11), measure=as.numeric(c(13)), value.name='Estimate') names(hosp_melt) [1] "Hospital.name" "State" "Payment.measure.name" "Payment.category" "variable" "Estimate"
sqldf
With my data melted, I wanted to get the average estimate for heart attack patients by state. This is a classic SQL query, so bringing in sqldf
allows for that.
library(sqldf) names(hosp_melt) [3] <- "paymentmeasurename" hosp_est <- sqldf("select State, avg(Estimate) as Estimate from hosp_melt where paymentmeasurename = 'Payment for heart attack patients' group by State") head(hosp_est) State Estimate 1 AK 20987.60 2 AL 21850.32 3 AR 21758.00 4 AZ 22690.62 5 CA 22707.45 6 CO 21795.30
If you have any question feel free to leave a comment below.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.