Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I will assume you have read through part 1 and have the csv file loaded. While we covered some basic graphing in the last post i hope to get into a little more of the data crunching. Specifically I am interested in the names which where driven by a specific cultural phenomena. There are clearly decade long trends where “Beth” is fading in popularity, but you can not point to a specific event or figure in the public consciousness that is driving this. Enough talk let me show you what I mean. Here is the chart for the female name Trinity. Hat Tip to the google dev python people for pointing this out.
nm<-"Trinity" p<-ggplot(names,aes(x=Year,y=Rank)) p<- p + ylim(max(names$Rank),min(names$Rank)) p<- p + geom_line(data = names[which(names$Female %in% nm),], aes(group=Female, colour = Female), alpha = 1, size = 1) p<- p + opts(title = "People Liked the Matrix Way Too Much") matrix.label<-data.frame(Year = 1985 , Rank = 220, Text = "Matrix Released - 1999") # create the custom on graphic text label p <- p + geom_rect(aes(xmin = 1998 , xmax = 2000 , ymin = 1000 , ymax = 1 ),fill = "Green", alpha = .002) p <- p + geom_text(data = matrix.label, aes(label = Text)) p
You can see here that some people loved the Matrix enough to brand their child for life. As you can see this rapid rise in popularity in a one or two year period can more easily be attributed to a particular social phenomena. Let take a first stab at this. What are the female names that have the greatest change in our data.
#This would be more elequently done with Hadley Wickam's ddply. The lapply/do.call("rbind") combo is brillinatly useful and for simple things I use name.min.max<-function(nm){ data.frame( name = nm, min = min(names[which(names$Female == nm),2]), max = max(names[which(names$Female == nm),2]), dif = max(names[which(names$Female == nm),2]) - min(names[which(names$Female == nm),2]) ) } name.list<-unique(names$Female) out<-lapply(X = as.list(name.list), FUN = name.min.max) out <- do.call("rbind", out) female.dif<-out[order(-out$dif),] #Top Female Names with greatest change overtime # Need to seperate the winners and losser just plots largest difference nm< -as.character(female.dif[1:10,1])
What we have done here is built a simple name.min.max function that returns a data frame of for a female name passed in. the lapply() function iterates over a list of distinct Female Names in name.list and returns a list of data frames(one fr each name). do.call(“rbind”,out) stitches them all back together. Then we order by the greatest difference a grab the top ten.
> female.dif[1:10,] name min max dif 469 Ava 4 999 995 1232 Samantha 3 998 995 915 Tammy 8 994 986 59 Debra 2 982 980 29 Cheryl 13 992 979 1257 Kayla 11 988 977 38 Kathy 14 990 976 22 Janice 21 996 975 738 Lily 18 993 975 1844 Taylor 6 978 972
Here is the result lets graph it with a facet for each name.
p<-ggplot(names,aes(x=Year,y=Rank)) p <- p + ylim(max(names$Rank),min(names$Rank)) p <- p + geom_line(data = names[which(names$Female %in% nm),], aes(group=Female, colour = Female), alpha = 1, size = 1) p <- p + opts(title = "Top Movers") p <- p + facet_wrap(~Female) p
We definitely found names that changed over time but we have names that are rising over time and falling. Also our new name.min.max function doesn’t give us any insight about rate of change. Is it a slow change like Lily or a dramatic rise like Kayla. As we said before names with a dramatic rise over a short period are more likely associated with something in the pop culture of the time.
What we are most interested is the names that jumped the most over a one year period. I tried a couple different techniques to get this and im going to walk through what i think to be the best. I would love to see other people’s approach. To calculated this we are going to use the function Delt() from the package quantmod. Quantmod is a package for testing financial models. If you actually want to make god money with R, stop reading about baby names and go build a trading platform. Delt is used to calculate change in stock prices over time. i.e. 2% up since yesterday. So load up quantmod and see what happens.
library(quantmod) female.names<-names[,c(1,4,2)] female.names<-female.names[order(female.names$Female, female.names$Year),] female.names$delta <- Delt(female.names$Rank) female.names[1540:1580,]
Year Female Rank Delt.1.arithmetic 44018 1994 Alexis 18 -0.333333333 45014 1995 Alexis 14 -0.222222222 46008 1996 Alexis 8 -0.428571429 47008 1997 Alexis 8 0.000000000 48006 1998 Alexis 6 -0.250000000 49003 1999 Alexis 3 -0.500000000 50006 2000 Alexis 6 1.000000000 51005 2001 Alexis 5 -0.166666667 52005 2002 Alexis 5 0.000000000 53007 2003 Alexis 7 0.400000000 54011 2004 Alexis 11 0.571428571 55013 2005 Alexis 13 0.181818182 56014 2006 Alexis 14 0.076923077 57019 2007 Alexis 19 0.357142857 58015 2008 Alexis 15 -0.210526316 59013 2009 Alexis 13 -0.133333333 41952 1991 Alexus 952 72.230769231 42616 1992 Alexus 616 -0.352941176 43354 1993 Alexus 354 -0.425324675 44250 1994 Alexus 250 -0.293785311 45234 1995 Alexus 234 -0.064000000 46178 1996 Alexus 178 -0.239316239 47182 1997 Alexus 182 0.022471910 48184 1998 Alexus 184 0.010989011 49228 1999 Alexus 228 0.239130435 50252 2000 Alexus 252 0.105263158 51281 2001 Alexus 281 0.115079365 52317 2002 Alexus 317 0.128113879 53370 2003 Alexus 370 0.167192429 54425 2004 Alexus 425 0.148648649 55529 2005 Alexus 529 0.244705882 56547 2006 Alexus 547 0.034026465 57627 2007 Alexus 627 0.146252285 58719 2008 Alexus 719 0.146730463 59895 2009 Alexus 895 0.244784423 48992 1998 Alexys 992 0.108379888 49821 1999 Alexys 821 -0.172379032 50842 2000 Alexys 842 0.025578563 51837 2001 Alexys 837 -0.005938242 52909 2002 Alexys 909 0.086021505 53958 2003 Alexys 958 0.053905391
We subsetted the female names into their own object, ordered by name and year(this is important when calculating a rolling change). We opened up a slice of the new object and there are some problems. First is Delt() function is returning arithmetic differences(that is the difference in the two observation divided by the first). This would be a great help if the numbers it was calculating were truly continuous. Remember, these are ordinal rankings.
1998 Alexis 6 -0.250000000 1999 Alexis 3 -0.500000000
This change in Alexis ranking between 1998 and 1999 is being a calculated as a 50% increase. This isn’t strictly meaningful because what we want to calculate is the absolute change. This brings up my second favorite part of R. If you have a problem with how function is behaving, just go change the function.
we can download the source file for quantmod:
wget http://cran.r-project.org/src/contrib/quantmod_0.3-17.tar.gz
extract the tar file and grep the function you care about:
tar -zxvf quantmod_0.3-17.tar.gz
grep Delt ./quantmod/R/*
We get a R source file named ./quantmod/R/OHLC.transformations.R that contains the function we care about. Here is the original definition:
`Delt` <- function(x1,x2=NULL,k=0,type=c('arithmetic','log')) { x1 <- try.xts(x1, error=FALSE) type <- match.arg(type[1],c('log','arithmetic')) if(length(x2)!=length(x1) && !is.null(x2)) stop('x1 and x2 must be of same length'); if(is.null(x2)){ x2 <- x1 #copy for same symbol deltas if(length(k)
There is a default type parameter that if not set will apply to the row.
unclass(x2)/Lag(x1,K.)-1
This is how the change is normalize. what we want is:
unclass(x2) - Lag(x1,K.)
What we can do is add another type option(absolute) and add an else if to handle it. We can run this to define a new function and apply it to the data:
`Delt.Absolute` <- function(x1,x2=NULL,k=0,type=c('arithmetic','log')) { x1 <- try.xts(x1, error=FALSE) type <- match.arg(type[1],c('log','arithmetic', 'absolute')) if(length(x2)!=length(x1) && !is.null(x2)) stop('x1 and x2 must be of same length'); if(is.null(x2)){ x2 <- x1 #copy for same symbol deltas if(length(k)
Year Female Rank Delt.1.absolute 28913 1978 Aaron 913 -33 29990 1979 Aaron 990 77 30883 1980 Aaron 883 -107 31956 1981 Aaron 956 73 32969 1982 Aaron 969 13 33881 1983 Aaron 881 -88 49978 1999 Abagail 978 97 50954 2000 Abagail 954 -24 51920 2001 Abagail 920 -34 52882 2002 Abagail 882 -38 53868 2003 Abagail 868 -14
now the function returns an absolute change. We still have an issue. look at the difference calculated between Aaron and Abagail. It is calculating between names, not good. To solve im going to use a trick I read on SO.
female.names[c(TRUE, female.names$Female[-1] != female.names$Female[-length(female.names$Female)]), 4] <- NA female.names<-female.names[order( female.names[4]),]
This elegant line of code take the first instance of all the names in our data frame and replaces the Delt col with a NA(R’s missing data handler). There is some tweaking if you want to use it on numeric data but it is a clever piece of code. After we reorder and extract the top ten names with greatest one year positive change(which is actually a negative number from our Delt,Absolute function, think a change in Rank 10 -> 2 is actually -8 change). Lets plot these top ten.
female.names.winners<-female.names[order(female.names[4]),] top.female.names<-unique(female.names.winners[1:12,2]) p<-ggplot(names,aes(x=Year,y=Rank)) + ylim(max(names$Rank),min(names$Rank)) + geom_line(data = names[which(names$Female %in% top.female.names),], aes(group=Female, colour = Female), alpha = 1, size = 1)+ opts(title = "Female Baby Name Popularity Since 1950")+ opts(axis.text.x=theme_text(angle=-70),hjust=0) + facet_wrap(~ Female) p
As you can see there are some names that flash up but do not stick around. However Some names shoot up and stayed popular. Take Desiree for example. Lets take a closer look.
p<-ggplot(names,aes(x=Year,y=Rank)) p <- p + ylim(max(names$Rank),min(names$Rank)) # Flip the Y-Axis nm <- "Desiree" # enter a female name p <- p + geom_line(data = names[which(names$Female == nm),], aes(group=Female, colour = Female), alpha = 1, size = 2) + opts(title = nm) p
Now I have never seen this movie but clearly Brando has an effect. Desiree (1954)
I hope you found some of this useful. There are some fun association that can be found. Have fun. Next Up fun with Baby Names Part 3: Biggest Losser’s
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.