[This article was first published on R-Chart, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The FDIC regularly publishes a Failed Bank List and related statistics. This post uses data in the original XLS from the FDIC web site which is formatted for human consumption to produce the charts below using R. Note that 2010 data below is incomplete.
The chart above is sort of a theme for the analysis. It is interesting to note that there were a large number of “Problem” Banks in 1990-1992 – percentage wise comparable to the last few years. However the total number of banks has dropped every year since 1990 (indicated by the color above). This is true of both Commercial Banks and Savings Institutions (Value indicates the number of institutions).
Creation of New Charters has been decreasing and has not opposed this trend.
Consolidation been one reason in the decline in the total number of institutions (especially during the 1990s), but these have decreased during the last several years.
The number of Problem Institutions in the early 1990s (and their assets) was even higher than the last few years.
Dollar amounts are expressed in billions (on the following and subsequent charts).
Combined Deposit Insurance FundThe Combined Deposit Insurance Fund is used to guarantee the safety on deposits in member banks. The amount of Insured Deposits has been growing.
However, the Fund Balance itself (along with the Reserve Ratio) been negative since 2009. The balance was also negative in 1991.
The Number of Failed Institutions (described in detail on the Failed Bank List) is actually less than the early 1990s, but the total assets involved are much higher.
The amount of assistance provided has also been greater during the last few years. There actually was assistance provided in 1990-1992, but the relative amount is so little that it does not appear in the charts below.
Definitions do not appear in the spreadsheet, but on other parts of the FDIC site, the “estimated loss is the difference between the amount disbursed from the Deposit Insurance Fund (DIF) to cover obligations to insured depositors and the amount estimated to be ultimately recovered from the liquidation of the receivership estate”. Simply said, the amount paid out by the government that could not be recovered in some way from the failed bank.
The “Resolution Receivables” are also not defined in the spreadsheet. But elsewhere on the FDIC site appear to represent “payments made by the DIF to cover obligations to insured depositors, advances to receiverships and conservatorships for working capital, and administrative expenses paid on behalf of receiverships and conservatorships”.
Number of Employees
The number of FDIC employees has been adjusted in some way – and so years prior to 2008 have been revised (according to the note in the spreadsheet) “according to a new full-time equivalent methodology”.
Technical Notes
The original XLS from the FDIC web site was saved as a csv using Excel. This csv and the R Source used to create the plots above are available on github.
Data Preparation
Many texts on data analysis or data mining make some comment about the amount of time used to prepare data. It needs to be “cleansed” or “reshaped” or otherwise transformed into a form that can be manipulated in the intended manner. Because the spreadsheet from the FDIC is directed at a human audience (rather than a computer program) it has a number of irregularities. There is text information in the header and footer, subheadings over related sections, columns added for space, and spaces added to cells. In addition, naming the rows and columns and reshaping the data make it easier to plot. There are a number of mechanisms available using native R and available packages. I ended up using several of them – but because of overlapping functionality, there are many ways this work could be accomplished. A few examples are below, the rest can be seen in the source file.
To remove empty columns reference data frame column indices as a list of negative numbers.
df=df[c(-2,-5,-7,-10,-12,-14,-16,-18,-20,-22,-24,-26,-28,-30,-32,-34,-36,-38,-40,-42)]
Rows can be removed in a similar fashion.
df=df[c(-1,-5,-9,-10,-13,-14,-18,-21,-24, -27),]
To remove trailing rows reference a negative sequence as the first parameter in the data frame.
df=df[-(29:36),]
To change the column names, substitute out characters
colnames(df)=gsub(‘X’,”,colnames(df))
or explicitly set column names.
colnames(df)[1]=’Statistic’
I am kind of fuzzy about the specific rules – but columns must be cast as factors in certain contexts and as characters in others to get the desired result. If there is a set rule about usage, please let me know. At this point, I am assuming that these types are not handled in a consistent way and you just have to know how a particular function behaves.
df$Statistic=as.character(df$Statistic)
To trim out leading space in all cells of a given column use an appropriate regular expression.
df$Statistic=gsub(‘^ *’,”,df$Statistic)
df$Statistic[11]=’Problem Institutions Number’
or in a somewhat generic fashion where the contents of one field are appended to another.
df$Statistic[3]=paste(df$Statistic[2],df$Statistic[3])
To pivot data, you can sometimes do a transpose and the result is a matrix.
t(df)
A better result is obtained using the reshape package.
library(reshape)
df.melted=melt(df, id=”Statistic”)
The data is to be treated as numeric. To strip out all commas and cast the values to numerics
df.melted$value=as.numeric(gsub(‘,’,”,as.character(df.melted$value)))
Again, casting factors as numerics produces funky results in any case.
df.melted$Year=as.numeric(as.character(df.melted$Year))
Subsets of each series contained in the melted data can be obtained by matching on the Statistic in view.
fdic_employees = df.melted[df$Statistic==
‘Number of FDIC Employees***’,c(‘Year’,’value’)]
The chart at the top was obtained by combining results from several of these subsets. The sqldf package is used to combine the data frames and do some calculations
pin=df.melted[df.melted$Statistic==”Problem Institutions Number”,]
cb= df.melted[df.melted$Statistic==”Commercial Banks”,]
si=df.melted[df.melted$Statistic==”Savings Institutions”,]
library(‘sqldf’)
pct_prob=sqldf(‘SELECT
p.Year Year,
p.value prob, c.value + s.value “Total Institutions”,
p.value / (c.value + s.value) * 100 “Percent Problem”
FROM pin p
JOIN cb c ON c.Year = p.Year
JOIN si s ON s.Year = p.Year’)
Plotting the Data
The plotting could be done in individual chunks of code like the following.
library(ggplot2)
p = qplot(data=pct_prob, Year, `Percent Problem`, color=`Total Institutions`, group=1)
p = p + geom_line()
p + opts(axis.text.x=theme_text(angle=-90, hjust=0), title=’Total FDIC Problem Institutions by Year’)
ggsave(‘totalFDICProblemInstitutionsByYear.png’)
However, many of the charts had the exact same format and simply applied to a different series. So I created a function that would produce the desired chart for any supplies Statistic.
fdicPlot=function(statistic){
f=df.melted[df.melted$Statistic==statistic,]
ggplot(data=f, aes(x=Year, y=value)) + geom_line() + geom_point() + opts(title=statistic)
img=paste(gsub(” |\(|\)|\*”,”,i),’.png’,sep=”)
ggsave(img)
}
The function could then be called for each available Statistic.
for (i in df$Statistic){fdicPlot(i)}
dev.off()
All in all, a useful exercise for reviewing FDIC data a bit more closely and seeing how R can be used to munge data into a format conducive to producing charts with ggplot2.
To leave a comment for the author, please follow the link and comment on their blog: R-Chart.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.