Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Learning to use a data analysis tool well takes significant effort, so people tend to continue using the tool they learned in college for much of their careers. As a result, the software used by professors and their students is likely to predict what the next generation of analysts will use for years to come. I track this trend, and many others, in my article The Popularity of Data Analysis Software. In the latest update (4/13/2012) I forecast that, if current trends continued, the use of the R software would exceed that of SAS for scholarly applications in 2015. That was based on the data shown in Figure 7a, which I repeat here:
Here is the data from Google Scholar:
R SAS SPSS 1995 8 8620 6450 1996 2 8670 7600 1997 6 10100 9930 1998 13 10900 14300 1999 26 12500 24300 2000 51 16800 42300 2001 133 22700 68400 2002 286 28100 88400 2003 627 40300 78600 2004 1180 51400 137000 2005 2180 58500 147000 2006 3430 64400 142000 2007 5060 62700 131000 2008 6960 59800 116000 2009 9220 52800 61400 2010 11300 43000 44500 2011 14600 32100 32000
ARIMA Forecasting
We can forecast the use of R using Rob Hyndman’s handy auto.arima function to forecast five years into the future:
> library("forecast") > R_fit <- auto.arima(R) > R_forecast <- forecast(R_fit, h=5) > R_forecast Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 18 18258 17840 18676 17618 18898 19 22259 21245 23273 20709 23809 20 26589 24768 28409 23805 29373 21 31233 28393 34074 26889 35578 22 36180 32102 40258 29943 42417
We see that even if the use of SAS and SPSS were to remain at their current levels, R use would surpass their use in 2016 (Point Forecast column where 18-22 represent years 2012 -2016).
If we follow the same steps for SAS we get:
> SAS_fit <- auto.arima(SAS) > SAS_forecast <- forecast(SAS_fit, h=5) > SAS_forecast Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 18 21200 16975.53 25424.5 14739.2 27661 19 10300 853.79 19746.2 -4146.7 24747 20 -600 -16406.54 15206.5 -24774.0 23574 21 -11500 -34638.40 11638.4 -46887.1 23887 22 -22400 -53729.54 8929.5 -70314.4 25514
It appears that if the use of SAS continues to decline at its precipitous rate, all scholarly use of it will stop in 2014 (the number of articles published can’t be less than zero, so view the negatives as zero). I would bet Mitt Romney $10,000 that that is not going to happen!
I find the SPSS prediction the most interesting:
> SPSS_fit <- auto.arima(SPSS) > SPSS_forecast <- forecast(SPSS_fit, h=5) > SPSS_forecast Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 18 13653.2 -16301 43607 -32157 59463 19 -4693.6 -57399 48011 -85299 75912 20 -23040.4 -100510 54429 -141520 95439 21 -41387.2 -145925 63151 -201264 118490 22 -59734.0 -193590 74122 -264449 144981
The forecast has taken a logical approach of focusing on the steeper decline from 2005 through 2010 and predicting that this year (2012) is the last time SPSS will see use in scholarly publications. However the part of the graph that I find most interesting is the shift from 2010 to 2011, which shows SPSS use still declining but at a much slower rate.
Any forecasting book will warn you of the dangers of looking too far beyond the data and I think these forecasts do just that. The 2015 figure in the Popularity paper and in the title of this blog post came from an exponential smoothing approach that did not match the rate of acceleration as well as the ARIMA approach does.
Colbert Forecasting
While ARIMA forecasting has an impressive mathematical foundation it’s always fun to follow Stephen Colbert’s approach: go from the gut. So now I’ll present the future of analytics software that must be true, because it feels so right to me personally. This analysis has Colbert’s most important attribute: truthiness.
The growth in R’s use in scholarly work will continue for two more years at which point it will level off at around 25,000 articles in 2014.This growth will be driven by:
- The continued rapid growth in add-on packages (Figure 10)
- The attraction of R’s powerful language
- The near monopoly R has on the latest analytic methods
- Its free price
- The freedom to teach with real-world examples from outside organizations, which is forbidden to academics by SAS and SPSS licenses (it benefits those organizations, so the vendors say they should have their own software license).
What will slow R’s growth is its lack of a graphical user interface that:
- Is powerful
- Is easy to use
- Provides journal style output in word processor format
- Is standard, i.e. widely accepted as The One to Use
- Is open source
While programming has important advantages over GUI use, many people will not take the time needed to learn to program. Therefore they rarely come to fully understand those advantages. Conversely, programmers seldom take the time to fully master a GUI and so often underestimate its capabilities. Regardless of which is best, GUI users far outnumber programmers and, until resolved, this will limit R’s long term growth. There are GUIs for R, but so many to choose from that none becomes the clear leader (Deducer, R Commander, Rattle, Red-R, at least two from commercial companies and still more here.) If from this “GUI chaos” a clear leader were to emerge, then R could continue its rapid growth and end up as the most used package.
The use of SAS for scholarly work will continue to decline until it matches R at the 25,000 level. This is caused by competition from R and other packages (notably Stata) but also by SAS Instute’s self-inflicted GUI chaos. For years they have offered too many GUIs such as SAS/Assist, SAS/Insight, IML/Studio, the Analyst application, Enterprise Guide, Enterprise Miner and even JMP (which runs SAS nicely in recent versions). Professors looking to meet student demand for greater ease of use could not decide what to teach so they continued teaching SAS as a programming language. Even now that Enterprise Guide has evolved into a good GUI, many SAS users do not know what it is. If SAS Institute were to completely replace their default Display Manager System with Enterprise Guide, they could bend the curve and end up at a higher level of perhaps 27,000.
The use of SPSS for scholarly work will decline only slightly this year and will level off in 2013 because:
- The people who needed advanced methods and were not happy calling R functions from within SPSS have already switched to R or Stata
- The people who like to program and want a more flexible language than SPSS offers have already switched to R or Stata
- The people who needed a more advanced GUI have already switched to JMP
The GUI users will stick with SPSS until a GUI as good (or close to as good) comes to R and becomes widely accepted. At The University of Tennessee where I work, that’s the great majority of SPSS users.
Stata’s growth will level off in 2013 at level that will leave it in fourth place. The other packages shown in Figure 7b will also level off around the same time, roughly maintaining their current place in the rankings. A possible exception is JMP, whose interface is radically superior to the the others for exploratory analysis. Its use could continue to grow, perhaps even replacing Stata for fourth place.
The future of Enterprise Miner and SPSS Modeler are tied to the success of each company’s more mainstream products, SAS and SPSS Statistics respectively. Use of those products is generally limited to one university class in data mining, while the other software discussed here is widely used in many classes.
So there you have it: the future of analytics revealed. No doubt each reader has found a wide range of things to disagree with, so I encourage you to follow the detailed blog at Librestats to collect your own data from Google Scholar and do your own set of forecasts. Or simply go from the gut!
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.