How to smartly choose your Data Science toolkit – R vs Python
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Often I get questions from readers who are constantly caught in the tool conundrum
Should I choose R or Python to start learning data science?
If you are newly entering the world of data science and not have tried either of these languages it is easy to land into this question. In this post we shall carefully examine both with the needs of data science in mind.
R
R is built by data Scientists for data scientist. So doing data analysis, building models, communicating results are the core strengths
The major power of R is it’s user community which offers extensive support and has developed the package base CRAN.
A few great packages for you to start exploring in R would be
- ggplot2/ggvis – Data Visualization
- dplyr (Data Munging and Wrangling)
- data.table (Data Wrangling)
- Caret: (Machine learning workbench)
- reshape2: (Data Shaping)
R has a steep learning curve and is generally built for stand alone systems. Although there are several packages to speed up the process.
If you are a beginner, I would strongly recommend downloading RStudio which is the de facto IDE for R
Python
Python is great programming language and is very easy to start with. You can easily perform most of the data science task like data wrangling, munging, visualization and of course it has a great machine learning library – scikit learn. If you are already familiar with Java/C++, it is straightforward to get started with Python.
According to the data science survey conducted by O’Reilly almost 40% of the data scientists use Python to solve their problems. Python also has a great community of open source packages.
Below are the list of packages which are great for data science applications
- Seaborn – Data Visualization
- Pandas – Data Munging and Wrangling
- Numpy/Scipy – Data Wrangling/ Representation
- Scikit-learn – Machine Learning library
Clash of the Titans: Python vs R
It is indeed clash of the titans of the data science world. Here are a few guidelines which you could use to choose the language.
Popularity:
Python is one of the top programming languages. Let us get down to the numbers in the data science community. Got the data from here . There is an increase in The below graph popularity of Python is increasing in the data science community. (The plot done in R )
Personal Choice
Coming from an engineering background I chose python as it was more natural to me. Later explored in to R to understand its strengths and support. The best way is to start one and learn the other to work on its strengths.
Learning Curve
R has a steep learning curve as compared to python. But deliberate practice could help you climb the ladder faster. In order to learn R I chose to use R for my projects deliberately, there by gaining knowledge and experience using it.
Type of Problem
Often the type of problem your solving has a bearing on the choice of language. If the nature of the problem at hand is to do thorough data analysis then I choose R, but If I need to write quick scripts to get things done, scrape the web then it is simpler to use Python.
Communication
Often overlooked but an important data science activity is the ability to communicate results and exchange ideas. IPython notebooks are a beauty in itself providing the best interface to communicate, shortly followed by R Markdowns.
Verdict
As a data scientist it is always best to open to learn more tools. Preferring one over the other may be good to start with, but it is always know and use the tools to their best strengths.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.