R is for Research, Python is for Production
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Both R and Python are great. We’ll showcase some of the strengths of each language in this article by showcasing where the major development efforts are within each ecosystem.
R is for Research
If I had to describe R in one word, it would be: tidyverse
. It has made research tasks – wrangling data, visualizing outcomes, iterating from idea to code – painless. In fact, it’s a joy. I’ll explain why R is for Research using the Ultimate R Cheat Sheet, a one-stop shop for the R-ecosystem.
When starting with R, Tidyverse is an ideal place to begin your journey. This is the formalized set of packages and tools that have a consistently structured programming interface, as opposed to the base version of R that was notably more complex and less user friendly.
We see many smaller packages that tackle specific problems. The following are the most important packages:
Dplyr & ggplot2
Two great packages in R that you’ll make daily decisions from are dplyr and ggplot2, which amongst other things, are great for data manipulation and visualization. These are the two most important skills a data scientist or data analyst can have.
Rmarkdown
One of the most exceptional aspects of R is without a doubt Rmarkdown, which is a framework for creating reproducible reports, presentations, blogs, journals and more! Imagine having a report that runs itself, and creates an easily shareable HTML page or PDF to share with your team. Definitely a more streamlined approach than hundreds of clicks in Excel every Monday morning.
Shiny
Shiny is another framework within R that is used to create interactive web applications. One of the best features of Shiny is providing the non data focused members of your team with the data science tools they need for decision making through an easy to use GUI (graphical user interface). Imagine your team getting together for a Monday afternoon planning session, having already reviewed the previous week’s report created in Rmarkdown, and running simulations using your collaborative Shiny web application to determine where the data is guiding you next.
Where R is Growing
Next, if we scroll through to the “Special Topics Page”, we can see the R ecosystem is growing. This is a key feature that distinguishes the R Ecosystem from the Python Ecosystem.
We can see that R has expanded into:
- Time Series and Forecasting:
Modeltime
andTimetk
- Financial Analysis (and other domains):
Tidyquant
,Quantmod
- Network Analysis and Visualization:
Tidygraph
andggraph
- Text Analysis:
Tidytext
and TextRecipes
- Geospatial Analysis and Visualization: Thematic Maps
- Machine Learning:
H2O
,Tidymodels
, andMLR3
What is R missing?
There is noticeably a gap in the Production. R has Shiny
(Apps) and Plumber
(APIs, not shown), but Automation Tools like Airflow and Cloud Software Development Kits (SDKs) are primarily available in Python.
R Overall
R is really something special when doing research because of the tidyverse, which streamlines data wrangling and visualization. Honestly, you’ll be 3-5X more productive doing data wrangling in R once you become proficient with the tidyverse.
Why is Python Great?
Python is amazing too, but for different reasons. Let’s take a Python Package like OpenCV
– for Computer Vision.
This is a real strength for the Python language because we can do crazy cool things like Object Detection with OpenCV.
But, how much does this apply to my daily life? Around zero. Why? Because I’m a business analyst and data scientist that works with SQL databases. I’m more interested in how Python will help me better mine for information and productionalize the results.
Let’s check out the Python Ecosystem using the Ultimate Python Cheat Sheet (note that this is different from the R cheat sheet shown earlier).
We see that there’s Pandas for essentially everything related to import, tidying and data wrangling. So what is Pandas? Pandas is an object-oriented tool for data wrangling in Python.
Pandas vs Tidyverse
While programmers love pandas, business analysts may initially struggle with the object-oriented (pythonic) way of having Data Frames with methods.
customer_counts_df = df.group_by(‘customer_id’).value_counts()
Everything in Python is an object, and we call these methods (e.g. group_by, and value_counts) on the object. This call doesn’t seem too bad. But we are normally trying to do many more wrangling operations. It gets very challenging, less readable, and more complex.
Conversely, in R using the tidyverse we use a different syntax with a pipe (%>%
). This is very similar to SQL and the flow of data wrangling how a user thinks.
customer_counts_tbl <- df %>% group_by(customer_id) %>% summarize(count = n())
This tidyverse data wrangling workflow makes it often much easier for analysts to expand the set of operations into 10 or more data wrangling commands. Remember, the challenge isn’t typing code, it’s turning your thoughts into code. This is where the tidyverse is really powerful.
Key Strengths of Python lie in Production ML
OK, so why is Python great for business? It turns out that it’s strengths lie in Machine Learning and Production!
We can see that Python has well-developed Production ML-oriented tools:
- Automation –
Airflow
,Luigi
- Cloud – AWS, Google Cloud, and Azure software development kits
- Machine Learning –
ScikitLearn
- Deep Learning and Computer Vision –
PyTorch
,TensorFlow
,MXNet
,OpenCV
- NLP –
spaCy
,NLTK
These production-oriented tools make it easier to work with others that interact with cloud and operations as part of a larger IT team because they are already in Python. No need to include R and any extra dependencies into a production system.
Python Overall
If you can get over the Pandas learning curve, then Python becomes a great tool. Most IT teams know Python, so your code will fit right into their workflow. Just realize that you may be 3X to 5X less productive at Research than your R counterparts due to the tidyverse boost.
Which Language Should You Learn?
The decision can be challenging because they both Python and R have clear strengths.
- R is exceptional for Research – Making visualizations, telling the story, producing reports, and making MVP apps with Shiny. From concept (idea) to execution (code), R users tend to be able to accomplish these tasks 3X to 5X faster than Python users, making them very productive for research.
- Python is exceptional for Production ML – Integrating machine learning models into production systems where your IT infrastructure relies on automation tools like Airflow or Luigi.
Why Not Learn Both Python and R?
One thing I haven’t mentioned is that I’m building a course that teaches Python from an R-users perspective. The core idea is that Python can be a tremendous asset, and being able to use tools like R’s reticulate to communicate between R and Python can make you a real asset to a data science team. Join the R/Python Teams course waitlist.
This waitlist is for:
- People that want to learn the benefits of collaborative R/Python Teams
- R users that want to learn Python
- Python users that want to learn about tools that help R users work with them
Join the R/Python Teams Course Waitlist
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.