Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The two major data science languages, Python
and R
, have historically taken two separate paths when it comes to where data scientists are doing the coding. The R
language has the RStudio IDE, which is a great IDE for data science because of its feature rich setup for efficiently developing analyses. The Python
language has the Jupyter Notebook (and more recently Jupyter Lab) that provides a web-based notebook. Most data scientists write their code in separate places – Python
is written in Jupyter Notebooks, and R
is written in the RStudio IDE. Until now – RStudio is making the case for a powerful mult-language IDE designed for Data Science.
The RStudio Version 1.2 Preview Edition comes with support for Python
and several other data science languages including SQL
and Stan
. With this release, RStudio is making a case for a powerful, all-in-one R
+ Python
Data Science IDE.
Let’s take a look at how the Python
integration works.
RStudio is making the case for a powerful mult-language IDE designed for Data Science.
Summary of RStudio Python Review
With the rollout of the Python
Integration – a major new feature in RStudio – We did a product review of the RStudio IDE from the perspective of data scientist using Python
. We created a script file (.py
file) and worked interactively with the RStudio IDE’s console, help documentation, and plotting panel performing basic operations that a data scientist will be doing quite frequently.
Here’s what we liked about the new RStudio IDE Python
Integration:
-
Sending code to Console is fast. Scripting can now be done efficiently with
CTRL + Enter
sending code to the Console. -
Tabbed autocompletion works well. Directory paths, function completion, even function arguments are supported.
-
Help documentation shows up in the Help Window. This is super useful so I don’t have to scroll away from my code to see the help documenation and function examples.
-
Plots show up in the plot window. This is actually a
seaborn
plot in the RStudio IDE lower right quadrant!
Summary of RStudio IDE Python Integration
Contents
-
Get the Data – We used the MovieLens 1M Data Set
-
YouTube Video Walkthrough – 6 Minute Python Tutorial in the RStudio IDE
-
Python Integration Review – MovieLens 1M Data Set – In-depth walkthrough using
pandas
,numpy
,matplotlib
, andseaborn
Get More From Business Science
Announcements | Business Science University Curriculum | Stay Connected, Get Updates |
Similar Articles to Check Out Next
-
R and Python: How to Integrate the Best of Both into Your Data Science Workflow
-
Ultimate Python Cheat Sheet: Data Science Workflow with Python
Product Review
With the rollout of the Python
Integration – a major new feature in RStudio – We did a product review of the RStudio IDE from the perspective of data scientist using Python
. We created a script file (.py
file) and worked interactively with the RStudio IDE’s console, help documentation, and plotting panel performing basic operations that a data scientist will be doing quite frequently.
Get the Data
The data that we’ll be using to test out the Python
functionality comes from Wes McKinney’s (creator of pandas
) “Python for Data Analysis” book. Here’s the GitHub Repo where you can download the pydata-book materials.
Video Python + RStudio IDE Review
Here’s a quick video review using Python in the RStudio IDE.
Python Integration Review – MovieLens 1M Data Set
With the new Preview Version 1.2 of RStudio IDE, we can work with Python
seamlessly. We’ll take a test spin using the MovieLens 1M Data Set. Let’s go.
Importing Libraries
For this walkthrough, we’ll import 4 libraries:
pandas
– Data manipulation library forPython
numpy
– High-performance numerical computing librarymatplotlib
– Visualization libraryseaborn
– Augmentsmatplotlib
Bonus No. 1
This brings us to our first bonus – The Python script enables code completion that works with “TAB” command, just like with an R script.
Tabbed Code Completion of Available Libraries
Importing Data
Next, we can read the “MovieLens” data set, which consists of 3 tables:
movies.dat
: Movie informationratings.dat
: Ratings information for each combination of movie and userusers.dat
: User information such as gender, age, occupation, and zipcode
The users.dat
file is read using the pd.read_table()
function.
We can import the remaining data with the following code.
Bonus No. 2
This brings us to our second bonus – The Python script enables code completion for all functions and arguments. Again, just like good ole RStudio!
Tabbed Code Completion of Available Functions and Arguments
Merging Data
Next, we’ll merge the data using pd.merge()
. We’ll inspect the first 5 lines and the shape of the data.
Pivot Table
With the data merged, we can begin to do some analysis. We’ll check out the pd.pivot_table()
function.
Bonus No. 3
Our third bonus is the help documentation, which shows up right where it should – in the Help Window.
Help Documentation in the Help Panel
Measuring Rating Disagreement
We can do a bit of data manipulation to measure ratings disagreement for the most active titles. First, we need to assess which titles are the most active. We can group by “title” and use the size()
function to count how many appearances each title makes in the data
.
The next bit of code returns the index for the all titles that have more than 250 ratings.
We can filter out to just the titles with more than 250 ratings using the index.
We can get the difference between the genders and assess the greatest differences. We’ll also add the absolute difference using np.abs()
.
Visualizing Rating Disagreement
We can plot a scatter plot emphasizing the magnitude of the disagreement.
And finally, we can inspect the top and bottom movies to see which have the highest disagreement.
Bonus 4
Our final bonus is that the seaborn
plot shows up exactly where it should: in the Plots Panel.
Plots show up in the Plots Panel
Conclusion
We are stoked about the prospect of the RStudio IDE supporting Python
. In its current state, RStudio has an amazing level of integrated support for Python
. Knowing RStudio, the features will only continue to improve. We are looking forward to seeing the RStudio IDE develop into a premier data science IDE supporting all relevant languages in the data science ecosystem.
Additional Information
We ran Python version 3.6 for this code tutorial. Note that the YouTube Video uses Python version 2.7.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.