Shockingly-fast data manipulation in R with polars

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hey guys, welcome back to my R-tips newsletter. Polars is NOW available in R! Yes– The shockinlgy-fast data manipulation library built on top of Rust is now in R. Today, I’m excited to show off some of Polar’s capabilities for fast financial and time series analysis. Let’s go!

Table of Contents

Here’s what you’re learning today:

  • What is polars? You’ll discover what polars is and how it accomplishes shockingly-fast data manipulation
  • Benefits of using Polars Which types of data analysis can benefit from polars the most.
  • How to use Polars inside of R I have prepared a full R code tutorial (get the code here).

Polars in R

Get the Code (In the R-Tip 082 Folder)


SPECIAL ANNOUNCEMENT: ChatGPT for Data Scientists Workshop on August 14th

Inside the workshop I’ll share how I built a Machine Learning Powered Production Shiny App with ChatGPT (extends this data analysis to an insane production app):

ChatGPT for Data Scientists

What: ChatGPT for Data Scientists

When: Wednesday August 14th, 2pm EST

How It Will Help You: Whether you are new to data science or are an expert, ChatGPT is changing the game. There’s a ton of hype. But how can ChatGPT actually help you become a better data scientist and help you stand out in your career? I’ll show you inside my free chatgpt for data scientists workshop.

Price: Does Free sound good?

How To Join: 👉 Register Here


R-Tips Weekly

This article is part of R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks. Pretty cool, right?

Here are the links to get set up. 👇

This Tutorial is Available in Video (9-minutes)

I have a 9-minute video that walks you through setting up polars in R and running your first financial time series data analysis. 👇

What is Polars?

According to the polars documentation:

The polars package for R gives users access to a lightning fast Data Frame library written in Rust. Polars’ embarrassingly parallel execution, cache efficient algorithms and expressive API makes it perfect for efficient data wrangling, data pipelines, snappy APIs, and much more besides. Polars also supports “streaming mode” for out-of-memory operations. This allows users to analyze datasets many times larger than RAM.

Lightning-Fast Data Frame Library Written in Rust

The key here is that, under the hood, both the R and Python implementations of polars use the hyper-scalable and blazingly fast Rust library. Key aspects of Rust include:

  1. Memory Safety: Rust ensures memory safety without needing a garbage collector. This is achieved through its ownership system, which enforces strict rules on how memory is managed.

  2. Concurrency: Rust is designed to make it easy to write concurrent programs. The language’s ownership system helps prevent data races, which are a common problem in concurrent programming.

  3. Zero-cost Abstractions: Rust aims to provide high-level abstractions without the cost typically associated with them in terms of performance. This allows developers to write efficient code without sacrificing readability.

  4. Performance: Rust’s performance is comparable to C and C++ due to its focus on low-level control over system resources.

  5. Tooling: Rust comes with a powerful set of tools, including cargo (the package manager and build system), rustc (the Rust compiler), and rustfmt (a code formatting tool).

Rust in a Nutshell

Rust is fast. It’s design is focused on parallel processing. And because of that polars is fast, parallel, lazy (in a good way), and really good for most data operations.

Which Data Manipulations is Polars Good For?

I’ve been testing out polars for quite a while in both Python and R.

For background, as of a year ago I began work on pytimetk, which replicates many of the R timetk packages time series analysis features in Python. And for that project, our team has internally used a polars engine for many time series operations that are known to be resource intense.

Polars vs Pandas: Speed Comparison and Performance Test Results:

We’ve published our performance results here.

Polars Beats Pandas

  1. Rolling Operations: Polars can be 10X to 3500X faster than Pandas

  2. Expanding Operations: 3X to 500X Faster

  3. Aggregations (Summarizations): 13X Faster

The bottomline is that Polars is fast vs Pandas. It’s especially good for grouped time series operations including rolling, expanding, and aggregating operations.

I expect Polars in R to be faster than dplyr. However, I have not run similar tests (yet).

Tutorial: How to use Polars inside of R

It takes about 30 seconds to get polars set up so you can start using shockingly-fast data manipulation inside of R. All the tutorial code shown is available in the R-Tips Newsletter folder for R-Tip 082.

Get the Code

Get the Code (In the R-Tip 082 Folder)

Step 1 – Install polars:

The first step is to set up polars. Polars is not on CRAN as of the writing of this article. But it’s simple to install from the r-multiverse.org team.

Run this line of code:

install.packages("polars", repos = "https://community.r-multiverse.org")

Step 2 – Load the Libraries and Data

Once polars is installed, load the libraries and data witht his code.

Libraries and Data

Here’s the stock_data.csv once it’s read with pl$read_csv(). A few key points about the Polars Data Frame Structure:

  • Shape of the data is shown at the top.
  • Some columns and rows will not be shown when printed to the screen(identifed with …)
  • The “Date” column is a str data type
  • The stocks (25 total) are f64 data type (float 64)

Stock Data

Get the Code (In the R-Tip 082 Folder)

Step 3 – Pivot to Long Format for Grouped Data Analysis

The next step is to get the data into a format so we can begin to do grouped analysis. Use the unpivot() function to go from wide-to-long format:

Wide to Long

Get the Code (In the R-Tip 082 Folder)

The transformation was done shockingly-fast. This is what the long format looks like:

Long Format Stock Data

To visualize the data, run this code:

Visualize Stock Data

Get the Code (In the R-Tip 082 Folder)

Step 4 – Moving Averages with Polars’ Rolling Mean

The last step we’ll cover is how to perform moving averages using polars rolling mean functionality. This is one of the biggest benefits to using Polars.

Run this code to perform a 10-day and 50-day moving average over each of the 25 stocks:

Rolling Mean

Get the Code (In the R-Tip 082 Folder)

Again, the performance is undeniable. In milliseconds, the rolling calculations are complete.

Run this code to visualize the result:

Visualize Moving Averages

Get the Code (In the R-Tip 082 Folder)

We can quickly see which stocks have momentum from the 10-day and 50-day moving averages (those with Red lines above the Green Lines).

Reminder: The code is available free inside R-tips

All of the code you saw today is available in R-Tips Newsletter folder for R-Tip 082

Get The Code

Get the Code (In the R-Tip 082 Folder)

Conclusions:

Polars is one of those libraries that is quickly becoming a standard in the Python ecosystem. I’m glad to see that R is getting the same treatment. It’s simply the fastest data manipulation library I’ve come across. And I’ve tried them all.

If you would like to grow your Business Data Science skills, then please read on…

Need to advance your business data science skills?

I’ve helped 6,107+ students learn data science for business from an elite business consultant’s perspective.

I’ve worked with Fortune 500 companies like S&P Global, Apple, MRM McCann, and more.

And I built a training program that gets my students life-changing data science careers (don’t believe me? see my testimonials here):

6-Figure Data Science Job at CVS Health ($125K)
Senior VP Of Analytics At JP Morgan ($200K)
50%+ Raises & Promotions ($150K)
Lead Data Scientist at Northwestern Mutual ($175K)
2X-ed Salary (From $60K to $120K)
2 Competing ML Job Offers ($150K)
Promotion to Lead Data Scientist ($175K)
Data Scientist Job at Verizon ($125K+)
Data Scientist Job at CitiBank ($100K + Bonus)

Whenever you are ready, here’s the system they are taking:

Here’s the system that has gotten aspiring data scientists, career transitioners, and life long learners data science jobs and promotions…

What They're Doing - 5 Course R-Track

Join My 5-Course R-Track Program Now!
(And Become The Data Scientist You Were Meant To Be…)

P.S. – Samantha landed her NEW Data Science R Developer job at CVS Health (Fortune 500). This could be you.

Success Samantha Got The Job

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)