How to Scrape PDF Text and Summarize It with OpenAI LLMs (in R)

Posted on March 31, 2024 by Business Science in R bloggers | 0 Comments

[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Hey guys, welcome back to my R-tips newsletter. Businesses are sitting on a mountain of unstructured data. The biggest culprit is PDF Documents. Today, I’m going to share how to PDF Scrape text and use OpenAI’s Large Language Models (LLMs) to summarize it in R.

Table of Contents

Here’s what you’re learning today:

How to scrape PDF Documents I’ll explain how to scrape the text from your business’s PDF Documents using pdftools.
How I summarize PDF’s using the OpenAI LLMs in R. This will blow your mind.

Get the Code (In the R-Tip 078 Folder)

SPECIAL ANNOUNCEMENT: ChatGPT for Data Scientists Workshop on April 24th

Inside the workshop I’ll share how I built a Machine Learning Powered Production Shiny App with ChatGPT (extends this data analysis to an insane production app):

What: ChatGPT for Data Scientists

When: Wednesday April 24th, 2pm EST

How It Will Help You: Whether you are new to data science or are an expert, ChatGPT is changing the game. There’s a ton of hype. But how can ChatGPT actually help you become a better data scientist and help you stand out in your career? I’ll show you inside my free chatgpt for data scientists workshop.

Price: Does Free sound good?

How To Join: 👉 Register Here

R-Tips Weekly

This article is part of R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks. Pretty cool, right?

Here are the links to get set up. 👇

Sign up for our R-Tips Newsletter and get the code.

Businesses are Sitting on $1,000,000 of Dollars of Unstructured Data (and they don’t know how to use it)

Fact: 90% of businesses are not using their unstructured data. It’s true. Many companies have no clue how to extract it. And once they extract it, they have no clue how to use it.

We’re going to solve both problems in this R-Tip.

The most common form is text located in PDF documents.

Businesses have 100,000s of PDF documents that contain valuable information.

OpenAI Document Summarization

One of the best use cases of LLMs is document summarization. But how do we get PDF data to OpenAI?

One easy way is in R!

R Tutorial: Scrape PDF Documents and Summarize with OpenAI

This is a simple 2 step process we’ll cover today:

Extract PDF Text: We’ll use pdftools to extract text
Summarize Text with OpenAI’s LLMs: We’ll use httr to connect to OpenAI’s API and summarize our PDF document

Business Objective:

I have set up a PDF document of Meta’s 2024 10K Financial Statement. We’ll use this document to analyze the risks that Meta reported in their filing (without even reading the document).

This is a massive speed up – and I can ask even more questions too beyond just the risks to really understand Meta’s business.

Good questions to ask for this financial case study:

What are the top 3 risks to Meta’s business
Where does Meta gain most of it’s revenue?
In which business line is Meta’s revenue growing the most?

Get the PDF and Code

You can get the PDF and Code by joining the R-Tips Newsletter here.

Get the PDF and Code (In the R-Tip 078 Folder)

Load the Libraries

Next, load the libraries. Here’s what we’re using today:

Get the PDF and Code (In the R-Tip 078 Folder)

Step 1: Extract PDF Text

With our project set up and libraries loaded, next I’m extracting the PDF text. It’s very easy to do in 1 line of code with pdftools::pdf_text().

Get the PDF and Code (In the R-Tip 078 Folder)

This returns a list of text for 147 pages in Meta’s 10K Financial Statement. You can see the text on each page by cycling through text[1], text[2] and so on.

Step 2: Summarize the PDF Document with OpenAI LLMs

A common task: I want to know what risks Meta has identified in their 10K Financial Statement. This is required by the SEC. But, I don’t want to have to dig through the document.

The solution is to use OpenAI to summarize the document.

We will just summarize the first 30,000 characters in the document. There are more advanced ways to create a vector storage, but I’ll save that for a follow up post.

Run this code to set up OpenAI and our prompt:

Note that I have my OpenAI API key set up. I’m not going to dive into all of that. OpenAI has great documentation to set it up.

Get the PDF and Code (In the R-Tip 078 Folder)

Run this code to send the text and get OpenAI’s response

I’m using httr to send a POST request to OpenAI’s API. Then OpenAI provides a response with the answer to my question in the context of the text I provided it.

Get the PDF and Code (In the R-Tip 078 Folder)

Run this Code to Parse the OpenAI Response

In just a couple seconds, I have a response from OpenAI’s API. Run this code to parse the response.

Get the PDF and Code (In the R-Tip 078 Folder)

Review the Response

Last, we can review the response from OpenAI’s Chat API. We can see that the top 3 risks are:

Regulatory Compliance
User Privacy and Trust Issues
Competition and Innovation Risks

Conclusions:

You’ve learned my secret 2 step process for PDF Scraping documents and using LLM’s like OpenAI’s Chat API to summarize text data in R. But there’s a lot more to becoming an elite data scientist.

If you are struggling to become a Data Scientist for Business, then please read on…

Struggling to become a data scientist?

You know the feeling. Being unhappy with your current job.

Promotions aren’t happening. You’re stuck. Feeling Hopeless. Confused…

And you’re praying that the next job interview will go better than the last 12…

… But you know it won’t. Not unless you take control of your career.

The good news is…

I Can Help You Speed It Up.

I’ve helped 6,107+ students learn data science for business from an elite business consultant’s perspective.

I’ve worked with Fortune 500 companies like S&P Global, Apple, MRM McCann, and more.

And I built a training program that gets my students life-changing data science careers (don’t believe me? see my testimonials here):

6-Figure Data Science Job at CVS Health ($125K)
Senior VP Of Analytics At JP Morgan ($200K)
50%+ Raises & Promotions ($150K)
Lead Data Scientist at Northwestern Mutual ($175K)
2X-ed Salary (From $60K to $120K)
2 Competing ML Job Offers ($150K)
Promotion to Lead Data Scientist ($175K)
Data Scientist Job at Verizon ($125K+)
Data Scientist Job at CitiBank ($100K + Bonus)

Whenever you are ready, here’s the system they are taking:

Here’s the system that has gotten aspiring data scientists, career transitioners, and life long learners data science jobs and promotions…

Join My 5-Course R-Track Program Now!
(And Become The Data Scientist You Were Meant To Be…)

P.S. – Samantha landed her NEW Data Science R Developer job at CVS Health (Fortune 500). This could be you.

Related

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Copyright © 2024 | MH Corporate basic by MH Themes