[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Hey guys, welcome back to my R-tips newsletter. Businesses are sitting on a mountain of unstructured data. The biggest culprit is PDF Documents. Today, I’m going to share how to PDF Scrape text and use OpenAI’s Large Language Models (LLMs) to summarize it in R.

Table of Contents

Here’s what you’re learning today:

  • How to scrape PDF Documents I’ll explain how to scrape the text from your business’s PDF Documents using pdftools.
  • How I summarize PDF’s using the OpenAI LLMs in R. This will blow your mind.

XGBoost R Code

Get the Code (In the R-Tip 078 Folder)

SPECIAL ANNOUNCEMENT: ChatGPT for Data Scientists Workshop on April 24th

Inside the workshop I’ll share how I built a Machine Learning Powered Production Shiny App with ChatGPT (extends this data analysis to an insane production app):

ChatGPT for Data Scientists

What: ChatGPT for Data Scientists

When: Wednesday April 24th, 2pm EST

How It Will Help You: Whether you are new to data science or are an expert, ChatGPT is changing the game. There’s a ton of hype. But how can ChatGPT actually help you become a better data scientist and help you stand out in your career? I’ll show you inside my free chatgpt for data scientists workshop.

Price: Does Free sound good?

How To Join: 👉 Register Here

R-Tips Weekly

This article is part of R-Tips Weekly, a weekly video tutorial that shows you step-by-step how to do common R coding tasks. Pretty cool, right?

Here are the links to get set up. 👇

Businesses are Sitting on $1,000,000 of Dollars of Unstructured Data (and they don’t know how to use it)

Fact: 90% of businesses are not using their unstructured data. It’s true. Many companies have no clue how to extract it. And once they extract it, they have no clue how to use it.

We’re going to solve both problems in this R-Tip.

The most common form is text located in PDF documents.

Businesses have 100,000s of PDF documents that contain valuable information.

PDF Data

OpenAI Document Summarization

One of the best use cases of LLMs is document summarization. But how do we get PDF data to OpenAI?

One easy way is in R!

R Tutorial: Scrape PDF Documents and Summarize with OpenAI

This is a simple 2 step process we’ll cover today:

  1. Extract PDF Text: We’ll use pdftools to extract text
  2. Summarize Text with OpenAI’s LLMs: We’ll use httr to connect to OpenAI’s API and summarize our PDF document

Business Objective:

I have set up a PDF document of Meta’s 2024 10K Financial Statement. We’ll use this document to analyze the risks that Meta reported in their filing (without even reading the document).

This is a massive speed up – and I can ask even more questions too beyond just the risks to really understand Meta’s business.

Good questions to ask for this financial case study:

  1. What are the top 3 risks to Meta’s business
  2. Where does Meta gain most of it’s revenue?
  3. In which business line is Meta’s revenue growing the most?

PDF Data

Get the PDF and Code

You can get the PDF and Code by joining the R-Tips Newsletter here.

T-Tip 078 Folder

Get the PDF and Code (In the R-Tip 078 Folder)

Load the Libraries

Next, load the libraries. Here’s what we’re using today:

Load Libraries

Get the PDF and Code (In the R-Tip 078 Folder)

Step 1: Extract PDF Text

With our project set up and libraries loaded, next I’m extracting the PDF text. It’s very easy to do in 1 line of code with pdftools::pdf_text().

Extract PDF Text

Get the PDF and Code (In the R-Tip 078 Folder)

This returns a list of text for 147 pages in Meta’s 10K Financial Statement. You can see the text on each page by cycling through text[1], text[2] and so on.

Step 2: Summarize the PDF Document with OpenAI LLMs

A common task: I want to know what risks Meta has identified in their 10K Financial Statement. This is required by the SEC. But, I don’t want to have to dig through the document.

The solution is to use OpenAI to summarize the document.

We will just summarize the first 30,000 characters in the document. There are more advanced ways to create a vector storage, but I’ll save that for a follow up post.

Run this code to set up OpenAI and our prompt:

Note that I have my OpenAI API key set up. I’m not going to dive into all of that. OpenAI has great documentation to set it up.

OpenAI Prompt Set Up

Get the PDF and Code (In the R-Tip 078 Folder)

Run this code to send the text and get OpenAI’s response

I’m using httr to send a POST request to OpenAI’s API. Then OpenAI provides a response with the answer to my question in the context of the text I provided it.

Connect to OpenAI API

Get the PDF and Code (In the R-Tip 078 Folder)

Run this Code to Parse the OpenAI Response

In just a couple seconds, I have a response from OpenAI’s API. Run this code to parse the response.

Parse OpenAI API Resposne

Get the PDF and Code (In the R-Tip 078 Folder)

Review the Response

Last, we can review the response from OpenAI’s Chat API. We can see that the top 3 risks are:

  1. Regulatory Compliance
  2. User Privacy and Trust Issues
  3. Competition and Innovation Risks

OpenAI Chat API Response


You’ve learned my secret 2 step process for PDF Scraping documents and using LLM’s like OpenAI’s Chat API to summarize text data in R. But there’s a lot more to becoming an elite data scientist.

If you are struggling to become a Data Scientist for Business, then please read on…

Struggling to become a data scientist?

You know the feeling. Being unhappy with your current job.

Promotions aren’t happening. You’re stuck. Feeling Hopeless. Confused…

And you’re praying that the next job interview will go better than the last 12…

… But you know it won’t. Not unless you take control of your career.

The good news is…

I Can Help You Speed It Up.

I’ve helped 6,107+ students learn data science for business from an elite business consultant’s perspective.

I’ve worked with Fortune 500 companies like S&P Global, Apple, MRM McCann, and more.

And I built a training program that gets my students life-changing data science careers (don’t believe me? see my testimonials here):

6-Figure Data Science Job at CVS Health ($125K)

Senior VP Of Analytics At JP Morgan ($200K)

50%+ Raises & Promotions ($150K)

Lead Data Scientist at Northwestern Mutual ($175K)

2X-ed Salary (From $60K to $120K)

2 Competing ML Job Offers ($150K)

Promotion to Lead Data Scientist ($175K)

Data Scientist Job at Verizon ($125K+)

Data Scientist Job at CitiBank ($100K + Bonus)

Whenever you are ready, here’s the system they are taking:

Here’s the system that has gotten aspiring data scientists, career transitioners, and life long learners data science jobs and promotions…

What They're Doing - 5 Course R-Track

Join My 5-Course R-Track Program Now!
(And Become The Data Scientist You Were Meant To Be…)

P.S. – Samantha landed her NEW Data Science R Developer job at CVS Health (Fortune 500). This could be you.

Success Samantha Got The Job

To leave a comment for the author, please follow the link and comment on their blog: business-science.io.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you’re looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don’t.

Continue reading: How to Scrape PDF Text and Summarize It with OpenAI LLMs (in R)

Impact of Unstructured Data Extraction and Summarization Techniques

Businesses today are sitting on a gold mine of unstructured data, primarily in the form of PDF documents. However, a large majority struggle in extracting and making meaningful use of this data. Techniques such as OpenAI’s Large Language Models (LLMs) for summarizing PDF data in R have opened new avenues to counter this challenge. Going forward, the value of this wealth of unstructured data can be unleashed with better applications of these techniques.

Future Developments

The current trend points towards a future where businesses will rely more on automated data extraction and summarization tools. Potentially, these techniques can revolutionize how businesses handle large volumes of unstructured information. It can lead to faster decision-making processes and improved understanding of critical business aspects such as risk management.

Automated Risk Analysis

For instance, businesses can implement LLMs to conduct automated financial risk analysis. By analyzing the risks identified by companies in their 10K Financial Statements, these models can provide summaries of top risks, revenue sources, and fastest-growing business lines, thereby enhancing strategic decision-making. As more businesses incorporate this technology, newer applications will surface creating a ripple effect in the industry.

Actionable Advice

Considering these long-term implications and future developments, it is advisable for businesses to invest in technologies and skills relating to data extraction and summarization using techniques like pdftools and OpenAI’s LLMs. This will not only reveal the hidden value in their unstructured data but also enhance their competitiveness in the market.

For Businesses

  1. Invest in Training: Organizations should consider training their teams in data extraction and summarization techniques. This will help to unlock the potential in their unstructured PDF data.
  2. Adopt Automation: With advancements in data extraction and summarization tools, it is important to integrate these into the workflow for efficient data management.

For Individuals

  1. Learn R: As the tutorial suggests, learning R, and in particular the application of OpenAI’s LLMs and pdftools in R, can be a valuable asset for anybody dealing with unstructured data.
  2. Adopt a Data Scientist Mindset: It is crucial to approach these tools from the perspective of a data scientist. By asking the right questions, you can make the most out of the unstructured data at your disposal.

Read the original article