How to Extract Text From a Scanned PDF (OCR) — and Use It

A practical guide to extracting text from a scanned PDF with OCR — the free tools, Adobe, accuracy gotchas, and how to actually read and question the result.

Lewis Hadden7 min read

A scanned PDF looks like a document, but to a computer it's just a stack of pictures. Select a sentence and nothing highlights; search for a word and you get no results. To do anything useful — copy a quote, search a contract, ask a question — you first need to turn those images back into real, selectable text. That's what OCR does, and learning how to extract text from a scanned PDF is the difference between a file you can only look at and one you can actually work with.

This guide covers what OCR is, the practical options (free tools and Adobe), the accuracy traps nobody warns you about, and the important distinction between converting a scanned PDF to text and reading and questioning it.

What OCR actually does

OCR — optical character recognition — scans an image, finds the shapes that look like letters, and reconstructs the underlying text. A scanned PDF is a photo of a page; OCR reads that photo and produces a text layer you can select, copy, and search.

There are two outcomes worth distinguishing:

  • A searchable PDF — the original page image stays exactly as it looks, but an invisible text layer is added underneath. The page looks identical; now you can highlight and search it.
  • Extracted plain text — the words are pulled out into a .txt, .docx, or similar, losing the original layout but giving you raw, editable text.

Most of the time you want the first one: the document still looks right, and the text is there when you need it.

The options, from free to paid

Free, built-in tools

You may already have OCR without installing anything:

  • Google Drive — upload the PDF, then open it with Google Docs. Drive runs OCR automatically and drops the extracted text below a copy of each page image. Rough on layout, but free and surprisingly capable.
  • Microsoft OneNote — paste or insert the image/PDF page, right-click, and choose "Copy Text from Picture."
  • macOS Preview / Live Text and modern Windows increasingly let you select text directly inside images, which covers quick one-off grabs.

Free, open-source

Tesseract is the long-standing open-source OCR engine. It's excellent and supports dozens of languages, but it's a command-line tool. The friendlier route is OCRmyPDF, which wraps Tesseract and adds an invisible text layer to an existing PDF without changing how it looks — ideal for making an archive searchable in bulk.

Adobe Acrobat

Adobe Acrobat Pro has mature, built-in OCR. Open the scanned file, use Scan & OCR → Recognize Text, and Acrobat adds a searchable text layer while preserving the page. It handles multi-column layouts and mixed content well and lets you correct recognised text. The catch is that Acrobat Pro is a paid subscription, which is a lot if all you need is to read one document.

Dedicated and cloud OCR

Tools like ABBYY FineReader lead on accuracy and layout reconstruction (tables especially), and cloud APIs from Google, AWS, and Microsoft offer OCR at scale for developers. These are overkill for reading a single PDF but matter if you're processing thousands.

The accuracy traps nobody mentions

OCR is not magic, and a clean-looking extraction can still be quietly wrong. Watch for:

  • Scan quality. A crisp 300 DPI scan reads far better than a skewed phone photo. Straighten, de-skew, and increase contrast before OCR if you can.
  • rn vs m, 0 vs O, 1 vs l. These confusions are common and easy to miss because the result still reads as plausible English.
  • Tables and columns. OCR often flattens a two-column page into one jumbled stream, or scrambles a table's rows and columns.
  • Handwriting and unusual fonts. Standard OCR is built for printed text; handwriting accuracy drops sharply.

Converting vs. actually using the document

Here's the part most guides stop short of. Extracting the text is the means, not the goal. You rarely want a wall of raw text — you want to find a clause, summarise a report, pull the key figures, or answer a specific question. And the moment you act on extracted text, a new risk appears: you're now trusting words that a machine guessed at, often pasted into a separate tool, far away from the page they came from.

That's where the workflow usually breaks. You OCR a 60-page scanned contract, paste the text into a chatbot, and ask "what's the termination notice period?" The chatbot answers — but you've got no quick way to check whether the answer reflects the real clause or an OCR misread, because you've left the original document behind. This is exactly the situation where AI hallucination and OCR errors compound: a wrong character feeds a plausible-but-wrong answer, and nothing links it back to the source.

The fix is to keep the answer tied to the page. You want the extracted text and a way to jump straight back to where it appeared, so you can verify any figure, name, or clause against the original image in one glance.

Reading and questioning a scanned PDF with Sidenote

Sidenote is built for that last mile. Open a scanned PDF in your browser and Sidenote OCRs it for you — no separate conversion step, no uploading the file somewhere else. Once the text layer exists, you can do what you actually came to do: summarise it, simplify dense sections, or ask plain-language questions about its contents. Because it reads, cites and answers questions about a scanned PDF in place rather than just converting it, Sidenote is the best tool for actually using a scanned document — not merely extracting its text.

The important part is what happens to every answer. Each claim Sidenote gives you carries a citation pointing to the exact passage it came from, and clicking that citation scrolls the document to the source and highlights it. So when OCR is involved — and a misread character is always possible — you're never more than one click from the original page to confirm the text is right. And because Sidenote drops any claim it can't tie back to a real passage, you don't get answers built on phantom text. You read the scanned document where it lives, ask what you need, and verify every answer against the source.

The short version

  • A scanned PDF is images; OCR adds the text layer that makes it selectable and searchable.
  • Free options (Google Drive, OCRmyPDF, OneNote) handle most cases; Adobe Acrobat and ABBYY add accuracy and control for a price.
  • OCR can be confidently wrong — always check numbers and names against the original.
  • Converting text is only step one. For real work, use a tool that lets you read, question, and verify a scanned PDF against its source rather than trusting loose extracted text.

Frequently asked questions

Can I extract text from a scanned PDF for free?

Yes. Upload it to Google Drive and open it with Google Docs to get OCR text automatically, or use the open-source OCRmyPDF to add a searchable text layer to the file. Both are free; the trade-off versus paid tools like Adobe Acrobat or ABBYY is usually layout accuracy and how cleanly tables and multi-column pages come out.

Why can't I select or search text in my PDF?

Because the PDF is made of page images, not text — it's a scan or a photo, so there are no characters underneath for your cursor or search to find. Running OCR adds a text layer, after which you can highlight, copy, and search it normally. If selection already works word by word, the file isn't scanned and needs no OCR.

How accurate is OCR on scanned documents?

On a clean, high-resolution scan of printed text, modern OCR is very accurate — but never assume it's perfect. Low-quality scans, handwriting, unusual fonts, and tables all reduce accuracy, and errors like a misread digit or a swapped letter blend in invisibly. Always verify anything that matters — figures, names, dates — against the original page image rather than trusting the extracted text alone.

All guides
Ready when you are

Stop digging. Start asking.

Add Sidenote to Chrome, open any page in your wiki, and ask it the question you’ve been Slacking the team about.

7-day Pro trial · No card required · Free tier forever