Install OCRmyPDF and Convert Scanned PDFs into Searchable PDFs

Active Topics

- by Eli 1 day ago Inaugural Lecture by Prof. Mukandala: Dola, Soko na Kushindwa kwa Taasisi View the latest post Replies 2 Views 96
- by Eli 3 days ago All in One: YouTube, TED, X, Facebook and Instagram Reels, Videos, Images and Text Posts View the latest post Replies 332 Views 42418
- by Eli 3 days ago Iran's President Ebrahim Raisi Aged 63 Dies in a Helicopter Crash View the latest post Replies 3 Views 79
- by Eli 3 days ago Re: What is in Your Mind? View the latest post Replies 717 Views 309407
- by Eli 5 days ago PySpark for Large Data Processing View the latest post Replies 2 Views 8224
- by Eli 5 days ago Online Bible View the latest post Replies 3 Views 23348
- by Eli 6 days ago Generating SSH Key and Adding it to the ssh-agent for Authentication on GitHub View the latest post Replies 1 Views 563
- by Eli 1 week ago Russia Invades Ukraine View the latest post Replies 663 Views 243350
- by Eli 2 weeks ago President Museveni's Speech During International Development Association (IDA) Summit View the latest post Replies 1 Views 517
- by Eli 2 weeks ago From Simple Linear Regression Analysis to Covariance & Correlation to Independent Determinant, and R-Squared View the latest post Replies 11 Views 25173

Install OCRmyPDF and Convert Scanned PDFs into Searchable PDFs

1 post • Page 1 of 1

Eli: Senior Expert Member; Reactions: 183; Posts: 5414; Joined: 9 years ago; Location: Tanzania; Has thanked: 75 times; Been thanked: 88 times; Contact:
Contact Eli

Website

Quote

It sometimes happens that you are dealing with a scanned pdf, and you would like to copy, cut, paste certain portions of the document or search texts from it to ease and speed up your work. Unfortunately, this type of pdf (scanned) will not allow you to take advantage of the functions mentioned to make your tasks seamlessly easy. The good news is that you can ocr your document by using OCRmyPDF -- one of the most powerful and superior technologies to deal with the aforementioned problem.

This means you can convert scans or images of documents into searchable, editable PD files, and even adjust the quality of the resulting files.

If using a Linux-based operating system, you have more power with OCRmyPDF. We show here using Ubuntu Linux 18.04 how to install the latest version of OCRmyPDF and how to use it to turn scanned/grayscale pdfs into more functional files.

Installing the latest version of OCRmyPDF on Ubuntu 18.04 LTS

Install several system dependencies by firstly updating all repositories with the commands:

Code: [Select all] [Expand/Collapse]

$sudo apt-get -y update
$sudo apt-get -y install \
    ghostscript \
    icc-profiles-free \
    liblept5 \
    libxml2 \
    pngquant \
    python3-cffi \
    python3-distutils \
    python3-pkg-resources \
    python3-reportlab \
    qpdf \
    tesseract-ocr \
    zlib1g

Some dependencies such as the JBIG2 encoder may be missing and pngquant may not be installed, hence you can install them separately (But, OCRmyPDF will still work fine without them).

Install JBIG2 encoder:

JBIG2 encoding is recommended for OCRmyPDF and is used to losslessly create smaller PDFs. If JBIG2 encoding is not available, lower quality encodings will be used.

Installation needs you to build a JBIG2 encoder from source:

Code: [Select all] [Expand/Collapse]

$git clone https://github.com/agl/jbig2enc
cd jbig2enc
./autogen.sh
./configure && make
[sudo] make install

Install pngquant:

Code: [Select all] [Expand/Collapse]

$sudo apt-get update -y
$sudo apt-get install -y pngquant

We will need a newer version of pip available for Ubuntu 18.04 by the time of installation:

Code: [Select all] [Expand/Collapse]

$wget https://bootstrap.pypa.io/get-pip.py && python3 get-pip.py

Lastly, install the most recent OCRmyPDF for the local user and set the user’s PATH to check for the user’s Python packages:

Code: [Select all] [Expand/Collapse]

$export PATH=$HOME/.local/bin:$PATH
$python3 -m pip install --user ocrmypdf

Usage:

You can then ocr your pdf (input.pdf) and get a very useful pdf (output.pdf) as follows:

Code: [Select all] [Expand/Collapse]

$ocrmypdf input.pdf output.pdf

If you cannot achieve this by using your local machine, you can ocr your pdf into something else by using the onlineocr tool

TSSFL -- A Creative Journey Towards Infinite Possibilities!

Post Reply

1 post • Page 1 of 1

Similar Topics

Replies

Views

Last post

How to Install the Tor Browser on Ubuntu Linux by Compiling from the Source in 2024

Last post by Eli « 3 months ago
Posted in Linux and Unix Based Operating Systems

by Eli » 3 months ago » in Linux and Unix Based Operating Systems

Here are instructions how to install the Tor Browser on Ubuntu Linux from the source code contained in .tar.xz file.

1. Download the file that...

0 Replies

1214 Views

Last post by Eli
3 months ago

Return to “Linux and Unix Based Operating Systems”

Information

Who is online

Users browsing this forum: No registered users and 0 guests

Install OCRmyPDF and Convert Scanned PDFs into Searchable PDFs

Who is online

Login • Register