Install OCRmyPDF and Convert Scanned PDFs into Searchable PDFs

Post Reply
User avatar
Eli
Senior Expert Member
Reactions: 183
Posts: 5414
Joined: 9 years ago
Location: Tanzania
Has thanked: 75 times
Been thanked: 88 times
Contact:

#1

It sometimes happens that you are dealing with a scanned pdf, and you would like to copy, cut, paste certain portions of the document or search texts from it to ease and speed up your work. Unfortunately, this type of pdf (scanned) will not allow you to take advantage of the functions mentioned to make your tasks seamlessly easy. The good news is that you can ocr your document by using OCRmyPDF -- one of the most powerful and superior technologies to deal with the aforementioned problem.

This means you can convert scans or images of documents into searchable, editable PD files, and even adjust the quality of the resulting files.

If using a Linux-based operating system, you have more power with OCRmyPDF. We show here using Ubuntu Linux 18.04 how to install the latest version of OCRmyPDF and how to use it to turn scanned/grayscale pdfs into more functional files.

Installing the latest version of OCRmyPDF on Ubuntu 18.04 LTS

Install several system dependencies by firstly updating all repositories with the commands:

  1. $sudo apt-get -y update
  2. $sudo apt-get -y install \
  3.     ghostscript \
  4.     icc-profiles-free \
  5.     liblept5 \
  6.     libxml2 \
  7.     pngquant \
  8.     python3-cffi \
  9.     python3-distutils \
  10.     python3-pkg-resources \
  11.     python3-reportlab \
  12.     qpdf \
  13.     tesseract-ocr \
  14.     zlib1g


Some dependencies such as the JBIG2 encoder may be missing and pngquant may not be installed, hence you can install them separately (But, OCRmyPDF will still work fine without them).

Install JBIG2 encoder:

JBIG2 encoding is recommended for OCRmyPDF and is used to losslessly create smaller PDFs. If JBIG2 encoding is not available, lower quality encodings will be used.

Installation needs you to build a JBIG2 encoder from source:

  1. $git clone https://github.com/agl/jbig2enc
  2. cd jbig2enc
  3. ./autogen.sh
  4. ./configure && make
  5. [sudo] make install


Install pngquant:

  1. $sudo apt-get update -y
  2. $sudo apt-get install -y pngquant


We will need a newer version of pip available for Ubuntu 18.04 by the time of installation:

  1. $wget https://bootstrap.pypa.io/get-pip.py && python3 get-pip.py


Lastly, install the most recent OCRmyPDF for the local user and set the user’s PATH to check for the user’s Python packages:

  1. $export PATH=$HOME/.local/bin:$PATH
  2. $python3 -m pip install --user ocrmypdf


Usage:

You can then ocr your pdf (input.pdf) and get a very useful pdf (output.pdf) as follows:

  1. $ocrmypdf input.pdf output.pdf


If you cannot achieve this by using your local machine, you can ocr your pdf into something else by using the onlineocr tool
0
TSSFL -- A Creative Journey Towards Infinite Possibilities!
Post Reply
  • Similar Topics
    Replies
    Views
    Last post

Return to “Linux and Unix Based Operating Systems”

  • Information
  • Who is online

    Users browsing this forum: No registered users and 0 guests