Exploring Python Packages — Pytesseract

Shrinand Kadekodi
4 min readJul 12, 2020

--

Python is one of the most widely used and loved programming language. The ease with which we can get things done without writing tons of code is enticing for a lazy programmer like me 😅. Python has a large number of packages at its disposal for a variety of use. As I wanted to explore these packages, the best approach I felt was by doing some small tasks in python. While taking up below tasks, I was able to explore a large area which was unknown to me. So without dilly dallying lets get started!

Non Searchable PDF to Searchable PDF:

This must have been a problem faced by many people especially in their college days (you know why 😛 ). You have a pdf file with images in them and cannot search words in them. There are free online converters available for this and even Adobe does it … but I wanted to see whether this was possible in python. This kind of task requires Optical Character Recognition (OCR) for execution. Searching on the internet I came to know of a package called as Tesseract which was developed by HP but now is being developed by Google. It is an open source project and uses pretty much latest ML algorithms for OCR.

For utilizing this in Python you need to follow below steps:

  • Install pytesseract by typing ‘pip install pytesseract’ in Anaconda or Python console. This will install pytesseract along with any dependencies (pip installing in Anaconda was a pleasant surprise as I had no idea I could do this! I used to search in conda forge and then install it… but now yay! I can install any package!).
  • Install tesseract-ocr exe in your computer. Its a simple executable available at — https://github.com/UB-Mannheim/tesseract/wiki. Download any latest stable version available at this site. While installing this make sure to check the languages which you want to use. They are downloaded separately at the time of installation. This helps in doing OCR on the downloaded languages.

Now that you have installed the prerequisites for Tesseract lets see what all packages are required to see through our tasks -

  • Install pdf2image with ‘pip install pdf2image’ command in Anaconda or Python console. (I will explain the reason below when we go through the code 😅).
  • Install PyPDF2 with ‘pip install PyPDF2’ command in Anaconda or Python console.
  • Install natsort with ‘pip install natsort’ command in Anaconda or Python console.

We have readied ourselves with the required tools! Obviously there are different ways in which this task can be executed. The way in which I have done is a crude one, bordering on being ‘jugaadu’ but you can always build on it and make it better.

I have pasted the code with comments below. I hope the comments are understandable 😅. A disclaimer: I was able to get the below code done by going through a lot of Google search (predominantly other blogs and Stackoverflow). But first lets go through the flow of the code.

  • The first step is to import all the packages we have installed and some like glob and os which are standard ones when Anaconda is installed (I am not sure if they are available with regular Python, but if not then you can use pip command to install them).
  • The second step is to read the pdf file pages as image. This is done by pdf2image package. The reason for this is that you cannot read pdf in tesseract! It reads images hence you need to convert your pdf to image (even though its a pdf of image 😅).
  • If you have tesseract exe path in environment variable then its not needed to include the exe path. This is necessary for pytesseract to use tesseract for OCR.
  • For each page in pdf file, loop through the image_to_pdf_or_hocr() method to get searchable pdf output. Save each file in the temp folder.
  • Next loop through all the single page pdf in temp folder and merge them.
# import all the required packages
import pytesseract as pt
import pdf2image
from PyPDF2 import PdfFileMerger
import os,glob,natsort
# convert all pages to image with 200 dpi
pages = pdf2image.convert_from_path(pdf_path='Trialpdf.pdf', dpi=200)
# required everytime for using tesseract-ocr
# not required if tesseract is in environment variable
# I have not kept in environment variable hence I had to explicitly define it
pt.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
# the start page number
pageNumber = 1
# creation of a temporary folder
os.mkdir('tempProj')
# initialising the object for merging of pages afterwards
merger = PdfFileMerger()
# looping through each page/image in pdf and converting it to readable pdf
for page in pages:
content = pt.image_to_pdf_or_hocr(page, nice=0, extension='pdf',lang = 'eng+jpn')
# Creating individual pdf page and saving it in the temp folder
f = open('tempProj/temp'+str(pageNumber)+'.pdf', 'a+b')
f.write(bytearray(content))
f.close()
print('Page Done: ' + str(pageNumber))
pageNumber+=1

# sorting the pdf files in page number order
listofPdfFiles = natsort.natsorted(glob.glob('tempProj/*.pdf'))
# merging the files into one pdf
for pdfList in listofPdfFiles:
merger.append(pdfList)
merger.write("result.pdf")
merger.close()

A few may have wondered about the extra step of writing in separating pdf and then merging them into a single pdf. This is because if I simply write the pages in a single file, only the latest page is shown. All other pages go under it. I was not able to get this working hence this solution 😅.

Some of the disadvantage that I feel is that since it is OCR, it may detect text as something different than original ones. Also the output file was larger than my original file and took around 3 minutes for 16 page pdf. And since the file did not have any full page images I am not sure how this could work on images.

But for all its disadvantages, I was able to learn about tesseract, pypdf2 and pdf2image. And also get my work done to some extent 😉. Let me know in the comments if this was helpful!

Originally published at http://evrythngunder3d.wordpress.com

--

--

Shrinand Kadekodi
Shrinand Kadekodi

Written by Shrinand Kadekodi

Simply curious about everything!

No responses yet