Text Summarizer using Python
I am sure that most would love an automated way to summarize long notes. Though not a fool proof way, this can be achieved by using Python (big surprise 🙄). With the help of a few libraries ( obviously 😅) we can get some semblance of a summary. So let’s get started with this ( rolling the sleeves).
Package Requirement:
For this we will be using NLTK package. NLTK package is one of the most widely used package for Natural Language Processing. It has lot of methods available to manipulate text. For reading pdf we are going to use PyMuPDF. The final summary of each page that I get is from a pdf file then reading it page by page.
How this works:
There are different methods to get summary from a text. An Abstractive way would be to understand the semantics and then give a summary of the text. Though this might be more human like, but requires a great deal of understanding of advanced NLP method. An easier way would be Extractive techniques where we extract summary by selecting important words and weighing the sentences based on those words. There are lot of methods used in this like cosine similarity, TF-IDF etc. We will use the weighted average of words to extract the summary as it is fast and easy (🤭).
Code:
The basic logic is simple as shown in above flow. Take each page of data from pdf and get the words sans the stop words (words like me,my etc.). Find the words with highest frequency of occurrence and divide all other words frequency by this number to get the weighted frequency. For a sentence add the words with weighted frequency. Take top 5 sentence having the highest score. This is the basic roadmap used to extract the summary. Disclaimer: I have created this code after searching through the internet.
import fitz # PyMuPDF
from nltk.corpus import stopwords
from nltk import sent_tokenize
from nltk import word_tokenize
import heapq# read the pdf file
def readPdf(pdfPath):
pdfFileObj = fitz.open(pdfPath)
allPageDict = dict()
pageNum = pdfFileObj.pageCount
# extract the text from the pages into dictionary
# with keys as page number
for iCount in range(0,pageNum):
pageObj = pdfFileObj.loadPage(iCount)
allPageDict[iCount] = pageObj.getText()
pdfFileObj.close()
return allPageDict# extract summary for each page
def summEachPage(pageWiseList):
pageWiseFreq = {}
for iCount in range(0,len(pageWiseList)):
# extract the word frequency for each page
wordFrequencies = findWordFreq(pageWiseList[iCount])
# get the sentence score for each page
sentScore = findSentScore(wordFrequencies,pageWiseList[iCount])
# get top 5 sentence
summary_sentences = heapq.nlargest(3, sentScore, key = sentScore.get)
summary = ' '.join(summary_sentences)
# insert into a dictionary
pageWiseFreq[iCount] = summary
return pageWiseFreq
# find the word frequencies
def findWordFreq(pageWiseList):
# check for stopwords in english
stopwordEng = stopwords.words('english')
word_frequencies = {}
# get the word frequency
for word in word_tokenize(pageWiseList):
if word not in stopwordEng:
if word not in word_frequencies.keys():
word_frequencies[word] = 1
else:
word_frequencies[word] += 1
# get the word with maximum number
maximum_frequency = max(word_frequencies.values())
# get the weighted word frequency for each word
for word in word_frequencies.keys():
word_frequencies[word] = (word_frequencies[word]/maximum_frequency)
return word_frequencies# find sentence score
def findSentScore(wordFrequencies,wholePage):
sentence_scores = {}
# sentence tokenize the page
sentence_list = sent_tokenize(wholePage)
# for each sentence
for sent in sentence_list:
# get all the words in the single sentence
for word in word_tokenize(sent.lower()):
# check if word is in the wordFreq dictionary
if word in wordFrequencies.keys():
# add the word freq in each sent
if sent not in sentence_scores.keys():
sentence_scores[sent] = wordFrequencies[word]
else:
sentence_scores[sent] += wordFrequencies[word]
return sentence_scores# path of pdf
pdfPath = 'Tiger.pdf'pageDict = readPdf(pdfPath)pageWiseSummary = summEachPage(pageDict)# write the data into text if needed
file1 = open("Summary.txt","a+", encoding='utf-8')
for key,value in pageWiseSummary.items():
file1.write("Page " + str(key + 1) + " \n")
file1.writelines(value)
file1.write("\n\n")
file1.close()
Voila! you have a basic text summarizer with you 😀. As an example the below is a page of wiki of Tiger. The summarized text is given below it.
Page One:
The tiger (Panthera tigris) is the largest extant cat species and a member of
the genus Panthera. It is most recognisable for its dark vertical stripes on
orange-brown fur with a lighter underside. It is an apex predator, primarily
preying on ungulates such as deer and wild boar. It is territorial and
generally a solitary but social predator, requiring large contiguous areas of
habitat, which support its requirements for prey and rearing of its offspring.
Tiger cubs stay with their mother for about two years, before they become
independent and leave their mother's home range to establish their own.
The tiger once ranged widely from the Eastern Anatolia Region in the west
to the Amur River basin, and in the south from the foothills of the
Himalayas to Bali in the Sunda islands. Since the early 20th century, tiger
populations have lost at least 93% of their historic range and have been
extirpated in Western and Central Asia, from the islands of Java and Bali,
and in large areas of Southeast and South Asia and China. Today's tiger
range is fragmented, stretching from Siberian temperate forests to
subtropical and tropical forests on the Indian subcontinent and Sumatra.
The tiger is listed as endangered on the IUCN Red List. As of 2015, the
global wild tiger population was estimated to number between 3,062 and
3,948 mature individuals, with most of the populations living in small
pockets isolated from each other. India currently hosts the largest tiger
population. Major reasons for population decline are habitat destruction,
habitat fragmentation and poaching. Tigers are also victims of human–
wildlife conflict, in particular in range countries with a high human
population density.Summarized Text:
Since the early 20th century, tiger
populations have lost at least 93% of their historic range and have been
extirpated in Western and Central Asia, from the islands of Java and Bali,
and in large areas of Southeast and South Asia and China. As of 2015, the
global wild tiger population was estimated to number between 3,062 and
3,948 mature individuals, with most of the populations living in small
pockets isolated from each other. It is territorial and
generally a solitary but social predator, requiring large contiguous areas of
habitat, which support its requirements for prey and rearing of its offspring.
I hope this topic helps you in understanding a little more of NLP (and also your note making process simple 😝). Let me know your thoughts in comments!
References:
- https://towardsdatascience.com/understand-text-summarization-and-create-your-own-summarizer-in-python-b26a9f09fc70
- https://stackabuse.com/text-summarization-with-nltk-in-python/#:~:text=Text%20summarization%20is%20a%20subdomain,and%20deep%20learning%2Dbased%20techniques.
- https://en.wikipedia.org/wiki/Tiger
Originally published at http://evrythngunder3d.wordpress.com on September 24, 2020.