Exploring Python Packages: WordCloud

Shrinand Kadekodi
3 min readJun 13, 2021

--

A lot many times we see cloud like structure having words with different sizes in many blogs or posts. These are called as Word Cloud or Tag Cloud in which the font size, color and bold typefaces depend on the importance of words. Word Cloud is one of the way to visualize and highlight the significant words in large texts. Let us see in this post as to how to create a Word Cloud using Python.

Installation :

First we need to install wordcloud in our anaconda environment. This is fairly simple. Just type — conda install -c conda-forge wordcloud in the Anaconda prompt and wordcloud will be installed.

Data :

Word Clouds are created by utilizing the frequency of words and displaying them accordingly. So more frequent words will have larger and bolder fonts as compared to some lesser used words. Honestly you can dump any text in wordcloud and it will generate the image with text. But this won’t be of much use as there will be no insights 😅.
For this task I choose the movie review corpora from NLTK. I randomly selected 100 text files each from positive and negative review folder. For the path where this data is saved, head to — C:\Users\YourUserID\AppData\Roaming\nltk_data\corpora\movie_reviews
You can take any 100 of text and create a folder each for negative and positive ones.

Code:

The below code parses through all the 100 text files and create a list with all the words in it. To make it more better I extracted the adjectives using spacy. So the reason behind this was that the reviews will have adjectives describing the movie which will give better result. Also the words which have frequency greater than 3 has been taken. This is by no means the best way to extract relevant words. So improvements can be made 😅.

import os
import wordcloud as wd
from nltk.tokenize import word_tokenize
import matplotlib.pyplot as plt
from nltk.probability import FreqDist
import spacy
# extract the data from text file and save in a list
def extract_file(pathname):
fileList = [os.path.join(pathname, f) for f in os.listdir(pathname)]
tempList = []
for file in fileList:
with open(file) as reader:
tempRead = reader.readlines()
tempList.append(tempRead[0])
return tempList
# removing stopwords and extracting adjectives
def removeStopwords(movieReviews):
uneeded_words = ['<','>','br','/']
endResult = []
for movieReview in movieReviews:
tokenizeSent = word_tokenize(movieReview)
temperRead = [w for w in tokenizeSent if not w in uneeded_words]
endResult.extend(temperRead)

nlp = spacy.load('en_core_web_sm')

adjectiveWords = []
for movieReview in movieReviews:
doc = nlp(movieReview)
for word in doc:
if word.pos_ == 'ADJ':
adjectiveWords.extend([word.text])

return (endResult,adjectiveWords)
positiveRev = extract_file('Path where the positive review folder is kept')
negativeRev = extract_file('Path where the negative review folder is kept')
posRevWoStpWrds,posWordsAdj = removeStopwords(positiveRev)
negRevWoStpWrds,negWordsAdj = removeStopwords(negativeRev)
#filtering and taking words having frequency greater than 3
dataPosAnalysis = FreqDist(posWordsAdj)
filterPosWords = dict([(m, n) for m, n in dataPosAnalysis.items() if n > 3])
dataNegAnalysis = FreqDist(negWordsAdj)
filterNegWords = dict([(m, n) for m, n in dataNegAnalysis.items() if n > 3])
# joining all words
allPosString = " ".join(filterPosWords)
allNegString = " ".join(filterNegWords)
stopwords = set(wd.STOPWORDS)# setting the wordcloud attributes
wordcloudPos = wd.WordCloud(width = 400, height = 400,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(allPosString)
# setting the wordcloud attributes
wordcloudNeg = wd.WordCloud(width = 400, height = 400,
background_color ='white',
stopwords = stopwords,
min_font_size = 10).generate(allNegString)
# plotting the clouds
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloudPos)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()
# plotting the clouds
plt.figure(figsize = (8, 8), facecolor = None)
plt.imshow(wordcloudNeg)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()

The output :

For a wordcloud to be relevant, it is important that preprocessing happens properly.
I hope that with this simple example of wordcloud, highlighting and displaying important words in a text will be simple and fun!

References:
- A lot of googling amongst which the major sources were machinelearningplus.com, medium.com, geeksforgeeks.org

Originally published at http://evrythngunder3d.wordpress.com on June 13, 2021.

--

--

Shrinand Kadekodi
Shrinand Kadekodi

Written by Shrinand Kadekodi

Simply curious about everything!

No responses yet