Dog Breed Classification: Part 1 (Image Scraping)

Shrinand Kadekodi
7 min readJul 10, 2022

This is a simple Deep Learning application which will try to classify four varieties of dog breed — Boston Terrier, French Bulldog, Malamute and Husky. The main focus of this application is to be able to follow the steps in Machine Learning as below:
1. Data Acquisition (using Python and Selenium)
2. Model Training (using Colab and Tensorflow)
3. Application Development (using Streamlit)
4. Deployment (Streamlit Share and Azure Apps)
This is the first part in which image data is scraped for training the Deep Learning model. Let’s start!

Installation:

The editor I have used is the PyCharm Community Edition. All the python modules can easily be installed on PyCharm. For this part the required modules are:
selenium, requests — for image scraping
os, random, pickle, io — for various image and file operations
pillow — for image reading from bytes

For using selenium you will need to download Chromedriver executable compatible with the version of Google Chrome. I have saved the executable in C drive but it can be saved in any location. This path will be used to start the driver.
The folder structure is created in advance (of course this could be done programmatically). It has the folders — train_data, test_data and extra_data inside which a folder for each breed has been created. The below image shows the structure.

Test, Train and Extra Folder
Dog breed sub folders

Once all the modules are installed and folder structure created we can proceed for image scraping.

Code Understanding:

Searching on the internet lead me to a very good post which showed image scraping using Selenium. You can find the link here.
It did 95% of the job and I modified the code to suit to my needs. It runs perfectly well and I was able to download the images and segregate it into different folders according to the breed name. I have explained each script in brief with code below

AllImportMods.py — This file contains all the modules to be imported. Totally not necessary as you can directly import all these modules when required in the other scripts 😅.

from selenium import webdriver
from selenium.webdriver.common.by import By
import time,requests,os,pickle,io,random
from PIL import Image
import tensorflow,cv2,streamlit
import numpy

imageScraping.py — This is the main file which calls all the functions for scraping the images. The below snippet shows the module import, initializations and retrieving the image url. You can see that the image url have been saved in a pickle file. This was just so that the image url extraction process does not run every time. Mostly this helped me during testing the code which was distributing the images across the folders. Another thing which I had seen before was that sometimes Google shows captcha when running automated web scripts from Selenium. So saving the url also helps in not running into the captcha verification 😅.

# import the required functions and modules
from AllImportMods import webdriver,pickle,os
from ImageScrapeMod import fetchImageUrls,persistImage
# give the absolute path for chromedriver and start the driver
DRIVER_PATH = 'your_chromedriver_path'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
# dog breeds used for classification
dogBreedList = ['Boston Terrier', 'French Bulldog', 'Malamute', 'Husky']
# needed for naming files
dogBreedSmallList = ['BT', 'FB', 'MD', 'HD']
# needed for naming folders
dogFolderList = ['terrier', 'bulldog', 'malamute', 'husky']
# search and keep the urls in pickle file so that not needed to go through the
# entire process of fetching image url everytime
if not os.path.exists('urllist.pickle'):
urlList = {}
for breed in dogBreedList:
urlList[breed] = fetchImageUrls(breed, 250, driver, 1)
driver.quit()# writing the data to pickle file
with open('urllist.pickle', 'wb') as handle:
pickle.dump(urlList,handle)
else:
driver.quit()
# reading data from pickle file
with open('urllist.pickle', 'rb') as handle:
urlList = pickle.load(handle)

The below chunk shows the preparation of a dictionary with folder name and file name as per the breed. The final function call (persistImage) copies files into the dog breed folders. The file name is the capital letter of each starting word eg: French Bulldog will have file name as FB11.jpg

# the data in this section is required for parsing the image urls and storing them
# in the respective folders
# create a dictionary with breed and its small name
dogBreedSmallList = {dogBreedList[i]:dogBreedSmallList[i] for i in range(0,len(dogBreedSmallList))}
# create a dictionary with breed and its folder name
dogFolderList = {dogBreedList[i]:dogFolderList[i] for i in range(0,len(dogFolderList))}
# folder where the subfolders are already created
folder_path = "your_folder_path"
# function to read all the urls and save the images
persistImage(folder_path,urlList,dogFolderList,dogBreedSmallList)

ImageScrapeMod.py — This script contains the functions (shown in below snippet) for searching the Dog breeds on Google search and retrieving their url. The code flow is as below:
1. Create a google query with the dog breed name and search for the images.
2. Scroll till the end of the page and get all the image url.
3. Keep the process till the end of page is reached or the image count is reached (in this case 250 images for each breed) .
Most of the code has been taken as is from the link highlighted above. Hence there are some codes like the find_elements value and other Selenium related codes which has been used as is.

from AllImportMods import time,By,Image,random,requests,io,os# function to scroll till the end of the page to get all the images in that page
def scrollToEnd(wd,sleepBetweenInteractions):
wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(sleepBetweenInteractions)
# function to fetch images from the url and copy it as per the folder
def fetchImageUrls(query, maxLinksToFetch, wd, sleepBetweenInteractions: int = 1):
# build the google query
searchUrl = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"
# load the page
wd.get(searchUrl.format(q=query))
image_urls = set()
image_count = 0
results_start = 0
while image_count < maxLinksToFetch:
# scroll till the end so tha all image can be loaded
scrollToEnd(wd,sleepBetweenInteractions)
# get all image thumbnail results
thumbnail_results = wd.find_elements(by=By.CSS_SELECTOR, value="img.Q4LuWd")
number_results = len(thumbnail_results)
print(f"Found: {number_results} search results. Extracting links from {results_start}:{number_results}")# loop over every image thumbnail found
for img in thumbnail_results[results_start:number_results]:
# try to click every thumbnail such that we can get the real image behind it
try:
img.click()
time.sleep(sleepBetweenInteractions)
except Exception:
continue
# extract image urls
actual_images = wd.find_elements(by=By.CSS_SELECTOR, value='img.n3VNCb')
for actual_image in actual_images:
if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'):
image_urls.add(actual_image.get_attribute('src'))
image_count = len(image_urls)# check if loop has run on all the images
if len(image_urls) > maxLinksToFetch:
print(f"Found: {len(image_urls)} image links, done!")
break
# checks for more pages and then continues after pausing for 5 sec
# if the end of result is reached then it breaks from the loop
else:
print("Found:", len(image_urls), "image links, looking for more ...")
time.sleep(5)
load_more_button = wd.find_element_by_css_selector(".mye4qd")
if load_more_button:
wd.execute_script("document.querySelector('.mye4qd').click();")
if results_start == number_results:
break
# move the result startpoint further down
results_start = len(thumbnail_results)
return image_urls

The second function reads the image url, downloads and saves the images in the respective folders. The program flow is as below:
1. For each breed keep 150 image for training 50 for testing and the rest for validation.
2. Try to get each url image and save it in folder as per the dog breed.
The reason count is used is because some links are not downloadable which could make the number of training and testing images less. Also the images are shuffled randomly and moved into train and test folders.

# put the images into their proper path for train,test,backup folders
def persistImage(folder_path, urlList, dogFolder, dogBreedSmallList):
# number of files to be searched is 250 shuffling the nummbers to have files
tempList = list(range(0,250,1))
random.seed(42)
random.shuffle(tempList)
# depending on count the file is moved into train,test or backup
for keyDict in urlList.keys():
count = 0
folderPath = ''
urlData = list(urlList[keyDict])
for cIndex in tempList:
if count < 150:
folderPath = 'train_data'
elif count < 200:
folderPath = 'test_data'
else:
folderPath = 'extra_data'
try:
image_content = requests.get(urlData[cIndex],timeout=10).content
except Exception as e:
print(f"ERROR - Could not download {urlData[cIndex]} - {e}")
try:
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert('RGB')
file_path = os.path.join(folder_path + folderPath + '\\' + dogFolder[keyDict], dogBreedSmallList[keyDict] + str(cIndex) + '.jpg')
with open(file_path, 'wb') as f:
image.save(f, "JPEG", quality=85)
count += 1
print(count)
# print(f"SUCCESS - saved {urlData[cIndex]} - as {file_path}")
except Exception as e:
print(f"ERROR - Could not save {urlData[cIndex]} - {e}")

After running the scripts you can see the images inside each dog breed folders

Phew! this has been a code heavy post 😅. With this we come to the end of the first part. In the next part we will see the training and testing of the model.

References -
https://towardsdatascience.com/image-scraping-with-python-a96feda8af2d
lots of googling, stackoverflow, medium

Originally published at http://evrythngunder3d.wordpress.com on July 10, 2022.

--

--