Web Scraping using Selenium in Python
Web scraping is a technique where data is extracted from websites. The content of the pages are parsed/searched and its data copied into a spreadsheet or loaded into a database. It can act as a data source for data analytics and machine learning. Disclaimer: Scraping websites may not be lawful and could be potentially illegal. So check your country laws. The goal of this post is educational and web scraping is not encouraged, especially when there are terms and conditions against such actions.
Web scraping in Python:
Python has a host of libraries for Web scraping like BeautifulSoup, scrapy, selenium etc. In this post we will be scraping data by using selenium. It is easy to use and helps in performing automated scraping though a bit slow and resource hungry. For using selenium we will be requiring some additional software to be installed.
Software Requirements:
First we will be requiring chromedriver which can downloaded through this site — https://chromedriver.chromium.org/downloads. The chromedriver is a web driver for Chrome that helps in automated testing of web apps through the browser. By using this we will be opening the website and will try to parse through the different HTML tags and extract data. Save the exe in a folder whose path we will be using in python to access the driver.
Also selenium library for anaconda has to be installed. The command for anaconda is —
conda install -c conda-forge selenium
Once all these are set up fire up Python and we will start scraping!
Web Site:
There are lot of websites that allow scraping practice. I just selected one among the top 5 results from Google 😅. The site used is — https://www.scrapethissite.com. In this I have chosen a test site which has Hockey information and has pages. We will try to scrape the data from this site and create our own data set.
Basics:
A very rudimentary knowledge of HTML is necessary for scraping web sites. We will be looking for specific data available and will try to populate some columns of table data in a Dataframe. To check the structure of the HTML code and where the data is situated press F12 on the Google chrome page. On the right hand side you will be able to see the HTML code with the tags, id, classes etc. used in the website. The image below demonstrates the same:
You can see that the site has a number of pages. By changing the page_num we can iterate through the pages and get the data of the table. It should have the same type of structure i.e same coding of HTML like above.
Finding the data:
Now we will check how the data is available in the table. The way to proceed for this is to find the tags of HTML and extract data from it. As you can see below, the table headings are stored in a <th> tag. We need to extract this data which will help us in getting the heading for our Dataframe.
Now to extract the column data, we look at <td> tag and see that there are some class data assigned which is unique to each column. Using this we can extract the column data of name, year, wins and losses.
Now we can use these data and create a Dataframe which we can use as a database. The complete code is as below:
# import libraries
from selenium import webdriver
import pandas as pd# function to extract text from table data
def getData(allData):
tempData = []
for data in allData:
tempData.append(data.text)
return tempData
# setting driver path
DRIVER_PATH = 'C:\Selenium\ChromeDriver/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)columnNames = []
dataToInsert = []# iterating through pages 1 and 2
for page in range(1,3):
# giving the url page wise
basePath = 'https://www.scrapethissite.com/pages/forms/?page_num='+str(page)
driver.get(basePath)
# extracting the headings once
if page == 1:
headingData = driver.find_elements_by_xpath("//th")
for heading in headingData:
columnNames.append(heading.text)
# extracting required data
nameData = getData(driver.find_elements_by_xpath("//td[@class ='name']"))
yearData = getData(driver.find_elements_by_xpath("//td[@class ='year']"))
winData = getData(driver.find_elements_by_xpath("//td[@class ='wins']"))
lossData = getData(driver.find_elements_by_xpath("//td[@class ='losses']"))
# appending the data for final dataframe creation
dataToInsert.extend(list(zip(nameData,yearData,winData,lossData)))# closing the opened chrome browser
driver.quit()# creating dataframe
df = pd.DataFrame(data = dataToInsert, columns = columnNames[:4])
find_elements_by_xpath this is a built-in function of selenium which helps to extract data for the particular tag. A Chrome window will pop up and the driver will extract the data from it and will close once it extracts two pages worth of data. You can see the dataframe as below:
The above example was a simple exercise as to how websites can be scraped and a small database can be created for Machine Learning. Obviously in practice the websites will have more complex HTML structure along with dynamic Javascript. Also there are many other libraries which you can use for the task. But the basic idea remains the same i.e scrape the data from different tags and create a structured data out of it.
I hope that this blog was useful in giving a taste of how web scraping is done. Again iterating what I wrote at the start — Scraping websites may not be lawful and could be potentially illegal. So check your country laws. The goal of this post is educational and web scraping is not encouraged, especially when there are terms and conditions against such actions.
Happy Learning 😄!
References:
- A lot of googling amongst which the major sources were stackoverflow.com, medium.com and youtube.com
Originally published at http://evrythngunder3d.wordpress.com on November 6, 2021.