Scraping Data using YouTube API

Shrinand Kadekodi
9 min readFeb 26, 2022

One of the major step involved in Machine Learning or Deep Learning is Data collection. After collecting data, we analyze it and then move forward with the subsequent steps. In this post we will look at how to extract data from YouTube videos using the API provided by Google. After extracting the data, we will try to do some analysis using Matplotlib as well.
Disclaimer: I have extracted the data using YouTube API only for study with no any intention to market or utilize data for monetary purposes!

API Key Generation

For accessing YouTube data using API we need some setup done beforehand. First step is to create a Google Cloud Platform (GCP) account. From here we will be creating a project and generating the API keys which will act as our identity when accessing the data. I found the below You Tube video to give a good explanation of the process of creating API key:
https://www.youtube.com/watch?v=th5_9woFJmk
There are other videos and posts as well which will give this information in detail. For me the above single video was sufficient for getting the API key generated 😅.

Python Libraries

Since we are using Python, everything is pretty much a library away! I installed the library google-api-python-client for this task. It can be installed in Anaconda with the following command —
conda install -c conda-forge google-api-python-client
But for this exercise I am using PyCharm IDE from Jetbrains. I was facing frequent Spyder IDE crash issues when trying to debug my code 😥 and hence went for this switch. PyCharm is a bit different from Anaconda and may require some time to get adjusted to. But with some Googling I was able to get the basic coding and debugging part 😅.
Once you have got the API key and installed the libraries we are good to go!
Some Useful Links to Refer:
To know the different API commands — https://developers.google.com/youtube/v3/docs/search/list
To know the Video category (this will be clear once we start coding) — https://gist.github.com/dgp/1b24bf2961521bd75d6c

Using the API

When using API we need the format in which we will be passing the data and receiving it. This can be seen in the above API commands link. Basically we are querying the different result as per our requirement.
The exercise which I am doing in this blog is to find the most popular videos in my region. The code snippet for the same is as below:

request = youtubeService.videos().list(part='snippet,contentDetails,statistics',chart='mostPopular', regionCode='IN',videoCategoryId='28', maxResults=50)

For now just concentrate on the list method. You can see that I have given a few parameters to this method to receive the data. If you go to YouTube API docs you can see the part parameters data as below:

It is a required parameter which can give a host of data like contentDetails, statistics etc. Depending on your use case you can select any of those. For me the three — ‘snippet, contentDetails, statistics’ was sufficient. You can see the details of other parameters below:

The YouTube API docs are not fully detailed. For example for finding the Video Category I had to refer some other resources! I mean why make so hard when you could definitely list the video categories on the link itself! Video Category 28 corresponds to Science & Technology (you can check this in the second link of Useful links).
The regionCode is the geographical region code where IN is for India.
So if we try to make sense of the code, it gives the meaning that — I want to get 50 most popular Science & Technology videos in India with their statistics, content details and snippet.
And you would expect it to work the same. Except it does not!
It gives me a list of videos with all different video categories! I was perplexed and tried searching what went wrong. It seems that if you are querying for most popular videos, then video category doesn’t matter 😥. It will just give most popular video in that region.

Anyway moving forward, the above code will only give at the max 50 videos worth of data. To get more data we can use a simple for or while loop to gather the data. At the max you can get 500 video data. For this exercise I have taken just 100 videos. The code snippet is as below:

# initializing some basic data
getTopVideos = []
totalPages = 2
pageCounter = 0
# gathering the video data
while pageCounter != totalPages:
response = request.execute()
for resp in response['items']:
getTopVideos.append(resp)
# code to move to next page
request = youtubeService.videos().list_next(request, response)
pageCounter = pageCounter + 1

After getting the video details, I have tried getting the channel details as well. The code snippet is as below:

# get channel data for the videos
getTopChannelData = []
for topTech in getTopVideos:
getResponse = youtubeService.channels().list(part='snippet,contentDetails,statistics',id=topTech['snippet']['channelId']).execute()
getTopChannelData.append(getResponse['items'][0])youtubeService.close()# extract subscriber count data from the received response
keyToCheck = 'subscriberCount'
for iCount in range(0, len(getTopChannelData)):
tempData = getTopChannelData[iCount]['statistics']
if keyToCheck not in tempData:
getTopChannelData[iCount]['statistics']['subscriberCount'] = 0

After getting all the data, I created a final dataframe which contains all the data of 100 videos. You can see that the column name corresponds to each key of dictionary. The code snippet is as below:

finalList = []
# category of video
cats = ['', 'Film & Animation', 'Autos & Vehicles', '', '', '', '', '', '', '',
'Music', '', '', '', '', 'Pets & Animals', '', 'Sports', 'Short Movies',
'Travel & Events', 'Gaming', 'Videoblogging', 'People & Blogs',
'Comedy', 'Entertainment', 'News & Politics', 'Howto & Style',
'Education', 'Science & Technology', 'Nonprofits & Activism',
'Movies', 'Anime/Animation', 'Action/Adventure', 'Classics',
'Comedy', 'Documentary', 'Drama', 'Family', 'Foreign',
'Horror', 'Sci-Fi/Fantasy', 'Thriller', 'Shorts',
'Shows', 'Trailers']
# creating dataframe
for iCount in range(0, 100):
tempDict = {}
tempDict['Channel_Title'] = getTopChannelData[iCount]['snippet']['title']
tempDict['Video_Title'] = getTopVideos[iCount]['snippet']['title']
tempDict['Video_type'] = int(getTopVideos[iCount]['snippet']['categoryId'])
tempDict['Video_type_name'] = cats[int(getTopVideos[iCount]['snippet']['categoryId'])]
tempDict['Subscribers'] = int(getTopChannelData[iCount]['statistics']['subscriberCount'])
tempDict['Video_Views'] = int(getTopVideos[iCount]['statistics']['viewCount'])
tempDict['Like_Count'] = int(getTopVideos[iCount]['statistics']['likeCount'])
# I found some of the videos not having this field. to prevent error this code
try:
tempDict['Comment_Count'] = int(getTopVideos[iCount]['statistics']['commentCount'])
except:
tempDict['Comment_Count'] = int(0)
tempDict['Video_Count'] = int(getTopChannelData[iCount]['statistics']['videoCount'])
finalList.append(tempDict)
createDataframe = pd.DataFrame(finalList)createDataframe.to_excel('YouTube_data.xlsx')

The final resulted is exported as excel report. You can also export it to any other format like csv or to a database.

Analyzing Data

Now that we have the YouTube video data with us, let’s see if it is possible to get some insights from it.
First I just wanted to know how many videos belong to each category. This could give me a broad insight as to which category is better. The chart is a simple pie chart which gave me the below result

It seems that the 3 categories — People & Blogs, Entertainment, Science & Technology, are highly popular with users and may see it more as out of 100 videos, nearly 84 videos are of these 3 categories only!

Let us dig a bit more into this. We have the following data — video view count, comment count, like count. First let us see which category has the highest view count i.e. which category has in total the highest number of video views. It is a simple bar chart as shown below:

Well now this is interesting… though comedy video type has 4 videos it has view count of more than 60 million! This could be due to some very big influencer video!
Let us now look at the like count which again is the sum for all the video type and see if there is any other things that can change. The graph is a simple bar graph as shown below:

Well the proportion seems to be more or less the same. Video categories having high views have high likes as well. We have one more data — comment count data. The same bar chart is used and can be seen as below:

Again the data throws some interesting insights! The number of comments are highest in Science & Technology category! Even Education category sees a huge jump in comments count as compared to view and like count. This could be due to the fact that in these two categories there could be lots of comment replies to queries and discussions 😅. Seems like user engagement is high in these categories!
Also since I also have the data for the channel and its subscriber count, I thought of getting the top 10 channels amongst the videos. The dataframe screenshot is as below:

70% is Science & Technology! And the subscriber count is in millions. That means Technical Guruji has 22 million subscribers 🤯 mind blowing!

The full code for this exercise is below for reference:

# importing required libraries
import pandas as pd
from googleapiclient.discovery import build
import matplotlib.pyplot as plt
# API key
api_key = 'your_api'
# Construct a Resource for interacting with an API.
youtubeService = build('youtube', 'v3', developerKey=api_key)
# requesting the data
request = youtubeService.videos().list(part='snippet,contentDetails,statistics',
chart='mostPopular', regionCode='IN',
videoCategoryId='28', maxResults=50)
# initializing some basic data
getTopVideos = []
totalPages = 2
pageCounter = 0
# gathering the video data
while pageCounter != totalPages:
response = request.execute()
for resp in response['items']:
getTopVideos.append(resp)
# code to move to next page
request = youtubeService.videos().list_next(request, response)
pageCounter = pageCounter + 1
# get channel data for the videos
getTopChannelData = []
for topTech in getTopVideos:
getResponse = youtubeService.channels().list(part='snippet,contentDetails,statistics',
id=topTech['snippet']['channelId']).execute()
getTopChannelData.append(getResponse['items'][0])youtubeService.close()# extract subscriber count data from the received response
keyToCheck = 'subscriberCount'
for iCount in range(0, len(getTopChannelData)):
tempData = getTopChannelData[iCount]['statistics']
if keyToCheck not in tempData:
getTopChannelData[iCount]['statistics']['subscriberCount'] = 0
finalList = []
# category of video
cats = ['', 'Film & Animation', 'Autos & Vehicles', '', '', '', '', '', '', '',
'Music', '', '', '', '', 'Pets & Animals', '', 'Sports', 'Short Movies',
'Travel & Events', 'Gaming', 'Videoblogging', 'People & Blogs',
'Comedy', 'Entertainment', 'News & Politics', 'Howto & Style',
'Education', 'Science & Technology', 'Nonprofits & Activism',
'Movies', 'Anime/Animation', 'Action/Adventure', 'Classics',
'Comedy', 'Documentary', 'Drama', 'Family', 'Foreign',
'Horror', 'Sci-Fi/Fantasy', 'Thriller', 'Shorts',
'Shows', 'Trailers']
# creating dataframe
for iCount in range(0, 100):
tempDict = {}
tempDict['Channel_Title'] = getTopChannelData[iCount]['snippet']['title']
tempDict['Video_Title'] = getTopVideos[iCount]['snippet']['title']
tempDict['Video_type'] = int(getTopVideos[iCount]['snippet']['categoryId'])
tempDict['Video_type_name'] = cats[int(getTopVideos[iCount]['snippet']['categoryId'])]
tempDict['Subscribers'] = int(getTopChannelData[iCount]['statistics']['subscriberCount'])
tempDict['Video_Views'] = int(getTopVideos[iCount]['statistics']['viewCount'])
tempDict['Like_Count'] = int(getTopVideos[iCount]['statistics']['likeCount'])
# I found some of the videos not having this field. to prevent error this code
try:
tempDict['Comment_Count'] = int(getTopVideos[iCount]['statistics']['commentCount'])
except:
tempDict['Comment_Count'] = int(0)
tempDict['Video_Count'] = int(getTopChannelData[iCount]['statistics']['videoCount'])
finalList.append(tempDict)
createDataframe = pd.DataFrame(finalList)createDataframe.to_excel('YouTube_data.xlsx')# create pie diagram for the sum of all the video type and video type name
videoCount = createDataframe['Video_type_name'].value_counts()
fig = plt.figure(figsize=(10, 8))
plt.pie(videoCount, labels=videoCount.index,autopct='%1.1f%%')
# plot the number of video_views count by each category
videoViewCount = createDataframe.groupby('Video_type_name')['Video_Views'].sum()
fig = plt.figure(figsize=(10, 8))
plt.bar(videoViewCount.index,height=videoViewCount)
plt.xlabel('Video Type/Genre')
plt.ylabel('Video View Count')
plt.title('Video Type/Genre vs Video View Count')
# plot the number of like_views count by each category
likeCount = createDataframe.groupby('Video_type_name')['Like_Count'].sum()
fig = plt.figure(figsize=(10, 8))
plt.bar(likeCount.index,height=likeCount)
plt.xlabel('Video Type/Genre')
plt.ylabel('Like Count')
plt.title('Video Type/Genre vs Like Count')
# plot the number of comment_views count by each category
commentCount = createDataframe.groupby('Video_type_name')['Comment_Count'].sum()
fig = plt.figure(figsize=(10, 8))
plt.bar(commentCount.index,height=commentCount)
plt.xlabel('Video Type/Genre')
plt.ylabel('Comment Count')
plt.title('Video Type/Genre vs Comment Count')
# top 10 channels having highest number of subscribers
subCount = createDataframe.groupby(['Channel_Title','Video_type_name']).apply(lambda x:x['Subscribers']/1000000).nlargest(10).reset_index()
subCount = subCount.drop(columns='level_2')

Well we come to an end to this exercise. There can be many other ways to analyze this data and get interesting insights! The above exercise just scratched the surface.
A thing to note is that the video data also contained shorts. Also the way in which YouTube finds most popular video is also not clear. Definitely there needs to be more improvement in YouTube API implementation because when Googling issues, I did find many stackoverflow answers citing limitations in the API. Also these stats could change if you try to run the same after a week or month.

I hope this exercise was able to give you some start in using YouTube API. Let me know in the comments whether this was helpful!

References:
- A lot of googling amongst which the major sources were stackoverflow.com, youtube.com

Originally published at http://evrythngunder3d.wordpress.com on February 26, 2022.

--

--