Understanding Statmodels Summary Feature
It is a known fact that Python has a lot of packages available for Statistics and Machine Learning. One amongst them is statsmodels which provides classes and functions for the estimation of many different statistical models, as well as for conducting statistical tests, and statistical data exploration. Going more specific into this, when calculating ordinary least square (OLS), it has a method called as summary which gives a detailed statistical information (basically OLS is one of the ways to do linear regression). Of these we will try to understand some of the statistics keywords which will help in understanding the analysis better. Let’s start with a sample code and from there we will discuss about the different statistical measures.
Code:
This is a sample code which uses the statsmodel package to calculate the ordinary least squares.
from sklearn.datasets import load_boston
import pandas as pd
import statsmodels.api as sm
X, y = load_boston(return_X_y=True)
allData = load_boston()# 5 - Lower Status of Population and 10 - pupil-teacher ratio
indVarOne = pd.DataFrame(X[:,12],columns = ['Lower Status of Population'])
indVarTwo = pd.DataFrame(X[:,10],columns = ['Pupil Teacher Ratio'])
depVar = pd.DataFrame(y,columns = ['Median Value of Home'])# independent var one - Lower Status of Population
indVarOne = sm.add_constant(indVarOne)
model = sm.OLS(depVar, indVarOne).fit()
predictions = model.predict(indVarOne)
print_model = model.summary()
print(print_model)# independent var two - pupil-teacher ratio
indVarTwo = sm.add_constant(indVarTwo)
model = sm.OLS(depVar, indVarTwo).fit()
predictions = model.predict(indVarTwo)
print_model = model.summary()
print(print_model)
On running the above code we get the below summary:
There are lots of statistical measures from which we will learn some of the important ones.
R Squared :
R Squared (also called co efficient of determination) is a measure which shows how much variation can occur in a dependent variable, which is explained by an independent variable. It basically shows the variation explained by the relationship between two variables in percentage. This is a bit difficult to understand, so let’s go step-by-step. First let’s understand the formula for R Squared:
R2= 1 − Unexplained Variation / Total Variation …. or more easily
R2= 1 − Variation of line / Variation of mean
To understand the formula let’s take an example of Cost of house vs Area of house graph as shown below:
You can see that on the left we have a regression line drawn which fits the points best. At the right we have the mean or average line. Variation of line is simply calculated as the square of the difference between the points and the line. The formula remains same for variation of mean, except we take the square of the difference between the data points and the mean/average line.
When a line fits better, it will be given that the value of variation will be less. So for this example let’s say the value is 5 whereas the variation of mean is 30. Now calculating the R Squared Value we get
R 2 = 1–5/30 … which comes out to be 83.33%
This means that around 83.33% of variation in data can be explained by Price/Area relationship.
In the above code too we can see that the ‘Lower Status of Population’ has higher R squared value than the ‘Pupil Teacher Ratio’. The higher the R squared value the better is the independent variable in defining the variation in dependent variable.
Adjusted R Squared:
Though R squared gives a good idea as to which independent variables to use, it does not give a correct picture when multiple independent variables are used to predict a dependent variable. Actually R squared value keeps on increasing even if you add any meaningless independent variable for calculating regression.
To overcome this issue Adjusted R squared value is used. This value is calculated after considering all the independent variables. Addition of independent variables which are not useful will reduce the Adjusted R squared value which helps in determining which independent variables to use.
The formula for calculating the Adjusted R squared is given below :
Adjusted R Squared = 1 — ((1 — R Squared) * (N — 1)) / (N — p — 1)
where N -> number of sample size,
p -> is the number of predictors (independent variables)
R squared -> R squared value
So as the number of predictors increases, the Adjusted R Squared Value will decrease if the independent variables are not that corelated.
Imp note — You must have noticed that this formula depends on sample size. So as the sample size increases, the difference between the R squared and Adjusted R squared approaches zero. Basically if you have a large sample size and very less independent variables you can be confident with the R squared value as it won’t be biased.
F-statistics:
In statistics, F statistic is a value got by running a ANOVA test or a regression analysis to find whether the means between two populations are significantly different. In case of linear regression we compare the produced linear model with the variables against a model that replaces the variable effect to 0. Basically what this means is, if the equation of line is y = mx +c, then we are making m as 0 and then trying to find whether the earlier line fits better or not. For calculation of the F statistics we also require an alpha value and an F-table. The alpha value is also called as the level of significance. A level of significance of 0.05 indicates a 5% risk of concluding that a difference exists when there is no actual difference.
Using the F statistics value, Prob (F-Statistic) tells the accuracy of the null hypothesis, or whether it is accurate that your variables effect is 0.
Log-likelihood:
Likelihood is a measure that tells how likely is that we will get a dataset like what we have, given the regression equation. Data has different features like mean, standard deviation, variance etc. Higher the value of likelihood, better is the fit of the model. The log value taken of the likelihood is the Log Likelihood. The reason for taking log is the ease in mathematical calculation and also helps in preventing underflow. Log Likelihood can lie between -Inf to +Inf due to which we can only compare the Log Likelihood values between multiple models.
The most common distribution is the normal distribution where the mean median and mode lies at the center. If the distribution gets moved i.e the mean, median and mode changes then this is called as Skew. The skew can be left skewed or right skewed. The formula for skewness is as below:
Skewness = 3 * ( Mean — Median ) / Standard deviation
If the values are between -0.5 to 0.5 the data has low skew, -1 to 1 is moderately skewed whereas anything above 1 is highly skewed data.
Kurtosis is the measure of how tall or small is the peak along with its tail. The normal distribution has a kurtosis of 3. Also kurtosis can range from 1 to Inf. A kurtosis greater than 3 means a tall distribution whereas a kurtosis less than 3 means a short peak.
These both together give an idea of the data distribution.
P-value:
This is used in accepting or rejecting null hypothesis. Hypothesis testing is used to test the validity of a claim called null hypothesis, that is made about a population using sample data. Thealternative hypothesisis the one you would believe if the null hypothesis is concluded to be untrue.
The lower the p-value the more you can reject the null hypothesis and accept the alternate hypothesis. There is another measure called the level of significance (alpha) which is used to understand how statistically significant is the data. For rejecting the null hypothesis p-value < alpha value (check the F- statistics part).
I hope this post was able to help in understanding the different measures commonly seen in summary method of statsmodel. Hopefully this will help in understanding and better analysing your models !
References:
- A lot of googling amongst which the major sources were machinelearningplus.com, medium.com, geeksforgeeks.org, towardsdatascience.com, analyticsvidhya.com
Originally published at http://evrythngunder3d.wordpress.com on May 15, 2021.