# Different Ways to Conduct Linear Regression in Python

**INTRODUCTION OF LINEAR REGRESSION**

**Background**

Nowadays, statistical modeling can be done with only a few mouse clicks on statistical software such as SAS, Stata, and SPSS. While these powerful software successfully serve their primary goal to help end-users perform statistical modeling in a more accessible fashion, end-users generally have no way to delve deeper into the inner working of how the resulting numbers are actually computed. This disconnect between theoretical understanding and acquisition in statistical results impedes a deeper understanding of the subject matter. In contrast, Python, being a general-purpose programming language, allows both quick execution of statistical modelling and manual coding of statistical procedures "by hand". In this post, I will demonstrate how linear regression can be carried in Python in both an easy (with equations and calculations hidden behind the scene) and a hard way (with actual equations and calculations coded by ourselves).

Linear regression is one of the most simple yet powerful and intuitive statistical models existing. It is used to model outcome (or dependent) variable that is continuous. It can have one or more predictor (or explanatory or independent) variables. When linear regression model has only one predictor, it is called simple (or univariate) linear regression, and when there are more than one predictors, it is called multivariate linear regression. We will explore a simple linear regression model here.

**Systematic component**

E[y_{i}] = β_{0} + β_{i}X_{i}

**Random component**

y_{i} ~ i.d.d., N(E[Y_{i}], σ^{2})

The systematic component of linear regression assumes a linear relationship between X and y. The parameters include the y-intercept (denoted as β_{0}) and slope (denoted as β_{i}). i represents the particular subject from the data, and y_{i }is the value of y for the ith subject while X_{i} is the value of X for the ith subject. E[y_{i}] denotes the expected y-value for the ith subject. The random component indicates that each y_{i} is normally distributed (N) independently (i.d.d. = independent identically distributed) from one another based on a constant variance σ^{2}.

**Assumptions**

- Linearity – linear relationship between predictor(s) and outcome,
- Constant variance (homoscedasticity) – different outcome variable values gave the same variance in their errors, regardless of the values of the predictor(s),
- Independence of errors – errors of outcome values are uncorrelated with each other (i.e. people closer to one another geographically will not behave more similarly than people separated further apart geographically),
- Lack of multicollinearity – no two of more predictors that are perfectly correlated with one another.

**PYTHON FOR LINEAR REGRESSION – THE EASY WAY**

**Scikit-learn**

There are a number of open-source, comprehensive, and powerful data mining libraries available in Python. Scikit-learn (Sklearn) is among the most popular of these libraries. Paired with other libraries, we will explore how to use python to conduct a simple linear regression on a sample data.

**Generating sample data and scatter plot**

First let’s make our own sample data using Sklearn’s *make_regression* function, in which we specify the number of sample subject to be 30, number of feature to be 1, noise to be 35 (the higher the noise, the greater the variance), and *random_state* to be 2 (specifying the random seed so we can use the same data set for future applications). print RegData displays the data creates contains a tuple containing two arrays. The first array contains the predictor (X) and is assigned as RegData_X while the second array contains the outcome (y) and is assigned as RegData_y. We then use the pyplot function from another library called matplotlib to create a scatter plot.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, linear_model import matplotlib.pyplot as plt # Load the diabetes dataset RegData = datasets.make_regression(n_samples=30, n_features=1, n_informative=100, n_targets=1, bias=0.1, noise=35, random_state=2) print RegData # Assign X and y RegData_X = RegData[0] RegData_y = RegData[1] # Create scatter plot plt.scatter(RegData_X, RegData_y) plt.show() |

**Computing coefficients**

The goal of linear regression is to estimate the beta coefficients (β_{0} and β_{1}) based on available data. β_{0} is the y-intercept, while β_{1} is the slope that indicates the magnitude of change in y for a unit change in X. We can obtain these values from the following codes. We first split the data by taking the last 20 samples as the test set, while the remaining as the training set. The training data can then be fitted into the linear regression model. The beta coefficients will then be estimated. Finally, the linear regression line is generated from the *plt.plot* comment.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, linear_model import matplotlib.pyplot as plt from sklearn.metrics import r2_score # Load the diabetes dataset RegData = datasets.make_regression(n_samples=70, n_features=1, n_informative=100, n_targets=1, bias=0.1, noise=35, random_state=2) # Assign X and y RegData_X = RegData[0] RegData_y = RegData[1] RegData_X_train = RegData_X[:-20] RegData_X_test = RegData_X[-20:] RegData_y_train = RegData_y[:-20] RegData_y_test = RegData_y[-20:] # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(RegData_X_train, RegData_y_train) # The coefficients print 'Slope:', regr.coef_[0] # Intercept print 'Intercept:', regr.intercept_ # The mean square error print 'Residual sum of squares: %.2f' % np.mean((regr.predict(RegData_X_test) - RegData_y_test) ** 2) # Explained variance score: 1 is perfect prediction print 'Variance score: %.2f' % regr.score(RegData_X_test, RegData_y_test) # Plot outputs plt.scatter(RegData_X_test, RegData_y_test, color='black') plt.plot(RegData_X_test, regr.predict(RegData_X_test), color='blue', linewidth=2, alpha=0.6) plt.xticks(()) plt.yticks(()) plt.show() |

Output results as the following:

1 2 3 4 |
Slope: 42.693251551 Intercept: -0.611401706515 Residual sum of squares: 1666.05 Variance score: -0.28 |

**PYTHON FOR LINEAR REGRESSION – THE HARD WAY**

**1) Hand calculation approach**

Sklearn allows a fairly easy implementation of machine learning algorithms. The general steps are:

- 1) read data,
- 2) split data and format data appropriately,
- 3) fit formatted data into a machine learning algorithm,
- 4) compute numerical outcomes, and
- 5) evaluate the performance of the learning algorithm.

Sklearn essentially does the math for the end-users for steps 4 and 5. In the following, we attempt to not use Sklearn but "manually" code the calculation of the beta-coefficients - namely the intercept (β_{0}) and slope (β_{1}). These values can be obtained based on the equations below:

The slope is equal to correlation (between X and y) multiplied by the standard deviation of y and divided by the standard deviation of X. This can be converted into an expression based primarily on the means of X, y, and X*y. The Python code is as below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
from statistics import mean import numpy as np import matplotlib.pyplot as plt from matplotlib import style import random from sklearn import datasets # Load the diabetes dataset RegData = datasets.make_regression(n_samples=70, n_features=1, n_informative=100, n_targets=1, bias=0.1, noise=35, random_state=2) # Assign X and y RegData_X = RegData[0] RegData_y = RegData[1] RegData_X_train = RegData_X[:-20] RegData_X_test = RegData_X[-20:] RegData_y_train = RegData_y[:-20] RegData_y_test = RegData_y[-20:] RegData_X_train = [float(i) for i in RegData_X_train] RegData_X_test = [float(i) for i in RegData_X_test] RegData_y_train = [float(i) for i in RegData_y_train] RegData_y_test = [float(i) for i in RegData_y_test] # Converting into numpy arrays (so we can better handle matrix operations in functions below) RegData_X_train = np.array(RegData_X_train, dtype=np.float64) RegData_X_test = np.array(RegData_X_test, dtype=np.float64) RegData_y_train = np.array(RegData_y_train, dtype=np.float64) RegData_y_test = np.array(RegData_y_test, dtype=np.float64) def best_fit_slope_and_intercept(xs, ys): slope = ((mean(xs)*mean(ys)) - mean(xs*ys))/((mean(xs))**2 - mean(xs**2)) intercept = mean(ys) - slope*mean(xs) return slope, intercept def squared_error(ys_orig, ys_line): return sum((ys_line-ys_orig)**2) def coefficient_of_determination(ys_orig, ys_line): y_mean_line = [mean(ys_orig) for y in ys_orig] squared_error_regr = squared_error(ys_orig, ys_line) squared_error_y_mean = squared_error(ys_orig, y_mean_line) return 1 - (squared_error_regr/squared_error_y_mean) # Fitting testing data into the functions RegSlope, RegIntercept = best_fit_slope_and_intercept(RegData_X_train, RegData_y_train) print 'Slope:', RegSlope print 'Intercept:', RegIntercept |

We obtained the exact same slope and intercept as in the Sklearn approach:

1 2 |
Slope: 42.693251551 Intercept: -0.611401706515 |

**2) Machine learning approach**

With increasing number of sample subjects and number of predictors, the equations (as in the "manual" calculation approach above) to obtain the exact values of slope and intercept can be computational intensive. Alternatively, a machine learning approach may be a more efficient way to estimate the slope and intercept given a large data set.

h_{θ}(X) = θ_{0} + θ_{1}X_{1} + θ_{2}X_{2 }+ … + θ_{i}X_{i}

The equation above is the same equation depicting the systematic component of the linear regression function above. θ_{0 }is the intercept and θ_{1} is the slope. h_{θ}(X) represents the expected y, and in machine learning is considered as the hypothesis. In other words, we hypothesize that y is a linear function of X. To simplify the notation, we let θ_{0 }= 1, such that:

Thus, the hypothesis is the summation of each parameters and predictor pair, and it can be condensed as the dot product of the transpose of vector θ (parameters) and vector X (predictors).

**Cost function**

We want to estimate parameters θ. One intuitive approach is to make h_{θ}(X) as close to the y values as possible. We introduce the measurement of this “closeness” between the predicted y value (as in h_{θ}(X)) and the actual observed y values in an equation called the cost function J(θ):

The cost function captures how much deviance (or residual or error) there is for the observed y values and their corresponding predicted values. The square operator ensures that the both the overestimation or underestimation in the prediction will be treated equally and not cancelling each other by summing them up. This is called the least-square cost function because we want to minimize the value of J(θ) such that the deviance between predicted and observed y will be kept minimum.

**Gradient descent**

Now the objective is more defined – to choose the parameters θ that minimizes J(θ). And this objective can be search through computationally using something called the gradient descent. The gradient descent algorithm performs the following update rules:

Substituting J(θ) with the right hand side of the cost function, we further derive the gradient descent update rule as:

Repeat until convergence {

is the partial derivative of J(θ). This is used to help find the “direction” of the steepest descent to decrease the cost function. (y^{(i)}-h_{θ}x^{(i)}) is the error term. is the learning rate to be set by the user, the larger the learning rate, the faster the learning will be, but it also risk not being able to converge.

**Using Python**

The aforementioned machine learning approach to linear regression can be coded in Python:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 |
from statistics import mean import numpy as np import matplotlib.pyplot as plt from matplotlib import style import random from sklearn import datasets # Load the diabetes dataset RegData = datasets.make_regression(n_samples=70, n_features=1, n_informative=100, n_targets=1, bias=0.1, noise=35, random_state=2) # Assign X and y RegData_X = RegData[0] RegData_y = RegData[1] RegData_X_train = RegData_X[:-20] RegData_X_test = RegData_X[-20:] RegData_y_train = RegData_y[:-20] RegData_y_test = RegData_y[-20:] RegData_X_train = [[1, float(i)] for i in RegData_X_train] RegData_X_test = [[1, float(i)] for i in RegData_X_test] RegData_y_train = [float(i) for i in RegData_y_train] RegData_y_test = [float(i) for i in RegData_y_test] # Converting into numpy arrays (so we can better handle matrix operations in functions below) RegData_X_train = np.array(RegData_X_train, dtype=np.float64) RegData_X_test = np.array(RegData_X_test, dtype=np.float64) RegData_y_train = np.array(RegData_y_train, dtype=np.float64) RegData_y_test = np.array(RegData_y_test, dtype=np.float64) def GradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() for i in range(0, numIterations): hypothesis = np.dot(x, theta) loss = hypothesis - y cost = np.sum(loss ** 2) / (2 * m) print("Iteration %d | Cost: %f" % (i, cost)) # avg gradient per example gradient = np.dot(xTrans, loss) / m # update theta = theta - alpha * gradient return theta # Run the gradient descent m, n = np.shape(RegData_X_train) # Set some initial values arbitraily numIterations= 50000 alpha = 0.01 theta = np.ones(n) # set up an arbitrary initial theta value theta = GradientDescent(RegData_X_train, RegData_y_train, theta, alpha, m, numIterations) print theta |

As expected, θ_{0} is -0.611 and θ_{1} is 42.693. The number of iterations, alpha (𝛼) and initial theta (θ) are the parameters set up by the us and can be changed to different values to demonstrate how fast the search converges to the final beta-coefficients.

## Leave a Reply

You must be logged in to post a comment.