Nowadays, statistical modeling can be done with only a few mouse clicks on statistical software such as SAS, Stata, and SPSS. While these powerful software successfully serve their primary goal to help end-users perform statistical modeling in a more accessible fashion, end-users generally have no way to delve deeper into the inner working of how the resulting numbers are actually computed. This disconnect between theoretical understanding and acquisition in statistical results impedes a deeper understanding of the subject matter. In contrast, Python, being a general-purpose programming language, allows both quick execution of statistical modelling and manual coding of statistical procedures "by hand". In this post, I will demonstrate how linear regression can be carried in Python in both an easy (with equations and calculations hidden behind the scene) and a hard way (with actual equations and calculations coded by ourselves).

Linear regression is one of the most simple yet powerful and intuitive statistical models existing. It is used to model outcome (or dependent) variable that is continuous. It can have one or more predictor (or explanatory or independent) variables. When linear regression model has only one predictor, it is called simple (or univariate) linear regression, and when there are more than one predictors, it is called multivariate linear regression. We will explore a simple linear regression model here.

Systematic component

E[yi] = β0 + βiXi

Random component

yi ~ i.d.d., N(E[Yi], σ2)

The systematic component of linear regression assumes a linear relationship between X and y. The parameters include the y-intercept (denoted as β0) and slope (denoted as βi). i represents the particular subject from the data, and yi is the value of y for the ith subject while Xi is the value of X for the ith subject. E[yi] denotes the expected y-value for the ith subject. The random component indicates that each yi is normally distributed (N) independently (i.d.d. = independent identically distributed) from one another based on a constant variance σ2.


  • Linearity – linear relationship between predictor(s) and outcome,
  • Constant variance (homoscedasticity) – different outcome variable values gave the same variance in their errors, regardless of the values of the predictor(s),
  • Independence of errors – errors of outcome values are uncorrelated with each other (i.e. people closer to one another geographically will not behave more similarly than people separated further apart geographically),
  • Lack of multicollinearity – no two of more predictors that are perfectly correlated with one another.



There are a number of open-source, comprehensive, and powerful data mining libraries available in Python. Scikit-learn (Sklearn) is among the most popular of these libraries. Paired with other libraries, we will explore how to use python to conduct a simple linear regression on a sample data.

Generating sample data and scatter plot

First let’s make our own sample data using Sklearn’s make_regression function, in which we specify the number of sample subject to be 30, number of feature to be 1, noise to be 35 (the higher the noise, the greater the variance), and random_state to be 2 (specifying the random seed so we can use the same data set for future applications). print RegData displays the data creates contains a tuple containing two arrays. The first array contains the predictor (X) and is assigned as RegData_X while the second array contains the outcome (y) and is assigned as RegData_y. We then use the pyplot function from another library called matplotlib to create a scatter plot.


Computing coefficients

The goal of linear regression is to estimate the beta coefficients (β0 and β1) based on available data. β0 is the y-intercept, while β1 is the slope that indicates the magnitude of change in y for a unit change in X. We can obtain these values from the following codes. We first split the data by taking the last 20 samples as the test set, while the remaining as the training set. The training data can then be fitted into the linear regression model. The beta coefficients will then be estimated. Finally, the linear regression line is generated from the plt.plot comment.

Output results as the following:



1) Hand calculation approach

Sklearn allows a fairly easy implementation of machine learning algorithms. The general steps are:

  • 1) read data,
  • 2) split data and format data appropriately,
  • 3) fit formatted data into a machine learning algorithm,
  • 4) compute numerical outcomes, and
  • 5) evaluate the performance of the learning algorithm.

Sklearn essentially does the math for the end-users for steps 4 and 5. In the following, we attempt to not use Sklearn but "manually" code the calculation of the beta-coefficients - namely the intercept (β0) and slope (β1). These values can be obtained based on the equations below:


The slope is equal to correlation (between X and y) multiplied by the standard deviation of y and divided by the standard deviation of X. This can be converted into an expression based primarily on the means of X, y, and X*y. The Python code is as below:

We obtained the exact same slope and intercept as in the Sklearn approach:

2) Machine learning approach

With increasing number of sample subjects and number of predictors, the equations (as in the "manual" calculation approach above) to obtain the exact values of slope and intercept can be computational intensive. Alternatively, a machine learning approach may be a more efficient way to estimate the slope and intercept given a large data set.

hθ(X) = θ0 + θ1X1 + θ2X2 + … + θiXi

The equation above is the same equation depicting the systematic component of the linear regression function above. θ0 is the intercept and θ1 is the slope. hθ(X) represents the expected y, and in machine learning is considered as the hypothesis. In other words, we hypothesize that y is a linear function of X. To simplify the notation, we let θ0 = 1, such that:


Thus, the hypothesis is the summation of each parameters and predictor pair, and it can be condensed as the dot product of the transpose of vector θ (parameters) and vector X (predictors).

Cost function

We want to estimate parameters θ. One intuitive approach is to make hθ(X) as close to the y values as possible. We introduce the measurement of this “closeness” between the predicted y value (as in hθ(X)) and the actual observed y values in an equation called the cost function J(θ):


The cost function captures how much deviance (or residual or error) there is for the observed y values and their corresponding predicted values. The square operator ensures that the both the overestimation or underestimation in the prediction will be treated equally and not cancelling each other by summing them up. This is called the least-square cost function because we want to minimize the value of J(θ) such that the deviance between predicted and observed y will be kept minimum.

Gradient descent

Now the objective is more defined – to choose the parameters θ that minimizes J(θ). And this objective can be search through computationally using something called the gradient descent.  The gradient descent algorithm performs the following update rules:


Substituting J(θ) with the right hand side of the cost function, we further derive the gradient descent update rule as:

Repeat until convergence {


Fig6is the partial derivative of J(θ). This is used to help find the “direction” of the steepest descent to decrease the cost function. (y(i)-hθx(i)) is the error term.  is the learning rate to be set by the user, the larger the learning rate, the faster the learning will be, but it also risk not being able to converge.

Using Python

The aforementioned machine learning approach to linear regression can be coded in Python:

As expected, θ0 is -0.611 and θ1 is 42.693. The number of iterations, alpha (𝛼) and initial theta (θ) are the parameters set up by the us and can be changed to different values to demonstrate how fast the search converges to the final beta-coefficients.

Please rate this