Bayesian Linear Regression using PyMC3
Introduction
In this post we will talk about the motivation behind using Bayesian methods for machine learning models and we will look at a simple Bayesian linear regression model developed using the PyMC3 python library.
How big a baby?
Say you are working as a statistician specializing in public health. As part of your job, every month, you collect data on height and weight of babies who are less than 2 years old. You collect this data to study childrens' health, malnourishment etc., and to drive public health policies. You now have a trove of past data.
Now someone asks you how tall a 6 month old baby girl will be, you put on your data scientist hat and set off to solve the problem.
Data
First, let’s look at the data. The data you’ve collected so far…
(By ‘you’, not you ‘you’, but a sample from the actual data that the World Health Organization (WHO) had collected. WHO and other health institutions around the world collect data for newborns and toddlers and design growth charts standards. This example is taken from PyMC3 examples documentation which in turn was taken from Osvaldo Martin’s book: Bayesian Analysis with Python: Introduction to statistical modeling and probabilistic programming using PyMC3 and ArviZ, 2nd Edition.)


Month Height
0 0 48.5
1 0 50.5
2 0 50.5
3 0 52.0
4 0 47.5


As we can see, there is a linear relationship between month and height. But obviously, not all children are born the same and No one size fit (t)all. There are distinct values of heights for each age group.
Let’s look at the probability distribution of the babies' heights


The data appears skewed. Now that we have explored the data, let us fit a model to predcit the height of a 6 month old baby girl.
NOTE: YOU ARE A FREQUENTIST.
Frequentist Linear Regression
The frequentist, or classical, approach to multiple linear regression assumes a model of the form:
y = α + β1.X1 + β2.X2 + .... + βn.Xn + ε
Where, α is the intercept, β1…βn are the weights and ε is the residual error. Our model is linear in the predictors, with some associated measurement error.
If we have a set of training data then the goal is to estimate the coefficients, which provide the best linear fit to the data. “Best” in this case means minimising some form of error function. The most popular method to do this is via ordinary least squares (OLS). To make a subsequent prediction, given some new data, we simply multiply the components of by the associated coefficients and obtain. The important point here is that the intercepts (b0) and the weigths (b1…bn) are point estimate. This means that it is a single value. In the Bayesian formulation we will see that the interpretation differs substantially. Now, let’s fit a linear regression model.
Regression Fit using Frequentist Approach with sklearn
Define the independent and dependent parameters


Fit the data as a Linear Regression using the sklearn library. For the purpose of this post, we will not divide the data into test/train. We will use all the input data to fit the model


0.9052295855793887
Even with a basic linear regression model, we are getting a R2 Score of 0.905. We will see what are the values of the intercept and the weights of our linear model.
print(f"Intercept α = {mdl.intercept_}")
print(f"Weight β = {mdl.coef_}")
Intercept α = 55.350410877255214
Weight β = [1.42791809]
Let’s provide the same input values of X and predict and visualize the results.


Visually the predicted values looks fine and the R2 score is 0.9. The model seems to have done a good job in fitting the data. (Or did it ? We will answer this question soon. Spoiler Alert, it did not.)
You are now ready to answer the question posed to you.


[63.91791939]
There it is. The answer to how tall a 6 month old baby girl will be = 63.91 cms tall.
The person who asked you the question seems unconvinced. He has some followup questions. How confident are you in the result ? The value is just the mean estimate. Can you estimate the distribution of height for 6 month old babies ?
Being a frequentist that you are, you are convinced that the model has done a good job. Even though the data suggests that babies come in different heights, the answer you had provided is based on the Maximum Likelihood Estimate (MLE) from the model fitted on the same data. The mean height of 6 month old babies will be around 63.91 cms, but you don’t know the spread ie., the variance. And you are not sure how confident you are in your results. There is no way to measure the confidence when the output is just the MLE.
You are now in want of a better approach.
Enter Bayesian Statistics
Bayesian Statistics
Bayesian statistics is a particular approach to applying probability to statistical problems. It provides us with mathematical tools to update our beliefs about random events in light of seeing new data or evidence about those events. We may have a prior belief about an event, but our beliefs are likely to change when new evidence is brought to light. Bayesian statistics gives us a solid mathematical means of incorporating our prior beliefs, and evidence, to produce new posterior beliefs.
Bayes rule:
P(HD) = P(DH).P(H) / P(D)
P(HD) : Probability that the hypothesis is true given the data.
P(DH) : Probability of the data arising given the hypothesis.
P(H) : Probability that the hypothesis is true, globally.
P(D) : Probability of the data arising, globally.
Bayesian statistics provides us with mathematical tools to rationally update our subjective beliefs in light of new data or evidence. This is in contrast to another form of statistical inference, known as classical or frequentist statistics, which assumes that probabilities are the frequency of particular random events occuring in a long run of repeated trials.
Bayesian Linear Regression
In a Bayesian framework, linear regression is stated in a probabilistic manner. That is, we reformulate the linear regression model to use probability distributions. The syntax for a linear regression in a Bayesian framework looks like this:
y ~ N(μ, σ)
μ = α + β.X
In words, our output datapoints are sampled from a multivariate normal distribution that has a mean equal to sum of the intercept plus the product of the β coefficients and the predictors X, and a variance of standard deviation of σ.
This is a very different formulation to the frequentist approach. In the frequentist setting there is no mention of probability distributions. In the Bayesian formulation the entire problem is recast such that the values are samples from a normal distribution.
A common question at this stage is “What is the benefit of doing this?”. What do we get out of this reformulation?
We saw that in the frequentist approach, the MLE value for our regression coefficients were only a single point estimate. In the Bayesian formulation we receive an entire probability distribution that characterises our uncertainty on the different coefficients. The immediate benefit of this is that after taking into account any data we can quantify our uncertainty in the parameters via the variance of this posterior distribution. A larger variance indicates more uncertainty.
Regression Fit using Bayesian Approach with PyMC3
We will now fit a Bayesian linear regression model. We’ll import the necessary libraries first.




Let us now fit the model using PyMC3 library. As explained earlier, we will fit the data as probability distributions. The linear regression intercept α and the weight are also assumed to be dervied from a Normal distribution with their corresponding mean and standard deviation. The dependent varialbe y is derived from a Normal distribution whose mean, μ = α + β.X. Since the standard deviation σ can only take positive values, it is derived from a HalfNormal or HalfCauchy distribution. This can be visualized as follows:


Sequential sampling (2 chains in 1 job)
NUTS: [σ, β, α]
Sampling chain 0, 0 divergences: 100%██████████ 2500/2500 [00:05<00:00, 467.50it/s]
Sampling chain 1, 0 divergences: 100%██████████ 2500/2500 [00:04<00:00, 542.02it/s]
There is a lot to unpack in the above few lines of code.
In line 1 we are defining a PyMC3 model using a ‘with’ statement. In lines 2 to 4, we are defining the intercept, weight and standard deviation as probability distributions as explained in the paragraph above. In line 8, we are setting the mean of the data y as per the equation μ = α + β.X. In line 10 we have defined y (height) as a Normal distribution with its mean and standard deviation and observed values.
In line 12, we are defining the step as pm.NUTS. This line is redundant and by default PyMC3 uses NUTS sampler. But to introduce the concept it has been defined explicitly here. In line 13, we are sampling using NUTS sampler with a iteration count of 2000.
We’ll take a slight detour to discuss what is happening in these last two lines.
Bayeisan Inference using MCMC
Traditionally Bayesian problems were difficult to solve because of the complex integrals involving hundreds of unknown parameters. But with the advent of computers and the ever increasing computing power available, the Bayesian models can be fit using numerical methods instead of complex integrals. PyMC3 is one such probabilistic programming package for Python that allows users to fit Bayesian models using a variety of numerical methods, most notably Markov chain Monte Carlo (MCMC) and variational inference (VI). Its flexibility and extensibility make it applicable to a large suite of problems.
MCMC
Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. MCMC methods are used to approximate the posterior distribution of a parameter of interest by random sampling in a probabilistic space.
Markov Chain and MonteCarlo simulation are complex topics which deserves a separate post of their own. For now, it is suffice to say that in MCMC method, samples are drawn from the probability distribution by constructing a Markov Chain, where the next sample that is drawn from the probability distribution is dependent upon the last sample that was drawn. The idea is that the chain will settle on (find equilibrium) on the desired quantity we are inferring.
There are different class of algorithms that exist to construct a Markov Chain during MCMC sampling such as Metropolis, Hamiltonian Monte Carlo (HMC), the No Uturn Sampler (NUTS), and Stein Variational Gradient Descent. NUTS is the most common method and suits for a varied dataset. In this example, we are using NUTS algorithm to draw samples that form the Markov chain to fit the probability distribution.
Regression Fit using Bayesian Approach with PyMC3 Contd…
Let’s continue with our Bayesian Linear regression code. We saw earlier that we had defined the probability distributions and had fit our model. Let us see what the results are for the model fit.


As explained earlier, unlike the output of traditional frequentist model, the values of intercept and weights are not a single point estimate but rather a distribution. The mean of these values are given by:


Intercept α = 55.32311984412687
Weight β = 1.4298496803475405
The mean values are closer to the MLE values estimated using the sklearn LinearRegression() model.
Prediction using Bayesian Model
In Bayesian inference, prediction are termed as posterior prediction. Similar to sklearn model we will use the same input X (Month) values to predict the height of the babies.


100%██████████ 1000/1000 [00:02<00:00, 487.80it/s]


Again the predicted values are not a single point estimate for each input values, but rather a distribution. The middle blue line represents the mean of the posterior prediction which is similar to the sklearn prediction output. The dark orange band represents the 60% HDI and the light orange bands represents the 94% HDI.
We’ll now answer how tall a 6 month baby will be using the Bayesian model.
with model_babies:
pm.set_data({"month": [6]})
ht_6_ppc = pm.sample_posterior_predictive(trace_babies, 2000)
ht_6 = ht_6_ppc["height"][:, 0]
print(f"\n\nMean Height = {ht_6.mean()}")
print(f"Standard Deviation = {ht_6.std()}")
100%██████████ 2000/2000 [00:04<00:00, 417.19it/s]
Mean Height = 63.858550813397954
Standard Deviation = 3.510693761007536
The Mean height of 6 month old babies is 63.85 cms which is close to the value of 63.91 cms that we got from sklearn model. In this case, since we’ve got a distribution as output, we can find the variance as well, which is 3.5. Let’s plot the output distribution.
az.plot_posterior(ht_6, hdi_prob=0.95)
 The Mean height of 6 month old babies is ~64 cms with a varianve of 3.5 cm.
 The height can vary between 57 and 71 cms for a 95% HPD.
 95% highest posterior density (HPD) or ‘Credible Interval’ encompasses the region of practical equivalence (ROPE).
 In frequentist approach we have Condifence Interval, but in Bayesian approach the term ‘Credible Interval’ is used. Read this stackexchange post for detailed explanation.
Conclusion
 Using a frequentist approach, we get a single point estimate of the regression parameters which in turn will give a single point output for given input predictors.
 In Bayesian approach, you need to define the data in terms of a probability distribution.
 The output of a Bayesian Linear Regression is not just a single point estimate but a distribution along with the credible intervals. This allows you to infer more about the data and the problem you are trying to solve compared to the frequentist approach.
This post is a simple introduction to Bayesian Linear Regression. The world of Bayesian modeling is interesting with a lot more to explore like hierarchial modeling, multilevel modeling etc., and there are even Bayesian Neural Networks. We will cover some of these topics in future posts.
References

https://www.quantstart.com/articles/BayesianLinearRegressionModelswithPyMC3/

http://jakevdp.github.io/blog/2014/03/11/frequentismandbayesianismapracticalintro/

http://camdavidsonpilon.github.io/ProbabilisticProgrammingandBayesianMethodsforHackers/

https://towardsdatascience.com/mcmcintuitionforeveryone5ae79fff22b1