Introduction

I was worried that there are not many Bayesian linear regressions of the correct implementation in the world, and there are few implementations that support multidimensional input, so I implemented it as an easy-to-use class. The description basically follows PRML.

Implemented class

That said, it's not as big as an implementation, it's about 50 lines ... I'll write the whole class here. The functions it has are as follows. It's very simple.

-Give a combination of $ \ phi $ and $ t $ to calculate the posterior distribution --Random sampling of mean and covariance parameters of Bayesian linear regression -Give $ \ phi $ to output the mean and variance of the predicted distribution

`Beyes_LR.py`


import numpy as np
from scipy.stats import multivariate_normal

class BeyesLinearRegression:
    def __init__(self, mu, S, beta):
        self.mu = mu
        self.S = S
        self.beta = beta

    def calc_posterior(self, phi, t):
        S_inv = np.linalg.inv(self.S)

        if len(phi.shape) == 1:
            phi = phi.reshape(1, -1)
            t = t.reshape(1, 1)
        self.S = np.linalg.inv(S_inv + self.beta * phi.T @ phi)
        self.mu = self.S @ (S_inv @ self.mu + np.squeeze(self.beta * phi.T @ t))

    def sampling_params(self, n=1, random_state=0):
        np.random.seed(random_state)
        return np.random.multivariate_normal(self.mu, self.S, n)

    def probability(self, x):
        dist = multivariate_normal(mean=self.mu, cov=self.S)
        return dist.logpdf(x)

    def predict(self, phi):
        if len(phi.shape) == 1:
            phi = phi.reshape(1, -1)
        pred = np.array([self.mu.T @ _phi for _phi in phi])
        S_pred = np.array([(1 / self.beta) + _phi.T @ self.S @ _phi for _phi in phi])

        # Above is a simple implementation.
        # This may be better if you want speed.
        # pred = self.mu @ phi.T
        # S_pred = (1 / self.beta) + np.diag(phi @ self.S @ phi.T)
        return pred, S_pred

All the code is also in git. (Although it is about 50 lines)

GitHub

About Bayesian linear regression

Since the following is detailed about the derivation of the formula, I will not write the details.

-Implementing Bayesian linear regression in python

-Regression using Bayesian inference

Calculation of posterior distribution

The important thing is to update the distribution as follows. To put it plainly, $ \ phi $ is the explanatory variable and $ t $ is the answer. The mean and covariance matrix are updated accordingly.

M_N=S_N(S_0^{-1}m_0+\beta\Phi^Tt)

S_N^{-1}=S_0^{-1}+\beta\Phi^T\Phi

Predicted distribution

The predicted distribution is shown below. I will omit the details here as well, but the point is that the distribution is expressed by the mean and variance as shown below for the new point $ x $.

N(m_N^T\phi(x), 1/\beta+\phi(x)^TS_N\phi(x))

Use the implemented class

From here, let's try Bayesian linear regression using the class we actually implemented. The class I created needs to be given $ \ phi $ as a feature from $ x $. In a common implementation, the generation part of $ \ phi $ (for example, polynomial) is also included in the class, and it is difficult to tell whether Bayesian linear regression is performed or feature design is performed by polynomial. But here they are separated.

So, if you input the original data $ x $, $ y $ and then implement the function to create $ \ phi $, it is basically OK.

Regression with sine wave data

Let's try with Toy data first.

Data generation and feature function design

The input data is the observation data by adding noise to the true distribution of the sine wave. In addition, the design of the feature function is a composite wave of trigonometric functions of several frequencies. The method x_to_phi vectorizes 10 waves and _phi represents a composite wave. Amplitude is the parameter found by Bayesian linear regression. The image below is mathematically. (If you think about it, the first term is zero ... I don't need it ...)

y=w_1sin(0)+w_2sin(2\pi x)+w_3sin(2\times2\pi x)+\cdots+w_9sin(9\times2\pi x)

def x_to_phi(x):
    if len(x.shape) == 1:
        x = x.reshape(-1, 1)
    return np.concatenate([np.sin(2 * np.pi * x * m) for m in range(0, 10)], axis=1)


def _phi(x, params):
    return np.array([p * np.sin(2 * np.pi * x * i) for i, p in enumerate(params)]).sum(axis=0)

Click here for the actual data generation part.

x = np.arange(0, 1, 0.01)
phi = x_to_phi(x)

e = np.random.randn(len(x)) / 10
y_true = np.sin(2 * np.pi * x)
y = y_true + e

When only one point is observed

First, consider the case where only one point of the 50th data is observed.

train_idx = 50
x_train = x[train_idx]
phi_train = phi[train_idx]
y_train = y[train_idx]
plt.scatter(x_train, y_train, c='crimson', marker='o', label='observation')
plt.plot(x, y_true, label='true')

Let's calculate the posterior distribution for this data immediately. If you just want to learn, it's one line like this:

#Initial value of Bayesian linear regression
input_dim = phi.shape[1]
mu = np.zeros(input_dim)
S = np.identity(input_dim)
sigma = 0.1
beta = 1.0 / (sigma ** 2)

#Model definition and learning
beyes_linear_model = BeyesLR.BeyesLinearRegression(mu, S, beta)
beyes_linear_model.calc_posterior(phi_train, y_train)

The waveforms randomly sampled from the post-learning posterior distribution are displayed with a dotted green line, and the predicted distribution is displayed in light blue. Most of them are blue because I have learned only one point.

When observing 50 points

Let's do exactly the same thing from 50 observation data. The only difference in the code is to randomly select 50 train_idx. The predicted distribution of the results is shown in the figure below.

Regression on Advertising dataset

Next is a real problem, which also solves a multidimensional linear regression. If feature extraction is performed in multiple dimensions, the dimensions will increase too much and it will be difficult to understand, so $ \ phi $ is a Linear function.

Input data

It is an Advertising dataset that is familiar to ISLR. This time, we will use TV and Radio advertising expenses as input and sales as response.

This time $ \ phi $ is linear, so just add the intercept term. The regression equation has the following form. $Sales = w_0+w_1TV+w_2Radio$

def x_to_phi(x, typ='linear', degree=3):
    if len(x.shape) == 1:
        x = x.reshape(-1, 1)
    return np.concatenate([np.ones(x.shape[0]).reshape(-1, 1), x], axis=1)


df = pd.read_csv(ADVERTISING_DATASET)
x = df[['TV', 'Radio']].values
y = df['Sales'].values

phi = x_to_phi(x)
x_train, x_test, phi_train, phi_test, y_train, y_test = \
    train_test_split(x, phi, y, train_size=0.05, random_state=0)

All you have to do is learn as in the previous example.

input_dim = phi.shape[1]
mu = np.zeros(input_dim)
S = np.identity(input_dim)
sigma = 10
beta = 1.0 / (sigma ** 2)

beyes_linear_model = BeyesLR.BeyesLinearRegression(mu, S, beta)
beyes_linear_model.calc_posterior(phi_train, y_train)

In the code, train_size is set to 0.05, but the regression plane drawn while changing this is as follows. It is learned by Bayesian linear regression, and 5 planes are extracted and drawn by random sampling. Random when the number of learning samples is small A plane is drawn, but it converges as the number of data points increases.

in conclusion

Finally, a little promotion of Bayesian linear regression. The design of the Bayesian linear regression feature extraction function $ \ phi $ is important, although there are still some parts that are not fully understood. I recognize that Gaussian process regression treats the variance-covariance matrix as a design matrix using kernel functions without explicitly writing this down. However, experience shows that Bayesian linear regression is sufficient from the viewpoint of explanatory power if there is prior knowledge that can extract good features. In addition, due to the calculation method of Bayesian linear regression, sequential learning can be performed as it is. All you have to do is learn the calculated posterior distribution as a prior distribution. There is no need to recalculate the Gram matrix like Gaussian process regression. There may be an online version, but ...

By the way, have a comfortable Bayesian linear regression life!

A python implementation of the Bayesian linear regression class