Gaussian Process

Basic

For a given set of finite data points, there are potentially infinitely many functions that fit the data. We need to make assumptions about the underlying functions, otherwise the predicted function (colored dashed lines) could be wild. Gaussian processes is a powerful method to solve this fitting problem by assigning a probability to each of colored dashed line.

GP can predict the value of an unknown data and the confidence (uncertainty) about how much we can trust this value. The uncertainty directs us to add new data to the training set and presumably improves the prediction.

Functional view

We have a collection of input $\textbf{X} = [ \textbf{x} _ 1, \textbf{x} _ 2, \dots, \textbf{x} _ n ]^T \in \R^{n \times d}$ where $n$ is the number of training data and $d$ is the dimension of the feature. The input space $\textbf{x}_i$ is infinite and continuous in space with a fixed number of components ($\textbf{x} _ i = [x(1), x(2), \dots, x(d)]$). The corresponding output is $\textbf{y} = [y _ 1, y _ 2, \dots, y _ n]^T \in \R^{n \times 1}$.

The goal is to find the mapping $f$ from $\textbf{x}$ to $y$ which corresponds to $y = f(\textbf{x})$ in a noise-free case. We suppose $f$ is a gaussian process which is defined by its mean function $m(\textbf{x})$ and covariance function $k(\textbf{x}, \textbf{x}^\prime)$ $$ \begin{aligned} m(\textbf{x}) &= \mathbb{E}(\textbf{x}) \\ k(\textbf{x}, \textbf{x}^\prime) &= \mathbb{E}[(f(\textbf{x}) - m(\textbf{x}))(f(\textbf{x}^\prime) - m(\textbf{x}^\prime))] \end{aligned} $$ If we want to predict outputs at $m$ locations ($\textbf{x} _ {\ast 1}, \textbf{x} _ {\ast 2}, \dots, \textbf{x} _ {\ast m}$) $$ \underbrace{f(\textbf{x} _ 1), \dots, f(\textbf{x} _ n)} _ {\text{training points}}, \underbrace{f(\textbf{x} _ {\ast 1}), \dots, f(\textbf{x} _ {\ast m})} _ {\text{test points}} \sim N(0, \sum) $$

where

$$ \begin{aligned} \sum &= \left[ \begin{array}{ccc|c} k(\textbf{x} _ 1, \textbf{x} _ 1) & \dots & k(\textbf{x} _ 1, \textbf{x} _ n) & k(\textbf{x} _ 1, \textbf{x} _ {\ast 1}) & \dots & k(\textbf{x} _ 1, \textbf{x} _ {\ast m}) \\ \vdots & & \vdots & \vdots & & \vdots \\ k(\textbf{x} _ n, \textbf{x} _ 1) & \dots & k(\textbf{x} _ n, \textbf{x} _ n) & k(\textbf{x} _ n, \textbf{x} _ {\ast 1}) & \dots & k(\textbf{x} _ n, \textbf{x} _ {\ast m}) \\ \hline k(\textbf{x} _ {\ast 1}, \textbf{x} _ 1) & \dots & k(\textbf{x} _ {\ast 1}, \textbf{x} _ n) & k(\textbf{x} _ {\ast 1}, \textbf{x} _ {\ast 1}) & \dots & k(\textbf{x} _ {\ast 1}, \textbf{x} _ {\ast m}) \\ \vdots & & \vdots & \vdots & & \vdots \\ k(\textbf{x} _ {\ast m}, \textbf{x} _ 1) & \dots & k(\textbf{x} _ {\ast m}, \textbf{x} _ n) & k(\textbf{x} _ {\ast m}, \textbf{x} _ {\ast 1}) & \dots & k(\textbf{x} _ {\ast m}, \textbf{x} _ {\ast m}) \end{array} \right] \\ &= \left[ \begin{array}{c|c} K(X, X) & K(X, X_\ast) \\ \hline K(X_\ast, X) & K (X _ \ast X _ \ast) \end{array} \right] \end{aligned} $$

Nice blog on Gaussian Process

Why do we need subsets of data and determine their marginalized probability distribution? - Partitioning dimensions into subsets