Gaussian Process

For a given set of finite data points, there are potentially infinitely many functions that fit the data.
We need to make assumptions about the underlying functions, otherwise the predicted function (colored dashed lines) could be wild.
`Gaussian processes`

is a powerful method to solve this fitting problem by assigning a probability to each of colored dashed line.

GP can predict the value of an unknown data and the confidence (uncertainty) about how much we can trust this value. The uncertainty directs us to add new data to the training set and presumably improves the prediction.

`conditioning`

: the probability of one variable depending on another variable`marginalization`

and`conditioning`

yield a modified Gaussian distribution- correlation intuition: imagine a earthquake is attacking many cities including ABC. The distance between City A and City B is closer than the distance between City A and City C ($d(A, B) < d(B, C)$). The intensity each city feels is defined as $I$. The $I _ A$ and $I _ B$ has higher correlation, and $I _ A$ and $I _ C$ has lower correlation. If we can only detect $I _ A$ at this moment and want to predict $I _ B$ and $I _ C$, we will be more certain about the prediction on City B than City C.

We have a collection of input $\textbf{X} = [ \textbf{x} _ 1, \textbf{x} _ 2, \dots, \textbf{x} _ n ]^T \in \R^{n \times d}$ where $n$ is the number of training data and $d$ is the dimension of the feature. The input space $\textbf{x}_i$ is infinite and continuous in space with a fixed number of components ($\textbf{x} _ i = [x(1), x(2), \dots, x(d)]$). The corresponding output is $\textbf{y} = [y _ 1, y _ 2, \dots, y _ n]^T \in \R^{n \times 1}$.

The goal is to find the mapping $f$ from $\textbf{x}$ to $y$ which corresponds to $y = f(\textbf{x})$ in a noise-free case. We suppose $f$ is a gaussian process which is defined by its mean function $m(\textbf{x})$ and covariance function $k(\textbf{x}, \textbf{x}^\prime)$ $$ \begin{aligned} m(\textbf{x}) &= \mathbb{E}(\textbf{x}) \\ k(\textbf{x}, \textbf{x}^\prime) &= \mathbb{E}[(f(\textbf{x}) - m(\textbf{x}))(f(\textbf{x}^\prime) - m(\textbf{x}^\prime))] \end{aligned} $$ If we want to predict outputs at $m$ locations ($\textbf{x} _ {\ast 1}, \textbf{x} _ {\ast 2}, \dots, \textbf{x} _ {\ast m}$) $$ \underbrace{f(\textbf{x} _ 1), \dots, f(\textbf{x} _ n)} _ {\text{training points}}, \underbrace{f(\textbf{x} _ {\ast 1}), \dots, f(\textbf{x} _ {\ast m})} _ {\text{test points}} \sim N(0, \sum) $$

where

$$ \begin{aligned} \sum &= \left[ \begin{array}{ccc|c} k(\textbf{x} _ 1, \textbf{x} _ 1) & \dots & k(\textbf{x} _ 1, \textbf{x} _ n) & k(\textbf{x} _ 1, \textbf{x} _ {\ast 1}) & \dots & k(\textbf{x} _ 1, \textbf{x} _ {\ast m}) \\ \vdots & & \vdots & \vdots & & \vdots \\ k(\textbf{x} _ n, \textbf{x} _ 1) & \dots & k(\textbf{x} _ n, \textbf{x} _ n) & k(\textbf{x} _ n, \textbf{x} _ {\ast 1}) & \dots & k(\textbf{x} _ n, \textbf{x} _ {\ast m}) \\ \hline k(\textbf{x} _ {\ast 1}, \textbf{x} _ 1) & \dots & k(\textbf{x} _ {\ast 1}, \textbf{x} _ n) & k(\textbf{x} _ {\ast 1}, \textbf{x} _ {\ast 1}) & \dots & k(\textbf{x} _ {\ast 1}, \textbf{x} _ {\ast m}) \\ \vdots & & \vdots & \vdots & & \vdots \\ k(\textbf{x} _ {\ast m}, \textbf{x} _ 1) & \dots & k(\textbf{x} _ {\ast m}, \textbf{x} _ n) & k(\textbf{x} _ {\ast m}, \textbf{x} _ {\ast 1}) & \dots & k(\textbf{x} _ {\ast m}, \textbf{x} _ {\ast m}) \end{array} \right] \\ &= \left[ \begin{array}{c|c} K(X, X) & K(X, X_\ast) \\ \hline K(X_\ast, X) & K (X _ \ast X _ \ast) \end{array} \right] \end{aligned} $$

- https://distill.pub/2019/visual-exploration-gaussian-processes

Why do we need subsets of data and determine their marginalized probability distribution? - Partitioning dimensions into subsets