In binary classification, a response function (or a link function) is used to turn the output of a regression model into class probability, so the domain of the output from a regression model is squashed into [0, 1] range. Common link function is logistic function (another one is probit function, which will not be discussed here). For a data point \((x_i, c_i)\),

\[ p(c_i|x_i, w) = \frac{1}{1+exp(-x_i'w)} = \sigma(x_i'w) \]

Encode class \(c_i\) as \(+1\) or \(-1\), Equation can be written as

\[ p(c_i|x_i, w) = p(c_i|f_i)=\sigma(c_if_i), f_i = x_i'w. \]

Here, \(f\) is the latent function \(f(x)\) that we do not observe it, it is also stochastic, so we assume the priors for \(f\) and weights \(w\) as \(f\sim MVN(0, \Sigma_f)\) and \(w \sim MVN(0, \Sigma_w)\), then given a dataset \(D=\left\{(x_i, c_i), i=1, \dots, n\right\}\) and new data points \((x^*, c^*)\), the posterior conditional distribution for predicting \(f^*\) for \((x^*, c^*)\) is

\[ p(f^*|D, x^*) = \int p(f^*|x, x^*, f)p(f|D)df, \mbox{ where } p(f|D)\propto p(c|f)p(f|x). \]

Since \(p(f|D)\) is not Gaussian, then we approximate it by \[q(f|D) = MVN(\hat{f}, (K^{-1}+W)^{-1}), \mbox{ where }(K^{-1}+W) \mbox{ is the hessian matrix of }-log p(f|D),\]

\(\hat{f}\) is the posterior mode that maximizes \(p(f|D)\) and is computed iteratively using Newton’s method.

With \(q(f|D)\), \(p(f^*|D, x^*)\) above replaced by \(q(f^*|D, x^*)\) with mean and variance in [1], pg44.

Given dataset \(D\) and new data points \((c^*, x^*)\), the posterior predictive probability is

\[ p(c^*=+1|D, x^*) = \int \sigma(f^*)q(f^*|D, x^*) df^* \] Approximate this integral using Gaussian integral of an error function link.

For tensorflow implementation, check out my repo GPC-tensorflow. Also check Tensorflow-probability, Tensorflow-probability repo

[1] Williams, Christopher KI, and Carl Edward Rasmussen. Gaussian processes for machine learning. Vol. 2. No. 3. Cambridge, MA: MIT press, 2006.

In binary classification, a response function (or a link function) is used to turn the output of a regression model into class probability, so the domain of the output from a regression model is squashed into [0, 1] range. Common link function is logistic function (another one is probit function, which will not be discussed here). For a data point \((x_i, c_i)\),

\[ p(c_i|x_i, w) = \frac{1}{1+exp(-x_i'w)} = \sigma(x_i'w) \]

Encode class \(c_i\) as \(+1\) or \(-1\), Equation can be written as

\[ p(c_i|x_i, w) = p(c_i|f_i)=\sigma(c_if_i), f_i = x_i'w. \]

Here, \(f\) is the latent function \(f(x)\) that we do not observe it, it is also stochastic, so we assume the priors for \(f\) and weights \(w\) as \(f\sim MVN(0, \Sigma_f)\) and \(w \sim MVN(0, \Sigma_w)\), then given a dataset \(D=\left\{(x_i, c_i), i=1, \dots, n\right\}\) and new data points \((x^*, c^*)\), the posterior conditional distribution for predicting \(f^*\) for \((x^*, c^*)\) is

\[ p(f^*|D, x^*) = \int p(f^*|x, x^*, f)p(f|D)df, \mbox{ where } p(f|D)\propto p(c|f)p(f|x). \]

Since \(p(f|D)\) is not Gaussian, then we approximate it by \[q(f|D) = MVN(\hat{f}, (K^{-1}+W)^{-1}), \mbox{ where }(K^{-1}+W) \mbox{ is the hessian matrix of }-log p(f|D),\]

\(\hat{f}\) is the posterior mode that maximizes \(p(f|D)\) and is computed iteratively using Newton’s method.

With \(q(f|D)\), \(p(f^*|D, x^*)\) above replaced by \(q(f^*|D, x^*)\) with mean and variance in [1], pg44.

Given dataset \(D\) and new data points \((c^*, x^*)\), the posterior predictive probability is

\[ p(c^*=+1|D, x^*) = \int \sigma(f^*)q(f^*|D, x^*) df^* \] Approximate this integral using Gaussian integral of an error function link.

For tensorflow implementation, check out my repo GPC-tensorflow. Also check Tensorflow-probability, Tensorflow-probability repo

[1] Williams, Christopher KI, and Carl Edward Rasmussen. Gaussian processes for machine learning. Vol. 2. No. 3. Cambridge, MA: MIT press, 2006.