MAP Estimates with Laplace Priors

The maximum a posteriori (MAP) estimates of parameters are defined to be the parameters that minimize the error function. The error function for parameters $\beta$ under logistic regression with Laplace priors consists of a likelihood component and a prior component. The likelihood error is the negative log likelihood of the training labels given the training data and parameters. The prior component is the log likelihood of the parameters in the prior.

We used Laplace priors with means of zero and variances $\sigma$ which varied by task. The intercept is assumed to be the first feature (position 0) and is always given a noninformative prior (equivalently a Laplace prior with infinite variance, $\sigma^2=\infty$). Each feature is given the same prior variance.

The Laplace prior (equivalently regularization or shrinkage with the $L_1$ norm, also known as the lasso) enforces a preference for parameters that are zero, but otherwise is more dispersed than a Gaussian prior (equivalently regularization or shrinkage with the $L_2$ norm, also known as ridge regression). A number of experiments in natural language classification have shown the Laplace prior to be much more robust and accurate under cross-validation than either maximum likelihood estimation or the more common Gaussian prior [Genkin et al., Goodman 03].

Given a sequence of $n$ data points $D = \langle x_j, c_j \rangle_{j < n}$, with $x_j \in \mathbb{R}^d$ and $c_j \in \{ 0, \ldots, k-1 \}$, the log likelihood of the data in a model with parameter matrix $\beta$ is:

\begin{displaymath}
\log p(D \mid \beta)
= \log \prod_{j < n} p(c_j \mid x_j, \beta)
= \sum_{j < n} \log p(c_j \mid x_j, \beta)
\end{displaymath} (4)

A Laplace, or double-exponential, prior with zero means and diagonal covariance is specified by a variance parameter $\sigma_i^2$ for each dimension $i < d$. The prior density for the full parameter matrix is:

\begin{displaymath}
p(\beta \mid \sigma^2) = \prod_{c < k-1} \ \prod_{i < d} \ \textsf{Laplace}(0,\sigma_i^2)(\beta_{c,i})
\end{displaymath} (5)

where the Laplace density function with mean $0$ and variance $\sigma_i^2$ is:
\begin{displaymath}
\textsf{Laplace}(0,\sigma_i^2)(\beta_{c,i})
=
\frac{\sqrt{2}...
...t( - \sqrt{2} \ \frac{\mid \beta_{c,i} \mid}{\sigma_i} \right)
\end{displaymath} (6)

The Laplace prior has fatter tails than the Gaussian, and is also more concentrated around zero. Thus it is more likely to shrink a coefficient to zero than the Gaussian, while also being more lenient in allowing larger coefficients.

The full error function is the sum of the log likelihood and log prior given parameters $\beta$, training data $D$ and prior variance $\sigma^2$:

$\displaystyle \textrm{Err}(\beta,D,\sigma^2)$ $\textstyle =$ $\displaystyle - \log p(D \mid \beta) - \log p(\beta \mid \sigma^2)$  
  $\textstyle =$ $\displaystyle \sum_{j < n} - \log p(c_j \mid x_j, \beta)
+ \sum_{0 < i < d} - \log p(\beta_i \mid \sigma_i^2)$  

This error function is concave and thus has a unique minimum, which is the MAP es4timate.

Carlos 2008-10-16