Basic examples of Maximum Likelihood Estimation

In this post, we see how to use maximum likelihood to estimate the best parameters for a few of the most common distributions.

Estimation of best parameter for iid Exponential Distributions

Let \( X_1, X_2, \dotsc, X_m \) be a random sample from the exponential distribution with probability density functions of the form \( f_\theta(x) = \tfrac{1}{\theta}e^{-x/\theta} \) for \( x >0 \) and any parameter \( \theta >0. \) The likelihood function is then given as the product

\begin{equation} \mathcal{L}(\theta; x_1,\dotsc, x_m) = f_\theta(x_1)\dotsb f_\theta(x_m) = \dfrac{1}{\theta^m} \exp \bigg( -\dfrac{1}{\theta} \sum_{k=1}^m x_k \bigg) \end{equation}

We look for the parameter value \( \theta>0 \) that offers an absolute maximum of \( \mathcal{L}. \) Notice that, since the logarithm is a one-to-one increasing function, the maximum of \( \mathcal{L} \) coincides with the maximum of \( \log\mathcal{L}. \) The latter expression is easier to handle than the former, so we use this one to look for the extrema in the usual way:

Set \( g(\theta) = \log \mathcal{L}(\theta; x_1, \dotsc, x_m) = -m \log(\theta) - \tfrac{1}{\theta} \sum_{k=1}^m x_k; \) it is then \( g’(\theta) = -\tfrac{m}{\theta} - \tfrac{1}{\theta^2}\sum_{k=1}^m x_k. \) Note that \( g’(\theta) = 0 \) if and only if \( \theta = \tfrac{1}{m} \sum_{k=1}^m x_k, \) which happens to be positive and actually a maximum of \( \mathcal{L}(\theta;x_1,\dotsc,x_m). \)

Note that the found parameter \( \theta \) is nothing but the arithmetic mean \( \bar{x} \) of \( {x_1, \dotsc, x_m}. \)

Estimation of best parameter for iid Geometric Distributions

In this case, the random sample \( X_1, \dotsc, X_m \) for the Geometric distribution has probability density functions of the form \( f_p(n) = p (1-p)^{n-1} \) for any \( n \in \mathbb{N} \) and parameter \( p \in [0,1]. \) We operate as in the previous example, by looking for extrema of the log-likelihood function:

Set \( \mathcal{L}(p;n_1,\dotsc,n_m) = p^m (1-p)^{-m+n_1+\dotsb+n_m} \) for \( 0 \leq p \leq 1. \)
Consider \( g(p) = \log \mathcal{L}(p;n_1, \dotsc, n_m) = m\log(p) + \bigg( -m + \displaystyle{\sum_{k=1}^m} n_k \bigg) \log(1-p), \) but only for \( 0<p<1. \)
It is then \( g’(p) = \dfrac{m}{p} - \dfrac{1}{1-p}\bigg( -m + \displaystyle{\sum_{k-1}^m} n_k\bigg) \)
\( g’(p) = 0 \) if and only if \( p =\dfrac{m}{\sum_{k=1}^m n_k}. \)

This time, the solution \( p \) coincides with the inverse of the arithmetic mean \( \bar{n} \) of the samples \( {n_1, \dotsc, n_m} \) (which is trivially positive and less than one). It is not hard to prove that this critical point is a maximum, and therefore is the parameter that we are looking for.

Estimation of best parameter for iid Poisson Distributions

The random variables in this case have probability density functions given by \( f_\lambda(n) = \dfrac{\lambda^n e^{-\lambda}}{n!} \) for any \( n \in \mathbb{N}, \) and parameter \( \lambda>0. \)

Set \( \mathcal{L}(\lambda;t_1,\dotsc,t_m) = e^{-m\lambda} \dfrac{\lambda^{n_1+\dotsb+n_m}}{n_1! \dotsb n_m!}. \)
Set \( g(\lambda) = \log \mathcal{L}(\lambda; n_1, \dotsc, n_m) = -m\lambda -\log (n_1! \dotsb n_m!)+(\log \lambda) \displaystyle{\sum_{k=1}^m} n_k . \)
Its derivative is given by \( g’(\lambda) = -m + \dfrac{1}{\lambda} \displaystyle{\sum_{k=1}^m} n_k. \)
Note that \( g’(\lambda) = 0 \) only for \( \lambda = \dfrac{1}{m} \displaystyle{\sum_{k=1}^m} n_k, \) which is trivially a maximum for \( \mathcal{L}. \)

As in the case of exponential distributions, the computed parameter \( \lambda \) is the arithmetic mean \( \bar{n} \) of \( {n_1, \dotsc, n_m}. \)

Estimation of best parameter for iid Normal Distributions

This case is a bit different, since we are dealing with two parameters instead of one: Assume \( X_1, X_2, \dotsc, X_m \) is a random sample from the normal distribution with probability density functions of the form \( f_{\mu,\sigma}(t) = (2\pi\sigma^2)^{-1/2} \exp \big( - \tfrac{(t-\mu)^2}{2\sigma^2} \big) \) for any \( t \in \mathbb{R}, \) and parameters \( \mu,\sigma \in \mathbb{R}. \) For ease of computations below, and since the parameter \( \sigma \) appears always squared on the expression of \( f \), we prefer to work instead with \( f_{\mu,\theta}(t) = (2\pi\theta)^{-1/2} \exp \big( - \tfrac{(t-\mu)^2}{2\theta} \big), \) and require the parameter \( \theta \) to be non-negative. Note the abuse of notation, and how this does not really affect the final result. We proceed to compute the likelihood function and its logarithm as before:

\( \mathcal{L}(\mu,\theta;t_1,\dotsc,t_m) = \bigg( \dfrac{1}{\sqrt{2\pi\theta}} \bigg)^m \exp \bigg( \dfrac{1}{2\theta} \displaystyle{\sum_{k=1}^m} (t_k-\mu)^2 \bigg) \)
\( g(\mu,\theta) = \log\mathcal{L}(\mu,\theta;t_1,\dotsc,t_m) = -\dfrac{m}{2}\log(2\pi\theta) - \dfrac{1}{2\theta} \displaystyle{\sum_{k=1}^m} (t_k-\mu)^2. \)
The partial derivatives of \( g \) are given by
\begin{equation} \dfrac{\partial g}{\partial \mu}(\mu,\theta) = \dfrac{1}{\theta} \displaystyle{\sum_{k=1}^m} (t_k - \mu),\qquad \dfrac{\partial g}{\partial \theta}(\mu,\theta) = -\dfrac{m}{2\theta} + \dfrac{1}{2\theta^2} \displaystyle{\sum_{k=1}^m} (t_k-\mu)^2 \end{equation}
Note that \( \dfrac{\partial g}{\partial \mu}(\mu,\theta) = 0 \) if and only if \( \mu = \dfrac{1}{m} \displaystyle{\sum_{k=1}^m} t_k. \) Let us denote it by \( \bar{t}, \) since it represents the mean of the values \( { t_1, \dotsc, t_m }. \)
Also, by virtue of the previous statement, a solution for \( \dfrac{\partial g}{\partial \theta}(\mu,\theta) = 0 \) is given uniquely by \( \theta = \dfrac{1}{m} \displaystyle{\sum_{k=1}^m} \big(t_k - \bar{t}\big)^2 \). Note that this value (which is positive, and hence satisfies the constraints) coincides with the variance \( s^2 \) of the set \( { t_1, \dotsc, t_m}. \) It is a priori a valid parameter for \( \theta. \)
It is not hard to see that the computed critical point \( (\mu,\theta) = (\bar{t}, s^2) \) offers indeed an absolute maximum for \( \log\mathcal{L}(\mu,\theta;t_1,\dotsc,t_m). \) Indeed, the Hessian of \( g \) is given by:
\begin{align} H(g)(\mu,\theta) &= \begin{pmatrix} \tfrac{\partial^2 g}{\partial \mu^2} & \tfrac{\partial^2 g}{\partial \mu \partial \theta} \\ \tfrac{\partial^2 g}{\partial \theta \partial \mu} & \tfrac{\partial^2 g}{\partial \theta^2}\end{pmatrix} \bigg\rvert_{(\mu,\theta)=(\bar{t},s^2)} \\ &= \begin{pmatrix} -m/\theta & -\sum_{k=1}^m (t_k-\mu)/\theta^2 \\ -\sum_{k=1}^m (t_k-\mu)/\theta^2 & m/(2\theta^2) - \sum_{k=1}^m (t_k-\mu)^2/\theta^3 \end{pmatrix} \bigg\rvert_{(\mu,\theta)=(\bar{t},s^2)} \\ &= \begin{pmatrix} -m/s^2 & 0 \\ 0 & -m/(2s^4) \end{pmatrix}. \end{align}

Its determinant at \( (\mu,\theta) = (\bar{t},s^2) \) is always positive: \( \det H(g)(\bar{t},s^2) = \dfrac{m^2}{2s^6}, \) and since \( \dfrac{\partial^2 g}{\partial \mu^2}(\bar{t},s^2) = -\dfrac{m}{s^2} \) is always negative, a maximum is attained.