3 The exponential family

3.5 Properties of YEF

There are two ways of extracting information about the moments of the EF random variable Y. One is to calculate the cumulant generating function, the other is to investigate expectation properties of the score function. The score is the derivative of log density function with respect to the parameter. The latter is an interesting preliminary to evaluating maximum likelihood estimates.

3.5.1 EF cumulant generating functions

First, the the moment generating function is

𝔼[exp{sY}] = -exp{sy}exp{θy-κ(θ)}q(y)𝑑y
= exp{-κ(θ)}-exp{sy+θy}q(y)𝑑y
= exp{-κ(θ)}𝔼q[exp{(s+θ)Y}]
= exp{-κ(θ)}Mq(s+θ)
= exp{-κ(θ)}exp{logMq(s+θ)}
= exp{-κ(θ)}exp{κ(s+θ)}.

Taking logs gives the cgf

K(s)=log𝔼[exp{sY}]=κ(s+θ)-κ(θ),

where s is such that s+θ lies in Θ. When we need to make it clear that K is the cgf of Y then we write KY(s).

The corollary to this result are that: apart from the first, the cumulants of YfEF(θ,q) are given by the same function as Yq. The mean and variance of Y under f are given by evaluating the derivatives of the cgf evaluated at s=0, K(0) and K′′(0). Now

K(s)=ddsK(s) = ddsκ(s+θ)
= dκ(s+θ)d(s+θ)d(s+θ)ds  [ChainRule]
= κ(s+θ)×1so
K(0) = κ(θ)
= κθ(θ).

Furthermore

K′′(s) = κ′′(s+θ)so
K′′(0) = κ′′(θ)
= κθθ(θ).

Using the properties of a cgf that K(0) and K′′(0) give the mean and variance, we have

𝔼[Y]=κθ(θ)andvar(Y)=κθθ(θ).

The subscript denotes derivatives with respect to θ. We need this notation to the more usual κ(θ) when we change variables and need to keep track.

 
Exercise 3.26
Find the cgf of the Poisson distribution in terms of the canonical parameter, and hence its mean and variance.

 

3.5.2 EF cgf under variable transformation

Recall that if Y is from the exponential family, then the random variable X under the one-to-one transformation Y=T(X) is also a member of the exponential family. If the transform is linear, i.e. Y=aX+b, then we can derive the expectation and variance of X from the results above:

𝔼[X]=𝔼[Y]-ba=κθ(θ)-baandvar(X)=var(Y)a2=κθθ(θ)a2

For other non-linear transformations, however, this is not possible as expectations and functions of random variables are not commutable.

Nevertheless, using the relationship between the random variables X and Y we can show that their moment generating functions are related by:

𝔼Y[exp{sY}] =exp{sY}fY(y|θ)𝑑y
=exp{sT(x)}fY(y=T(x)|θ)dydxdx
=exp{sT(x)}fX(x|θ)𝑑x
=𝔼X[exp{sT(X)}]

We can therefore define the moment generating function for the sufficient statistic T(X) by utilising our previous findings:

MT(X)(s)=𝔼X[exp{sT(X)}]=𝔼Y[exp{sY}]=exp{κ(s+θ)-κ(θ)},

and too the cgf for the sufficient statistic:

KT(X)(s)=κ(s+θ)-κ(θ)

Following the same argument as previously, we can therefore define the expectation and variance of T(X) as:

𝔼[T(X)]=κθ(θ)andvar(T(X))=κθθ(θ).

 
Exercise 3.27
For the random variable XGamma(α,1), find an expression for 𝔼[logX] and var(logX).

 

3.5.3 EF score and information functions

First, define the log-likelihood function for the canonical parameter θ, based on a single observation y, by:

(θ)=logf(y|θ).

Take care: the log-likelihood is also calculated as a function of the mean parameter μ. We are about to show that the relationship between μ and θ is invertible so that the log-likelihoods are consistent. When we need to make y explicit we write (θ;y).

Denote its derivatives of the log-likelihood with respect to θ by:

θ=d(θ)dθandθθ=d2(θ)dθ2.
Definition 3.5.1.

The score function is θ() and the curvature function is θθ().

Definition 3.5.2.

The observed information function is -θθ().

Mean and variance of the score

In principle, both the score function and the observed information function are functions of the canonical parameter θ and the observation y. Taking the expected value of the score function over Y results in:

𝔼Y[θ(θ;Y)]=0

PROOF:

𝔼Y[θ(θ;Y)] =θ(θ;y)f(y|θ)𝑑y
=ddθ{logf(y|θ)}f(y|θ)𝑑y
=ddθf(y|θ)f(y|θ)f(y|θ)𝑑y
=ddθf(y|θ)𝑑y
=ddθf(y|θ)𝑑y
=ddθ{1}=0.

The variance of the score function is:

varY(θ(θ;Y))=-𝔼Y[θθ(θ;Y)]

PROOF

varY(θ(θ;Y))+𝔼Y[θθ(θ;Y)]
   =𝔼Y[θ(θ;Y)2]-𝔼Y[θ(θ;Y)]2+𝔼Y[θθ(θ;Y)]
   =𝔼Y[θ(θ;Y)2+θθ(θ;Y)]
   =[{ddθlogf(y|θ)}2+d2dθ2logf(y|θ)]f(y|θ)dy
   =[{ddθlogf(y|θ)}ddθf(y|θ)f(y|θ)+d2dθ2logf(y|θ)]f(y|θ)dy
 =ddθlogf(y|θ)×ddθf(y|θ)+d2dθ2logf(y|θ)×f(y|θ)dy  [ByProductRule]
 =ddθ[ddθlogf(y|θ)×f(y|θ)]𝑑y
 =ddθ[ddθf(y|θ)f(y|θ)f(y|θ)]𝑑y
 =d2dθ2f(y|θ)𝑑y
 =d2dθ2f(y|θ)𝑑y=d2dθ2[1]=0

This is the big news! We now specialize these results to EF distributions. The log-likelihood function, the score function and the curvature function, for a single observation, are

(θ) = θy-κ(θ)+logq(y)
θ(θ) = y-κθ(θ)
θθ(θ) = -κθθ(θ).

The score is linear in y and the observed information is constant with respect to y. Hence the latter is identical to the Fisher or expected information.

This provides us with an alternative way to find the first two moments of Y. From above, the expectation of Y is:

𝔼Y[Y]=𝔼Y[θ(θ;Y)+κθ(θ)]=𝔼Y[θ(θ;Y)]+κθ(θ)=κθ(θ)

and the variance of Y is:

varY(Y)=varY(θ(θ;Y)+κθ(θ))=varY(θ(θ;Y))=-𝔼Y[θθ(θ;Y)]=κθθ(θ)

A sufficient condition for a function to be strictly convex is that its second derivative be strictly positive. Hence the function κ(θ) is strictly convex on Θ, because the second derivative of κ is κθθ(θ)=var(Y) which is always positive.

3.5.4 Canonical parameter MLE

Consider a set of independent and identically distributed random variables y1,,yn from some random variable belonging to the exponential family with pmf/pdf:

f(yi|θ)=q(yi)exp{yiθ-κ(θ)}fori=1,,n.

The log-likelihood of the canonical parameter for all realisations is the summation of all log-likelihood contributions from each yi:

(θ) =i=1n(θ|yi)
=i=1n{θyi-κ(θ)+logq(yi)}
=θi=1nyi-nκ(θ)+i=1nlogq(yi)

Taking derivatives obtains the score and curvature functions:

θ(θ) =i=1nyi-nκθ(θ)
θθ(θ) =-nκθθ(θ)

where κθ(θ) and κθθ(θ) respectively denote the first and second derivatives of the function κ(θ).

The maximum likelihood estimate (MLE) for the canonical parameter, θ^, is evaluated by finding the roots of the score function:

0=i=1nyi-nκθ(θ^)κθ(θ^)=1ni=1nyi

An analytical definition of the MLE is found by deriving the inverse of the derivative function κθ(θ). If an analytical solution does not exist, then the MLE can be determined using a numerical algorithm such as Newton-Raphson.

The observed information for the canonical parameter evaluated at some value θ* is

IO(θ*)=-θθ(θ=θ*)=nκθθ(θ=θ*)

Note that the observed information for the canonical parameter of a pmf/pdf which belongs to the exponential family does not depend on the value of the realisations, but only on how many realisations there are. It follows that the expected information at θ* is:

IE(θ*)=𝔼Y[IO(θ*)]=𝔼Y[nκθθ(θ*)]=nκθθ(θ*).

In many instances, the observed and expected information is evaluated at the MLE, θ*=θ^.

 
Exercise 3.28
Recall that the Poisson(λ) distribution belongs to the exponential family with canonical parameter θ=log(λ) and:

κ(θ)=exp{θ}-1.

Find the MLE of the canonical parameter θ^ for independent and identically distributed realisations y1,,yn from the Poisson distribution. Also derive an expression for the expected information at the MLE.

 

3.5.5 The mean function and moment parameter

A central role in EF theory is played by the function that computes the moment parameter μ from the value of the canonical parameter θ.

Definition 3.5.3.

The mean function m is the mapping from Θ to Ω given by

μ=𝔼[Y]=m(θ)=κθ(θ).

The mean function is sometimes known as the mean value function. The moment parameter space Ω is the range space of the mapping from θ to μ.

The mean function plays an important part in the theory. For example, maximum likelihood estimates of θ turn out to be solutions to the equation m(θ)=y.

 
Exercise 3.29
Find the mean function for the Poisson distribution, in terms of the canonical parameter, and its inverse.

 

 
Exercise 3.30
Find the cgf of Binom(k,1/2).

Find the mean function for the binomial distribution Binom(k,π) with 0<π<1 generated by exponentially tilting Binom(k,1/2).

 

3.5.6 The inverse mean function or canonical link function

From the discussion of κ it follows that m too, is continuous and differentiable on the interior of Θ. The first derivative of m(θ) is

mθ=κθθ=var(Y)

so that mθ(θ)>0 for all θ. Hence m(θ) is a strictly increasing function of θ as portrayed in the diagram.

Figure 3.1: Link, Caption: none provided

Consequently the mapping from θ to μ is invertible. Hence there is a function m-1 going from Ω to Θ such that

θ=m-1(μ).

By the inverse function rule of differentiation

dθdμ=(dμdθ)-1=1mθ=1var(Y)>0.

The inverse mean function m-1 plays the role of the link function g when the canonical parameter is chosen to be the linear predictor.

ML from a single observation

Recall that the score function based on a single observation, and obtained by differentiating the log-likelihood as a function of the canonical parameter θ, is

θ(θ)=y-κθ(θ)=y-m(θ),

using the definition of the mean function m. Hence if θ is a free parameter then its ML estimate satisfies

0=y-m(θ^)θ^=m-1(y),

as m is invertible. Finally as μ=m(θ), μ^=m(θ^),

μ^ = y.

Equating observation with theoretical moments is a general feature of the likelihood equations in GLMs.

 
Exercise 3.31
Derive the MLE for the canonical parameter for a single measurement, y, of a Poisson random variable.

 

 
Exercise 3.32
Find an expression for the MLE of the canonical parameter θ^ given a single observation, x, from XGamma(α,1).

Figure 3.2: Link, Caption: Relationship between the observation x and MLE θ^.

 

3.5.7 The variance function

The relationship of the variance of Y to the mean of Y characterizes linear EF distributions. We want to compute var(Y) in terms of the moment parameter μ=𝔼[Y].

Definition 3.5.4.

With μ=𝔼[Y], the variance function, v, from Ω to the positive real line is

v(μ)=var(Y),

expressed as a function of μ.

This definition expresses the variance in terms of the canonical parameter θ. Since

v(μ)=κθθ(θ)=mθ(θ).

Substitute for the canonical parameter θ in terms of μ to get

v(μ)=var(Y)=mθ(θ)=mθ[m-1(μ)].

 
Exercise 3.33
Find the variance function for the Poisson distribution.

 

 
Exercise 3.34
Find the variance function for the Exponential distribution.