5 IRWLS Algorithm

5.2 Algorithm

The parameter vector β of the GLM is estimated using a solution to th log-likelihood function as follows. In the canonical form, for independent observations y1,,yn, the likelihood is given by:

(β1,,βp)=i=1n(yiθi-κ(θi))+i=1nlogq(yi).

Next we take the first derivative of with respect to βr, 1rp. Note here that θi is a one-to-one function of μi with dμidθi=κ′′(θi) and ηi is a one-to-one function of μi through the link function and finally ηi=𝐱i𝜷. Hence, by the chain rule:

ddβr=i=1ndidθidθidμidμidηidηidβr

where

didθi=yi-κ(θi)=yi-μi  dμidθi=κ′′(θi)=var(Yi)  dηidβr=xi,r.

This leads to the likelihood equations:

ddβr=i=1n(y-μi)xi,rvar(Yi)dμidηi,for1rp.

We denote the above likelihood vector form by:

𝐮=i=1n(yi-μi)𝐱i1var(Yi)(dμidηi)2dηidμi=i=1nWi(yi-μi)dηidμi𝐱i,

where

Wi=1var(Yi)(dμidηi)2=1var(Yi){g(μi)}2. (5.2)

In the sequel, all sums are over i from 1 to n, unless otherwise and the subscript i is omitted from the summands. The r-the component, 1rp of 𝐮 is:

ur=W(y-μ)dηdμxr (5.3)

and let the expectation of the negative Hessian matrix be:

𝐈=𝔼[-d2dβrdβs],

where both 𝐮 and 𝐈 are evaluated at the current estimate 𝐛 of 𝜷. Then from (5.3), for 1r,sp,

-durdβs =-[(y-μ)ddβs{Wdηdμ}xr+Wdηdμxrddβs(y-μ)]
=-[(y-μ)ddβs{Wdηdμ}xr-Wxrdηdμdμdβs]
=-[(y-μ)ddβs{Wdηdμ}xr]+Wxrdηdβs
=-[(y-μ)ddβs{Wdηdμ}xr]+Wxrxs.

Hence:

Ir,s=-𝔼[durdβs]=-[𝔼[Y-μ]ddβs{Wdηdμ}xr]+Wxrxs=Wxrxs. (5.4)

To apply Fisher’s scoring method, note that the r-th component of 𝐈𝐛 is:

sIr,sbs=si=1nWixi,rxi,sbs=i=1nWixi,rsxi,sbs=i=1nWixi,rηi (5.5)

where ηi=𝐱i𝐛 is the i-th linear predictor evaluated estimate. Hence from (5.3) and (5.5):

𝐈𝐛+𝐮=W{η+(y-μ)dηdμ}𝐱=Wz𝐱, (5.6)

where

zi=zi(𝐛)=ηi+(yi-μi)dηidμi=𝐱i𝐛+(yi-μi)g(μi) (5.7)

with all quantities (μi and ηi) evaluated at the current estimate 𝐛. Consequently from (5.1), (5.5) and (5.6):

𝐛*=𝐈-1(𝐈𝐛+𝐮)=(i=1nWi𝐱i𝐱i)-1(i=1nWi𝐱i𝐳i)=(𝐗𝐖𝐗)-1𝐗𝐖𝐳

where 𝐖 is a diagonal matrix with i-th diagonal entry Wi of (5.2).

Remark 1 on implementation: For implementing IRWLS, start with initial 𝐛 and first compute the linear predictor ηi=𝐱i𝐛. Then calculate μi=g-1(ηi); however, often the initial μi’s are taken as Yi’s and one evaluates ηi=g(μi) (There are obvious problems with such choice, for example, when some yi’s are zero and one has to take the logarithm as in the Poisson case). Finally, zi(b) of (5.7) is evaluated and the iteration continues.

Remark 2: Consider a multiple linear regression model with observed zi’s of (5.7) defined as:

zi=𝐱i𝜷+(yi-μi)dηidμi =𝐱i𝜷+(yi-μi){var(Yi)}1/2[{var(Yi)}1/2dμidηi]
=𝐱i𝜷+(yi-μi){var(Yi)}1/2Wi-1/2 (5.8)

Since 𝔼[(Yi-μi)/var(Yi)1/2]=0, the updated estimate 𝐛* is nothing but the weighted least squares estimate of 𝜷 with weights given by Wi’s, where these weights and zi’s are calculated using the current value of 𝐛 since that is the best approximation at the current stage. The hypothetical model (5.8) can be motivated from a one-step Taylor approximation:

g(yi)g(μi)+(yi-μi)g(μi)=𝐱i𝜷+(yi-μ)dηidμi

Remark 3: From (5.4), -𝔼[durdβs]=-durdβs if Widηidμi is a constant function of 𝜷. This happens under the canonical link function and consequently Fisher’s scoring method and Newton-Raphson method for finding 𝜷^ coincide resulting in fast convergence. This is because:

Widηidμi=1var(Yi)g(μi)=1v(μi)g(μi)

and for this to be free from 𝜷, g(μ)v(μ) is a constant or:

g(μ)=c1v(μ)𝑑μ.

In particular:

  • Simple linear regression – v(μ)=1, g(μ)=μ

  • Poisson regression – v(μ)=μ, g(μ)=log(μ)

  • Logistic regression – v(μ)=μ(1-μ), g(μ)=log{μ/(1-μ)}