5 IRWLS Algorithm 5.1 Introduction 5.3 Example: Poisson Regression

5.2 Algorithm

The parameter vector $\beta$ of the GLM is estimated using a solution to th log-likelihood function as follows. In the canonical form, for independent observations $y_{1},\ldots,y_{n}$ , the likelihood is given by:

\ell(\beta_{1},\ldots,\beta_{p})=\sum_{i=1}^{n}(y_{i}\theta_{i}-\kappa(\theta_% {i}))+\sum_{i=1}^{n}\log q(y_{i}).

Next we take the first derivative of $\ell$ with respect to $\beta_{r}$ , $1\leq r\leq p$ . Note here that $\theta_{i}$ is a one-to-one function of $\mu_{i}$ with $\frac{d\mu_{i}}{d\theta_{i}}=\kappa^{\prime\prime}(\theta_{i})$ and $\eta_{i}$ is a one-to-one function of $\mu_{i}$ through the link function and finally $\eta_{i}=\mathbf{x}^{\prime}_{i}\boldsymbol{\beta}$ . Hence, by the chain rule:

\frac{d\ell}{d\beta_{r}}=\sum_{i=1}^{n}\frac{d\ell_{i}}{d\theta_{i}}\frac{d% \theta_{i}}{d\mu_{i}}\frac{d\mu_{i}}{d\eta_{i}}\frac{d\eta_{i}}{d\beta_{r}}

where

\displaystyle\frac{d\ell_{i}}{d\theta_{i}}=y_{i}-\kappa^{\prime}(\theta_{i})=y% _{i}-\mu_{i}\quad\quad\frac{d\mu_{i}}{d\theta_{i}}=\kappa^{\prime\prime}(% \theta_{i})=\mathrm{var}(Y_{i})\quad\quad\frac{d\eta_{i}}{d\beta_{r}}=x_{i,r}.

This leads to the likelihood equations:

\frac{d\ell}{d\beta_{r}}=\sum_{i=1}^{n}\frac{(y-\mu_{i})x_{i,r}}{\mathrm{var}(% Y_{i})}\frac{d\mu_{i}}{d\eta_{i}},\quad\mathrm{for}~{}~{}1\leq r\leq p.

We denote the above likelihood vector form by:

\mathbf{u}=\sum_{i=1}^{n}(y_{i}-\mu_{i})\mathbf{x}_{i}\frac{1}{\mathrm{var}(Y_% {i})}\left(\frac{d\mu_{i}}{d\eta_{i}}\right)^{2}\frac{d\eta_{i}}{d\mu_{i}}=% \sum_{i=1}^{n}W_{i}(y_{i}-\mu_{i})\frac{d\eta_{i}}{d\mu_{i}}\mathbf{x}_{i},

where

\displaystyle W_{i}=\frac{1}{\mathrm{var}(Y_{i})}\left(\frac{d\mu_{i}}{d\eta_{% i}}\right)^{2}=\frac{1}{\mathrm{var}(Y_{i})\{g^{\prime}(\mu_{i})\}^{2}}.

(5.2)

In the sequel, all sums are over $i$ from 1 to n, unless otherwise and the subscript $i$ is omitted from the summands. The $r$ -the component, $1\leq r\leq p$ of $\mathbf{u}$ is:

\displaystyle u_{r}=\sum W(y-\mu)\frac{d\eta}{d\mu}x_{r}

(5.3)

and let the expectation of the negative Hessian matrix be:

\mathbf{I}=\mathbb{E}\left[-\frac{d{}^{2}\ell}{d\beta_{r}d\beta_{s}}\right],

where both $\mathbf{u}$ and $\mathbf{I}$ are evaluated at the current estimate $\mathbf{b}$ of $\boldsymbol{\beta}$ . Then from (5.3), for $1\leq r,s\leq p$ ,

	$\displaystyle-\frac{du_{r}}{d\beta_{s}}$	$\displaystyle=-\sum\left[(y-\mu)\frac{d}{d\beta_{s}}\left\{W\frac{d\eta}{d\mu}% \right\}x_{r}+W\frac{d\eta}{d\mu}x_{r}\frac{d}{d\beta_{s}}(y-\mu)\right]$
		$\displaystyle=-\sum\left[(y-\mu)\frac{d}{d\beta_{s}}\left\{W\frac{d\eta}{d\mu}% \right\}x_{r}-Wx_{r}\frac{d\eta}{d\mu}\frac{d\mu}{d\beta_{s}}\right]$
		$\displaystyle=-\sum\left[(y-\mu)\frac{d}{d\beta_{s}}\left\{W\frac{d\eta}{d\mu}% \right\}x_{r}\right]+\sum Wx_{r}\frac{d\eta}{d\beta_{s}}$
		$\displaystyle=-\sum\left[(y-\mu)\frac{d}{d\beta_{s}}\left\{W\frac{d\eta}{d\mu}% \right\}x_{r}\right]+\sum Wx_{r}x_{s}.$

Hence:

\displaystyle I_{r,s}=-\mathbb{E}\left[\frac{du_{r}}{d\beta_{s}}\right]=-\sum% \left[\mathbb{E}[Y-\mu]\frac{d}{d\beta_{s}}\left\{W\frac{d\eta}{d\mu}\right\}x% _{r}\right]+\sum Wx_{r}x_{s}=\sum Wx_{r}x_{s}.

(5.4)

To apply Fisher’s scoring method, note that the $r$ -th component of $\mathbf{I}\mathbf{b}$ is:

\displaystyle\sum_{s}I_{r,s}b_{s}=\sum_{s}\sum_{i=1}^{n}W_{i}x_{i,r}x_{i,s}b_{% s}=\sum_{i=1}^{n}W_{i}x_{i,r}\sum_{s}x_{i,s}b_{s}=\sum_{i=1}^{n}W_{i}x_{i,r}% \eta_{i}

(5.5)

where $\eta_{i}=\mathbf{x}^{\prime}_{i}\mathbf{b}$ is the $i$ -th linear predictor evaluated estimate. Hence from (5.3) and (5.5):

\displaystyle\mathbf{I}\mathbf{b}+\mathbf{u}=\sum W\left\{\eta+(y-\mu)\frac{d% \eta}{d\mu}\right\}\mathbf{x}=\sum Wz\mathbf{x},

(5.6)

where

\displaystyle z_{i}=z_{i}(\mathbf{b})=\eta_{i}+(y_{i}-\mu_{i})\frac{d\eta_{i}}% {d\mu_{i}}=\mathbf{x}^{\prime}_{i}\mathbf{b}+(y_{i}-\mu_{i})g^{\prime}(\mu_{i})

(5.7)

with all quantities ( $\mu_{i}$ and $\eta_{i}$ ) evaluated at the current estimate $\mathbf{b}$ . Consequently from (5.1), (5.5) and (5.6):

\mathbf{b}^{*}=\mathbf{I}^{-1}(\mathbf{I}\mathbf{b}+\mathbf{u})=\left(\sum_{i=% 1}^{n}W_{i}\mathbf{x}_{i}^{\prime}\mathbf{x}_{i}\right)^{-1}\left(\sum_{i=1}^{% n}W_{i}\mathbf{x}_{i}^{\prime}\mathbf{z}_{i}\right)=(\mathbf{X}^{\prime}% \mathbf{W}\mathbf{X})^{-1}\mathbf{X}^{\prime}\mathbf{W}\mathbf{z}

where $\mathbf{W}$ is a diagonal matrix with $i$ -th diagonal entry $W_{i}$ of (5.2).

Remark 1 on implementation: For implementing IRWLS, start with initial $\mathbf{b}$ and first compute the linear predictor $\eta_{i}=\mathbf{x}^{\prime}_{i}\mathbf{b}$ . Then calculate $\mu_{i}=g^{-1}(\eta_{i})$ ; however, often the initial $\mu_{i}$ ’s are taken as $Y_{i}$ ’s and one evaluates $\eta_{i}=g(\mu_{i})$ (There are obvious problems with such choice, for example, when some $y_{i}$ ’s are zero and one has to take the logarithm as in the Poisson case). Finally, $z_{i}(b)$ of (5.7) is evaluated and the iteration continues.

Remark 2: Consider a multiple linear regression model with observed $z_{i}$ ’s of (5.7) defined as:

	$\displaystyle z_{i}=\mathbf{x}_{i}^{\prime}\boldsymbol{\beta}+(y_{i}-\mu_{i})% \frac{d\eta_{i}}{d\mu_{i}}$	$\displaystyle=\mathbf{x}_{i}^{\prime}\boldsymbol{\beta}+\dfrac{(y_{i}-\mu_{i})% }{\{\mathrm{var}(Y_{i})\}^{1/2}}\left[\dfrac{\{\mathrm{var}(Y_{i})\}^{1/2}}{% \frac{d\mu_{i}}{d\eta_{i}}}\right]$
		$\displaystyle=\mathbf{x}_{i}^{\prime}\boldsymbol{\beta}+\dfrac{(y_{i}-\mu_{i})% }{\{\mathrm{var}(Y_{i})\}^{1/2}}W_{i}^{-1/2}$		(5.8)

Since $\mathbb{E}[(Y_{i}-\mu_{i})/\mathrm{var}(Y_{i})^{1/2}]=0$ , the updated estimate $\mathbf{b}^{*}$ is nothing but the weighted least squares estimate of $\boldsymbol{\beta}$ with weights given by $W_{i}$ ’s, where these weights and $z_{i}$ ’s are calculated using the current value of $\mathbf{b}$ since that is the best approximation at the current stage. The hypothetical model (5.8) can be motivated from a one-step Taylor approximation:

g(y_{i})\approx g(\mu_{i})+(y_{i}-\mu_{i})g^{\prime}(\mu_{i})=\mathbf{x}_{i}^{% \prime}\boldsymbol{\beta}+(y_{i}-\mu)\frac{d\eta_{i}}{d\mu_{i}}

Remark 3: From (5.4), $-\mathbb{E}[\frac{du_{r}}{d\beta_{s}}]=-\frac{du_{r}}{d\beta_{s}}$ if $W_{i}\frac{d\eta_{i}}{d\mu_{i}}$ is a constant function of $\boldsymbol{\beta}$ . This happens under the canonical link function and consequently Fisher’s scoring method and Newton-Raphson method for finding $\hat{\boldsymbol{\beta}}$ coincide resulting in fast convergence. This is because:

W_{i}\frac{d\eta_{i}}{d\mu_{i}}=\frac{1}{\mathrm{var}(Y_{i})g^{\prime}(\mu_{i}% )}=\frac{1}{v(\mu_{i})g^{\prime}(\mu_{i})}

and for this to be free from $\boldsymbol{\beta}$ , $g^{\prime}(\mu)v(\mu)$ is a constant or:

g(\mu)=c\int\frac{1}{v(\mu)}d\mu.

In particular:

•

Simple linear regression – $v(\mu)=1$ , $g(\mu)=\mu$
•

Poisson regression – $v(\mu)=\mu$ , $g(\mu)=\log(\mu)$
•

Logistic regression – $v(\mu)=\mu(1-\mu)$ , $g(\mu)=\log\{\mu/(1-\mu)\}$