5 Week 5 Bayesian statistics: Decisions 5.2 Bayes Rule and Risk 5.4 Evaluating the cost or utility of an experiment

5.3 Decision theory for pomt estimation

Bayes rule for squared error loss

If the loss is squared error, the Bayes decision $d$ is found by minimizing $\rho(d,\pi(\theta|X))$ or simplifying the notation, $\rho(d)$ .

	$\displaystyle\rho({d},\pi)$	$\displaystyle=$	$\displaystyle\mbox{\rm E\,}_{\theta\|X}({d}-\theta)^{2}$		(5.3)
		$\displaystyle=$	$\displaystyle{d^{2}}-2\mbox{\rm E\,}_{\theta\|X}(\theta){d}+\mbox{\rm E\,}_{% \theta\|X}\theta^{2}$		(5.3)

Differentiating wrt $d$ to find the minimum loss

$\displaystyle\rho^{\prime}({d},\pi)$	$\displaystyle=$	$\displaystyle 2{d}-2\mbox{\rm E\,}_{\theta\|X}(\theta)=0$
$\displaystyle\implies{d^{\star}}$	$\displaystyle=$	$\displaystyle\mbox{\rm E\,}_{\theta\|X}(\theta)$
	$\displaystyle=$	$\displaystyle\int_{\theta}\theta\pi(\theta\|X)d\theta$

Bayes rule for squared error loss

And Since $\rho^{\prime\prime}({d}^{\star})=2$ , the posterior mean ${d^{\star}}=\mbox{\rm E\,}_{\theta|X}(\theta)$ , is the Bayes decision or rule.
Bayes risk can be found by substituting ${d^{\star}}=\mbox{\rm E\,}_{\theta|X}(\theta)$ in (5.3) to get

\rho^{\star}(\pi)=\mbox{\rm E\,}_{\theta|X}(\theta^{2})-\left[\mbox{\rm E\,}_{% \theta|X}(\theta)\right]^{2}=\mbox{\rm var\,}_{\theta|X}

Characteristics of squared error loss

•

It is symmetrical
•

It is easy to interpret. This means that, with such a loss, we can summarize a posterior with a mean (Bayes rule) and variance (Bayes risk).
•

Other losses also have a Bayes rule of the posterior mean (Lindley 1985)
•

The squared loss is often criticized for penalising large errors to heavily

Decision theory for pomt estimation

The Bayes decision $d$ is found by minimizing $\rho(d)$

	$\displaystyle\rho({d})$	$\displaystyle=$	$\displaystyle\mbox{\rm E\,}_{\theta\|X}\left[w(\theta)({d}-\theta)^{2}\right]$
		$\displaystyle=$	$\displaystyle\mbox{\rm E\,}_{\theta\|X}\left[w(\theta){d}^{2}\right]+\mbox{\rm E% \,}_{\theta\|X}\left[-2w(\theta)\theta{d}\right]+\mbox{\rm E\,}_{\theta\|X}\left% [w(\theta)\theta^{2}\right]$

Characteristics of squared error loss

Differentiating wrt $d$ to find the minimum loss

$\displaystyle\rho^{\prime}({d})$	$\displaystyle=$	$\displaystyle 2\mbox{\rm E\,}_{\theta\|X}\left[w(\theta){d}\right]+\mbox{\rm E% \,}_{\theta\|X}\left[-2w(\theta)\theta\right]$
$\displaystyle\implies{d}^{\star}$	$\displaystyle=$	$\displaystyle\frac{\mbox{\rm E\,}_{\theta\|X}\left[w(\theta)(\theta)\right]}{% \mbox{\rm E\,}_{\theta\|X}\left[w(\theta)\right]}$
	$\displaystyle=$	$\displaystyle\frac{\int_{\theta}w(\theta)\theta\pi(\theta\|X)d\theta}{\int_{% \theta}w(\theta)\pi(\theta\|X)d\theta}$

The Bayes decision or rule is the mean of the weighted posterior over the posterior mean of the weights

Asymmetrical loss functions

Recall this loss function is given by

\displaystyle L({d},\theta)

\displaystyle=

\displaystyle\left\{\begin{tabular}[]{lr}$K_{1}(\theta-{d})$&${d}<\theta$\\ $K_{2}({d}-\theta)$&${d}\geq\theta$\\ \end{tabular}\right.

Bayes decision for asymmetrical loss

The Bayes decision can be found by minimising

$\displaystyle\rho({d})$	$\displaystyle=$	$\displaystyle\mbox{\rm E\,}_{\theta\|X}L({d},\theta)$
	$\displaystyle=$	$\displaystyle\int_{\theta}\pi(\theta\|X)L({d},\theta)d\theta$
	$\displaystyle=$	$\displaystyle\int_{{d}<\theta}\pi(\theta\|X)L({d},\theta)d\theta+\int_{{d}\geq% \theta}\pi(\theta\|X)L({d},\theta)d\theta$

	$\displaystyle=$	$\displaystyle K_{1}\int_{{d}<\theta}\pi(\theta\|X)(\theta-{d})d\theta+K_{2}\int% _{{d}\geq\theta}\pi(\theta\|X)({d}-\theta)d\theta$
$\displaystyle\implies\rho^{\prime}({d})$	$\displaystyle=$	$\displaystyle-K_{1}\int_{{d}<\theta}\pi(\theta\|X)d\theta+K_{2}\int_{{d}\geq% \theta}\pi(\theta\|X)d\theta=0$
$\displaystyle 0$	$\displaystyle=$	$\displaystyle-K_{1}P_{\theta\|X}({d}<\theta)+K_{2}(1-P_{\theta\|X}\left({d}<% \theta)\right)$
$\displaystyle\implies P_{\theta\|X}\left({d}<\theta\right)$	$\displaystyle=$	$\displaystyle\frac{K_{2}}{K_{1}+K_{2}}$

Bayes decision for asymmetrical loss

P_{\theta|X}\left({d}<\theta)\right)=\frac{K_{2}}{K_{1}+K_{2}}

The Bayes decision $d^{\star}$ is the $\frac{K_{2}}{K_{1}+K_{2}}$ fractile (quantile) of the posterior.

In particular for absolute loss $K_{1}=K_{2}$ the Bayes decision is the median.

Examples of asymmetrical loss

It is usual that given positive error may be more serious than a given negative error of the same magnitude or vice-versa. Examples include the following

1.

The plug is pulled out of a ventilator of a very sick hospital patient when the probability that a patient is dead exceeds a threshold say $P({\rm D}=1)>\lambda.$
2.

A nuclear power plant is to be shut down if the probability of a meltdown is greater than a threshold
3.

The safe concentration of CO2 in the atmosphere is thought to exceed a threshold, $[C0_{2}]>\epsilon$ . When this level is exceeded, the risk of a runaway greenhouse gas effect is thought too high and expensive correctional procedures are carried out.

Binary loss

An interval of length $2\epsilon$ ; say $(b-\epsilon;b+\epsilon)$ is said to be a modal interval of length $2\epsilon$ for the distribution of a random variable $X$ if
$P(b-\epsilon<X<b+\epsilon)$ takes on its maximum value out of all such intervals.

For the loss function

\displaystyle\L({d},\theta)

\displaystyle=

\displaystyle\left\{\begin{tabular}[]{lr}0&$|{d}-\theta|<\epsilon$\\ 1&$|{d}-\theta|>\epsilon$\\ \end{tabular}\right.

$P`\left({d}-\theta<\epsilon<{d}+\theta\right)$ is maximized if ${{d}^{\star}}$ is chosen to be the midpoint of the modal interval of length $2\epsilon$ .

Binary loss

For the limiting case of this as $\epsilon\longrightarrow 0$ is the hit or miss loss:

L({d},\theta)=\delta({d},\theta).

where $\delta$ is the Kronecker function. If the posterior distribution is uni-modal the Bayes decision is

{d}^{\star}={\tt argmax}_{\theta}\pi(\theta|X)

the mode of the posterior (The MAP).

Characteristics of 0-1 loss

•

Loss function can be thought of as depicting the truth of a model. When a model is either right or wrong this is the appropriate loss function.
•

This is mainly used in the classical formulation of hypothesis testing as formalized by Newman and Pearson.
•

It does not take into account shades of usefulness.