Home page for accesible maths 8 More than one random variable

Style control - access keys in brackets

Font (2 3) - + Letter spacing (4 5) - + Word spacing (6 7) - + Line spacing (8 9) - +

8.2 Independence

Independence is the simplest form for joint behaviour of two (or more) random variables. Informally, two random variables X and Y are independent if knowing the value of one of them gives no information about the value of the other.

The outcomes of, say, rolls of two separate dice are independent in exactly this sense: knowing that the red die showed a 4 does not give us any information about the score of the blue die, and, conversely, knowing that the score of the blue die was 3 does not give any information about the red die.

Two random variables X and Y are independent if the events {XA} and {YB} are independent for all sets A and B, i.e. P(XA,YB)=P(XA)P(YB) for all sets A,B.

Theorem 8.4.

Two discrete random variables X and Y are independent if and only if

pXY(x,y)=pX(x)pY(y)

for all x and y.

Proof.

Let X and Y be independent, and let A={x} and B={y}. Then

pXY(x,y) = P(XA,YB)
= P(XA)P(YB)
= pX(x)pY(y).

Conversely, if the joint pmf factorises we get for arbitrary sets A and B

P(XA,YB) = xAyBpXY(x,y)
= xAyBpX(x)pY(y)
= xApX(x)yBpY(y)
= P(XA)P(YB).

If X and Y are discrete random variables, the conditional pmfs are

pXY(xy) = pX,Y(x,y)pY(y),
pYX(yx) = pXY(x,y)pX(x).

Thus pXY(xy)=P(X=x,Y=y)/P(Y=y)=P(X=xY=y).

Exercise 8.5.

Show that if the discrete variables (X,Y) are independent then for all x,y:

pXY(xy) = pX(x)
Solution.
pXY(xy) = pX,Y(x,y)pY(y)
= pX(x)pY(y)pY(y)
= pX(x)

These results conform with intuition as, when X and Y are independent, knowing the value of X should tell us nothing about Y.

The converse is also true: if the conditional distribution of X given Y=y is independent of y or, equivalently, the conditional distribution of Y given X=x is independent of x, then X and Y are independent.

Example 8.6.

A fair coin is tossed. If it shows H a fair die is thrown, if T a biased die. The bias makes even numbers twice as probable as odd numbers. Find the joint pmf of X, the toss of the coin, and Y, the score on the die.

Solution.

Code T and H as 0 and 1 to make rvs. Marginal: pX(x)=1/2 for x=0,1.

Conditional:
x=1: pY|X(y|1)=1/6 for y=1,2,,6.
x=0: pY|X(y|0)=c for y=1,3,5 and pY|X(y|0)=2c for y=2,4,6. Hence c=1/9.

Using pXY(x,y)=pY|X(y|x)pX(x) delivers

1 2 3 4 5 6
0 1/18 2/18 1/18 2/18 1/18 2/18
1 1/12 1/12 1/12 1/12 1/12 1/12
Example 8.7.

For the joint pmf in Example 8.3 obtain the conditional pmf of X given Y=2.

Y
0 1 2 3
X 1 2/60 16/60
2 3/60 24/60
3 6/60 20/60
11/60

So

pXY(x|2) = pXY(x,2)/pY(2)

Thus x=1 w.p. 2/11, x=2 w.p. 3/11 and x=3 w.p. 6/11.

We have seen that when X and Y are both discrete, they are independent if and only if their joint pmf can be factorised as a product of the marginal pmfs.

pXY(x,y)=pX(x)pY(y).

Our definition of independence holds also for continuous random variables, but there is no joint pmf for continuous random variables. It is beyond the scope of this module, but there can exist a joint probability density function fX,Y(x,y). As with univariate random variables, results that hold in the discrete case with probability mass functions often hold in the continuous case with joint mass functions.

Theorem 8.8.

Two continuous random variables X and Y are independent if and only if

fXY(x,y)=fX(x)fY(y).
Proof.

Not given here. ∎

The result is needed for constructing likelihood-based estimates in statistics: often it is assumed that repeated experiments result in n independent observations of a random variable, and the joint density function of the observations is the product of the marginal densities.

8.3 Weak law of large numbers

Recall from Exercise 5.13 that if an experiment is repeated n times then, as n gets large, the proportion of times an event A occurs converges to P(A). We will now prove a similar result concerning the average of several realisations of a random variable converging to the expected value. We start with a lemma which is proved in MATH230.

Lemma 8.9.

Let X1,X2,,Xn be jointly distributed random variables with finite expectation and variance. Then

  • E(X1+X2++Xn)=E(X1)+E(X2)++E(Xn), and

  • if X1,X2,,Xn are independent then

    Var(X1+X2++Xn)=Var(X1)+Var(X2)++Var(Xn).

Now suppose that X1,X2,,Xn are independent copies of a random variable X. For example, suppose we repeated an experiment n times, and Xi is the measured outcome on the ith experiment. This setup means that for each i we have

E[Xi] = E[X]
Var(Xi) = Var(X)

If we want to report a value, scientists will usually measure it n times and report the average measured value. Let Xi be the measured value on the ith experiment. The average measured value is

X¯=1n(X1+X2++Xn).

Why do we do this?

Let’s consider the properties of X¯. For simplicity, write μ for E[X] and σ2 for Var(X).

E[X¯] = E[1n(X1+X2++Xn)]
= 1nE[X1+X2++Xn]  by linearity of E
= 1n{E[X1]+E[X2]++E[Xn]}  by Lemma 8.9
= 1n{E[X]+E[X]++E[X]}  sinceE[Xi]=E[X]
= 1n{nμ}
= μ

So X¯ has expectation the quantity we wish to report, the true expected value of X. Of course, simply reporting the first measurement X1 would also have this expected value.

Consider now the variance of X¯:

Var(X¯) = Var(1n(X1+X2++Xn))
= 1n2Var(X1+X2++Xn)  by the calculation on p4.5
= 1n2{Var(X1)+Var(X2)++Var(Xn)}  by Lemma 8.9
= 1n2{nσ2}
= σ2n

The variance of our reported quantity, X¯, decreases as the number of measurements n increases.

We can use Chebychev’s inequality (Section 4.6) to be more precise about this. Recall that for any random variable R with expected value μ and standard deviation s

P(|R-μ|>cs)1c2,

for any c>0.

[I am using s for the standard deviation here, instead of σ, to avoid confusion with the σ2 already used for the variance of X.]

Hence for the random variable X¯ with expected value μ, variance σ2/n and hence standard deviation σ/n, we have

P(|X¯-μ|>cσn)1c2

By taking k=c/n, we can rearrange this expression to

P(|X¯-μ|>kσ)1k2n.

We see that as n gets large, the probability that the sample average X¯ is more than distance kσ away from the expected value of the original random quantity X decreases to 0.

Since k is arbitrary, in some sense we can say that X¯ converges to μ. This is called the weak law of large numbers. You will see various other forms of convergence of random variables in later courses.

One final thing to note: the standard deviation σ is exactly the right quantity for determining the appropriate scale for measuring distance here: the events are of the type “random variable is more than k standard deviations away from the mean”.