You are on page 1of 155

1

Chapter 0
The Art of Measuring Sets

ΠSets and the OR Rule

A very simple notion is that of a set or collection of objects. Being so simple one can
expect it to crop up in great generality in many places, and indeed it does. It has become a
useful building block in mathematics.

There are many ways to describe a set, but the two most important from our point of view
are: listing the objects in the set, or specifying some property that the objects in
the set satisfy. Naturally in order for this description to be satisfactory, the writer has to
accomplish communication with the reader, just like in the case of sequences in the
previous chapter. For example, suppose I were to write {1,2,3,4,K} . Here I have neither
listed all the objects in the set (an impossible task), nor specified a property that the
elements of the set satisfy. Rather I have appealed to your intuition, and expect you to
know that I am describing the set of natural or counting numbers. Note, that on the
average, you wouldn't have too much trouble deciding if a certain object is an element of
this set or not.

On the other hand if I were to write {2,3,5,17,K} it might be totally unclear to the reader
what set I am talking about. Probably one could speculate I had in mind the set
{x | x = 2 2n
}
+ 1, n ³ 1 , but it would be much better to have clarified. But it could also be
argued that I am just asking for the primes in this collection. Then it would not
necessarily be so easy to decide whether a given object is in the set or not.

To understand the basic notation: whenever one sees { , the set bracket symbol, one reads
it as: the set of all, and the symbol | reads as such that, and of course, as with any other
parentheses, we are required to use } to close the phrase. Thus, {x | x is a prime} reads as
the set of all x’s such that x is a prime, or simply the set of primes. Similarly,
{1,3,5,7,9,...} should probably be read as the set of odd positive integers.

One of the definite ingredients in the course will be to count sets, that is, to decide on the
number of distinct or different elements a set has. Thus {2,3,2,4} has 3 elements since
we are not counting occurrences of 2, we are just counting elements. A very common
error when counting sets is that of double counting, another name for counting an
element of a set more than once (it is like claiming that the set above has four elements).
We will use capital letters for sets, such as A , B, C, X, Y and we will use A , B , etc. to
2

denote the cardinality, or number of distinct elements of the sets A, B, etc. It is not
uncommon to refer to a set with n different elements as an n-set. Thus the set {2,3,4} is a
3-set.

It is universal to use x ∈ A to denote the fact that x is an element of A and x ∉ A to


express the negation of that fact. Also universally, ∅ denotes the empty set, the set with
no elements, so Æ = 0 , and it is the only set that has this cardinality. Note that we are
implying that there is only one emp ty set. The reason for this is that equality of sets is
totally determined by its elements: two sets are considered equal if they have
identical collections of elements. Thus, {1,2,3,2} = {3,1,2} since they have the same
elements. From the linguistic point of view, sets A and B are equal if and only if
"x ( x Î A « x Î B) .

A related notion is that of a subset or subcollection. If A and B are sets, one says A is a
subset of B if every element of A is an element of B (observe it is a conditional
statement), and one writes A ⊆ B or B ⊇ A to express that fact. Thus A = B if and only
if A Í B and B Í A . Note that naturally since every element of A is an element of A,
A Í A ; every set is a subset of itself. Some authors use B Ì A to indicate that B Í A and
B ¹ A.

Since every element of the empty set is in any set, ∅ ⊆ A for any set A. But it is not true
that necessarily Æ Î A .

An obvious remark is that if A ⊆ B , then A ≤ B . For example, the set of students in


this class is a subset of the set of students at CSULB, which in turn is a subset of the set of
students in California, so in particular there are fewer students in this class than at CSULB
and in turn fewer than in California (kind-of-crude inequalities!).

Building new sets from old sets is easy. For example, if A and B are sets, let C consist of
all things in the sets A, B. If A={1,2,3,4} and B={3,4,5,6,7}, then C={1,2,3,4,5,6,7}. This
C is called the union of A and B, and is denoted by C = A È B . Formally defined
A È B = {x x Î A or x Î B} ,
The picture in one's mind of a union is very simple—just put the two sets together. A
common way to represent sets graphically is via Venn diagrams (or also affectionately
called bubbles): if
one uses a bubble
to represent a set, A B AÈ B
then the union is
the two bubbles
together:
3

Also easy is the intersection of two sets. This new set consists of only those objects that
A and B have in common. Letting D denote this intersection,
D={3,4} in the example above. In symbols, D = A Ç B and
formally
A Ç B = {x | x Î A and x Î B} . AÇ B

One common error when counting sets is to presume


A È B = A + B , which is patently not true as our example above shows: |A|=4, |B|=5,
but A È B = 7 . What's gone wrong? 4 + 5 ≠ 7 . Indeed, when we added 4+5 we counted
some things twice, namely the elements 3 and 4 (or, in general, anything in the
intersection of A and B), so in order to rectify the count we need to subtract 2 from the
sum. The correct formula is
A È B = A + B - AÇ B .

Of course if A and B were disjoint, which means they have no element in common—
or, equivalently, their intersection is empty, A Ç B = Æ , then the size of their union is
the sum of the sizes. As modest a claim as this is, it is still useful and of course it can
be generalized to an arbitrary number of sets as long as they are pair-wise disjoint, that is
no two of them have anything in common.

More formally, let A1 , A2 ,K , At be subsets of C. Then we say A1 , A2 ,K , At partition C


if every element of C is in exactly one of the pieces; in other words, A1 Ç A2 = Æ,
A1 Ç A3 = Æ ,…, Ai Ç Aj = Æ ,…, At-1 Ç At = Æ and A1 È A2 È L È At = C . Then if this is
the case we have
C = A1 + A2 + L + At .

This rule is often called the first counting principle (we will see the second one later)
or the rule of sum. The subsets involved in a partition are often called the classes of the
partition. Thus, we could count the undergraduates in the university by counting the
freshmen, the sophomores, the juniors and the seniors and adding up the results. A little
bit more interesting is the following

Example 1. We are to toss a coin 3 times and record the results (as to heads or tails.) Let
C be the set of possible outputs to this experiment. Let A0 be the subset of outcomes with
0 heads, let A1 be the subset of outcomes with 1 head, A2 the outcomes with 2 heads,
and A3 the one with 3. Then A0 , A1 , A2 , A3 partition C and so A0 + A1 + A2 + A3 = C .
Indeed, C={HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}, while A0 = { TTT } ,
A1 = {HTT , THT, TTH} , A2 = {HHT, THH, HTH} and A3 = {HHH} . Note 8=1+3+3+1.
Similarly if D were the set of outcomes with at least 2 heads, then A2 and A3 partition D,
so D = 4 .
4

What happens if the pieces are not disjoint is more complicated, and in order to
understand it is best to visualize sets a bit. A way of accomplishing this is again by using
Venn diagrams to represent sets. To represent A, B, C
subsets of a universe U one has then the picture

How many pieces are there in this picture? To best


represent them we need to introduce the complement
of a set. If some universe is clearly understood and A is
some subset of that universe, then the complement of
A, A , is the collection of things in the universe that are
not in A. Thus complementation corresponds to
negation:
A = {x | x Ï A} .
But do not forget there is a universe in the background. Clearly, by the first counting
principle, the size of A + the size of A = the size of the universe:
A+A=U .

Then going back to our picture. There are eight disjoint


pieces, and each of them is described as a triple
intersection of each of our sets (or their complements
with each other): 5 2 6
1 = AÇ B ÇC 2 = AÇ B ÇC
3 = AÇ B ÇC 4 = AÇ B Ç C 4 1 3
5 = AÇ B ÇC 6 = AÇ B Ç C
7 = AÇ B ÇC 8 = AÇ B Ç C 7
8
Of course if we know the cardinalities of these pieces,
we know the size of every subset in the picture. For
example, since A = 1 È2 È4 È 5 , and since the pieces are disjoint, we get that
A = 1 È 2 È 4 È 5 . Similarly, A Ç B = 1 È 2 .

But data is rarely as accommodating as to come in the desired form. Instead, what's
usually available is the size of A, B, C, A Ç B , A Ç C , B Ç C and A Ç B Ç C . What is the
size then of A È B È C ?

Let's review our picture. Suppose we simply add A + B + C . We have definitely


counted some things twice, namely
A Ç B , A Ç C and B Ç C . So in
order to balance the books we need to
subtract these. But now what has
happened with A Ç B Ç C ? Originally
it had been counted thrice, but now it
5

has been subtracted thrice! So it has not been counted at all, we need to add it in, so the
right expression becomes:

A È B ÈC = A + B + C - A Ç B - A Ç C - B Ç C + A Ç B Ç C .

The following table may make the theorem more obvious:


A B C AÇ B AÇC B ÇC AÇ B ÇC
1 1 1 1 1 1 1 1
2 1 1 0 1 0 0 0
3 0 1 1 0 0 1 0
4 1 0 1 0 1 0 0
5 1 0 0 0 0 0 0
6 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0
8 0 0 0 0 0 0 0
where a 1 is put on the table if that piece among the 8 is counted the set indicating the
column is counted. Then one could readily verify that if we add the first three columns
and the last one, and from that sum subtract the remaining three columns, each of the first
seven rows would end up with exactly 1.

This is a specific case of a general theorem called the principle of inclusion-exclusion.

But the best way to go about it when a Venn diagram can be drawn is to compute the size
of each of the 8 pieces in the diagram—a specific example should help.
AC AT
Example 2. Of the cars sold during the month of August, 90 had air
20 60
conditioning, 100 had automatic transmission, and 75 had power
steering. Five cars had all of these extras. Twenty cars had none of these 5
5
extras. Twenty cars had only air conditioning and 60 cars had only
automatic transmission. Ten cars had both automatic transmission and PS
power steering. How many cars were sold in August? 20

Filling in the Venn diagram we have some of the 8 disjoint pieces readily
available to us—and in the picture above we have filled in all the available information.
From 10 cars had both AT & PS, we can infer that 5 had only
AC AT AT & PS since 5 had all three, and then we can successively
fill in the rest of the diagram. Note that has to read the
20 30 60 sentences as always inclusive unless otherwise stated—for
5 example, 100 cars had AT means the size of the set of cars
35 5
with automatics transmission was 100.
30 PS
20 Now we can answer the question, and indeed we could answer
any question about any combination of sets of cars: the number
of sold cars is the sum of all the eight pieces, which equals
205.
6

Example 3. In a survey of 75 consumers, 12 indicated they were going to buy a new car,
18 said they were going to buy a new refrigerator, and 24 said they were going to buy a
new oven. Of these 6 were going to buy both a car and a refrigerator, 4 were to buy a car
and an oven, and 10 were going buy a refrigerator and an oven.
R Two were to purchase all three items. Once the information has
C
been processed in the picture, all questions can be answered.
4 4 4 For example we know that 39 people do not intend to buy
2 anything.
2 8
39
12 Example 4. In an eccentric way to organize his research
O
company, the owner boss decides to assign himself as #1, and
then his 99 employees will be assigned a number from 2 through 100. An employee will
be then subservient to all other employees whose numbers are factors of his/her
number—thus, employee #6 would be subservient to the boss (#1) and to both employees
#2 and #3. Of course, everybody is subservient to the boss.
E T
The boss would like to know how many employees are there 23 13 14
going to be responsible jus t to him. Easily, an employee has no 3
boss but the boss exactly when their number is a prime. That 7 3
will occur exactly when it is not a multiple of 2, 3, 5 or 7. First 26
we will count all the numbers that are not multiples of 2, 3 nor 7
5. Let E stand for the set of multiples of two between 1 and F
100, let T stand for the multiples of three, and let F stand for
the multiples of five. What the original question asked was the number of elements in
region 8, E Ç T Ç F . From the picture, or by simple logic, this is equivalent to counting
E È T È F , and then subtracting that total from 100. We use inclusion-exclusion to count
E È T È F . Certainly E has 50 elements since 50 = 100  , while T has 33 = 100  and F,
 2   3 

20 = 100  elements, where we let  x  denote the largest whole number below x. How

 5 
does a number get to be in E Ç T ? By being a multiple of 2 and a multiple of 3, in other
words a multiple of their least common multiple: 6, and thus E Ç T has 16 = 100 
 6 

elements. Similar reasoning applies to E Ç F , the multiples of 10 with 10 = 100 


 10 
elements, and to T Ç F , the multiples of 15 with 6. And thus we subtract these. Finally,
the elements of E Ç T Ç F are the multiples of 30, so it has 3 elements, and so since
E È T ÈF = E + T + F - EÇ T- EÇ F - TÇ F + EÇ T Ç F ,
we get
E È T È F = 50+ 33+ 20- 16- 10 - 6 + 3 = 74 .
And so E Ç T Ç F = 100 - 74 = 26 , which means there are 26 numbers that are not
multiples of 2, 3 nor 5 among the first 100 numbers. The set is actually
{1,7,11,13,17,19,23,29,31,37,41,43,47,49,53,59,61,67,71,73,77,79,83,89,91,97}.
7

Before we finish counting the primes below 100, we need to include another set: S, the
multiples of seven, and thus we need to understand inclusion-exclusion for four sets.
Unfortunately, a Venn diagram can be draw for four or more sets, but a table similar to
the one built above can always be made. However, we have yet another approach to it,
and that is the use of recursion.

Suppose we had now four sets: A, B, C, and D. We use what we know: let C ¢ = C È D .
Then we have
A È B È C È D = A È B È C ¢ = A + B + C ′ - A Ç B - A Ç C ¢ - B Ç C ¢ + AÇ B Ç C ¢ .
From before we know C ¢ = C + D - C Ç D . What about
A Ç C ¢ ? Consider the Venn diagram picture for just these two
sets:

If we view then A Ç C ¢ , we get the


picture

And we clearly see that


A Ç C ¢ = A Ç (C È D) = ( AÇ C)È ( A Ç D) .

This is a distributive law and if you say it in words (ands and ors) you will convince
yourself that it is true. Similarly,
B Ç C ¢ = B Ç (C È D) = ( B Ç C)È ( B Ç D)
and
A Ç B Ç C ¢ = A Ç B Ç (C È D) = ( A Ç B Ç C ) È ( A Ç B Ç D) .
But then
A Ç C ¢ = A Ç C + AÇ D - AÇ CÇ D
since A Ç A= A , and similarly,
B Ç C ¢ = B Ç C + B Ç D - BÇ CÇ D ,
A Ç B Ç C ¢ = A Ç B ÇC + AÇ B Ç D - AÇ BÇ CÇ D .
Substituting, we then have
AÈ BÈCÈ D = A + B + C + D
- A Ç B - A ÇC - A Ç D - B Ç C - B Ç D - C Ç D
+ A Ç B Ç C + AÇ BÇ D + A Ç C Ç D + B Ç C Ç D - AÇ B Ç C Ç D

Note that there is a very nice symmetry in the formula, which makes it quite easy to
remember. We will not bother generalizing it to n sets, but everybody should know that,
as we mentioned above, that the general rule is called the inclusion-exclusion
principle, and we have just seen a couple of instances of it.

We can now finish Example 4:


8

E È T È F ÈS = E + T + F + S - E Ç T - E Ç F - E Ç S - T Ç F - T Ç S - F Ç S
+ E Ç T Ç F + E Ç T Ç S + EÇ F Ç S + TÇ FÇ S - E Ç T Ç F Ç S
And we get: E È T È F ÈS = 50+ 33+ 20+ 14- 16- 10 - 7 - 6 - 4 - 2 + 0 = 78 .

The complement then has 22 elements, but we have included 1 and excluded 2,3,5 and 7,
and so the true answer is that there 25 primes below 100. The list of employees with
only the owner as their boss is
{2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97}.

So far in this chapter we have learned how to count some sets based on information from
other sets. The construction was that basically that of unions, or in linguistic terms, they
were ors. Now we will address the issue of how to count ands.
9

• The Second Counting Principle: The AND Rule

We start with the set construction that will aid us: the product of two sets. Let A and B
be sets. Then a new set A cross B, A× B , can be made out of the ordered pairs made out
of A and B with the first coordinate coming out of A and the second coordinate coming
out of B.

For example, if, as before, A={1,2,3,4} and B={3,4,5,6,7}, then A× B consists of the 20
pairs:
(1,3) (1,4 ) (1,5) (1,6 ) (1,7 )
(2,3) (2,4 ) (2,5 ) (2,6 ) (2,7 )
(3,3) (3,4) (3,5) (3,6) (3,7)
(4,3) (4,4 ) (4,5 ) (4,6 ) (4,7 )
Similarly, we could build the cross product of three sets A × B × C which consists of the
ordered triples out of A, B and C respectively which means the first coordinate is from A,
the second from B and the third is from C.

A parallel way to visualize product of sets is that of a tree diagram—although at times it


can be laborious. The basic ingredient is that of stages. Namely, the set we are building
has stages for its development. An example will help clarify.

Example 1. Suppose that two players are going to play five games of a specific
diversion or sport. We are interested in recording the winner at each game. How
many ways are there of doing this? Let’s call the players A and B. It is clear here that each
individual element of the set to be counted consists of five games and we will adopt
these as the stages of our development. We could visualize it as ordered five-tuples or as
a tree. In order to be efficient, we will write, for example, ABBAA for ( A, B, B, A, A) .

Listing all the possible outcomes:


AAAAA AAAAB AAABA AAABB AABAA AABAB AABBA AABBB
ABAAA ABAAB ABABA ABABB ABBAA ABBAB ABBBA ABBBB
BAAAA BAAAB BAABA BAABB BABAA BABAB BABBA BABBB
BBAAA BBAAB BBABA BBABB BBBAA BBBAB BBBBA BBBBB

Starting at the beginning, the first hand can be won by either A or B, so that is our first
branching. At each node of our tree there are going to be two new branches so that we
can easily keep track of how many nodes there are at each level and since at the end each
terminal node is going to be at leve l 5 (all branches will have 5 individual branches), all
we will need to know is how many terminal nodes there are. Well, we know there are two
nodes at level 1, and each will give rise to two new nodes, so there will be four nodes at
level 2 ( 2 + 2 ), which in turn, each of which will give rise to two other ones at level 3. So
at this level there will be 8 nodes (2 + 2 + 2 , or better yet 3 × 2 ). Continuing in this
fashion we get 16 nodes at level 4 and finally we have a total of 32 branches (or nodes at
10

level 5). And the corresponding tree diagram: The point again of why trees are useful is
that
there are as many full branches as there are terminal nodes.

The whole key here, to emphasize again, was that the number of branches at each stage
was independent of where in the tree you were. Although that property is nice,
what is truly essential is that at each level every node has the same
number of branches coming out of it. (It did not matter that all levels
have two branches coming out, but again we repeat, what was
crucial was that at each level all nodes have the same number of
branches coming out.) That way we can keep multiplying.

Note also that we could easily count the number if there were 7 games, or 9 games, or 15
games. Namely if there are n games, the answer is 2 n .

Often, one of the hardest things to do in mathematics is to realize what you have solved
when you have worked out a problem. Suppose I were to ask you: how many subsets
does {1,2,3,4,5} have? Can you work it out? We already have. In order to build a subset,
you have five decisions to make: whether to put 1 in the subset or not, whether to put 2 in
the subset or not, whether to put 3 in or not, and the same with 4, and the same with 5.
Let's say A corresponds to not put in, B to put it in. Then look at any of the 32 ways to
play five games, for each of them we can find a corresponding subset: for example,
AAAAA corresponds to the empty set, ∅ , BBBBB corresponds to the whole set
{1,2,3,4,5} , BAABA corresponds to {1,4} while ABBAB corresponds to {2,3,5}. And the
correspondence goes in reverse too: the subset {1,3,4} corresponds to BABBA. Is it then
clear that there are exactly 32 subsets of {1,2,3,4,5}? And just as we generalized the
games idea, we have that
if X has n elements, then X has 2 n subsets,
of course, including the empty subset and X itself.
11

It turns out that many counting problems can be worked out by suitable trees where the
number of branches coming out of a node depends only on the level of the node not on
which node nor the past history of the tree. Before we get more abstract, let's look at
another example.

Example 2. Out of the five candidates: Mr.


Alberts, Ms. Brett, Mrs. Chan, Mr. Diaz and Secretary
Mr. Ewing, a president, a secretary and a
treasurer are to be chosen. How many ways can President
this be done? The tree is on the right, but it
illustrates the point that from now on we should
build trees in our heads. The first decision is to
choose a president, so that is the first level in the Alberts
tree. This will give rise to 5 nodes at level 1. If Brett
we are at one of those nodes, that person cannot Chan
be chosen for the next decision, that of secretary, Diaz T
but nevertheless we have exactly four branches Ewing r
coming out of each of those nodes at level one,
so we will have exactly 20 nodes at level 2. e
Suppose now that you are at level 2, so a
president and a secretary have been chosen. start
a
Regardless of who has been chosen so far, at s
each node at level 2, you will have 3 new
branches for level 3, so we will have a total of u
60 nodes at level three, or equivalently 60 r
terminal branches in our tree, so there are 60
ways of doing the choosing. e
There are several relevant observations to the
r
last example. First, it did not matter at all what
our first decision was. We could have just as
easily decided that first we were going to choose
a treasurer, and then a secretary and then a
president, or first a secretary, then a president
and then a treasurer--whatever. What was
essential was that the three positions were differentiable, so that our levels in the
tree were clearly spelled out. (We will return to this important issue in a later example.)

Second, that although we had different choices depending on which branch of our tree we
were, the number of choices was independent of whom we had chosen. This is a crucial
point. For suppose we had a constraint such as if Mr. Alberts is elected president, then
Mrs. Chan will not serve as treasurer; then our whole approach is down the drain and we
are into counting branches by just drawing them all. (Actually this is not quite true, only
for the time being).
12

Third, to visualize the set as a collection of triples out of A, B, C, D and E we must require
that the triples have repeated entries, and thus this description is harder than the tree.

Before we lose track of it later on, let's enunciate the very important counting principle
we have been using. It is usually referred to as the second counting principle or the
rule of product:
if when building the elements of a set one has t clearly differentiated
decisions (or stages, or levels), and the number of options at each
stage depends only of what stage of the process we are in and not on
the previous choices for the previous stages, then the total number
of elements in our set is the product n1 × n2 ×L× nt where n1 is the
number of choices at the first stage, n 2 the number of choices at the
second stage, etcetera.

The following is classically typical:

Example 3. In how many ways can we arrange n people in line (to pose for a picture)?
We have n choices for our first position, or first stage. For the next stage we only have
n − 1 choices, and for the third stage we have exactly n − 2 , and so on until at the last
stage, the nth stage we have only one option (whoever is left goes last on the line). Can
you visualize the tree? So the number of ways is n × ( n − 1) × (n − 2)×L×1 , which is
called n factorial, and is denoted by n! . This is one of the functions that will be most
important throughout the course.

n 1 2 3 4 5 6 7 8 9 10
n! 1 2 6 24 120 720 5040 40320 362880 3628800

What is 0! then? If we see how we get the next column from the previous one, we see
that in the table Y and so 0 1 X = 0! = 1 .
X XY X 1

But note that we could not succeed in defining (-1)! since nothing -1 0
multiplied by 0 will produce 1. X 1

There are two cautions associated with the second counting principle: first, you must
have the same number of branches coming out of each node at any given level, the
second is that you have to be able to differentiate between your levels. First, we give
an example addressing the first caution.

Example 4. Two players are going to play a game until one of them has
won 3 hands or sets. How many different ways are there of playing the match? It is clear
here that each individual element of the set to be counted consists of several ‘hands’
13

(possibly as many as five) and we will


adopt these different games as the
stages of our development. Let’s call
the players A and B. Starting at the
beginning; the first hand can be won
by either A or B, so that is our first
branching:

One sees then that there are 20 distinct


full branches, or equivalently, 20
terminal nodes (which are shaded) in
the tree, and we can partition them in
the following manner:

Matches which A wins Matches which B wins


3 games in the match AAA BBB
4 games in the match AABA, ABAA, BAAA ABBB, BABB, BBAB
5 games in the match
AABBA, ABABA, ABBAA, AABBB, ABABB, ABBAB,
BAABA, BABAA, BBAAA BAABB, BABAB, BBAAB

You will get no argument from anybody on the clumsiness of the procedure. How would
you like to do 4 games (i.e., best out of seven)? Even if we become smarter by just
drawing one half of the tree (notice the symmetry), it is still a painful, and not particularly
elucidating experience. Of course, a machine loves to do this kind of work, and the
program to get it to do the work is not difficult to come up with. But if all we are
interested in is the number of possible ways (more motivation on that later), there are
smarter ways to do it. These are not yet available at this stage of the course.

A large part of the clumsiness in the last example came from the fact that at each stage
the number of branches coming out of a node depended on the past history of the tree,
thus some of the elements in our set (the twenty ways listed above) had 3 individual
branches while others had 4 or 5.

Some more examples:


Example 5. How many divisors does a number have? Take for example 64 = 2 6 . Easily,
the divisors are 1, 2, 4, 8, 16, 32 and 64; or in other words, 2 i where i = 1,K ,6 . This will
hold for any power of a prime number. If p is a prime, then p e has e + 1 divisors:
p 0 = 1 , p 1 = p , p 2 , K, p e . How about a number that is not a power of a prime? Take
36 = 2 23 2 . The divisors of 36 are 1, 2, 3, 4, 6, 9, 12, 18, and 36, since by the theorem
they are all of the form 2 i 3 j where i = 0,1,2 and j = 0,1,2 . So here we have two
decisions: what exponent to choose for 2 (3 choices) and what exponents to choose for 3
(3 choices again), for a total of 9 options all together.
14

exponent of 3 How about 72? 72 = 2 332 . So it has 12 = ( 3 + 1)( 2 + 1)


1
exponent of 2 divisors. We can easily visualize these divisors with the help
3 of a tree.
9
Example 6. In how many ways can we arrange our favorite 5
2
people Mr. A, Ms. B, Mrs. C, Mr. D and Mr. E for a picture if
6 Mr. D and Mr. E refuse to stand next to each other. Before we
proceed you should try to visualize the tree that you would be
18
building if you tackled this problem head on. Probably you
4 would say the first level of the tree is who goes first, then who
12 goes second, etcetera. It also becomes clear that your number
of choices at the next stage depends on which node you are in.
36 Thus, if you chose Mr. D for the first person, you only have
8 three choices for the second while if you chose Ms. B for the
first person, you would have four. So in order to count the
24
branches in the tree, we would build the tree or think some
72 more. It pays to think. Move laterally, don't tackle it head on.
From Example 3 we know there are 120 ways to arrange our five people for a picture.
This universe can be partitioned into two pieces: those with D and E standing together
and those with them apart. We want the size of the latter, but by the first counting
principle, if we find the size of the former and subtract it from 120 we will have our
answer. Is this any progress? So we are reduced to counting the number of ways of
arranging A,B,C,D and E for a picture so that D and E stand together. This is straight
second counting principle: first decision is making D and E stand together, we have two
options for it: DE and ED. After that we have to arrange 4 elements: A, B, C, and DE. By
Example 6, we have 24 ways of doing that, so in total we have 2 × 24 = 48 ways. So
there are 120 − 48 = 72 ways to arrange A, B, C, D and E for a picture so that D and E do
not stand together.

Now at the second caution in using the second counting principle, you have to be able to
differentiate between your stages. Let’s change Example 6 slightly:

Example 7. Let’s instead require tha t out of the five candidates, a committee of three is
to be chosen. Before we start discussing it in the abstract let’s write down the answer.
There are 10 ways of choosing such a committee. Here are the 10 of them (using initials
to denote the people): {A,B,C}, {A,B,D}, {A,B,E}, {A,C,D}, {A,C,E}, {A,D,E}, {B,C,D},
{B,C,E}, {B,D,E}, {C,D,E}. One is tempted to use a tree to reason this out, but without
some extra reasoning one may be in trouble. Namely, what are you going to choose for
the first level of your tree? What is your first stage? What decision are you
making? Your answer would probably be: I am choosing the first member of
my committee. But is that clearly spelled out? Suppose your committee was
{A,B,C}. Who did you choose first? The difference between this example and
Example 6 is that before the three persons to be chosen were specifically
differentiated: there was a president, a secretary and a treasurer, now there are
only three people in a committee, non-differentiated. So how do we get the 10
without listing all the possibilities?
15

Going in reverse, from each of the 10 committees, there are 3! = 6 ways of choosing a
president, a secretary and a treasurer from it since there are three options for president,
and then two for secretary and then only one for treasurer. Of course, that way we obtain
a total of 60 = 5 × 4 × 3 ways of choosing a president, a secretary and a treasurer.
Remember that what is crucial is that the three members of a committee are
indistinguishable from each other while the board members are all differentiable—
pay attention to this distinction, it is terribly important.
16

Ž Choosings

The last section ended with how to count the number of committees one could make out
of 5 candidates if the committee were to consist of 3 of those candidates. In general, let n
and k be nonnegative integers. Then the number of subcollections (or committees) of size
k (or k-subsets) from a collection of n objects or candidates (an n-set) is called n choose
k and is denoted by  n  . The notation is from the 19th century. Other notations abound,
k
including nCk and Ckn . Why this name? What you are doing is choosing out of n friends
that you have, k to come to a party, and you are counting the number of ways of doing
that. Or from n different balls, yo u are choosing k to put in a bucket.

From the previous section we saw  5 = 10 (it reads 5 choose 3) since there were 10
 
 3
subsets of a 5-set that had 3 elements. We will revisit this below.

As mentioned above, the expression  n is the number of ways of making a committee of


 
k

k people out of n eligible candidates. The numbers  n are also called binomial
 
k
coefficients (the reason for this name will be clarified later),
and many of us find them among the most charming of
numbers.

We do the example with n = 5 and k = 2 in the discussion below. We are starting with
five balls of different colors, and
we are going to choose two to go
into a bucket (for whatever
reason).

If we label the balls 1, 2, 3, 4 and


5, then we easily come up with
exactly 10 pairs: {1,2}, {1,3},
{1,4}, {1,5}, {2,3}, {2,4}, {2,5},
{3,4}, {3,5} and {4,5}, and so we
would state that  5 = 10 .
 2

A similar computation yields  5 = 10 Is this a coincidence? No. Think about it this way.
 
 3
Anytime you choose 2 balls to put in the bucket, you are automatically choosing 3 not to
be placed so.
17

Hence, the number of ways of choosing 3 out of 5 is the same as the number of
ways of choosing 2 out of 5 (compare pictures).

By similar reasoning, 5  5 = 5 , and finally  5 5 = 1.


 =   = 
1  4  0 5

Of course, if k > 5 ,  5 = 0 since there is no way to select more than 5 balls from the
 
k
collection. Finally, since every collection of balls is accounted for, we must have:
 5  5  5  5  5  5 = 2 5 = 32,
  +  +   +   +   +  
 0  1  2  3  4  5
since that is the total number of subcollections. In fact, 1 + 5 + 10 + 10 + 5 + 1 = 32 .

The same observations from the example can be generalized to arbitrary numbers:

Œ  n = 0 whenever k > n since there is no way to choose more objects than


 
k
what is available.
•  n   n  whenever 0 ≤ k ≤ n since to choose k friends to come to the
  = 
k n − k
party is tantamount to choosing n − k not to come.
Ž  n  n = 1 since there is only one way to give a party where either nobody
  = 
 0  n
comes, or everybody comes.
 n  n  n  n
•   +   +   +L+   = 2 since we are partitioning the subsets of an n -set by
n

 0  1  2  n
their size.

More importantly, there is a very nice recursion that they satisfy. This recursion is due to
Pascal, the very bright yet never fully developed French mathematician of the 17th
century.

Theorem (Pascal's Recursion). Let 1 ≤ k ≤ n , then  n + 1  n   n 


  =   + .
 k   k   k − 1
Proof. It is very simple indeed. Among your n + 1 friends, from which you have to
choose k for the party, there is a special one: Otto, the Brute. Partition your parties into
two types, the ones with Otto and the ones without Otto. How many parties with k
people do you have if Otto is coming? Since Otto is one of the guests you have yet to
choose k − 1 friends out of your remaining n friends, hence the answer is  n  . On the
 k − 1
other hand, if Otto is not coming, you have to choose all k guests out of the remaining n
friends, hence the answer is  n  . By simple counting then, since we partitioned the set of
k
parties into two pieces and we have counted each of the pieces, we are done. z
18

With this recursion, together with the conditions before the theorem, we can now build
the table of binomial coefficients. Although this table has been known to many people
and many cultures from way before the time of Pascal, it is known, in the Western world
at least, as Pascal's triangle. We are going to let n\k 0 1 2 3 4 5 6 7 8 9 10
0 1
n index the rows of our array while k will index 1 1 1
the columns. We will start with n = k = 0 , and 2 1 1
3 1 1
grow from there. By the conditions before the 4 1 1
theorem we know our array looks like with zeros 5 1 1
6 1 1
above the main diagonal. 7 1 1
8 1 1
9 1 1
But now with the recursion we can fill in the rest 10 1 1
of the array. Namely to fill a new row, one adds
the position just above it to the one above and to the left:

n\k 0 1 2 3 4 5 6 7 8 9 10
0 1
1 1 1
2 1 2 1
3 1 3 3 1
4 1 4 6 4 1
5 1 5 10 10 5 1
6 1 6 15 20 15 6 1
7 1 7 21 35 35 21 7 1
8 1 8 28 56 70 56 28 8 1
9 1 9 36 84 126 126 84 36 9 1
10 1 10 45 120 210 252 210 120 45 10 1

Pascal’s Triangle
Observe the example of the recursion in the table. Besides the observations we made
before the theorem, there are many nice features in Pascal’s triangle. One of the
important ones is that the coefficients increase in each row up to the middle, and then,
because of the symmetry, they decrease. For example, the last row in our table went 1-10-
45-120-210-252 (which is the exact middle).

As we saw before, the rows add up to powers of 2. But what about the alternating row
sums (that is, take alternating signs)? If one experiments a bit, it is not too hard to believe
that the alternating sums are always 0, (for example, the last row in our table gives
1 − 10 + 45 − 120 + 210 − 252 + 210 − 120 + 45 − 10 + 1 ).

But as with every recursion, sometimes a closed expression is preferable. This formula
when used wisely is computationally superior to the recursion, but the key word is
wisely. We have seen the idea behind the closed expression in the previous section when
we looked at the number of committees consisting of 3 people out of a pool of 5
candidates (or equivalently, 3-subsets of a 5-set). We compared that number with the
number of executive boards with President, Secretary and Treasurer out of the same 5
candidates. We are going to try to contrast the committees with the executive boards.
Remember that what is crucial is that the three members of a committee are
19

indistinguishable from each other while the board members are all differentiable—
pay attention to this distinction, it is terribly important, and later on, it will be very
important. We have counted the number of executive boards: 60 = 5 × 4 × 3 where the 5
is the number of choices for our President, 4, the number of choices for our Secretary and
3, the ones for the Treasurer.

Suppose we don’t know the number of committees, yet we think of these committees as
indexing the rows of a matrix, while the boards index the columns of the same matrix, so
the matrix is m × 60 where m is the number we are looking for (we know it is actually
 5 = 10 , but suppose for a second that we didn't know this). We are going to fill the
 
 3
matrix with 0’s and 1’s. We put a 1 in a position if the committee corresponding to that
row has the same elements as the board that corresponds to that column, otherwise we put
a 0 in that position. For example, suppose that some row corresponds to the committee
{A,C,D}. Then in the column corresponding to the board DAC (D is President, A is
Secretary and C is Treasurer) we would put a 1, while in the column corresponding to the
board BAC we would not, and instead we would put a 0. One fundamental (and obvious)
fact of life is that if one has a (0,1) matrix, then the number of 1's in it is independent of
whether we counted them by rows or by columns. Let’s first count the ones in our matrix
by columns: clearly every column has only one 1 in it, the one in the row corresponding
to the committee made up of the members of the board corresponding to that column.
Since there are 60 columns in this matrix, there is a total of 60 1's in it. Now let's count
them by rows. Take a row, say the one corresponding to {A, C, D}. How many 1’s are
there in its row? How many boards can be made from the members of this committee?
How many ways can we order the set? 3!=6. Hence every row has 6 1’s and we can
conclude that there are 10 rows since the number of rows times 6 is 60. Isn’t this neat?

Extend that now. Suppose we have a pool of n candidates for a race in which we are
going to keep track of the first k places. Then how many outcomes are to the race? By the
second counting principle, since we have n choices for our first stage or decision, n − 1
choices for the second, and so on, we have n × ( n − 1) × (n − 2 )×L×( n − k + 1) total
choices. Where did that last factor come from? It should be clear that we are going to
have k factors in our result since there are k stages to our tree development. The first
factor is n − 0 , so the last factor should be n − ( k − 1) = n − k + 1 . This number is easy to
remember if we rewrite in the form n ! . Suppose we take an arbitrary k-subset of an
(n − k ) !
n-set, that particular subset gives rise to how many race finishings? How many ways can
we order the k-subset? We know it is k! ways. (Think of the matrix.) Thus each k-subset
of our n-set, and we know there are, by choice,  n  of them, gives rise to k! of the race
k

finishings, and since there are n! race finishings,  n  × k ! = n! , or equivalently


( k )!
n − k   ( k )!
n −

Theorem (Newton's Expression). Let 0 ≤ k ≤ n , then  n n! .


  =
 k  k !(n − k )!
20

In order for this expression to be correct, it is necessary to define 0! to be 1, and we have


argued that before. Thus, for example,
100 100! 100 (99)( 98)(97 )(96)
 = = = 75,287,520 .
 5  5! 95! 120
This example points out the folly of pretending to use the formula with total abandon.
First, there is no doubt the binomial coefficients get large: for example, 100 has 29
 50 
digits; second, that in order to use the formula wisely, one has to be careful with the
computations.

One approach is to use a different recursion than Pascal’s. Specifically, we attempt to


build a row all by itself. Let’s us say we wanted 100 . We are going to try to build the
 94 
th
100 row of Pascal’s Triangle. By the symmetry, we can exchange for the computation
of 100 ; in other words we can always assume k £ n2 . We know the row starts with 1. If
 6

we look at the ratio of two consecutive terms in that row,  100  divided by 100 , we get
 k + 1  k 
100!
100 − k
(k + 1) !(100 − k − 1)! which equals . This means that as we move on the row from k
100! k +1
k !(100 − k )!
to k + 1 , we are multiplying by 100 − k and dividing by k + 1 , which makes moving on a
row much easier:

and we know then that 100 equals 1,192,052,400.


 6

Example 1. How many ways can we toss a coin 5 times so that exactly three heads
appear? From the 32 ways of tossing the coin 5 times: 2 ´ 2 ´ 2´ 2´ 2 , three spots have to
æ5ö
be designated for the heads and the remaining 2 for tails, so there are çç ÷÷÷ = 10 ways. For
è3ø
example, the subset {2,3,5} corresponds to THHTH.

By now we have developed our counting tools, we have ors (or unions), and we have
ands (or stages) and then we have choosings. The power comes from knowing which
to use when. We will come back later to this subject.

We finish the chapter with examples that may combine techniques.

Example 2. A committee of 7 people, 4 males and 3 females, is to be chosen from 10


male candidates and 8 female candidates. In how many ways can this be done? The
21

æ10öæ8ö
answer is simple: çç ÷÷÷çç ÷÷÷ because we must choose the males in the committee and we
è 4 øè3ø
must choose the females in the committee. And so our total answer is 210´ 56 = 11,760 .
What would the answer be if instead all that is required of the committee (of still 7
people) is that at most 3 females serve? That means that either 3, or 2, or 1 or no females
serve in the committee, and so we have the answer to be:
æ10öæ ö æ öæ ö æ öæ ö æ öæ ö
çç ÷÷çç8÷÷ + çç10÷÷çç8÷÷ + çç10÷÷çç8÷÷ + çç10÷÷ çç8÷÷ = 210 × 56 + 252 × 28 + 210 × 8 + 120 ×1 = 20,616 .
è 4 øè3ø÷ è 5 øè
÷ ÷ 2ø÷ è 6 øè ÷ 1ø÷ è 7 ÷øè0ø÷

Example 3. As we all (should) know, the U.S. Congress


consists of two chambers: the Senate with 100 members (2
for each state), and the House of Representatives with 435
members. In how many ways can Congress select a
Committee to meet with the President if:
• the Committee is to have only one member from
Congress: 535 = 100 + 435 .
‚ the Committee is to have one Senator and one
Representative: 43,500 = 100´ 435 .
ƒ the Committee is to have three members, at least one from each chamber:
æ 435öæ ö æ öæ ö
ç ÷÷ç100÷÷ + ç435÷÷ç100÷÷ = 94,395 × 100 + 435 × 4,950 = 11,592,750 .
çè 2 øè
÷ç 1 ø÷ èç 1 øè
÷ç 2 ø÷
A common error in working this example is as follows: choose a Representative, 435
ways of doing that, choose a Senator, 100 ways of doing that, and then choose one of the
remaining members of Congress, 533 ways of doing that. But there is double counting
since we do not know which of the two Senators or Representatives was chosen first, so
435 ×100 × 533
the correct answer is .
2

Example 4. An anagram of a word is a rearrangement of its letters. Thus, my name


MENA has 24 anagrams since there are 4! arrangements of 4 objects. Not all are
meaningful words in English, but that is irrelevant. BOB on the other hand does not have
3!=6 anagrams, only 3: OBB, BOB and BBO. The reason being that when we compute the
6 ways we are counting the B's as different but they are not. Similarly, my first name
ROBERT does not have 6! anagrams because of double counting: there are two R's that
are being counted as different in the 6!, but do not give different words. Let's do it
another way.

Think of the six blanks that we are going to fill in with the letters:{R, O, B, E, R, T}:_ _ _
_ _ _ _. Of those 6 blanks, 2 have to go for R's. Choose those 2, for which we have
 6 = 15 ways of doing. Then choose the spot for the O's : any of 4 ways. Then we have 3
 
 2
spots left to put the B, 2 for the E and 1 for the T. Thus the number of anagrams is:
15 × 4 × 3 × 2 × 1 = 360 .
22

You can think of obtaining the result in terms of activities: you have six blanks: 2 of them
are going for R, 1 for O, 1 for B, 1 for E and 1 for T.

By the same reckoning, MISSISSIPPI has         anagrams since we have to


11 7 3 1
 4  4  2  1 
choose 4 places to put the I’s, and then 4 to place the S’s, and 2 for the P’s and the
remaining place goes to the M. But something interesting occurs in the computation of the
number         . By Newton’s expression it equals:
11 7 3 1
 4  4  2  1 
11  7  3   1 = 11! × 7! × 3! × 1! = 11! × 7 ! × 3! × 1! = 11!
 4  4 2  1 
       4!7! 4!3! 2!1! 1! 4!7! 4 ! 3 ! 2!1! 1! 4!4!2!1!
And this number is thus expressed as  11  = 34 ,650 , which represents the number of
 4,4 ,2 ,1
anagrams Ol’ Man River has. The number is known as a multinomial coefficient. See
last example in this section for further elucidation.

Of course, one can have variations on the questions. How many anagrams of ROBERT are
there where the vowels are together? Put the vowels together: there are 2 ways of doing
that, and then think of them as one letter: V and we have anagrams of RBRTV, of which
there are 60, so our answer is 120.

Another variation: how many anagrams are there of ROBERT where the vowels come in
alphabetical order (not necessarily together). There are 360 anagrams, and they come in
pair where the vowels occur in the same locations, only one of those two has the vowels
360
in order, so the answer is = 180 .
2

We end the section with an explanation of the binomial coefficient appellation.

Example 5. The Binomial Theorem. Suppose you are asked to expand ( x + y ) 6 ?


Everybody knows what this stands for. Namely,
( x + y) ( x + y) ( x + y) ( x + y) ( x + y) ( x + y) ,
a total of 6 times.

If we proceed without any thought, we can just multiply this out. Not only boring, but
very inefficient. Instead let’s think about the process of the multiplication and the
powerful distrib utive law. What we are doing then is taking one of the two
summands from each one of the factors in order to get one of our terms. So each of
our terms is of the form x i y j where i and j are nonnegative integers and i + j = 6 . How
many terms do we have? In order to build a term, we have six stages or decisions (one for
each of the factors) and two options for each of the decisions, so we will have 2 6 = 64
terms. One term is, clearly, x 6 which comes up by taking an x from each of the factors,
and that is the only way to obtain x 6 . But another term is xyxyxx (we are indicating by
23

the position what each factor contributed) which equals x 4 y 2 . But when we collect terms
to simplify, x 4 y 2 occurred several times; we have just seen one of those occurrences.
Another is yxxxxy . In total, how many times does x 4 y 2 occur? We have to decide which
of the six factors will contribute y's, the others will contribute x's. So we have to choose 2
out of 6, so there are  6 = 15 terms that equal x 4 y 2 , so its coefficient is 15. This is why
 2
these numbers are called binomial coefficients. If we just extend our reasoning to all the
terms, we get
( x + y) 6 = x 6 + 6 x 5 y + 15x 4 y 2 + 20x 3 y 3 + 15x 2 y 4 + 6 xy 5 + y 6 .

Of course, since this is an algebraic identity, it is valid for arbitrary x and y. Thus, if what
we wanted was ( 2a − b) 6 , then all we would have to do is substitute x by 2a and y by
− b (watch that minus sign), in order to obtain
( 2a − b ) 6 = 64a 6 − 192a 5b + 240a 4b 2 − 160a 3b 3 + 60a 2b 4 − 12ab5 + b 6 .

Or we could have let x = y = 1 to get 64=1+6+15+20+15+6+1, an identity we had seen


before since we are adding a row of Pascal's triangle; the number 64 is the same as the
number of terms, of course. The reasoning above clearly generalizes to any exponent (as
long as it is a positive integer), so we can state
æ nö
The Binomial Theorem. ( x + y) n = å çç ÷÷÷ x n- iy i
n

i =0 è ø
i

The sigma notation for sums can be a bit intimidating, but by just writing the terms one-
by-one all fears are conquered:
ænö n-1 æ nö n-2 2 ænö n-3 3
( x + y ) = x + çç ÷÷÷ x y + çç ÷÷÷ x y + çç ÷÷÷ x y + L.
n n

è1ø è 2ø è 3ø

Suppose we now vary our original question, and ask you to compute ( x + y + z ) instead.
6

So we would be looking at
( x + y + z )( x + y + z )( x + y + z )( x + y + z )( x + y + z )( x + y + z ) .
Here the expansion would be real tedious. But the reasoning we did in our previous
considerations is still valid. Namely as we expand this product, in order to build a term
we get one summand out of each of the factors, thus our terms are of the form x i y j z k
where i, j and k are nonnegative integers and i + j + k = 6 . For example, x 2 y 3 z and x 4 z 2
are both terms. How many terms will we have? 6 stages, 3 options for each decision give
us a total of 36 = 729 terms!

Let’s reason it in general: take x i y j z k where i + j + k = 6 . Then i factors have to


contribute x's and we have  6 ways of doing that choosing. Then from the remaining
 
 i
24

factors we have to decide for the y's:  6 − i ways, and finally, for the z's: 6 − i − j =1
   
 j   k 
since i + j + k = 6 . Hence the coefficient of x i y j z k is
÷÷ = 6! ´ (6 - i )! ´ (6 - i - j )! = 6!
æ 6ö æ6 - iö æ6 - i - jö
çç ÷÷´çç ÷÷´ç
çè i ø÷ èç j ø÷ èçç k ø÷ i !(6 - i )! j !(6 - i - j )! k !0! i ! j !k !
by canceling. This is a very satisfying expression:

These coefficients are then called multinomial coefficients, as we saw before. What
we are doing in the multinomial coefficients is choosing specific numbers of
friends for each of several, different activities.

We finish computing ( x + y + z) :
6

( x + y + z ) 6 = x 6 + y 6 + z 6 + 6 x 5 y + 6 x 5 z + 6 xy 5 + 6 xz 5 + 6 y 5z + 6 yz 5 +
15x 4 y 2 + 15x 4 z 2 + 15x 2 y 4 + 15x 2 z 4 + 15 y 4 z 2 + 15 y 2 z 4 +
30 x 4 yz + 30 xy 4 z + 30 xy 4 + 20 x 3 y 3 + 20 x 3z 3 + 20 y 3 z 3 +
60 x 3 y 2 z + 60 x 3 yz 2 + 60 x 2 y 3 z + 60 x 2 yz 3 + 60 xy 3 z 2 + 60xy 2 z 3 +
90 x 2 y 2 z 2 .

With exactly 3+36+90+90+60+360+90=729 terms as expected.

We will do many more applications of our counting ability in the applied area of
probability in the next chapter.
25

• Continuing into the Continuum

As we finished the last section, the idea of counting was to be used in the essential notion
of probability. Indeed, the first stated realization that mathematics, specifically
counting, had a role to play in games of chance concerned the rolling of one
die. There we have six equal way to roll a die, so the probability one rolls a
specific value is 16 . No much depth in this analysis, yet it was not stated until
the 16th century, despite people gambling for at least 10,000 years.

Although the largest regular polyhedron one could make has only 20 sides
(the icosahedron), one can always make a dreidel- like object with
arbitrarily many sides to simulate the random choosing of a number from
1 to the number of sides. If that is the case, then the probability of
choosing a specific number will of course be 1n where n is the number of
sides. Thus, we could conceive of a dreidel with 1,000 sides, and so when
1
we roll (or spin) it, the probability of landing on any one side is 1000 . And
then we could ask what is the probability we spin a number which is at
most 100, and the obvious answer is 1000100
= 101 .

From there it is a small jump to the idea of a random number, a randomizer as is


known. In the next section we introduce the notion of random variable. A randomizer X
is a specific example of a random variable. It is a function that chooses a number between
0 and 1 (inclusive) equally, without any prejudice. We can see that when we had n equal
numbers to choose from, the probability of choosing one of them was 1n , so now that we
have an indefinite number of choices to choose from, the probability of choosing any
single one of them should be 0. Nevertheless, we can still make sense of other questions
such as What is the probability of choosing a number that is at most 13 ? The answer
should be 13 . Or more interestingly, What is the probability of picking a number
between .4 and .6? Here the answer should be 15 . The way to arrive at these is by using
the length of the interval that we want to land in compared with the length of the total
interval that we are sure to land on.

Hence the geometric measure of entities becomes relevant. As another


example, suppose an archer shoots very thin arrows at a round target—she
is sure to hit the target, but what is the probability to hit within half the
radius from the center? Since the area of the whole
circle is pr 2 while that of the region she desires to
land on is 14 of that, her probability is 14 .

Similarly, we could ask for the likelihood of picking a point in the


top half of a cone (including the interior) given that a point from it
is randomly chosen. The answer being 18 since the volume of the
26

π (2r )
2 h
large cone is 1
3
πr 2 h while the volume of the top half is only 1
3 2 .

The previous paragraphs are being used to motivate the need to review some basic
geometric concepts regarding measurement, such as length, angle, distance, area and
volume.

Of course, length is very much associated with counting. Once we have a unit of linear
measure, we count how many times it fits around the room, or whatever we are
attempting to take the length of, and we have arrived at an estimate. Of course, fractions
and eventua lly real numbers occur naturally in this context.

One can only use common sense to speculate what area is the
oldest in mankind’s memory—but the winner, one can
conjecture, must be the rectangle: the base × height formula
for the area could easily be deduced via multiplication from
brick laying or tiling examples. Again, note the intimate connection to counting.

The next area, after the rectangle, to be computed was, probably, that of
a parallelogram, which is also
base × height.
That this was done early follows from the easy rearrangement of any
parallelogram into a rectangle.

And then the triangle could not be far behind since two of them make a
parallelogram:
1
Area = base × height
2

Above we discussed the area of a circle—but that computation is certainly much younger
than the others, and it probably stems from the fact that the area of a circle should be half
the circumference times the radius, as the pictures exemplify.

Similarly, the first volume to be achieved was, most probably, that of a


rectangular parallelepiped (a rectangular box) with the well-known
length × width × height expression for its volume.

Much more recent than all the previous and much more relevant to our
27

course is the notions from calculus for the calculation of areas and volumes developed by
the great Leibniz and Newton. Namely the crucial ideal of integration.

Integration, which is the continuous version of summation, uses Riemann sums to


execute control from the discrete to the continuous, from sums to integrals. Eventually,
one can simply state that:

• The integral of length is area.


• The integral of area is volume.

We will be using these extensively throughout the course.


28

Chapter 1
Odds on Favorites
Œ Probability—the Basic Rules

One of the original motivations behind counting was the beginning of the taming of uncertainty
that occurred in the 16th and early part of the 17th century. Why it took so long to even develop
to the extent it did in those early centuries is indeed interesting, but not for us to speculate
(gambling is very old indeed.) What is relevant to us is that by 1600 it was reasonably clear
in many people's minds what some aspects of probability were about. But let's progress by
example, a historical one. Galileo himself was posed this question, and as usual he analyzed it
correctly.

Example 1. Suppose we are going to play the following game (those early years were mostly
concerned with gambling questions (as mentioned above, gambling is olddefinitely
thousands of years old.):
we roll 3 dice, if a 9 shows up, I pay you $1, if a 10 shows up, you
pay me $1, if anything else shows up, we roll again.
Naturally you are mistrustful since I am proposing the game, but how do you know I am not the
idiot by offering it to you, or better, that it may be a fair game and you are just missing the
opportunity to have fun. Of course, if you are just going to play one hand, no calculation is really
necessary and you are just going to make your decision based on your mood, who makes the
offer, etc. But suppose you intend to do this for three hours every Saturday for the next three
years (we live in the age of individual preference.) At first thought it seems like a
reasonable game: one can obtain a 9 by rolling:

while a 10 can be obtained by rolling:

At first thought it seems like a fair and reasonable game. Both numbers can be rolled in six
different ways, as the two lists of possibilities indicate. But what is the logic behind this
attempt? It has something to do with the number of ways of doing something and if
one thing has more ways of occurring than another, then it is more likely to occur.
After all nobody would play the previous game if the competition were between rolling a 3 and
rolling a 10 since intuitively one feels that a 3 is much rarer than a 10.
29

Although there is common sense behind this, it is not quite correct. It needs to be improved
upon. The first basic principle that we are going to use for our probability calculations is
suppose an activity or experiment is to be performed, and we have
equally feasible outcomes, then the probability for a given event to
occur is the number of outcomes that give the desired event divided
by the total number of outcomes.1

But extra emphasis needs to be made on the premise of the principle: one must first reduce the
outcomes of the activity to equally feasible outcomes2. Then you can start looking at the
probability of the event that you are interested in. Going back to the game in question. What is
the activity in this example? Rolling 3 dice. What are the outcomes? It seems acceptable to say
that the outcomes are, in addition, to the two lists above:
3
4
5
6
7
8
11

12

13
14
15
16
17
18
and thus there would be a total of 56 outcomes, so the probability of a 9 would be 6 , and a
56

1
This is the first enunciated principle in the theory of probability, and the simplest one. As simple as it is, it
was not stated clearly until the 16th century by the inimitable Cardano, great scholar and scoundrel.
2
What are equally feasible outcomes can be in itself a polemic. How do you know a coin is fair? But we will
be naive about the subtleties of statistical analysis, and only insist that, from what we know, we can
honestly claim that the outcomes we are taking are equally feasible.
30

10 would have the same probability, so the game is seemingly fair.

However, if we apply the same reasoning, then the probability of a 3 is 1 , and a 4 has the
56
same probability. So if we keep rolling the three dice for a long time, the number of 3's
occurring should roughly be the same as the number of 4's. It does not take much
experimentation to perhaps start doubting our premise, and maybe we should question why
did we label those outcomes as equally feasible? So let's rethink a bit. Is a
as equally feasible as a ?

Suppose we had a yellow die, a white die and a blue die. Then, to roll a 3, we would
have to have , we need to show a in each die, but to get a 4, we can do it
by: , or , or ,
, there are three ways since any of the

three dice could show a and the other two need to show a . It seems like we have
some more choices in the latter situation. Three times as many, actually.

A way out of the quagmire is to take for our outcomes the 216 different ways there are to roll
three dice if one of them is yellow, other one white and the third one blue. We get the 216 from:
6 × 6 × 6 . Nobody can argue on the equal feasibility of these 216 outcomes. So we start now
from there. How many ways can we roll a 3? As before, only one way, so the probability is
1 , not 1 . But how many ways can we roll a 4? Three ways as we saw above, so the
216 56
probability of a 4 is 3 . So 4's should occur about three times more often than 3's.
216

Let's go back to the 9 and the 10 of our game. Of the 216 ways, how many ways can we roll
a ? Easily, we have three decisions, which die shows the , which the
and which the : 3× 2 × 1 = 6 ways:

By identical reasoning, there are 6 ways to roll a . But what about a


? Here our tree of options has only two stages, since once we have decided
which die shows the , the other two dice must show a and we have nothing left to
decide (or equivalently, we only have one option, once we have placed the ), so there are
31

only 3 ways to roll a :

Identically, there are 3 ways to roll a , and 6 ways to roll a .


Finally, there is only 1 way: to roll a (all 3 dice have to show
's.) So how many ways can we roll a 9? Totaling our options we
obtain: 6 + 6 + 3 + 3 + 6 + 1 = 25 ways to roll a 9. Putting it in a table, together with the
similar calculations for 10.

Roll of 9 # of Roll of 10 # of
Ways Ways
6 6

6 3

3 6

3 6

6 3

1 3

Total 25 Total 27

Hence the probability of a 9 is 25 while the probability of a 10 is 27 . So on the average,


216 216
after 216 rolls of the dice, Person A would have lost 25 times, but would have won 27 times,
and the rest would have been draws. So on the average, after 216 rolls of the dice, you would
have won 25 times, I would have won 27 times, and the rest would have been draws. So your
net outcome would be a loss of $2. Now whether you play or not is your decision: after all you
could consider $2 cheap entertainment (millions of people go to Vegas).

One of the common errors made in the past by mathematicians (including some of the best like
Leibniz, D'Alembert and others) is the one of presuming equally feasible outcomes to an
experiment without further analysis. In this course we will not have the chance to get too subtle
into this subject, but remember to always be careful to set up the outcomes to the experiment
before you start asking about the event that you are interested in, and try to analyze your
outcomes so that they seem, as best as you can tell, equally feasible.
We can enhance the understanding on the previous example by introducing the fundamental
32

concept of a random variable.

When performing an activity that can lead to different numerical outputs one is
in the midst of a random variable X.

In the previous example the activity is rolling three dice, and what we are interested is the sum of
the roll. Therefore the outcomes to our activity consist of the numbers 3 through 18, and thus
are the values that our random variable can take. The (probability) distribution of the variable is
the set of probabilities associated with each of the possible outcomes, and one way to represent
the variable is by a simple table such as:

X 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
P 1
216
3
216
6
216
10
216
15
216
21
216
25
216
27
216
27
216
25
216
21
216
15
216
10
216
6
216
3
216
1
216
# .0046 .0138 .0277 .0463 .0694 .0972 .1157 .125 .125 .1157 .0972 .0694 .0463 .0277 .0138 .0046

0.140
Note that quite a bit of computation went
0.120
0.100
into the table. On the left is a graphical
0.080
representation of the distribution.
0.060
0.040
0.020
0.000 We make two key observations about
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 the distribution of a random variable:

ΠThe probability of an event is always a number between 0


and 1, inclusive.

• The sum of all the possible probabilities is 1 since we are


certain one of them will occur.

Note that having the distribution of the random variables allows us to answer a variety of
questions. For example, we could ask what is the probability of rolling at least a 15, which is
given by
P ( X ³ 15) = P ( X = 15) + P ( X = 16) + P ( X = 17 ) + P ( X = 18) = » 9.26% .
20
216
As with counting, the reason we are allowed to add the probabilities is that they are disjoint
events, at no times we can have two of them occurring simultaneously.

Ž If A and B are disjoint events, then P ( A or B ) = P (A ) + P (B ) .

Observe that if A, B and C are events any two of which are disjoint, then
33

P ( A È B È C ) = P ( A È B) + P (C) = P (A )+ P (B ) + P (C ) .
Example 2. Similarly to the roll of three dice, only easier, we can state that if X is the roll of
two dice, then its distribution is given by
X 2 3 4 5 6 7 8 9 10 11 12
P 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36

0.200
In the game of CRAPS, one wins if one 0.150
rolls a 7 or 11 to start with, so the
0.100
probability of wining on the first roll is
0.050
» 22.22% . In the same game one
8
36 0.000
loses to start with if one rolls a 2, 3 or 12, 2 3 4 5 6 7 8 9 10 11 12

» 11.11% of
4
which has probability
36
occurring, and that is the likelihood of losing on the first roll.

Example 3. Suppose that a happily married couple has two children. How likely is it that they
will have one of each sex? D'Alembert incorrectly analyzed this by saying there were three
outcomes to the experiment: Two Boys, Two Girls and One-Of-Each. So the
probability of One-Of-Each is 1 . Actually, if we assume that a boy being born is as likely as
3
a girl being born in any given birth3, then there are four equally feasible outcomes: BB, BG,
GB and GG. Of those, 2 give us children of both sexes, so the probability is 2 = 1 . Again this
4 2
estimate conforms to reality much better. Here we could have considered Y to be the random
variable which is the number of boys among the children and the distribution Y 0 1 2
of Y is given by P 1 1 1 4 2 4

Now we revisit the first day of class. Suppose you walk into a room with
people. How many people do there need to be in the room before you bet that
there are two people with the same birthday? Of course if there were a
thousand people, you would bet. It is a sure thing. But probably nobody would
bet against you. How about if there are only 100 people? 50?

Before we find the surprising answer, let's look at a simple principle from probability. Since the
size of a set plus the size of its complement is always the size of the universe (first counting

3
As it happens, this is not quite correct. One realization came very early—as soon as statistical tables of
birth were gathered in the 1660's: more boys are born than girls —approximately in an 18-to-17 ratio (girls are
more likely to survive, so not so many need to be produced.) The other complication is that a given couple,
because of the chemistry, has a certain small factor of repeating the sex of previous children (this is small).
34

principle). Equivalently, in symbols, if A and B are disjoint sets with their union being all
possibilities, then A + B = U , where U is the universe of possibilities (all of them), and so
A B
= 1- , thus
U U
• the probability for a given event to # All Diff. NOT All Diff
occur equals 1 − probability that the 1 1.00000 0.00000
event will not occur. 2 0.99726 0.00274
3 0.99180 0.00820
4 0.98364 0.01636
Example 4. Let A be the set of people in the room. We want to
5 0.97286 0.02714
compute the probability that at least two of them have the same
6 0.95954 0.04046
birthday. What is the activity? We go around the room asking for 7 0.94376 0.05624
people's birthdays. If A = m , then there are 365m (we have m 8 0.92566 0.07434
decisions or stages and 365 choices for each one of them) outcomes 9 0.90538 0.09462
to our experiment. It seems tricky to compute the size of the 10 0.88305 0.11695
11 0.85886 0.14114
outcomes that give at least two with the same birthday, but the
12 0.83298 0.16702
complement of this set is that no two of them have the same birthday.
13 0.80559 0.19441
How many ways can that occur? For the first person we ask we have 14 0.77690 0.22310
365 choices, but for the second one we have only 364, and for the 15 0.74710 0.25290
third one 363, etc. and this kind of reasoning should sound familiar. 16 0.71640 0.28360
17 0.68499 0.31501
So we can build a table of probabilities. Let's say pn denotes the 18 0.65309 0.34691
probability of no two persons having the same birthday when there 19 0.62088 0.37912
20 0.58856 0.41144
are n people in the room. We can write down an answer by simple
21 0.55631 0.44369
counting, and we obtain
22 0.52430 0.47570
365!
pn = . 23 0.49270 0.50730
(365 - n)!365 n 24 0.46166 0.53834
Unfortunately, this may not be easy to compute since 365! has more 25 0.43130 0.56870
26 0.40176 0.59824
than 500 digits. A much better way to do the computation is
27 0.37314 0.62686
recursively. Namely, at stage 1 we have 365 , for the next one we
365 28 0.34554 0.65446
364 , 29 0.31903 0.68097
have our previous result times and for the next, previous result
365 30 0.29368 0.70632
times 363 , etc. In short, 31 0.26955 0.73045
365 32 0.24665 0.75335
365 - n
pn+1 =
33 0.22503 0.77497
pn .
365 34 0.20468 0.79532
The table gives the values recursively computed. Thus, amazingly, you 35 0.18562 0.81438
40 0.10877 0.89123
should be ready to bet when there are only 23 people in the room!
50 0.02963 0.97037
60 0.00588 0.99412
In a typical classroom of 35 students, the odds that at least two have 70 0.00084 0.99916
the same birthday are better than 4 to 1; and at 40 people they are 80 0.00009 0.99991
better than 8 to 1. For example, if we look at the 43 Presidents of the 90 0.00001 0.99999
35

United States, we should have a high probability of two of them having the same birthday, and
indeed, Polk and Harding were both born on November 2nd. With 50 people the odds are
better than 32 to 1, and if there are as many as 100 people in the room your odds are
astronomical, better than 3,000,000 to 1. In our own Mathematics Department of 45 faculty
members, the odds were better than 10-to-1 than 2 of us have the same birthday. As it turns
out there are 3 of us with the same birthday! And that is not very likely.

There is a more generic way to view the last rule. An event A is a sub event of event B if
whenever A occurs, B must also occur. Of course, every event is a sub event of the
universe. Another example is rolling a 6 on a die is a sub event of rolling an even number.

Suppose now that A is a subevent of B, then B but not A is an event on its own, and since it is
disjoint from A, we get
• the probability of an event, but not a given subevent is the
probability of the event minus the probability of the subevent.

Thus the probability of an odd roll with two dice but not a seven is 18
36
- 366 = 12
36
.

Since the union of two events A and B can be seen as the disjoint union of three events:
A but not A and B,
B but not A and B
A and B
we get that
P ( A or B ) = P (A ) -P ( A and B ) + P (B ) - P ( A and B) + P ( A and B ) ,
or equivalently we have
‘ For any events A and B, P ( A or B ) = P (A ) + P (B ) -P ( A and B ) .

Example 5. Suppose that in a lot of bolts, it is determined that 10% are too long and 15% are
too wide, and 7% are both too long and too wide. In selecting a bolt at random, what is the
probability that it will be acceptable in both length and width? Easy,
P ( toolong or toowide ) = 0.10 + 0.15 - 0.07 = 0.18 .
So probability of acceptance is 1- 0.18 = 82% .

Example 6. It is estimated that in 70% of all fatal automobile accidents between two cars, at
least one of the drivers was driving under the influence, and in 25% of them both of the drivers
were DUI. In how many of those accidents was exactly one of the drivers not intoxicated?
Easily, 70 - 25 = 45% .
36

We summarize all the above observations about the distribution of a random variable
ΠThe probability of an event is always a number between 0
and 1, inclusive.
• The sum of all the possible probabilities is 1 since we are
certain one of them will occur.
Ž If A and B are disjoint events, then P ( A or B ) = P (A ) + P (B ) .
• The probability for a given event to occur equals
1 − probability that the event will not occur.
• The probability of an event, but not a given sub event is the
probability of the event minus the probability of the sub
event.
‘ For any events A and B, P ( A or B ) = P (A ) + P (B ) -P ( A and B ) .

Although probability was born the middle of gambling and gaming question, it was soon realized
to allow in social and considerations. The following last example of this section illustrates the
idea of point of view in a numerical sense.

Example 7. Grandson’s Dilemma. Suppose a grandfather is to distribute some money


among his favorite relatives: Alphonse, Bertrand and Constance, his three grandchildren.
He has 5 crisp, new $100 bills. But the grandfather has a peculiar spirit so he has decided he
will give these bills at random. How many ways does he have of doing this? First we try first
brute force. Here what matters is how many bills each Al, Bert and Connie receive. We are
going to let ( x , y , z ) stand for a distribution where x is the number of bills Al received, y the
number Bert got and z how many Connie received. So x, y and z are nonnegative whole
numbers that add up to 5. The possibilities are then

(5,0,0) (4,1,0) (4,0,1) (3,2,0) (3,0,2) (3,1,1) (2,2,1)


(0,5,0) (0,4,1) (1,4,0) (0,3,2) (2,3,0) (1,3,1) (1,2,2)
(0,0,5) (1,0,4) (0,1,4) (2,0,3) (0,2,3) (1,1,3) (2,1,2)
So there are 21 ways to distribute the money. If we just try to build a tree we will soon notice
that the number of branches at a stage does depend on previous choices.

In this scheme, the probability that at least one grandchild will not get money is given by:
15 5
= ≈ 71.43% . So the grandfather will not be surprised if one or more of the grandchildren
21 7
is disappointed.

Let us consider the problem from Alphonse’s point of view. He perhaps sees himself as equally
likely to get $500, $400, $300, $200, $100 or no dollars. So he considers the likelihood that
1
he gets nothing as being ≈ 16.67% . Thus, he would be considerably surprised to end up
6
empty-handed.
37

Example 8. Grandson’s Dilemma II. Suppose the grandfather is to distribute rather than 5
crisp, new $100 bills, five different bills, a $5 bill, $10, $20, $50 and a $100 dollar bill. As
before, he decides that any given bill is as likely to go any one of the three grandchildren:
Alphonse, Bertrand or Constance. What is the probability that at least one of the grandchildren
will end up being unhappy by not receiving any money?

How many ways can he distribute the money: 35 = 243 since the $5 dollar bill can end up in
three different hands, and the $10 also has three choices, et cetera. Next we need to count the
number of ways to distribute the money so that everybody gets some money. Let A be the set
of distributions where Alphonse does not get anything, and likewise B is the set of
distributions where Bertrand ends up empty handed, and similarly for C .

Then we are interested in A ∪ B ∪ C . Our universe has 243 objects as we saw above, and
clearly A has 32 since if Alphonse is to receive nothing, then
each bill has 5 choices, so we get 2 5 = 32 . On the other hand A B
A ∩ B has only one element since Connie has to then receive all 30 1 30
the money, and finally A ∩ B ∩ C = ∅ since the money will end 0
up in the grandchildren’s hands. Then easily one can finish filling 1 1
in the following diagram. C
30
Thus, the probability that every child will receive some money is 243 − 93 = 150
150
≈ 61.72% , and consequently the probability of bringing
243
unhappiness to at least one grandchild is 38.28%. Of course, Alphonse only has 13.17%
chance of getting nothing, so again he might be surprised this happens.

We end the section a popular game: poker.

Example 9. A
typical deck of cards consists of 52 cards in 4
suits (spades, hearts, diamonds and clubs)
and 13 denominations (Ace,
2,3,4,5,6,7,8,9,10, Jack, Queen, King).

A poker hand consists of 5 cards out of the


52. There are special hands, combinations of either denominations or suits or both, that are
ranked higher than others. The rankings from best to worse are as follows:
38

Straight Flush
5 cards in sequence in the same suit.
What is the probability of a Straight Flush? There
are at least two ways to view our experiment: one way is I am dealt 5 cards out of a deck of 52
(as in 5-card draw), another is I am dealt one card at a time until I have 5 (as in 5-card stud).
Should the probabilities be different? Of course not. But what may happen is that it is easier to
do a problem one way than the other. Already with the Straight Flush we see that one is
æ52ö
superior to the other. With the first approach, there are çç ÷÷÷ = 2,598,960 equally feasible
è5ø
outcomes. To compute the numerator, how many decisions do we have to make? The suit of
the flush (4 choices for this decision) and the type of straight (10 choices for this decision—just
count them in your fingers by deciding which denomination is the lowest), so in total we have 40
options, so the probability is 40 » 0.00001539 (you should not hold your breath
2598960
until you get one of these). What about the second approach? The first card can be
anything, but what about the second card. We are in trouble. The number of options for the
second level of the tree depended on which branch of the first level you are in (for example, if
the first one is a king, then the second one can only be a 9,10,J,Q,A: 5 options, while if the first
one is an 8, then the second one can be a 4,5,6,7,9,10,J,Q: 8 options.) We certainly don't want
to start drawing trees. As it turns out this tree has 4,320 terminal nodes! (we might let a machine
do this, but certainly not by hand.) This is an important lesson. If you are in trouble counting
the outcomes for an event, then by moving laterally and changing the set up
maybe the trouble can be avoided. In reality, the second approach only worsens as we go
down the list of hands. So we will stay with the first approach.

4-of-a-Kind
All four cards of the same denomination.
For 4-of-a-Kind: we have to decide the
denomination (13 ways), and then the odd card (48 ways), so we have 13 × 48 = 624 options
all together, giving us a probability of 0.00024.

Full House
3 cards in the same denomination, 2 in another.
For a Full House: in order to build a full house we
need to decide which 3-of-a-kind we are going to have (13 options), which suit those 3 cards
æ4ö æ4ö
are going to have, çç ÷÷÷ = 4 , which pair (12 options) and the suits for the pair, çç ÷÷÷ = 6 options,
è3ø è2ø
so the total is 13 × 4 × 12 × 6 = 3,744 , so the probability is 0.001440576.
39

Flush
5 cards in the same suit, but not in sequence.
For a Flush: we have to decide the suit (4 options)
and which 5 cards out of the 13 in the suit, which
æ13ö
gives çç ÷÷÷ = 1,287 options, so we have 5148 ways, but these include the 40 hands that are
è 5ø
straights, hence the number is 5108, and the probability is 0.0019654. Note that if we had
missed the subtlety of the 40 hands we had counted before, the answer wouldn't be that much
different: 0.0019807, with a difference of 0.000015.
Straight
5 cards in sequence, but not in the same suit.
For a Straight: we have to decide the type of
straight (10 options), and then decide the suit for
each of the cards (4 options for each), as before, we don't worry about the straight flushes, we
will just subtract them. So in total we have 10´ 4´ 4 ´ 4 ´ 4 ´4 =10,240 ways from which we
subtract the 40 straight flushes, to give 10,200, so the probability is 0.0039246.

3-of-a-Kind
3 cards in the same denomination, other 2 different
For 3-of-a-Kind: choose the denomination (13
æ 4ö æ12ö
ways), the suits for these 3 cards çç ÷÷÷ = 4 ways), the two other denominations çç ÷÷÷ = 66 ways),
è 3ø è2ø
and the arbitrary suits for the two new cards: 4 2 = 16 . Total = 13 × 4 × 66 × 16 = 54,912 . The
probability is, thus, 0.021128 (things are getting better).
Two Pairs
2 cards in one denomination, 2 cards in another
denomination, fifth card in yet
another.
This one is subtle: Two-Pair. One trap that is commonly fallen into is as follows: choose the
denomination for the first pair—we have 13 ways of doing this, then choose the suits for this
æ4ö
pair: çç ÷÷÷ = 6 ways. Then we have 12 ways of choosing the second pair, and again 6 ways of
è2ø
choosing its suits. Now all we have to do is choose the remaining card out of a possible 48,
giving us a grand total of 13 × 6 × 12 × 6 × 48 = 269,568 . There are two most foul errors in this
discussion. The latter is the easiest one to catch. The 48 is wrong. You are not controlling full
houses! Hence it should be one card out of the remaining 44, giving an adjusted count of
247,104. But one error remains that is most subtle and very common (and tempting—can you
detect it?) Remember the distinction between a committee and an executive board. Can we tell
the difference between the first pair and the second pair? We certainly have counted them as if
we could, and that is definitely wrong. Instead we have counted every hand twice and the real
40

count should be 123,552. Just to make you totally comfortable with this let's count them another
æ13ö
way. Let's start by choosing the two denominations for the two pairs: çç ÷÷÷ = 78 ways. Then we
è 2ø
have 6 ways of choosing the suits for one of the pairs and another 6 of choosing them for the
other one, and then we have as before 44 ways of choosing the extra card:
78 × 6 × 6 × 44 = 123,552 .
Pair
2 cards in one denomination, nothing else.
For a simple Pair, we have first to decide the
denomination (13 ways), then the suits for the pair,
æ4ö÷ æ ö
ç ÷ = 6 . Then we must have three other denominations, ç12÷÷ = 220 , and then the arbitrary suits
çè2ø÷ çè 3 ø÷
for those denominations, 4 × 4 × 4 = 64 , which gives us the total of
13 × 6 × 220 × 64 = 1,098,240 .

Bust
None of the above.
For a Bust, we must have 5 denominations,
æ13ö÷
çç ÷ = 1,287 , and a suit for each, 4 5 = 1,024 . But have included the Straight Flushes, the
è 5 ø÷
Flushes and the Straights, so we have to subtract
1287´1024 - 4 - 36 - 5108 -10200 = 1,302,540 .
But wait, you say, I was stupid to have done this calculation since we could have found the
number of bust hands by subtracting all the previous hands from the total number of hands.
However, it is important to have redundancy, especially in a sophisticated calculation like this
one. The fact that, as the table shows, the numbers add up to the correct total assures us that
we have not possibly made only one error. Name of hand Number of ways Probability

Straight Flush 40 0.000015


4-of-a-Kind 624 0.000240
Full House 3,744 0.001441
Flush 5,108 0.001965
Straight 10,200 0.003925
3-of-a-Kind 54,912 0.021128
Two Pair 123,552 0.047539
Pair 1,098,240 0.422569
Bust 1,302,540 0.501177
TOTAL 2,598,960 1.000000
41

• Random Variables

In the last section we encountered the fundamental concept of a random variable. One can
safely say that this is the most fundamental concept in the course. So what is a random variable?
In the last section we looked at a couple of them. The roll of two dice which had for its
distribution

X 2 3 4 5 6 7 8 9 10 11 12
P 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
# .0277 .0556 .0833 .1111 .1388 .1667 .1388 .1111 .0833 .0556 .0277

And the roll of three dice:

X 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
P 1
216
3
216
6
216
10
216
15
216
21
216
25
216
27
216
27
216
25
216
21
216
15
216
10
216
6
216
3
216
1
216
# .0046 .0138 .0277 .0463 .0694 .0972 .1157 .125 .125 .1157 .0972 .0694 .0463 .0277 .0138 .0046

Both of these are examples of discrete random variables since the only values the random
variable takes are isolated real numbers. These values that the random variable takes are called
the support (or range) of the random variable. And for a discrete random variable, each of
the numbers in this collection of (isolated, possibly infinitely many) points (the support) gets
assigned a nonnegative number, which stands for the probability of obtaining that number as
an output of our random variable. Critically, the sum of all these probabilities has to be 1,
of course. This collection of probabilities is called the distribution of the (discrete) random
variable. Although this is commonly so designated, it is an unfortunate choice since as we will
see below the word distribution will acquire a slightly different meaning when we discuss other
types of random variables.

The rest of the computations of probabilities is basically dependent on the fact that these
outcomes are disjoint events so one simply adds the probabilities.

Example 1. Suppose you come to take a test totally unprepared. The test consists of 10
True-False questions, each of which you will answer at random (but you will answer them
all, since there is no penalty for guessing.) How likely is it that you will achieve a passing score
of 70% or better? Let X be the random variable that counts the number of correct answers, so
what is desired is P ( X ≥ 7 ) .

First, what is the experiment? Answering the exam. How many ways can you do this? By the
second counting principle, 2 10 = 1024 (10 decisions, 2 choices for each). What is the event we
are pursuing? In how many ways can you answer the exam so that you have exactly 10 correct?
1 way. How about 9 correct? Build your tree, the first stage is to decide which question you are
42

going to answer incorrectly, for this stage you have 10 choices. After you have done that, there
are no options left since the question you are to answer incorrectly has to be answered that way
while all the others have to be answered correctly. So there is a total of 10 = 10 ways of
 1
getting 9 correctly. What about 8? There we have the decision of which two questions out of
the 10 we are to answer incorrectly, after that we have no options left, so the answer
is 10 =45. Finally, by similar reasoning, we get 10 =120 ways of getting exactly 7 correct.
 2  3

Continuing this way, we get the probability distribution for the random variable that counts the
number of correct answers:
X 0 1 2 3 4 5 6 7 8 9 10
P 10241 10
1024
45
1024
120
1024
210
1024
252
1024
210
1024
120
1024
45
1024
10
1024
1
1024
% .097 .97 4.4 11.7 20.5 24.6 20.5 11.7 4.4 .97 .097

So P ( X ≥ 7 ) = P ( X = 7 ) + P ( X = 8 ) + P ( X = 9 ) + P ( X = 10) , which totals 176 ,


1024
approximately 17%. Not bad for total ignorance!

Example 2. This is a slight, but important variation of Example 1. Suppose you come to take
a test (totally unprepared as usual), but that the test is Multiple Choice, with ten questions
and each question having three choices, only one of them correct. What is the probability that
you score at least a 70% on the test? Again let X be the random variable that counts the
number of correct answers.

How many ways can we answer the exam? Easy, 310 = 59049 . How many ways can we get
all correct: 1 way. Nine correct? 10 ×19 × 2 1 = 90 . The surprising ingredient here might be
 9
the 2, which is coming from the 2 ways we can answer a question incorrectly. Reviewing the
three factors we get: 10 as the number of ways of choosing the questions that we are going to
 9
answer correctly, 1 as the number of ways of answer those questions correctly, 2 1 as the
9

number of ways of answering the remaining questions incorrectly. How many ways can we get
an 80%? 10 ×18 × 2 2 = 480 . Continuing we get
 8
X 0 1 2 3 4 5 6 7 8 9 10
P 1024
59049
5120
59049
11520
59049
15360
59049
13440
59049
8064
59049
3360
59049
960
59049
180
59049
20
59049
1
59049
% 1.73 8.67 19.51 26.01 22.76 13.65 5.67 1.63 .30 .03 .001

So P ( X ≥ 7 ) is: 1 + 20 +180 + 960


=
1161 » 1.96% , much smaller than in the True-False
59049 59049
exam.
43

The graph represents the contrast between the two types of


0.3
0.25 exams.
0.2
0.15
p=.3333
0.1 p=.5
0.05 We next see two other examples of simple random variables.
0
0 1 2 3 4 5 6 7 8 9 10

Example 3. The Simplest Random Variable. The title belongs to a constant! X 5


Namely, for example, the random variable X = 5 has distribution: P 1

Example 4. The Bernoulli, the Next Simplest Random Variable. The next simplest
random variable is named after one of the founders of the subject, Jacob Bernoulli who wrote
a very important book named Ars Conjectandi.

The purpose of the Bernoulli random variable is simply success vs. failure. It does have one
parameter, the probability of success, denoted by p, so Bp , which is how
Bp 0 1
we will refer to this random variable from now on, has the simple distribution:
P 1- p p
So usually 1 is success while 0 is failure. Also often we refer to 1- p as q (and we will do so
throughout the course). Of course what success is depends on the user. For example, success
could be rolling a 7 with a pair of dice, or flipping heads with a coin.

But there is another kind altogether of random variables. Let us start with an example we
encountered briefly before.

Example 5. Uniform. Let us consider the random variable X that chooses a number between
0 and 1 at random. This is a randomizer or a random number generator. It will be known as U
for the remainder of the course.
0.2
0.15
0.1 Think of a die, but instead of numbers 1 through
0.05 6 on the face, we have the numbers 16 through
0 6
6
on them. Then the distribution of a roll looks
1/6 1/3 1/2 2/3 5/6 1
like a flat line.

1 20
Similarly, we can think of an icosahedron with 20 sides labeled 20
to 20

0.06
0.05
0.04
0.03
0.02
0.01
0
44

Or a dreidel with 40 sides


0.03

0.02

0.01

0
As we proceed to increase the number of
sides, we see that the probability of getting any
given value is decreasing toward 0, from 6 = .1667 to 20 = .05 to 40 = .025 in our three
1 1 1

examples. So in fact when we talk of a random variable that chooses an arbitrary real number in
the interval [ 0,1] , we have lost the ability to talk of the probability of a single number as being
anything but 0.

First we need to understand the support of the variable—that is, what values can our variable
take. In the example we are discussing, the support is the interval [ 0,1] . If we take an arbitrary
u in this interval, we could ask for the following quantity
P (u ≤ U ≤ u + h )
.
h
And then we could take the limit as h → 0 , this would represent the tendency of probability for
u —a potential. In our specific example, that limit
P (u ≤ U ≤ u + h ) (u + h ) − u = 1
lim = lim
h →0 h h →0 h
as long as 0 ≤ u ≤ 1 , and that agrees with our intuition of what that specific random variable is
doing—namely if one is a number between 0 and 1, one has the same chance as any other
number in that interval of being picked, while numbers outside that interval have no chance of
being selected.

Again we could use the same idea of a table (as we did in the discrete case) to represent this
information, except now that instead of being a discrete table, it is a continuous table:
U L −1 L 0 L u L 1 L 2 L
f (u ) 0 0 0 1 1 1 1 1 0 0 0
And instead of a probability, it is a potential for probability—instead of mass, it is density. And
it is usually denoted by f ( u ) . Note that we use little u to denote an arbitrary number in the
support of the random variable U .

Again, for a continuous random variable, individual members of the support loose all
importance, so the probability of any one of them occurring is simply 0, in other words
P (U = a ) = 0 for any number a,
but that does not mean that single elements have a higher potential of occurring than others. In
45

the previous example, the points in the interval between 0 and 1 had equal potential, while point
outside that interval had no potential (so they were not in the support of the random variable).
In other words, instead of talking of the probability of getting a specific number, we talk about
the probability of being close to that number. But even more dramatically, for any number a ,
P ( U < a ) = P ( U ≤ a ) , since, again, individual points do not matter.

We know for sure in our example that P (0 £U £ 1) = 1 since we have certainty this will
happen. This is closely related to the above mentioned principle:
• The sum of all the possible probabilities is 1 since we are
certain one of them will occur.
But it is no longer the sum since we are acting on a continuum now, so the correct statement is
that
•’ The integral of the density over all real numbers equals 1 since
we are certain this will occur.

∞ 1
And in fact ∫ f (x ) dx = ∫ f ( x ) dx
−∞ 0
since there is no density outside that interval, and in that
∞ 1
interval we have f ( u ) = 1 , so f (x ) dx = ∫ 1dx = x| = 1 as it should be.
1

−∞ 0
0

Rather that the table as we did above, the density of a random variable is given as a function. In
our example
ìï1 0 £ u £ 1
f ( u) = ïí
1.2

ïïî0 otherwise
1
0.8
and the graph is given by the picture. 0.6
0.4
0.2
One could also simply say that the support is the interval 0
[ 0,1] and in that interval the density is 1. 0 0.5 1 1.5

What kind of event should we be discussing? Certainly, we can compute P (.2 £ U £ .45) . It
can be viewed two ways:
.45

• ò 1dt = .25 . This is nothing but the sum (integral since it is continuous) of all relevant
.2

possibilities.

‚ Area of a rectangle with base of size .25 and height 1=.25. Similar to the discrete
arguments: Share of .25 out of a total of 1, .25.
46

Thus in general the density of a continuous random variable has to satisfy:


It is nonnegative f ( x ) ³ 0 for all x’s,
(this corresponds to principle Œ) and as we saw before:
¥

It has unit area underneath ò f (t ) dt = 1


to be in agreement with principle •.

An event is a subset of the line, and to compute its probability one simply integrates the function
a b

over that subset, e.g., P ( X £ a) = ò f (t ) dt , or P (a £ X £ b) = ò f (t ) dt . By simple


-¥ a

properties of the integral, we get


Ž If A and B are disjoint events, then P ( A or B ) = P (A ) + P (B ) .

Note that P ( X £ a) = ò f (t ) dt is indeed an honest probability, and it is such a useful


function that it is closely associated with the random variable. It is unfortunately called the
distribution (vs. density) of the random variable, and so one defines
a

F ( a) = P ( X £ a ) = ò f (t ) dt

Often, in order to clarify, the distribution is also referred to as the cumulative distribution.
Its characteristic properties include:

• It is always between 0 and 1, 0 £ F (a ) £1 for all a


since f ( x ) ³ 0 , and total area is 1;

‚ It is increasing, F ( a) £ F (b ) if a £ b
since the event X £ a is a subevent of the event X £ b ;

ƒ It must approach 1, lim F ( a) = 1


a®¥

Of course, if we know the distribution F ( a ) of a continuous random variable X, then its density
dF ( x )
is simply given as its derivative f ( a ) = | .
dx x= a

For example if we use X = U to be the uniform random variable on the interval [ 0,1] , then its
density is given by
47

ïì1 0 £ u £ 1
f ( u) = ïí
ïïî0 otherwise
while its distribution is
ìï0 u £0
ïï
F (u ) = íu 0 £ u £ 1 .
ïï
ïïî1 u ³1
Or simply one could say that F ( x ) is x in the support of U , the rest being obvious. The
graphs of the two are
1.2

1
One of the advantages of having the distribution is that no
0.8
more integration is then required. For example, to compute
P (.2 £ U £ .45) , one simply calculates F ( .45 ) − F (.2 ) ,
0.6

0.4

since 0.2

P (U £ .2 ) + P (.2 £ U £ .45) = P (U £ .45) 0

-0.5 0 0.5 1 1.5


Because the events on the left are disjoint—actually we
Density Distribution
should have written
P (U < .2 ) + P (.2 £ U £ .45) = P (U £ .45)
instead, but the difference (one point) is insignificant.

We look at another example of a continuous random variable.

Example 6. Lifetime. A certain type of electronic device is guaranteed to last at least 10


hours. After that the lifetime of the device is inversely proportional to the fourth power of the
time (measured in hours) it has lasted. How likely is it that the machine will last at least 20
hours? How about 30 hours? Let T be the random variable that measures the time the device
will last. Then we know that the support of T is all real numbers greater than or equal to 10,
c
and that the density on that set should be of the form f ( t ) = 4 where c is some unknown
t
constant. We would like to know P (T ≥ 20 ) .


c
The first computation has to be to find c , and to do that we know that ∫x
10
4
dx has to equal 1.

This readily leads to c = 3,000 . So we know the density equals


ì 3,000
ï
ï t ³ 10
f ( t) = ï
í t4
ï
ï
ï
î 0 otherwise
48


To find P ( X > 20) , all we need to do is compute then
3000 ∞

20
x 4
dx = −1000 x −3| = 18 , a
20

12.5%. Although not required, we will compute the cumulative distribution


t
3000 1000
FT ( t ) = P ( T ≤ t ) = ∫ 4 dx = 1 − 3 .
10
x t
Of course, once we know the distribution, we can compute most probabilities without the need
to integrate. Thus, for example P ( X > 30) is simply 1 − FT ( 30) = 271 ≈ 3.70% .

We end the section with yet another type of random variable, which unfortunately, will not be
encountered often in the course.

Example 7. A Mixed Problem. A car is to be driven by a mechanic to a parking place where


it will be left until the chauffeur picks it up. The car will arrive at the parking place with a uniform
distribution between noon and 1 p.m. The chauffeur will arrive independently1 with the same
distribution. If the car is there when the chauffeur arrives, he drives off immediately; if the car is
not there, he waits until it arrives and then takes off. Let W be the waiting time for the chauffeur.
What is the density of W?

If we let M denote the arrival time of the mechanic and C that of the chauffeur, then we have
P ( M £ C ) = 12 since P ( M £ C ) + P ( M ³ C ) = 1 and P ( M £ C ) = P ( M ³ C ) . But this is
equivalent to P (W = 0) = 12 .

Clearly, P (W £ 1) = 1 . What is P (W £ a ) ? For this to happen we must have C ≤ M + a .


We do not at present have the tools to compute this probability, but below we will see that the
(1- a ) 1 + 2a - a 2
2

answer is 1- = ,and so the density is the derivative 1- a .


2 2

Thus the density of W is given by


 12 w=0

f W ( w) = 1 − w 0 < w ≤ 1
 0
 otherwise
we the graph

Note that here one point, namely 0, is very significant.


1
2

1
1
This word: independently, will be further clarified in a future section.
49

Ž Attributes of a Random Variable

In this section we introduce some of the most important features of a random variable—
in all we will look at 5 items, some measure the center of the random variable while
others measure the spread. We have already looked at one of the measurements of the
spread—the SUPPORT or RANGE of a random variable, which consists of the values a
random variable can take.

Next we look at perhaps the most important individual bit of information about a random
variable: its mean, or average or expected value. The notion was introduced by the
great Dutch mathematician Huygens (also of the 17th century).

Example 1. Let's look at the following game (once more, not atypical of the 17th
century):
Since one has a chance in six of rolling a 1 with a die, one has an even
chance to roll at least one 1 when one rolls 3 dice. Hence I propose the
following game to you. You will roll 3 dice. If three 1's show up you win a
wonderful $5, if only two 1's, you win $2, while if only one 1 is rolled, you
will still win $1. If, unfortunately, on the other hand, no 1's show up you pay
me only $1 .

Naturally, you are suspicious of my proposition, but it is much better to pin point the
reasons for your suspicions. What we need is to compute your expected value when you
play this gamethis is equivalent to what your average performance is going to be. Of
course, if you are going to play this game just once, then it does not matter what you opt
to do, but as a long-range strategist, you need to compute.

The computation is just common sense. You are basically asking: suppose I played the
game so many times, what would happen? What is the appropriate random
variable?

In any one roll, you can win either 5, 2 or 1 dollars, or you can lose 1. We know that is
we roll three dice, there are 216 different rolls. Among those rolls, three 1's would show
up once, while two 1's would show up 15 times (15 = 3 × 5 ), the 3 is the number of
options of which two dice are going to show the two 1's while the 5 is what the other dice
is going to show). How many times does one 1 show up? Choose which die shows the 1
(3 options), and then choose what the other two dice show ( 5 × 5 ) for a total of 75 times.
Finally, no 1's will show 125 times (which is X 5 2 1 -1
5 × 5 × 5 = 216 − 1 − 15 − 75 ). So the random variable is P 1 15 75 125
216 216 216 216

So if you play the proverbial 216 rolls, you will win $5 once, $2 exactly 15 times, $1, 75
occurrences, but for 125 times, you will lose $1, so your winnings are
5 × 1 + 2 × 15 + 1 × 75 − 1 × 125 = −15
or equivalently, your expectation is
50

æ 1 ö æ 15 ö æ 75 ö æ125 ö -15
5çç ÷÷÷ +2çç ÷÷÷ +1çç ÷÷÷ -1çç ÷÷÷ = » -6.9 ¢.
è 216 ø è 216ø è 216ø è 216 ø 216
So on the average you will lose $15 in 216 rolls, or approximately 7¢ a roll. Naturally,
you would rather not play a game when your expectation is negative, unless you will
have so much fun you are willing to pay the fee.

Note that the computation is simply the dot product of the two rows, X and P. And that is
exactly the definition of the expectation of a discrete random variable,
E ( X ) = X × P = å xi pi .
i

Note that is a sum, so for a continuous random variable we should be seeing an integral,
¥

E ( X ) = ò xf (x ) dx .

1

Thus for U, the uniform random variable on [ 0,1] , we get E (U ) = ò xdx = . In this
1

0
2
particular instance, the average did not carry a lot of information—it is just in the middle
of the density, the safest guess.

Often, the expectation is also referred to as m , the Greek m, for mean.

Example 2. Carnival Gave. A game is based on the following setup. A bowl contains 4
blue balls marked $100, $50, $20, and $10, Blue $100 $50 $20 $10
respectively, 4 red balls marked $100, $50, $20, and Red $100 $50 $20 $10
$10, respectively, 3 black balls marked $100, $50, Black $100 $50 $20
and $20, 2 yellow balls marked $100, and $50, 1 Yellow $100 $50
green balls marked $100. Then simultaneously 3 Green $100
balls are drawn from the bowl at random. The contestant wins the difference between the
largest ball drawn and the smallest ball drawn. Thus, for example, if we draw 50, 50 and
10, the contestant wins 40.

Let X be the random variable that represents X 90 80 50 40 30 10 0


the earnings of the contestant. We start by P 364
95 105
364
70
364
40
364
30
364
9
364
15
364
finding its distribution:

and then its expectation is


= » $63.30 .
23040 5760
364 91
So if we had to pay $65 to play the game, it would not be profitable. On the other hand, (I
believe) most people would not pay $60 to play, and yet that would be a profitable
scheme (with enough capital behind us).
51

But we can take the opportunity to introduce other measurements for this random
variable. The simplest notion is that of MODE: the outcome with highest probability. In
this case it is $80.

In the continuous case, the mode is represented by the point(s) where the density achieves
its maximum value. Thus, for the standard uniform, U, every point between 0 and 1
represents the mode. On the other hand, for the sum of two independent standard
uniforms, the mode was 1.

Then the third measure of central tendency is the MEDIAN: This is the outcome that is in
the middle—50% chance above and 50% chance below. One way to compute it in the
example is to think of the 364 possible outcomes. If these are listed in order, then the
middle will be between the 182nd position and the 183rd position. Since both of these are
occupied by 80, that is our median. If the middle positions had been occupied by different
numbers, the median would have been the average of the two numbers.

The median of the standard uniform is also 12 . One fact worth mentioning is that
the mean, median and mode are always in alphabetical either
decreasing or increasing order.
Thus in the example, mean was approximately $63 while the median and the mode were
both $80.

Often the RANGE is simply given as the interval in which the variable has probability
bigger than 0. Thus, in the gaming example, the range is 0 to 90. Another way to
describe the range is by giving the MIN and MAX values of the random variable (of
course only if these values are finite).

Before we discuss the most recent of these ideas, and the most fundamental measure of
spread, we need to expand further on random variables. Given a discrete random variable
X , the random variable X 2 simply takes the squares of the values of X with the same
probability (or the same density). Namely, in table form for the
discrete case, we have X L x L
X 2
L x2 L
P p
Thus,
E(X )=∑x
2 2
px
x
in the discrete case.

In the next section, we will see how to extend the concept of X 2 in the continuous case,
but easily, what we require now is E ( X 2 ) , and that concept is easily extended to

E(X 2
) = ∫ x f (x ) dx .
2

−∞

Now we are ready for a fundamental concept VARIANCE: The variance is the
52

expectation of the square minus the square of the expectation. Thus,


V ( X ) = E ( X 2 )- E ( X ) .
2

STANDARD DEVIATION: The most crucial of all measurements of dispersion of the


outcomes, it is simply the square root of the variance. It is commonly denoted by s ,
or s X when necessary.

Note that since the standard deviation is the square root of the variance, it is necessary
that the variance is a nonnegative number—we will indeed prove that is the case
below.

Thus in our gaming example, the random variable X 2 has distribution:


X 2 8100 6400 2500 1600 900 100 0
P 95
364
105
364
70
364
40
364
30
364
9
364
15
364
so, its expectation is 4690.93.

So the variance is V ( X ) = 4690.93 - ( 63.3) = 684.04 , and the standard deviation is then
2

s = 684.04 = 26.15 . It is measured in the same units as the mean, in this case, dollars.

Example 3. Standard Uniform. Let U be the standard uniform, the randomizer. We


already have E (U ) = 12 , and the median is the mean, while the mode is represented by
every point in the interval [ 0,1] . Obviously the support or range is that same interval. So
we need to compute the variance, and the standard deviation. To compute the variance we
¥

need to compute E (U ) . But E (U ) = ò


1

x f (x ) dx = ò x dx = . Thus
2 2 2 2 1

-¥ 0
3

V (U ) = E (U 2 ) - E (U ) = - = and so sU =
2 1 1 1 1
.
3 4 12 2 3

In summary, the parameters introduced are given in the following table:

Name Definition Importance


Expectation E( X ) = X × P High
Median 50-50 Moderate
Mode Most Popular Moderate-Low
Support or Range Max-Min Moderate
Variance V ( X ) = E ( X 2 )- E ( X )
2
High

Standard Deviation s= V (X ) High


53

Example 4. Being hired by a small firm whose average salary is $100,000 sounds very
exciting. But pursuing it further, we learn that there are 13 employees in the firm, the
boss makes $1,000,000, his two vice presidents (his two daughters) make
$100,000 each, and the remaining 10 employees make $10,000 per S 106 105 104
person. So the distribution of salaries is P 131 2 10

Indeed, E ( S ) = 100,000 , but the median and the mode are both $10,000.
13 13

More importantly, the standard deviation is approximately $261,798, a shockingly large


number.

Example 5. The Roll of the Dice. Consider the roll of two dice:
X 2 3 4 5 6 7 8 9 10 11 12
P 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
# .0277 .0556 .0833 .1111 .1388 .1667 .1388 .1111 .0833 .0556 .0277

then
2´ 1 + 3 ´ 2 + 4 ´ 3 + 5 ´ 4 + 6 ´ 5 + 7 ´ 6 + 8 ´ 5 + 9 ´ 4 + 10 ´ 3 + 11´ 2 + 12 ´1
E( X ) = =7.
36
The mode is 7 also, and so the median has to be 7 (since they are always in alphabetical
order). To compute the variance, we need as usual E ( X 2 ) , which equals
4´ 1 + 9 ´ 2 + 16´ 3+ 25 ´ 4 + 36 ´ 5 + 49 ´ 6 + 64 ´ 5 + 81´ 4 + 100 ´ 3 + 121´ 2 + 144 ´1
E( X 2 ) = =
36
1974
≈ 54.8333 .
36
Hence V ( X ) = 5.8333 and σ = 2.4152 .

Let us next revisit a couple of examples from the last section.

Example 6. Lifetime. A certain type of electronic device has a lifetime T with density
ì
ï
ï
3,000
ï t ³ 10
f ( t) = í t 4 .
ï
ï
ï
î 0 otherwise
Since the density is a decreasing function, the mode
is easily seen to be 10. Now 0.35
∞ 0.3
3000 1500 ∞
E ( T ) = ∫ t 4 dt = − 2 | = 15 , 0.25
10
t t 10
0.2
So we expect the device to live 15 hour s. 0.15
0.1
The computation of the median is interesting, we 0.05
need to find a so that 0
a 0 10 20 30 40
3000 1
∫10 t 4 dt = 2 .
54

1000
Or equivalently, F ( a ) = 12 , and since we computed F ( a ) = 1 −3
, we get that the
a
median is 3 2000 = 12.60 , so 50% of the devices will not last past 12.60 hours, in
between the mode and the mean (as expected).

To compute E ( T 2 ) we need as usual



E (T ) = ∫t 3000 3000 ∞
2 2
4
dt = − | = 300
10
t t 10
and so V (T ) = 300 − 15 = 75 , and so σT = 8.66 .
2

Example 7. A Mixed Problem. A car is to be driven by a mechanic to a parking place


where it will be left until the chauffeur picks it up. Let W be the waiting time for the
 12 w=0

chauffeur. The density of W is given by f W ( w) = 1 − w 0 < w ≤ 1 .
 0
 otherwise

The median is clearly 0, and so is the mode. For the expectation, we need to compute
1

E (W ) = 0× 12 + ò t (1- t ) dt = .
1

0
6

Note P (W £ 16 ) = 72
47
= 65.27% . The support is of course, 0 to 1. Now

E (W 2 ) = 0× 12 + ò t 2 (1- t ) dt =
1
1
,
0
12

so V (W ) = 1
12 - = .
1
36
1
18

Decisions based on expectations are made on all kinds of contexts.

Example 8. Insurance. Accident records collected by an automobile insurance company


state that for a given group of drivers, the probability of getting into an accident is 15%.
If a driver is in an accident, 80% of the time, their car receives 20% damage, while 12%
of the time the car’s repair are tantamount to 60% of the value of the car. On the
remaining occasions, the damage is total. A customer seeks to insure a $20,000 car. What
is expected loss to the insurance company? Again, let us observe our random variable.
X 0 4,000 12,000 20,000
P .85 .12 .018 .012
and so E ( X ) = 4000 × 0.12 +12000× 0.018 + 20000× 0.012 = $936 . Thus the smallest fee
the insurance company should charge is $936.

But the notion of expectation can be used in even more interesting ways:
55

Example 9. Salvage. A company provides a salvage service. If a piece of equipment is


lost under water, the company will send divers to recover it. In the past, the divers have a
70% recovery rate per day. That is, if a team of divers is sent, they will recover the
equipment in one day 70% of the time. It costs the company $50,000 to send a team of
divers for the day. When multiple teams are used, since they do not interact with each
other, the company has always treated them as independent random variables.

The salvage company is contacted by an oil conglomerate—they are to recover urgently


(within one day) a piece of equipment worth $900,000.

The first question has little to do with expectation. How many teams should the salvage
company send in order to have at least a 99% chance of recovering the equipment? If n
teams are sent, the probability that the equipment will not be recovered is (.3) , so if we
n

send three teams, we get 1- (.3) = 1 - .027 = .973 chance of recovery, not enough. But if
3

we send 4 teams, we get 1- (.3) = 1- .0081 = .9919 , so 4 teams will suffice. Of course,
4

the more teams we send the more likely it will be that the piece of equipment will be
recovered.

A different question is how to maximize the expectation for the oil


conglomerate. Suppose n teams of divers are sent. What can the oil n E (X )
company expect to happen? Its cost is clear, $50000n. On the other 1 580,000
hand, they can expect $900,000 with probability 1- .3n and 0 otherwise. 2 719,000
So if we let X be the random variable expressing the oil conglomerate’s 3 725,700
return, we get that 4 692,710
E ( X ) = 900000 (1- .3n ) - 50000n . 5 647,813
6 599,344
The table exemplifies E ( X ) for various values of n. Thus, the maximum 7 549,803
expectation is $725,700 when we send three teams of divers. Sending 8 499,941
more teams does not increase the expectation. 9 449,982
10 399,995

One of the reasons that expectations are more important than modes or medians is that
there are several important theorems that are true about expectation as we will see below
in a future section.

Example 10. The Simplest Random Variable. If X = a , a constant, then immediately


the mode, mean and median of X is that constant. But also E ( X 2 ) = a 2 , so V ( X ) = 0 .
We will see in the next section that the only random variables with zero variance are
constants.
56

Example 11. The Bernoulli Bp which has distribution Bp 0 1


P 1- p p

Then Bp has expectation p , and since Bp2 = Bp , E ( B 2p ) = p and so


V ( Bp ) = p − p 2 = pq . Its mode and its median depend on whether p > .5 or not.
57

• New Variables from Old

In a previous section, random variables were introduced. As mentioned then, it is a


fundamental concept. In the last section we introduced the square of a random variable,
and in this section we introduce arbitrary functions of a random variable.

The easiest new random variable is one created from an old random variable via
multiplication by a scalar.

Example 1. The Roll of the Dice. Let X denote the roll of 2 dice. Suppose we will play a
game where we get paid triple our roll. Let Y denote the random variable that denotes our
winnings, then certainly Y = 3 X , and certainly their distributions will be closely
associated:
X 2 3 4 5 6 7 8 9 10 11 12
Y 6 9 12 15 18 21 24 27 30 33 36
P 36 36 363 364
1 2 5
36
6
36
5
36
4
36
3
36
2
36
1
36

As exemplified, the basic relation between X and Y = aX is that for any number n,
P ( X = n ) = P (Y = an) in the discrete case. Clearly,
E (Y ) = E (3X ) = 3X ⋅P = 3( X ⋅ P ) = 3E ( X )
and this claim is generic, E ( aX ) = aE ( X ) . This is at least obvious in the discrete case—
it will become so also in the continuous case soon. Also a little thought will show that the
median of Y is 3 times the median of X , and the same is true for the mode.

Of course the range of Y is the collection of triple values of the range of X , as is easily
observed from the table. Now trivially Y 2 = 9 X 2 , so E ( Y 2 ) = 9 E ( X 2 ) , and finally, we
get
2 2
(
V (Y ) = E ( Y 2 ) − E ( Y ) = 9E ( X 2 ) − ( 3E ( X ) ) = 9 E ( X 2 ) − E ( X )
2
) = 9E ( X ) ,
so σ Y = 3σ X .

Suppose now that X is continuous with density f ( x ) . What do we know about the
random variable Y = 3 X ? Certainly we know that for any number y ,
 y
P (Y ≤ y ) = P (3 X ≤ y ) = P  X ≤  ,
 3

and so FY ( y ) = FX   , and so f Y ( y ) =
y  y
1
f X   , by the chain rule. Let us consider a
3 3
3

specific example.
58

Example 2. Lifetime. A certain type of electronic device has had so far a lifetime T with
density
ì
ï
ï
3,000
ï t ³ 10
f ( t) = í t 4 .
ï
ï
ï
î 0 otherwise
The manufacturers now claim their new product has tripled the life of the old devices, so
the new lifetime is N = 3T . What do we know about N ? We know its support is all
numbers ≥ 30 , and if n ≥ 30 , then
 n n 1000 27000
FN ( n ) = P ( N ≤ n ) = P  T ≤  = FT   = 1 − = 1− .
 3 3 ( n3 )
3
n3
81000 n
Hence f N ( n ) = = 1
fT   . Of course the mode of N is 30, three times the mode
3
4 3
n
of T , and
∞ ∞
81000 −40500 ∞
E(N) = ∫ nf (n ) dn = ∫ n 4
dn = |30 = 45
−∞ 30
n n2
So E ( N ) = 3E ( T ) . But the astute reader may easily recognize a much better way to do
this computation by substituting in the integral n = 3t :
∞ ∞ ∞
81000 81000 3000
E(N) = ∫ n dn = ∫ 3t 3 dt = 3 ∫ t 4 dt = 3E (T ) .
( 3t )
4 4
30 n 10 10 t
It is clear we use the distribution since it refers to actual probabilities versus the density,
which only refers to potentials.

a
81000 1
Similarly, for the median, a , the median of N satisfies ∫
30
n 4
dn = , and we the same
2
a a
3 3
81000 3000 1 a
substitution we get ∫
10
81t 4
3dt = ∫ 4 dt = , so
10
t 2 3
is the median of T , and so again

we have that the median of N is three times the median of T .

Finally, since N 2 = 9T 2 , we should get E ( N 2 ) = 9 E ( T 2 ) , and indeed


∞ ∞
E ( N 2 ) = ∫ n2 = 2700 = 9 E ( T 2 )
81000 81000 81000 ∞
4
dn = ∫ 2
dn = − |
30
n 30
n n 30

and so as before, V ( N ) = 9V ( T ) .

And we have basically argued


Theorem. Scalar Multiples. Let X be a random variable and let a be a
constant. Then the random variable Y = aX satisfies E ( Y ) = aE ( X ) , the
median of Y is the median of X multiplied by a , and the same is true of
the mode. Moreover, V (Y ) = a 2V ( X ) , so σY = a σ X .
59

Observe that we also showed that if X has density f X ( x ) and distribution FX ( x ) , then
Y = aX has density and distribution given respectively by
 y
f Y ( y ) = 1a f X   FY ( y ) = FX   .
y
and
a a
Actually, note that it was much easier to find the distribution first (from which we got the
density by differentiating), rather than the other way around.

The next level for building new random variables is that of adding a constant to a random
variable: Y = X + b . This often known as a translation (in geometry), and clearly the
support of the random variable is being so transformed. For example, in the roll of two
dice, suppose somebody claimed they would give you two dollars over the roll. Thus if
X is the roll of the dice, then the winnings would be represented by Y = X + 2 . And the
distribution is given by
X 2 3 4 5 6 7 8 9 10 11 12
Y 4 5 6 7 8 9 10 11 12 13 14
P 361 2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
Clearly the mode of Y is the same as the mode of X plus 2, and so is the median. For
the expectation, the calculation is interesting:
E ( Y ) = ∑ yi pi = ∑ ( xi + 2 ) p i = ∑ xi pi + ∑ 2 pi = E ( X ) + 2
i i i i

since ∑ p = 1 . What about E (Y ) ? Then


i
2

E ( Y 2 ) = ∑ yi2 pi = ∑ ( xi + 2) pi = ∑ xi2 pi +∑ 4 xi pi + ∑ 4 pi = E ( X 2 ) +4 E ( X ) + 4 .
2

i i i i i
So
V (Y ) = E ( Y 2 ) − E ( Y ) =E ( X 2 ) +4 E ( X ) + 4 − ( E ( X ) + 2 ) = E ( X 2 ) − E ( X ) = V ( X )
2 2 2

namely the variance does not change! Below we will see an alternate definition of
variance that will point out clearly the reason for this last claim.

Let us move to the continuous example. Suppose the manufacturer of our devices
promise now that the devices will last 2 hours longer than before. Again, let N be the
lifetime of our new devices, so as before N = T + 2 . So our range (or support) will now
be all numbers greater than or equal to 12. Again, as before, the next easiest information
to gather about N is its cumulative distribution: if n ≥ 12 , then
1000
FN ( n ) = P ( N ≤ n ) = P ( T + 2 ≤ n ) = FT ( n − 2) = 1 −
( n − 2)
3

so, the density satisfies


3000
f N ( n) = fT ( n − 2 ) = .
( n − 2)
4

Thus clearly, the mode of N subtracted by 2 is the mode of T , for the expectation we
get
60

∞ ∞ ∞ ∞
3000 3000 3000 3000
E(N) = ∫ n dn = ∫ (t + 2 ) dt = ∫ t 4 dt + ∫ 2 4 dt = E ( T ) + 2
( n − 2)
4 4
12 10
t 10
t 10
t

3000
by making the substitution n = t + 2 , and also using the fact that ∫
10
t4
dt = 1 .

By a similar substitution, we get that the median of N is the median of T increased by 2


(hours).

In a similar fashion to the discrete case, one can show that the variance of N is the same
as the variance of T , and by extending our reasoning to an arbitrary constant, we get a
nice theorem, which receives its name from the fact that transformation of the form
x a ax + b are called affine:

Theorem. Affine Transformations. Let X be a random variable and let a


and b be constants. Consider the random variable Y = aX + b . Then
E ( Y ) = aE ( X ) + b , and the same relationship holds between the mode of
Y and the mode of X , and between the median of Y and the median of
X . Moreover, V (Y ) = a V ( X ) and so sY = a s X . Finally, if X is
2

continuous with density f X ( x ) and distribution FX ( x ) , then Y has


 y −b
density and distribution given respectively by f Y ( y ) = 1
fX   and
 a 
a

 y −b 
FY ( y ) = FX  .
 a 
Proof. Only the last remark is worth arguing. Introduce Z = aX , so Y = Z + b . Then from
above we know FZ ( z ) = FX   , and since FY ( y ) = FZ ( y − b ) , FY ( y ) = FX 
z y −b 
 , and
a  a 
so f Y ( y ) = 1a f X 
y −b
. z
 a 

Example 3. Uniforms. From now on, we will let U[ p ,q ] denote the uniform random
ì
ï
ï p £x £q
1
ï
variable on the interval [ p, q ] , so its density is given by f U[ p ,q ] ( x) = í q - p .
ï
ï
ï
î 0 otherwise
As mentioned above, in the special case of the unit interval, [ p, q ] = [ 0,1] , we will use
simply U.

Consider the random variable Y = ( q − p ) U + p . Then the range of Y is given by the set
of numbers of the form ( q − p ) x + p where x is any number between 0 and 1—but then
61

this set is nothing but the interval [ p, q ] . And in that set the density is given by its
 y − p y−p y− p
density is given by f Y ( y ) =
1
1
q− p
fU   = , since f U   = 1 as 0 ≤ ≤ 1,
 q− p  q − p  q− p  q− p

(
and so Y = U[ p, q] . Hence we get that E U[ p, q] = ( q - p ) + p =) 1
2
q+ p
2
, the midpoint of
the interval. Also the median is located there. For the variance, we have
(q - p )
( )
2

V U[ p ,q ] = V (( q- p)U + p) = (q - p) V (U ) =
2
.
12

The fact that U[ p, q] = ( q − p ) U + p is very useful since most numerical programs come
equipped with a randomizer which is most often U. Thus, if we wanted to model some
other uniform, say U[- 1,2] we could easily build a table of random U U[- 1,2]
values by using U. For example here are 10 random values for U 0.8167 1.4502
and the corresponding 10 values for U[- 1,2] = 3U -1 . 0.0714 -0.7859
0.2610 -0.2171
0.0167 -0.9499
0.3007 -0.0980
It has been not stated explicitly, but it is clear that for any random 0.9018 1.7055
variable whose range is finite—in other words, one that is bounded 0.2681 -0.1956
above and below, its expectation will lie in the interval 0.7096 1.1289
between the highest and lowest values. Of course, the same is 0.3979 0.1936
0.6657 0.9970
true for the mode and the median. In particular, if a random
variable takes only nonnegative values, its expectation can only be nonnegative.

In particular, the square of any random variable will have a nonnegative


expectation. And it is squaring (and exponentiation in general that we go next). Most
interesting among all powers of a random variable is its square, and it already has played
a considerable role in the definition of the variance. But in general all positive integer
powers are of some consequence.

The discrete case is easiest to describe. For example, let X be the roll of the dice, and
then Y = X 2 and Z = X 3 , then their distributions are trivially given by

X 2 3 4 5 6 7 8 9 10 11 12
Y 4 9 16 25 36 49 64 81 100 121 144
Z 8 27 64 125 216 343 512 729 1000 1331 1728
P 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36
And it is clear that the mode of Y is the square of the mode of X , while the mode of Z
is the cube of the mode of X in this case. Similarly for the median, the median of the
square is the square of the median (and also true for the cube), but the same is not true
of the expectation: E ( X ) = 7 , while E ( Y ) = = 54.83 and E ( Z ) =
1974 16758
= 465.5 .
36 36
62

Of course we know that if E ( X 2 ) = E ( X ) , then V ( X ) = 0 , and we have mentioned


2

before tha t only happens if X = a , a constant.

On the other hand, if X is given by X −1 0 1 then X2 0 1


P .3 .4 .3 P .4 .6
So the mode and the median do not satisfy the above stated rules in all cases.

What happens in the continuous case? At the density level, it is similar to the discrete—
namely, we see the density of X under the value of X
while Y = X 2 has the square value: X L x L
Y=X 2
L x 2
L
Except this does not give us the density of Y = X . It does
2
fX L f X ( x) L
however allow us to compute as we have done above the
expectation of Y = X 2 by simply

E(X 2) = ∫ x f (x ) dx .
2

−∞

Similarly, E ( X 3 ) is simply given by E ( X 3 ) = ∫ x f (x ) dx .
3

−∞

We now extend our thinking to an arbitrary function of a random variable. So if g is an


arbitrary function, what do we mean by g ( X ) if X is a random variable? Naturally, what
is required is that if an interval of the line has probability (under X) greater than 0 of
occurring, then that interval has to be part of the domain of the function g. For example,
( )
if we let g ( x) = x , then certainly we could not discuss g U[-1,1] because we could not
input into the function most points in the interval [-1, 0] . On the other hand, we could
1
discuss where U is the standard uniform, since only one point is problematic, and in
U
continuous random variables, singleton points do not count.

For a discrete random variable, the distribution of Y = g ( X ) is easy to define. Namely,

P (Y = b ) = å P ( X = a ) . For example, let P ( X = i) =


1
for i = 1,2,K ,10 , and 0
( )
g a =b 10

every where else. Let g ( x) = x 3 -15 x2 + 74 x- 120 , and Y 0 6 24 60 120


Y = g ( X ) . Then the distribution for Y is given by P 3 2 2 2 1
10 10 10 10 10

since g ( 4) = g (5) = g (6) = 0 , g (3) = g (7) = 6 , g ( 2) = g (8) = 24 , g (1) = g (9 ) = 60


and g (10) = 120 .
63

For a continuous random variable, we get the same kind of idea about the density as we
did above. Explicitly we see the value of X above the
corresponding value of g ( X ) with the density of X , X L x L
f X below them Y = g ( X ) L g ( x) L
fX L f X ( x) L
The cumulative distribution of the variable Y = g ( X )
is perhaps more easily described. As usual, if FY (b) = P (Y £ b ) and FX ( a) = P ( X £ a ) ,
we simply have, in a similar fashion to the discrete case
FY (b) = ò f X (t ) dt .
g ( t )£b

But one of the most useful conclusions is

Theorem. Expectation of a Function. Let X be a random variable. Let


Y = g ( X ) where g ( x ) is an arbitrary function. Then if X is discrete,
then E ( Y ) = ∑ g ( xi ) pi , while if X is continuous with density f ( x ) ,
i

then E ( Y ) = ∫ g ( x) f (x ) dx .
−∞

Let us observe the discrete case is immediate, while the continuous case follows in a
similar manner.

When the variable is a mixed type, we have to use both ideas (as in an example we have
seen previously):

Example 4. A Mixed Problem. Returning to the problem of the mechanic and the
chauffeur, we have the density of the waiting time W being
 12 w=0

f W ( w) = 1 − w 0 < w ≤ 1 .
 0
 otherwise
Suppose the chauffeur requires as payment for the pickup: P = 100eW (this indicates a
high degree of sophistication on the driver) measured in dollars of course. Then what do
we expect to pay the chauffeur? We get
1
E ( P ) = 100 ( 12 ) + ∫ 100ew (1 − w) dw = 50 + 100e − 200 = $121.83 .
0

Let us return one more time to the lifetime example.


64

Example 5. Lifetime. A certain type of electronic device has had so far a lifetime T with
density
ì
ï
ï
3,000
ï t ³ 10
f ( t) = í t 4 .
ï
ï
ï
î 0 otherwise
The cost (in dollars) of running such a device (through its lifetime) is given by
C = 2T 2 + 5T + 8 . What is the expected cost of having a machine until it runs out? We get
immediately:
∞ ∞
E ( C ) = ∫ ( 2t 2 + 5 t + 8 ) f (t ) dt = ∫ ( 2t 2 + 5t + 8 ) 4 dt = $683 .
3000
−∞ 10
t
In a similar fashion we could have computed the expectation of any function g ( T ) .

Note however that although we know the expectation of C , the cost, we do not know its
distribution, not its density, so for example if we wanted to know what is the probability
the costs will exceed $825, at present we could not readily answer. We could proceed to
answer by simple language:
P ( C ≥ 825 ) = P ( 2T 2 + 5T + 8 ≥ 825) = P ( 2T 2 + 5T − 817 ≥ 0 ) =

P ( (T −19 )( 2T + 43) ≥ 0 ) = P ( T ≥ 19 ) =
3000

19
t4
dt = .1457 ,

And therefore through astonishing algebraic fortunes we would have arrived at the
answer. Providentially, there is a much better process called the transformation
method, which is a useful one-variable technique.

Theorem. Transformation Method. Let X be a continuous random


variable with density f X . Let g : ¡ → ¡ be a function for which its
derivative (at least in the support of X) is never 0. Consider Y = g ( X ) .
Then the range of Y is the image under g of the range of X, and if
y = g ( x ) , then
f X ( x)
fY ( y ) = .
g′ ( x)
Proof. Let y = g ( x ) . Then readily P ( Y ≤ y ) = P ( X ≤ x ) , and so FY ( y ) = FX ( x ) , so
FX ( x ) = FY ( g ( x ) ) , and so taking derivatives with respect to x, by the chain rule,
f X ( x ) = fY ( g ( x ) ) g ′ ( x ) , from which our result follows. z

Let us revisit the lifetime example. The transformation is the function t a 2t 2 + 5t + 8 ,


which has derivative 4t + 5 , which is never 0 for t ≥ 10 . Thus we can apply the theorem,
f (t)
and if we let c = 2t 2 + 5t + 8 , we obtain f C ( c ) = T . But now what is necessary is to
g ′ (t )
65

get rid of t . First we observe that the support of C is all numbers c ≥ 258 . In that range,
−5 ± 8 c − 39
since c = 2t 2 + 5t + 8 , we get t = , and since t ≥ 10 , we get
4
−5 + 8 c − 39
t=
4
And so for any c ≥ 258 ,
3000 × 256
fC ( c ) = .
( )
4
8c − 39 − 5 8 c − 39
Thus to simply answer the question P ( C ≥ 825 ) = ? , we would simply integrate

∫ f ( c ) dc = .1457 , but we have the added ability to compute quickly P ( C ≥ 1000 ) as


825
C


being ∫ f C ( c ) dc = .1071 , and so forth. Indeed we now have as much control over C as
1000
we do over T .

Example 6. Let f X ( x ) = 29 x in the range 0 ≤ x ≤ 3 . Consider the random variables


Y = 5 X + 6 and Z = −4 X + 7 , both of which satisfy the hypothesis of the theorem. First,
the range of Y is 6 through 21 while that of Z is −5 through 7. Let y = 5 x + 6 be in that
 y −6
 5  2 y −12
2
f ( x) 9
range, then f Y ( y ) = X =  = . Similarly for the random variable Z, if
5 5 225
2
z−7
9 
− 4  14 − 2 z
z = −4 x + 7 , then f Z ( z ) =  = .
4 144

Example 7. Powers. Let X have nonnegative range. Then for any positive real number r,
Y = X r is a random variable, and the theorem applies. Let y = x r = g ( x ) , then

g ′ ( x ) = rx , so f Y ( y ) =
r −1 f X ( x) f X y
=
( )
1
r

, since we would like our expression in terms


rx r−1 ry1− r
1

of y alone.

Thus, for example, if we let X be as in the previous example, and let r = 2 , then the range
1
2 2
y 1
of Y is 0 ≤ y ≤ 9 , and if y = x , then f Y ( y ) =
2 9
1 = , and we obtain a uniform. On the
2y 2 9
z2 4z3
2
other hand, if let Z = X , then we obtain f Z ( z ) = −1 = in the range 0 ≤ z ≤ 3 .
1
2 9
1
z 29
66

Example 8. Uniforms Again. If U is the basic uniform, then what is the density or
distribution of V = U 2 ? We actually can do this computation two different ways. Recall
that on the range 0 ≤ u ≤ 1 , FU ( u ) = u . Observe that V has the same support, so in that
range
FV ( v ) = P (V ≤ v ) = P (U 2 ≤ v ) = P U ≤ v = v . ( )
And so in that range (of course without 0, which only reaffirms the idea that singleton
points are insignificant and irrelevant to continuous random variables):
1
fV ( v ) = .
2 v
We know for example that P (U £ 12 ) = 12 , but on the other hand,
P (V £ 12 ) = 2
2
» 0.7071 . There is a bigger density at the lower numbers because as we
take square roots in the interval [ 0,1] the numbers get smaller

But, of course, we could also have used the trans formation method since the derivative is
only 0 at the edge of the interval1 , so directly we get, a bit more directly,
f ( u) 1
fV ( v ) = U = 1.
2u 2v 2

Similarly, if we let Z = U = U 2 , by the transformation method we get


1

f ( u) 1
f Z ( z ) = U 1 −1 = 1 −1 = 2 z . 6
2 z
1 2
2u
5

The graphs of the three densities is given by 4 Density of U


3 Density of V
2 Density of Z
Example 9. Let us again consider g ( x) = x and apply 2
1

this function to the random variable X = U[-1,2] . 0


0 0.25 0.5 0.75 1

Then if we let Y = X 2 , then the range or support of Y


is naturally the interval [ 0,4] . Let y be in this interval. Then as we saw above
FY ( y ) = ò f X (t ) dt . There are two cases. If y ≤ 1 , then
t £y
2

y y

FY ( y ) = ò f X (t ) dt = FY ( y ) = ò dt =
1 2 y
,
3 3
- y - y

1
Actually what is essential for the transformation method is that the function be increasing or decreasing
on the range.
67

so for such a y , f Y ( y ) =
1
. Equivalently, we could have reasoned
3 y

( ) (
P U[2-1,2] £ y = P - y £ U[-1,2] £ y = ) 2 y
3
.

( )
On the other hand, let 1 £ y £ 4 , then P U[2-1,2] £ y = P -1 £ U[-1,2] £ y = ( ) y
3
+ 13 . So
we get
ì
ï
ï
ï
1
0 < y £1
ï
ï 3 y
ï
ï
fY ( y ) = ï 1£ y £4 .
1
í
ï
ï 6 y
ï
ï
ï
ï
0 otherwise
ï
ï
î

Could we have done this example via the transformation method? No, since the
derivative is 0 inside the interval.

We end this important section with a fundamental fact, but first a particular instance of a
very general theorem below:

Lemma. Expectation of a Sum. Let X be a random variable. Let g ( x )


and h ( x ) be functions. Consider the random variables Y = g ( X ) and
Z = h ( X ) . Then E ( Y + Z ) = E ( Y ) + E ( Z ) .

In the continuous case, the proof is immediate since


∞ ∞ ∞
E (Y + Z ) = ∫ ( g ( x ) + h ( x ) ) f ( x ) dx = ∫ g ( x) f ( x ) dx + ∫ h ( x) f ( x ) dx = E ( Y ) + E ( Z )
−∞
X
−∞
X
−∞
X

and the discrete case follows trivially by replacing integrals by sums.

Theorem. Alternate Definition of Variance. Let X be a random variable


( )
with E ( X ) = m . Then V ( X ) = E ( X -m ) . In particular, V ( X ) ³ 0
2

with equality if and only if X is a constant.


Proof. We have
( )
E ( X -m ) = E ( X 2 - 2mX +m 2 ) = E ( X 2 ) - 2mE ( X ) + E (m 2 ) =
2

V ( X ) + E ( X ) - 2m E ( X ) + m 2 = V ( X ) + m 2 - 2m 2 + m 2 = V ( X )
2
68

and we are done with the first claim. But since ( X - m ) only takes nonnegative values,
2

its expectation has to be nonnegative, and finally it can only be 0 if ( X - m ) is 0 itself,


2

that is, X = m . z

In particular, this explains why we can define the standard deviation as the square root of
the variance.

This theorem also illustrates part of the reason of the importance of the variance—it is
measuring the (average) distance (in the Euclidean sense) between the random variable
and the constant random variable represented by the mean.
69

• The World of WHAT IFs

Random variables (and the accompanying probability distributions) are always representations
of the level of information we posses on an activity or phenomenon. In this section we look at
what occurs to a random variable when we gain information. Before we can do that we need to
develop the foundational concept of conditional probability.

Example 1. Let us start with a very simple situation. You are to visit a potential customer who
is known to have two children. You are speculating on the random variable X that counts the
number of boys among his children. Knowing nothing else, the distribution X 0 1 2
of X (assuming once again boys and girls are equally feasible) is given by
P 4 1 1
2
1
4
.25 .5 .25
You arrive at the house and you see a boy playing in the backyard. You
ask the customer who the boy is and the customer replies
He is my oldest child.
Let B be the event that the oldest child is a boy. Then what would we conclude about the
values of the random variable X if we assume B as a given occurrence? Clearly they have
changed to
X|B 0 1 2
P 0 12 12
Thus for example if we let A be the event that both children are boys, then if B is given, then
the likelihood that A will occur is the same as the likelihood that the second child is a boy,
which is simply 12 .

Suppose the customer has instead replied


He is one of my children.
The subtleties in measuring information are reflected in the difference between the two
statements. One would hardly think there is a measurable difference between them. But let us
see what we can conclude now. Let C be the event that one of the children is a boy.

But given C , then out of the four possibilities for two children, BB, BG, GB and GG only the
last one is ruled out, so our denominator is 3, and again the distribution for X has changed:
X|C 0 1 2
P 0 23 13

In summary we get the surprising fact that P ( A) = 14 , P ( A | B ) = 12 , but P ( A | C ) = 13 .

This example deals with the fundamental notion of conditional probability. If A and B are
70

two events, one defines the probability of A given B , in symbols P ( A | B) , by


P ( A and B) P ( A Ç B)
P( A | B) = = .
P ( B) P ( B)
Namely, we restrict our set of possibilities to those in which B has occurred, and thus our
denominator while in the numerator we put the situations when
both events occur. In Venn diagrams:
A AÇ B B

The basic idea behind conditional probability is the idea of the


denominator becoming what is given. Namely we only consider
possibilities among what has already occurred. The following example will illustrate further.

Example 2. The accompanying table lists totals of accidental deaths by age and also certain
specific types for the United States in 1976.

Type of Accident
Age All Types Motor Falls Drowning
Vehicle
Under 5 4,692 1,532 201 720
5 to 14 6,308 3,175 121 1,050
15 to 24 24,316 16,650 463 2,090
25 to 34 13,868 7,888 426 1,060
35 to 44 8,531 4,224 534 520
45 to 54 9,434 4,118 931 500
55 to 64 9,566 3,652 1,340 420
65 to 74 8,823 3,082 1,997 270
75 and over 15,223 2,717 8,123 197
Total 100,761 47,038 14,136 6,827
A randomly selected person from the United States was known to have an accidental death in
1976. We can address a variety of questions:
• The probability that he/she was 15 or older. This one is very straightforward:
there were 100,761 deaths of which 11,000 that were not 15 or older, so the
= 89.08% .
89761
probability we desire is
100761
‚ The probability that the cause of death was a motor vehicle accident. Again,
= 46.68% .
47038
readily
100761
These were not conditional statements. The next one is
ƒ The probability that the cause of death was a motor vehicle accident given
that the person was between 15 and 24 years old. Here the pool of people to
71

be considered has become those who are between 15 and 24, so the denominator is
= 68.47% . Observe that this probability is
16650
24, 316 and so the probability is
24316
considerably higher than the one in the previous one, so being between 15 and 24 has a
considerable effect on the probability of dying from a motor vehicle accident.
„ Find the probability that the cause of death was a drowning accident given
that it was not a motor vehicle accident and the person was 24 or under.
Here we have the denominator of 13,959, which is 35,316 (who were 24 or younger)
= 27.65% ,
3860
minus 21,357 (who died in motor accidents), so we have
13959
considerable higher than the probability of drowning: 6.77%.

The previous two examples illustrate the effect one event can have on the probability of some
other event occurring. The relation given above
P ( A and B) P ( A Ç B)
P( A | B) = =
P ( B) P ( B)
has two other equivalent formulations, which are equally as useful:
P ( A and B) = P ( B) P ( A | B )
and
P ( A and B) = P ( A) P ( B | A) .

These latter statements are very useful in situations such as the ones in the following example:

Example 3. We are told that that 60% of the population of Harmony believes the mayor
should quit, and we are also told that among those people, 60% of them believe the mayor
should be Phyllis. So if we let Q stand for a person believing the Mayor should quit, and we let
P stand for a person believing the mayor should be Phyllis, then when a person is chosen at
random, what we know is that P (Q) = 60% . We are also given that P ( P | Q) = 60% , so we
know that P ( P and Q) = 36% . Note we do not know P ( P) since we do not know the
support for Phyllis among the people that do not believe the mayor should quit. If we were told
that 40% of them favored Phyllis, then we would have
that P ( P | ØQ) = 40% , and so P ( P and ØQ) = 16% ,
.4

and finally, P ( P) = 52% . We are letting Ø Q stand for


.4 .6
the case that Q is not true.

One way to structure information such as the one given in


the previous example is to use a tree. If a person from .6
.6
Harmony is chosen at random, then there are two
options:
.4
Often we will find the situation when one event will have
72

no effect on another, if that is the case, if for example we are saying event B has no effect on
event A, then we should have that P ( A | B) = P ( A) . But as the following shows much more
than that occurs:

Fact. If one of these occurs, then all of them occur:


• P ( A | B) = P ( A) ;
‚ P ( B | A) = P ( B ) ;
ƒ P ( A and B) = P ( A)P ( B ) .
P ( A and B)
The proof is very simple—we have P ( A | B) = P ( A) if and only if P ( A) = ,
P ( B)
which is tantamount to ƒ. But since P ( A and B) and P ( B and A) have the same meaning, we
have the symmetry of ‚ automatically.

Two events that satisfy the conditions in this fact are called independent events. In many
situations we will assume events are independent. Among these are consecutive flips of a coin,
or tosses of a die, or draw of a card (as long as we shuffle the deck in between draws).

On the other hand, from the example above, we can see that dying from a motor vehicle
accident and being between 15 and 24 years of age are not independent events. However,
being blue eyed and being male are probably independent events, so whenever we get a table
containing data about blue eyed males and females and non-blue eyed males and females, it
should approximately be true that P ( B | M ) = P ( B ) .

Example 4. Consider the random variable that is the toss of two dice:
X 2 3 4 5 6 7 8 9 10 11 12
P 361 362 3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36

Suppose some one claims that the red die shows a 4 (call this event F). How is the random
variable affected by this occurrence:
X | F 2 3 4 5 6 7 8 9 10 11 12
P 0 0 0 16 16 16 16 16 16 0 0
Thus P ( X = 7 | F ) = 16 = P ( X = 7 ) , so X = 7 , the event that we roll a 7, and F are
independent. But P ( X = 6 | F ) = 16 ¹ 365 = P ( X = 6) , so rolling a 6 is not independent of the
roll of the red die.

If we instead are given that one of the dice shows a 4 (event G), then the distribution becomes:
73

X|F 2 3 4 5 6 7 8 9 10 11 12
P 0 0 0 2
11
2
11
2
11
1
11
2
11
2
11
0 0
and we see none of these rolls and G are independent.

Example 5. Lifetime Once More. A certain type of electronic device has had so far a lifetime
T with density
ì
ï
ï
3,000
ï t ³ 10
f ( t) = í t 4 .
ï
ï
ï
î 0 otherwise
Before we compute the probability of the device lasting at least 20 hours to be 18 . Suppose now
we are given that the machine has lasted at least 20 hours. What is the probability that it will last
at most 30 hours. In other words, we need to compute P ( X < 30| X > 20) . Since we have
1000
the cumulative distribution from before, F ( t ) = 1 − 3 , so we can readily compute
t
P ( X < 30and X > 20) F ( 30 ) − F ( 20 )
P ( X < 30| X > 20) = =
P ( X > 20 ) 1 − F ( 20)
1000  1000 
1− − 1−
303  203 
3
 2  19
= = 1−   = = 70.37%
 1000   3  27
1− 1− 
 203 
Note that then P ( X ³ 30| X > 20 ) =1 -P ( X < 30| X > 20) = 278 = 29.63% .

In a sense conditional probability has been around the subject from its inception. Two of the first
contributors on the subject were the great French mathematicians of the 17th century: Fermat
and Pascal.

Example 6. Pascal was proposed the following problem:


Two parties, of equal ability, will play a game until one of them has
won six hands. Each of them has placed 32 coins in the pot to be
collected by the winner. For some unexpected reason they have to
stop when one of them has won 5 games and the other 3. How
should the 64 coins be divided?

You may think of this problem as that of flipping a coin until you get a total of 6 heads or 6 tails.
Let us say then that 5 tosses had been heads and 3 had been tails. The problem had been
around for a long time, and several proposed answers had been given, including 2:1 and 5:3.
Pascal corresponded with Fermat on it, and they both solved it correctly, but in very different
ways. We will solve it in a modern, aggressive way—probably intellectually closer to Fermat’s
74

way. Let X be the random variables that counts the number of tosses until the game would have
been over. Then its distribution is X 1 2 3
P 12 14 14
The reasons are: X = 1 must mean the next toss was a head, and there is 2 1

chance of that; X = 2 only if we first get a tail and then a head, and we will be done in 3 tosses
if we either get TTH or TTT. Since the only way the tail player wins is when X = 3 , and only
one half of those occurrences, so the probability that T wins is 18 . So the stakes should be
divided 7-to-1, or 56 coins for H and 8 for T.

A word one encounters often in the outside world is odds. In general, the odds for an event
are the probability of the event occurring divided by the probability of it not
occurring. Equivalently, it is how the stakes should be divided. So in the previous situation, the
odds for H to win are 7-to-1. The odds for T to win are 1-to-7. If you want to bet on T, for
every $1 you contribute to the pot, your opponent should contribute $7.

We next review another famous problem, but this one more contemporary. It illustrates the
subtleties inherent in probability—but just because it is subtle you shouldn't mistrust it, instead
you should take the opportunity to fine tune your brain. It is a famous problem that has amused,
confused and bedazzled many people.

Example 7. We are going to play a game where I am going to give you a choice of three
doors. Behind one is an extra point for this class, behind the other two, nothing. You pick a
door, and all-knowing I, before showing you what is behind your chosen door, open another
door which has nothing behind it. I then give you a choice of either retaining your door or
switching to the only other unopened door. On the average, suppose you were doing this every
day there is class. What should be your standard operating procedure?

Before we discuss the problem as stated, let us solve a simpler problem. Suppose I had just
given you a door to choose out of 3 doors. Then nobody would argue that you have 13 of a
chance of guessing the correct door. Is that agreed on?

Now, let's convolute the problem by my showing you a door without a price. One easily
arrived, yet wrong, conclusion is that it really does not matter if you have a standard procedure.
This wrongful reasoning goes as follows. Originally each door had a 13 chance of having the one
point behind it. One of them has been eliminated, so now each door has only 12 of having the
point behind it, hence it is really the same if you switch as if you don't switch. Isn't this absolutely
reasonable?

I will try to convince you that it is not. But first let me get a little philosophical and point out that
what probability tries to accomplish is to measure uncertainty, and that the only uncertainty in
this problem is strictly from your point of view. (After all, I know everything.) Hence you have to
75

try to use every bit of information available to you (sort of squeeze blood out of rocks ). What piece
of information you have not weighed in the argument in the previous paragraph? The fact that
although under no circumstances I would show you what is behind your door, I had perhaps a
choice of which door to show you and that I indeed chose the door I chose. More directly put,
it is correct to say that at the beginning every door has 13 of a chance of being rich. But what is
crucial is that the new information could not have affected the probability behind your door, but
instead has affected the probabilities of the other two, one going to 0 and the other to 23 . And
that indeed it behooves you to switch doors every time!

We next review a classical result: Bayes’ Theorem. In 1761, the Reverend Thomas Bayes
passed away unaware that among his unpublished manuscripts lay a paper that would eventually
make his name known to every student of probability. In his own words,
Given the number of times in which an unknown event has
happened and failed:
Required the chance that its probability of happening in a single trial
lies somewhere between any two degrees of probability that can be
named.

The way Bayes’ Theorem is used today is exemplified by:

Example 8. Treasure Hunting. There are three chests, each with two drawers. In
each drawer there is a coin. In chest Πthere is one gold coin and one silver coin. In
chest • there are two silver coins and in chest Ž there are 2 gold coins. A chest is
selected by rolling a fair die. If the die comes up even, chest Πis selected. If it
comes 1 or 3, chest • is selected while chest Ž is selected only if a 5 is rolled.
Once the chest is selected, one of the drawers is selected at random and the coin in that drawer
is observed.

The best way to sort out this information is to use a tree diagram as follows:

1
In the picture the probabilities are the probability of traveling
2 the arrow in which the probability is assigned. Thus the arrow
Œ from the start to Œ is 12 , and similarly the arrow from • to
1
1
2
the silver is 1 because we have to draw silver if we are in
2 Chest •.
1
1
3
• Note that the last statement was a conditional statement, the
probability of silver given that we are in chest •, P ( S |•) .
1
6
Ž 1
But we know from conditional probability that then
P ( S and •) is the product P (•) P ( S |•) , which equals 13 .
76

This exhibits a nice way to compute the probability of going from the starting point to a terminal
node: simply multiply the probabilities of the arrows in the path.
Now we have the ability to answer a variety of questions:
• What is probability that the coin observed is gold?
Since we can observe gold by either going to Œ or Ž first, then we have
P (G) = P (Œ) P (G | Œ) + P (Ž)P (G | Ž) = 12 ´ 12 + 16 ´1 =14 + 16 = 125 .
‚ What is probability that the coin observed is silver?
Similarly to the previous question,
P ( S ) = P (Œ) P ( S | Œ) + P ( •)P (G | •) = 12 ´ 12 + 13 ´1 =14 + 13 = 127 .
Of course, with a bit of reflection we could have deduced this fact with no computation since it
is the nonoccurrence of gold that produces silver.

But so far, we have not used Bayes’ ideas. The next question does require our going back on
the tree, and that is essential to his ideas:
ƒ Given that the coin observed is gold what is the probability that chest Œ was selected?
Here we have
P (Πand G ) 1
P (Π| G ) = = 45 = 35 .
P (G ) 12

Similar to the previous:


„ Given that the coin observed is gold what is the probability that chest • was selected?
This is of course 0 since P (• and G ) = 0 .
… Given that the coin observed is gold what is the probability that chest Ž was selected?
P (Ž and G ) 1
P (Ž | G ) = = 65 = 25 .
P (G ) 12

Again, we could have predicted the outcome, since given G, we must have had Œ, • or Ž,
and these are disjoint events.

Observe that the last three questions asked probabilities of prior events, and that was essential
to Bayes’ ideas. The idea of reversing the process is essential to modern applications to
hypothesis testing.

Example 9. Medical Diagnosis. Suppose that during a visit to the doctor, a


patient is given a test for a certain disease, and the test turns out positive. Of course,
the patient is alarmed, and so during a conversation with the doctor, the doctor
informs the patient that the test is very reliable. The doctor accurately claims: If one
has the disease, the test will detect it 99.99% of the time. A fairly
impressive statement indeed. But this is hardly all the information one needs. The
patient then timidly asks the doctor what percentage of the population has the
disease, and then we are told that 1% of the population has this disease.

Do we know enough to evaluate the patient’s chances of having the disease? No, not really. We
77

are missing a key piece of information—the patient needs to ask the doctor the following simple
question: What is probability of testing positive if I do not have the disease? The
question addresses the existence of false-positives, and these are rarely openly discussed but
they are essential for the computation. Suppose we find that there is +
0.05
only a 5% chance of a false positive, namely we have a 5% percent
chance of testing positive if we are perfectly healthy. Now we can H
0.99
-
indeed compute. Setting up the tree: 0.95

where we let H stand for the event that the patient is healthy and S +
0.01 0.9999
that the patient has the disease, while + stands for testing positive.
Thus, what we have is that the patient tested positive. Given that S
what is the probability that the patient is sick? 0.0001
-

Before we can compute P (S | +) , we need to compute P (+) . Since one can test positive by
either being sick or healthy, we have that
P (+) = P (+ and H ) + P (+ and S ) = 0.99´ 0.05 + 0.01´0.9999 = 0.059499 .
And so
P (S and +)
P (S | +) =
.009999
= = 16.80% .
P (+) .059499
Thus the odds that the patient indeed is sick are less than 1 to 4. Surprising!!

Example 10. Quality Control. Suppose as part of admission to graduate school a test is
given. From past occurrences, it is known that if a candidate is qualified to enter graduate
school, he will succeed on the test 95% of the time, while an 0.95
P
ineligible candidate will pass the test only 25% of the time. The
school has given the test to applicant Lewis, and he has passed. Q
What is the probability he was qualified to attend graduate school? 0.8 0.05
F
As it is we do not know enough to answer the question.
P
0.2 0.25
The school clarifies that in the past 80% of the people that applied
were indeed qualified. Now we do know enough. The tree is similar U
to the others: 0.75
F
So now we can compute, P (Q | P) for Lewis. We need as usual P (Q and P ) = 0.76 . But we
also need
P ( P ) = P(P | Q) P(Q) + P( P | U) P( U) = 0.76 + 0.05 = 0.81 .
And so
P (Q | P) = = 93.82% .
0.76
0.81
Thus we are fairly confident of Lewis’ ability to succeed in graduate school.
78

A separate yet interesting question is what P ( Q | F ) equals. This represents the probability of
being left behind (since the test was failed) but that one was still qualified. In fact,
P ( Q and F ) .04
P ( Q | F) = = = 21.05% .
P(F ) .19

Above we saw how the occurrence of an event creates a new random variable from an old one
such as in the examples about the parent with the two children, or the roll of the dice. In a
similar fashion, we can compute the probability of an event by conditioning it on a random
variable. The following example should be interesting.

Example 11. Craps. In the American game of CRAPS, a shooter rolls two dice. The shooter
wins if she rolls a 7 or 11 to start with, and loses to start with if she rolls a 2, 3 or 12. If the first
roll is anything else, namely, 4, 5, 6, 8, 9 or 10, then that roll becomes the shooter’s point, and
she keeps rolling until either she rolls a 7 (she loses) or her point (she wins). We are interested
in the probability the shooter wins.

Let X be the first roll of the two dice, and let B be a Bernoulli random variable, which is 1 if the
player wins.

What is the probability the shooter wins if the first roll was a 4, P ( B = 1| X = 4) ? Then clearly
B = 1 if whenever the shooter rolls either a 4 or a 7, she gets a 4. Therefore let Y be another
roll of two dice, then P ( B = 1| X = 4 ) = P ( Y = 4 | Y = 4or Y = 7 ) . But the latter is easily
computed,
P (Y = 4 and Y = 4 or Y = 7 ) P (Y = 4)
P (Y = 4 | Y = 4 or Y = 7) = = = 39 .
P (Y = 4 o r Y = 7) P (Y = 4 or Y = 7 )

But what we are really interested in P ( B = 1) . The following should be clear, since these are
disjoint events:
12
P ( B = 1) = ∑ P ( B = 1and X = i ) .
i =2

But P ( B = 1 and X = i ) = P ( B = 1| X = i ) P ( X = i ) , and since we saw how to compute one


of these above, in a similar fashion, we get,

X 2 3 4 5 6 7 8 9 10 11 12
P 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36

P (B =1| X ) 0 0 3
9
4
10
5
11
1 5
11
4
10
3
9
1 0
79

So the probability P ( B = 1) is then given


3
36
× 39 +364 ×104 + 365 × 115 + 366 ×1 + 365 × 115 + 364 × 104 + 363 × 39 + 362 ×1 = win
244
495
» 49.29% 3
9
lose
not bad at all!! 3
36
4
10
The perspicacious reader may realize the computation just made as
similar to an expectation, and indeed that is the case, and we shall
return to this view at the end of the course. 5
11

5
11

4
10

3
9
75

‘ Independence

In this last section of this chapter we look at a crucial concept: independent random
variables.

We have defined two events A and B to be independent if equivalently any one of the
following occurs:
P ( A | B ) = P ( A) ,
or P ( B | A) = P ( B ) ,
or most symmetrically
P ( A and B ) = P ( A) P ( B ) .

Let X and Y be random variables. We say they are independent if and only if for all
numbers a and b,
P ( X £ a and Y £ b) = P ( X £ a ) P (Y £b ) .
In other words, X and Y are independent if the events X £ a and Y £ b are always
independent events.

For a discrete random variable, we have that


Fact. Let X and Y be discrete random variables. Then they are independent
if and only for all numbers a and b,
P ( X = a and Y = b) = P ( X = a )P (Y = b ) .

Now that we have the definition of independent random variables, we can discuss
combinations of more than one random variable—at least in this case of independence. 1

Example 1. The Roll of the Dice Again. Let X be the X or Y 1 2 3 4 5 6


roll of 1 die, so we know its simple distribution. Let Y P 1
6
1
6
1
6
1
6
1
6
1
6
be the roll of 1 die, so Y is identically distributed to X.
What random variable is X + Y ? As is, we do not have enough information to answer the
question. But usually, one treats the roll of dice as independent variables, in other words
P ( X = 3and Y = 5) = P ( X = 3)P (Y = 5) = 361 , and the 3 and the 5 were arbitrary. Thus,
if we choose a number at random, 5 say, then since there are + 1 2 3 4 5 6
four positions on the table that give 5, P ( X + Y = 5) = 364 . 1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
Again it was critical that the two random variables were 4 5 6 7 8 9 10
independent. Note there is a great deal of difference between 5 6 7 8 9 10 11
X + Y and 2 X . Although X and Y are identically 6 7 8 9 10 11 12
distributed, they are not the same random variable.

1
We will have combinations of arbitrary random variables in Chapter 3.
76

One of the most interesting random variables is the sum of n independent Bernoulli’s
with the same p. As usual, let us start with a specific example.

Example 2. The Sum of Bernoullis. Let us consider X = Y1 + Y2 + Y3 + Y4 where each Yi


has the distribution of Bp , and they are independent. Then certainly X can take the
values 0, 1, 2, 3 and 4. What then is the probability X = 2 ? If we let a 4-tuple represent
the respective outcomes of each of the variables, then X = 2 when one of the following
disjoint events occurs:
(1 1 0 0) , (1 0 1 0) , (1 0 0 1) ,
( 0 1 1 0) , ( 0 1 0 1) or ( 0 0 1 1) .
But because of the independence of the variables, each of these has probability p 2 q 2 to
occur, and so since there are 6 of them we have that
P ( X = 2) = 6 p 2 q 2 .
The only ingredient missing is why 6? But that is simple, in 4 occasions we want 2
successes, so that the number is   = 6 . Thus if in n tries we wanted k successes, then
4
 2
P ( X = k ) =   p k qn −k .
n
k 
The random variable which is the sum of n independent Bp ’s is a binomial random
variable and of course it has two parameters, n and p, and it will be denoted by Bn, p .

It is time to bring out the nicest theorem about expectations, and one of the nicest
theorems in the course:

Theorem (Expectation of a Sum). Let X and Y be random variables,


then E ( X + Y ) = E ( X ) + E (Y ) . Also if a is scalar, E ( aX ) = aE ( X ) .

Thus, the expectation of a sum is the sum of the expectations. Note no independence
of the variables is required, this is true at any time. The complete proof will be
postponed. Nevertheless it is worth pointing out that in the independent discrete case, the
proof is straightforward:
E ( X + Y ) = ∑ iP ( X + Y = i ) = ∑ ∑ ( j + k ) P ( X = j and Y = k ) =
i i j +k =i

∑ ∑ ( j + k ) P ( X = j ) P (Y = k ) = ∑∑ jP ( X = j ) P ( Y = k) + kP ( X = j ) P (Y = k )
i j +k =i j k

= ∑ jP ( X = j ) P ( Y = k ) + ∑ kP ( X = j ) P (Y = k )
j, k j ,k

= ∑ jP ( X = j ) ∑P ( Y = k ) + ∑P ( X = j )∑ kP (Y = k )
j k j k

= ∑ jP ( X = j ) + ∑ k P (Y = k ) = E ( X ) + E ( Y ) .z
j k
77

Note that thus the random variable X of the previous example will have expectation
E ( X ) = 4 p since the expectation of Bp is p .

Example 3. A father makes a deal with his two daughters. Cicely is to roll a dice and
Maude is to draw a card, and the father will give their daughters the sum of the roll and
the draw (where king is 13, queen is 12, et cetera), in dollars. It is clear if we let C be
Cicely’s roll and M be Maude’s draw, that these are independent random variables. We
are interested in the random variable X = C + M .

Similarly to the example + A 2 3 4 5 6 7 8 9 10 J Q K


above, we get 1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 3 4 5 6 7 8 9 10 11 12 13 14 15
Since each particular cell in 3 4 5 6 7 8 9 10 11 12 13 14 15 16
4 5 6 7 8 9 10 11 12 13 14 15 16 17
the table has probability 5 6 7 8 9 10 11 12 13 14 15 16 17 18
1
6
´ 131 = 781 , the distribution of 6 7 8 9 10 11 12 13 14 15 16 17 18 19
X = C + M is simply
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
1 2 3 4 5 6 6 6 6 6 6 6 6 5 4 3 2 1
78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78

To compute E ( X ) , we simply use the fact that E ( X ) = E ( C ) + E ( M ) , and since


E (C ) = 3.5 , the roll, while E ( M ) = 7 , the draw, we get
E ( C + M ) = E (C ) + E ( M ) = 10.5 .
If he were to give them the choice of not rolling and drawing and instead accepting a $10
bill, the girls may have a tough decision in front of them because they expect a little more
than the $10 offer—but bird in hand and all that.

The following elegant and subtle example illustrates the power of the theorem.

Example 4. The Case of the Absent Minded Professor. A professor has written n
letters to his friends and has addressed the corresponding n envelopes. But when the time
came to put the letters in the envelopes, the professor got distracted with some wonderful
mathematical thoughts, and so the professor proceeds to put the letters in the envelopes at
random—keeping of course, one letter to an envelope (the professor is NOT that absent
minded!!). How many letters do we expect to go in the correct envelope?

Let the letter be denoted by 1, 2, …, n be the letters, and when we write a permutation of
these numbers, the first number listed goes in envelope 1, and the second number listed
goes in the second envelope, and so on. Thus for example 4 2 1 5 3 has only one
letter, # 2, going to the correct person.

Let us start with some small cases. Let n = 2 , and so there are two possibilities, both
equally feasible—in one both letters get in the correct envelope while in the other none
does, so the average is 1.
78

Let n = 3 , of the six possibilities, the number of correct letters is given, 1 2 3 (3),
1 3 2 (1), 2 1 3 (1), 2 3 1 (0), 3 1 2 (0) and 3 2 1 (1).
Or equivalently, if we let X be the random variable that counts the X 0 1 2 3
number of correct letters, then P 1
3
1
2
0 16

So the expectation is again 1, E ( X ) = 1 .

If we consider n = 4 , then the random variable that counts the X 0 1 2 3 4


number of correct letters has distribution: P 9 8 6 0 241
24 24 24

since there is one permutation that has 4 correct letters, 6 that have two (pick which two
and then swap the other two), 8 that have one correct (pick the correct one (4 choices),
and then there are two permutations of three letters that leave none correct), and the
remaining 9 permutations leave none correct, so we get, once again,
E ( X ) = 241 (1 × 4 + 6 × 2 + 8 ×1) = 1 .

Let us now consider the general case. Certainly the distribution is not necessarily easy to
compute—but the expectation is! For each 1 ≤ i ≤ n , define a Bernoulli random variable
Bi , as follows:
1 letter i is correct
Bi = 
0 otherwise
Note that the Bi ’s are not independent, and in fact we know at present little about how
they interact. However, the random variable we are interested is X = B1 + B2 + L+ Bn ,
and the distribution of that random variable is not easy as mentioned above. But
E ( Bi ) =
1
since we could have started with the i - letter, or equivalently, there are
n
( n-1)! (out of n !) permutations with the i - letter in the correct spot. But then
E ( X ) = E ( B1 + L + Bn ) = E ( B1 ) + L + E ( Bn ) = + L+ = 1 ,
1 1
n n
and so surprisingly, regardless of the number of letters, we should expect exactly one
letter to be in the correct envelope.

The following theorem is proven via the distributive properties of ands over ors:

Theorem (Independence & Functions). Let X and Y be independent


random variables, and let g ( x ) be any function. Then g ( X ) and Y are
also independent random variables.
Proof. We will prove it in the discrete case—later on we will have better tools to prove it
in the continuous case, and the proof will be very similar. We need to show that for any
a and b , P ( g ( X ) = a and Y = b ) = P ( g ( X ) = a ) P ( Y = b ) . But the event g ( X ) = a is
the disjoint union of all the events X = xi , where g ( xi ) = a . Similarly the event
79

g ( X ) = a and Y = b is the disjoint union of the events X = xi and Y = b where g ( xi ) = a .


Thus,
P ( g ( X ) = a and Y = b ) = ∑ P ( X = xi and Y = b ) = ∑ P ( X = xi ) P ( Y = b ) =
i i

 
 ∑ P ( X = xi )  P (Y = b ) = P ( g ( X ) = a ) P (Y = b ) .z
 i 

We saw before that the expectation of a sum is the sum of the expectation regardless of
whether the random variables were independent or not—but in order to state that the
expectation of a product is the product of the expectations, one does indeed need
independence.

Theorem (Expectation of a Product). Let X and Y be independent


random variables, then E ( XY ) = E ( X )E (Y ) .
Proof. We again give the proof in the discrete case (until we have more tools to deal with
the continuous case). Now
E ( XY ) = ∑ iP ( XY = i ) = ∑ ∑ jkP ( X = j and Y = k ) =
i i j × k =i

∑∑i j × k =i
jkP ( X = j ) P ( Y = k ) = ∑ jP ( X = j )∑ kP ( Y = k ) = E ( X )E (Y ) z
j k

As a very important consequence we get


Corollary. Variance & Independence. Let X and Y be independent
random variables. Then V ( X + Y ) = V ( X ) + V (Y ) , the variance of their
sum is the sum of their variances.
Proof. Now
( 2
)
V ( X + Y ) = E ( X + Y ) -( E ( X + Y )) = E ( X 2 + 2 XY + Y 2 ) - ( E ( X ) + E (Y )) =
2 2

E ( X 2 ) + 2 E ( XY ) + E (Y 2 ) - E ( X ) - 2E ( X )E (Y ) - E (Y ) = V ( X ) +V (Y )
2 2

so we are done. z

We will use this corollary extensively throughout the course, but remember independence
was required for the variance of a sum is the sum of the variances.

Example 5. The Unit Circle. Let W be the random variable that selects a point of the unit
disk at random. The unit disk consists of the unit circle and its inside. We do
have an intuitive sense of what the random variable W is doing. For example,
if we let Z = W denote the distance from a point of the unit disk to the origin,
we could ask P ( Z £ 13 ) . The question is equivalent to asking what is the
1
probability the point chosen is within 3
distance from the center. The answer
80

p( 13 )
2
1
=
. In fact, in a very similar fashion, we can compute for every 0 ≤ r ≤ 1 ,
p1 2
9
P ( Z £ r ) = r 2 , so FZ ( r ) = r 2 , and so its density is f Z ( r ) = 2r in the range 0 ≤ r ≤ 1 .
1
Now we can compute E ( Z ) as E ( Z ) = ∫ r 2rdr = .
2

0
3

But suppose we try to obtain W , and all we have at our disposal are standard uniforms.
Can we build W from them? One idea is to use the polar form of a point in the disk,
namely ( x, y ) = (r cos q, r sin q) where r (the distance to the origin) is arbitrarily between
0 and 1 and 0 £ q £ 2p . So perhaps if we let U and V independent standard uniforms,
then let
W = (U cos (2pV ) ,U sin ( 2pV )) .
However, we have an immediate problem with this since we know that Z = U , and that
Z is not uniform.
1.5

In fact if we use a randomizer to plot 100 points produced by this 1

expression for W we get the following picture: 0.5

And the points seem prejudiced toward the center. 0


-1 -0.5 0 0.5 1 1.5

-0.5

Instead, we know that Z has the same distribution as U (from -1

above), so the way to build the correct W -1.5


1.5
should rather be
1

0.5
W= ( U cos (2pV ), U sin (2pV ) . )
When we do that we get a more balanced picture
0
-1.5 -1 -0.5 0 0.5 1 1.5

Consider now the x − coordinate, X = U cos ( 2πV ) , and its


-0.5

expectation E ( X ) = E ( )
U cos ( 2 πV ) . But since U and V are
-1

-1.5

independent, so are U and cos ( 2πV ) , and so we get that


E(X ) = E ( U ) E (cos ( 2πV )) .
( )
1 1
= E ( Z ) , of course, and E ( cos ( 2πV ) ) = ∫ cos ( 2πv ) dv = 0 ,
2
Now E U = ∫ udu =
0
3 0

and so E ( X ) = 0 . This is not surprising since there is an equal distribution to the right of
the y − axis as to the left.

Another concept closely related to the expectation of a product is that of the covariance
of two random variables. Let X and Y be random variables. Then their covariance is
simply defined by
81

cov ( X , Y ) = E ( XY ) - E (X )E (Y ) ,
the expectation of their product minus the product of their expectations.

Note that
cov ( X , X ) = E ( X 2 ) − E ( X ) = V ( X ) .
2

Two variables that have covariance 0 are said to be uncorrelated. The reason for the
name uncorrelated is because there is a concept called the coefficient of correlation2
defined by
cov ( X , Y )
ρ= .
σ X σY
It is a fact that −1 ≤ ρ ≤ 1 .
If X and Y are independent, then cov ( X , Y ) = 0 .

As an immediate consequence of the last theorem we get


Corollary. Correlation & Independence. Let X and Y be independent
random variables. Then cov ( X , Y ) = 0 .

Thus, two independent variables are uncorrelated. But the following example shows
that two variables may be uncorrelated and yet not be independent.

Example 6. Walking. By a step in the plane we mean adding one of the


æ1ö æ-1ö æ0ö
following 4 vectors (at random) to a point in the plane: çç ÷÷÷ , çç ÷÷÷ , çç ÷÷÷ or
è0ø è 0 ø è1ø
æ 0 ö÷
çç ÷ . If we start at the origin and take one step, then we can end at one of
è-1ø÷
these four points with equal probability. Let X be the random variable that records the x-
coordinate of the point we end after one step, and similarly Y records the y-coordinate.
Now XY is a constant random variable with value 0, so E ( XY ) = 0 . X or Y -1 0 1
But X and Y have identical distributions, P 1
4
1
2
1
4

and clearly E ( X ) = E (Y ) = 0 , so indeed they are uncorrelated. However,


P ( X £ -1) = 14 = P (Y £-1) ,
but
P ( X £-1 and Y £-1) = 0 ,
so they are not independent.

2
We will not have much use in this course for the coefficient of correlation, but it is an important concept.
82

Chapter 2
A Visit to the Zoo
ΠSampling

In this chapter we will look at some of the most important types of random variables,
both discrete and continuous. In this section we look at two special types of random
variables that occur often in applications. Let us start with a generic example dealing with
the testing of refrigerators as to whether they are defective or not.

Example 1. Lot. From a lot of 10 refrigerators, 3 are known to be 0.600000

defective. You are to choose 4 of the refrigerators for a store. Let X be 0.500000

0.400000
the number of defective refrigerators in the sample of 4 that you chose. 0.300000

0.200000

Then the distribution of X is given by 0.100000

X 0 1 2 3 0.000000
0 1 2 3

P 21035 105
210
63
210
7
210
.1667 .5 .3 .0333
æ3öæ ö æ3öæ ö æ 3öæ ö æ3öæ ö
ç ÷÷ç7÷÷ ç ÷÷ç7÷÷ ç ÷÷ç7÷÷ ç ÷÷ç7÷÷
çè0øè
÷ç4ø÷ çè1øè
÷ç3ø÷ çè 2øè
÷ç2ø÷ çè3øè
÷ç1ø÷
since P ( X = 0 ) = , P ( X = 1) = , P ( X = 2) = , P ( X = 3) = .
æ10ö æ10ö æ10ö æ10ö
ç ÷÷ ç ÷÷ ç ÷÷ ç ÷÷
çè 4 ø÷ çè 4 ø÷ çè 4 ø÷ çè 4 ø÷

105 + 126 + 21
Its expectation is E ( X ) = = = 1.2 which is not surprising at all, since
252
210 210
30% of the population is defective, 30% of the sample is expected to be defective. The
mode is 1, and so is the median.

105 + 4 × 63 + 9 × 7
-1.2 2 = -1.44 = 0.56 and so the standard deviation
420
The variance is
210 210
is 0.7483.

This is a typical example of a hypergeometric distribution.

Example 2. Production Line. But there is another way to have a sampling mechanism.
Suppose now we are pulling refrigerators from a production line, and we are going to pull
4 of them at random. It is known that 30% of refrigerators from that plant are defective (it
is not recommended we purchase refrigerators from this company). Again we are
interested in the distribution of the number of defectives in the sample.

Let Y denote this random variable. Then what is the probability that Y = 0 ? If that is to
be the case, then the first refrigerators has to be good, and so is the second , etcetera, and
83

since we are pulling them from the production line at random, we treat these events as
independents events, and so we get that
P ( Y = 0 ) = .7 × .7 × .7 ×.7 = .2401 .
What is the probability that Y = 1 ? In that case, one of the refrigerators has to be
defective, which one? The first one? The second one? Etcetera. If the first, we have
probability .3 × .7 ×.7 × .7 , while if the second one is defective, we have .7 × .3 ×.7 × .7 .
Similarly for the remaining two possibilities (the third one and fourth), we get the
respective probabilities .7 × .7 × .3 × .7 and .7 × .7 × .7 × .3 . So each of them has the same
probability: .31 × .7 3 , and so if just count the occurrences we will readily have our answer.
Obviously we have four choices,   = 4 , so our answer is simply
4
 1
P ( Y = 1) =   .3.7
4 1 3
= .4116 .
 1

In a similar fashion, P ( Y = 2 ) =   .32.7 2 = .2646 , P ( Y = 3) =   .33.71 = .0756 and


4 4
 2  3
P ( Y = 4 ) = .3 = .0081 , so the distribution is
4

Y 0 1 2 3 4
 4  4 1 3  4 2 2  4 3 1  4 4 0
P   .30.7 4   .3.7  2  .3 .7  3  .3 .7  4  .3 .7
 0  1      
.2401 .4116 .2646 .0756 .0081

Not surprisingly, the expectation of Y is the same as the expectation of X in the previous
example, E ( Y ) = 1.2 . However, the reasoning is slightly 0.45

different—the way to approach this expectation is simply that of 0.4

0.35
having a 30% chance of doing something (such as hitting a hit 0.3

when batting), and trying it four times , so you expect 4 × .3 0.25

0.2
successes (hits at bat). The mode is also 1 and so is the median. We 0.15

could alternatively look at Y as the sum of four independent 0.1

Bernoulli’s B.3 , so E ( Y ) = 4 E ( B.3 ) = 1.2 .


0.05

0
0 1 2 3 4

The variance is actually V (Y ) = 4 ×.3 ×.7 = .84 and so its standard


deviation σ = .9165 .

This is an example of a binomial random variable. It basically counts the number of


successes among n independent trial at each of which we have probability p of success at
each trial. Of course, what success is is very much of our choosing.

Both of the previous examples deal with choosing a sample. In Example 1, the sample
was picked from a fixed population with two kinds of objects (defective vs. good).
Thus, in a hypergeometric random variable, there are 3 parameters, N, the number of
individuals in the total population, m, the number of individuals in the selected
84

population, and k, the size of the sample, and such a variable is referred to as H N ,m ,k .
Thus, Example 1 involved, H 10,3,4

But one can also think of a binomial also as a process of sampling. Rather than selecting
from a fixed limited population, one can think of a growing population (or unlimited
population such as a production line), like in Example 2. The binomial random variable
has two parameters, n the number of trials and p the probability of success at a trial,
so it will be referred to as Bn, p . Thus, Example 2 concerned B4,.3 . Although there are
only two parameters, one often associates a third letter with a binomial, and that is
q = 1 − p , which in the example above was q = .7 , the probability of failure.

One could consider a beer factory, where 10% of the bottles are defective (not filled
enough, or overfilled, or broken glass, or not sealed properly). If one selects 20 bottles at
random from the population of bottles, and is interested in the number of defectives, then
one is doing a sampling process, but it is not hypergeometric, since we do not have a
fixed population to choose from, so we are using B20,.10 .

The key difference is that each choice is independent of the next which does not happen
when we are doing a hypergeometric—the choice of a defective refrigerator in the first
example has an effect on the probability of repeating that selection afterwards. Thus,
sometimes one refers to sampling without replacement as a hypergeometric while one
can think of sampling with replacement as a binomial.

In the examples above we contrasted H 10,3,4 versus B4,.3 —however if there had been a
large collection of refrigerators rather than 10, we would see a much better fit since now
choosing the next refrigerator is almost independent from what we have done before. For
example, the table below reflect the values for B4,.3 and H 100,30,4 :
0 1 2 3 4
B4,.3 .2401 .4116 .2646 .0756 .0081
H 100,30,4 .2338 .4188 .2679 .0725 .0070

Example 3. An academic department at a university consists of 10


female professors and 12 male professors. A committee of 6 in 0.400000
0.350000
charge of retention and promotion is to be appointed by the dean, 0.300000

and all faculty members are equally eligible. Let X count the 0.250000
0.200000
number of males in the committee. We recognize this random 0.150000

variable as X = H22,12,6 , and its distribution is given by: 0.100000


0.050000

0 1 2 3 4 5 6 0.000000
X 01 2 3 4 5 6

P 74613 74613 74613 74613 74613 74613 74613


210 3024 13860 26400 22275 7920 924

# .0028 .0405 .1858 .3538 .2985 .1061 .0124


85

We have of course E ( X ) = = » 3.27 . The mode and median are 3, while


72 36
22 11

E(X 2) =
883728
74613
( )
≈ 11.8441 , so V ( X ) = E X 2 − E ( X ) ≈ 1.1334 and σ ≈ 1.0646 .
2

How shocking would we be if the committee were of one gender? We expect 3-plus
males, so to go to either 0 or 6 males, it takes a hefty jump of approximately 3 standard
deviations, so it should not be something that one expects to happen—a bit shocking. In
this particular case, we actual have the unlikely probabilities, 1.24% (for all male) and
0.28% (for all female). Thus, conceivably one could go to court on either, but safer in the
latter case of an all female committee.

Note that if we had wanted the number of females in the committee, we would be dealing
with H 22,10,6 . Easily,
H 22,10,6 = 6 − H 22,12,6 ,
so once we understand one of them, the other one follows.

Example 4. Suppose you come to take a test totally unprepared. The test consists of 10
True-False questions, each of which you will answer at random (but yo u will answer
them all, since there is no penalty for guessing.) How likely is it that you will achieve a
passing score of 70% or better? Although this is not quite a sampling problem, if we
consider the random variable X that counts the number of correct answers you obtain,
that random variable is a binomial random variable.

Think of answering each question as an independent event (after all we are totally
ignorant), so the probability of answering any question correctly is 12 (just like the flip of
a coin)—once we assume the independence of consecutive answers (or flips of the coin),
then to get 3 correct answers, we have to choose which three questions to answer
correctly (and we have 10 to choose them), after we have done that, let us say questions
 3
1, 5 and 7, then we must have CWWWCWCWWW as our answers, and since each has
10 ( 12 )
10
1
probability 2
of occurring, because of the independence, we end up with   as our
 3
probability of exactly 3 correct answers, which is the same as the probability that B10,.5
takes the value 3.

Proceeding in this fashion we can compute the distribution of the random variable
X = B10,.5 that counts the number of correct answers:
X 0 1 2 3 4 5 6 7 8 9 10
P 1024
1 10
1024
45
1024
120
1024
210
1024
252
1024
210
1024
120
1024
45
1024
10
1024
1
1024
% .097 .97 4.4 11.7 20.5 24.6 20.5 11.7 4.4 .97 .097
86

Thus P ( X ≥ 7 ) = 176 , and hence the probability of getting a passing score in the exam is
1024
more than 17%. Not bad for total ignorance!

Not surprisingly, E ( X ) = 5 , we expect to get 5 correct answers.

The next example will show a small yet important variation of this example.

Example 5. Suppose you come to take a test (totally unprepared as usual), but that the
test is Multiple Choice, with ten questions and each question having three choices,
only one of them correct. Again we compute the distribution for the random variable that
counts the number of correct answers. Let us repeat the computation of exactly three
correct answers. Again we have 120 ways of choosing which 3 questions will be
answered correctly, and after that, if for example again we are to have
CWWWCWCWWW , the probability of this is 13 23 23 23 13 23 13 32 32 32 = 3210 , and so we arrive, since
7

310 = 59049

X 0 1 2 3 4 5 6 7 8 9 10
P 1024
59049
5120
59049
11520
59049
15360
59049
13440
59049
8064
59049
3360
59049
960
59049
180
59049
20
59049
1
59049
% 1.73 8.67 19.51 26.01 22.76 13.65 5.67 1.63 .30 .03 .001

This table represents thus the distribution of


X = B10,1 .
3
0.3

Note that now the probability that you score at 0.25


least a 70% on the test is 0.2
1 + 20 + 180 + 960 1161 » 1.96% ,
only: = much 0.15
p=.3333
59049 59049
0.1
smaller than in the True-False exam. Of p=.5
0.05
course, here the expectation is 3 13 .
0
0 1 2 3 4 5 6 7 8 9 10

Now we proceed to compute the expectation


and variance of the hypergeometric and the binomial. We start with the latter.

A binomial random variable counts the number of successes among n independent trial
at each of which we have probability p of success. Of course, what success is is very
much of our choosing. Equivalently, Bn, p is the sum of n independent Bernoulli random
variables with probability p: Bn, p = B1 + L+ Bn where Bi is a Bp -random variable.
Hence, since E ( B p ) = p , and V ( Bp ) = pq , we obtain by the expectation of a sum and
the variance of a sum of independents,
E ( Bn, p ) = np and V ( Bn, p ) = npq .
87

The hypergeometric is more complicated. In order to compute the expectation of the


hypergeometric we use similar thinking to that used in Pascal’s recursion. We intend to
show that if X = H N ,m ,k , then
mk
E(X )= ,
N
Namely the proportion of members of the select group in the sample is the same as
the proportion in the whole population. This is clearly true when k = 1 , since then we
m
really have a Bernoulli variable with p = . We now proceed by induction, and in order
N
to help our thinking, we visualize what we are doing. Suppose we have a population of N
neighbors of the United States, where m of them are Mexican and the rest Canadians. A
group of k of them will be selected to relocate to our state of California. Let X = H N ,m ,k
denote the random variable that counts the number of Southern neighbors in that
relocation delegation.

Let us concentrate our sights on Ms. J, one of the Southern neighbors. What is the
probability that she gets chosen to reside in the golden state? Let J denote that event,
 N − 1
 k −1  ( N − 1) ! × k !( N − k ) ! = k , so P not J = N − k . Let
then P ( J ) =  = ( )
N ( k − 1) !( N − k ) ! N! N N
k 
 
Y = ( X | J ) − 1 and let Z = X | ¬J . Then we readily observe that Y = H N −1,m− 1,k −1 while

Z = H N −1,m −1,k , and so by induction we know that E (Y ) =


( m −1)( k − 1) and
N −1
E (Z ) =
( m − 1) k . Then clearly, since
N −1
P ( X = i ) = P ( X = i | J ) P ( J ) + P ( X = i | ¬J ) P ( ¬J ) ,
we get
k N −k
P( X = i) = P ( Y = i − 1) + P (Z = i)
N N
Then
k N −k 
E ( X ) = ∑ iP ( X = i ) = ∑ i  N P (Y = i − 1) + P ( Z = i)  =
i i N 
k N −k k N −k

N i
iP ( Y = i − 1) +
N i
∑ iP ( Z = i) =
N
E (Y + 1) +
N
E (Z ) =

k  ( m −1)( k − 1)  N − k  ( m − 1) k 
 + 1 +  =
N N −1  N  N −1 
k  mk − k − m + 1 + N − 1  N − k  mk − k 
 +  =
N N −1  N  N −1 
88

 mk 2 − k 2 − mk + Nk   Nmk − Nk − mk 2 + k 2 
  +   =
 N ( N − 1)   N ( N − 1) 
 −mk   Nmk  mk
  +   = .
 N ( N − 1 )   N ( N − 1 )  N

Later on in the course, the previous argument will be greatly simplified.

In a similar fashio n, one can also show that if X = H N ,m ,k ,


 m  N − m   N − k 
V ( X ) = k    .
 N  N   N −1 

Since both random variables are associated with sampling, we should not be surprised if
there is some connection between the parameters of the hypergeometric X = H N ,m ,k and
the binomial Y = Bn , p . Clearly the role of n in the binomial is played by k in the
hypergeometric since that is the size of the sample being taken. Also naturally, the
m N −m
probability of success, p, is nothing but , and so the probability of failure, q = .
N N
Having translated, now we see that
km
E(X ) = = np = E ( Y )
N
while
 m  N −m  N − k   N −k  N −k 
V ( X ) = k     = npq   = V (Y )  
 N   N   N −1   N −1   N −1 
N −k 
and the odd term   tends to 1 if N is very large and k is remains small.
 N −1 

Example 6. The Roll of the Dice. Suppose we roll five dice simultaneously. We can
view that as a binomial since the rolls are assumed as independent. Let us say the roll of a
is a success, so p = 16 and n = 5 . Then the distribution is given by
X 0 1 2 3 4 5
P % 40.19 40.19 16.08 3.22 .32 .01

0.5000 We mentioned above that standard deviations


0.4000 are the yardstick by which one measures
0.3000 surprise. In particular, one would never be
0.2000 surprised of being one standard deviation away
0.1000
from the mean. In Example 6, what does it
mean to be within one standard deviation away
0.0000
0 1 2 3 4 5 from the mean? The mean is 56 and so is the
89

standard deviation, so the possible values that are one s away from m (this is a common
reference from the mean) are 0 and 1, and from the distribution we see that in more than
80% of the throws we will end up in that interval. Not shocking at all.

The ranges of both of these variables are easily discerned. We should say a few words
about the mode and the median of both. Certainly the expectation of neither is not
necessarily in the range, so we don’t expect the three measurements, mean, median and
mode to agree—but as the table of examples indicates, they are always close.
Hypergeometric Binomial
N m k Mean Median Mode n p Mean Median Mode
21 11 5 2.6 3 3 12 .85 10.2 10 11
22 11 5 2.50 2 2 or 3 12 .75 9 9 9
23 11 5 2.3 2 2 12 .65 7.8 8 8
23 12 7 3.6 4 4 12 .9 10.8 11 11
33 12 7 2.54 3 2 24 .8 19.2 19 19
43 12 7 1.9 2 2 20 .4 8 8 8
43 22 7 3.58 4 4 25 .25 6.3 6 6

Summarizing, the following table represents the highlight features of the two types of
variables.

Name Hypergeometric Binomial


Notation H N ,m ,k Bn, p
Range 0,1,…, min {m, k } 0,1,…,n
 m  N − m 
 j  k − j  n
Probability P ( H N ,m ,k = j ) =    P ( Bn, p = j ) =   p j qn − j
N  j
k 
 
E ( H N ,m ,k ) = E ( Bn, p ) = np
km
Expectation
N
 m  N − m N − k 
Variance V ( H N ,m ,k ) = k      V ( Bn, p ) = npq
 N  N  N −1 
nk
Median (or Mode) , nearest integer np , nearest integer
M

Certainly the variance of the hypergeometric is an easily forgettable expression.

In the following example, we will use the binomial distribution to helps us make
decisions.
90

Example 7. Truthful vs. Liar. A university claims that 85% of its students graduate. You
are to test their veracity by setting up a test of their claim. You pick 12 students at
random, and see how many of them graduate. You decide to accept the school’s claim if
at least 8 of the 12 students graduated.

What is the probability that you come to the wrong conclusion if indeed the university’s
claim is true? Let Y denote the number of students among the 12 that graduated, so Y is a
binomial random variable with n = 12 and p = .85 . Thus we will be wrong if we
encounter Y £ 7 , thus we desire to compute P (Y £ 7) = 1 -P (Y ³ 8) , and the relevant
values are:
8 9 10 11 12
0.068284 0.171976 0.292358 0.301218 0.142242
with a sum of 0.976078, and so we will be wrong 2.39% of the time, a truly negligible
possibility. This type of error is called a Type I error.

But there is another kind of error we could make—namely accept the university’s claim
when it is not true. This is called a Type II error. Now assume that the university’s rate of
graduation is actually only 60%. What then is the probability that you will accept their
claim as true even though it is not? Then we will be wrong with probability P (Y ³ 8)
where Y is a binomial random variable with n = 12 and p = .6 . The values are then
8 9 10 11 12
0.212841 0.141894 0.063852 0.017414 0.002177
with sum 0.43178, so will be wrong more than 43% of the time.

Thus, in order to reduce the probability of a Type II error, we tighten the test a bit, and we
ill call the university truthful if at least 9 students graduate. So we now want P (Y ³ 9) ,
where Y is a binomial random variable with n = 12 and p = .6 . Then we have the total
sum of 22.53%, and we have reduced our Type II error by more than half. Yet in order to
be thorough we should look again at the Type I error. That one has increased from 2.39%
to over 9.21%. Overall probably the second test of using 9 students rather than 8 to help
us decide decreased the Type II error but increased the Type I.

Finally we give an example of great historical importance since it was one of the first
occasions pure data was used to make a scientific conclusion.

Example 8. Boys vs. Girls. Laplace, the great French mathematician of the 18th century
was one of the first ones to assert that
the probability of giving birth to a
baby boy is bigger than the
probability of giving birth to a baby
girl. His conclusion was based on the
fact that from 1745-1770, there were 251,527 boys born in Paris while only 241,945 girls.
91

The key method was to assume that boys and girls were equally
feasible, and then to compute the probability of what occurred to
occur. When that probability is too small, we conclude that our
assumption is wrong. This is called hypothesis testing, and it is done
very commonly today.

Laplace’s analysis was as follows. Suppose m male births are given and suppose f female
m
birth occur. Let p = lim be the eventual ratio of male births to total births—he
m+ f ®¥ m + f

decidedly assumed such a limit existed, and that p represented the probability of being
born a boy. He then used calculus to compute some probabilities by computing some
complicated integrals. He then finished showing that the probability that p £ 12 given the
actual distribution of 251,527 boys and 241, 945 girls, was approximately 10 −42 , which
made it morally certain that p > .
1
2

He went even further and by comparing the births of London and Paris, he concluded that
boys are even more feasible in London than in Paris!

Today we would treat the random variable Y (the number of boy births) as a binomial
with n = 493,472, and p = 12 . Then our expectation would be E (Y ) = 246,736 boys.
Now we need to compute P (Y ³ 251527 ) , and this is the kind of computation that
Laplace was involved.

But we have in our possession a much more powerful weapon that Laplace did not have:
the standard deviation.

We saw before that for a binomial, the standard deviation is s = npq , so in our case,
s = 123368 » 351.24 . Then the fact that what occurred is more 13 standard deviations
æ 251527 - 246736 4791 ö
away ççç = » 13.6÷÷÷ tells us that what occurred is virtually impossible
è s s ø
to occur under our assumptions, so it is these that need to be questioned.

In the next section we will see other variations of the binomial. It is this distribution
that is overwhelmingly important—much more that the hypergeometric.
92

• Variations on the Binomial

In this section we will look at several random variables closely associated with the
binomial (and necessarily the accompanying Bernoullis). We start with the discrete
waiting variable—namely we have a Bernoulli Bp , and we are waiting for the first trial
in which success will occur. We do assume consecutive trials are independent. Let G p be
the number of times we would have to execute before the first success occurs. Such a
random variable is called a geometric random variable (with parameter p). Thus G p
can take the values 1, 2, 3, …, and its distribution is given by
Gp 1 2 3 4 … n …
P p qp q2 p q3 p … q n −1 p …

Below is a table with the probabilities of the first 12 values of G p for p = 0.1 , 0.2 , 0.3 ,
0.4 and 0.5 . Also below is the graphical representation of the table.

1 2 3 4 5 6 7 8 9 10 11 12
0.1 0.1000 0.0900 0.0810 0.0729 0.0656 0.0590 0.0531 0.0478 0.0430 0.0387 0.0349 0.0314
0.2 0.2000 0.1600 0.1280 0.1024 0.0819 0.0655 0.0524 0.0419 0.0336 0.0268 0.0215 0.0172
0.3 0.3000 0.2100 0.1470 0.1029 0.0720 0.0504 0.0353 0.0247 0.0173 0.0121 0.0085 0.0059
0.4 0.4000 0.2400 0.1440 0.0864 0.0518 0.0311 0.0187 0.0112 0.0067 0.0040 0.0024 0.0015
0.5 0.5000 0.2500 0.1250 0.0625 0.0313 0.0156 0.0078 0.0039 0.0020 0.0010 0.0005 0.0002

0.6000 The logic for these probabilities


0.5000 is simple: in order for G p to take
p=0.1
0.4000 p=0.2
the value n, we must have had
0.3000 p=0.3 n − 1 failures to start with, and
p=0.4
then we must have a success, and
0.2000
since the trials are independent,
p=0.5
0.1000 this leads to the probability:
0.0000 q{Lq p .
1 2 3 4 5 6 7 8 9 10 11 12 n −1

Example 1. It is known that 60% of the people of the town like Mildred as a candidate
for Mayor. Her only opponent, Paul is thus preferred by 40% of the population. What is
the probability that the fifth random person interviewed is the first person to like Paul?
Obviously this is P ( G.4 = 5) = .6 4.4 = 0.05184 .

Easily the mode of a geometric variable is 1. The median is only slightly more
interesting. Clearly if p ≥ 0.5 , then the median is 1. Otherwise we must have
 1 − qn  1
p (1 + q + q + L + q ) ≥ 2 , and so p  1− q  ≥ 2 , hence 12 ≥ qn , so − ln2 ≥ n ln q , and
2 n −1 1

 
93

ln2
so n ≥ − , and so e.g., if p = 0.1 , the median is 7, but the median is 4 when p = 0.2 ,
ln q
and 2 for p = 0.3,0.4 .

In order to compute the expectation and variance of a geometric, it is best to think


recursively once again. Consider the random variable Y that counts the number of rolls
necessary to obtain the first success given that the first one is a failure. But then easily
either G p takes the value 1 with probability p, or it takes the value of Y + 1 with the
probability of Y times q. So
∞ ∞ ∞
E ( G p ) = p + q ∑ ( i +1) P ( Y = i ) = p + q∑ iP ( Y = i) + q ∑ P ( Y = i ) = p + qE ( Y ) + q .
i =1 i =1 i =1
But easily, since Y is also waiting on a Bernoulli, the distribution of Y is also geometric
with parameter p, and so E ( Y ) = E ( Gp ) , and so E ( Y ) = E ( Gp ) , so we have that
E ( G p ) = p + qE ( G p ) + q = 1+ qE ( G p ) ,
and we quickly get
E (G p ) =
1
.
p
So on the average one has to roll a die 6 times in order to achieve a , or flip a coin
twice to achieve a head.

To compute the variance of G p , we first compute E ( Gp2 ) . But similarly to the argument

(
for the expectation, we get E ( G 2p ) = p + qE (Y + 1) , so
2
)
E ( G 2p ) = p + qE ( Y 2 + 2Y + 1) = p + qE ( Gp2 ) + 2q
1
+q.
p
So pE ( G2p ) = 2q
1
+ 1 , and so
p
1 1 1 2q + p −1 q
V ( Gp ) = E ( Gp2 ) − E ( Gp ) = 2 q
2
+ − = = 2 .
p 2 p p2 p2 p

Closely associated with the geometric is the extended geometric (also kno wn as the
negative binomial), which is waiting for the k th success, Gk , p , so in fact G1,p = G p .

Clearly, Gk , p can take the values: k, k + 1 , k + 2 ,…. Let n ≥ k , what is the probability
Gk , p = n ? Easily, the last attempt must have been a success, and among the previous
n − 1 tries, we must have had exactly k − 1 successes (we can see here the relation to the
binomial), so
94

 n − 1  n −k k
P ( Gk , p = n ) =  q p .
 k − 1

In the table below is a list of probabilities for the first 15 values of Gk , p for p = 0.35 and
k = 2,3,4 .
2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 0.1225 0.1593 0.1553 0.1346 0.1093 0.0853 0.0647 0.0480 0.0351 0.0254 0.0181 0.0129 0.0091 0.0063
3 0.0429 0.0836 0.1087 0.1177 0.1148 0.1045 0.0905 0.0757 0.0615 0.0488 0.0381 0.0293 0.0222
4 0.0150 0.0390 0.0634 0.0824 0.0938 0.0975 0.0951 0.0883 0.0789 0.0684 0.0578 0.0478

To compute the expectation and variance of the extended geometric, we simply observe
that the extended geometric is nothing but the sum of k independent geometric random
variables: Gk , p = Gp + L + G p , and so
14 4244 3
k

E ( Gk , p ) =
k
.
p
And since they are independent, V ( Gk , p ) = kV ( Gp ) =
kq
.
p2

Example 2. Jim is a professional diver and has found a huge bed oysters in which on the
average 1 in 4 yield s a pearl. He intends to give his betrothed a necklace with 12 pearls. If
he wants to have a margin of two standard deviations that he will have enough oysters to
make the necklace, how many oysters should he bring up? The relevant random variable
12 ( 34 )
is of course G12,0.25 . Its expectation is 48, and its variance is = 144 , so σ = 12 , and
( 14 )
2

he should bring up 72 oysters to have that margin of safety.

We now enter an interesting variation of the binomial Bn, p —one in which we think of n
as very large and p as very small. Actually, it is used in situations such as the number of
car accidents in a given location in a given amount of time, or the number of customers
entering a given store at a mall in a given hour.

What we have then is an average, λ , for example three customers enter the store in a
given hour, and this average is known to be λ = np = 3 . Then what we have is
n− k
n λk  λ
P ( Bn, p = k ) =   p k q n− k =
n!
1 − n  =
k  k !( n − k ) ! n k  
λ k n × n − 1×L × n − k + 1! 
−k
λ  λ
n

1 −  1 − 
 n  n
k
k! n
−k
λk k −1   λ   λ
n

(1)  1 −  1 − L  1 −


1 2
=  1 −  1 −  .
k!  n  n   n  n   n
95

λ
n

If we now let n → ∞ , and recalling the fundamental fact that lim 1 −  = e −λ , we get
n →∞
 n
that
−k
λk  1   2   k −1   λ   λ λk −λ
n

P ( Pλ = k ) = lim (1) 1 −   1 − L  1 −  1 −  1 −  = e .
n →∞ k !
 n  n   n  n   n k!

And thus one defines a random variable to be a Poisson random variable with
λ k −λ
parameter λ , Pλ , if it can take the values 0, 1, … with respective probabilities e .
k!

Below is a table of value for k ≤ 10 for λ = 1,2,3,4 together with a graphical


representation of those values.

0 1 2 3 4 5 6 7 8 9 10
1 0.367879 0.367879 0.183940 0.061313 0.015328 0.003066 0.000511 0.000073 0.000009 0.000001 0.000000
2 0.135335 0.270671 0.270671 0.180447 0.090224 0.036089 0.012030 0.003437 0.000859 0.000191 0.000038

3 0.049787 0.149361 0.224042 0.224042 0.168031 0.100819 0.050409 0.021604 0.008102 0.002701 0.000810
4 0.018316 0.073263 0.146525 0.195367 0.195367 0.156293 0.104196 0.059540 0.029770 0.013231 0.005292

0.4 The data above suggests that


the mode of a Poisson is a tie
0.3 1 between λ −1 and λ , but
2 actually that occurs only
0.2
3
when λ is a (positive)
integer. In the case when the
0.1 4
average is not a whole
number, the mode is λ  ,
0.0
the largest integer less than
λ . While the median may
give some trouble (it is always either λ  or λ  + 1 ), we should not be surprised
if the expected value of Pλ ends up being λ . After all, the whole construction was
based on us knowing only the average.

Indeed,

λ k − λ ∞ λ k −λ ∞ λk ∞
λk
E ( Pλ ) = ∑ k e = ∑k e =∑ e− λ = λe −λ ∑ = λe − λe λ = λ
k =0 k! k =1 k! k =1 ( k − 1) ! k =0 k !

Similarly,

λ k −λ ∞ 2 λ k −λ ∞ k λ k −λ
E ( Pλ2 ) = ∑ k 2 e =∑k e =∑ e =
k =0 k! k =1 k! k = 1 ( k − 1) !
96


( k + 1) λk +1 e −λ = ∞
k λ k +1 −λ ∞ λ k +1 −λ
∑ k!
∑ k ! e +∑ e =
k =0 k =0 k=0 k !

λ k +1 − λ ∞
λk +2 −λ
∑ ( k − 1)! e + λ = ∑ k !
e +λ = λ 2 + λ
k =1 k =0

and so
V ( Pλ ) = λ 2 + λ − λ 2 = λ .

Example 3. Insect Breeding. Suppose the number of eggs the Brown Recluse Spider
lays is given by a Poisson random variable with λ = 10 . Moreover the probability that an
egg will develop is 12 . What is the probability that our spider will give rise to exactly 5
new spiders? In order to help us understand, let us start with P (Y = 0 ) where Y is the
number of baby spiders we will obtain. Clearly
(
P ( Y = 0 ) = e−10 1 + 101! 21 + 102! 14 +
1 2
10 3 1
3! 8 )
+ L = e −10 ( e5 ) = e−5
Since we will have no eggs, or 1 egg and it will not develop, or two eggs and neither of
which will develop, etcetera. Now, for another specific example, let us compute the
probability that Y = 2 . But then we have that P10 = 2 and both eggs developed, or P10 = 3
and exactly two of the eggs developed, or P10 = 4 and exactly two of the eggs developed,
and so on. Thus we have


3

() ()
P ( Y = 2 ) = e −10  102! 212 + 103! 32 213 + 104! 42 214 + L =
2 4


2 3

() ()
e −10  52! + 53! 32 + 54! 42 + L = e −10 52! + 53! 2!1!
4


2 3
3! 4
(
+ 54! 2!2!
4!
+ 55! 2!3!
5
5!
+L = )
( )
2
5
e −10 52
1 + 1!5 + 52! + 53! + L = e −10 e 5 = e− 5
2 3
52
2! 2!
2!
And so we begin to suspect that we get a Poisson distribution also for the number of
spiders. Indeed, by similar reasoning, we get


k

( ) ( ) k+ 2 k+ 3

( )
P ( Y = k ) = e −10  10k ! 21k + (10k +1)! k k+ 1 2k1+1 + (10k+ 2)! k k+ 2 2k1+2 + (10k+ 3)! k k+ 3 2k1+3 + L  =
k +1


k

( ) ( ) ( )
e −10  5k ! + ( k5+1)! k k+ 1 + ( k5+2)! k k+ 2 + ( k5+3)! k k+ 3 + L =
k+1 k +2 k +3


e −10 ( 5k
k! + ( 5k +1) ! ( k!1!) + ( 5k +2) ! ( k!2!) + ( 5k+3)! ( k !3!) + L =
k+ 1 k +1 ! k +2 k +2 ! k +3 k +3 !
)
(1 + )
k
5
e −10 5k
+ 2!5 + 3!5 + L = e −10 e 5 = e− 5
2 3
5 5k
k! 1! k!
.
k!

55
Thus, P ( Y = 5 ) = e −5 = 0.17546 .
5!
97

Example 4. Officers and Horses. In 19th century Prussia, 14 cavalry Number


Number of
regiments were observed for 20 years (resulting in 280 observations) of
Occurrences
and the number of deaths from horse kicks were recorded in the Deaths

following table : 0 144


1 91
If you were the captain of a regiment, and if X was the random variable 2 32
that counted the number of deaths in your regiment, then you could 3 11
perhaps think of the 196 deaths which occurred as random trials, and a 4 2
success be that it occurred in the given regiment (in a given year), 1 in ≥5 0
280. Thus X = B (196, 280
1
) , and so we could speculate that Total 280

P ( X = 0 ) = ( 279
280 ) ≈ 0.4959 and P ( X = 1) = 196 ( 280 )( 279
280 )
196 195
1
≈ 0.34841 .

But perhaps we should switch to the Poisson, and when we divide the total number of
196
deaths by 280 to compute the average number of deaths, we get λ = = 0.7 . And then
280
we would get P ( Pλ = 0) = e −.7 ≈ 0.49658 , P ( Pλ = 1) = e− .7 ( .7 ) ≈ 0.34760 , reaffirming the
approximation from which we obtained the Poisson.

But also interestingly, we could use the Poisson to predict the number of occurrences of
each of the situations above by multiplying each probability by 280 (and rounding to the
nearest integer):
Number of Deaths 0 1 2 3 4 5
Poisson Probability 0.49658 0.34760 0.12166 0.02838 0.00496 0.00082
Expected # of occurrences 139 97 34 8 1 0
Actuality 144 91 32 11 2 0

In a future section, we will see whether this is an acceptable prediction or not.

Example 5. The Sum of Independent Poissons. Consider the following problem. We


have a major department store located on two main avenues of a large city downtown.
One uses P10 to model the number of customers coming to the store via the door located
in Avenue A and one uses P13 to model the number of customers coming through the
door in Avenue B. How many customers are coming into the store via both? We will
assume the two avenues behave independently. Easily let X = P10 + P13 . Then
P ( X = 0) = P ( P10 = 0and P13 = 0 ) = P ( P10 = 0) P (P13 = 0) = e−10e −13 = e −23 .
In a similar fashion,
P ( X = 1) = P ( P10 = 1) P ( P13 = 0 ) + P ( P10 = 0 ) P ( P13 = 1) = e −10 101! e −13 + e −10e −13 131! = e −23 23
1! .

And for further understanding


P ( X = 2 ) = P ( P10 = 2 ) P ( P13 = 0 ) + P ( P10 = 1) P ( P13 = 1) + P ( P10 = 0 ) P ( P13 = 2 ) =
e −10 102! e− 13 + e−10 101! e−13 13
2

1!
+ e −10e −13 132! = e −23
2
1
2! (10
2
+ 2 (10 )(13) + 132 ) = e −23 232! .
2
98

Once more
P ( X = 3) =
P ( P10 = 3 ) P ( P13 = 0 ) +P ( P10 = 2) P ( P13 = 1) + P ( P10 = 1) P ( P13 = 2 ) + P ( P10 = 0) P ( P13 = 3 )
= e −10 103! e −13 + e −10 102! e −13 131! + e −10 101! e −13 132! + e −10e −13 133! =
3 2 3

(
e −23 3!1 10 3 + 3 (10 ) (13) + 3 (10 )(13 ) + 133 = e −23
2 2
) 233
3! .

In general, we get
n n
P ( X = n ) = ∑ P ( P10 = k ) P ( P13 = n − k ) = ∑e −10 10 k −13 13 n− k
k! e ( n − k )! =
k=0 k =0
n
e−23 n 10k e−23 n n! e −23 23n
e −23 ∑ 10k ! ∑ n! ∑ = P ( P23 = n ) .
n− k n− k n −k
= = =
k k
13 13
( n− k )! ( n− k )! 1013
k =0 n! k=0 k! n ! k = 0 k !(n −k )! n!

Thus, P10 + P13 = P23 if in the presence of independence. In a similar fashion to the
example, one can show the surprising result that for any λ, µ (if independent random
variables):
Pλ + Pµ = Pλ+µ .
99

p Three Important Theorems

Before we go and discuss the most important of all random variables in the next section,
we take a detour to visit three important theorems. We start with a classical result: The
Law of Averages. We have already encountered James, the eldest of the Bernoulli's (the
most prestigious family in mathematical history), and the first theorem we discuss is due
to him.

An old Arab saying proclaims: Indeed he knows not how to know who knows not
also how to un-know. James Bernoulli made fundamental progress in how to test our
knowledge, and in doing so became one of the first statisticians in history. In 1713, eight
years after his death, his nephew Nicholas (son of brother John) published,
posthumously, Jacob’s masterpiece: Ars Conjectandi, an important book, the heir to
Huygens' (who gave us expectation), and the predecessor to Laplace’s (the next theorem).
The following discussion is an interpretation of some of the ideas Bernoulli discussed.

Suppose you have an urn in which you can hear some balls inside.
Unfortunately the balls are too large to get them out, and certainly the urn is to
be preserved, so you don't want to destroy it. However through the opening you
can see one of the balls (one at-a-time) inside, and if you rattle the urn you can
perhaps change the ball you are seeing. Indeed, after a little while, through the
top you see balls of two colors: black and white. You spend some time observing the
different colors of the balls you see from the top of the urn, and you
are ready to guess that perhaps there are five balls inside the urn: 3
black and 2 white. But you are an honest person with integrity and you
would like some moral certainty concerning your guess. How would
you go about it? How would we test our knowledge? In Bernoulli's
own words:
...how often a white and how often a black pebble is observed. The question is,
can you do this so often that it becomes ten times, one hundred times, one
thousand times, etc. more probable (that is, it be morally certain) that the
numbers of whites and blacks observed (chosen) are in the same 3:2 ratio as
the pebbles in the urn, rather than in any other ratio?

At the same time, Bernoulli realized that we could only expect an approximation to our
ratio, not an exact ratio. Namely, the more times we did the experiment, the less
likely would it be that we get the exact ratio. Indeed, let's do some computations.
Suppose we observe the urn 500 times, and suppose that our hypothesis of 3 black
and 2 white is correct. Then the probability that we observe exactly 300 black and 200
white is, since we are in the random variable X = B500,.6 :
⎛500⎞⎛ 3 ⎞ ⎛ 2 ⎞⎟
300 200
500!3300 2200
P ( X = 300) = ⎜⎜⎜ ⎟⎟⎟⎜⎜⎜ ⎟⎟⎟ ⎜ ⎟ = ≈ 3.6%
⎝300⎠⎝ 5 ⎠ ⎝⎜⎜ 5 ⎠⎟ 300!200!5500
100

In general, Bernoulli knew that if his hypothesis was correct, and he did the experiment
n times, the probability that he would get exactly k black showings was:
n −k
⎛n⎞⎛ 3⎞ ⎛ 2 ⎞ n !3k 2n−k
k

P ( X = k ) = ⎜⎜⎜ ⎟⎟⎟⎜⎜⎜ ⎟⎟⎟ ⎜⎜⎜ ⎟⎟⎟ = . (*)


⎝k ⎠⎝ 5 ⎠ ⎝ 5 ⎠ k !(n − k )!5n

In other words, he knew the previously studied binomial distribution. He understood


enough about these numbers to appreciate that the larger the n got the smaller this
number got regardless of what k was. Indeed he said
To avoid misunderstanding, we must note that the ratio between the number
of cases, which we are trying to determine by experiment, should not be taken
as precise and indivisible (for then just the contrary would happen, and it
would become less probable that the true ratio would be found the more
numerous were the observations). Rather, it is the ratio taken with some
latitude, that is, included within two limits, which can be made as narrow as
one might wish.

And if we compute, we see that as the number of experiments increase, the probability
that we get exactly the ratio 3:2 decreases:

50 100 150 200 250 300 350 400 450 500


0.11456 0.08122 0.06637 0.05751 0.05145 0.04697 0.04350 0.04069 0.03837 0.03640

as we go from 50 experiments to 500 by increments of 50 each. Thus


for the case of 500, we get, as we saw above, around 3.6% chance of
coming up with a 3:2 ratio. The mode does in fact occur at the
expected ratio 3:2. But as the number of trials increases, this
probability decreases⎯as we pointed out above⎯since now the
tallest member has a lot of close competitors. The picture makes any
more words unnecessary.

Bernoulli then made a wonderful observation. If we give up on the exactness of the


ratio but we are content with
staying within a fixed, aforementioned, fraction of that ratio, then we
can indeed expect our probabilities to increase as the number of
experiments increases.

Let us say, as he did, that we want to stay within 1 (2%) of our fixed ratio. Thus if we do
50
the experiment 50 times, then we expect to have either 29, 30 or 31 successes (black
balls), if we do it 100 times, then we look at the probability of 58, 59, 60, 61 or 62
successes, etcetera.

The table below gives both the number of occurrences of black balls and the probabilities
for each which have been added up at the bottom of the table. The top row indicates the
number of experiments.
101

50 100 150 200 250 300 350 400 450 500


290 0.0239
261 0.0262 291 0.0258
232 0.0290 262 0.0284 292 0.0277
203 0.0323 233 0.0314 263 0.0304 293 0.0295
174 0.0364 204 0.0349 234 0.0336 264 0.0323 294 0.0312
145 0.0415 175 0.0393 205 0.0373 235 0.0356 265 0.0340 295 0.0327
116 0.0484 146 0.0448 176 0.0418 206 0.0394 236 0.0373 266 0.0355 296 0.0339
87 0.0582 117 0.0521 147 0.0475 177 0.0440 207 0.0411 237 0.0387 267 0.0367 297 0.0350
58 0.0742 88 0.0625 118 0.0549 148 0.0496 178 0.0456 208 0.0424 238 0.0398 268 0.0376 298 0.0357
29 0.1091 59 0.0792 89 0.0653 119 0.0568 149 0.0509 179 0.0466 209 0.0432 239 0.0404 269 0.0382 299 0.0362
30 0.1146 60 0.0812 90 0.0664 120 0.0575 150 0.0514 180 0.0470 210 0.0435 240 0.0407 270 0.0384 300 0.0364
31 0.1109 61 0.0799 91 0.0656 121 0.0570 151 0.0511 181 0.0467 211 0.0433 241 0.0405 271 0.0382 301 0.0363
62 0.0754 92 0.0631 122 0.0554 152 0.0499 182 0.0458 212 0.0426 242 0.0399 272 0.0377 302 0.0359
93 0.0591 123 0.0527 153 0.0480 183 0.0443 213 0.0414 243 0.0389 273 0.0369 303 0.0351
124 0.0491 154 0.0453 184 0.0423 214 0.0397 244 0.0376 274 0.0358 304 0.0342
155 0.0421 185 0.0398 215 0.0377 245 0.0359 275 0.0343 305 0.0329
186 0.0369 216 0.0353 246 0.0339 276 0.0326 306 0.0315
217 0.0327 247 0.0317 277 0.0308 307 0.0298
248 0.0294 278 0.0287 308 0.0280
279 0.0266 309 0.0261
310 0.0242
0.3345 0.3899 0.4402 0.4839 0.5223 0.5563 0.5868 0.6143 0.6393 0.6622

And we see the probabilities climb as the number of


experiments increase:

This fact, namely, that

the probability of staying within an arbitrary, yet


prescribed, fraction from the exact ratio does increase
without bound all the way to 1 as the number of experiments increases without
bound

is known as the Law of Large Numbers, or as the Law of Averages, or sometimes as


Bernoulli's Theorem.

Bernoulli proved the theorem in his Ars Conjectandi, and among the tools he used was
this pretty inequality:
a c a+c
Let a, b, c, d be positive numbers. If < , then is in between the
b d b+d
a a+c c
other two, < < .
b b+d d

We can not help observing that this is an average that does occur in baseball. If a player
goes 53 (it is customarily read: 3 for 5) one day and 64 another, then her/his combined
total for the two days is 117 , which is necessarily in between⎯not as good as her/his best
day and not as bad as her/his worse day. Observe that although 64 is the same as 23 in
102

5
most contexts, it is not so from this point of view, since 8 (which is what we get when
we average 53 and 23 ) is not the same as 117 .

Suppose now we let N be the total number of experiments, and because we chose the
fraction 1 , we are going to let N = 50M . Then the ratio 3:2 occurs when we have 30M
50
observations of black balls. Hence to be within 1 of that ratio means we must have
50
between 29M and 31M occurrences of black balls (endpoints included). We are
interested then in
ϑ= probability that we have the number of occurrences of black balls fall
within these two limits,
ϑ
then what Bernoulli proved was that as n grows, so does ϑ and that becomes
1−ϑ
arbitrarily large (moral certainty), that it grows without bound as M → ∞ (or
equivalently, N → ∞ ).

But Bernoulli wanted actual estimates for the number of tries necessary, an estimate
ϑ
for N , and here perhaps he felt he had failed. For example, he wanted ≥ 1000 , and
1−ϑ
then his estimate n was 25,550 observations, which to him was a gigantic number⎯there
were fewer than 3,000 known stars in the skies in his lifetime.

De Moivre’s approximation to the binomial is an assault on improving N, and that topic


will be discussed next: The Approximation of the Normal to the Binomial as a preamble to
the Central Limit Theorem.

In the early history of probability, the situation most often encountered was that of
performing repeated trials of an experiment, and counting the number of successes
among the trials, however success was defined. In other words, the most common random
variable was the Binomial.

Suppose, for the sake of explicitness, we consider


flipping a fair coin 10,000 times, and counting the
number of heads. That is we are discussing,
Y = B10000,.5 . We have seen before that the probability
of k heads among 10,000 tosses of a fair coin is
⎛10,000 ⎞ 1
⎜⎜ ⎟⎟ 10,000 , and the probability of any one
⎝ k ⎠2
specific number of successes is very small. To wit, the most likely event, that one of
exactly 5,000 heads, has probability less than 0.008 , but this computation is not at all
trivial even today, thus, so much less so in the eighteenth century. Hence, events have to
103

be collected in groups so as to make quantities meaningful⎯remember Bernoulli. So we


may ask instead, what is the probability of staying within 50 heads of this most likely
event, in other words, what is P ( 4950 ≤ Y ≤ 5050 ) ? We can readily write an answer, but
what numerical estimate we can attach to it is a different story. The answer is:

5, 050 10,000
⎞ 1
∑ ⎜⎜
k = 4 , 950 ⎝

k ⎟⎠ 210, 000
. But how to arrive at an estimate for such a quantity? And then, if

we vary the numbers a little, how sensitive are our estimates to these variations?

It is here that De Moivre introduced what is one of the most important density into
1 − x2
probability, the function y= e 2
, which nowadays is so prevalent. It has many

names: the normal distribution, the Gaussian density function, the astronomer’s
error law, or simply the bell-shaped curve (the same shaped as any row of Pascal’s
triangle).

x2

Actually De Moivre introduced the shape y = e , and he knew the integral had to be
2

found in order to produce a density. It was left to the great Laplace1 to exactly compute
the area under this curve.

∞ 2
−x
Theorem. ∫ e 2
dx = 2π .
−∞
Proof. It was a clever trick that Laplace used to integrate this function. He worked with
the square instead:
⎛ ∞ −x2 ⎞ ⎛∞ ⎞⎛ ∞ ⎞
2
⎜⎜ e 2 dx⎟⎟ = ⎜⎜ e−x 2 dx⎟⎟⎜⎜ e−x 2 dx⎟⎟ =
2 2

⎜⎜ ∫ ⎟ ⎜∫ ⎟⎟⎜ ∫ ⎟
⎝−∞ ⎠⎟⎟ ⎝⎜−∞ ⎟⎜−∞
⎠⎝ ⎠⎟⎟
⎛ ∞ −x2 ⎞⎛ ∞ ⎞ ∞ ∞ −( x2 + y 2 )
⎜⎜ e 2 dx⎟⎟⎜⎜ e− y 2 dy⎟⎟ =
2

⎜⎜ ∫ ⎟⎟⎜ ∫ ⎟⎟ ∫ ∫ e 2
dx dy .
⎝−∞ ⎟ ⎜
⎠⎝−∞ ⎟
⎠ −∞ −∞
It was here he used polar coordinates, where x = r cos θ , y = r sin θ , and thus,

1
Laplace (1749-1827) is a major name in both mathematics and mechanics. His two masterpieces:
Mécanique Céleste and Théorie Analytique des Probabilités are both major book in their subjects. The
former is, as its name indicates, the full elucidation of our Solar system following Newton's Laws of
Planetary Motion. The latter is the distillation of all probabilistic knowledge up to the latter half of the
eighteenth century, and remained the major book in the subject for 50 years. Ironically, Laplace's work
represents both the crowning achievement of mechanism, the philosophy that considers the universe to run
as clockwork, and yet by his emphasis on the importance of probability and statistics, he represents the
beginning of the end of such a philosophy.
104

x2 + y 2 = r 2 .

By using infinitesimal areas, an early fact from conversion of areas from one set of
coordinates to the other, we get our integral to be,
⎛ ∞ −x2 ⎞
2
∞ ∞ 2π ∞
⎜⎜ e 2 dx⎟⎟ = −( x + y )
2 2 2
−r
⎜⎜ ∫ ⎟ ∫ ∫ = ∫ ∫ dr d θ
2 2
e dx dy re
⎝−∞ ⎟

⎠ −∞ −∞ 0 0

⎛ 2π ⎞⎛ ⎟
∞ ⎞⎟ ∞

= ⎜⎜⎜ ∫ d θ⎟⎟⎜⎜⎜ ∫ re 2 dr ⎟⎟ = 2π ∫ re 2 dr .
2 2
−r −r

⎝⎜ 0 ⎠⎝ ⎟⎟⎜ ⎠⎟⎟
0 0

But the last integral is much easier since it has the needed r factor, and so we get, by a
2 2
−r −r
substitution, u = e 2
, then du = −re 2
dr , and when r = 0 , u = 1 , and when r = ∞ ,
u = 0 , thus

⎛ ∞ −x2 ⎞
2
0
⎜⎜ e 2 dx⎟⎟ = 2π −du = 2π .
⎜⎜ ∫ ⎟ ∫ a
⎝−∞ ⎠⎟⎟ 1

A bit later we will see other tremendous contribution of Laplace connected with the
normal.

And without getting too technical, De Moivre correctly claimed that this unique bell-
shaped curve can be used to approximate any of the binomial problems (once
appropriately calibrated), and this approximation can be used to give an answer to our
query as an integral instead of a sum. In our case, the estimate is: the probability of
between 4,950 and 5,050 heads is 0.6826 (as we will see in the next section).

Unfortunately for De Moivre, he was not able to see how far reaching was his curve. It
was left to Laplace (and Gauss) to cement the importance of the normal distribution: The
Central Limit Theorem

Ever since the first encounters with probability theory and Pascal's triangle, the bell shape
of the distribution of the numbers was very apparent. As we saw before, DeMoivre had
proven that the normal curve
1 − x2
y= e 2

was the limit situation when a large number of experiments was performed in which two
outcomes (success and failure) were possible (binomial distributions). As we saw it was
1
Laplace that found the constant .

More importantly, Laplace is the discoverer of a vast generalization of De Moivre’s


discovery. Indeed, it does not matter what one starts with, if we keep doing it, we will
eventually get a distribution approaching the normal distribution.
105

This fact is known as the Central Limit Theorem, and it was used in the 19th century, as
it still used today, to apply statistical methods to the social sciences. We will give some
further application in the next section. But first we clarify what the theorem means.

Consider the following gaming situation:

We have 3 dice, one with the faces marked 1,1,1,2,2,2 (thus the probability of rolling
a 1 is ), another one with the faces marked 2,2,2,2,3,3, and finally the third one
1
2
with 3,3,3,3,3 and 4. You will roll the 3 dice simultaneously and record the sum.
Hence you can roll either a 6, or a 7 or an
0.50
0.45
0.40
8 or a 9. With a little computation we
0.35
0.30
derive the
0.25 probabilities Roll Probability # of Ways
0.20 6 0.277778 60
0.15 of each:
0.10 7 0.472222 102
0.05
0.00 8 0.222222 48
6 7 8 9 Above is the 9 0.027778 6
histogram of
these probabilities.

But suppose that instead of rolling once, you rolled twice, and recorded the average of
your two rolls:
0.40
0.35
Total Average Probability
0.30
12 6.0 0.077160 0.25
13 6.5 0.262346 0.20

14 7.0 0.346451 0.15


0.10
15 7.5 0.225309
0.05
16 8.0 0.075617 0.00
17 8.5 0.012346 6 6.5 7 7.5 8 8.5 9
18 9.0 0.000772

And we can begin to see the reason for the name Central Total Average Probability
Limit Theorem. Do it one more time, keeping track of the 18 6.00 0.021433
average of the three rolls. 19 6.33 0.109311
20 6.67 0.237269
0.35 21 7.00 0.286630
0.30 22 7.33 0.211677
0.25 23 7.67 0.098830
0.20 24 8.00 0.029107
0.15 25 8.33 0.005208
0.10 26 8.67 0.000514
0.05 27 9.00 0.000021
0.00
6.00 6.33 6.67 7.00 7.33 7.67 8.00 8.33 8.67 9.00
106

And we perceive more and more the approximation that the theorem claims. We give
three more histograms, the
0.30
Total Average Probability
0.25
24 6.00 0.005954
0.20 25 6.25 0.040485
0.15 26 6.50 0.122290
0.10 27 6.75 0.216549
0.05
28 7.00 0.249915
29 7.25 0.197698
0.00
30 7.50 0.109756
6.00

6.25

6.50

6.75

7.00

7.25

7.50

7.75

8.00

8.25

8.50

8.75

9.00
31 7.75 0.043034
32 8.00 0.011816
ones with 4 rolls, 5 rolls and 6 rolls, and no more words of 33 8.25 0.002215
explanation. 34 8.50 0.000269
T A P 35 8.75 0.000019
30 6.0 0.001654 36 9.00 0.000001
0.25
31 6.2 0.014057
0.20
32 6.4 0.054411
33 6.6 0.127063 0.15

34 6.8 0.199980 0.10


35 7.0 0.224450
0.05
36 7.2 0.185397
0.00
37 7.4 0.114658
6

6.4

6.8

7.2

7.6

8.4

8.8
38 7.6 0.053485 T A P
39 7.8 0.018807 36 6.0 0.000459
40 8.0 0.004942 37 6.2 0.004686
41 8.2 0.000953 38 6.3 0.022120
42 8.4 0.000130 39 6.5 0.064159
43 8.6 0.000012 40 6.7 0.128034
44 8.8 0.000001 41 6.8 0.186530
45 9.0 0.000000 42 7.0 0.205459
43 7.2 0.174831
44 7.3 0.116435
45 7.5 0.061111
46 7.7 0.025324
0.25 47 7.8 0.008263
48 8.0 0.002107
0.20 49 8.2 0.000414
50 8.3 0.000061
0.15 51 8.5 0.000007
52 8.7 0.000000
0.10 53 8.8 0.000000
54 9.0 0.000000
0.05

0.00
6.33

6.67

7.33

7.67

8.33

8.67
6

9
107

But the whole impact of the theorem is that what we started with is not relevant
at all, the shape could be very different from the one in the example above, and yet we
would have the same tendency toward the normal curve with its bell shape. To exemplify
we just give six more histogram just like in the example above, but starting with a very
different distribution from the one above.

n o

p q

r s

THE
NORMAL
LAW OF ERROR
STANDS OUT IN THE
EXPERIENCE OF MANKIND
AS ONE OF THE BROADEST
GENERALIZATIONS OF NATURAL
PHILOSOPHY * IT SERVES AS THE
GUIDING INSTRUMENT IN RESERACHES
IN THE PHYSICAL AND SOCIAL SCIENCES AND
IN MEDICINE AGRICULTURAL AND ENGINEERING
IT IS AN INDISPENSABLE TOOL FOR THE ANALYSIS AND THE
INTERPRETATION OF BASIC DATA OBTAINED BY OBSERVATION AND EXPERIMENT

And we will turn to some of those applications in the next section.


108

We end the section with a couple of Russian results from the 19th century.

Theorem. Markov’s Inequality. Let X be a random variable with


E(X )
nonnegative support. Then for any a > 0 , P ( X ≥ a ) ≤ .
a
Proof. Let f ( x ) denote the density of X . Then
∞ ∞ ∞
aP ( X ≥ a ) = ∫ af ( x ) dx ≤ ∫ xf ( x ) dx ≤ ∫ xf ( x ) dx = E ( X ) .
a a 0

Example 1. If the mean number of accidents in a given highway is 30 in a week, then


how likely is it that it will exceed 50 in a given week? We can readily claim that this
probability is at most 53 .

The next fundamental inequality firmly establishes the standard deviation as the yardstick
for shock.

Theorem. Chebyshev’s Inequality. Let X be a random variable with


mean E ( X ) = µ and standard deviation σ . Then for any k > 0 ,

P ( X − µ ≥ kσ) ≤
1
.
k2
Proof. Let Y = ( X − µ ) , so E (Y ) = σ2 , and since Y ≥ 0 , we can apply Markov’s
2

E (Y ) 1
inequality to Y . Letting a = k 2 σ 2 , we get P (Y ≥ σ2 k 2 ) ≤ = . And since
σ2 k 2 k 2
P (Y ≥ σ 2 k 2 ) = P ( X − µ ≥ σk ) , we are done. a

σ2
, P( X −µ ≥ k) ≤ 2 .
k
Note the equivalent statement by replacing k by
σ k

Example 2. If the mean number of accidents in a given highway is 30 in a week, and the
standard deviation is 10, then how likely is it that it will exceed 50 in a given week?
Since we are two standard deviations above the mean, we can readily assert that the
probability is at most 25%.

With the help of Chebyshev’s Inequality, we get Bernoulli’s Theorem immediately.

First observe that if X 1 , X 2 ,…, X n are independent random variables with the same
distribution, and hence the same mean and variance, µ and σ 2 respectively. Then we can
define a new variable, the average, Yn = 1
n ( X1 + X 2 + " + X n ) . We can easily observe
that
E (Yn ) = µ
110

q The Ubiquitous Normal Distribution

In this section, we discuss the most


important of all random variables, the
normal, denote by Z and with density
f ( x) =
− x2
1

e 2
.
Its shape is the well-known bell-shaped
curve. Its importance mainly stems from
simple applications of the central limit
theorem, as we will see below. We start
with its main parameters.

Its range is the whole real line, but most of the area is concentrated between −3 and 3. In
fact, P ( −1 ≤ Z ≤ 1) = 68.25%, P ( −2 ≤ Z ≤ 2 ) = 95.41% and P ( −3 ≤ Z ≤ 3) = 99.69%.
These were done used the table—the cumulative distribution of Z does not have a closed
form in your calculator, although it is known as Φ . The mode and median are both
clearly 0, and indeed so is the expectation,

E (Z ) =
− x2

∫x
−∞
1

e 2 dx = 0

since the integrand is an odd function. As to E ( Z 2 ) = V ( Z ) , we have


∞ ∞
V (Z ) = E (Z )= ∫ x
− x2 − x2
2 2 1

e dx = 2 ∫ x 2
2 1

e 2 dx ,
−∞ 0
− x2
which integrating by parts, we get if we let u = x , dv = x 1

e 2 dx , we then get
∞ ∞
2 x − 2x2 ∞ 2 2 2π
V ( Z ) = 2∫ x 2
− x2 − x2

0
1

e dx = −
2


e | +
0 ∫
2π 0
e 2
dx = 0 +
2π 2
= 1.

So σ = 1 also.

As usual, probabilities are given by areas under the curve, and the
areas have been previously computed in a table. The table (see
appendix) gives the area of the tail as indicated in the picture.

Example 1. The Standard Normal. We will ask several questions


about this crucial distribution.
c P ( Z ≥ 1) . Directly from the table: 15.87%.
d P ( Z ≥ 2.23) . Also directly from the table: 1.30%.
e P ( Z ≤ −1) . By the symmetric of the curve, and hence equality of areas,
we get: 15.87%. Similarly P ( Z ≤−2.23) = 1.30%.
111

f P ( Z ≤ 1) . By complementation: 100% − 15.87%=84.13%.


g P (2.23 ≥ Z ≥ 1) . Our picture looks a little different now, and we have to
subtract .1587 − .0130 = .1457 .
h P (−2.23 ≤ Z ≤ 1) . With yet another picture, we now have the union of
two disjoint events: P (−2.23 ≤ Z ≤ 0) , which has probability
50% −1.30% = 48.70%, and P (0 ≤ Z ≤ 1) with probability
50% −15.87% = 34.13%. Thus, our total probability is given by the
sum .487 + .3413 = .8283 .

For a different question altogether: now we need to find


i For which value a do we have that P ( Z ≥ a ) = 1% . Now we have to look
inside the table for a probability. We come close to 1% at 2.33 standard
deviations away from the mean.

Similarly, 5% is given by 1.645 standard deviations, and 10% by 1.28.

The best procedure in order to compute from the table is to always have a picture of the
desired area in mind.

The time has come to consider other normals. Start by considering a random variable of
the form X = aZ + b . Then E ( X ) = aE ( Z ) + b = b , and V ( X ) = a 2V ( Z ) , so if we let
N µ ,σ = σZ + µ , then E ( N µ ,σ ) = µ , and its standard deviation is σ . More clearly, its
distribution is given by
y −µ
σ
⎛ y −µ ⎞
FNµ ,σ ( y ) = P ( N µ ,σ ≤ y ) = P ( σZ + µ ≤ y ) = P ⎜ Z ≤
1 − x2

⎝ σ


=
2 π

−∞
e 2
dx .

So by the fundamental theorem of calculus, its density is given by the derivative, which is
−( y −µ )
2

1
f ( y) = e 2σ .
2

σ 2π
In fact all of these graphs are bell-shaped. The graphs
on the right represent a collection of normals with the
same means, but with different standard deviations—
the wider the graph the bigger the standard deviation.
The standard normal is the one in the middle.

On the other hand, the graphs below


represent several normal with different
means and the same standard deviation—the
only effect is a simple translation.
112

But more importantly than the density is the fact that in order to compute a probability,
all we need is to measure the difference from the mean in terms of the standard deviation,
y −µ
the expression gives us all the control. In other words, if we are using another
σ
normal beside the standard normal, all we have to do to compute is to translate to the
number of standard deviation away from the mean that we desire. The following example
should be illustrative.

Example 2. Verbal SAT. Scores on the SAT verbal ability follow a normal distribution
with µ = 430 and σ = 100 , so our random variable is N ( 430,100 ) . Suppose 10,000
students take the test. Consider the following.
c How many students scored 530 or higher? How far from the mean are we?
We are 100 points, which is one standard deviation, so the answer is the
same as c in the previous example: 15.87%, or 1,587 students.
d How many students scored 653 or higher? Since 653 − 430 = 223 , we are
2.23 standard deviations above the mean, so again the question is the same
question as in part d of the previous example, so the answer is 130
students.
Proceeding in the same fashion, but phrasing the question in terms of students and SAT
scores, rather than standard deviations away from the mean, we get
e How many students scored 330 or lower: 1,587 of them. Similarly 130
students scored 207 or lower.
f How many students scored 530 or lower? By complementation: we get
8,4123 students.
g How many students scored between 530 and 653: 1587 −130 = 1457 of
them.
h How many students scored between 207 and 530? As before we get the
union of two disjoint events: one with 4,870 students and the other one
with 3,413 for a total of 8,283 students.
i Finally, Lesley’s mother was told that Lesley’s score was in the top 1%.
Thus Lesley’s score was at least 663 points.

The following example reiterates the notion that what is crucial is to use the standard
deviation as the yardstick, that σ is the unit of measurement.

Example 3. Career Choices. Thomas recently took two national exams for admission to
graduate school. His score on the biology test was 89, while in the math test was 78.
Since the tests are given to thousands of students, one can safely assume that the scores
on each of them follow a normal distribution.

Thomas found out that on the biology test the average was 82 and the standard deviation
was 4, while on the math test the average was 76 and the standard deviation was 1. What
subject is Thomas more suited for graduate school? Biology or mathematics?

It is clear that Thomas scored above average in both exams, and the difference in points
113

were 7 and 2, so it would seem that Thomas is more suited for biology. But points on the
test is not the correct measuring stick. In terms of standard deviations, in biology,
Thomas was 1.75 standard deviations above the mean, or equivalently in the top 4.02%
of the test takers. But in math he was 2 standard deviations above, or in the top 2.29%, so
Thomas scored higher in the mathematics test. And his talent in that subject is more
precious than his biological skills.

Example 4. A coffee dispensing machine can be regulated so that it discharges an


average of µ ounces per cup. The ounces of fill are normally distributed with σ = 0.3
ounces. How should we choose µ so the machine overflows an 8-ounce cup at most 1%
of the time? That is we want the probability P ( X ≥ 8) ≤ .01 where X is the number of
ounces of fill. We want the number of standard deviations away from the mean to be at
8 −µ
least 2.33 in order to warrant the 1% margin of error. So ≥ 2.33 , and so µ = 7.301
.3
is the desired mean.

We return to the original question that motivated De Moivre to discover the normal
curve—the approximation to the binomial.

Example 5. What is the probability of having between 4,950 and 5,050 heads when we
toss a coin 10,000 times? As before, we can readily write an answer, but what numerical
5,050
⎛10, 000⎞⎟ 1
estimate we can attach to it is a different story. The answer is: ∑ ⎜⎜⎜ ⎟ . Now
k = 4,950 ⎝
k ⎠⎟ 210,000
we can use the normal to approximate the binomial. But what is µ , the mean, and what is
σ , the standard deviation? We know that µ = 5000 , and σ = 10000 ⋅ 12 ⋅ 12 = 2500 = 50
and our question simply becomes, what is the probability of being within one standard
deviation (either side) from the mean, so it equals 68.25%.

Example 6. Quality Control. A company manufactures perfume sprayers. They consider


that 5% of their production is defective. A random sample of 600 sprayers is tested. In
that sample then we expect 30 defective ones. We also have that
σ = 600 ⋅ .05 ⋅ .95 ≈ 5.34 atomizers. What is the probability then of each of the
following?
5
c At least 35 defectives? Since we are = .936 above the mean, the
σ
probability is 17.62%.
d Between 25 and 35 defectives? Easily, we have 64.76%.
e Suppose we got 50 defectives in our sample—should we worry that
perhaps our defective amount to more than 5%? Since the probability of
being 3.7463 standard deviations above the mean is only .02%, we can
114

consider our production to be more defective than we claimed.


Example 7. University Admissions. Suppose you are in charge of admissions at a small
college. From past history, you know that 30% of the students admitted decide to attend
your college. You have room for 200 students this Fall. There are 800 applications. If you
want to admit as many students as possible, but have less than a 5% chance of
overflowing, how many students should you admit? Let X be the number of students that
come to the university. We need P ( X ≥ 201) ≤ 0.05 . The probability is computed using
the normal, and so as long as the number of standard deviations away from the mean is at
least 1.645, we will have at most a 5% chance of exceeding our enrollment. Below is a
table for different values of admitted students:

# of standard
n = Admitted E X
Students
( ) 201− E ( X ) σ = n ⋅ .3⋅ .7 deviations away
from the mean
600 180 21 11.22497 1.87082
601 180.3 20.7 11.23432 1.84256
602 180.6 20.4 11.24366 1.81435
603 180.9 20.1 11.25300 1.78619
604 181.2 19.8 11.26233 1.75807
605 181.5 19.5 11.27165 1.73000
606 181.8 19.2 11.28096 1.70198
607 182.1 18.9 11.29026 1.67400
608 182.4 18.6 11.29956 1.64608
609 182.7 18.3 11.30885 1.61820

And so we conclude we should admit 608 students.

Example 8. Sampling. Suppose that 40% of the adult population of Menaville do attend
religious services on any given week. What is the probability that when we interview
1785 adults in the city of Menaville and asked them whether they attended religious
services that week or not, that we will get between 37% and 43% of them to say yes (a
3% margin of error). Again σ = 1785 ⋅ .4 ⋅ .6 = 20.7 . We expect 714 people, and we want
to know the probability P (768 ≥ X ≥ 660) , which readily translates to
P (2.58 ≥ Z ≥−2.58) = 98.97% , and so we have a high degree of confidence.

Example 9. Sampling. Suppose that 40% of the adult population of Menaville do attend
religious services on any given week. What is the probability that when we interview
1000 adults in the city of Menaville and asked them whether they attended religious
services that week or not, that we will get between 37% and 43% of them to say yes (a
3% margin of error). Again σ = 1000 ⋅ .4 ⋅ .6 = 15.49 . Let X = B1000,.4 be the random
variable that counts the number of people attending services. We know E ( X ) = 400 , and
we want to know the probability P (430 ≥ X ≥ 370) . This readily translates to
115

P (1.93 ≥ Z ≥−1.93) = 94.6% , and so we have a high degree of confidence.

How many people would we have to interview if we wanted a 99% degree of certainty?
Since P (2.59 ≥ Z ≥−2.59) ≥ 99% , we need .03n ≥ 2.59σ = 2.59 .24n . This simplifies
2.59
to n≥ .24 ≈ 42.29 , so any n ≥ 1, 789 would be sufficient.
.03
116

r The Utilitarian Exponential

The previous two sections were dedicated to the normal distribution which is symmetric
about its mean. However, many distributions can take values that are only on one side of
the axis, positive for example—certainly Z 2 , or any X 2 for that matter, would be such a
random variable. In this section we look at a popular special case of a major family we
will discuss in a later chapter. The exponential random variable is a special case of the
gamma random variable, but it is an important and useful special case of that variable, so
we will isolate its discussion.

Throughout this section β will denote a positive real number. Then the random variable
⎧ 1 − βy
⎪ e y>0
with density f ( y ) = ⎨ β is known as an exponential random variable
⎪ 0
⎩ otherwise
with parameter β , and will be denoted by X β .

1.2
On the left are the graphs of three such
1 densities with respective β ’s.
0.8
1
0.6 2 Before we discuss examples, we compute
0.4 4 the basic properties of the exponential.
0.2
Its distribution, to start with, has a
0 particularly nice form:
0 2 4 6 8 y

FX β ( y ) = ∫ e β dt = −e
1 −t − βt y −y
| = 1− e β .
0
β 0

Thus P ( X β > y ) = e β , and this is perhaps the simplest of all descriptions of the
−y

exponential random variable.

Clearly, the mode of X β is 0 (it only decays from there), and the median can be found by

FX β ( y ) = , so the median is given by ( ln 2 ) β . A simple integration by parts, provides


1
2
∞ ∞
E ( X β ) = ∫ ye β dy = − ye
1 −y − βy ∞ −y − βy ∞
| + ∫ e β dy = 0 − β e | =β.
0
β 0
0
0

Similarly,
∞ ∞
E(X )=∫β y e 2 −β 2 −β ∞
dy = 0 + 2β E ( X β ) = 2β 2 ,
1 y y
− βy
2
β dy = − y e | + ∫ 2 ye
0
0 0

so V ( X β ) = β , and so σ = β .
2
117

The following example is typical of the exponential random variable.

Example 1. A certain machine uses electronic components each of which has a lifetime
in hours given by X 50 . Thus each component has an expected lifetime of 50 hours (with a
standard deviation of 50 hours!). On the other hand, the median lifetime of a component
is only 50ln 2 ≈ 34.65 hours. We also have that the probability a component will last less
than 50 hours is P ( X ≤ 50 ) = 1 − e−1 = .6321 , so P ( X 50 ≥ 50 ) = e−1 ≈ .3678 . If we ask,
what the probability is a component will last 100 hours if it has already lasted 50, we
surprisingly get,
P ( X 50 ≥ 100 ) e −2
P ( X 50 ≥ 100 | X 50 ≥ 50 ) = = −1 = e −1 .
P ( X 50 ≥ 50 ) e
We will see below this is a typical property of the exponential

Suppose the machine has 5 of those components acting independently, and that it needs 3
of them to work in order for it to operate. What is the probability the machine will
successfully operate for 100 consecutive hours? What we need now is to switch our
thinking to a Bernoulli where success is a component lasting at least 100 hours. But that
is easy, as we saw before, P ( X 50 ≥ 100 ) = e−2 , so we are considering Be−2 as our
Bernoulli, and we have 5 of them so we are now in a binomial, B5,e−2 , and we want at
least 3 successes. Easily then we get this probability to be close to exactly 2%.

Example 2. Waiting on a Poisson. Suppose that the number of customers entering a


store in a given day is a Poisson random variable with parameter 12, P12 (this is a very
expensive jewelry store so few people come in). Assume that there is no such thing as
Christmas seasons, or sales, so we assume any two periods of time behave independently.
We saw before that the sum of two independent Poissons is a Poisson also, thus we look
at the number of customers in two days, we get P24 , and in three days, P36 , while in half-
a-day, we have P6 , and P4 for a third of a day. Thus, in t days, we get P12t as the random
variable representing the number of customers going into the store.

If we let T denote the waiting time until the first customer comes to the store, where time
is measured in days, then T is a continuous random variable. But then P (T > t ) = e12t , so
P (T ≤ t ) = 1 − e12t = FX 1 ( t )
12

and so we get that T = X 1 .


12

The exponential has a unique and distinguishing characteristic, no memory. Once it has
lasted a certain amount of time, the probability it will last a specific time longer is the
same as the probability it would have lasted that specific time to start with!
118

Theorem. Memorylessness. Let X β be an exponential random variable.


Then for any positive t and s, P ( X β ≥ s + t | X β ≥ t ) = P ( X β ≥ s ) .
P ( Xβ ≥ s + t ) − sβ+t

Proof. It is immediate, P ( X β ≥ s + t | X β ≥ t ) = = P ( Xβ ≥ s) . a
e − βs
= =e
P ( Xβ ≥ t ) e
− βt

One can eventually prove that the exponential is the only random variable that satisfies
this memorylessness property.

Example 3. You walk into a UPS shipping company office just before it closes. The
office has 2 windows with employees serving customers. However, there is a line of 3
people waiting to be served in addition to the two already at the windows. Assume that
the time a customer stays at a window is given by X 10 , measured in minutes. What is the
probability you will be the last customer to leave the office? By the time you get to a
window, it is with probability 1 that the other window will also be occupied. Since the
exponential has no memory, you and the person at the other window have equal chances
to finishing first, so your chances are 50%. We will revisit this example in the next
chapter.

We end the section with a brief discussion of failure rates. Let X be a random variable
that represents the lifetime of some machine. Thus we assume X has positive range, and
density f and distribution F. Then the failure rate of X is defined by
f (t )
κ (t ) = .
1 − F (t )
The reason for the name is due to the fact that it measures the proportion of failure in a
small interval after time t once the machine lasted time to that time. The following
explains further
P (t < X < t + h | X > t ) P (t < X < t + h) 1 F (t + h) − F (t ) f (t )
lim = lim = lim =
h→0 h h→0 hP ( X > t ) 1 − F ( t ) h→0 h 1 − F (t )
.

1 − βt
e
f (t ) β 1
In fact, for the exponential, we have κ ( t ) = = − t = , a constant. This is
1 − F (t ) e β β
another indication of the memorylessness of the exponential.

Knowing the failure rate one can derive the distribution:


dF
f (t )
κ (t ) = = dt
1 − F (t ) 1 − F (t )
119

t
so integrating both sides, we get ln (1 − F ( t ) ) = − ∫ κ ( s ) ds + c , and since F ( 0 ) = 0 , we
0
t


− κ( s ) ds
get c = 0 . Then by taking e to both sides, we get 1 − F ( t ) = e 0
, and so
t


− κ( s ) ds
F (t ) = 1 − e 0
,
so
t


− κ( s ) ds
f (t ) = κ (t ) e 0
.

Example 4. A supplier claims its new version of a product has half the failure rate of the
old product. That is the claim is κ n ( t ) = .5κo ( t ) . From experience one knows that 89 of
the old machines failed to reach 100 days once they had lasted 80. How would the new
machines act under the same assumption of having lasted 80 days already?

We know that
100
− ∫ κo ( s ) ds 100

1 − Fo (100 ) e ∫ κo ( s )ds −
= P ( X o ≥ 100 | X o ≥ 80 ) =
0
1
= 80 = e 80 .
1 − Fo ( 80 )
9
− ∫ κo ( s ) ds
e 0
100
− ∫ κn ( s )ds
By the same reasoning, P ( X n ≥ 100 | X n ≥ 80 ) = e 80
. But κ n ( t ) = .5κo ( t ) , so
100 100 100
− ∫ κn ( s ) ds − ∫ .5 κo ( s )ds − ∫ κo ( s )ds
e 80
=e 80
= e 80
= 13 .

So a full third of the machines will last to a 100 days. Thus the net result of halving the
failure rate had the effect of taking the square root of the probability of survival!
120

Chapter 3
All for One

n Joint Distributions

In this chapter we finally start looking at arbitrary multivariable situations. We now allow
ordered tuples of variables, and we start with some simple examples to clarify the
meaning of joint distributions.

Example 1. Suppose a bowl contains 10 red balls, 8 blue balls and 6 white balls. Four
balls are chosen simultaneously from the bowl. We let R denote the number of red balls
among the four and we let B denote the number of blue balls among the four. We
consider now the ordered pair ( R, B ) of random variables, and ask what values can this
ordered pair take and with what probabilities? The easies way to describe them is via a
⎛ 24 ⎞
table (or matrix), where the common denominator of ⎜ ⎟ = 10, 626 has purposefully
⎝4⎠
been left out:
R\B 0 1 2 3 4 Row Sum
0 15 160 420 336 70 1001
1 200 1200 1680 560 0 3640
2 675 2160 1260 0 0 4095
3 720 960 0 0 0 1680
4 210 0 0 0 0 210
Column Sum 1820 4480 3360 896 70

And this table is then known as the joint distribution of the pair ( R, B ) . To explain the
entries in the table, we choose to explain the entry corresponding to R = 1 and B = 2 . But
then the number of ways of choosing the four balls, since we have to choose 1 red, 2 blue
⎛ 10 ⎞ ⎛ 8 ⎞ ⎛ 6 ⎞
and necessarily 1 white balls respectively, is ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ = 10 × 28 × 6 = 1680 .
⎝ 1 ⎠⎝ 2⎠⎝1⎠

As an immediate benefit of having all this information, we get what is known as the
marginal distributions. Namely these are the individual distributions of the single
variables R and B by respectively, and we obtain these by looking at the row sums and
column sums of our matrix. Thus the distribution for B is given by
B 0 1 2 3 4
P 1820 4480 3360 896 70
with of course the same denominator of 10626. Observe this is simply the random
variable H 24,8,4 .
121

Naturally, once we have the joint distribution we can ask almost any question involving
the random variables such as the following 3 queries:
0 1 2 3 4
P ( R + B ≥ 3) is 10626
R\B
c 1
7956 ≈ 74.87% since the
0 336 70
only positions on the table that are relevant are 1 1680 560
given, and they add to 7956. 2 2160 1260
3 720 960
4 210
Similarly, if our quest is
d P ( R ≥ B ) , then the positions are given by

R\B 0 1 2 3 4
with a total of 7400, and a consequent probability
0 15 1
of 10626 7400 ≈ 69.64% .
1 200 1200
2 675 2160 1260
3 720 960 Finally, to compute
4 210 e P ( R ≥ 2 | B ≤ 1) ,
we need to compute P ( R ≥ 2 and B ≤ 1) and P ( B ≤ 1) . Note that for the latter we can

simply use the distribution of B to get P ( B ≤ 1) =


6300
, and for the former we get
10626
4725 4725
, so our answer is = 75% .
10626 6300

Of course, these two random variables are not independent since the number of red balls
certainly has an effect on the number of blue balls—but even more directly from the table
one can see that there are zeroes on the table, yet there are no zeroes on the marginals,
and hence it is not possible for the entry inside the table to be the product of the
respective entries on the marginals, which is what independence means.

In fact, we expect these two random variables not only to not be independent, but to be
negatively correlated—in other words, the larger one of them is, the more likely the
smaller the other one is. To recall, the covariance of two random variables X and Y is
defined by
cov ( X , Y ) = E ( XY ) − E ( X ) E (Y ) .
Then their correlation (also called index of correlation) is simply
cov ( X , Y )
ρ= .
σ X σY
It is a fact that this number is always between −1 and 1. If positive, then the variables are
said to be positively correlated while if negative, they are negatively correlated. Note that
the covariance alone determines the sign of the correlation:

We will compute the distribution of the random


variable RB which has range 0, 1, 2, 3 and 4 with 0 1 2 3 4
corresponding probabilities: 0.2640 0.1129 0.3613 0.1430 0.1185
122

so then easily E ( RB ) = 1.7391. On the other hand, E ( R ) = E ( H ( 24,10, 4 ) ) = 53 ≈ 1.666


while E ( B ) = E ( H ( 24,8, 4 ) ) = 43 ≈ 1.333 , so E ( R ) E ( B ) = 20
9 ≈ 2.222 , which is bigger
than E ( RB ) .

In a similar fashion to the discrete example above, we can consider a continuous situation
where instead of the table we have a region of the plane consisting of the range of
ordered pairs of the two variables X and Y, for example, and then their joint density
would consist of a function defined above the region in such a way that the total volume
involved would naturally have to be 1.

Example 2. Consider two random variables X and Y whose range is the closed first
quadrant, x ≥ 0 and y ≥ 0 , and whose joint density is given by the function
f ( x, y ) = ce− x e−2 y
(on that range and 0 otherwise). First of all in order to be a joint density, we must have
∞∞

∫ ∫ ce
− x −2 y
total volume under this surface to be 1, so we must have e dxdy = 1 , and since
0 0
∞∞

∫∫e dxdy = 12 , we must have that c = 2 .


− x −2 y
easily e
0 0

Now again, we can ask all sorts of queries:


y=2
c P ( X > 1, Y < 2 ) .
This is computed by taking the integral over the
relevant region, so it is given by x =1

( ) ( )
2∞ 2 2
dxdy = ∫ 2e −2 y −e − x | dy = e −1 ∫ 2e −2 y dy = e −1 −e −2 y | = e −1 (1 − e −4 ) ≈ 0.3611 .

∫ ∫ 2e
− x −2 y 2
e
1 0
0 1 0 0

y=x
d P(X <Y ).
Again the key is to integrate over the relevant region. In this case,

Thus the probability is given by

( −e | ) dy = ∫ 2 (1 − e
∞ y ∞ ∞

∫ ∫ 2e
− x −2 y
e dxdy = ∫ 2e −2 y −x y
0
−y
)e −2 y
dy =
0 0 0 0

∞ ∞
∫ 2e − 2e −3 y dy = −e−2 y | + 23 e−3 y | = 13 .
−2 y
0 0
0

e P ( X + Y ≤ 2) .
x+ y =2
We now have
123

( )
2 2− y 2 ∞
2− y
∫∫
0 0
2e − x e−2 y dxdy = ∫ 2e−2 y −e− x |
0
0
dy = ∫ 2e −2 y − 2e − y − 2 dy = 1 − 2e −2 + e −4 ≈ .7476 .
0

But again having the joint densities, allow us to compute the marginal densities which are
the respective densities of the individual random variables X and Y:

f X ( x ) = ∫ 2e − x e −2 y dy = e − x
0

and so X = X 1 , an exponential random variable with parameter β = 1 . Similarly,



fY ( y ) = ∫ 2e − x e −2 y dx = 2e −2 y ,
0

which exposes Y as an exponential too, X 1 .


2

And we readily observe that the joint distribution is the product of the two marginals,
f ( x, y ) = f X ( x ) f Y ( y ) .

We previously defined that X and Y are independent if for all numbers a and b,
P ( X ≤ a and Y ≤ b) = P ( X ≤ a)P (Y ≤ b) . In other words, X and Y are independent if the
events X ≤ a and Y ≤ b are always independent events. We use this example to show
that in the case of continuous random variables, independence is equivalent to the
joint density being the product of the marginals. For one direction,
b a b a

P ( X ≤ a and Y ≤ b) = ∫ ∫ f ( x, y ) dxdy = ∫∫ f X ( x ) fY ( y ) dxdy =


−∞ −∞ −∞ −∞
b a

∫ fY ( y ) dy ∫ f X ( x ) dx = P ( X ≤ a )P (Y ≤ b) .
−∞ −∞
The equality between the two rows is achieved by invoking the continuous distributivity
of multiplication over addition. The other direction follows from similar considerations.
From this we can argue that

Thus we can assert that the random variables of Example 2 are independent.

Having this characterization of independent random variables, we can then easily show
that independent random variables are uncorrelated—in other words, their covariance
is 0. But we should remind ourselves that variables can be uncorrelated without
necessarily being independent. The brief argument,
∞ ∞ ∞ ∞
E ( XY ) = ∫ ∫ xyf ( x, y ) dxdy = ∫ ∫ xyf ( x ) f ( y ) dxdy =
X Y
−∞ −∞ −∞ −∞
∞ ∞

∫ xf ( x ) dx ∫ yf ( y ) dy = E ( X ) E (Y ) .
−∞
X
−∞
Y
124

Thus, cov ( X , Y ) = E ( XY ) − E ( X ) E (Y ) = 0 .

In a future section we will see how to compute the density of functions of random
variables such as X + Y or XY . Bt we already can discuss

Example 3. The Sum of Uniforms. We actually consider the sum of two standard
uniforms. So let U1 and U 2 be two independent standard uniforms, and X = U1 + U 2 .

Then their joint density is the product of the two marginals, so


f ( x1 , x2 ) = 1 on the unit square and 0 elsewhere. (1,1)

What then is the probability that X ≤ a ? Certainly 0 ≤ X ≤ 2 . But


if we wanted to find P ( X ≤ 12 ) , how would we
(1,1) go about it? Now we can simply ask which (1, 0)
points of the unit square satisfy X ≤ 2 . To
1

graph an inequality one graphs the equation, and see which half of
the line satisfies the inequality. So to graph X ≤ 12 , we do the
( 12 , 0) (1, 0) equation x1 + x2 = 12 ,
and then graph the 0 ≤ a ≤1 1≤ a ≤ 2
side containing the origin. We obtain the (1,1) (1,1)
shaded region, which has area 18 , so now we
can answer P ( X ≤ 12 ) = 81 . To do this in
general, we need to compute P ( X ≤ a ) for (1, a −1)
any 0 ≤ a ≤ 2 . The pictures speak for (a, 0) (1, 0) (1, 0)
themselves:


⎪ 2


a
0 ≤ a ≤1

⎪ 2
and so we have that P ( X ≤ a ) = ⎨ . So for the density, we obtain

⎪ (2 − a )
2


⎪1− 1≤ a ≤ 2

⎩ 2
⎧ a
⎪ 0 ≤ a ≤1


f X (a ) = ⎨2 − a 1 ≤ a ≤ 2 1.2



⎩ 0
⎪ otherwise 1

whose graph is 0.8


0.6
0.4
And so again when we add random variables, 0.2
interesting things happen, and in fact, a bulge in the 0
middle occurred as in the sum of two rolls of the -0.5 -0.2 0 0.5 1 1.5 2 2.5
dice, or the sum of two uniforms.
125

Closely associated with the sum of two random variables, is their averaging. Specifically,
rather than X = U1 + U 2 , we would consider Y = 12 (U1 + U 2 ) . Of course, we have
⎧⎪ y
⎪⎪ 0 ≤ y ≤ 12
⎪⎪ 2
⎪⎪ y
fY ( y ) = ⎪⎨1− 1
≤ y ≤1 .
⎪⎪ 2 2

⎪⎪
⎪⎪ 0 otherwise
⎪⎪⎩
The following table of values should give a sense of the distribution. One can observe,
for example, how much more often Y is around the middle rather than the extremes:

U1 U2 Y U1 U2 Y U1 U2 Y
0.926023 0.687290 0.806657 0.869625 0.801620 0.835623 0.360271 0.433231 0.396751
0.920365 0.112898 0.516631 0.810914 0.199639 0.505277 0.639698 0.688845 0.664272
0.980040 0.188716 0.584378 0.342260 0.545219 0.443740 0.594755 0.296755 0.445755
0.591154 0.930574 0.760864 0.179582 0.882815 0.531199 0.651303 0.661886 0.656595
0.380800 0.046967 0.213883 0.170059 0.358107 0.264083 0.923143 0.402448 0.662796
0.469614 0.288062 0.378838 0.971122 0.598000 0.784561 0.114371 0.444630 0.279500
0.268003 0.265989 0.266996 0.538538 0.189569 0.364053 0.142091 0.689431 0.415761
0.490173 0.196052 0.343112 0.535925 0.530224 0.533074 0.989111 0.270230 0.629671
0.641882 0.138761 0.390322 0.331605 0.908189 0.619897 0.597512 0.371834 0.484673
0.987128 0.796948 0.892038 0.050085 0.091245 0.070665 0.523877 0.203685 0.363781
0.036852 0.138046 0.087449 0.717419 0.981859 0.849639 0.178458 0.851269 0.514864
0.961909 0.121547 0.541728 0.972951 0.560992 0.766972 0.415218 0.064349 0.239783

In a later section, we will see how to compute the sum of three or more independent
standard uniforms.

We now discuss another discrete example involving three variables:

Example 4. The Multinomial Distribution. It is known that, H A D P


nationally, 73% of all residential fires are in family homes, 20% 4 0 0 0.28398
are in apartments, and 7% are in other types of dwellings. 3 1 0 0.31121
Suppose 4 fires are recorded. Consider the random variables H, 3 0 1 0.10892
A and D that record the number of family homes, apartments 2 2 0 0.12790
and dwellings, respectively among the four fires. We also 2 1 1 0.08953
assume the four fires occurred independently. If we use a triple 2 0 2 0.01567
1 3 0 0.02336
to indicate the values of H, A and D respectively, then we will
1 2 1 0.02453
have the following 15 situations with nonzero probability: 1 1 2 0.00858
1 0 3 0.00100
Of course, the computation of the probability is simple, for 0 4 0 0.00160
example 0 3 1 0.00224
0 2 2
P ( H = 2, A = 1, D = 1) = ⎛⎜
4 ⎞ 0.00118
⎟ (.73) (.20 )(.07 ) ≈ .08953 ,
2
0 1 3 0.00027
⎝ 2,1,1 ⎠ 0 0 4 0.00002
126

since we have to choose two fires to be at the homes, and 1 fire at an apartment, and the
⎛ 4 ⎞
remaining fire for a dwelling, thus the term ⎜ ⎟ which is known as a multinomial
⎝ 2,1,1⎠
4!
coefficient and it equals . The rest of the computation is just as in the binomial
2!1!1!
(Bernoulli) situation.

The three marginals in this case are nothing but binomials. For example, to compute
P ( H = 1) , we would need to add all the situations when H = 1 , and there are 4 such
possibilities: (1,3, 0 ) , (1, 2,1) , (1,1, 2 ) and (1, 0,3) given us a grand total of 0.0574,
which also equals P ( B4,.73 = 1) . The reason why these marginals are binomials is simple:
in order to consider only the random variable H, all other fires have become non-H, so
now we are simply H or not H, success vs. failure. Thus, we know
E ( H ) = 4 × .73 = 2.92 and V ( H ) = 4 × .73 × .27 = .7884 . Perhaps the only interesting new
information is the covariance of the individual variances—for example cov ( H , A) . To
compute this all we need is E ( HA ) since we already have the individual expectations,
E ( H ) = 4 × .73 and E ( A) = 4 × .20 , respectively. To compute E ( HA ) we return to the
root of all binomials (and multinomials), the Bernoulli.

We have H = H1 + H 2 + H 3 + H 4 where each of these summands is a Bernoulli, B.73 , and


similarly A = A1 + A2 + A3 + A4 , where each Ai = B.20 . But then
4
HA = ∑H A
i , j =1
i j .

Now what random variable is H i Ai ? Simply 0 since it cannot be that the first fire is both
a home and an apartment. What random variable is H i A j if i ≠ j . Since we are assuming
the fires are independent, the distribution of H i A j is H i Aj 0 1
very simple: P 1 − .73 × .20 .73 × .20
And so E ( H i A j ) = .73 × .20 , and thus E ( HA ) = ( 42 − 4 ) .73 × .20 . But E ( H ) = 4 × .73 and
E ( A) = 4 × .20 , so E ( H ) E ( A) = ( 4 × .73)( 4 × .20 ) = 42 × .73 × .20 , and so
cov ( H , A) = −4 × .73 × .20 .
Similarly, cov ( H , D ) = −4 × .73 × .07 and cov ( A, D ) = −4 × .20 × .07 .

Note that cov ( H , H ) = E ( H 2 ) − E ( H ) E ( H ) = V ( H ) , and so the time has come to


introduce the covariance matrix of a set of random variables, in our case, three random
variables, H, A and D. So we think of a matrix whose rows are columns are indexed by
these three variables, and in the respective entry we place the respective covariance.
127

Thus, on the main diagonal, we place the variances, and this is clearly a symmetric matrix
since cov ( X , Y ) always equals cov (Y , X ) . In our specific example
⎛ .7884 −.5840 −.2044 ⎞
M = ⎜ −.5840 .6400 −.0560 ⎟ .
⎜ ⎟
⎝ −.2044 −.0560 .2604 ⎠

Naturally, more than the number of fires, one is interested in the costs of the fires.
Suppose that we expect the cost of a house fire to be $25 thousand, while an apartment is
$15 thousand, and other dwelling are $5 thousand. Thus,
C = 25H + 15 A + 5D ,
and of course we could easily build the distribution of C from the table above. For
example, if (1,3, 0 ) occurs (with probability 2.33%), the cost will be $70 thousand.
Easily, the expected cost of the fires is given by
E ( C ) = 25E ( H ) + 15E ( A) + 5E ( D ) = $86.4 thousand.

But what is the variance of C, V ( C ) ? Since the variables are not independent, we cannot
just simply add the variances. Rather we have to do a computation—we will do it a little
more generically. Let X = aH + bA + cD where a, b and c are numbers. Then we know
E ( X ) = aE ( H ) + bE ( A) + cE ( D ) , so
E ( X ) = ( aE ( H ) + bE ( A) + cE ( D ) ) =
2 2

a 2 E ( H ) + b 2 E ( A ) + c 2 E ( D ) + 2abE ( H ) E ( A ) + 2acE ( H ) E ( D ) + 2bcE ( A ) E ( D ) .


2 2 2

Now we need E ( X 2 ) . But


X 2 = ( aH + bA + cD ) = a 2 H 2 + b 2 A2 + c 2 D 2 + 2abHA + 2acHD + 2bcAD .
2

So,
(
E ( X 2 ) = E ( aH + bA + cD )
2
)=
a 2 E ( H 2 ) + b 2 E ( A2 ) + c 2 E ( D 2 ) + 2abE ( HA ) + 2acE ( HD ) + 2bcE ( AD ) .
Thus,
V ( X ) = a 2 E ( H 2 ) + b 2 E ( A2 ) + c 2 E ( D 2 ) + 2abE ( HA ) + 2acE ( HD ) + 2bcE ( AD ) −
a 2 E ( H ) + b 2 E ( A ) + c 2 E ( D ) + 2abE ( H ) E ( A ) + 2acE ( H ) E ( D ) + 2bcE ( A ) E ( D ) =
2 2 2

a 2V ( H ) + b 2V ( A) + cV ( D ) + 2ab cov ( H , A ) + 2ac cov ( H , D ) + 2bc cov ( A, D ) .

This expression would be impossible to remember if it were not for the wonderful tool of
matrix multiplications—this is nothing but
⎛a⎞
( a b c ) M ⎜⎜ b ⎟⎟
⎝c⎠
where M is the covariance matrix described above. Thus, in our particular situation,
128

⎛ .7884 −.5840 −.2044 ⎞ ⎛ 25 ⎞


⎜ ⎟
V ( C ) = ( 25 15 5) ⎜ −.5840 .6400 −.0560 ⎟ ⎜ 15 ⎟ = 145.76 .
⎜ ⎟⎜ ⎟
⎝ −.2044 −.0560 .2604 ⎠ ⎝ 5 ⎠

And so the standard deviation of C is $12,073.

The method actually extends to compute the covariance of two linear combinations. Thus
⎛ 3⎞
for example if B = 3H + 4 A + 5D , then cov ( C , B ) = ( 25 15 5 ) M ⎜ 4 ⎟ .
⎜ ⎟
⎝5⎠

We finish the section with a continuous model.

Example 5. Independent? Suppose X and Y are random variables whose joint


distribution is given by
⎧cxy 0 ≤ x ≤ y ≤ 1
f ( x, y ) = ⎨ .
⎩ 0 otherwise
As usual, to find c, we need to take the double integral

∫ ∫ xydxdy = ∫ y ( | ) dy = ∫
1 y 1 1
2 y 1
1
2 x 1
2 y 3 dy = ,
0 8
0 0 0 0

and so c = 8 .

We may think that X and Y are independent since the joint distribution seems like a
product of two marginals—but the triangular shape of the range should quickly dissuade
us from assuming that. Note that this shape indicates a relationship between X and Y—
especially since the former is always less tan or equal to the latter!!
y =1 y=x
In fact, computing the marginals, we get
1
f X ( x ) = ∫ 8 xydy = 4 xy 2| = 4 x − 4 x 3 for 0 ≤ x ≤ 1 ,
1

x
x

and
y

fY ( y ) = ∫ 8 xydx = 4 x 2 y| = 4 y 3 for 0 ≤ y ≤ 1 ,
y

0
0

and we readily give up on any claims of independence. In fact, if we consider the joint
distribution of the variables, this will become even clearer. As in the case of one variable,
the cumulative distribution of the pair X and Y is given by
F ( a, b ) = P ( X ≤ a and Y ≤ b ) . If b≤a, then easily y =1 y=x
( ) ( )
F a, b = F b, b , so without loss we can assume 0 ≤ a ≤ b ≤ 1 . ( a , b )
In this case, the picture aids us in computing.
129

F ( a, b ) is more easily computed if we integrate with respect to x last since we can do it


in one integral—if we chose to do it with respect to y last, then we would have to break
the interval into the two shapes, triangular and rectangular. Thus, we get

( )
a b a a
F ( a, b ) = ∫ ∫ 8 xydydx = ∫ 4 x y | dx = ∫ 4 xb − 4 x dx = a ( 2b − a ) ,
2 2 b 3 2 2 2
x
0 x 0 0

∂2 F
and this is their joint distribution. Naturally, easily confirmed is the fact that f = .
∂x∂y

As in the case of one variable, one can use the distribution to do a variety of
computations:

c P ( X ≤ 12 , Y ≤ ) = F ( 12 , 12 ) = 161 .
1
2

d P ( X ≤ 34 , Y ≤ 12 ) = F ( 12 , 12 ) = 161 . y =1 y=x

e P ( X ≤ 12 , Y ≥ 12 ) = F ( 12 ,1) − F ( 12 , 12 ) = 166 .
f P ( X ≥ 12 , Y ≥ 12 ) = 1 − F ( 12 ,1) = 169 .
g P ( X ≤ 12 , Y ≤ 43 ) = F ( 12 , 43 ) = 327 .
h P ( X ≥ 12 , Y ≤ 34 ) = F ( 34 , 34 ) − F ( 12 , 34 ) = 256
81
− 327 = 256
25
.

However, some other computations will still require we


y =1 y=x do some integration: for example P ( X + Y ≤ 1) . In that
case we still have to draw the region, and the probability
will be gotten by the appropriate integral:
P ( X + Y ≤ 1) =

∫ ∫ 8 xydydx = ∫ 4 x ( y | ) dx = ∫ 4 x − 8x dx =
1 1 1
2 1− x 2 2
2 1− x 2 1
6
.
x
0 x 0 0

We can compute E ( X ) by two methods since we already have the marginals—we can

∫ xf X or we can ∫∫ xf . Without having computed the marginals, the second would be


perhaps less work, but since we have f X ( x ) = 4 x − 4 x3 in the range 0 ≤ x ≤ 1 , we readily
1 1
get E ( X ) = ∫ 4 x − 4 x dx =
2 4 8
15
. Similarly, E (Y ) = ∫ 4 y 4 dy = 54 . Not surprisingly we get
0 0

E ( X ) ≤ E (Y ) , since we always have X ≤ Y .


130

1
We could also readily get the variances of X and Y: E ( X 2
) = ∫ 4x 3
− 4 x 5 dx = 13 , so
0
1
V (X) = 11
225 , and σ X ≈ .2211 , and E (Y 2 ) = ∫ 4 y 5 dy = 23 , so V (Y ) = 2
75 , and σY ≈ .1633 .
0

However, we are not ready to compute the covariance matrix. We first need E ( XY ) :

( )
1 y 1 1
E ( XY ) = ∫ ∫ ( xy ) 8xydxdy = ∫ 83 y x | dy = ∫ 83 y dy = 94 .
2 3 y 5
0
0 0 0 0

And so
cov ( X , Y ) = E ( XY ) − E ( X ) E (Y ) = 94 − 158 54 = 225
4

And not surprisingly, these two variables are positively correlated.

⎛11 4 ⎞
Our covariance matrix is then M = 1
⎜ 4 6⎟ .
⎝ ⎠
225

Having all this information allows us to quickly calculate the main attributes of the
random variable Z = Y − X . Without having to compute the density, we know that
1 ⎛ 11 4 ⎞ ⎛ −1 ⎞
E ( Z ) = 54 − 158 = 154 and V ( Z ) = ( −1 1) 225 ⎜ 4 6 ⎟ ⎜ 1 ⎟ = 225 .
9

⎝ ⎠⎝ ⎠

In this section we have learned several crucial ideas and their uses: joint densities and
joint distributions, both continuous and discrete, marginals, independence, correlation
and covariance, and the covariance matrix.
131

o Transformations

From the onset of the course, we not only considered random variables, but also
combinations of them such as sums and products, as well as new variables obtained from
old ones by taking a function of that variable such as X 2 + 3 . In a previous section, we
look at the transformation method for one variable—in this section we will look at the
extension of this method to the multivariable case.

We start by reviewing the one variable technique:

Theorem. Transformation Method. Let X be a continuous random


variable with density f X . Let g : \ → \ be a function for which its
derivative (at least in the range of X) is never 0. Consider Y = g ( X ) . Then
the range of Y is the image under g of the range of X, and if y = g ( x ) ,
then
fX ( x)
fY ( y ) = .
g′ ( x)

The multivariable version of the theorem is not that much harder to prove, except it takes
solid understanding of multivariable calculus. First one has to understand what the
derivative of a function of several variables is, and it is (of course) a matrix. This matrix
is made out of all possible partial derivatives. An example should suffice. Consider the
function k : \ × \ → \ × \ given by k ( x, y ) = ( x 2 + y 2 , 2 xy ) , then readily we can see
this function as made up of two other simpler functions, g ( x, y ) = x 2 + y 2 and
h ( x, y ) = 2 xy , so that now k ( x, y ) = ( g ( x, y ) , h ( x, y ) ) . Then what one means by the
⎛ ∂g ( x ) ∂g ( y ) ⎞
⎜ ∂x ∂y

derivative of k is the matrix of partials, ⎜ ⎟ . In our specific case, it is given
⎜ ∂h ( x ) ∂h ( y ) ⎟
⎜ ∂x ⎟
⎝ ∂y ⎠
⎛ 2x 2 y ⎞
by the matrix ⎜ ⎟.
⎝ 2 y 2x ⎠

Theorem. Multivariable Transformation Method. Let X and Y be a


continuous random variables with joint density f X ,Y . Let Z and W be
random variables given by functions h, g : \ × \ → \ , Z = g ( X , Y ) and
W = h ( X , Y ) . Suppose moreover that for any x and y, the matrix1

1
The J is for Jacobian, named after Carl Gustav Jacobi, nineteenth century mathematician.
132

⎛ ∂g ( x ) ∂g ( y ) ⎞
⎜ ∂x ∂y

J=⎜ ⎟
⎜ ∂h ( x ) ∂h ( y ) ⎟
⎜ ∂x ⎟
⎝ ∂y ⎠
always (at least in the range of X and Y) has nonzero determinant. Then we
have for ( z , w ) = ( g ( x, y ) , h ( x, y ) ) ,
f X ,Y ( x, y )
f Z ,W ( z , w ) = .
det J

The proof will be omitted (it is a consequence of the multivariable chain rule), but we
should readily observe the similitude between the two theorems.

Naturally, there is a multivariable version of the theorem for any number of variables.

Example 1. Sums and Differences. Let X and Y have joint density f X ,Y . Consider
⎛1 1 ⎞
Z = X + Y and W = X − Y . In that case J = ⎜ ⎟ , is never zero, so if z = x + y and
⎝ 1 −1 ⎠
w = x − y , then since 2x = z + w and 2 y = z − w , we get immediately
f ( x, y ) 1 ⎛ z + w z −w⎞
f Z ,W ( z , w ) = X ,Y = f X ,Y ⎜ , ⎟.
2 2 ⎝ 2 2 ⎠

Example 2. A Revisit to a Chapter 1 Example. Let X = Y = U , and assume they are


independent. Thus f X ,Y ( x, y ) = 1 in the range 0 ≤ x, y ≤ 1 . Clearly, the range of
Z = X + Y is 0 ≤ z ≤ 2 , and the range of W = X − Y is −1 ≤ w ≤ 1 . Their joint density is
f ( x, y ) 1 ⎛ z + w z −w⎞ ⎛ z + w z −w⎞
given by f Z ,W ( z, w ) = X ,Y = f X ,Y ⎜ , ⎟ , but note f X ,Y ⎜ , ⎟ is 0
2 2 ⎝ 2 2 ⎠ ⎝ 2 2 ⎠
unless 0 ≤ z + w ≤ 2 and 0 ≤ z − w ≤ 2 , so the joint
W
density has for its range the square, with boundary (1,1)
given by the lines: w = z , w = − z , w = 2 − z and
w = z − 2 . In that range its density is simply
f Z ,W ( z , w ) = .
1
2 ( 2, 0 ) Z

Thus if we wanted to compute the individual


densities of Z and W, all we would have to do is (1, −1)
compute the marginals, and since the shape is a
prism, all we need to do is compute lengths:
⎧ z 0 ≤ z ≤1 ⎧1 − w 0 ≤ w ≤ 1
fZ ( z ) = ⎨ and fW ( w ) = ⎨ .
⎩2 − z 1 ≤ z ≤ 2 ⎩1 + w −1 ≤ w ≤ 0
133

The next example is a small variation of the previous:


Example 3. Sum of Exponentials. Instead of letting X and Y be independent standard
uniforms, let them be exponentials with parameter 1, X = Y = X 1 , and thus their range is
the first quadrant and their joint density there is e − x − y . As
before, we let Z = X + Y and W = X − Y . Then the range of W
Z is simply z ≥ 0 , but the range of W is −∞ < w < ∞ . As
before we get
⎛ z + w z −w⎞
f Z ,W ( z , w ) = f X ,Y ⎜
1
, ⎟,
2 ⎝ 2 2 ⎠ Z
thus we need both z + w ≥ 0 and z − w ≥ 0 , so the region is
an infinite triangle with boundaries z = w and z = −w lines.
In that region the density is given by f Z ,W ( z , w ) = e − z .
1
2
Thus to compute the marginal on Z, need
z
fZ ( z ) = ∫
z
1
2 e − z dw = 12 e− z w| = ze − z
−z
−z
for z ≥ 0 . In a future section we will identify this random variable as a gamma.

Now to compute the other marginal: for w ≥ 0 , we get fW ( w ) = ∫ 12 e − z dz = 12 e − w , while
w

for w ≤ 0 , fW ( w ) = ∫ 1
2 e − z dz = 12 e w , so succinctly we can state:
−w

fW ( w ) = 12 e− w .

Example 4. Consider the following the two random variables,


X and Y with joint density:

⎪2 (1− y ) 0 ≤ x ≤ 3 y, 0 ≤ y ≤ 1
f ( x, y ) = ⎪


⎩ 0
⎪ elsewhere
Consider the random variables U = X + Y and V = 2 X − 4Y .
We find the joint density of U and V via the transformation method. Easily, since
⎛u ⎞ ⎛ x⎞ ⎛1 1 ⎞ ⎛ x⎞ −1 ⎛ u ⎞ −1 1 ⎛4 1⎞
⎜ ⎟ = A ⎜ ⎟ where A = ⎜ 2 −4 ⎟ , we have that ⎜ ⎟ = A ⎜ ⎟ and since A = 6 ⎜ 2 −1⎟ ,
⎝v⎠ ⎝ y⎠ ⎝ ⎠ ⎝ y⎠ ⎝v⎠ ⎝ ⎠
4u + v 2u − v f ( x, y )
we have that x = and y = . Thus, since fU ,V ( u , v ) = X ,Y and J = A ,
6 6 det J
⎛ 2u − v ⎞
1−
2 (1 − y ) ⎝⎜ ⎟
6 ⎠ 6 − 2u + v
we have fU ,V ( u, v ) = = = . However, what is particularly
6 3 18
not clear is what is the joint support, what is the range of U and V ? Since the
transformation is linear, the image of the triangle is another triangle, the one with vertices
134

the images of the respective vertices of the original


⎛0⎞ ⎛ 1 ⎞ ⎛ 3⎞
triangle: ⎜ ⎟ , ⎜ ⎟ and ⎜ ⎟ , with diagram:
⎝ 0 ⎠ ⎝ −4 ⎠ ⎝6⎠

Example 5. Suppose the joint density of two


random variables is given by f X ,Y ( x, y ) = 2 (1 − x )
while on the open unit square, 0 < x, y < 1 . Suppose
we are interested in the random variable Z = XY .
Can we use the multivariable transformation
method? All we have to do is come up with another
variable in such a way that the Jacobian is nonzero.
⎛y x⎞
For example, we can let W = X . Then J = ⎜⎜ ⎟⎟
⎝ 1 0⎠
with det J = − x ≠ 0 .

Thus we readily have


⎛ z⎞
f X ,Y ⎜ w, ⎟
f ( x, y ) ⎝ w ⎠ = 2 (1 − w )
f Z ,W ( z , w ) = X ,Y =
det J w w
where we need then 0 < w < 1 and 0 < z < w . Thus to compute
the marginal of Z, we get
⎛1 ⎞
1
f Z ( z ) = ∫ 2 ⎜ − 1⎟ dw = Z
z ⎝ ⎠
w

( 2 ln w − 2w)|z = −2 − 2 ln z + 2 z = 2 ( z − ln z − 1)
1

for 0 < z < 1 .

Moment-Generating Functions

More than a transformation, the last technique we look at in this section is a transform
method, the Laplace transform to be more exact. Manners of encoding sequences into
functions (and vice versa) has been a useful device in mathematics for a considerable
time, and that is exactly what we do now.

Starting with a random variable X, we associate with it its sequence of moments,


1 = E ( X 0 ) , µ = E ( X ) , E ( X 2 ) , E ( X 3 ) ,….
Then we encode that sequence by transforming into a function:

mX ( t ) = 1 + E ( X ) t +
( )t
E X2 2
+
( )t
E X3 3
+".
2! 3!
135

This function is called the moment-generating function of X. The name is appropriate


since the construction should remind the reader of the MacLaurin series expansion of a
function, and so easily we can get the coefficients by repeated differentiation evaluated at
0 at each stage. In other words,
E ( X i ) = m(Xi ) ( 0 )
where m(Xi ) ( 0 ) stands for the ith derivative evaluated at 0. But more is true, since the
expectation of a sum is the sum of the expectations, it seems that we could pull the E
operator out of the right hand side, i.e.,
⎛ ⎞ ⎛ ( tX ) ( tX ) ⎞
2 3
X2 2 X3 3
mX ( t ) = E ⎜1 + Xt + t + t + "⎟ = E ⎜1 + tX + + + "⎟
⎝ 2! 3! ⎠ ⎜ 2! 3! ⎟
⎝ ⎠
And we can easily recognize the right hand side expression as E ( e ) , and so we have
tX

another expression for the moment-generating function of X,


mX ( t ) = E ( etX ) .

One of the wonderful facts about moment-generating functions is that if two random
variables have the same moment-generating function, then they have the same
distribution—or equivalently they are the same random variable.

Although a bit intimidating, the last expression above is very useful as the following
examples will illustrate.

Example 6. The Bernoulli. Of course we should start with the constants—but if X is


constant, say a, then E ( X i ) = a i , and so mX ( t ) = e at when X = a . Just slightly more
complicated is the Bernoulli, since B ip = B p , E ( B ip ) = p , and so
t2 t3 t2 t3
mBp ( t ) = 1 + pt + p +p + " = q + p + pt + p +p + " = q + pet .
2! 3! 2! 3!

Example 7. The Uniform. The expression mX ( t ) = E ( etX ) is particularly useful in the


continuous case. Let U be the standard uniform, then
etu 1 et − 1
1
mU ( t ) = E ( etU ) = ∫ etu du = | = .
0
t 0 t

Example 8. The Poisson. Consider now Pλ , a Poisson. Then let X = etPλ , then
λ k −λ
P ( X = etk ) = e , and so
k!

λk
mPλ ( t ) = ∑ etk e−λ = e−λ ∑

( λe ) = e−λ eλet = eλ(et −1) .
t k

k =0 k! k =0 k!
136

Example 9. The Normal. Let Z be the standard normal. Then


∞ ∞ ∞
mZ ( t ) = E ( etU ) =
1 − x2 1 − x 2 + tx 1 (
− 12 x 2 − 2 tx ) dx =
∫ e e 2 dx = ∫ dx = ∫e
tx
e 2

2π −∞
2π −∞
2π −∞
∞ ∞ ∞
1 ( ) dx =
2
− 12 x 2 − 2 tx + t 2 + t2 1 t2 − 12 (x 2
− 2 tx + t 2 ) dx = 1 t2 − 12 ( x − t ) t2

∫e ∫e ∫e
2

e 2
e2 dx = e 2 .
2π −∞
2π −∞
2π −∞

Certainly one would have to agree that some of the expressions are not that memorable—
but the following theorem will point out their usefulness.

Theorem. The Algebra of Moment-Generating Functions. Let X and Y


be independent random variables with respective moment-generating
function mX ( t ) and mY ( t ) . Let a be a scalar. Then
c maX ( t ) = mX ( at ) ;
d mX +Y ( t ) = mX ( t ) mY ( t ) .
( )
Proof. It is rather simple: maX ( t ) = E et ( aX ) = E e( at ) X = mX ( at ) so c is done. Now ( )
( )
mX +Y ( t ) = E et ( X +Y ) = E ( etX +tY ) = E ( etX etY ) = E ( etX ) E ( etY ) = mX ( t ) mY ( t )
because it is a fact that the variables etX and etY are also independent so the expectation
of their product is the product of their expectations. a

This theorem has many applications.

Example 10. The Binomial. Since Bn , p = B p + " + B p , we immediately get that




mBn , p ( t ) = ( q + pe )
t n
. Thus, if we so wanted, we could easily compute the first three
moments of the binomial: E ( Bn , p ) = mB′ n , p ( 0 ) = np , E ( Bn2, p ) = mB( 2n ), p ( 0 ) = npq , and
E ( Bn3, p ) = mB(3n), p ( 0 ) = np ( n 2 p 2 + q 2 + 3npq − pq ) .

Example 11. The Uniforms. Since U [a ,b] = ( b − a ) U + a , we get


e( b − a )t − 1 at ebt − e at
mU[a ,b] ( t ) = mU ( ( b − a ) t ) ma ( t ) = e = .
(b − a ) t (b − a ) t
Example 12. The Normals. Since N µ ,σ = σZ + µ , similar to the previous example we get
( σt )2
mNµ ,σ ( t ) = mZ ( σt ) mµ ( t ) = e
σ2t 2 +µt
2
eµt = e 2
.

We end the section with a fundamental fact about normals.


137

Theorem. The Sum of Normals. Let X = N µ1 ,σ1 and Y = N µ2 ,σ2 be


independent normals, and let W = X + Y . Then W = N µ ,σ where
µ = µ1 + µ 2 and σ = σ12 + σ22 .
σ12t 2 σ22t 2

Proof. We have that mX ( t ) = e and mY ( t ) = e


+µ1t +µ 2 t
2 2
, so
σ12t 2 σ22t 2 (σ1 +σ2 )t +( µ +µ )t
2 2 2
σ22t 2

mW ( t ) = e mY ( t ) = e 2 2 ,
+µ1t +µ 2 t +µ t
2
e 2
=e 2 1 2

which is the moment-generating function of N µ ,σ as above. a


138

p The Gamma and its Many Relatives

In this section we introduce the last family of random variables of the course. Before we
looked at the normal distribution, which is symmetric about its mean. However, many
distributions can take values that are only on one side of the axis, positive for example—
certainly Z 2 would be such a random variable, or the exponential. The gamma is one of
these with nonnegative range, but before we can discuss it we need to do some
integration.

Throughout this section α will denote a positive real number. Consider the following
definition (which is acceptable since as y → ∞ , e − y converges to 0 much faster than y α−1
converges to ∞ ):

Γ ( α ) = ∫ y α−1e − y dy .
0

This is known as the gamma function, and it was


45
created by the great Euler in order to satisfy a
40
recursion similar to the factorial. Indeed, easily, by
35
integration by parts, one can show:
Γ ( α + 1) = αΓ ( α )
30
25
And together with the easily computed Γ (1) = 1 , we 20
15
readily obtain
Γ ( n + 1) = n ! .
10
5
0
It was Euler that computed Γ ( 12 ) = π . Note then that 0 2 4 6

Γ ( 32 ) = 1
2 π.


−y
∫ y e dy can be computed
α−1
Next we extend this integral a bit. First let β > 0 also. Then β

0

e dy = βα Γ ( α ) . So now we are ready to
α−1 − β
y

via a substitution, x = βy , to obtain, ∫y


0

introduce a new family of random variables determined by two parameters, α and β ,


both positive. We say the random variable with density

⎧ y α−1e β
−y

⎪ y>0
f ( y ) = ⎨ βα Γ ( α )

⎩ 0 otherwise
is known as a gamma random variable with parameters α and β , and will be denoted
by Gα ,β .
139

On the right are the graphs of three such 1.2

densities with β = 1 , and respective α ’s. 1

0.8
1
These graphs represent the densities of Gamma 0.6 2
random variables with α = 2 and respective 4
0.4
β ’s.
0.2

0.4 0
0.35 0 2 4 6 8

0.3
0.25 1
0.2 2 To compute the two key parameters of
0.15 4 Gα ,β is rather easy:
0.1 ∞
E ( Gα ,β ) =
1 −y
∫ yy e dy =
α−1 β
0.05
0 β Γ (α)
α
0
∞ α+1
Γ ( α + 1)
0 2 4 6 8
1 − βy β
β Γ (α) ∫
α
y e dy = = αβ .
α
0
β Γ (α)
α

And
∞ ∞
β
α+ 2
Γ ( α + 2)
E ( Gα2,β ) = = β2 α ( α + 1) .
1 −y 1 −y
∫ y y e dy = ∫ y e dy =
2 α−1 β α+1 β

β Γ (α)
α
0
β Γ (α)
α
0
β Γ (α)
α

Thus the variance is given


V ( Gα ,β ) = α 2β 2 + αβ 2 − α 2β 2 = αβ 2 .
Naturally, the range of the gamma is the set of positive reals.

As a remark worth making, observe that a gamma random variable with α = 1 is an


exponential, that is, G1,β = X β

The moment generating function of Gα ,β is not very difficult:


∞ ∞

( )
y −βty

mGα ,β ( t ) = E e
1 −y 1 α−1 − β
∫ e y e dy = β Γ (α) ∫
ty α−1 β
= dy =
tGα ,β
y e
β Γ (α)
α
0
α
0
∞ α α
y(1−βt )
⎛ β ⎞
1 ⎛ 1 ⎞
Γ (α) = ⎜
1 − β
∫y
α−1
e dy = α ⎜ ⎟ ⎟ .
βα Γ ( α ) 0
β Γ ( α ) ⎝ 1 − βt ⎠ ⎝ 1 − βt ⎠

As a trivial consequence, we get


Theorem. Sum of Gammas. If Gα ,β and Gλ ,β are independent, then
Gα ,β + Gλ ,β = Gα+λ ,β .
α λ
⎛ 1 ⎞ ⎛ 1 ⎞
Proof. It is immediate: mGα ,β ( t ) = ⎜ ⎟ and mGλ ,β ( t ) = ⎜ 1 − βt ⎟ , so
⎝ 1 − βt ⎠ ⎝ ⎠
140

α λ α+λ
⎛ 1 ⎞ ⎛ 1 ⎞ ⎛ 1 ⎞
mGα ,β +Gλ ,β ( t ) = mGα ,β ( t ) mGλ ,β ( t ) = ⎜ ⎟ ⎜ ⎟ =⎜ ⎟ = mGα+λ ,β ( t ) .
⎝ 1 − βt ⎠ ⎝ 1 − βt ⎠ ⎝ 1 − βt ⎠

Thus, we have that the sum of n (independent) exponentials with the same mean is a
gamma, X β + X β + + X β = Gn ,β . This extends a previous observation in a former
n
section.

Example 1. The Beta Distribution. Let X = Gα ,β and Y = Gλ ,β be independent. We


know that their joint density is given by
−x −y

x α−1e β y λ−1e β
f X ,Y ( x, y ) = α
β Γ ( α ) βλ Γ ( λ )
X
Consider U = X + Y and V = . The range of U is all nonnegative numbers, and we
X +Y
already know that U = Gα+λ ,β . The range of V is the unit interval 0 ≤ v ≤ 1 . What is the
joint density of U and V, and what is the marginal of V? Using the transformational
⎛ 1 1 ⎞
⎜ ⎟ −1
method, we get the matrix J = ⎜ x + y − x − x ⎟ , and so J = . If u = x + y and
⎜ ( x + y) x + y
( x + y ) ⎟⎠
2 2

x
v= , then x = uv and y = u − uv , so
x+ y
−( u −uv )

f X ,Y ( x, y ) ( x + y ) x α−1e β y λ−1e β
−x −y
u ( uv ) e β ( u − uv )
− uv
α−1 λ−1 β
e
fU ,V ( u, v ) = = =
1 βα Γ ( α ) βλ Γ ( λ ) βα Γ ( α ) βλ Γ ( λ )
x+ y
u1+α−1+λ−1v α−1e β (1 − v ) e β u α+λ−1 v (1 − v )
−u λ−1 −u
α−1 λ−1

= =
βα Γ ( α ) βλ Γ ( λ ) βα+λ Γ (α) Γ (λ)
(1 − v ) Γ ( α + λ ) v α−1 (1 − v ) Γ ( α + λ )
−u
α−1 λ−1 λ−1
e β u α+λ−1 v
= α+λ = fU ( u ) .
β Γ (α + λ) Γ (α) Γ (λ ) Γ (α) Γ (λ)

From which we can conclude that U and V are independent, and that
v α−1 (1 − v ) Γ ( α + λ )
λ−1

fV ( v ) = .
Γ (α) Γ (λ)
Such a random variable is known as a beta random variable with parameters α and λ ,
Tα ,λ . Note that as an immediate consequence since this is a density, we get
1
Γ (α) Γ (λ)
∫ v (1 − v )
α−1 λ−1
dv = .
0
Γ (α + λ)
141

The following theorem will capture the main properties of the beta random variables.

Theorem. Beta Properties. Let Tα ,λ be a beta random variable with


parameters α and λ . Thus, the density of Tα ,λ is given by
x α−1 (1 − x ) Γ ( α + λ )
λ−1

f ( x) =
Γ (α) Γ (λ )
over its range [ 0,1] . Then
α αλ
E (Tα ,λ ) = and V (Tα ,λ ) = .
α+λ ( α + λ ) ( α + λ + 1)
2

1 1
Γ ( α + 1) Γ ( λ )
∫ xx (1 − x ) dx = ∫ x α (1 − x )
α−1 λ−1 λ−1
Proof. Now we know that dx = . Thus
0 0
Γ (α +1+ λ )
Γ ( α + 1) Γ ( λ ) Γ ( α + λ ) α
E (Tα ,λ ) = = , by the fundamental recursion of the
Γ (α +1+ λ) Γ (α) Γ (λ ) α + λ
1
Γ ( α + 2) Γ (λ )
∫ x x (1 − x )
2 α−1 λ−1
gamma function. Now for the second moment, dx = as
0
Γ (α + 2 + λ )
Γ ( α + 2) Γ (λ ) Γ (α + λ ) α ( α + 1)
before, so E (Tα2,λ ) = = , and so the variance
Γ ( α + 2 + λ ) Γ ( α ) Γ ( λ ) ( α + λ )( α + λ + 1)
α ( α + 1) ⎛ α ⎞ αλ
2

is given by V (Tα ,λ ) = −⎜ ⎟ = . a
( α + λ )( α + λ + 1) ⎝ α + λ ⎠ ( α + λ ) ( α + λ + 1)
2

Example 2. Job Assignment. Around a factory there are 13 independent jobs to be


executed. The time to do any one of the jobs is an exponential random variable with a
mean of one hour. Fred and Barney are to get the jobs assigned to them, so suppose Fred
gets 8 jobs and Barney gets 5, since he has more seniority. Let X be the random variable
that represents the time Fred is going to be working, and similarly Y represents the time
Barney will be on the job. Then X and Y are independent gammas with parameters 8 and
1, and 5 and 1 respectively. Then U = X + Y represents the total amount of time they will
X
be working altogether, while V = represents the proportion of time that Fred will
X +Y
be working out of the total time. By the previous example, U and V are (surprisingly)
independent, and V will have a beta distribution with parameters 8 and 5. Thus, the
probability that Fred gets away with working less than his 8 thirteenths share of work is
8
13

given by 3960 ∫ x 7 (1 − x ) dx ≈ 48.23% while the probability he spends 50% of the total
4

time or less is only 19.38%.


142

We do one more type of random variable that is a derivative of the gamma. Again, let us
use Z to denote N ( 0,1) , the standard normal. Then we consider the random variable Z 2 .
Certainly its range is the set of nonnegative reals, and for any such number y,
y y y

(
FZ 2 ( y ) = P ( Z 2 ≤ y ) = P − y ≤ Z ≤ y = ) ∫ f Z ( t ) dt = 2 ∫ f Z ( t ) dt = 2 ∫ f Z ( t ) dt − 1
− y 0 −∞

= 2 FZ ( y ) −1 ,
−y 1 −1 −y
e2 y2 e 2
so f Z 2 ( y ) = = 1 , and so Z 2 = G 1 ,2 , a gamma with expectation 1 and
2πy 2 Γ ( 2 )
2 1 2

variance 2.

Thus a sum of n independent Z 2 ’s is a gamma variable of type Gn ,2 which has mean n


2

and variance 2n . This type of random variable is so important it acquires a special name,
it is known as a χ 2 − random variable with n degrees of freedom. We will illustrate
some of its uses in the examples below. The probabilities of such a random table are
available in tables similar to the one for the normal.

The χ − square random variable was created by Karl Pearson to develop a goodness-of-
fit test. It works as follows:

Example 3. Officers and Horses Again. An example from the past examined the
number of cavalry officers killed by horses. Number of Deaths 0 1 2 3 4 5
The data was as follows: Actuality 144 91 32 11 2 0

The idea, as before, is to model this occurrence with a Poisson random variable. We
obtain λ = .7 , and so we have the following table:
Number of Deaths 0 1 2 3 4 5
Poisson Probability 0.49658 0.34761 0.12166 0.02838 0.00496 0.00078
Expected # of occurrences 139.04 97.33 34.07 7.95 1.39 0.22
Actuality 144 91 32 11 2 0

The idea then is to compare the last two rows of the table—the χ 2 − test then adds the
squares of the difference between what is expected and what occurred divided by what is
expected: so in our case
(144 − 139.02 ) ( 91 − 97.33) ( 32 − 34.07 ) (11 − 7.95 ) ( 2 − 1.39 ) ( −.22 )
2 2 2 2 2 2

+ + + + +
139.02 97.33 34.07 7.95 1.39 .22
which gives a total of 2.3716. This is the key statistic which is then checked in a
χ 2 − table (with 5 degrees of freedom) and found to have a reasonably probability of
occurring but not as high as 90%, so one is reasonably satisfied with the model, but not
totally certain that the fit is perfect. The reason that 5 degrees of freedom was used is due
to the fact that 6 pieces of data are being compared but since their sum is the same, we
only 5 degrees.
143

We end the section with another slightly different application of the χ 2 − test.

Example 4. Two-Way Tables. The effectiveness of a new flu vaccine was being tested
in a small city. The vaccine was provided free of charge in a two-shot sequence over a
period of two weeks to anybody who wanted it. Later a survey of 1000 town people
provided the following information:
Status No Vaccine One Shot Two Shots Total
Flu 24 9 13 46
No Flu 289 100 565 954
Total 313 109 578 1000

We attempt to measure if there is an effect from the vaccine on whether a person got or
did not get the flu. If there had been no effect, then we could say that the rows and the
columns are independent of each other so what we should be getting in each cell of the
table is the share of the totals for that row and that column, thus in the 1,1− position we
46 × 313
would be getting = 14.40 . If we compute each of
1000
these we get the table 14.40 5.01 26.59
298.60 103.99 551.41
Now we are ready to compute the χ 2 − statistic as we did before, and we obtain the
following table of values 6.40 3.17 6.94
0.31 0.15 0.33
which add up to 17.31.

The only mystery remaining is to decide how many degrees of freedom we have. We
have six cells to start with, but we loose one because all the cells add up to the same
number, but we also loose one because the row sum of the first row in both tables is the
same. We do not loose one for the second row, since that had been accounted for with the
total of 1000. But we will loose 2 more for the first two columns—again the third column
is already accounted for with the total, so we have 2 degrees of freedom remaining. Now
the probability of getting as high a value as 17.31 with two degrees of freedom is less
than .005. Thus we can reasonably conclude that there is some effect since what occurred
is highly unlikely to occur if there had been none.
144

q Conditioning Further

In this last section, we further exploring the conditioning of random variables, and end
with a brief discussion of order statistics. Let us review via an example what a continuous
distribution is.

Example 1. Let X = Pλ and Y = Pδ be independent Poisson random variables. Suppose


we are given that X + Y = n . What is the distribution of X then? As before, we are into a
new random variable now, X | X + Y = n , and since Y only takes nonnegative values, we
have that X can only take the values 0 through n. Thus if we let 0 ≤ k ≤ n , then
P ( X = k and Y = n − k ) P ( X = k ) P (Y = n − k )
P ( X = k | X + Y = n) = =
P ( X + Y = n) P ( X + Y = n)
because X and Y are independent. But because of that we also know that X + Y = Pλ+δ is a
Poisson too, so we have then that
λ k e −λ δn − k e−δ
P (( X | X + Y = n) = k ) =
n!
=
k ! ( n − k ) ! ( λ + δ )n e −λ−δ
n−k
⎛ n⎞⎛ λ ⎞ ⎛ δ ⎞ ⎛ ⎞
k

⎜ ⎟⎜ ⎟ ⎜ ⎟ = P⎜B λ = k ⎟
⎝ k ⎠⎝ λ + δ ⎠ ⎝ λ + δ ⎠ ⎝ n , λ+δ ⎠
And so we obtain that X | X + Y = n is nothing but a binomial.

In a similar manner one can handle the continuous case.

Example 2. Let the amount of time a student takes to finish an exam be given by the
x2 x
random variable X with density f X ( x ) = + in the range 0 to 60 (we are
144000 3600
measuring time in minutes). This random variable has mean 42.5 minutes and a standard
deviation of approximately 13.18 minutes. It is known that no one has finished the exam
in less than 15 minutes, so if we condition using this fact, we will get a different random
variable:
P (15 ≤ X ≤ x ) x 3 + 60 x 2
FX | X ≥15 ( x ) = P ( X ≤ x | X ≥ 15 ) = =
P ( X ≥ 15 ) 415125
x 2 + 40 x
in the range 15 ≤ x ≤ 60 . Thus, f X | X ≥15 ( x ) = , and so its mean is slightly higher
138375
at 43.81 minutes, but on the other hand, its standard deviation has reduced a little to 11.67
minutes. Perhaps, the latter random variable is more realistic—only the data will tell.

But can we condition a continuous with an event similar to the one in the first example,
namely that of a random variable taking a specific value. Suppose we have two
145

continuous random variables X and Y with joint density f ( x, y ) . Similarly to a former


computation, let us consider
P(x ≤ X ≤ x + h | y ≤ Y ≤ y + k)
=
h
y+k x+h

∫ ∫ f X ,Y ( t , s ) dtds
P
( x ≤ X ≤ x + h and y ≤ Y ≤ y + k ) y x

hk = hk .
P(y ≤Y ≤ y + k) y+k

∫ fY ( s ) ds
k y

If we now take the limit as h and k vanish, we get


f ( x, y )
f X |Y ( x | y ) = X ,Y
fY ( y )
as long as the denominator of this expression is not 0. And this is exactly how the
conditional distribution of X given Y is given.

Example 3. Suppose that the range of the random variables X and Y is the (open) unit
square. And in that range their joint density is f X .Y ( x, y ) = 125 x ( 2 − x − y ) . What is the
expectation of X?

To do that we could compute the marginal of X , and then from it derive the
1 1
expectation—or we could simply do the double integral ∫∫x 5
12 x ( 2 − x − y ) dydx . We opt
0 0
1 1 1
for the latter, so E ( X ) = ∫ ∫ x 125 x ( 2 − x − y ) dydx = ∫ 125 x ( 32 − x ) dx =
5
2
≈ 0.1042 .
0 0 0
48

But suppose we are given that Y = y , what then is the expectation of X? We are really
asking for E ( X | Y ) . So first we need to compute the density of this random variable.
Readily, the range of X | Y is the open interval 0 to 1, and there
f ( x, y ) x (2 − x − y) x ( 2 − x − y ) 6x ( 2 − x − y )
f X |Y ( x ) = X ,Y =1 = = .
fY ( y ) 3 − 2
2 y
4 − 3y
∫ x ( 2 − x − y ) dx
0
Observe that for any y in the interval 0 to 1, this is a density, namely this has nonnegative
values and
6x (2 − x − y)
1 1
1 1
∫ 4 − 3y ∫
1
dx = 12 x − 6 x 2
− 6 xydx = 6 x 2 − 2 x 3 − 3 xy| = 1 .
0
4 − 3y 0
4 − 3y 0
146

Now we can compute

E( X |Y ) =
1
6x
2
( 2 − x − y ) dx = 1
1

∫ 4 − 3y ∫
− 6 x3 − 6 x 2 ydx =
2
12 x
0
4 − 3y 0

1 1 5 − 4y
4 x3 − 1.5 x 4 − 2 x3 y| = .
4 − 3y 0 8 − 6y

Thus if y = 0.5 , we get the expectation of X to be 0.60—certainly then the value of Y has
an effect on the expectation of the other random variable. But something remarkable
5 − 4Y
happens: let us now consider the random variable Z = E ( X | Y ) = , and let us
8 − 6Y
compute E ( Z ) = E ( E ( X | Y ) ) . That is simple,
1 1
5− 4y 5
1
⎛ 5 x 2 ( 5 − 4 y )( 6 − 2 x − 3 y ) ⎞ 1
E (Z ) = ∫ ∫ x ( 2 − x − y ) dxdy = ∫0 ⎜⎝ ⎟|0dy =
0 0
8 − 6 y 12 144 ( 4 − 3 y ) ⎠
5 (5 − 4 y )
1

∫0 144 dy =
5
144
( 5 y − 2 y )
2 1
|0
=
5
48
.

But we have seen that number before, it was the expectation of X—coincidence? NO
WAY—it is a theorem.

Theorem. The Expectation of the Expectation. Let X and Y be random


variables. Consider the random variable Z = E ( X | Y ) , then
E ( Z ) = E ( E ( X | Y )) = E ( X )

Example 4. ‘Mazing Rats. In order to determine whether rats can distinguish colors or
remember them in any case, a rat is put into a maze with three swinging doors colored red
white and blue. Behind the red door there is a path to a piece of cheese. The path will
take about 3 minutes for the rat to travel. Behind the white door there is a maze that
returns the rat to the starting point after roughly 5 minutes while behind the blue doors
there is a similar but longer maze that returns the rat to the starting point after 7 minutes.

Assuming that the rat is color blind and memoryless so it will take any door at random at
any time, on the average how long will it take it to reach the cheese? Let X be the variable
that measures the time until the rat reaches the cheese, and let Y be the door that the rat
chooses the first time. Then we assume Y = red, white or blue with equal probability, 13 .
Now if the first occurs, then X = 3 so E ( X | Y = red ) = 3 . On the other hand, easily
E ( X | Y = white ) = 5 + E ( X ) and E ( X | Y = blue ) = 7 + E ( X ) , so
E ( X ) = 13 ( 3 + 5 + E ( X ) + 7 + E ( X ) ) = 5 + 32 E ( X )
and hence we conclude the rat will take 15 minutes on the average to reach the cheese.
147

Rather than give a formal proof of this theorem, we will use a long example to illustrate
why it is true.

Example 5. Consider the following two random variables. One rolls two dice, and Y
records the sum of the two dice while X records the highest value of either die. Their joint
distribution is given by the following table.
X/Y 2 3 4 5 6 7 8 9 10 11 12
1 1
36
0 0 0 0 0 0 0 0 0 0
2 0 2
36
1
36
0 0 0 0 0 0 0 0
3 0 0 2
36
2
36
1
36
0 0 0 0 0 0
4 0 0 0 2
36
2
36
2
36
1
36
0 0 0 0
5 0 0 0 0 2
36
2
36
2
36
2
36
1
36
0 0
6 0 0 0 0 0 2
36
2
36
2
36
2
36
2
36
1
36

Let us use the table to compute the marginals of both X and Y. So now we obtain an
extended table:
X/Y 2 3 4 5 6 7 8 9 10 11 12 P
1 1
36
0 0 0 0 0 0 0 0 0 0 1
36

2 0 2
36
1
36
0 0 0 0 0 0 0 0 3
36

3 0 0 2
36
2
36
1
36
0 0 0 0 0 0 5
36

4 0 0 0 2
36
2
36
2
36
1
36
0 0 0 0 7
36

5 0 0 0 0 2
36
2
36
2
36
2
36
1
36
0 0 9
36

6 0 0 0 0 0 2
36
2
36
2
36
2
36
2
36
1
36
11
36

P 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36

Now we can readily compute E ( X ) = 361 + 366 + 15


36 + 36 + 36 + 36 = 36 . Now we are going to
28 45 66 161

compute the random variable E ( X | Y ) . We can actually compute this random variable
from the table above:
X/Y 2 3 4 5 6 7 8 9 10 11 12 P
1 1 0 0 0 0 0 0 0 0 0 0 1
36

2 0 1 1
3
0 0 0 0 0 0 0 0 3
36

3 0 0 2
3
1
2
1
5
0 0 0 0 0 0 5
36

4 0 0 0 1
2
2
5
1
3
1
5
0 0 0 0 7
36

5 0 0 0 0 2
5
1
3
2
5
1
2
1
3
0 0 9
36

6 0 0 0 0 0 1
3
2
5
1
2
2
3
1 1 11
36

P 1
36
2
36
3
36
4
36
5
36
6
36
5
36
4
36
3
36
2
36
1
36

E(X |Y ) 1 2 8
3
7
2
21
5
5 26
5
11
2
17
3
6 6
148

and we need to understand the nature of the terms in this table. The 1 in the first column
comes from the fact that if Y = 2 , then X has to be 1. Let us proceed to the third column
(the second is similar to the first). In the third column we see 13 and 23 which stem from
the 361 and 362 from the original table since we are apportioning 1 in each column
according to the probabilities in that column. Each column is obtained that way. Now we
need to quickly observe that the entries in the E ( X | Y ) row are obtained by taking the
expectation of each column.

Finally if we compute E ( E ( X | Y ) ) by using the last row of the table, we will obtain
161
36 = E ( X ) . Why did that happen? What happened to a typical entry inside the table?
First in order to become a probability for the conditional, it got divided by the column
sum, then it got multiplied by the row so it would become a summand in the computation
for the value E ( X | Y ) in that column. But then that value got multiplied by the marginal
of Y in that column, which is the column sum! So all that remained then was the sum of
all the entries times the respective values of X, and that is exactly the expectation of X.

Arguing a bit abstractly, let pij denote the probability in the i, j − position of the table, let
c j be the column sum of the jth column. But as observed above, this is the marginal of Y.
Let xi be the value of X in the ith row (in this case xi = i , but this is irrelevant). Then we
p
start with ij , and then multiply it by xi , and add all of the over a given column, so we
cj
pij
end up with ∑c xi , and that becomes the value of the random variable E ( X | Y ) in
i j

column j . But now to compute E ( E ( X | Y ) ) , we take the sum of all of the values over
all columns after being multiplied by the marginal of Y:
p
∑j c j ∑i c ij xi = ∑∑
j i
pij xi = E ( X ) .
j

One should observe that in a similar fashion one proves the expectation of a sum is the
sum of the expectations:

E ( X + Y ) = ∑∑ ( xi + y j ) pij =∑∑ xi pij + y j pij =


j i j i

∑∑ x p +∑∑ y
j i
i ij
j i
j pij =E ( X ) + E (Y ) .

We end the section (and the course) with a brief discussion of order statistics.
149

Maxima and Minima

In the last example we looked at the maximum of the roll of two dice. Likewise we can
consider the maximum of the roll of three dice. More formally, let D1 , D2 and D3 be the
rolls of three dice (independent of course), and consider Dmax = max { D1 , D2 , D3 } . Then
the distribution of Dmax is given by Dmax 1 2 3 4 5 6
P 1
216
7
216
19
216
37
216
61
216
91
216
We could have also discussed Dmin whose
distribution is instead
Note that E ( Dmax ) = 1071
216 while E ( Dmin ) =
441
,
Dmin 1 2 3 4 5 6 216

P 91 61 37 19 7 1 which together average to 3.5 = E ( Di ) .


216 216 216 216 216 216

In the continuous case, if X 1 , X 2 ,…, X t are independent and identically distributed, then
if we let X max = max { X 1 ,..., X t } , then X max has the same range as any of the X i ’s.

Easily X max ≤ a if and only if X i ≤ a for all i, and since these are independent events, we
obtain P ( X max ≤ a ) = ∏ P ( X i ≤ a ) . So if we let f ( x ) and F ( x ) denote the common
i

densities and distributions of the X i ’s, then the distribution of X max , Fmax is simply
Fmax ( a ) = ( F ( a ) ) ,
t

so, its density is also simply


f max ( a ) = t ( F ( a ) )
t −1
f (a) .

Example 6. Max of Uniforms. Let X 1 , X 2 ,…, X t be independent standard uniforms,


U’s. Then their common range is of course the unit interval, and there f ( x ) = 1 and
F ( x ) = x , so if we take U max = max { X 1 , X 2 ,…, X t } , then its distribution is Fmax ( a ) = at
and its density is f max ( a ) = ta t −1 . Note that as with the dice, we naturally should see the
expectation of U max inch up as the number of variables increases. Indeed,
1
E (U max ) = ∫ xtx t −1dx =
t
.
0
t +1
The limit of this value is 1 as t → ∞ . Computing the variance of U max , first the second
moment
1
E (U max ) = ∫ x 2txt −1dx =
2 t

0
t+2
so
150

2
t ⎛ t ⎞ t
V (U max ) = −⎜ ⎟ =
t + 2 ⎝ t +1⎠ ( t + 1) ( t + 2 )
2

which goes to 0 as t increases.

The distribution of the minimum is just as simple as that of the maximum: X min ≥ a if
and only if X i ≥ a for all i, and again since these are independent events, we obtain
P ( X min ≥ a ) = ∏ P ( X i ≥ a ) . So if f ( x ) and F ( x ) are as before, then we have
i

1 − Fmin ( a ) = (1 − F ( a ) ) ,
t

so, its density is also simply


f min ( a ) = t (1 − F ( a ) )
t −1
f (a) .

Example 7. Min of Uniforms. Let X 1 , X 2 ,…, X t as before be independent standard


uniforms, U’s. Let U min = min { X 1 , X 2 , …, X t } , then its distribution satisfies
1 − Fmin ( a ) = (1 − a ) and its density is f min ( a ) = t (1 − a ) . Note that as with the dice, we
t t −1

should see the expectation of U min inch down to 0. Indeed,


1
E (U min ) = ∫ xt (1 − x )
t −1 1
dx = .
0
t +1

Note that as before the average of E (U max ) and E (U min ) is E (U ) .

Example 8. Min of Exponentials. Let X 1 , X 2 ,…, X t be independent exponentials of

( ) = ( e ) , and so we
t
parameter β . Then we know 1 − F ( x ) = e β , so 1 − Fmin ( x ) = e β
−x −x − tx
β

readily obtain that Fmin ( x ) = X β , an exponential where the average has been divided by t.
t
109

since the expectation of a sum is the sum of the expectations. Moreover, since the
variance of an independent sum is the sum of the variances, and the variance of a
constant times a variable is the constant squared times the variance of the
variable, we obtain
σ2
V (Yn ) = n12 ( nσ2 ) = ,
n
so by Chebyshev’s Inequality, so for any positive integer k ,
⎛ 1 ⎞ V (Yn ) σ 2
P ⎜ Yn − µ ≥ ⎟ ≤ = 2 ,
⎝ k⎠ k2 k n
and we have proven

Theorem (Law of Large Numbers). Let X 1 , X 2 ,…, X n ,… be a sequence


of independent random variables with the same distribution, with mean
and variance, µ and σ 2 respectively. For every positive integer n ≥ 1 , let
Yn = 1
n ( X1 + X 2 + " + X n ) . Then for every positive integer k ,
⎛ 1⎞
P ⎜ Yn − µ ≥ ⎟ → 0 as n → ∞ .
⎝ k⎠

Note that this theorem establishes the crucial role that the expectation plays as opposed to
the mode and the median—there is no comparable theorem about the other measurements
of central tendency.

We end the section with a precise statement of the Central Limit Theorem. First we let
Z denote the standard normal, which has density f ( y ) =
− x2
1

e 2
, and as we will see has
expectation 0 and its standard deviation 1.

Theorem (Central Limit Theorem). Let X 1 , X 2 ,…, X n ,… be a sequence


of independent random variables with the same distribution, with mean
X + " + X n − nµ
and variance, µ and σ 2 respectively. Define Wn = 1 .
σ n
Then for any number a , P (Wn ≤ a ) → P ( Z ≤ a ) as n → ∞ .

The next section illustrates multiple applications of this fundamental theorem.

You might also like