1
Stochastic Processes Subject Notes
Probability Review
Axiomatic Probability Sample Space:
The set of all possible outcomes of some random process
Sigma algebra:
A set of subsets of our sample space, generally thought of as the events we can measure or observe
Probability measure:
A set function from the set of events to a number between 0 and 1 such that .
Random variable:
A function from the sample space to a real number
Probabilities of Random Variables When A is a subset of the real numbers, we can say that (using set inverse):
Strictly speaking the first notation is incorrect, as probability is a set function taking events as
arguments, and is a number not a set in . More correctly we write:
Basically, this is saying that if we say something like "the probability that X = 4", we really are talking
about "the probability of the set of all outcomes in the sample space that yield X = 4".
Distribution Function The distribution function of the random variable X can be written:
These probabilities along the real line are enough to specify the probability of any event. So two
variables with the same distribution function have the same probability for any subset A of (or
technically the same probability for all the outcomes that produce values in the subset A).
2
Random Vectors Multiple random variables placed on the same probability space.
Independent Random Variables Random variables are said to be independent if their joint distribution or density function is equal to the
product of the individual density or distribution functions for each variable.
Conditional probability
Independence Events A and B are said to be independent if:
This means that the sets are independent for all Borel sets.
3
Expectation of a Random Variable
Moments
Conditional Probability Density For continuous variable x:
If we now want the conditional distribution function of two continuous variables:
To find the conditional density, differentiate:
Poisson Limit Theorem The law of rare events or Poisson limit theorem gives a Poisson approximation to the binomial
distribution, under certain conditions.
4
The Poisson Limit Theorem states that that if , , are independent Bernoulli random variables with
P(Xi = 1) = 1 − P(Xi = 0) = pi , then is well-approximated by a Poisson random variable
with parameter .
Discrete-Time Markov Chains
Stochastic Processes
The Markov Property A stochastic process has the Markov property if the conditional probability distribution of future states
of the process (conditional on both past and present states) depends only upon the present state, not
on the sequence of events that preceded it.
Transition Probabilities A random sequence with a countable state space forms a DTMC if:
This enables us to write:
5
The key point here being that transition probabilities do not depend on the time index. Note that each
row sums to 1.
Deriving Joint Distribution
Consider , such that:
We can then write the joint distribution as:
Which can be written just as a product of binomials.
N-Step Transitions
The Chapman-Kolmogorov equations show how we can calculate the n-step transition probabilities
from the single-step transition probabilities . The equation states that for any :
Which is the sum of the probabilities of going from state to in steps, and from to in steps,
summed over all the states .
We can derive this equation using the law of total probability and simple properties of conditional
probabilities:
6
Where the last step follows from the Markov property.
The upshot of this is all the information we need to specify all finite dimensional distributions is the
starting distribution and the one-step transition matrix.
Accessibility Relationships Accessibility: a state is accessible from , denoted , if for some we have
Communicability: states and communicate with each other if and , denoted
Non-essentiality: a state is non-essential if there is a state such that but ; that is
eventually we will leave state and never be able to return
Essentiality: a state is essential if implies that
Absorbing: a state is an absorbing state if . Absorbing states are essential
Ephemeral: a state is ephemeral if , meaning that once we reach the state we can never
return to it. Ephemeral states usually don’t add anything to a DTMC model and we are going to
assume that there are no such states
7
Properties of Communicability Relation If there are no ephemeral states, then the following properties hold for all states:
Reflexivity:
Symmetry: if and only if
Transitivity: if and then
Transitivity Suppose we know that and , we want to show that . We know that
and
for some and . Using the Chapman-Kolmorogov equation we then have:
Hence we have demonstrated transitivity.
Communicating Classes Note that transitivity relations produce a partitioning of the state space. Consider a set S whose
elements can be related to each other via any equivalence relation ⇔. Then S can be partitioned into a
collection of disjoint subsets (where might be infinite) such that implies that
⇔ .
An essential state cannot be in the same communicating class as a non-essential state. This means we
can further divide the communicating class partition into a set of non-essential communicating classes
and a set of essential communicating classes.
If a DTMC starts in a state from a non-essential communicating class then once it leaves, it never
returns. If the DTMC starts in a state from a essential communicating class then it can never leave.
Irreducible Markov Chain If a DTMC has only one communicating class (i.e., all states communicate) then it is called an irreducible
DTMC.
8
Random Walk Behaviour Consider a random walk where with iid with and
. This DTMC is irreducible and so all states are essential.
However,if p > q, then , so will ‘drift to infinity’, at least in expectation.
As such, for each fixed state , with probability one, the DTMC will visit only finitely many times.
We infer from this that even if a state is essential, we might still leave it and never return. This always
happens with non-essential states, but even for some essential states it still happens. Thus we need a
further classification of states.
Recurrence This further classification of states that we need relies on calculating the probability that the DTMC
returns to a state once it has left. Define , the probability of returning to a state eventually:
State is said to be recurrent if , and transient if . Transient states we might return to a
bunch of times, but eventually we leave them - there is a probability that there is last time that
we return to state .
If the DTMC starts in a recurrent state then, with probability one, it will eventually re-enter . At this
point, the process will start anew (by the Markov property) and it will re-enter again with probability
one. So the DTMC will (with probability one) visit infinitely-many times.
If the DTMC starts in a transient state then there is a probability that it will never return. So,
Letting be the number of visits to state after starting there, we see that has a geometric
distribution:
Thus we have:
Thus the expected number of return times to a transient state is finite - eventually we expect that we
will never return there.
Consider the indicator variable . We can define as:
9
We can therefore calculate as:
It follows that state is recurrent if and only if
Recurrence is a Class Property Recurrence is a class property, meaning that if one state in a communicating class is recurrent, then so
are all states in that class. To show this, assume that state is recurrent and ↔ . Since can always
return to itself from , there must exist and such that and
. Now consider:
This inequality holds since the latter event is a subset of the former event, requiring a specific value of
that the first event does not. We can now write:
We could likewise write out the same inequality for . Thus we have:
Using the second inequality:
Sub out n for :
Now add the first inequality:
The terms in parentheses are independent of , so can be taken as irrelevant constants when summing:
10
We know that
since is recurrent (the finite and shifts don't matter for infinite
sums), so therefore
, and thus state is also recurrent.
Note that if the Markov chain is irreducible then all states are either recurrent or transient and so it’s
appropriate to refer to the chain as either recurrent or transient.
Random Walk Recurrence We can compute the m-step transition probabilities from state j to itself by observing that these
probabilities are zero if m is odd and if even is equal to:
This sum diverges if p = q = 1/2, so the DTMC is recurrent. Otherwise it is transient.
Periodicity
State is periodic with period if the set of is non-empty and has a greatest common
divisor . If a state has period 1 we say it is aperiodic.
Periodicity is a class property, meaning that all states in the same communicating class have the same
period.
11
Steps for Analysing a DTMC Draw a transition diagram
Divide the state space into essential and non-essential states
Define the communicating classes, and divide them into recurrent and transient
Decide whether the classes are periodic
Recurrence in Finite State MC This is fairly intuitive - if we have an infinite list of a finite number of things, one of those things must
appear an infinite number of times. Recall that a state is transient if:
This means that the DTMC visits only finitely-many times. Now define to be the probability that the
DTMC starts in state and ever visits state . We can break this up into the sum over all the possible
times when it first visits state :
12
Now consider the probability that we go from to in steps:
As the sum over all the possible first fists to at time , followed by returning to in the remaining
steps.
We can now consider the probability of going from to . Substituting the above formula we have:
Here we have written the probability that we start from j and ever get to k, as the sum over all the
numbers of steps that could take (n), and also a sum over all number of steps which could have
been our first time getting to state . Note that
since we may go from to more than
once.
Now, if all states were transient, we would have:
Using the above results, we can rewrite this as:
13
This is a contradiction. Hence all states cannot be transient in a finite-state markov chain; one of
must be infinite for some k. This argument does not hold for an infinite markov chain, which
obviously it shouldn't since a simple random walk with has all states transient.
Recurrence in Infinite State MC - First Step Analysis In order to be able to tell whether a class is recurrent, we need to be able to calculate the probability of
return for at least one state, which we will denote . Denote the probability that the chain ever
reaches state 0, given it starts at . Ultimately therefore we want to determine:
To do this, we need to introduce a more general equation:
Which simply says that the probability we ever go from to 0 is the probability of going straight there,
plus the sum over all other states , of going to that state and then to state 0.
To see how this method works, consider a simple random walk that 'bounces off' zero. Thus:
Thus using our equation from above (i.e. the probability that we ever go from to 0 is the probability
that we first go up and then to zero, plus the probability that we first go down and then to 0):
The first equation is a second-order linear difference equation with constant coefficients.
14
Which has roots:
If these roots are distinct (i.e. if ), the general solution is thus a linear combination of
these roots:
Otherwise the general solution uses the generalised eigenvector:
Where the values of the constants A and B need to be determined by boundary equations, or other
information that we have.
In both cases, it should be clear that , otherwise these won't be proper probabilities for large . To
solve for A, substitute in :
Thus , which makes sense because p ≤ 1/2 and so we have a neutral or downward drift.
If we now consider the case where , we need to introduce the idea of continuity of
probability measure. As we take increasing subsets, the probabilities of these subsets converge to the
union of all these sets.
15
This reasoning allows us to determine that is the minimal nonnegative solution to the difference
equation we are trying to solve, because the probability that we ever get to zero is obviously less than
the sum of all the probabilities for getting there in 1,2,3, etc steps (since we could get there more than
once).
The upshot of this is that in our general solution of the form:
We know that we are looking for the minimal solution, so . Now we cannot use the
argument we used earlier. We can still use the boundary condition:
16
Thus we have:
Hence we find:
The Gambler’s Ruin Problem
The recurrence equation gives us:
Which for has the same generation solution as before:
Since we can never come down from , the upper boundary condition gives us:
Likewise the lower boundary condition is:
17
Which together yield:
Null and Positive Recurrence We want to know the proportion of time a DTMC spends in each state over the long run. This should be
the same as the limiting probabilities . Note that this will be zero for transient and non-
essential states.
Define to be the time between the th and st return to state .
Let us define:
18
In the aperiodic, positive recurrent case, the number of times visiting state j will be:
In the null recurrent or transient case, , so the limit becomes zero.
For this analysis to work, the chain must be non-periodic.
Ergodicity and Stationarity We call the DTMC ergodic if for all the limit
exists and does not depend on .
19
Note that stationary distributions are not necessarily the same thing as limiting distributions. They are
equivalent under the following conditions:
We often test whether an irreducible, aperiodic DTMC is ergodic by attempting to solve the equations,
and seeing if there is a unique solution.
Double-Stochastic Matrix If an aperiodic DTMC has a doubly-stochastic transition matrix, then we can easily verify that:
This also provides a good example for the difference between limiting distrubitions and stationary
distributions. A 2 by 2 doubly stochastic matrix as shown below is periodic so has no limiting
distribution, but does have a stationary distribution.
20
Random walk with One Barrier Consider again the random walk with a barrier on the lower boundary (i.e. positive only). We know it is
irreducible (as every state is accessible to every other), aperiodic (as there is a loop at zero), and
recurrent if p (upward probability) is less than or equal to 1/2. We now as the question: is this chain
ergodic (i.e. is it positive or null recurrent?)
We can do this by using the above theorem, and determine whether there is a single probability solution
to the equation:
Using the downwards boundary condition, we have:
The rest of the chain comes from the transition probabilities:
Thus we arrive at the equations:
For this to be a probability solution, we need all the pi's to sum to 1:
Thus we must have
in the stationary distribution.
Interpreting the Distribution For an irreducible, aperiodic and positive-recurrent DTMC, the distribution defined by π has a number of
interpretations
Limiting: the probability in the limit that the chain is in state
Stationary: starting in distribution the chain will remain there forever
Ergodic: the proportion of time the chain spends in state converges to with probability one
21
Markov Reward Processes Consider a situation where a cost or reward is incurred/earned by the DTMC whenever it visits
state for one time unit. In the stationary regime, the expected cost/reward per time unit is:
In some situations we have a means of controlling such a process by making decisions that affect the
transition probabilities and we wish to make decisions that maximise our reward. The study of how to
do this is known as Markov decision theory.
Consider the situation where we have a discrete time stochastic process and a finite set of
actions. Transition probabilities depend on state and action, and can be written as . The
objective is to choose a set of actions (a policy) so as the maximise the expected value of rewards over
the time horizon with respect to our actions.
Poisson Processes
Key Distributions The Poisson distribution arises as the limit of the binomial distribution:
The exponential distribution arises as the limit of a geometric distribution:
Introduction A nonnegative integer-valued process is a Poisson process with a rate if:
it has independent increments on disjoint intervals, so the following are independent:
22
the value at any time depends on the time by:
for the value at two distinct times:
This third result can be proved using the moment-generating function:
Note the distinction between and , with :
23
We can think of the Poisson process as being the limit of a discrete version:
In the discrete version, the number of successes is binomially distributed, while the number of time
units needed for a success is geometrically distributed. As such,
, which as we know
converges in distribution to as . Likewise we have
converges with large
to .
Joint Distribution Using the property of independent increments, we find that the joint distribution is given by:
That is, the joint distribution factorises into independent marginal distributions.
24
Waiting Times are Exponential is a Poisson process with rate if and only if are independent random variables.
To prove this, note that the waiting time until the jth jump is less than , , if and only if there are
or more events in time , . Thus we have:
Thus we see that waiting times follow an exponential distribution. More generally:
Which is a sum of independent exponentially distributed random variables, forming a gamma
distribution.
Order Statistics The th order statistic of random variables refers to the th smallest value of these
variables, and is denoted .
In general if are independent random variables with distribution function and density ,
the distribution function of is given by:
25
The density is given by:
So adding combinatorial coefficients:
The joint density for order statistics of variables is given by:
Conditional Distributions It turns out that the conditional distribution of arrival times given the total number of occurrences is
equal to , is equal to the distribution of order statistics from iid normal variables.
Where are independent Uniform on . We can also state this as:
26
We can show this by direct computation by rewriting Ts in terms of s:
By independence of non-overlapping intervals we have:
We can simplify the result by normalising each statistic by :
Superposition of Poisson Processes The sum of two independent Poisson process is itself a Poisson process with a rate equal to the sum of
the rates of the two initial processes.
27
To show that this is the case, just check the two axioms of Poisson processes:
The sum of Poissons is a Poisson, so
For disjoint intervals and , we have
which are both independent given that and are Poisson processes
Thinning of a Poisson Process Suppose in a Poisson process each customer is marked (set aside) independently with probability p,
and denote as the number of ‘marked’ customers up to time . It turns out that the marked process
and the ‘non-marked’ process are independent Poisson processes with rates and
.
This arises because when we specify a point being in , all that tells us is that we cannot have a point in
the exact same spot in . But the probability of having a point at any exact spot is zero anyway, so our
probability distribution is unchanged conditioning on this information.
Poisson Arrivals See Time Averages One question we may ask is when a new arrival (say a new customer arriving at a queue in a store)
observe a markov chain in its stationary state, namely seeing state with probability ?
The “PASTA Theorem” states that Poisson arrivals see time averages, so from the customer’s
perspective the chain is always in its steady state. This is the case because the only additional
information that the arriving customer has is that , so we write:
Thus, conditioning a Poisson process on a single value at a single point does not change the distribution.
28
Compound Poisson Process A compound Poisson process is a regular Poisson process where the increments are themselves i.i.d.
random variables (as opposed to only the interval times). Such a process can be defined as:
Continuous-Time Markov Chains
Introduction A non-negative integer valued stochastic process in continuous time is said to be a
Continuous-Time Markov Chain if, for all and positive integers :
Generally we deal with homogenous CTMCs, which have probabilities that don’t change over time, only
on the state.
To have the memoryless property, markov chains must have exponential holding times, so that the rate
of jumping does not change depending on how long one has been at the state.
29
If is the first time the chain leaves state then using the markov property at step 2 and
homogeneity at step 3:
The Chapman-Kolmogorov Equations For the continuous case we write these equations as:
In matrix form we can express this as:
Unfortunately since time is now continuous, we cannot write as powers of a single matrix, since we
need to be able to use non-integer time values as well. We need a new object. Consider therefore:
We are interested what happens when becomes very small, so define:
The continuity assumption on implies the existence of the matrix A, called the matrix or
infinitesimal generator of the CTMC. Element-wise this is:
Note that this value is always finite, and (rows sum to zero).
Transition Probability Generator From the analysis above, we would hope to show that both of the following hold:
30
To do so, we need to verify that certain limits converge so that we can interchange limits and sums. This
analysis will be omitted here.
For non-explosive CTMCs, the matrix A determines the transition probability completely by solving the
backward or forward equations to get:
Poisson Process Example The Poisson process is a variety of a CTMC, where we have:
So we have the very simple generator:
Now solving for the transition probabilities considering element-wise:
By induction we find the generation rule to be:
Interpretation of the Generator Beginning with the equation to define the generator:
31
We have for a small value of , and in the case where the chain stays at state k (note the approximation
because it is possible to leave state , come back, and leave again, but for small this is unlikely):
So we can think of as the rate of transition from j to k, with:
Since each row sums to zero, we have:
Now consider the same probability from the point of view of leaving time. Since we know holding times
are exponential, we have for the case of staying at state k:
Using an expansion approximation for small :
Equating these two derivations we have for :
Likewise have for small h:
Comparing this with the result obtained above we have for :
To find out where the CTMC moves upon leaving state , we calculate:
32
In the limit of small h:
Ergodicity For an ergodic CTMC, the stationary distribution satisfies:
This occurs if and only if satisfies:
Discrete vs Continuous
Birth and Death Processes Let be the number of ‘people’ in a system at time t. Whenever there are ‘people’ in the system
new arrivals enter (by birth or immigration) the system at an exponential rate and ‘people’ leave (or
die from) the system at an exponential rate , with arrivals and departures occurring independently of
one another.
The generator of a birth and death process has the form:
33
The CTMC evolves by remaining in state for an exponentially-distributed time with rate , then
it moves to state with probability
and state with probability
, and so on.
The Poisson process is an example of a pure birth process with constant birth rates.
For a birth and death process with a given initial distribution, one can find by solving
the system of differential equations (for a vector):
Expanded by columns as:
This system of equations governs the ‘redistribution’ of ‘probability mass’ as time passes. For finite state
space birth and death process, it can be solved numerically.
The stationary distribution can be found by solving , which implies:
Which is equivalent to the condition:
Which has the solution:
Using the same theorem as for discrete processes:
34
Queuing Theory
Introduction Queueing theory is the mathematical study of the operation of stochastic systems describing processing
of flows of jobs.
Kendall’s Notation This notation takes the form A/B/n/m, where:
A describes the arrival process
o A = M (Markov) inter-arrival times are independent and exponentially-distributed
o A = G inter-arrival times are independent with an arbitrary distribution
o A = D inter-arrival times are deterministic
B describes the service process
o B = M service times are independent and exponentially-distributed
o B = G service times are independent with an arbitrary distribution
o B = D service times are deterministic
n gives the number of servers
m gives the capacity of the system. When this is usually omitted
Describing the M/M/1 Queue Arrival stream: Poisson process with intensity
Service: n = 1 server, service time
Infinite space for waiting: .
The state Xt gives the number of customers at time t: If the server is idle, while if
one customer is being served and – customers are waiting in the queue.
35
The queue length of an M/M/1 queue evolves as a birth and death process with birth rates and
death rates . Thus we use our result from birth-death processes to determine the stationary
distribution (with
):
Average number of people in the system is calculated by:
Average queue length is:
Waiting Times Waiting time depends on , the number of people already in the system when a new arrival appears.
For service times , the new arrival will have to wait:
This leads to the expectation:
The expected total time in the system is simply the sum of expected waiting time and expected time
being served:
36
Little’s Law This expresses a relationship between queue length and expected waiting time:
Describing the M/M/a Queue Arrival stream: Poisson process with intensity
Service: n = a servers, service time
Infinite space for waiting: .
A stationary distribution exists if:
The stationary distribution is given by:
Average queue length is:
37
Where is the proportion of time that all servers are busy.
By Little’s Law, the expected waiting time is:
And so the expected delay is simply plus the expected service time:
Single vs Multiple Servers Which is better: a single server with service rate , or servers with service rate each? A heuristic
argument tells us that if , both systems work with the same rate, but if the rate for
the a server queue is , which is less than the rate for the single server. So we might conclude that
the single server is better.
We can show this explicitly by comparing the expected time for the M/M/1 and M/M/a queues:
Renewal Theory
Introduction A renewal process is a counting process for which the times between successive
events, called renewals, are independent and identically-distributed random variables with an arbitrary
common distribution function F. A Poisson process is a renewal process, but a renewal process may not
be Poisson.
38
Important properties:
Explosions Is it possible to have an infinite number of jumps in a finite time? To see that this is not possible:
Distribution of Nt We know that the distribution tends to infinity over time:
To get an expression for the distribution function write:
But how can we find ?
Since we have , it follows:
39
By the strong law of large numbers, we know also that:
And hence by the sandwich theorem we find that:
And we see that, for large t, grows like
.
The M/G/1/1 Queue We consider a single-server Erlang Loss System. There is no queue: when an arriving customer finds
server busy, he/she does not enter. Service times are independent and identically-distributed with
distribution function G with the mean .
Let be the number of customers who have been admitted by . Then the times between successive
entries of customers are made up of a service time, and then a waiting time from the end of service until
the next arrival.
The mean time between renewals is the sum of mean service and waiting times:
Thus the entry rate is:
40
The proportion of customers who enter the queue is:
The Renewal Central Limit Theorem The theorem states that for and then:
To show this:
Choose so
, thus:
Note that this applies regardless of the distribution of .
Residual Lifetime The residual lifetime at time is the expected amount of time until the next arrival:
41
When the distribution of is non-lattice (i.e. not discrete), then for all :
To show this, note that the proportion of time up to the th arrival where the residual lifetime is longer
than is:
By the law of large numbers, and since long-term limiting proportions are the same as limiting
probabilities, as approaches infinity this tends to:
Hence we have:
Where is the distribution of the renewal times .
Age Distribution The age is the time sense the last renewal:
42
Note that:
If we set we find:
To find the joint residual/age density, twice differentiate the joint distribution:
Brownian Motion
Defining Brownian Motion The normal distribution arises as the limit of random walks. We know from the central limit theoream
that for iid with mean 0 and variance 1:
A continuous time stochastic process is standard Brownian motion if:
It has continuous sample paths
It has independent increments on disjoint intervals
For each
43
Properties of Brownian Motion Brownian motion can be considered as a limit of a random walk, packing an infinite number of steps in
every finite time interval. On the basis of this we can derive the following property:
Furthermore a Brownian motion process restarted at any moment is still a Brownian motion process:
Brownian motion with parameter is defined to have the same distribution as .
Multivariate Normal Distribution The multivariate normal distribution describes the distribution of a sum of independent normal
variables. We say that has the multivariate normal distribution if:
44
Where is a positive definite matrix (meaning , and so matrix has ‘square root’) called the
covariance matrix, and is a k-vector mean.
So an individual is given by:
So we can show that the covariance is given by :
An easier way of writing a multivariate normal is by using a lower triangular matrix such that
We can also write:
Since we know the density of , we can find the density of by simply making a change of variables:
45
For any invertible matrix , we have is a multivariate normal with mean and covariance .
Joint Brownian Distribution The joint distributions of Brownian motion observed at a collection of times are linear functions of
independent normal variables, which correspond to the increments. Thus we can write:
Therefore we see that we can write the joint distribution as a multivariate normal distribution:
The means are zero and the so the distribution is entirely determined by the pairwise covariances, which
we can compute as:
Thus we have the covariance matrix for the joint distribution of :
Einstein Derivation of Brownian Motion Brownian motion arises as the limit of random walk, and so inherits the definition/properties from the
random walk. This result is called Donsker’s Theorem or the invariance principle.
Define a simple random walk with time scaled to :
By the law of total probability:
Taking and doing some rearranging we find:
46
This is the heat equation which under appropriate boundary conditions has the solution:
Which is the normal distribution.
Hitting Times Define the first time when a Brownian motion process hits the level to be . Since Brownian motion is
continuous, if then . Furthermore, since random walk is recurrent, is finite.
We can derive the distribution of by the equation:
So all that remains is to find the conditional distribution. However since we know that shifting the origin
of a Brownian motion up by some constant leaves the result still a Brownian motion process, we have
the reflection principle:
Thus we find that hitting times are distributed as:
47
The density of the hitting time is thus given by what is called Levy’s distribution:
Maximum of Brownian Motion The distribution of the maximum of up to time is derived as follows:
So we find that the maximum of Brownian motion is distributed as the absolute value of a normal
distribution.
Relative Hitting Times By analogy with the gambler’s ruin problem: