Probability - The Science of Uncertainty and Data (2021)
Use the course to re-build my statistics knowledge.
1 Sample Space and Probability
1.1 Sample space - A set of outcomes
- 
discrete/finite example 
- 
continuous example 
1.2 Probability Axioms
- 
Nonnegativity \(P(A) \geq 0 \) 
- 
Normalization \( P( \Omega ) = 1 \), \(\Omega \) is the entire sample space. 
- 
(finite) Additivity: A and B are disjoint, then the probability of their unions satisfies \(P(A \cup B) = P(A) + P(B)\) (to be strengthened later) 
1.2.1 Simple consequences of the axioms
- 
For a sampe space consist of a finite number of disjointed events, \[ P({s_1, s_2, ...., s_n}) = P(s_1) + P(s_2) + ...... P(s_n) \] 
- 
\(A \subset B\), then \(P(A) \leq P(B)\) 
- 
\(P(A \cup B) = P(A) + P(B) - P(A \cap B)\) 
- 
\(P(A \cup B) \leq P(A) + P(B))\) 
1.3 Probability calculations
1.3.1 Uniform Probability Law
- 
Discrete example If the sample space consists of n possible outcomes which are equally likely (i.e., all single-element events have the same probability), \[ P(A) = \frac{\text{number of elements of A}}{n} \] 
- 
continuous example probability = area 
1.3.2 Discrete but infinite sample space
- 
Sample space: {1, 2, 3 ....} Given \(P(n) = \frac{1}{2^n}\), n = 1, 2, 3.... As \( P(\Omega) = 1 \): \(\frac{1}{2} + \frac{1}{4} + ....= \sum\limits_{n=1}^\infty \frac{1}{2^n} = \frac{1}{2}\sum\limits_{n=0}^\infty \frac{1}{2^n} = \frac{1}{2}\frac{1}{1-1/2} = 1\) 
1.3.3 Countable aditivity axiom
Additivity holds only for "countable" sequences of events
If \(A_1, A_2, A_3 ...\) is an \(\underline{\text{infinite sequence of disjoined events}}\),
\[ P(A_1 \cup A_2 ......) = P(A_1) + P(A_2) + ...... \]
1.4 Mathematical background
1.4.1 Sets - A collection of distinc elements
- 
finite: e.g. {a, b, c, d} 
- 
infinite: the reals (R) 
- 
\( \Omega \) - the universal set 
- 
Ø - empty set 
What are reals?
The reals include rational numbers (terminating decimals and non-terminating recurring decimals and irrational numbers (non-terminating non-reccuring decimals
1.4.2 Unions and intersection
1.4.3 De Morgans' Law
- 
\( (S \cap T)^c = S^c \cup T^c \) and \( (S \cup T)^c = S^c \cap T^c \) 
- 
\( (S^c \cap T^c)^c = S \cup T \) 
1.4.4 Other important mathematical backgrounds
- 
Sequences and their limits squence: an enumerated collection of objects 
- 
When does a sequence converge - 
if \(a_i \leq a_{i+1}\) - 
the sequence "converge to \(\infty\)" 
- 
the sequence converge to some real number a 
 
- 
- 
if \(|a_i - a| \leq b\), for \(b_i \to 0\), then \(a_i \to a\) 
 
- 
- 
Infinite series series(infinte sums) vs. summation(finite sums) \(\sum\limits_{n=1}^\infty a_i = \lim\limits_{n\to\infty}\sum\limits_{i=1}^n a_i\) - 
\(a_i \leq 0\): limit exists 
- 
if term \(a_i\) do not all have the same sign: a. limit does not exist b. limit may exist but be different if we sum in a different order c. Fact: limit exists and independent of order of summation if \(\sum\limits_{n=1}^\infty |a_i| \leq \infty\) 
 
- 
- 
Geometric series (等比数列、等比级数) \(\sum\limits_{i=0}^\infty a^i = 1 + a + a^2 + ...... = \frac{1}{1-a} \text{ |a| < 1} \) 
1.4 Sets
1.4.1 Countable and uncountable infinite sets
- 
Countable - 
integers, pairs of positive integers, etc. 
- 
rational numbers q (有理数), with 0 < q < 1 
 
- 
- 
Uncountable - continuous numbers - 
the interval [0, 1] 
- 
the reals, the plane, etc. 
 How to prove the reals are uncountable - "Control's diagonalization argument" 
- 
Unit 2 Conditioning and independence
Refer to Section 1.3 - 1.5 in the textbook
2.1 Conditional and Bayes' Rules
2.1.1 The definition of conditional probability
P(A|B) = "probability of A, given that B occurred"
\[ P(A|B) = \frac{P(A \cap B )}{P(B)} \]
defined only when P(B) > 0
2.1.2 Conditional probabilities share properties of ordinary probabilities
- 
\(P(A|B) \geq 0\) 
- 
\(P(\Omega|B) = 1\) 
- 
\(P(B|B) < 0\) 
- 
If \(A \cap C = Ø\), then \(P(A \cup C|B) = P(A|B) + P(C|B)\) also only applies to countable and finite sequence (countable additivity axioms). 
2.1.3 Models base on conditional probabilities

1. The multiplication rule
\\(P(A \cap B) = P(B)P(A|B) = P(A)P(B|A)\\)
\\(P(A^c \cap B \cap C^c) = P(A^c \cap B) P(C^c|A^c \cap B) = P(A^c) P(B|A^c) P(C^c|A^c \cap B)\\)
\\(P(A_1 \cap A_2...\cap A_n) = P(A_1) \prod\limits_{i=2}^n P(A_i|A_1 \cap A_2...\cap A_i)\\)

2. Total probability theorem

3. Bayes' rules

2.2 Independence
2.2.1 Conditional independence
Independent of two events
- 
Intuitive "definition": P(B|A) = P(B) - Occurence of A provides no new information about B
 
Definition of independence:
\(P(A \cap B) = P(A) \times P(B)\)
whether two events disjoined or joined is not associated with independence
Independent of events complements
If A and B are independent, then A and \(B^c\) are independent.
Independent of events complements

Conditioning may affect independence

2.2.2 Independence of a collection of events
- 
Intuitive "definition": Information on some of the events does not change probabilities related to the remaining events 
- 
Definition: Events \(A_1, A_2,....., A_n\) are called independent if: \(P(A_i \cap A_j \cap .... \cap A_m) = P(A_i)P(A_j)...P(A_m)\) 
Pairwise independence
n = 3:
\(P(A_1 \cap A_2) = P(A_1)P(A_2)\)
\(P(A_1 \cap A_3) = P(A_1)P(A_3)\)
\(P(A_2 \cap A_3) = P(A_2)P(A_3)\)
vs. 3-way indenpendence
\(P(A_1 \cap A_2 \cap A_3) = P(A_1)P(A_2)P(A_3)\)
Independence vs. pairwise independence

2.2.3 Reliability

Unit 3 Couting
3.1 Basic counting principle
r stages and \(n_i\) choices at stage i give the total number of possible choices \( n_1 * n_2 * ....n_r \)
3.2 Permutation
- Permutation - number of ways of ordering n elements (repetition is prohibited)
\[n * (n-1) * (n-2) * ... * 2 * 1 = n!\]
- Number of subsets of {1, 2, ...n} = \(2^n\)
3.3 Combinations
- 
combinations \(\binom{n}{k}\)- number of k-element subsets of a given n-element set How is combination equation derived? Two ways of constructing an ordered sequence of k distinct items: - 
choose the k items one at a time: \[ n (n-1) ... (n-k+1) = \frac{n!}{k!(n-k)!} \] 
- 
choose k items, then order them: \[ \left( \begin{array}{c} n \\ k \end{array} \right)k! \] 
 There we have \[ \left( \begin{array}{c} n \\ k \end{array} \right) = \frac{n!}{k!(n-k)!} \] 
- 
3.3 Binominal coeffficient
- 
Binominal coeffficient \(\binom{n}{k}\) - Binomial probabilities Toss coins n times and each toss is given independent, P(Head) = p \[ P(\text{k heads}) = \binom{n}{k}p^k (1-p)^{n-k} \] If asking P(k heads without ordered), then \[ P(\text{k heads}) = p^k (1-p)^{n-k} \] Therefore, \(\binom{n}{k}\) is the number of k-head sequence 
3.4 Partitions

- 
multinomial coeffecient (number of partitions) = \[ \frac{n!}{n_1! n_2! ... n_r!} \] 
If r = 2, then \(n_1 = k\) and \(n_2 = n - k\). There is \(\frac{n!}{n! (n-k)!}\) which is \(\binom{n}{k}\)
- A simple example


4 Discrete random variables
4.1 Probability mass function (PMF)
Random variable(r.v.): a function from the sample space to the real numbers, notated as X.
PMF: probability distribution of X
\[ p_X(x) = P(X = x) = P({w \in \Omega, s.t. X(\omega) = x}) \]
4.2 Discrete Random variable examples
4.2.1 Bernoulli random variables
with parameter \(p \in [0,1]\)
\[ p_X(x) = \begin{cases} 1, p(x) = p \\ 0, p(x) = 1 - p \end{cases} \]
- 
Models a trial that results in either success/failure, Heads/Tails, etc. 
- 
Indicator random variables of an event A, \(I_A\) iff A occurs 
4.2.2 Uniform random variables
with paramters a,b
- 
Experiment: pick one of a, a+1 .... b at a random; all equally likely 
- 
Sample space; {a, a + 1, .... b} 
- 
Random variables X: \(X(\omega) = \omega\) 
4.2.3 Binomial random variables
with parameters: pasitive integer \(n; p \in [0,1]\)
- 
Experiment: n independent toses of a coin with P(Heads) = p 
- 
Sample space: set of sequences of H and T of length n 
- 
Random variables X: number of Heads observed 
- 
Model of: number of successes in a given number of independent trials 
\[ p_X(k) = \left(\begin{array}{c} n \\ k \end{array} \right)p^k(1-p)^{n-k}, k = 0, 1 ..., n \]
4.2.4 Geometric random variables
with parameter p: 0 < p ≤ 1
- 
Experiment: infinitely many independent tosses of a coin: P(Heads) = p 
- 
Random variable X: number of tosses until the first Heads 
- 
Model of waiting times; number of tirals until a success 
\[
p_X(k) = P(X = k) = P(T...TH) =(1-p)^{k-1}p, k = 1,2,3...
\]
4.3 Expectation/mean of a random variable
- 
Definition: \[ E[X] = \sum\limits_{x} xp_X(x) \] 
- 
Interpretation: average in large number of independet repetitions of the experiment 
- 
Elementary properties - 
If X ≥ 0, then E(X) ≥ 0 
- 
If a ≤ X ≤ b, then a ≤ E[X] ≤ b 
- 
If c is a constant, E[c] = c 
- 
The expected value rule: \[ E[Y] = \sum\limits_y yp_Y(y) = E[g(X)] = \sum\limits_x g(x)p_X(x) 
 \]
- 
Linearity of expectation: \(E[aX+b] = aE[X] + b\) 
 
- 
4.4 Variance - a measure of the spread of a PMF
4.4.1 Definition of variance:
\[ var(X) = E[(X - \mu)^2] = \sum\limits_x (x - \mu)^2 p_X(x) \]
standard deviation: \(\sigma_X = \sqrt{var(X)}\)
4.4.2 Properties of the variance
- 
Notation: \(\mu = E[X] \) 
- 
\(var(aX + b) = a^2var(X)\) 
- 
A useful formula: \[ var(X) = E(X^2) - (E[X])^2 
 \]
Summary of Expectation and Variance of Discrete Random Variables
| Random Variables | Formula | E(X) | var(X) | 
|---|---|---|---|
| Bernoulli (p) | \(p_X(x) = \begin{cases} 1, p(x) = p \\ 0, p(x) = 1 - p \end{cases} \) | \(p\) | \(p(1-p)\) | 
| Uniform (a,b) | \(p_X(x) = \frac{1}{b-a}, a ≤ x ≤ b\) | \(\frac{a+b}{2}\) | \(\frac{1}{12}(b-a)(b-a-2)\) | 
| Binomial \(p \in [0,1]\) | \(p_X(k) = \left(\begin{array}{c} n \\ k \end{array} \right)p^k(1-p)^{n-k}, k = 0, 1 ..., n\) | \( np \) | \(np(1-p)\) | 
| Geometric \(0 < p ≤ 1\) | \(p_X(k) = (1-p)^{k-1}p, k = 1,2,3.... \) | \(\frac{1}{p}\) | \(\) | 
4.5 Conditional PMF and expectation, given an event
4.5.1 Conditional PMFs
\(p_{X|A}(x|A) = P(X = x|A)\), given A = {Y = y}
\[
p_{X|Y}(x|y) = \frac{p_{X,Y}(x,y)}{p_Y(y)}
\]
4.5.2 Conditional PMFs involing more than two random variables
- 
\(p_{X|Y,Z}(x|y,z) = P(X = x|Y = y, Z = z) = \frac{P(X=x,Y=y,Z=z)}{P(Y=y, Z=z)} = \frac{P_{X,Y,Z}(x,y,z)}{P_{Y,Z}(y,z)} \) 
- 
Multiplication rules: \(p_{X,Y,Z}(x,y,z) = p_X(x)p_{Y|X}(y|x)p_{Z|X,Y}(z|x,y) \) 
- 
Total probability and expectation theorems \(p_X(x) = P(A_1)p_{X|A_1}(x) + ... + P(A_n)p_{X|A_n}(x) \implies p_X(x) = \sum\limits_y p_Y(y)p_{X|Y}(x|y)\) \(E[X] = P(A_1)E[X|A_1] + ... + P(A_n)E[X|A_n] \implies E[X] = \sum\limits_y p_Y(y) E[X|Y = y]\) 
4.6 Multiple random variables and joint PMFs
4.6.1 Joint PMF
\[ p_{X,Y}(x,y) = P(X = x, Y =y) \]
- 
\(\sum\limits_x \sum\limits_y p_{X,Y}(x,y) = 1\) 
- 
Marginal PMFs: \(p_X(x) = \sum\limits_y p_{X,Y}(x,y)\) \(p_Y(y) = \sum\limits_x p_{X,Y}(x,y)\) 
4.6.2 Functions of multiple random variables
\(Z = g(X,Y)\)
- 
PMF: \(p_Z(z) = P(Z=z) =P(g(X,Y) = z) \) 
- 
Expected value rules: \(E[g(X,Y)] = \sum\limits_x \sum\limits_y g(x,y) p_{X,Y}(x,y)\) 
- 
Linearity of expectations - 
\(E[aX + b] = aE[X] + b\) 
- 
\(E[X + Y] = E[X] + E[Y]\) 
 
- 
4.6.3 Independence of multiple random variables
- 
\(P(X = x and Y = y) = P(X = x) \times P(Y = y), for all x, y \) 
- 
\(P_{X|Y}(x|y) = P_X(x)\) and \(P_{Y|X}(y|x) = P_Y(y)\) 
- 
Independence and expectations - 
In general, \(E[g(X,Y)] \neq g(E[X], E[Y])\) 
- 
If X, Y are independent: \(E[XY] = E[X]E[Y]\) g(X) and h(Y) are also independent: \(E[g(X)h(Y)] = E[g(X)]E[h(Y)]\) 
 
- 
- 
Independence and variances - 
Always true: \(var(aX) = a^2var(X)\) and \(var(X+a) = var(X)\) 
- 
In general: \(var(X+Y) \neq var(X) + var(Y)\) 
- 
If X, Y are independent, \(var(X,Y) = var(X) + var(Y)\) 
 
- 
5 Continuous random variables
5.1 Probability density function (PDFs)
5.1.1 Definition

PDFs are not probabilities. Their units are probability per unit length.
Contiunous random variables: a random variable is continuous if it can be described by a PDF.
- 
\(P(X = a) = 0\) 
- 
\(f_X(x) \geq 0\) 
- 
\(\int_{-\infty}^{+\infty}f(x)dx = 1\) 
Expectation/Mean
Expection/mean of a continuous random variable: average in large number of independent repetitions of the experiment
\[ E[X] = \int_{-\infty}^{+\infty}xf_X(x)dx \]
Properties of expectations
- 
if X ≥ 0, then \(E[X] ≥ 0\) 
- 
if a ≤ X ≤ b, then \(a ≤ E[X] ≤ b\) 
- 
Expected value rule: \(E[g(X)] = \sum\limits_{x} g(x) f_X(x) dx \) 
- 
Linearity: \(E[aX + b] = aE(X) + b\) 
Variance
According to the definition of variance: \(var(X) = E[(X - \mu)^2] \)
\[ var(X) = \int_{-\infty}^{+\infty} (x - \mu)^2 f_X(x) dx \]
- 
Standard deviation = \(\sigma_X = \sqrt{var(X)} \) 
- 
\(var(aX + b) = a^2 var(X)\) 
- 
\(var(X) = E[X^2] - (E[X])^2\) 
Summary of Expectation and Variance of continuous random variables
| Random Variables | Formula | E(X) | var(X) | 
|---|---|---|---|
| Uniform | \(f(x) = \frac{1}{b-a}, a ≤ x ≤ b\) | \(\frac{a+b}{2}\) | \(\frac{(b-a)^2}{12}\) | 
| Exponential \( \lambda > 0 \) | \(f(x) = \begin{cases} \lambda e^{-\lambda x}, x ≥ 0 \\ 0, x < 0 \end{cases}\) | \(\frac{1}{\lambda}\) | \(\frac{1}{\lambda^2}\) | 
5.1.2 Cumulative distribution functions (CDF)
CDF defination: \(F_X(x) = P(X ≤ x )\)
- 
Non-decreasing 
- 
\(F_X(x)\) tends to 1, as \(x \to \infty\) 
- 
\(F_X(x)\) tends to 0, as \(x \to - \infty\) 
5.1.3 Normal(Gaussian) random variables
- 
Standard normal(Gaussian) random variables Stardard normal \(N(0,1): f_X(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2} \) - 
\(E[X] = 0\) 
- 
\(var(X) = 1\) 
 
- 
- 
General normal(Gaussian) random variables General normal \(N(\mu,\sigma^2): f_X(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/2\sigma^2}, \sigma > 0 \) - 
\(E[X] = \mu \) 
- 
\( var(X) = \sigma^2 \) 
 \\( \sigma^2 \to small\\), the shape of normal distribution becomes more narrow.
- 
- 
Linear functions of a normal random variable - 
Let \(Y = aX + b, X \sim N(\mu, \sigma^2)\) \(E[Y] = a\mu + b\) \(Var(Y) = a^2 \sigma^2 \) 
- 
Fact: \(Y \sim N(a\mu + b, a^2 \sigma^2)\) 
- 
Special case: a = 0. There is Y = b, \(N(b, 0)\) 
 
- 
5.1.4 Calculation of normal probabilities
- 
Standard normal tables \(\Phi(y) = F_Y(y) = P(Y \leq y)\) which can be find in the table, where y ≥ 0. 
- 
Standardizing a random variable \(X \sim N(\mu, \sigma^2), \sigma^2 > 0 \) \(Y = \frac{X - \mu}{\sigma}\) 
5.2 Conditioning on an event: multiple continuous r.v.'s
\[ P( X \in B|A) = \int_B f_{X|A}(x)dx \]
5.2.1 Conditional PDf of X, given that \(X \in A \)
\[ f_{X|X \in A}(x) = \begin{cases} 0, if x \notin A \\ \frac{f_X(x)}{P(A)}, if x \in A \end{cases} \]
5.2.2 Conditional expectation of X, given an event

5.2.3 Memorylessness of the exponential PDF

5.2.4 Total probability and expectation theorems
- Probability theorem:
\[ P(B) = P(A_1)P(B|A_1) + \dotsb + P(A_n)P(B|A_n) \]
- For the discrete random variable:
\[ p_X(x) = P(A_1)p_{X|A_1}(x) + \dotsb + P(A_n)p_{X|A_n}(x) \]
- For CDF:
\[ F_X(x) = P(X \leq x) = P(A_1)P(X \leq x | A_1) + \dotsb + P(A_n)P(X \leq x | A_n) \\= P(A_1)F_{X|A_1}(x) + \dotsb + P(A_n)F_{X|A_n}(x) \]
- For PDF, the derivative of CDF:
\[ f_X(x) = P(X \leq x) = P(A_1)f_{X|A_1}(x) + \dotsb + P(A_n)f_{X|A_n}(x) \]
- Integral above equation, we will obtain the expectation equation:
\[ \int xf_X(x)dx = P(A_1) \int xf_{X|A_1}(x)dx + \dotsb + P(A_n) \int xf_{X|A_n}(x)dx \]
\[
E[X] = P(A_1)E[X|A_1] + \dotsb + P(A_n)E[X|A_n]
\]
5.3 Mixed random varibles
5.3.1 Mixed distirbutions
\[
X = \begin{cases} Y, \text{with probability } p \text{ (Y discrete)}\\ Z, \text{with probability } 1-p \text{ (Z continuous)} \end{cases}
\]
- 
do not have PDF or PMF but can be defined with CDF and expectation \[ F_X(x) = p P(Y \leq x) + (1-p) P(Z \leq x) \\ =pF_Y(x) + (1-p)F_Z(x) \\ = E[X] = p E[Y] + (1-p) E[Z] \]  
5.3.2 Joint PDFs

- 
Joint PDFs are denoted as \(f_{X,Y}(x,y)\): probaility per unit area When X = Y, equal to a line, meaning X and Y are not joint PDFs. 
5.3.3 From the joint to the marginal

5.3.4 Joint CDF
\[ F_{X,Y}(x,y) = P(X \leq x, Y \leq y) = \int\limits_{-\infty}^{y} \int\limits_{-\infty}^{x} f_{x,y}(s,t)dsdt \]
5.4 Conditioning on a random variable and Bayers rule
5.4.1 Conditional PDFs, given another r.v.
- 
\(f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)}\), if \(f_y(y) > 0\) - 
\(f_{X|Y}(x|y) \geq 0\) 
- 
Think of value of Y as fixed at some y shape of \(f_{X|Y}(\cdot|y)\): slice of the joint 
- 
multiplication rule: \[ f_{X|Y}(x,y) = f_Y(y) \cdot f_{X|Y}(x|y) 
 \]
 
- 
- 
\(P(X \in A | Y = y) = \int_A f_{X|Y}(x/y)dx\) 
5.4.2 Total probability and expectation theorems
- 
Anolog to the PMFs of discrete randome varable \(p_X(x) = \sum\limits_y p_Y(y)p_{X|Y}(x|y)\) For continuous r.v., there is \[ f_X(x) = \sum_{-\infty}^{\infty} f_Y(y)f_{X|Y}(x|y)dy 
 \]
- 
Anolog to the Expectation of discrete randome varable \(E[X|Y=y] = \sum\limits_x x p_{X|Y}(x|y)\) For continuous r.v., there is \[ E[X|Y=y] = \int_{-\infty}^{\infty} xf_{X|Y}(x|y)dx 
 \]
- 
Anolog to the discrete randome varable \(E[X] = \sum\limits_y p_Y(y) E[X|Y=y]\) For continuous r.v., there is \[ 
 E[X] = \int_{-\infty}^{\infty} f_Y(y)E[X|Y=y]dy
 \\ = \int_{-\infty}^{\infty} xf_X(x)dx
 \]
- 
Expected value rule \[ E[g(X)|Y=y] = \int_{-\infty}^{\infty} g(x)f_{X|Y}(x|y)dx 
 \]
5.4.3 Independence
\[
f_{X,Y}(x,y) = f_X(x)f_Y(y), for all x and y
\]
- 
\(f_{X,Y}(x,y) = f_X(x)\), for all y with \(f_Y(y) > 0\) and all x 
- 
If X, Y are independent: \[ E[XY] = E[X]E[Y] \\ var(X + Y) = var(X) + var(Y) \] g(X) and h(Y) are also independent: \(E[g(X)h(Y)] = E[g(X)] \cdot E[h(Y)]\) 
5.4.4 The Bayes rule --- a theme with variations
- 
For discrete r.v., - 
\(p_{X|Y}(x|y) = \frac{p_X(x) p_{Y|X}(y|x)}{p_Y(y)}\) 
- 
\(p_Y(y) = \sum\limits_{x'} p_X(x')p_{Y|X}(y|x')\) 
 
- 
- 
For continuous r.v., - 
\(f_{X|Y}(x|y) = \frac{f_X(x) f_{Y|X}(y|x)}{_Y(y)}\) 
- 
\(p_Y(y) = \int\limits f_X(x')f_{Y|X}(y|x')\) 
 
- 
- 
One discrete and one continuous r.v.  
Unit 6 Further topics on random variables
6.1 Derived distributions
6.1.1 A linear function \(Y = aX + b\)
- 
Discrete r.v. \( p_Y(y) = p_X(\frac{y-b}{a}) \) 
- 
Continuous r.v. \( f_Y(y) = \frac{1}{|a|}f_X(\frac{y-b}{a}) \) - 
A linear function of normal r.v. is normal If \(X \sim N(\mu, \sigma^2)\), then \(aX + b \sim N(a\mu + b, a^2\sigma^2)\) 
 
- 
6.1.2 A general function \(g(X)\) of a continuous r.v.
Two-step procedure:
- 
Find the CDF of Y: \(F_Y(y) = P(Y \leq y) = P(g(x) \leq y)\) and the valid range of y 
- 
Differentiate: \(f_Y(y) = \frac{dF_Y(y)}{dy}\) 
- 
A general formula for the PDF of \(Y = g(X)\) when g is monotomic \[ f_Y(y) = f_X(h(y))\left|\frac{dh(y)}{dy}\right| 
 \]\(x = h(y)\) is the inverse function of \(y = g(x)\) 
- 
A nonmonotonic example \(Y = X^2\) - 
the discrete case: \(p_Y(y) = p_X(\sqrt{y}) + p_X(-\sqrt{y})\) 
- 
the continuous case: \(f_Y(y) = f_X(\sqrt{y})\frac{1}{2\sqrt{y}} + p_X(-\sqrt{y})\frac{1}{2\sqrt{y}}\) 
 
- 
- 
A function of multiple r.v.'s: \(Z = g(X,Y)\)  
6.2 Sums of independent vadom variables
6.2.1 The distribution of \(X + Y\): the discrete case
Z = X + Y; X,Y independent, discrete known PMFs
\[
p_Z(z) = \sum\limits_x p_X(x)p_Y(z-x)
\]
Dsicrete convoltion mechanics
- 
Flip the PMF of Y and put it underneath the PMF of X 
- 
Shift the flipped PMF by z 
- 
Cross-multiply and add 
6.2.2 The distribution of \(X + Y\): the continous case
Z = X + Y; X,Y independent, continuous known PDFs
\[
f_Z(z) = \int\limits_x f_X(x)f_Y(z-x)dx
\]
- 
conditional on \(X = x\): \[ f_{Z|x}(z|x) = f_Y(z-x) 
 \]which can then be used to calculate Joint PDF of Z and X and marginal PDF of Z. 
- 
Same mechanics as in discrete case 
6.2.3 The sum of independent normal r.v.'s
- 
\(X \sim N(\mu_x, \sigma_x^2), Y \sim N(\mu_y, \sigma_y^2\) Independent \(Z = X + Y: \sim N(N(\mu_x + \mu_y, \sigma_x^2 + \sigma_y^2))\) The sum of finitely many independent normals is normal 
6.3 Covariance (协方差)
6.3.1 Definition
\[
cov(X,Y) = E[(X - E[X]) \cdot (Y - E(Y))]
\]
- If \(X,Y\) independent: \(cov(X,Y) = 0 \)
convers is not true!
6.3.2 Covariance properties
- 
\(cov(X,X) = var(X) = E[X^2] - (E[X])^2\) 
- 
\(cov(aX+b,Y) = a \cdot cov(X,Y)\) 
- 
\(cov(X,Y+Z) = cov(X,Y) + cov(X,Z)\) 
Practical covariance formula:
\[ cov(X,Y) = E[XY] - E[X]E[Y] \]
6.3.3 The variance of a sum of random variables
- 
two r.v.s \[ var(X_1 + X_2) = var(X_1) + var(X_2) + 2cov(X_1,X_2) 
 \]X,Y indepedent, then \(var(X_1 + X_2) = var(X_1) + var(X_2)\) 
- 
multiple r.v.s \[ var(X_1 + \dots + X_n) = \sum\limits_{i=1}^nvar(X_i) + \sum\limits_{(i,j):i \neq j}^n cov(X_i,X_j) 
 \]\(\sum\limits_{(i,j):i \neq j}^n \) contains \((n^2 - n)\) terms 
6.4 The correlation coefficient
\[ \rho(X,Y) = E\left[\frac{(X - E[X])}{\sigma_X} \cdot \frac{(Y - E[Y])}{\sigma_Y}\right] = \frac{cov(X,Y)}{\sigma_X \sigma_Y} \]
6.4.1 Interpretation of correlation coeffecient
- 
Dimensionless version of covariance 
- 
Measure of the defree of "association" between X and Y 
- 
Association does not imply causation or influence 
- 
Correlation often refleces underlying, common, hidden factor 
6.4.2 Key properties of the correlation coeffecient
- 
\(-1 \leq \rho \leq 1\) 
- 
Independent \(\implies \rho = 0\) "uncorrelated" (converse is not true) 
- 
\(|\rho| = 1 \Leftrightarrow\) linearly related 
- 
\(cov(aX+b, Y) = a \cdot cov(X,Y) \implies \rho(aX+b,Y) = sigma(a)\rho(X,Y)\) 
6.5 Conditional expectation and variance as a random variable
6.5.1 Conditional expecation
- Definition: \(g(Y)\) is the random variable that takes the value \(E[X|Y=y]\), if \(Y\) happens to take the value \(y\).
\[
E[X|Y] = g(Y)
\]
- Law of iterated expectations
\[
E[E[X|Y]] = E[g(Y)] = E[X]
\]
6.5.2 Conditional variance
- 
Variance fundamentals \[ var(X) = E[(X - E[X])^2] \\ var(X|Y=y) = E[(X - E[X|Y=y])^2|Y=y] \] 
var(X|Y) is the r.v. that takes the value var(X|Y=y), when Y=y
- 
Law of total variance \[ var(X) = E[var(X|Y)] + var(E[X|Y]) 
 \]var(X) = (average variability within sections) + (variability between sections) 
6.6 Sum a random number of indepedent r.v.'s
Example of shopping
- 
N: number of stores visited (N is a nonnegative integer r.v.) 
- 
\(X_i\): money spent in store i - 
\(X_i\) independent, identically distributed 
- 
independent of N 
 
- 
- 
Let \(Y = X_1 + \dots + X_N\) 
6.6.1 Expecatation of the sum
Based on the Law of iterated expectations:
\[
E[Y] = E[E[Y|N]] = E[N \cdot E[X]] = \cdot E[X]E[N]
\]
6.6.2 Variance of the sum
Based on the Law of total variance: \(var(Y) = E[var(Y|N)] + var(E[Y|N])\):
\[ var(Y) = E[N]var(X) + (E[X])^2var(N) \]
Unit 7 Bayesian inferences
7.1 Introduction to Bayesian inference
7.1.1 Basic concepts
- 
Model building versus inferring unobserved variables \[X = aS + W\] S: signal; W: noise; a: medium (image a black box where S goes through and output X with W as noise) - 
Model building: known signal S, observe X -> infer a 
- 
Variable estimation: known a, observe X -> infer S 
 
- 
- 
Hypothesis testing vs. estimation - 
Hypothesis testing - 
unknown takes one of few possible values 
- 
aim at small probability of incorrect decision 
 
- 
- 
Estimation - 
numerical unknown(s) 
- 
aim at an estimate that is "close" to the true but unknown value 
 
- 
 
- 
7.1.2 The Bayescian inference framework
- 
Unknown \(\Theta\) - treated as a random variable prior distribution: \(p_{\Theta}\) or \(f_{\Theta}\) 
- 
Observation \(X\) - observation model \(p_{X|\Theta}\) or \(f_{X|\Theta}\) 
- 
Use appropriate version of the Bayes rule to find \(p_{X|\Theta}(\cdot | X = x)\) or \(f_{X|\Theta} (\cdot| X = x)\) 

- 
The output of Bayesian inference - posterior distribution - 
Maximum a posterior probability (MAP): \(p_{\Theta|X}(\theta^*|x) = \max\limits_{\theta} p_{\Theta|X}(\theta|x)\) \(f_{\Theta|X}(\theta^*|x) = \max\limits_{\theta} f_{\Theta|X}(\theta|x)\) 
- 
Conditional expectation: \(E[\Theta|X = x]\) Least Mean Square (LMS) 
- 
estimate: \(\hat{\theta} = g(x)\) (number) 
- 
estimator: \(\hat{\Theta} = g(X)\) (random variable) 
 
- 
7.1.3 Four cases
- 
Discrete \(\Theta\), discrete X - values of \(\Theta\): alternative hypotheses
 \[ p_{\Theta|X}(\theta|x) = \frac{p_{\Theta}(\theta)p_{X|\Theta}(x|\theta)}{p_X(x)} 
 \]\[ p_X(x) = \sum\limits_{\theta'}p_{\Theta}(\theta')p_{X|\Theta}(x|\theta') \] - conditional prob of error: Smallest under the MAP rule
 \\[ P(\hat{\theta} \neq \Theta|X = x) \\]- overal probability of error:
 \\[ P(\hat{\Theta} \neq \Theta) = \sum\limits_{x} P(\hat{\Theta} \neq \Theta|X = x)p_X(x) = \sum\limits_{\theta}P(\hat{\Theta} \neq \Theta|\Theta = \theta)p_{\Theta}(\theta) \\]
- 
Discrete \(\Theta\), Continuous X \[ p_{\Theta|X}(\theta|x) = \frac{p_{\Theta}(\theta)f_{X|\Theta}(x|\theta)}{f_X(x)} 
 \]\[ f_X(x) = \sum\limits_{\theta'}p_{\Theta}(\theta')f_{x|\Theta}(x|\theta') \] - 
the same equation for conditional prob. of error 
- 
overall probability of error \[ P(\hat{\Theta} \neq \Theta) = \int\limits_{x} P(\hat{\Theta} \neq \Theta|X = x)f_X(x)dx = \sum\limits_{\theta}P(\hat{\Theta} \neq \Theta|\Theta = \theta)p_{\Theta}(\theta) \] 
 
- 
- 
Continuous \(\Theta\), Discrete X \[ f_{\Theta|X}(\theta|x) = \frac{p_{\Theta}(\theta)p_{X|\Theta}(x|\theta)}{p_X(x)} 
 \]\[ p_X(x) = \int\limits_{\theta'}f_{\Theta}(\theta')p_{x|\Theta}(x|\theta')d\theta' \] - Inferring the unknown bias of a coin and the Beta distribution
 
- 
Continuous \(\Theta\), Continuous X \[ f_{\Theta|X}(\theta|x) = \frac{f_{\Theta}(\theta)p_{X|\Theta}(x|\theta)}{p_X(x)} 
 \]\[ f_X(x) = \int\limits_{\theta'}f_{\Theta}(\theta')p_{x|\Theta}(x|\theta')d\theta' \] - 
Linear normal models: estimation of a noisy singal 
- 
Estimating the parameter of a uniform \(X\): uniform \([0, \Theta]\) \(\Theta\): uniform \([0, 1]\) 
- 
Performance evaluation of an estimator \(\hat{\Theta}\) \(E[(\hat{\Theta} - \Theta)^2|X = x]\) \(E[(\hat{\Theta} - \Theta)^2]\) 
 
- 
Useful equation:
\[
\int_0^1 \theta^\alpha(1-\theta)^\beta d\theta = \frac{\alpha!\beta!}{(\alpha + \beta + 1)!}
\]
7.2 Linear models with normal noise
7.2.1 Recognizing normal PDFs
- 
Normal distribution: \(X \sim N(\mu, \sigma^2)\) \(f_X(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/2\sigma^2}\) 
- 
\(f_X(x) = c e^{-(\alpha x^2 + \beta x + \gamma)}\), \(\alpha > 0\) Normal with mean \(-\beta/2\alpha\) and variance \(-1/2\alpha\) 
7.2.2 Estimating a normal random variable in the presence of additive normal noise
\(X = \Theta + W\), \(\Theta, W,N :(0,1), independent\)
- 
\( \hat{\theta} _{MAP} = \hat{\theta} _{LMS} = E[\Theta|X = x] = x/2\) 
- 
even with general means and variances: - 
posterior is normal 
- 
LMS and MAP estimators conincide 
- 
these estimators are "linear" of the form \(\hat{\Theta} = aX + b\) 
 
- 
7.2.3 The case of multiple observations
\(X_i = \Theta + W_1\), \(\Theta \sim N(x_0, \sigma_0^2)\), \(W_i \sim N(x_i, \sigma_i^2), \Theta, W_i\) indepedent
- 
\(\hat{\theta} _{MAP} = \hat{\theta} _{LMS} = E[\Theta|X = x] = \frac{\sum\limits _{i=0}^n\frac{x_i}{\sigma_i^2}}{\sum\limits _{i=0}^n\frac{1}{\sigma_i^2}}\) 
- 
Key conclusions - 
posterior is normal 
- 
LMS and MAP estimates coincide 
- 
these estimates are "linear" of the form \(\hat{\theta} = a_0 + a_1x_1 + \dots + a_nx_n\) 
 
- 
- 
Interpretations - 
estimate \(\hat{\theta}\): weighted average of \(x_0\) (prior mean) and \(x_i\) (observations) 
- 
weights determined by variances 
 
- 
7.2.4 The mean square error
- 
Performance measures - 
\(E[(\Theta - \hat{\Theta})^2|X = x] = E[(\Theta - \hat{\theta})^2|X = x] = var(\Theta|X = x) = \frac{1}{\sum\limits _{i=0}^n \frac{1}{\sigma_i^2}}\) 
- 
\(E[(\Theta - \hat{\Theta})^2] = \int E[(\Theta - \hat{\Theta})^2|X = x] f_X(x) dx = \frac{1}{\sum\limits _{i=0}^n \frac{1}{\sigma_i^2}}\) 
 
- 
7.3 Least mean squares (LMS) estimation
7.3.1 In the absence of observations
- 
Least Mean Square formulation: minimize Mean Squared Error (MSE) \(E[(\Theta - \hat{\theta})^2]: \hat{\theta} = E[\Theta]\) 
- 
\(E[(\Theta - E[\Theta])^2]:var(\Theta)\) 
7.3.2 LMS estimation of \(\Theta\) based on X
- Minimize conditional mean square error: \(E[(\Theta - \hat{\theta})^2|X = x]: \hat{\theta} = E[\Theta|X = x]\)
7.3.3 LMS performance evaluation
- 
LMS estimate: \(\hat{\theta} = E[\Theta|X=x]\) 
- 
Estimator: \(\hat{\Theta} = E[\Theta|X]\) 
- 
Expected performance, once we have a measurement - Conditional mean square error \(MSE = E[(\Theta - E[\Theta|X=x])^2|X=x] = var(\Theta|X=x)\) 
- 
Expected perfornamce of the design: \(MSE = E[(\Theta - E[\Theta|X])^2] = E[var(\Theta|X)] = \int var(\Theta|X=x) \cdot f_X(x) dx\) Average of conditional variance 
- 
A good example   
7.3.4 Properties of the estimation error in LMS estimation
Given Estimator: \(\hat{\Theta} = E[\Theta|X]\) and Error: \(\tilde{\Theta} = \hat{\Theta} - \Theta\)
- 
\(E[\tilde{\Theta|X=x}] = 0\) 
- 
\(cov(\tilde{\Theta},\hat{\Theta}) = 0\) 
- 
\(var(\Theta) = var(\hat{\Theta}) + var({\tilde{\Theta}})\) 
7.4 Linear least mean squares (LLMS) estimation
Motivation: Conditional expectation \(E[\Theta|X]\) maybe hard to compute/implement
7.4.1 LLMS formulation
Consider estimators of \(\Theta\) of the form \(\hat{\Theta} = aX + b\), minimize \(E[(\hat{\Theta} - \Theta)^2] \implies E[(\hat{\Theta} - aX - b)^2] \)
7.4.2 LLMS solution

Minimize \(E[(\hat{\Theta} - \Theta)^2]\), that is \(E[(\Theta - aX - b)^2]\)
\[
\hat{\Theta}_L  = E[\Theta] + \frac{Cov(\Theta,X)}{var(X)}(X - E[X]) = E[\Theta] + \rho \frac{\sigma _\Theta}{\sigma_X}(X - E[X])
\]
\(\rho\) corelation coefficiency
Remarks on the solution and on the error variance
- 
Only means, variances, covariances matter (we do not need to know everything) \(E[(\hat{\Theta}_L - \Theta)^2] = (1 - \rho^2)var(\Theta)\) 
7.4.3 LLMS with multiple observations
- 
Consider the form \(\hat{\Theta} = a_1X_1 + \dots + a_nX_n + b\) 
- 
Minimize \(E[(a_1X_1 + \dots + a_nX_n + b - \Theta)^2]\) 
- 
Solve linear system in \(b\) and \(a_i\) 
- 
if \(E[\Theta|X]\) is linear in X, then \(\hat{\Theta} _{LMS} = \hat{\Theta} _{LLMS}\) 
- 
suppose general distributions with same mean, variances - 
\(\hat{\theta} _{MAP} = \hat{\theta} _{LMS} = E[\Theta|X = x] = \frac{\sum\limits _{i=0}^n\frac{x_i}{\sigma_i^2}}{\sum\limits _{i=0}^n\frac{1}{\sigma_i^2}}\) 
- 
\(\hat{\Theta} _{LMS} = E[\Theta|X] = \frac{\frac{x_0}{\sigma _0^2} + \sum\limits _{i=i}^n\frac{X_i}{\sigma_i^2}}{\sum\limits _{i=0}^n\frac{1}{\sigma_i^2}} = \hat{\Theta} _{LLMS}\) 
 
- 
7.5 Bayesian inference summary

Unit 8 Limit theorems and clasic statistics
8.1 Inequalities, comvergence, and the Weak Law of Large Numbers
8.1.1 Markov and Chebyshev inequality
- 
Markov inequality "If \(X \geq 0\) and \(E[X]\) is small, then X is unlikely to be very large" \[ P(X \geq a) \leq \frac{E[X]}{a} \text{, for all } a > 0 \text{ and } X \geq 0 \] 
- 
Chebyshev inequality "If the variance is small, then X is unlikely to be too far from the mean" \[ P(|X - \mu| \geq c) \leq \frac{\sigma^2}{c^2} \text{, for all } c > 0 \text{ and } X \text{ is a random variable with mean } \mu \text{ and variance } \sigma^2 \] 
8.1.2 The Weak Law of Large Numbers (WLLN)
\(X_1, X_2, \dots\) i.i.d.: infinite mean \(\mu\) and variance \(\sigma^2\)
\[ \text{Sample mean: } M_n = \frac{X_1 + \dots + X_n}{n} \]
- 
\(E[M_n] = \mu\) 
- 
\(Var(M_n) = \frac{\sigma^2}{n}\) 
- 
WLLN: for \(\varepsilon > 0\), \[ P(|M_n - \mu|) \geq \varepsilon = P \left( \left| \frac{X_1 + \dots + X_n}{n} - \mu\right| \geq \varepsilon \right) \to 0 \text{, as n} \to \infty \] 
- 
Interpreting the WLLN - 
Sample mean \(M_n\) is unlikely to be far off from true mean \(\mu\) 
- 
Sample mean \(M_n\) is the emperical frequency of even \(A\), with \(p = P(A)\) 
 
- 
8.1.3 Convergence in Probability
Sequence of random variables \(Y_n\), not necessarily independent
Definition: A sequence \(Y_n\) converges in probability to a certain number a if:
\[
\lim_\limits{n \to \infty} P(|Y_n - a| \geq \varepsilon) = 0
\]
Almost all of the PMF/PDF of \(Y_n\) eventually gets concentrated (arbitrarily) close to a
- 
Some properties - suppose that \(X_n \to a, Y_n \to b\) - 
if g is continuous, then \(g(X_n) \to g(a)\) 
- 
\(X_n + Y_n \to a + b\) 
- 
\(E[X_n]\) need not converge to a 
 
- 
8.2 The Central Limit Theorem (CLT)

8.2.2 What exactly does the CLT say?
- 
Theory \(Z_n = \frac{S_n - n\mu}{\sqrt{n}\sigma}\) and \(Z \sim N(0,1)\) - 
CDF of Zn converges to normal CDF 
- 
results for convergence of PDFs or PMFs (with more assumptions) 
- 
results without assuming that Xi are identically distributed 
- 
results under "weak dependence" 
 In short, CLT applies to a sequence of random variables that do not need to be i.i.d. 
- 
- 
Practice - 
The practiec of normal approximations: - 
treat Zn as if it were normal 
- 
treat Sn as if normal: \(N(n\mu, n\sigma^2)\) as \(S_n = \sqrt{n}\sigma Z_n + n\mu\) 
 
- 
- 
Can we use the CLT when n is "moderate"? - 
usually, yes 
- 
symmetry and unimodality help 
 
- 
 
- 
8.3 An introduction to classical statistics
8.3.1 Overview
- 
Inference using the Bayes rule: unknown \(\Theta\) and observation \(X\) are both random variables: Find \(p_{\Theta|X}\) 
- 
Classical statistics: unknown constant \(\theta\) - 
Problem types in classical statistics - 
Hypothesis testing: \(H_0: \theta = 1/2 \text{ vs. } H_1: \theta = 3/4\) 
- 
Composite hypotheses: \(H_0: \theta = 1/2 \text{ vs. } H_1: \theta \neq 1/2\) 
- 
Estimation: design an estimator \(\hat{\Theta}\), to keep estimation error \((\hat{\Theta} - \theta)\) small. 
 
- 
 
- 
8.3.2 The sample mean and some terminology
- 
Estimating a mean - 
\(X_1, \dots, X_n\): i.i.d, mean \(\theta\), variance \(\sigma^2\) 
- 
Sample mean \(= \hat{\Theta}_n = M_n = \frac{X_1 + \dots + X_n}{n}\) 
 
- 
- 
Properties and terminology - 
\(E[\hat{\Theta}_n] = \theta\) (unbiased) for all \(\theta\) 
- 
WLLN: \(E[\hat{\Theta}_n] \to \theta\) (consistency) for all \(\theta\) 
- 
Mean square error (MSE): \(E[(\hat{\Theta}_n - \theta)^2] = var(\hat{\Theta}_n) = \frac{\sigma^2}{n}\) 
 
- 
8.3.3 On the mean square error of an estimator
\[
E[(\hat{\Theta} - \theta)^2] = var(\hat{\Theta} - \theta) + (E[\hat{\Theta} - \theta])^2 = var(\hat{\Theta}) + (bias)^2
\]
- 
Sample mean estimator (\(\hat{\Theta}_n = M_n\)): \(MSE = \frac{\sigma^2}{n} + 0\) 
- 
Zero estimator (\(\hat{\Theta} = 0\)): \(MSE = 0 + \theta^2\) 
- 
\(\sqrt{var(\hat{\Theta})}\) is the standard error . Standard Error refers to sampling distribution, whereas standard deviation refers to sample distribution 
8.3.4 Confidence intervals (CIs)
An \(1 - \alpha\) confidence interval is an interval \([\hat{\Theta}^-, \hat{\Theta}^+]\), for all \(\theta\)
\[
P(\hat{\Theta}^- \leq \theta \leq \hat{\Theta}^+)
\]
- 
CI for the estimation of the mean - 
\(X_1, \dots, X_n\): i.i.d, mean \(\theta\), variance \(\sigma^2\) 
- 
Sample mean \(= \hat{\Theta}_n = M_n = \frac{X_1 + \dots + X_n}{n}\) 
- 
95% CI: \(\Phi(1.96) = 0.975 = 1 - 0.025\) \[ P \left( \frac{|\hat{\Theta}_n - \theta|}{\sigma/\sqrt{n}}\right) \leq 1.96 \approx 0.95 \text{ (CLT) } \implies P \left(\hat{\Theta}_n - \frac{1.96\sigma}{\sqrt{n}} \leq \theta \leq \hat{\Theta}_n + \frac{1.96\sigma}{\sqrt{n}}\right) \] 
 
- 
- 
CI for the mean when \(\sigma\) is unknown - 
use upper bound on \(\sigma\) - for \(X_i\) Bernoulli: \(\sigma \leq 1/2\)
 
- 
use ad hoc estimate of \(\sigma\) - for \(X_i\) Bernoulli: \(\sigma = \sqrt{\hat{\Theta}_n(1 - \hat{\Theta}_n)}\)
 
- 
use sample mean estimate of the variance \(\sigma^2 = E[(X_i - \theta)^2] \implies \frac{1}{n} \sum\limits_{i = 1}^n (X_i - \hat{\Theta}_n)^2 \to \sigma^2\) 
 
- 
- 
Two approximations involved here: - 
CLT: approximately normal 
- 
using estimate of \(\sigma\) 
 
- 
- 
correction for second approximation (t-tables) used when n is small. 
8.3.5 Other natural estimators

8.3.6 Maximum Likelihood (ML) estimation
- 
Pick \(\theta\) that "makes data most likely" \[ \hat{\theta}_ {ML} = arg \max\limits_{\theta} p_X(x;\theta) 
 \]compare to maximum a posterior probability Bayesian posterior \(p_{\Theta|X}(\theta^*|x) = \max\limits_{\theta}p_{\Theta|X}(\theta|x)\) 
Unit 9 The Bernoulli and Poisson process
9.1 The Bernoulli process
9.1.1 Definition
- 
A sequence of independent Bernoulli tirals, \(X_i\) 
- 
At each trial, i: \(P(X_i = 1) = P(\text{success at the ith trial}) = p\) \(P(X_i = 0) = P(\text{failure at the ith trial}) = 1 - p\) 
- 
Properties - 
\(E[X_i] = p\) 
- 
\(var(X_i) = p*(1-p)\) 
 
- 
- 
Key assumption - 
Independence 
- 
Time-homogeneity 
 
- 
9.1.2 Stochastic processes
- 
A sequence of random variables \(X_1, X_2, \dots\) 
- 
Sample space: \(\Omega = \text{a set of infinite sequence of 0's and 1's}\) 
9.1.3. Number of successes/arrivals S in n time slots (Binomial distribution)
- 
\(S = X_1 + X_2 + \dots + X_n\) 
- 
\(P(S=k) = \binom{n}{k}p^k(1-p)^{n-k}\), k = 0, 1, 2 .... 
- 
\(E[S] = np\) 
- 
\(var(S) = np(1-p)\) 
9.1.4 Time until the first success/arrival (Geometric distribution)
- 
\(T_i = min \{i: X_i=1 \}\) 
- 
\(P(T_1 = k) = (1-p)^(k-1)p\), k = 1,2,... 
- 
\(E[T_1] = \frac{1}{p}\) 
- 
\(var(T_1) = \frac{1-p}{p^2}\) 
9.1.5 Independence, memorylessness, and fresh-start properties
- 
Fresh-start after time n (slots), after time T1 
- 
Fresh-start after a random time N - 
N = time of 3rd sucess 
- 
N = first time that 3 successes in a row have been observed 
 
- 
- 
The process \(X_{N+1}, X_{N+2}\), ... is - 
A Bernoulli process 
- 
independent of N, \(X_1, X_2, \dots, X_N\) 
 as long as N is determined "casually" 
- 
9.1.6 Time of the kth success/arrival
- 
\(Y_k\) = time of kth arrival 
- 
\(T_k\) = kth inter-arrival time = \(Y_k - Y_{k-1} \text{, } k \geq 2 \) 
- 
\(Y_k = T_1 + \dots + T_k\) - 
The process starts fresh after time T1 
- 
T2 is independent of T1: Geometric(p) 
- 
\(E[Y_k] = \frac{k}{p}\) 
- 
\(var(Y_k) = \frac{k(1-p)}{p^2}\) 
- 
PMF: \(p_{Y_k}(t) = \binom{t-1}{k-1}p^k(1-p)^{t-k} \text{, } t = k, k +1, ..\). 
 
- 
9.1.7 Merging of independent Bernoulli processes
- 
\(X_i\): Bernoulli(p) 
- 
\(Y_i\): Bernoulli(q) 
- 
Merged process: \(Z_i: g(X_i, Y_i)\) Bernoulli(p + q - pq) 
9.1.7 Splitting of a Bernoulli process

9.1.8 Poisson approximation to binomial
- 
Interesting regime: large n, small p, moderate λ = np 
- 
Number of arrivals S in n slots: \(p_S(k) \to \frac{\lambda^k}{k!}e^{-\lambda}\) (For fixed k = 0, 1...) 
9.2 The Poison process
9.2.1 Definition
Poisson process is similar to Bernoulli process, but in a continuous time interval.
- Numbers of arrivals in disjoint time intervals are independent
\\(P(k, \tau)\\) = Prob. of *k* arrivals in interval of duration \\(\tau\\)
- 
Small interval probabilities - For VERY small \(\delta\): \[ P(k, \delta) = \begin{cases} 1-\lambda\delta + O(\delta^2) & \quad \text{if } k = 0 \\ \lambda\delta + O(\delta^2) & \quad \text{if } k=1 \\ 0 + O(\delta^2) & \quad \text{if } k>1 \end{cases} \] \[ P(k, \delta) \approx \begin{cases} 1-\lambda\delta & \quad \text{if } k = 0 \\ \lambda\delta & \quad \text{if } k=1 \\ 0 & \quad \text{if } k>1 \end{cases} \] 
9.2.2 The Poisson PMF for the number of arrivals
- 
\(N_{\tau}:\text{ arrivals in }[0, \tau]\) 
- 
\(N_\tau \approx Binomial(n,p)\), \(n = \frac{\tau}{\delta}\), \(p = \lambda\delta + O(\delta^2)\) 
- 
\[ P(k, \tau) = P(N_\tau =k) = \frac{(\lambda\tau)^ke^{-\lambda\tau}}{k!}, \text{k = 0, 1, 2,...} \] 
- 
\(E[N_\tau] \approx np \approx \lambda\tau\) 
- 
\(var(N_\tau) \approx np(1-p) \approx \lambda\tau\) 
9.2.3 The time \(T_1\) until the first arrival
Find the CDF: \(P(T_1 \leq t) = 1 - P(T_1 > t) = 1 - P(0,t) = 1 - e^{-\lambda t}\)
\[
f_{T_1}(t) = \lambda e^{-\lambda t} \text{, for } t \geq 0
\]
9.2.4 The time \(Y_k\) of the kth arrival
Two ways to derive:
- 
Through CDF: \(P(Y_k \leq y) = \sum\limits_{n=k}^{\infty}P(n, y)\) 
- 
More intuitive argument \[ f_{Y_k}(y)\delta \approx P(y \leq Y_k \leq y + \delta) \approx P(k-1, y)\lambda\delta \] 
\[
f_{Y_k}(y) = \frac{\lambda^k y^{k-1} e^{-\lambda y} }{(k-1)!} \text{, } y \geq 0
\]
9.2.5 Memorylessness and the fresh-start property
- 
If we start watching at time t, we see Poisson process, independent of the history until time t. Then, time until next arrival follows exp(λ) 
- 
Time between first and second arrival, \(T_2 = Y_2 - Y_1\) follows exp(λ) 
- 
Similar for all \(T_k = Y_k - Y_{k-1} \text{, } k \geq 2\) 
- 
\(Y_k = T_1 + \dots + T_k\) is sum of i.i.d. exponentials - 
\(E[Y_k] = \frac{k}{\lambda}\) 
- 
\(var(Y_k) = \frac{k}{\lambda^2}\) 
 
- 
9.2.6 Bernoulli/Poisson relation

| Poisson | Bernoulli | |
|---|---|---|
| Times of Arrival | Continuous | Discrete | 
| Arrival Rate | λ per unit time | p per trial | 
| PMF of # of arrivals | \[P(k,\tau) = \frac{(\lambda\tau)^ke^{-\lambda\tau}}{k!} \\E[N_\tau] \approx \lambda\tau \\ var(N_\tau) \approx \lambda\tau\] | \[P_S(k) = \binom{n}{k}p^k(1-p)^{(n-k)} \\ \to \frac{\lambda^k}{k!}e^{-\lambda} \\ E[S] = np \\ var(S) = np(1-p) \] | 
| Interarrival Time Distr. | \[f_{T1}(t) = \lambda e^{-\lambda t}\] Exponential \[E[T_1] = 1/\lambda \\ var(T_1) = 1/\lambda^2\] | \[P_{T1} = (1-p)^{n-1}p\] Geometric \[E[T_1] = 1/p \\ var(T_1) = \frac{1-p}{p^2}\] | 
| Time to k-th arrival | \[f_{Y_k}(y) = \frac{\lambda^k y^{k-1} e^{-\lambda y}}{(k-1)!}\] Erlang \[E[Y_k] = k/\lambda \\ var(Y_k) = k/\lambda^2\] | \[p_{Y_k}(t) = \binom{t-1}{k-1}p^k(1-p)^{t-k}\] Pascal | 
9.3 More on the Poisson process
9.3.1 The sum of independent Poisson random variables
\[ P(k, \tau) = \frac{(\lambda\tau)^k e^{-\lambda\tau}}{k!} \]
We call it a Poisson random variable with parameters \(\lambda\tau\)
9.3.2 Merging independent Poisson processes
| 0 \(1 - \lambda_1\delta\) | 1 \(\lambda_1\delta\) | ≥ 2 \(O(\delta^2)\) | |
|---|---|---|---|
| 0 \(1 - \lambda_2\delta\) | \((1-\lambda_1\delta)(1-\lambda_2\delta)\) | \(\lambda_1\delta(1-\lambda_2\delta)\) | - | 
| 1 \(\lambda_2\delta\) | \(\lambda_2\delta(1-\lambda_1\delta)\) | \(\lambda_1\lambda_2\delta^2\) | - | 
| ≥ 2 \(O(\delta^2)\) | - | - | - | 
- 
0 Arrivals \(\approx 1 - (\lambda_1 + \lambda_2)\delta\) 
- 
1 Arrivals \(\approx (\lambda_1 + \lambda_2)\delta\) 
- 
≥ 2 Arrivals \(O(\delta^2)\) 
Merging independent Poisson(λ1) and Poisson(λ1) result in Poisson(λ1 + λ2))
9.3.3 The time the first(last) light bulb burns out - min{X,Y,Z} and max{X,Y,Z} problem
Three lightbulbs have independent lifetimes X, Y, Z exponential(λ)
- 
The expected time until first lightbulb burnout: - 
X, Y, Z: first arrivals in independent Poisson processes 
- 
Merged process: Poisson(3λ) 
- 
min{X, Y, Z}: 1st arrival in merged process \(\to E[min] = 1/3\lambda\) 
 
- 
- 
The expected time until the last lightbulb burnout: - Merged process in different intervals
 \[ E[max] = \frac{1}{3\lambda} + \frac{1}{2\lambda} + \frac{1}{\lambda} \] 
9.3.4 Splitting of a Poisson process
Split arrivals into two streams using independent coin flips of a coin with bias q
Assume that coin flips are independent from the original Poisson process
- 
Resulting streams are Poisson with rate \(\lambda q, \lambda (1-q)\) 
- 
The splitted Poisson processes are independent! 
9.3.5 'Random incidence' in the Poisson process
- 
Analysis  
- 
Random incidence "Paradox" is not special to the Poisson process - 
Example: interarrival times, i.i.d., equally likely to be 5 or 10 mins. Then expected value of k-th interarrival time = 7.5 
- 
Show up at a "random time" - 
P(arrival duaring a 5-minute interarrival interval) = 1/3 
- 
Expected length of interarrival interval during which you arrive ≈ 8.3 
 
- 
- 
Sampling method matters - Different sampling methods can give different results - 
Average family size? (3 families with one person, 1 family with 6 persons) - 
look at a random family: 3/4x1 + 1/4x6 
- 
looat at a random persons's family: 3/9x1 + 6/9x6 
 
- 
- 
Average bus occupancy? 
- 
Average class size? 
 
- 
 
- 
9.4 Additional theoretical background
9.4.1 Poisson versus normal approximation to the binomial
We have seen that a binomial random variable with parameters n and p can be approximated by a normal random variable (central limit theorem) but also by a Poisson random variable. Are these two facts contradictory? Fortunately not; the two approximations apply to different regimes:
- 
if we fix p and let \(n \to \infty)\), we are in the setting where the central limit theorem applies. 
- 
If we let \(n \to \infty)\), \(p \to 0)\), while keeping the product np fixed, the Poisson approximation applies. 
- 
If p is very small but np is very large, then two approximations agree. 
9.4.2 Sums of a binomial and a Poisson-distributed number of Bernoulli r.v.'s
Let \(X_1,X_2,...\) be independent Bernoulli random variables with parameter p, and N be a random variable that takes integer values and is independent of \(X_i, i = 1,2, \dots\) Let \(Y=X_1+X_2+ \dots +X_N\) for positive values of N, and let \(Y =0\) when \(N=0\).
- 
If N is binomial with parameters m and q, then Y is binomial with parameters m and pq.  
- 
If N is poisson with parameters \(\lambda\), then Y is Poisson with parameter \(\lambda\).  
9.4.3 Sums of a geometrically-distributed number of geometric and exponential r.v.'s
Let N be a geometric random variable with parameter q, and let \(X_1, X_2, \dots\) be random variables that are independent and independent of N. Let \(Y=X_1+\dots+X_N\).
- 
If \(X_i\) is geometric with parameter p, then Y is geometric with parameter pq  
- 
If \(X_i\) is exponential with parameter \(\lambda\), then Y is exponential with parameter \(\lambda q\) 