Probability - The Science of Uncertainty and Data (2021)

Use the course to re-build my statistics knowledge.

The course link

The Book

1 Sample Space and Probability

1.1 Sample space - A set of outcomes

discrete/finite example
continuous example

1.2 Probability Axioms

Nonnegativity \(P(A) \geq 0 \)
Normalization \( P( \Omega ) = 1 \), \(\Omega \) is the entire sample space.
(finite) Additivity: A and B are disjoint, then the probability of their unions satisfies \(P(A \cup B) = P(A) + P(B)\) (to be strengthened later)

1.2.1 Simple consequences of the axioms

For a sampe space consist of a finite number of disjointed events, \[ P({s_1, s_2, ...., s_n}) = P(s_1) + P(s_2) + ...... P(s_n) \]
\(A \subset B\), then \(P(A) \leq P(B)\)
\(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
\(P(A \cup B) \leq P(A) + P(B))\)

1.3 Probability calculations

1.3.1 Uniform Probability Law

Discrete example

If the sample space consists of n possible outcomes which are equally likely (i.e., all single-element events have the same probability), \[ P(A) = \frac{\text{number of elements of A}}{n} \]
continuous example

probability = area

1.3.2 Discrete but infinite sample space

Sample space: {1, 2, 3 ....}

Given \(P(n) = \frac{1}{2^n}\), n = 1, 2, 3....

As \( P(\Omega) = 1 \): \(\frac{1}{2} + \frac{1}{4} + ....= \sum\limits_{n=1}^\infty \frac{1}{2^n} = \frac{1}{2}\sum\limits_{n=0}^\infty \frac{1}{2^n} = \frac{1}{2}\frac{1}{1-1/2} = 1\)

1.3.3 Countable aditivity axiom

Additivity holds only for "countable" sequences of events

If \(A_1, A_2, A_3 ...\) is an \(\underline{\text{infinite sequence of disjoined events}}\),

\[ P(A_1 \cup A_2 ......) = P(A_1) + P(A_2) + ...... \]

1.4 Mathematical background

1.4.1 Sets - A collection of distinc elements

finite: e.g. {a, b, c, d}
infinite: the reals (R)
\( \Omega \) - the universal set
Ø - empty set

What are reals?

The reals include rational numbers (terminating decimals and non-terminating recurring decimals and irrational numbers (non-terminating non-reccuring decimals

1.4.3 De Morgans' Law

\( (S \cap T)^c = S^c \cup T^c \) and \( (S \cup T)^c = S^c \cap T^c \)
\( (S^c \cap T^c)^c = S \cup T \)

1.4.4 Other important mathematical backgrounds

Sequences and their limits

squence: an enumerated collection of objects
When does a sequence converge
- if \(a_i \leq a_{i+1}\)
  - the sequence "converge to \(\infty\)"
  - the sequence converge to some real number a
- if \(|a_i - a| \leq b\), for \(b_i \to 0\), then \(a_i \to a\)
Infinite series

series(infinte sums) vs. summation(finite sums)

\(\sum\limits_{n=1}^\infty a_i = \lim\limits_{n\to\infty}\sum\limits_{i=1}^n a_i\)
- \(a_i \leq 0\): limit exists
- if term \(a_i\) do not all have the same sign:
  
  a. limit does not exist
  
  b. limit may exist but be different if we sum in a different order
  
  c. Fact: limit exists and independent of order of summation if \(\sum\limits_{n=1}^\infty |a_i| \leq \infty\)
Geometric series (等比数列、等比级数)

\(\sum\limits_{i=0}^\infty a^i = 1 + a + a^2 + ...... = \frac{1}{1-a} \text{ |a| < 1} \)

1.4 Sets

1.4.1 Countable and uncountable infinite sets

Countable
- integers, pairs of positive integers, etc.
- rational numbers q (有理数), with 0 < q < 1
Uncountable - continuous numbers
- the interval [0, 1]
- the reals, the plane, etc.
How to prove the reals are uncountable - "Control's diagonalization argument"

Unit 2 Conditioning and independence

Refer to Section 1.3 - 1.5 in the textbook

2.1 Conditional and Bayes' Rules

2.1.1 The definition of conditional probability

P(A|B) = "probability of A, given that B occurred"

\[ P(A|B) = \frac{P(A \cap B )}{P(B)} \]

defined only when P(B) > 0

2.1.2 Conditional probabilities share properties of ordinary probabilities

\(P(A|B) \geq 0\)
\(P(\Omega|B) = 1\)
\(P(B|B) < 0\)
If \(A \cap C = Ø\), then \(P(A \cup C|B) = P(A|B) + P(C|B)\) also only applies to countable and finite sequence (countable additivity axioms).

2.1.3 Models base on conditional probabilities

1. The multiplication rule

\\(P(A \cap B) = P(B)P(A|B) = P(A)P(B|A)\\)

\\(P(A^c \cap B \cap C^c) = P(A^c \cap B) P(C^c|A^c \cap B) = P(A^c) P(B|A^c) P(C^c|A^c \cap B)\\)

\\(P(A_1 \cap A_2...\cap A_n) = P(A_1) \prod\limits_{i=2}^n P(A_i|A_1 \cap A_2...\cap A_i)\\)

2. Total probability theorem

3. Bayes' rules

2.2 Independence

2.2.1 Conditional independence

Independent of two events

Intuitive "definition": P(B|A) = P(B)
- Occurence of A provides no new information about B

Definition of independence:

\(P(A \cap B) = P(A) \times P(B)\)

whether two events disjoined or joined is not associated with independence

Independent of events complements

If A and B are independent, then A and \(B^c\) are independent.

Independent of events complements

Conditioning may affect independence

2.2.2 Independence of a collection of events

Intuitive "definition": Information on some of the events does not change probabilities related to the remaining events
Definition: Events \(A_1, A_2,....., A_n\) are called independent if:

\(P(A_i \cap A_j \cap .... \cap A_m) = P(A_i)P(A_j)...P(A_m)\)

Pairwise independence

n = 3:

\(P(A_1 \cap A_2) = P(A_1)P(A_2)\)

\(P(A_1 \cap A_3) = P(A_1)P(A_3)\)

\(P(A_2 \cap A_3) = P(A_2)P(A_3)\)

vs. 3-way indenpendence

\(P(A_1 \cap A_2 \cap A_3) = P(A_1)P(A_2)P(A_3)\)

Independence vs. pairwise independence

Unit 3 Couting

3.1 Basic counting principle

r stages and \(n_i\) choices at stage i give the total number of possible choices \( n_1 * n_2 * ....n_r \)

3.2 Permutation

Permutation - number of ways of ordering n elements (repetition is prohibited)

\[n * (n-1) * (n-2) * ... * 2 * 1 = n!\]

Number of subsets of {1, 2, ...n} = \(2^n\)

3.3 Combinations

combinations \(\binom{n}{k}\)- number of k-element subsets of a given n-element set

How is combination equation derived?

Two ways of constructing an ordered sequence of k distinct items:
- choose the k items one at a time:
  
  \[ n (n-1) ... (n-k+1) = \frac{n!}{k!(n-k)!} \]
- choose k items, then order them:
  
  \[ \left( \begin{array}{c} n \\ k \end{array} \right)k! \]
There we have \[ \left( \begin{array}{c} n \\ k \end{array} \right) = \frac{n!}{k!(n-k)!} \]

3.3 Binominal coeffficient

Binominal coeffficient \(\binom{n}{k}\) - Binomial probabilities

Toss coins n times and each toss is given independent, P(Head) = p

\[ P(\text{k heads}) = \binom{n}{k}p^k (1-p)^{n-k} \]

If asking P(k heads without ordered), then

\[ P(\text{k heads}) = p^k (1-p)^{n-k} \]

Therefore, \(\binom{n}{k}\) is the number of k-head sequence

3.4 Partitions

multinomial coeffecient (number of partitions) =

\[ \frac{n!}{n_1! n_2! ... n_r!} \]

If r = 2, then \(n_1 = k\) and \(n_2 = n - k\). There is \(\frac{n!}{n! (n-k)!}\) which is \(\binom{n}{k}\)

A simple example

4 Discrete random variables

4.1 Probability mass function (PMF)

Random variable(r.v.): a function from the sample space to the real numbers, notated as X.

PMF: probability distribution of X

\[ p_X(x) = P(X = x) = P({w \in \Omega, s.t. X(\omega) = x}) \]

4.2 Discrete Random variable examples

4.2.1 Bernoulli random variables

with parameter \(p \in [0,1]\)

\[ p_X(x) = \begin{cases} 1, p(x) = p \\ 0, p(x) = 1 - p \end{cases} \]

Models a trial that results in either success/failure, Heads/Tails, etc.
Indicator random variables of an event A, \(I_A\) iff A occurs

4.2.2 Uniform random variables

with paramters a,b

Experiment: pick one of a, a+1 .... b at a random; all equally likely
Sample space; {a, a + 1, .... b}
Random variables X: \(X(\omega) = \omega\)

4.2.3 Binomial random variables

with parameters: pasitive integer \(n; p \in [0,1]\)

Experiment: n independent toses of a coin with P(Heads) = p
Sample space: set of sequences of H and T of length n
Random variables X: number of Heads observed
Model of: number of successes in a given number of independent trials

\[ p_X(k) = \left(\begin{array}{c} n \\ k \end{array} \right)p^k(1-p)^{n-k}, k = 0, 1 ..., n \]

4.2.4 Geometric random variables

with parameter p: 0 < p ≤ 1

Experiment: infinitely many independent tosses of a coin: P(Heads) = p
Random variable X: number of tosses until the first Heads
Model of waiting times; number of tirals until a success

\[ p_X(k) = P(X = k) = P(T...TH) =(1-p)^{k-1}p, k = 1,2,3...
\]

4.3 Expectation/mean of a random variable

Definition:

\[ E[X] = \sum\limits_{x} xp_X(x) \]
Interpretation: average in large number of independet repetitions of the experiment
Elementary properties
- If X ≥ 0, then E(X) ≥ 0
- If a ≤ X ≤ b, then a ≤ E[X] ≤ b
- If c is a constant, E[c] = c
- The expected value rule:
  
  \[ E[Y] = \sum\limits_y yp_Y(y) = E[g(X)] = \sum\limits_x g(x)p_X(x)
  \]
- Linearity of expectation: \(E[aX+b] = aE[X] + b\)

4.4 Variance - a measure of the spread of a PMF

4.4.1 Definition of variance:

\[ var(X) = E[(X - \mu)^2] = \sum\limits_x (x - \mu)^2 p_X(x) \]

standard deviation: \(\sigma_X = \sqrt{var(X)}\)

4.4.2 Properties of the variance

Notation: \(\mu = E[X] \)
\(var(aX + b) = a^2var(X)\)
A useful formula:

\[ var(X) = E(X^2) - (E[X])^2
\]

Summary of Expectation and Variance of Discrete Random Variables

Random Variables	Formula	E(X)	var(X)
Bernoulli (p)	\(p_X(x) = \begin{cases} 1, p(x) = p \\ 0, p(x) = 1 - p \end{cases} \)	\(p\)	\(p(1-p)\)
Uniform (a,b)	\(p_X(x) = \frac{1}{b-a}, a ≤ x ≤ b\)	\(\frac{a+b}{2}\)	\(\frac{1}{12}(b-a)(b-a-2)\)
Binomial \(p \in [0,1]\)	\(p_X(k) = \left(\begin{array}{c} n \\ k \end{array} \right)p^k(1-p)^{n-k}, k = 0, 1 ..., n\)	\( np \)	\(np(1-p)\)
Geometric \(0 < p ≤ 1\)	\(p_X(k) = (1-p)^{k-1}p, k = 1,2,3.... \)	\(\frac{1}{p}\)	\(\)

4.5 Conditional PMF and expectation, given an event

4.5.1 Conditional PMFs

\(p_{X|A}(x|A) = P(X = x|A)\), given A = {Y = y}

\[ p_{X|Y}(x|y) = \frac{p_{X,Y}(x,y)}{p_Y(y)}
\]

4.5.2 Conditional PMFs involing more than two random variables

\(p_{X|Y,Z}(x|y,z) = P(X = x|Y = y, Z = z) = \frac{P(X=x,Y=y,Z=z)}{P(Y=y, Z=z)} = \frac{P_{X,Y,Z}(x,y,z)}{P_{Y,Z}(y,z)} \)
Multiplication rules: \(p_{X,Y,Z}(x,y,z) = p_X(x)p_{Y|X}(y|x)p_{Z|X,Y}(z|x,y) \)
Total probability and expectation theorems

\(p_X(x) = P(A_1)p_{X|A_1}(x) + ... + P(A_n)p_{X|A_n}(x) \implies p_X(x) = \sum\limits_y p_Y(y)p_{X|Y}(x|y)\)

\(E[X] = P(A_1)E[X|A_1] + ... + P(A_n)E[X|A_n] \implies E[X] = \sum\limits_y p_Y(y) E[X|Y = y]\)

4.6 Multiple random variables and joint PMFs

4.6.1 Joint PMF

\[ p_{X,Y}(x,y) = P(X = x, Y =y) \]

\(\sum\limits_x \sum\limits_y p_{X,Y}(x,y) = 1\)
Marginal PMFs: \(p_X(x) = \sum\limits_y p_{X,Y}(x,y)\)

\(p_Y(y) = \sum\limits_x p_{X,Y}(x,y)\)

4.6.2 Functions of multiple random variables

\(Z = g(X,Y)\)

PMF: \(p_Z(z) = P(Z=z) =P(g(X,Y) = z) \)
Expected value rules: \(E[g(X,Y)] = \sum\limits_x \sum\limits_y g(x,y) p_{X,Y}(x,y)\)
Linearity of expectations
- \(E[aX + b] = aE[X] + b\)
- \(E[X + Y] = E[X] + E[Y]\)

4.6.3 Independence of multiple random variables

\(P(X = x and Y = y) = P(X = x) \times P(Y = y), for all x, y \)
\(P_{X|Y}(x|y) = P_X(x)\) and \(P_{Y|X}(y|x) = P_Y(y)\)
Independence and expectations
- In general, \(E[g(X,Y)] \neq g(E[X], E[Y])\)
- If X, Y are independent: \(E[XY] = E[X]E[Y]\)
  
  g(X) and h(Y) are also independent: \(E[g(X)h(Y)] = E[g(X)]E[h(Y)]\)
Independence and variances
- Always true: \(var(aX) = a^2var(X)\) and \(var(X+a) = var(X)\)
- In general: \(var(X+Y) \neq var(X) + var(Y)\)
- If X, Y are independent, \(var(X,Y) = var(X) + var(Y)\)

5 Continuous random variables

5.1 Probability density function (PDFs)

5.1.1 Definition

PDFs are not probabilities. Their units are probability per unit length.

Contiunous random variables: a random variable is continuous if it can be described by a PDF.

\(P(X = a) = 0\)
\(f_X(x) \geq 0\)
\(\int_{-\infty}^{+\infty}f(x)dx = 1\)

Expectation/Mean

Expection/mean of a continuous random variable: average in large number of independent repetitions of the experiment

\[ E[X] = \int_{-\infty}^{+\infty}xf_X(x)dx \]

Properties of expectations

if X ≥ 0, then \(E[X] ≥ 0\)
if a ≤ X ≤ b, then \(a ≤ E[X] ≤ b\)
Expected value rule: \(E[g(X)] = \sum\limits_{x} g(x) f_X(x) dx \)
Linearity: \(E[aX + b] = aE(X) + b\)

Variance

According to the definition of variance: \(var(X) = E[(X - \mu)^2] \)

\[ var(X) = \int_{-\infty}^{+\infty} (x - \mu)^2 f_X(x) dx \]

Standard deviation = \(\sigma_X = \sqrt{var(X)} \)
\(var(aX + b) = a^2 var(X)\)
\(var(X) = E[X^2] - (E[X])^2\)

Summary of Expectation and Variance of continuous random variables

Random Variables	Formula	E(X)	var(X)
Uniform	\(f(x) = \frac{1}{b-a}, a ≤ x ≤ b\)	\(\frac{a+b}{2}\)	\(\frac{(b-a)^2}{12}\)
Exponential \( \lambda > 0 \)	\(f(x) = \begin{cases} \lambda e^{-\lambda x}, x ≥ 0 \\ 0, x < 0 \end{cases}\)	\(\frac{1}{\lambda}\)	\(\frac{1}{\lambda^2}\)

5.1.2 Cumulative distribution functions (CDF)

CDF defination: \(F_X(x) = P(X ≤ x )\)

Non-decreasing
\(F_X(x)\) tends to 1, as \(x \to \infty\)
\(F_X(x)\) tends to 0, as \(x \to - \infty\)

5.1.3 Normal(Gaussian) random variables

Standard normal(Gaussian) random variables

Stardard normal \(N(0,1): f_X(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2} \)
- \(E[X] = 0\)
- \(var(X) = 1\)
General normal(Gaussian) random variables

General normal \(N(\mu,\sigma^2): f_X(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/2\sigma^2}, \sigma > 0 \)
- \(E[X] = \mu \)
- \( var(X) = \sigma^2 \)
```
 \\( \sigma^2 \to small\\), the shape of normal distribution becomes more narrow.
```
Linear functions of a normal random variable
- Let \(Y = aX + b, X \sim N(\mu, \sigma^2)\)
  
  \(E[Y] = a\mu + b\)
  
  \(Var(Y) = a^2 \sigma^2 \)
- Fact: \(Y \sim N(a\mu + b, a^2 \sigma^2)\)
- Special case: a = 0. There is Y = b, \(N(b, 0)\)

5.1.4 Calculation of normal probabilities

Standard normal tables

\(\Phi(y) = F_Y(y) = P(Y \leq y)\) which can be find in the table, where y ≥ 0.
Standardizing a random variable

\(X \sim N(\mu, \sigma^2), \sigma^2 > 0 \)

\(Y = \frac{X - \mu}{\sigma}\)

5.2 Conditioning on an event: multiple continuous r.v.'s

\[ P( X \in B|A) = \int_B f_{X|A}(x)dx \]

5.2.1 Conditional PDf of X, given that \(X \in A \)

\[ f_{X|X \in A}(x) = \begin{cases} 0, if x \notin A \\ \frac{f_X(x)}{P(A)}, if x \in A \end{cases} \]

5.2.2 Conditional expectation of X, given an event

5.2.3 Memorylessness of the exponential PDF

5.2.4 Total probability and expectation theorems

Probability theorem:

\[ P(B) = P(A_1)P(B|A_1) + \dotsb + P(A_n)P(B|A_n) \]

For the discrete random variable:

\[ p_X(x) = P(A_1)p_{X|A_1}(x) + \dotsb + P(A_n)p_{X|A_n}(x) \]

For CDF:

\[ F_X(x) = P(X \leq x) = P(A_1)P(X \leq x | A_1) + \dotsb + P(A_n)P(X \leq x | A_n) \\= P(A_1)F_{X|A_1}(x) + \dotsb + P(A_n)F_{X|A_n}(x) \]

For PDF, the derivative of CDF:

\[ f_X(x) = P(X \leq x) = P(A_1)f_{X|A_1}(x) + \dotsb + P(A_n)f_{X|A_n}(x) \]

Integral above equation, we will obtain the expectation equation:

\[ \int xf_X(x)dx = P(A_1) \int xf_{X|A_1}(x)dx + \dotsb + P(A_n) \int xf_{X|A_n}(x)dx \]

\[ E[X] = P(A_1)E[X|A_1] + \dotsb + P(A_n)E[X|A_n]
\]

5.3 Mixed random varibles

5.3.1 Mixed distirbutions

\[ X = \begin{cases} Y, \text{with probability } p \text{ (Y discrete)}\\ Z, \text{with probability } 1-p \text{ (Z continuous)} \end{cases}
\]

do not have PDF or PMF but can be defined with CDF and expectation

\[ F_X(x) = p P(Y \leq x) + (1-p) P(Z \leq x) \\ =pF_Y(x) + (1-p)F_Z(x) \\ = E[X] = p E[Y] + (1-p) E[Z] \]

5.3.2 Joint PDFs

Joint PDFs are denoted as \(f_{X,Y}(x,y)\): probaility per unit area

When X = Y, equal to a line, meaning X and Y are not joint PDFs.

5.3.3 From the joint to the marginal

5.3.4 Joint CDF

\[ F_{X,Y}(x,y) = P(X \leq x, Y \leq y) = \int\limits_{-\infty}^{y} \int\limits_{-\infty}^{x} f_{x,y}(s,t)dsdt \]

5.4 Conditioning on a random variable and Bayers rule

5.4.1 Conditional PDFs, given another r.v.

\(f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)}\), if \(f_y(y) > 0\)
- \(f_{X|Y}(x|y) \geq 0\)
- Think of value of Y as fixed at some y shape of \(f_{X|Y}(\cdot|y)\): slice of the joint
- multiplication rule:
  
  \[ f_{X|Y}(x,y) = f_Y(y) \cdot f_{X|Y}(x|y)
  \]
\(P(X \in A | Y = y) = \int_A f_{X|Y}(x/y)dx\)

5.4.2 Total probability and expectation theorems

Anolog to the PMFs of discrete randome varable \(p_X(x) = \sum\limits_y p_Y(y)p_{X|Y}(x|y)\)

For continuous r.v., there is

\[ f_X(x) = \sum_{-\infty}^{\infty} f_Y(y)f_{X|Y}(x|y)dy
\]
Anolog to the Expectation of discrete randome varable \(E[X|Y=y] = \sum\limits_x x p_{X|Y}(x|y)\)

For continuous r.v., there is

\[ E[X|Y=y] = \int_{-\infty}^{\infty} xf_{X|Y}(x|y)dx
\]
Anolog to the discrete randome varable \(E[X] = \sum\limits_y p_Y(y) E[X|Y=y]\)

For continuous r.v., there is

\[
E[X] = \int_{-\infty}^{\infty} f_Y(y)E[X|Y=y]dy
\\ = \int_{-\infty}^{\infty} xf_X(x)dx
\]
Expected value rule

\[ E[g(X)|Y=y] = \int_{-\infty}^{\infty} g(x)f_{X|Y}(x|y)dx
\]

5.4.3 Independence

\[ f_{X,Y}(x,y) = f_X(x)f_Y(y), for all x and y
\]

\(f_{X,Y}(x,y) = f_X(x)\), for all y with \(f_Y(y) > 0\) and all x
If X, Y are independent:

\[ E[XY] = E[X]E[Y] \\ var(X + Y) = var(X) + var(Y) \]

g(X) and h(Y) are also independent: \(E[g(X)h(Y)] = E[g(X)] \cdot E[h(Y)]\)

5.4.4 The Bayes rule --- a theme with variations

For discrete r.v.,
- \(p_{X|Y}(x|y) = \frac{p_X(x) p_{Y|X}(y|x)}{p_Y(y)}\)
- \(p_Y(y) = \sum\limits_{x'} p_X(x')p_{Y|X}(y|x')\)
For continuous r.v.,
- \(f_{X|Y}(x|y) = \frac{f_X(x) f_{Y|X}(y|x)}{_Y(y)}\)
- \(p_Y(y) = \int\limits f_X(x')f_{Y|X}(y|x')\)
One discrete and one continuous r.v.

Unit 6 Further topics on random variables

6.1 Derived distributions

6.1.1 A linear function \(Y = aX + b\)

Discrete r.v.

\( p_Y(y) = p_X(\frac{y-b}{a}) \)
Continuous r.v.

\( f_Y(y) = \frac{1}{|a|}f_X(\frac{y-b}{a}) \)
- A linear function of normal r.v. is normal
  
  If \(X \sim N(\mu, \sigma^2)\), then \(aX + b \sim N(a\mu + b, a^2\sigma^2)\)

6.1.2 A general function \(g(X)\) of a continuous r.v.

Two-step procedure:

Find the CDF of Y: \(F_Y(y) = P(Y \leq y) = P(g(x) \leq y)\) and the valid range of y
Differentiate: \(f_Y(y) = \frac{dF_Y(y)}{dy}\)

A general formula for the PDF of \(Y = g(X)\) when g is monotomic

\[ f_Y(y) = f_X(h(y))\left|\frac{dh(y)}{dy}\right|
\]

\(x = h(y)\) is the inverse function of \(y = g(x)\)
A nonmonotonic example \(Y = X^2\)
- the discrete case: \(p_Y(y) = p_X(\sqrt{y}) + p_X(-\sqrt{y})\)
- the continuous case: \(f_Y(y) = f_X(\sqrt{y})\frac{1}{2\sqrt{y}} + p_X(-\sqrt{y})\frac{1}{2\sqrt{y}}\)
A function of multiple r.v.'s: \(Z = g(X,Y)\)

6.2 Sums of independent vadom variables

6.2.1 The distribution of \(X + Y\): the discrete case

Z = X + Y; X,Y independent, discrete known PMFs

\[ p_Z(z) = \sum\limits_x p_X(x)p_Y(z-x)
\]

Dsicrete convoltion mechanics

Flip the PMF of Y and put it underneath the PMF of X
Shift the flipped PMF by z
Cross-multiply and add

6.2.2 The distribution of \(X + Y\): the continous case

Z = X + Y; X,Y independent, continuous known PDFs

\[ f_Z(z) = \int\limits_x f_X(x)f_Y(z-x)dx
\]

conditional on \(X = x\):

\[ f_{Z|x}(z|x) = f_Y(z-x)
\]

which can then be used to calculate Joint PDF of Z and X and marginal PDF of Z.
Same mechanics as in discrete case

6.2.3 The sum of independent normal r.v.'s

\(X \sim N(\mu_x, \sigma_x^2), Y \sim N(\mu_y, \sigma_y^2\) Independent

\(Z = X + Y: \sim N(N(\mu_x + \mu_y, \sigma_x^2 + \sigma_y^2))\)

The sum of finitely many independent normals is normal

6.3 Covariance (协方差)

6.3.1 Definition

\[ cov(X,Y) = E[(X - E[X]) \cdot (Y - E(Y))]
\]

If \(X,Y\) independent: \(cov(X,Y) = 0 \)

convers is not true!

6.3.2 Covariance properties

\(cov(X,X) = var(X) = E[X^2] - (E[X])^2\)
\(cov(aX+b,Y) = a \cdot cov(X,Y)\)
\(cov(X,Y+Z) = cov(X,Y) + cov(X,Z)\)

Practical covariance formula:

\[ cov(X,Y) = E[XY] - E[X]E[Y] \]

6.3.3 The variance of a sum of random variables

two r.v.s

\[ var(X_1 + X_2) = var(X_1) + var(X_2) + 2cov(X_1,X_2)
\]

X,Y indepedent, then \(var(X_1 + X_2) = var(X_1) + var(X_2)\)
multiple r.v.s

\[ var(X_1 + \dots + X_n) = \sum\limits_{i=1}^nvar(X_i) + \sum\limits_{(i,j):i \neq j}^n cov(X_i,X_j)
\]

\(\sum\limits_{(i,j):i \neq j}^n \) contains \((n^2 - n)\) terms

6.4 The correlation coefficient

\[ \rho(X,Y) = E\left[\frac{(X - E[X])}{\sigma_X} \cdot \frac{(Y - E[Y])}{\sigma_Y}\right] = \frac{cov(X,Y)}{\sigma_X \sigma_Y} \]

6.4.1 Interpretation of correlation coeffecient

Dimensionless version of covariance
Measure of the defree of "association" between X and Y
Association does not imply causation or influence
Correlation often refleces underlying, common, hidden factor

6.4.2 Key properties of the correlation coeffecient

\(-1 \leq \rho \leq 1\)
Independent \(\implies \rho = 0\) "uncorrelated" (converse is not true)
\(|\rho| = 1 \Leftrightarrow\) linearly related
\(cov(aX+b, Y) = a \cdot cov(X,Y) \implies \rho(aX+b,Y) = sigma(a)\rho(X,Y)\)

6.5 Conditional expectation and variance as a random variable

6.5.1 Conditional expecation

Definition: \(g(Y)\) is the random variable that takes the value \(E[X|Y=y]\), if \(Y\) happens to take the value \(y\).

\[ E[X|Y] = g(Y)
\]

Law of iterated expectations

\[ E[E[X|Y]] = E[g(Y)] = E[X]
\]

6.5.2 Conditional variance

Variance fundamentals

\[ var(X) = E[(X - E[X])^2] \\ var(X|Y=y) = E[(X - E[X|Y=y])^2|Y=y] \]

var(X|Y) is the r.v. that takes the value var(X|Y=y), when Y=y

Law of total variance

\[ var(X) = E[var(X|Y)] + var(E[X|Y])
\]

var(X) = (average variability within sections) + (variability between sections)

6.6 Sum a random number of indepedent r.v.'s

Example of shopping

N: number of stores visited (N is a nonnegative integer r.v.)
\(X_i\): money spent in store i
- \(X_i\) independent, identically distributed
- independent of N
Let \(Y = X_1 + \dots + X_N\)

6.6.1 Expecatation of the sum

Based on the Law of iterated expectations:

\[ E[Y] = E[E[Y|N]] = E[N \cdot E[X]] = \cdot E[X]E[N]
\]

6.6.2 Variance of the sum

Based on the Law of total variance: \(var(Y) = E[var(Y|N)] + var(E[Y|N])\):

\[ var(Y) = E[N]var(X) + (E[X])^2var(N) \]

Unit 7 Bayesian inferences

7.1 Introduction to Bayesian inference

7.1.1 Basic concepts

Model building versus inferring unobserved variables

\[X = aS + W\]

S: signal; W: noise; a: medium (image a black box where S goes through and output X with W as noise)
- Model building: known signal S, observe X -> infer a
- Variable estimation: known a, observe X -> infer S
Hypothesis testing vs. estimation
- Hypothesis testing
  - unknown takes one of few possible values
  - aim at small probability of incorrect decision
- Estimation
  - numerical unknown(s)
  - aim at an estimate that is "close" to the true but unknown value

7.1.2 The Bayescian inference framework

Unknown \(\Theta\) - treated as a random variable prior distribution: \(p_{\Theta}\) or \(f_{\Theta}\)
Observation \(X\) - observation model \(p_{X|\Theta}\) or \(f_{X|\Theta}\)
Use appropriate version of the Bayes rule to find \(p_{X|\Theta}(\cdot | X = x)\) or \(f_{X|\Theta} (\cdot| X = x)\)

The output of Bayesian inference - posterior distribution
- Maximum a posterior probability (MAP):
  
  \(p_{\Theta|X}(\theta^*|x) = \max\limits_{\theta} p_{\Theta|X}(\theta|x)\)
  
  \(f_{\Theta|X}(\theta^*|x) = \max\limits_{\theta} f_{\Theta|X}(\theta|x)\)
- Conditional expectation: \(E[\Theta|X = x]\) Least Mean Square (LMS)
- estimate: \(\hat{\theta} = g(x)\) (number)
- estimator: \(\hat{\Theta} = g(X)\) (random variable)

7.1.3 Four cases

Discrete \(\Theta\), discrete X
- values of \(\Theta\): alternative hypotheses
\[ p_{\Theta|X}(\theta|x) = \frac{p_{\Theta}(\theta)p_{X|\Theta}(x|\theta)}{p_X(x)}
\]

\[ p_X(x) = \sum\limits_{\theta'}p_{\Theta}(\theta')p_{X|\Theta}(x|\theta') \]
- conditional prob of error: Smallest under the MAP rule
```
 \\[
     P(\hat{\theta} \neq \Theta|X = x)
 \\] 
```
- overal probability of error:
```
 \\[
     P(\hat{\Theta} \neq \Theta) = \sum\limits_{x} P(\hat{\Theta} \neq \Theta|X = x)p_X(x) = \sum\limits_{\theta}P(\hat{\Theta} \neq \Theta|\Theta = \theta)p_{\Theta}(\theta)
 \\] 
```
Discrete \(\Theta\), Continuous X

\[ p_{\Theta|X}(\theta|x) = \frac{p_{\Theta}(\theta)f_{X|\Theta}(x|\theta)}{f_X(x)}
\]

\[ f_X(x) = \sum\limits_{\theta'}p_{\Theta}(\theta')f_{x|\Theta}(x|\theta') \]
- the same equation for conditional prob. of error
- overall probability of error
  
  \[ P(\hat{\Theta} \neq \Theta) = \int\limits_{x} P(\hat{\Theta} \neq \Theta|X = x)f_X(x)dx = \sum\limits_{\theta}P(\hat{\Theta} \neq \Theta|\Theta = \theta)p_{\Theta}(\theta) \]
Continuous \(\Theta\), Discrete X

\[ f_{\Theta|X}(\theta|x) = \frac{p_{\Theta}(\theta)p_{X|\Theta}(x|\theta)}{p_X(x)}
\]

\[ p_X(x) = \int\limits_{\theta'}f_{\Theta}(\theta')p_{x|\Theta}(x|\theta')d\theta' \]
- Inferring the unknown bias of a coin and the Beta distribution
Continuous \(\Theta\), Continuous X

\[ f_{\Theta|X}(\theta|x) = \frac{f_{\Theta}(\theta)p_{X|\Theta}(x|\theta)}{p_X(x)}
\]

\[ f_X(x) = \int\limits_{\theta'}f_{\Theta}(\theta')p_{x|\Theta}(x|\theta')d\theta' \]
- Linear normal models: estimation of a noisy singal
- Estimating the parameter of a uniform
  
  \(X\): uniform \([0, \Theta]\)
  
  \(\Theta\): uniform \([0, 1]\)
- Performance evaluation of an estimator \(\hat{\Theta}\)
  
  \(E[(\hat{\Theta} - \Theta)^2|X = x]\)
  
  \(E[(\hat{\Theta} - \Theta)^2]\)

Useful equation:

\[ \int_0^1 \theta^\alpha(1-\theta)^\beta d\theta = \frac{\alpha!\beta!}{(\alpha + \beta + 1)!}
\]

7.2 Linear models with normal noise

7.2.1 Recognizing normal PDFs

Normal distribution: \(X \sim N(\mu, \sigma^2)\)

\(f_X(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-(x-\mu)^2/2\sigma^2}\)
\(f_X(x) = c e^{-(\alpha x^2 + \beta x + \gamma)}\), \(\alpha > 0\) Normal with mean \(-\beta/2\alpha\) and variance \(-1/2\alpha\)

7.2.2 Estimating a normal random variable in the presence of additive normal noise

\(X = \Theta + W\), \(\Theta, W,N :(0,1), independent\)

\( \hat{\theta} _{MAP} = \hat{\theta} _{LMS} = E[\Theta|X = x] = x/2\)
even with general means and variances:
- posterior is normal
- LMS and MAP estimators conincide
- these estimators are "linear" of the form \(\hat{\Theta} = aX + b\)

7.2.3 The case of multiple observations

\(X_i = \Theta + W_1\), \(\Theta \sim N(x_0, \sigma_0^2)\), \(W_i \sim N(x_i, \sigma_i^2), \Theta, W_i\) indepedent

\(\hat{\theta} _{MAP} = \hat{\theta} _{LMS} = E[\Theta|X = x] = \frac{\sum\limits _{i=0}^n\frac{x_i}{\sigma_i^2}}{\sum\limits _{i=0}^n\frac{1}{\sigma_i^2}}\)
Key conclusions
- posterior is normal
- LMS and MAP estimates coincide
- these estimates are "linear" of the form \(\hat{\theta} = a_0 + a_1x_1 + \dots + a_nx_n\)
Interpretations
- estimate \(\hat{\theta}\): weighted average of \(x_0\) (prior mean) and \(x_i\) (observations)
- weights determined by variances

7.2.4 The mean square error

Performance measures
- \(E[(\Theta - \hat{\Theta})^2|X = x] = E[(\Theta - \hat{\theta})^2|X = x] = var(\Theta|X = x) = \frac{1}{\sum\limits _{i=0}^n \frac{1}{\sigma_i^2}}\)
- \(E[(\Theta - \hat{\Theta})^2] = \int E[(\Theta - \hat{\Theta})^2|X = x] f_X(x) dx = \frac{1}{\sum\limits _{i=0}^n \frac{1}{\sigma_i^2}}\)

7.3 Least mean squares (LMS) estimation

7.3.1 In the absence of observations

Least Mean Square formulation: minimize Mean Squared Error (MSE) \(E[(\Theta - \hat{\theta})^2]: \hat{\theta} = E[\Theta]\)
\(E[(\Theta - E[\Theta])^2]:var(\Theta)\)

7.3.2 LMS estimation of \(\Theta\) based on X

Minimize conditional mean square error: \(E[(\Theta - \hat{\theta})^2|X = x]: \hat{\theta} = E[\Theta|X = x]\)

7.3.3 LMS performance evaluation

LMS estimate: \(\hat{\theta} = E[\Theta|X=x]\)
Estimator: \(\hat{\Theta} = E[\Theta|X]\)
Expected performance, once we have a measurement - Conditional mean square error

\(MSE = E[(\Theta - E[\Theta|X=x])^2|X=x] = var(\Theta|X=x)\)
Expected perfornamce of the design:

\(MSE = E[(\Theta - E[\Theta|X])^2] = E[var(\Theta|X)] = \int var(\Theta|X=x) \cdot f_X(x) dx\) Average of conditional variance
A good example

7.3.4 Properties of the estimation error in LMS estimation

Given Estimator: \(\hat{\Theta} = E[\Theta|X]\) and Error: \(\tilde{\Theta} = \hat{\Theta} - \Theta\)

\(E[\tilde{\Theta|X=x}] = 0\)
\(cov(\tilde{\Theta},\hat{\Theta}) = 0\)
\(var(\Theta) = var(\hat{\Theta}) + var({\tilde{\Theta}})\)

7.4 Linear least mean squares (LLMS) estimation

Motivation: Conditional expectation \(E[\Theta|X]\) maybe hard to compute/implement

7.4.1 LLMS formulation

Consider estimators of \(\Theta\) of the form \(\hat{\Theta} = aX + b\), minimize \(E[(\hat{\Theta} - \Theta)^2] \implies E[(\hat{\Theta} - aX - b)^2] \)

7.4.2 LLMS solution

Minimize \(E[(\hat{\Theta} - \Theta)^2]\), that is \(E[(\Theta - aX - b)^2]\)

\[ \hat{\Theta}_L = E[\Theta] + \frac{Cov(\Theta,X)}{var(X)}(X - E[X]) = E[\Theta] + \rho \frac{\sigma _\Theta}{\sigma_X}(X - E[X])
\]

\(\rho\) corelation coefficiency

Remarks on the solution and on the error variance

Only means, variances, covariances matter (we do not need to know everything)

\(E[(\hat{\Theta}_L - \Theta)^2] = (1 - \rho^2)var(\Theta)\)

7.4.3 LLMS with multiple observations

Consider the form \(\hat{\Theta} = a_1X_1 + \dots + a_nX_n + b\)
Minimize \(E[(a_1X_1 + \dots + a_nX_n + b - \Theta)^2]\)
Solve linear system in \(b\) and \(a_i\)
if \(E[\Theta|X]\) is linear in X, then \(\hat{\Theta} _{LMS} = \hat{\Theta} _{LLMS}\)
suppose general distributions with same mean, variances
- \(\hat{\theta} _{MAP} = \hat{\theta} _{LMS} = E[\Theta|X = x] = \frac{\sum\limits _{i=0}^n\frac{x_i}{\sigma_i^2}}{\sum\limits _{i=0}^n\frac{1}{\sigma_i^2}}\)
- \(\hat{\Theta} _{LMS} = E[\Theta|X] = \frac{\frac{x_0}{\sigma _0^2} + \sum\limits _{i=i}^n\frac{X_i}{\sigma_i^2}}{\sum\limits _{i=0}^n\frac{1}{\sigma_i^2}} = \hat{\Theta} _{LLMS}\)

7.5 Bayesian inference summary

Unit 8 Limit theorems and clasic statistics

8.1 Inequalities, comvergence, and the Weak Law of Large Numbers

8.1.1 Markov and Chebyshev inequality

Markov inequality

"If \(X \geq 0\) and \(E[X]\) is small, then X is unlikely to be very large"

\[ P(X \geq a) \leq \frac{E[X]}{a} \text{, for all } a > 0 \text{ and } X \geq 0 \]
Chebyshev inequality

"If the variance is small, then X is unlikely to be too far from the mean"

\[ P(|X - \mu| \geq c) \leq \frac{\sigma^2}{c^2} \text{, for all } c > 0 \text{ and } X \text{ is a random variable with mean } \mu \text{ and variance } \sigma^2 \]

8.1.2 The Weak Law of Large Numbers (WLLN)

\(X_1, X_2, \dots\) i.i.d.: infinite mean \(\mu\) and variance \(\sigma^2\)

\[ \text{Sample mean: } M_n = \frac{X_1 + \dots + X_n}{n} \]

\(E[M_n] = \mu\)
\(Var(M_n) = \frac{\sigma^2}{n}\)
WLLN: for \(\varepsilon > 0\),

\[ P(|M_n - \mu|) \geq \varepsilon = P \left( \left| \frac{X_1 + \dots + X_n}{n} - \mu\right| \geq \varepsilon \right) \to 0 \text{, as n} \to \infty \]
Interpreting the WLLN
- Sample mean \(M_n\) is unlikely to be far off from true mean \(\mu\)
- Sample mean \(M_n\) is the emperical frequency of even \(A\), with \(p = P(A)\)

8.1.3 Convergence in Probability

Sequence of random variables \(Y_n\), not necessarily independent

Definition: A sequence \(Y_n\) converges in probability to a certain number a if:

\[ \lim_\limits{n \to \infty} P(|Y_n - a| \geq \varepsilon) = 0
\]

Almost all of the PMF/PDF of \(Y_n\) eventually gets concentrated (arbitrarily) close to a

Some properties - suppose that \(X_n \to a, Y_n \to b\)
1. if g is continuous, then \(g(X_n) \to g(a)\)
2. \(X_n + Y_n \to a + b\)
3. \(E[X_n]\) need not converge to a

8.2 The Central Limit Theorem (CLT)

8.2.2 What exactly does the CLT say?

Theory

\(Z_n = \frac{S_n - n\mu}{\sqrt{n}\sigma}\) and \(Z \sim N(0,1)\)
- CDF of Z_n converges to normal CDF
- results for convergence of PDFs or PMFs (with more assumptions)
- results without assuming that X_i are identically distributed
- results under "weak dependence"
In short, CLT applies to a sequence of random variables that do not need to be i.i.d.
Practice
- The practiec of normal approximations:
  - treat Z_n as if it were normal
  - treat S_n as if normal: \(N(n\mu, n\sigma^2)\) as \(S_n = \sqrt{n}\sigma Z_n + n\mu\)
- Can we use the CLT when n is "moderate"?
  - usually, yes
  - symmetry and unimodality help

8.3 An introduction to classical statistics

8.3.1 Overview

Inference using the Bayes rule:

unknown \(\Theta\) and observation \(X\) are both random variables: Find \(p_{\Theta|X}\)
Classical statistics: unknown constant \(\theta\)
- Problem types in classical statistics
  - Hypothesis testing: \(H_0: \theta = 1/2 \text{ vs. } H_1: \theta = 3/4\)
  - Composite hypotheses: \(H_0: \theta = 1/2 \text{ vs. } H_1: \theta \neq 1/2\)
  - Estimation: design an estimator \(\hat{\Theta}\), to keep estimation error \((\hat{\Theta} - \theta)\) small.

8.3.2 The sample mean and some terminology

Estimating a mean
- \(X_1, \dots, X_n\): i.i.d, mean \(\theta\), variance \(\sigma^2\)
- Sample mean \(= \hat{\Theta}_n = M_n = \frac{X_1 + \dots + X_n}{n}\)
Properties and terminology
- \(E[\hat{\Theta}_n] = \theta\) (unbiased) for all \(\theta\)
- WLLN: \(E[\hat{\Theta}_n] \to \theta\) (consistency) for all \(\theta\)
- Mean square error (MSE): \(E[(\hat{\Theta}_n - \theta)^2] = var(\hat{\Theta}_n) = \frac{\sigma^2}{n}\)

8.3.3 On the mean square error of an estimator

\[ E[(\hat{\Theta} - \theta)^2] = var(\hat{\Theta} - \theta) + (E[\hat{\Theta} - \theta])^2 = var(\hat{\Theta}) + (bias)^2
\]

Sample mean estimator (\(\hat{\Theta}_n = M_n\)): \(MSE = \frac{\sigma^2}{n} + 0\)
Zero estimator (\(\hat{\Theta} = 0\)): \(MSE = 0 + \theta^2\)
\(\sqrt{var(\hat{\Theta})}\) is the standard error .

Standard Error refers to sampling distribution, whereas standard deviation refers to sample distribution

8.3.4 Confidence intervals (CIs)

An \(1 - \alpha\) confidence interval is an interval \([\hat{\Theta}^-, \hat{\Theta}^+]\), for all \(\theta\)

\[ P(\hat{\Theta}^- \leq \theta \leq \hat{\Theta}^+)
\]

CI for the estimation of the mean
- \(X_1, \dots, X_n\): i.i.d, mean \(\theta\), variance \(\sigma^2\)
- Sample mean \(= \hat{\Theta}_n = M_n = \frac{X_1 + \dots + X_n}{n}\)
- 95% CI: \(\Phi(1.96) = 0.975 = 1 - 0.025\)
  
  \[ P \left( \frac{|\hat{\Theta}_n - \theta|}{\sigma/\sqrt{n}}\right) \leq 1.96 \approx 0.95 \text{ (CLT) } \implies P \left(\hat{\Theta}_n - \frac{1.96\sigma}{\sqrt{n}} \leq \theta \leq \hat{\Theta}_n + \frac{1.96\sigma}{\sqrt{n}}\right) \]
CI for the mean when \(\sigma\) is unknown
1. use upper bound on \(\sigma\)
  - for \(X_i\) Bernoulli: \(\sigma \leq 1/2\)
2. use ad hoc estimate of \(\sigma\)
  - for \(X_i\) Bernoulli: \(\sigma = \sqrt{\hat{\Theta}_n(1 - \hat{\Theta}_n)}\)
3. use sample mean estimate of the variance
  
  \(\sigma^2 = E[(X_i - \theta)^2] \implies \frac{1}{n} \sum\limits_{i = 1}^n (X_i - \hat{\Theta}_n)^2 \to \sigma^2\)

Two approximations involved here:
- CLT: approximately normal
- using estimate of \(\sigma\)
correction for second approximation (t-tables) used when n is small.

8.3.5 Other natural estimators

8.3.6 Maximum Likelihood (ML) estimation

Pick \(\theta\) that "makes data most likely"

\[ \hat{\theta}_ {ML} = arg \max\limits_{\theta} p_X(x;\theta)
\]

compare to maximum a posterior probability Bayesian posterior \(p_{\Theta|X}(\theta^*|x) = \max\limits_{\theta}p_{\Theta|X}(\theta|x)\)

Unit 9 The Bernoulli and Poisson process

9.1 The Bernoulli process

9.1.1 Definition

A sequence of independent Bernoulli tirals, \(X_i\)
At each trial, i:

\(P(X_i = 1) = P(\text{success at the ith trial}) = p\)

\(P(X_i = 0) = P(\text{failure at the ith trial}) = 1 - p\)
Properties
- \(E[X_i] = p\)
- \(var(X_i) = p*(1-p)\)
Key assumption
- Independence
- Time-homogeneity

9.1.2 Stochastic processes

A sequence of random variables \(X_1, X_2, \dots\)
Sample space: \(\Omega = \text{a set of infinite sequence of 0's and 1's}\)

9.1.3. Number of successes/arrivals S in n time slots (Binomial distribution)

\(S = X_1 + X_2 + \dots + X_n\)
\(P(S=k) = \binom{n}{k}p^k(1-p)^{n-k}\), k = 0, 1, 2 ....
\(E[S] = np\)
\(var(S) = np(1-p)\)

9.1.4 Time until the first success/arrival (Geometric distribution)

\(T_i = min \{i: X_i=1 \}\)
\(P(T_1 = k) = (1-p)^(k-1)p\), k = 1,2,...
\(E[T_1] = \frac{1}{p}\)
\(var(T_1) = \frac{1-p}{p^2}\)

9.1.5 Independence, memorylessness, and fresh-start properties

Fresh-start after time n (slots), after time T1
Fresh-start after a random time N
- N = time of 3rd sucess
- N = first time that 3 successes in a row have been observed
The process \(X_{N+1}, X_{N+2}\), ... is
- A Bernoulli process
- independent of N, \(X_1, X_2, \dots, X_N\)
as long as N is determined "casually"

9.1.6 Time of the kth success/arrival

\(Y_k\) = time of kth arrival
\(T_k\) = kth inter-arrival time = \(Y_k - Y_{k-1} \text{, } k \geq 2 \)
\(Y_k = T_1 + \dots + T_k\)
- The process starts fresh after time T1
- T2 is independent of T1: Geometric(p)
- \(E[Y_k] = \frac{k}{p}\)
- \(var(Y_k) = \frac{k(1-p)}{p^2}\)
- PMF: \(p_{Y_k}(t) = \binom{t-1}{k-1}p^k(1-p)^{t-k} \text{, } t = k, k +1, ..\).

9.1.7 Merging of independent Bernoulli processes

\(X_i\): Bernoulli(p)
\(Y_i\): Bernoulli(q)
Merged process: \(Z_i: g(X_i, Y_i)\) Bernoulli(p + q - pq)

9.1.7 Splitting of a Bernoulli process

9.1.8 Poisson approximation to binomial

Interesting regime: large n, small p, moderate λ = np
Number of arrivals S in n slots: \(p_S(k) \to \frac{\lambda^k}{k!}e^{-\lambda}\) (For fixed k = 0, 1...)

9.2 The Poison process

9.2.1 Definition

Poisson process is similar to Bernoulli process, but in a continuous time interval.

Numbers of arrivals in disjoint time intervals are independent

\\(P(k, \tau)\\) = Prob. of *k* arrivals in interval of duration \\(\tau\\)

Small interval probabilities - For VERY small \(\delta\):

\[ P(k, \delta) = \begin{cases} 1-\lambda\delta + O(\delta^2) & \quad \text{if } k = 0 \\ \lambda\delta + O(\delta^2) & \quad \text{if } k=1 \\ 0 + O(\delta^2) & \quad \text{if } k>1 \end{cases} \]

\[ P(k, \delta) \approx \begin{cases} 1-\lambda\delta & \quad \text{if } k = 0 \\ \lambda\delta & \quad \text{if } k=1 \\ 0 & \quad \text{if } k>1 \end{cases} \]

λ: "Arrival rates"

9.2.2 The Poisson PMF for the number of arrivals

\(N_{\tau}:\text{ arrivals in }[0, \tau]\)
\(N_\tau \approx Binomial(n,p)\), \(n = \frac{\tau}{\delta}\), \(p = \lambda\delta + O(\delta^2)\)
\[ P(k, \tau) = P(N_\tau =k) = \frac{(\lambda\tau)^ke^{-\lambda\tau}}{k!}, \text{k = 0, 1, 2,...} \]
\(E[N_\tau] \approx np \approx \lambda\tau\)
\(var(N_\tau) \approx np(1-p) \approx \lambda\tau\)

9.2.3 The time \(T_1\) until the first arrival

Find the CDF: \(P(T_1 \leq t) = 1 - P(T_1 > t) = 1 - P(0,t) = 1 - e^{-\lambda t}\)

\[ f_{T_1}(t) = \lambda e^{-\lambda t} \text{, for } t \geq 0
\]

Exponential(λ)

9.2.4 The time \(Y_k\) of the kth arrival

Two ways to derive:

Through CDF: \(P(Y_k \leq y) = \sum\limits_{n=k}^{\infty}P(n, y)\)
More intuitive argument

\[ f_{Y_k}(y)\delta \approx P(y \leq Y_k \leq y + \delta) \approx P(k-1, y)\lambda\delta \]

Erlang distribution

\[ f_{Y_k}(y) = \frac{\lambda^k y^{k-1} e^{-\lambda y} }{(k-1)!} \text{, } y \geq 0
\]

9.2.5 Memorylessness and the fresh-start property

If we start watching at time t, we see Poisson process, independent of the history until time t. Then, time until next arrival follows exp(λ)
Time between first and second arrival, \(T_2 = Y_2 - Y_1\) follows exp(λ)
Similar for all \(T_k = Y_k - Y_{k-1} \text{, } k \geq 2\)
\(Y_k = T_1 + \dots + T_k\) is sum of i.i.d. exponentials
- \(E[Y_k] = \frac{k}{\lambda}\)
- \(var(Y_k) = \frac{k}{\lambda^2}\)

9.2.6 Bernoulli/Poisson relation

	Poisson	Bernoulli
Times of Arrival	Continuous	Discrete
Arrival Rate	λ per unit time	p per trial
PMF of # of arrivals	\[P(k,\tau) = \frac{(\lambda\tau)^ke^{-\lambda\tau}}{k!} \\E[N_\tau] \approx \lambda\tau \\ var(N_\tau) \approx \lambda\tau\]	\[P_S(k) = \binom{n}{k}p^k(1-p)^{(n-k)} \\ \to \frac{\lambda^k}{k!}e^{-\lambda} \\ E[S] = np \\ var(S) = np(1-p) \]
Interarrival Time Distr.	\[f_{T1}(t) = \lambda e^{-\lambda t}\] Exponential \[E[T_1] = 1/\lambda \\ var(T_1) = 1/\lambda^2\]	\[P_{T1} = (1-p)^{n-1}p\] Geometric \[E[T_1] = 1/p \\ var(T_1) = \frac{1-p}{p^2}\]
Time to k-th arrival	\[f_{Y_k}(y) = \frac{\lambda^k y^{k-1} e^{-\lambda y}}{(k-1)!}\] Erlang \[E[Y_k] = k/\lambda \\ var(Y_k) = k/\lambda^2\]	\[p_{Y_k}(t) = \binom{t-1}{k-1}p^k(1-p)^{t-k}\] Pascal

9.3 More on the Poisson process

9.3.1 The sum of independent Poisson random variables

\[ P(k, \tau) = \frac{(\lambda\tau)^k e^{-\lambda\tau}}{k!} \]

We call it a Poisson random variable with parameters \(\lambda\tau\)

The sum of independent Poisson random variables, with means/parameters \(\mu\) and \(\nu\) is Poisson with mean/parameter \(\mu + \nu\)

9.3.2 Merging independent Poisson processes

	0 \(1 - \lambda_1\delta\)	1 \(\lambda_1\delta\)	≥ 2 \(O(\delta^2)\)
0 \(1 - \lambda_2\delta\)	\((1-\lambda_1\delta)(1-\lambda_2\delta)\)	\(\lambda_1\delta(1-\lambda_2\delta)\)	-
1 \(\lambda_2\delta\)	\(\lambda_2\delta(1-\lambda_1\delta)\)	\(\lambda_1\lambda_2\delta^2\)	-
≥ 2 \(O(\delta^2)\)	-	-	-

0 Arrivals \(\approx 1 - (\lambda_1 + \lambda_2)\delta\)
1 Arrivals \(\approx (\lambda_1 + \lambda_2)\delta\)
≥ 2 Arrivals \(O(\delta^2)\)

Merging independent Poisson(λ1) and Poisson(λ1) result in Poisson(λ1 + λ2))

9.3.3 The time the first(last) light bulb burns out - min{X,Y,Z} and max{X,Y,Z} problem

Three lightbulbs have independent lifetimes X, Y, Z exponential(λ)

The expected time until first lightbulb burnout:
- X, Y, Z: first arrivals in independent Poisson processes
- Merged process: Poisson(3λ)
- min{X, Y, Z}: 1st arrival in merged process \(\to E[min] = 1/3\lambda\)
The expected time until the last lightbulb burnout:
- Merged process in different intervals
\[ E[max] = \frac{1}{3\lambda} + \frac{1}{2\lambda} + \frac{1}{\lambda} \]

9.3.4 Splitting of a Poisson process

Split arrivals into two streams using independent coin flips of a coin with bias q

Assume that coin flips are independent from the original Poisson process

Resulting streams are Poisson with rate \(\lambda q, \lambda (1-q)\)
The splitted Poisson processes are independent!

9.3.5 'Random incidence' in the Poisson process

Analysis
Random incidence "Paradox" is not special to the Poisson process
- Example: interarrival times, i.i.d., equally likely to be 5 or 10 mins. Then expected value of k-th interarrival time = 7.5
- Show up at a "random time"
  - P(arrival duaring a 5-minute interarrival interval) = 1/3
  - Expected length of interarrival interval during which you arrive ≈ 8.3
- Sampling method matters - Different sampling methods can give different results
  - Average family size? (3 families with one person, 1 family with 6 persons)
    - look at a random family: 3/4x1 + 1/4x6
    - looat at a random persons's family: 3/9x1 + 6/9x6
  - Average bus occupancy?
  - Average class size?

9.4 Additional theoretical background

9.4.1 Poisson versus normal approximation to the binomial

We have seen that a binomial random variable with parameters n and p can be approximated by a normal random variable (central limit theorem) but also by a Poisson random variable. Are these two facts contradictory? Fortunately not; the two approximations apply to different regimes:

if we fix p and let \(n \to \infty)\), we are in the setting where the central limit theorem applies.
If we let \(n \to \infty)\), \(p \to 0)\), while keeping the product np fixed, the Poisson approximation applies.
If p is very small but np is very large, then two approximations agree.

9.4.2 Sums of a binomial and a Poisson-distributed number of Bernoulli r.v.'s

Let \(X_1,X_2,...\) be independent Bernoulli random variables with parameter p, and N be a random variable that takes integer values and is independent of \(X_i, i = 1,2, \dots\) Let \(Y=X_1+X_2+ \dots +X_N\) for positive values of N, and let \(Y =0\) when \(N=0\).

If N is binomial with parameters m and q, then Y is binomial with parameters m and pq.
If N is poisson with parameters \(\lambda\), then Y is Poisson with parameter \(\lambda\).

9.4.3 Sums of a geometrically-distributed number of geometric and exponential r.v.'s

Let N be a geometric random variable with parameter q, and let \(X_1, X_2, \dots\) be random variables that are independent and independent of N. Let \(Y=X_1+\dots+X_N\).

If \(X_i\) is geometric with parameter p, then Y is geometric with parameter pq
If \(X_i\) is exponential with parameter \(\lambda\), then Y is exponential with parameter \(\lambda q\)

Note