Statistics with Python
1 Introduction to Probability and Statistics
1.1 Probability Theory
Use Python to simulate coin tossing problem,
# Generate the sum of k coin flips, repeat that n times
def generate_counts(k=1000, n=100):
X=2*(np.random.rand(k,n)>0.5)-1 # generate a kXn matrix of +-rando
S=np.sum(X, axis=0)
return S
coins_flip = generate_counts()
# plot a histogram
plt.style.use('ggplot')
plt.hist(coins_flip, 10, range = [-400, 400])
plt.show
np.random.rand(k,n)>0.5
generate a True/False (k,n) matrix, but it transforms to an integer matrix with 2*
-
In most cases, we can approximate probabilities using simulations (Monte-Carlo simulations)
-
However, calculating the probabilities is better because it provides a precise answer and is much faster than Monte-Carlo simulations.
1.2 What is statistics?
Statistics is about analyzing real-world data and drawing conclusions.
The logical of Statistic inference
To answer the question "whether the coin is biased given 570 heads after tossing 1000 times",
-
Suppose that the coin is fair
-
Use probability theory to compute the probability of getting at least 570 (or 507) heads
-
If this probability is very small, then we can reject with confidence the hypothesis that the coin is fair.
Given \(x_i = -1\) for tails and \(x_i = +1\), we looked at the sum \(S_k = \sum_{i=1}^{k} x_i\)
If number of heads = 570, then \(S_{1000} = 140\)
It is known that it is unlikely that \(|S_k| > 4 \sqrt{k}\), that is \(|S_{1000}| > 4 sqrt{1000} \approx 126.5\)
from math import sqrt 4*sqrt(1000)
Therefore, it is very unlikely that the coin is unbiased. -> the coin is probably biased.
1.3 Three card puzzle
Three cards in a hat
Suppose we have three cards in a hat:
-
'R''B' - one card is painted blue on one side and red on the other
-
'R''R' - one card is painted blue on both sides
-
'B''B' - one card is painted red on both sides
I pick one of the three cards at random, flip it to a random side, and place it on the table. If the other side of the card has a different color I pay you $1; if not you pay me $1.
Monte Carlo simulation
red_bck="\x1b[41m%s\x1b[0m"
blue_bck="\x1b[44m%s\x1b[0m"
red=red_bck%'R'
blue=blue_bck%'B'
Cards=[(red,blue),(red,red),(blue,blue)]
counts={'same':0,'different':0}
for j in range(50):
i=int(np.random.rand()*3.) # Generate a random integer in an array [0,1,2] indicating three cards
side=int(np.random.rand()*2.) # Generate either 0 or 1 indicating the color
C=Cards[i]
if(side==1): # select which side to be "up" ('red' is "up")
C=(C[0],C[1]) # two sides of the selected cards
same = 'same' if C[0]==C[1] else 'different' # count the number of times the two sides are the same or different.
counts[same]+=1
print(''.join(C)+' %-9s'%same, end='')
if (j+1)%5==0:
print()
print()
print(counts)
2 Elements, sets and membership
2.1 Basic concepts
Common sets
-
Intergers {..., -2, -1, 0, 1, 2, ...} \(Z\)
-
Naturals {..., 0, 1, 2, 3, ...} \(N\)
-
Positives {1, 2, 3, ...} \(P\)
-
Rationals {interger ratios m/n, \(n \neq 0\)} \(Q\)
-
Reals {...Google...} \(R\)
-
The order and repetition do no matter:
-
{0,1} = {1,0}
-
{0,1,1,1} = {0,1}
-
Special sets
-
Empty set: \(x \notin \varnothing\)
-
Universal set: \(\forall x \in \Omega\)
-
Define a set in python
-
Define a set:
set1={1,2}
orset2=set({2,3})
-
Define an empty set:
set()
orset({})
-
-
Membership -
in
andnot in
-
Test empty -
not
S = set()
not S
#Output: True
- Set size -
len()
2.2 Basic sets
2.2.1 Sets within Sets
{\(x \in A | .... \)} = {elements in A such that}
-
Integer Intervals
\(N = \{ x \in Z | x \geq 0 \}\), \(P = \{ x \in Z | x > 0 \} \)
-
Real intervals
\([a,b] = \{ x \in R | a \leq x \leq b \}\)
\((a,b) = \{ x \in R | a < x < b \}\)
-
Divisibility
Sets of Multiples
\(m \in Z\), \(_m Z = \{ i \in Z : m| i \}\)
Even numbers: \(_2 Z = \{ ..., -4, -2, 0, 2, 4, ... \} = E\)
Python syntax
-
{0,...,n-1}:
range(n)
-
{m,...,n-1}:
range(m,n)
-
{m, m+d, m+2d, ...} < n-1:
range(m, n, d)
print(set(range(3)))
#Output: {0, 1, 2}
print(set(range(2,5)))
#Output: {2, 3, 4}
print(set(range(2,12,3)))
#Output: {2, 5, 8, 11}
#Return type range, but conver to set if print
2.2.2 Visualization - Venn Diagram
import matplotlib.pyplot as plt
import matplotlib_venn as venn
S = {1, 2, 3}
T = {0, 2, -1, 5}
venn.venn2([S, T], set_labels=('S','T'))
plt.show()
#for 3 sets: venn.venn3([S,T,U], set_labels=(’S’,’T’,'U'))
2.3 Relations
2.3.1 Number relations
-
Equality - = or ≠
-
Intersection - two sets share at least one common element
Disjoint - no shared elements
-
Subsets - \(A \subseteq B\)
superset - \(B \supseteq A\)
\[ P \subseteq N \subseteq Z \subseteq Q \subseteq R \]
strict subset - if \(A \subseteq B\) and \(A \neq B\), A is a strict subset of B, denote \(A \subset B\); conversely, A is a strict superset of B, \(B \supset A\)
2.3.2 Belongs to (\( \in \)) vs. Subsets of (\(\subseteq \))
-
\( x \in A \): element x belongsto set A
\( 0 \in \{0,1\} \)
-
(\( A \subseteq B \)): A is a subset of B
\( \{ 0 \} \subseteq \{0,1\} \)
2.3.3 Python syntaxt
-
Check equality and disjoint
==
,!=
,.isjointed()
S1={0,1}; S2=set({0,1}); S3={1,0,1}; T={0,2} # Equality S1 == T #Output: False S1 == S2 S1 == S3 #Output: True # Inequality S1 != S2 # Disjoint S1.isdisjoint(T) S1.isdisjoint({2})
-
Check subsets and supersets
<=
orissubset
for \(\subseteq \) and<
for \(\subset \)>=
orissuperset
for \(\supseteq \)zero = {0}; zplus = {0,1}; zminus = {0, -1} print(zminus <= zplus) #Output: False print(zminus >= zplus) #Output: False zero.issubset(zminus) #Output: True
2.4 Operations
2.4.1 Intersection and complement
-
Commutative: \(A \cap B = B \cap A\), \(A \cup B = B \cap A\)
-
Associative: \((A \cap B) \cap C = A \cap (B \cap C)\), \((A \cup B) \cup C = A \cup (B \cup C)\)
-
Distributive: \(A \cap (B \cup C) = (A \cap B) \cup (A \cap C)\), \(A \cup (B \cap C) = (A \cup B) \cap (A \cup C)\)
-
De Morgan \((A \cap B)^c = A^c \cup B^c\), \((A \cup B)^c = A^c \cap B^c\)
2.4.2 Set Difference A-B
- \(A-B = \{ x: x \in A \wedge x \notin B \} = A \cap B^c\)
-
Symmetric Difference
The symmetric differene of two sets is the set of elements in exactly one set.
\(A bigtriangleup B = \{x: x \in A \wedge x \notin B \vee x \in B \wedge x \notin A \} \)
2.4.3 Python Syntax
Union and Intersection
- Union \(\cup\):
|
orunion
A = {1,2}
B = {2,3}
print(A|B)
C = A.union(B)
print(C)
- Intersection \(\cap\):
&
orintersection
print(A&B)
C = A.intersection(B)
print(C)
Set- and Symmetric-Difference
-
Set difference:
-
ordifference
A = {1,2} B = {2,3} A - B C = B.difference(A) print(C)
-
Symmetric difference:
^
orsymmetric_difference
A^B C = B.symmetric_difference(A) print(C)
2.4.4 Caetesian products
-
Set: Order and repetition do not matter {a,b,c} = {b,a,c}
-
Tuple: Both order and reperition matter (a,b,c) ≠ (b,a,c) and (a,a,a) ≠ (a)
-
n-tuple: Tuple with n elements
-
2-tuple: Ordered pair (a,b)
-
Cartesian products
The cartesian product of A and B is the set AxB of ordered pairs (a,b) where a \(\in A\) and b \(\in B\)
\[
A \times B = \{(a,b): a \in A, b \in B \}
\]
-
\(A \times A\) denotes \(A^2\)
-
\(R^2 = \{(x,y): x,y \in R\} \) - Cartesian Plane
-
\(A, B \subseteq R \) then \(A \times B \subseteq R^2 \) - Rectangle
-
\(A \times B = \{(x,y): x \in [0,2], y \in [1,4] \}\), where A = [0,2] and B = [1,4]
- Discrete sets
- Tables
Tables are Cartesian products
- Cartesian product of 3 sets
A x B - 2D
A x B x C - 3D
- Sequence
Sequence is tuples just without '()' and some times without ','
Cartesian products with Python
from itertools import product
Faces = set({'J', 'Q', 'K'})
Suits = {'♢','♡'}
for i in product(Faces, Suits):
print(i)
2.4.5 Russell's Paradox
3. Counting
3.1 Set Size
3.1.1 Basic concepts
The number of elements in a set S is called its size, or cardinality (基数), denoted |B| or # S.
in Python
-
Size:
len
, i.e.,len({-1, 1})
-
Sum:
sum
, i.e.,sum({-1, 1})
-
minimum:
min
, i.e.,min({-1, 1})
-
maximum:
max
, i.e.,max({-1, 1})
3.1.2 Disjoint
-
Additional rule (for disjoint):
\(A \cap B = \varnothing\): \(|A| + |B| = |A \cup B\)|
-
Subtraction rule (for complement):
\(A \subseteq B \implies B = A \cup (B - A) \implies |B| = |A| + |B - A|\)
3.1.3 General Unions
Principle of Inclusion-Exclusion (PIE)
- Two sets
\[ |A \cup B| = |A| + |B| - |A \cap B | \]
-
Three sets \[ |A \cup B \cup C| = |A| + |B| + |C| - |A \cap B |- |A \cap C| - |B \cap C| + |A \cap B \cap C| \]
-
n sets
3.1.4 Cartesian Products
Product Rule - the size of a Cartesian Product is the product of the set sizes. (multiplication)
\[ |A \times B| = |A| \times |B| \]
3.1.5 Cartesian Powers
Applications:
-
Binary strings: \(\{0,1\}^n = |\{0,1\}|^n = 2^n\)
-
Subsets
The power set of S, \(P(S)\), is the collection of all subsets of S.
\[ P(\{ a, b \}) = \{ \{ \}, \{ a \}, \{ b \}, \{a, b \} \}
\]The size of the power set is the power of the set size.
\[ |P(S)| = |\{0,1\}|^{|S|} = 2^{|S|} \]
\(P(P(S))\) - set of subsets of P(S)
\[ |P(P(S))| = 2^{|P(S)|} = 2^{2^{|S|}} \]
-
Functions
Functions from A to B: \(B^A\), # = \(|B|^{|A|}\)
-
Binary functions
Binary functions of n binary variables: Functions from \(\{0 ,1 \}^n \) to \( \{0 ,1 \} \). That is \( \{0,1 \}^{{ \{0,1\} }^{n}} \)
#= \(2^{2^n}\) Double exponntial
Exponential Growth
-
\(A^k\):
itertools.product(A, repeat = k)
-
\(n^k\):
n**k
import itertools set(itertools.product({1,2, 3}, repeat = 2)) #Exponent print(3**2)
-
3.2 Variations
Variable length
Take an example of PIN: #3-5 digit PINs
3.3 Counting trees
- Cartesian products as Trees
- Trees are more general products
For example, in a university, there are 3 departments, and each department has 2 different courses. Therefore there are 6 courses in total.
Path from Sources to Destination
4 Permutations and combinations
4.1 Permutations
4.1.1 Basic concept and application
-
n factorial = n!
-
0! = 1
-
Stirling's approximation
\[ n! \sim \sqrt{2 \pi n} \left(\frac{n}{e}\right)^n \]
4.1.2 Partial Permutations
-
permutations of k out of n objects: k-permutaitons of n
\(n \cdot (n-1) \cdot (n-2) \cdot \dotsb \cdot(n-k+1) = \frac{n!}{(n-k)!} \newcommand*{\defeq}{\stackrel{\text{def}}{=}} (n)^{\underline{k}}\)
kth falling power of n, also denoted \(P(n,k)\)
4.2 Combinations
-
Sequences with k 1's
\(\binom{[n]}{k} \) - collection of k-subsets of [n] = {1,2,...,n}
corresponds to n-bit sequences with k 1's
two interpretations
- Number of n-bit sequences with k 1's: \(\binom{n}{k}\)
4.2.1 Binomial coefficients
\[ \binom{n}{k} = \frac{n^{\underline{k}}}{k!} = \frac{n!}{k!(n-k)!} \]
-
\(\binom{n}{k} = \binom{n}{n-k}\)
-
recursive: \(\binom{n}{k} = \frac{n}{k} \cdot \binom{n-1}{k-1}\)
\[ \binom{n}{k} \cdot k = n \cdot \binom{n-1}{k-1}
\] -
\(\sum\limits_{i=0}^{n} \binom{n}{i} = 2^n\)
4.2.2 Binomial Theorem
- Pascal's identity
\[
\binom{n+1}{k} = \binom{n}{k} + \binom{n}{k-1}
\]
-
Pascal's triangle
-
Binomial Theorem
\[ (a+b)^n = \sum\limits_{i=0}{n} \binom{n}{i}a^{n-i}b^i \]
For example, \((a+b)^4 = a^4 + 4a^3b + 6a^2b^2 + 4ab^3 + b^4\)
Think of select # b from n set of {a,b}:
\[ (a+b)^n = \binom{n}{0}a^n + \binom{n}{1}a^{n-1}b + \dots + \binom{n}{n}b^n = \sum\limits_{i=0}^{n}\binom{n}{i}a^{n-i}b^i
\]-
Polynomial coefficient
\[ (1+x)^n = \sum\limits_{i=0}^{n}\binom{n}{i}x^i
\] -
Taylor expansion
\[ e^x = \sum\limits_{i=0}^{\infty} \frac{x^i}{i!} \]
derived from \((1 + \frac{x}{n})^n = \sum\limits_{i=0}{n} \binom{n}{i} \left(\frac{x}{n}\right)^i\)
-
Binomial distribution
\[ \sum\limits_{i=0}^{n} \binom{n}{i} p^{n-1}(1-p)^i = (p + (1 - p))^n = 1^n = 1
\]
4.2.3 Multinomial coefficients
\[ \frac{n!}{k_1! \cdot k_2! \cdot k_3!} \triangleq \binom{n}{k_1, k_2, k_3}, (k_1 + k_2 + k_3 = n) \]
- Multinomial theorem
\[ (a_1 + a_2 + \dots + a_m)^n = \sum\limits_{k_1 + k_2 + \dots + k_m = n \\ k_1, k_2, \dots, k_m \geq 0} \binom{n}{k_1,k_2,\dots, k_m} \prod\limits_{t=1}^{m} a_t^{k_t} \]
- Sum of Multinomialas
\[ m^n = (1 + 1 + \dots + 1)^n = \sum\limits_{k_1 + k_2 + \dots + k_m = n \\ k_1, k_2, \dots, k_m \geq 0} \binom{n}{k_1,k_2,\dots, k_m} \]
-
4.3 Stars and bars
4.3.1 Basic applications
-
k terms adding to n
#ways to write n as a sum of k positive integers, when order matters: \(\binom{n-1}{k-1}\)
-
Any Sum to n
#ways to write n as a sum of (any # of) positive integers: \(2^{n-1} = \sum\limits_{i=0}^{n-1}\binom{n-1}{i}\)
-
Nonnegative terms
#ways to write n as a sum of k nonnegative integers: \(\binom{n+k-1}{k-1}\)
-
Simple example
4-letter words (order doesn't matter): #a + #b + ... + #z = 4 \(\implies \binom{4+26-1}{26-1} = \binom{29}{25} = \binom{29}{4}\)
4.3.2 More applications
-
#k positive adding to n = #k nonnegative adding to n-k
\[ \binom{n-1}{k-1} = \binom{n-k+(k-1)}{k-1}
\] -
#k nonnegative adding to ≤ n = #k+1 nonnegative adding to n
\[ \binom{n+k}{k} = \binom{n+(k+1)-1}{(k+1)-1} \]
need to use Pascal's triangle?
4.4 Python Notebook
-
Permutation:
itertools.permutations(A)
-
Partial permutation:
itertools.permutations(A, k)
-
Factorial:
factorial(len(A))
using the factorial function in mathfrom math import factorial
-
Combinations:
itertools.combinations(A,k)
-
Week exercise
Use Python to generate a k-composition of an integer n, i.e., a k-tuple of positive integers that sum to n
-
The simpler way:
int(binom(n-1,k-1))
-
To obtain all the tuples in the composition by define a function:
import sys import numpy as np # not clear what the following packages used for import scipy as sp from scipy.special import * def compositions(k, n): if k == 1: return {(n,)} # (n,) means a tuple containg a single value comp = set() # comp = [] will generate a list instead of a set. for i in range(1, n): # 1,2,....,n for t in compositions(k - 1, n - i): #recursively comp.add((i,) + t) return comp
-
5 Topic 5 Probability Introduction
5.1 Basic concept
Random value of outcome, denoted by X.
Probability of random outcome x denoted by P(x) or P(X=x)
Probability distribution function (PDF)
-
uniform probability space
Toss an unbiased coin or die...
-
non-uniform probability space
5.2 Three Axioms
-
Non-negativity \(P(A) \geq 0\)
-
Unitarity \(P(\Omega) = 1\)
-
Addition rule: A,B disjoint \(P(A \cup B)= P(A) + P(B)\)