Inferential Statistics with Python

Notes from the 2nd course in "Statistics with Python Specialization" on Coursera

Commonly used Python library for inferential statistics

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

1 Confidence interval

Example using NHANES dataset.

\[ \text{confidence interval} = \text{best estimate} \pm \text{mutiplier}*\text{standard error} \]

Cross table

One looking at two proportions from two groups, crosstable might be useful pd.crosstable(dx[col1], dx[col2]).

NB: the column names of the cross table are not a list, and thus needs to be renamed by dx.columns = ['col1','col2'] in some cases

Proportion calculation

    dz = dx.groupby(['RIAGENDRx']).agg({'SMQ020x': [lambda x: np.mean(x=="yes"), np.size]})
    dz.columns = ['Proportion', 'Total_n']

Then, calculate p, n, se, respectively

sm.stats.proportion_confint(prop*n, n, alpha = 0.05)

\[ \text{confidence interval} = \text{best estimate} \pm \text{mutiplier}*\text{se_diff}
\]

\(\text{SE}_1 = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1}}\)
\(\text{SE}_2 = \sqrt{\frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\)
\(\text{se_diff} = \sqrt{SE_1^2 + SE_2^2} = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\)
\(\text{Best estimate} = p_1 - p_2\)

Calculate lower confidence interval and upper confidence interval respectively. No sm method so far.

\(\text{Best estimate} = \bar{x}\)
\(\text{Standard error} = \frac{s}{\sqrt{n}}\)
- s is the sample standard deviation np.std(data, ddof=1)
- population standard deviation np.std(data, ddof=0)
multiplier depends on the significant level and distribution (z or t). If t distribution, the shape depends on the degree of freedom.
z distribution: sm.stats.DescrStatsW(bmi_female).zconfint_mean()

\(\text{Best estimate} = \bar{x}_1 - \bar{x}_2\)
\(\text{se_diff} = \sqrt{SE_1^2 + SE_2^2} = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\)
\(df = min(n_1 - 1, n_2 - 1)\) which is a very conservative way or using Welch's approximation

\(\text{Best estimate} = \bar{x}_1 - \bar{x}_2\)
\(\text{se_diff} = \sqrt{SE_1^2 + SE_2^2} = \sqrt{\frac{(n_1-1)*s_1^2 + (n_2-1)*s_2^2}{n_1+n_2 - 2}}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}\)
\(df = n_1 + n_2 - 2\)

General steps:

set null hypothesis

\(H_0: p_1 - p_2 = 0\)

\(H_A: p_1 - p_2 \neq 0\)
Check conditions: \(n_1p_1 \geq 10, n_1(1-p_1) \geq 10, n_2p_2 \geq 10, n_2(1-p_2) \geq 10\)
calculate test statistics

\[ z = \frac{\hat{p}_1-\hat{p_2}_1 - 0}{\sqrt{\hat{p}(1-\hat{p})(\frac{1}{n_1}+\frac{1}{n_2})}}
\]

se is based on combined the proportion \(p = \frac{(n_1p_1 + n_2p_2)}{(n_1+n_2)}\)
- traditional method for p value: p_val = 2*dist.norm.cdf(-np.abs(test_stat))
- t test: sm.stats.ttest_ind(population1,population2)
- z score: sm.stats.ztest(population1,population2)
Alternative approaches
1. Chi-square text: different hypothesis and two-side hypothesis
2. Fisher's Exact test
  - allow one-side hypothesis
  - typically for small sample size

set null hypothesis

\(H_0: \mu = ?\)

\(H_A: \mu \neq ?, \mu > ?, \mu < ?\) depending on the research questions
Exam results, check assumptions, summarize data (boxplot, QQplot, Histogram)
calculate test statistics

\[ t = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}}
\]
- s: sample standard deviation np.std(x, ddof=1)
- sm.stats.ztest()
What if normality doesn't hold
- non-parametric test:
  
  e.g. Wilcoxon signed Rank test (use median to do test statistics)

set null hypothesis

\(H_0: \mu_d = 0 \text{ or } \mu_1 = \mu_2\)

\(H_A: \mu_d \neq 0 \text{ or } \mu_1 \neq \mu_2\)
Exam results, check assumptions, summarize data (boxplot, QQplot, Histogram)
calculate test statistics

\[ t = \frac{(\bar{x}_1 - \bar{x}_2) - 0}{se}
\]
- pooled approach (\(\sigma_1^2 \approx \sigma_2^2\)) variance
  
  \[ se = \sqrt{\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2 - 2}}\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}
  \]
  
  \(df = n_1 + n_2 - 2\)
- unpooled approach (\(\sigma_1^2 \approx \sigma_2^2\) is not needed)
  
  \[ se = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}} \]
  
  \(df = min(n_1-1,n_2-1)\)
- sm.stats.ztest() or sm.stats.ttest_ind()
- sm.stats.CompareMeans(bmi_female, bmi_male).ztest_ind(usevar='pooled')
  
  The argument bmi_female should be the output of sm.stats.DescrStatsW(data)