Lecture 1(What is statistics)
Purpose & Scope of Statistics
- Statistics is the science of making inferences about a population from sample data.
- This course focuses on probability and core statistical methods.
- Methods under Gaussian assumptions are the baseline, with brief pointers to nonparametric approaches when the underlying distribution is unknown.
Descriptive vs Inferential vs Statistical Computing
- Descriptive statistics: Summarize and communicate structure in data using means, standard deviations, ranges, and clear graphs or tables.
- Inferential statistics: Draw population-level conclusions from samples via hypothesis tests, confidence intervals, and related procedures.
- Statistical computing: Apply computation grounded in probability and statistics, e.g., Monte Carlo simulation and image reconstruction, to solve practical problems.
Case Study 1: Comparing Blood Pressure Devices (Rosner Ch.1)
- Question: Are pharmacy-style automated BP machines comparable to technician-operated manual cuffs.
- Design considerations: measurement order effects, subject characteristics (sex, age, weight, hypertension history), de-identification, and outlier checks.
- Finding at one site (Location C): mean machine–manual difference ≈ 14 mmHg.
- Null hypothesis: the population mean difference is 0.
- Evaluate how unlikely the observed difference is under an appropriate probability model (e.g., Gaussian) to assess evidence against the null.
Case Study 2: Descriptive Visualization Examples (Rosner Ch.2)
- Vitamin A intake vs cancer status: Binned histograms/bar charts comparing distributions between cases and controls, with high-intake bins rarer among cancer cases.
- CO exposure over time: Scatter or time-series plots for nonsmokers vs passive smokers showing similar early-morning levels, higher midday exposure for passive smokers, and convergence again in the evening.
Practical Takeaways
- Before modeling, use descriptive statistics and visualization to inspect structure, trends, and outliers.
- In study design, predefine measurement order, simultaneous-measurement feasibility, subject metadata, and data management/privacy.
- In hypothesis testing, interpret results as the probability of observing data as extreme as yours if the null hypothesis were true.
Lecture 2(Practical Tips for Descriptive Statistics)
Make self-contained graphics
- Write captions with minimal context: what/why, dataset, key takeaway.
- Label axes with units, define symbols/variables, and include a clear legend.
Show only what’s needed
- Plot just enough to reveal trends; avoid clutter and decorative styling.
Don’t trust software defaults
- Adjust axis ranges/ratios/line styles to match your goal.
- ROC example: both x (FPR) and y (TPR) are in $[0,1]$ → use equal axis lengths for faithful interpretation.
File formats
- Prefer vector (EPS/PS/PDF/SVG) for publication—scales cleanly.
- Raster (JPEG/PNG) pixelates when enlarged.
- Screenshots → vector won’t recover lost quality.
Notation & Measures of Location
- Sample:
-
Mean: (과연 집단을 대표할 수 있는지) \(\displaystyle \bar{x}=\frac{1}{n}\sum_{i=1}^n x_i\)
-
Linear/affine transform: if \(y_i=ax_i+b, \bar{y}=a\bar{x}+b\)
-
Sensitive to outliers.
-
-
Median: middle value after sorting (average the two middles if $n$ is even).
-
Transform:
- \[y_i=ax_i+b \Rightarrow \operatorname{median}(y)=a\,\operatorname{median}(x)+b\]
-
Robust to outliers (requires sorting).
-
-
Mode: most frequent value (useful for categorical data; limited as a location measure when values are widely dispersed).
-
Geometric mean: \(\operatorname{GM}=\exp\!\Big(\tfrac{1}{n}\sum_{i=1}^n \log x_i\Big) \;=\;\Big(\prod_{i=1}^n x_i\Big)^{1/n}\)
- Good for wide-scale data (power-law/exponential, concentrations/exposures), and log-axis plots.
- Consider as an alternative to the mean for highly skewed data.
Mean vs Median (Assignment Focus)
- ==Median is preferred== when data are skewed or contain outliers (e.g., concentrations, income).
- ==Arithmetic Mean==: (\bar{x}=\frac{1}{n}\sum x_i) — sensitive to outliers.
- ==Median==: robust to outliers; good for ordinal data or censored measurements.
Arithmetic vs Geometric Mean (Kidney Study)
- Arithmetic Mean: (\bar{x}=\frac{1}{n}\sum x_i)
- Geometric Mean: (\mathrm{GM}=\exp!\big(\frac{1}{n}\sum \log x_i\big)) ==(use when data are log-normal / multiplicative)==
- Practice: zeros → replace by LOD/2 or small (\epsilon) and report your rule.
Conclusion: For right-skewed concentration data, ==Geometric Mean is the more appropriate location measure==.
Distribution Shape & Mean–Median
- Symmetric (e.g., bell-shaped): $\bar{x}\approx \text{median}$.
- Right-skewed (positive tail): $\bar{x}>\text{median}$.
- Left-skewed (negative tail): $\bar{x}<\text{median}$.
Measures of Spread (Variation)
-
Range: $\max-\min$. Very sensitive to sample size and outliers; hard to compare across different $n$.
-
Percentiles/Quantiles: $p$-th percentile $v_p$ satisfies $p\%$ of data $\le v_p$.
- Compute (after sorting): $k=n\cdot p/100$.
- If $k$ is integer → average of the $k$-th and $(k{+}1)$-th values.
- Else → ceil to $k’$ and take the $k’$-th value.
- Use IQR (Q3–Q1) for a robust spread summary.
- Compute (after sorting): $k=n\cdot p/100$.
-
Sample variance/SD: \(s^2=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2,\qquad s=\sqrt{s^2}\)
- The $n{-}1$ term is the degrees-of-freedom correction (unbiased estimate).
-
Transform: $y_i=ax_i+b \Rightarrow s_y= a \,s_x$ (shift $b$ doesn’t affect spread).
Why are there two formulas for standard deviation? (n vs. n−1)
TL;DR
-
Population standard deviation (known population): \(\sigma=\sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2}\) Use denominator $N$.
-
Estimating population variance/SD from a sample: \(s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar x)^2}\) Use denominator $n-1$ (Bessel’s correction). Reason: $\tfrac{1}{n}\sum (x_i-\bar x)^2$ systematically underestimates the population variance.
Why two formulas? (intuition)
-
Sample mean is unbiased: Even though $\bar x$ can be above or below $\mu$ in any sample, on average $\mathbb E[\bar x]=\mu$.
-
Sample variance (with $n$) is biased low: Because we center at the sample mean $\bar x$ (spending one degree of freedom), $\tfrac{1}{n}\sum (x_i-\bar x)^2$ is, on average, too small → needs a correction.
-
Bessel’s correction: \(\mathbb E\!\left[\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)^2\right]=\sigma^2\) Using $n-1$ makes the estimator unbiased for the population variance.
Tiny example (feel the bias)
Population ${1,2,3}$ has SD $\approx 0.8165$. All size-2 samples: ${1,2}, {1,3}, {2,3}$.
- Using $n$ in the variance for each sample gives $0.25, 1, 0.25$ → average is too small.
- Using $n-1$ inflates those sample variances just enough to be closer to the population variance.
Takeaway: The smaller the sample, the stronger the underestimation → $n-1$ matters more.
When to use $n$ vs $n-1$
| Situation | Goal | Denominator | Functions |
|---|---|---|---|
| You have the entire population | Describe that population’s variance/SD | $N$ | Excel STDEV.P; NumPy np.std(x, ddof=0) |
| You have a sample and want to estimate the population variance/SD | Remove bias (unbiased estimator) | $n-1$ | Excel STDEV.S; NumPy np.std(x, ddof=1) |
Practical tip: In reports, state what you used (denominator, function name,
ddof).
Sketch of the algebra
Definition: \(s^2=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)^2.\) Expand: \(\sum (x_i-\bar x)^2=\sum x_i^2-n\bar x^2 =\sum x_i^2-\frac{1}{n}\Big(\sum x_i\Big)^2.\) Take expectations (under i.i.d. sampling): \(\mathbb E\!\left[\sum (x_i-\bar x)^2\right]=(n-1)\sigma^2 \;\Rightarrow\; \mathbb E[s^2]=\sigma^2.\) So $n-1$ is the correction that yields an unbiased estimate.
-
Mean Absolute Deviation (MAD about the mean): $\displaystyle \frac{1}{n}\sum x_i-\bar{x} $. - Useful in sparse/robust settings (e.g., compressed sensing).
- SD is tightly linked to the Normal model ($\mu,\sigma$ fully specify it); percentile–SD mappings under Normality come next chapter.