Basic of Statistics

Lecture 1(What is statistics)

Purpose & Scope of Statistics

Statistics is the science of making inferences about a population from sample data.
This course focuses on probability and core statistical methods.
Methods under Gaussian assumptions are the baseline, with brief pointers to nonparametric approaches when the underlying distribution is unknown.

Descriptive vs Inferential vs Statistical Computing

Descriptive statistics: Summarize and communicate structure in data using means, standard deviations, ranges, and clear graphs or tables.
Inferential statistics: Draw population-level conclusions from samples via hypothesis tests, confidence intervals, and related procedures.
Statistical computing: Apply computation grounded in probability and statistics, e.g., Monte Carlo simulation and image reconstruction, to solve practical problems.

Case Study 1: Comparing Blood Pressure Devices (Rosner Ch.1)

Question: Are pharmacy-style automated BP machines comparable to technician-operated manual cuffs.
Design considerations: measurement order effects, subject characteristics (sex, age, weight, hypertension history), de-identification, and outlier checks.
Finding at one site (Location C): mean machine–manual difference ≈ 14 mmHg.
Null hypothesis: the population mean difference is 0.
Evaluate how unlikely the observed difference is under an appropriate probability model (e.g., Gaussian) to assess evidence against the null.

Case Study 2: Descriptive Visualization Examples (Rosner Ch.2)

Vitamin A intake vs cancer status: Binned histograms/bar charts comparing distributions between cases and controls, with high-intake bins rarer among cancer cases.
CO exposure over time: Scatter or time-series plots for nonsmokers vs passive smokers showing similar early-morning levels, higher midday exposure for passive smokers, and convergence again in the evening.

Practical Takeaways

Before modeling, use descriptive statistics and visualization to inspect structure, trends, and outliers.
In study design, predefine measurement order, simultaneous-measurement feasibility, subject metadata, and data management/privacy.
In hypothesis testing, interpret results as the probability of observing data as extreme as yours if the null hypothesis were true.

Lecture 2(Practical Tips for Descriptive Statistics)

Make self-contained graphics

Write captions with minimal context: what/why, dataset, key takeaway.
Label axes with units, define symbols/variables, and include a clear legend.

Show only what’s needed

Plot just enough to reveal trends; avoid clutter and decorative styling.

Don’t trust software defaults

Adjust axis ranges/ratios/line styles to match your goal.
ROC example: both x (FPR) and y (TPR) are in $[0,1]$ → use equal axis lengths for faithful interpretation.

File formats

Prefer vector (EPS/PS/PDF/SVG) for publication—scales cleanly.
Raster (JPEG/PNG) pixelates when enlarged.
Screenshots → vector won’t recover lost quality.

Notation & Measures of Location

Sample:

\[x_1,\ldots,x_n (vector {x}).\]

Mean: (과연 집단을 대표할 수 있는지) $\displaystyle \bar{x}=\frac{1}{n}\sum_{i=1}^n x_i$
- Linear/affine transform: if $y_i=ax_i+b, \bar{y}=a\bar{x}+b$
- Sensitive to outliers.
Median: middle value after sorting (average the two middles if $n$ is even).
- Transform:
- \[y_i=ax_i+b \Rightarrow \operatorname{median}(y)=a\,\operatorname{median}(x)+b\]
- Robust to outliers (requires sorting).
Mode: most frequent value (useful for categorical data; limited as a location measure when values are widely dispersed).
Geometric mean: $\operatorname{GM}=\exp\!\Big(\tfrac{1}{n}\sum_{i=1}^n \log x_i\Big) \;=\;\Big(\prod_{i=1}^n x_i\Big)^{1/n}$
- Good for wide-scale data (power-law/exponential, concentrations/exposures), and log-axis plots.
- Consider as an alternative to the mean for highly skewed data.

Mean vs Median (Assignment Focus)

==Median is preferred== when data are skewed or contain outliers (e.g., concentrations, income).
==Arithmetic Mean==: (\bar{x}=\frac{1}{n}\sum x_i) — sensitive to outliers.
==Median==: robust to outliers; good for ordinal data or censored measurements.

Arithmetic vs Geometric Mean (Kidney Study)

Arithmetic Mean: (\bar{x}=\frac{1}{n}\sum x_i)
Geometric Mean: (\mathrm{GM}=\exp!\big(\frac{1}{n}\sum \log x_i\big)) ==(use when data are log-normal / multiplicative)==
Practice: zeros → replace by LOD/2 or small (\epsilon) and report your rule.

Conclusion: For right-skewed concentration data, ==Geometric Mean is the more appropriate location measure==.

Distribution Shape & Mean–Median

Symmetric (e.g., bell-shaped): $\bar{x}\approx \text{median}$.
Right-skewed (positive tail): $\bar{x}>\text{median}$.
Left-skewed (negative tail): $\bar{x}<\text{median}$.

Measures of Spread (Variation)

Range: $\max-\min$. Very sensitive to sample size and outliers; hard to compare across different $n$.
Percentiles/Quantiles: $p$-th percentile $v_p$ satisfies $p\%$ of data $\le v_p$.
- Compute (after sorting): $k=n\cdot p/100$.
  - If $k$ is integer → average of the $k$-th and $(k{+}1)$-th values.
  - Else → ceil to $k’$ and take the $k’$-th value.
- Use IQR (Q3–Q1) for a robust spread summary.
Sample variance/SD: $s^2=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2,\qquad s=\sqrt{s^2}$
- The $n{-}1$ term is the degrees-of-freedom correction (unbiased estimate).
- Transform: $y_i=ax_i+b \Rightarrow s_y= a \,s_x$ (shift $b$ doesn’t affect spread).

Why are there two formulas for standard deviation? (n vs. n−1)

TL;DR

Population standard deviation (known population): $\sigma=\sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i-\mu)^2}$ Use denominator $N$.
Estimating population variance/SD from a sample: $s=\sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar x)^2}$ Use denominator $n-1$ (Bessel’s correction). Reason: $\tfrac{1}{n}\sum (x_i-\bar x)^2$ systematically underestimates the population variance.

Why two formulas? (intuition)

Sample mean is unbiased: Even though $\bar x$ can be above or below $\mu$ in any sample, on average $\mathbb E[\bar x]=\mu$.
Sample variance (with $n$) is biased low: Because we center at the sample mean $\bar x$ (spending one degree of freedom), $\tfrac{1}{n}\sum (x_i-\bar x)^2$ is, on average, too small → needs a correction.
Bessel’s correction: $\mathbb E\!\left[\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)^2\right]=\sigma^2$ Using $n-1$ makes the estimator unbiased for the population variance.

Tiny example (feel the bias)

Population ${1,2,3}$ has SD $\approx 0.8165$. All size-2 samples: ${1,2}, {1,3}, {2,3}$.

Using $n$ in the variance for each sample gives $0.25, 1, 0.25$ → average is too small.
Using $n-1$ inflates those sample variances just enough to be closer to the population variance.

Takeaway: The smaller the sample, the stronger the underestimation → $n-1$ matters more.

When to use $n$ vs $n-1$

Situation	Goal	Denominator	Functions
You have the entire population	Describe that population’s variance/SD	$N$	Excel `STDEV.P`; NumPy `np.std(x, ddof=0)`
You have a sample and want to estimate the population variance/SD	Remove bias (unbiased estimator)	$n-1$	Excel `STDEV.S`; NumPy `np.std(x, ddof=1)`

Practical tip: In reports, state what you used (denominator, function name, ddof).

Sketch of the algebra

Definition: $s^2=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar x)^2.$ Expand: $\sum (x_i-\bar x)^2=\sum x_i^2-n\bar x^2 =\sum x_i^2-\frac{1}{n}\Big(\sum x_i\Big)^2.$ Take expectations (under i.i.d. sampling): $\mathbb E\!\left[\sum (x_i-\bar x)^2\right]=(n-1)\sigma^2 \;\Rightarrow\; \mathbb E[s^2]=\sigma^2.$ So $n-1$ is the correction that yields an unbiased estimate.

Mean Absolute Deviation (MAD about the mean): $\displaystyle \frac{1}{n}\sum x_i-\bar{x} $.
- Useful in sparse/robust settings (e.g., compressed sensing).
- SD is tightly linked to the Normal model ($\mu,\sigma$ fully specify it); percentile–SD mappings under Normality come next chapter.

NEXTWhy the Sample Variance Divides by n − 1