অধ্যায় 24 — Statistics Fundamentals

📖 একটি ছোট গল্প

Probability theory বলে — "coin fair হলে কী হবে।" Statistics উল্টোটা — "data দেখে coin fair কিনা বের কর।" AI-তে আমাদের কাছে data থাকে, theory বের করতে হয় — তাই statistics এত central।

Descriptive Statistics

Central tendency

Mean μ = (1/n)Σxᵢ — outlier-sensitive।
Median — robust, sorted data-র middle।
Mode — most frequent value।

Spread

Range = max − min
Variance σ² = (1/n)Σ(xᵢ − μ)²
Std deviation σ = √σ²
IQR = Q3 − Q1 — outlier-robust spread

Population vs Sample

Population = সব data। Sample = population-এর subset (যা আমরা actually observe করি)।

Sample variance: s² = (1 / (n−1)) · Σ(xᵢ − x̄)²

💡 ইনসাইট

Bessel's correction: n−1 divisor unbiased estimate দেয় (Bias = E[s²] − σ² = 0)।n দিলে underestimate করে।

Covariance & Correlation

Cov(X, Y) = E[(X − μ_X)(Y − μ_Y)]

ρ(X, Y) = Cov(X, Y) / (σ_X · σ_Y) ∈ [−1, 1]

ρ = +1: perfect positive linear relationship
ρ = 0: no linear relationship (could still be nonlinear!)
ρ = −1: perfect negative linear relationship

⚠️ সতর্কতা

Correlation ≠ Causation। ice cream sales ↔ drowning — সম্পর্কিত কারণ দুটোই গ্রীষ্মে বাড়ে।

Python Implementation

pythonPython · NumPy

import numpy as np

np.random.seed(0)
data = np.random.normal(loc=100, scale=15, size=1000)
data = np.append(data, [500, 600])   # add outliers

print(f"Mean      = {data.mean():.2f}")     # pulled by outliers
print(f"Median    = {np.median(data):.2f}") # robust
print(f"Std       = {data.std(ddof=1):.2f}") # sample std
print(f"IQR       = {np.percentile(data, 75) - np.percentile(data, 25):.2f}")

# Covariance & correlation
x = np.random.normal(0, 1, 1000)
y = 2 * x + np.random.normal(0, 0.5, 1000)   # linear with noise
cov = np.cov(x, y, ddof=1)[0, 1]
corr = np.corrcoef(x, y)[0, 1]
print(f"\nCov(x, y)  = {cov:.4f}")
print(f"Corr(x, y) = {corr:.4f}  (should be ~0.97)")

# Standardization (z-score) — used everywhere in ML
z = (data - data.mean()) / data.std(ddof=1)
print(f"\nStandardized: mean={z.mean():.4f}, std={z.std(ddof=1):.4f}")

AI/ML সংযোগ

Feature scaling: z = (x − μ) / σ — gradient descent-এর জন্য essential।
BatchNorm: per-batch mean/variance দিয়ে activation normalize।
Covariance matrix: PCA-র heart।
Pearson correlation: feature selection-এ multi-collinearity detect।
Robust loss (Huber): mean-এর পরিবর্তে median-like behavior।

Common Mistakes

Train set-এর mean/std test set-এ apply না করে test set-এর own statistics ব্যবহার — data leakage।
Correlation দেখে causation infer করা।
Mean-only summary skewed distribution-এ misleading।
Sample variance-এ n ব্যবহার করা n−1-এর পরিবর্তে।

Practice Tasks

একটি skewed dataset-এ mean, median, mode তুলনা করুন।
দুটি feature-এর covariance এবং correlation calculate করুন — non-linear relation হলে কী হয়?
Iris dataset-এ ৪টি feature-এর correlation matrix বের করুন।

Assignment

একটি real dataset (Titanic, Iris, Boston) নিন। সব numeric feature-এর mean, median, std, IQR calculate করুন। Correlation heatmap বানান। কোন feature গুলো highly correlated? Feature selection-এর জন্য কোনগুলো drop করতেন?

Interview Questions

Variance-এ n−1 divisor কেন?
Mean vs Median — কখন কোনটি ব্যবহার করবেন?
Correlation = 0 মানে কি independent? কেন না?
Z-score normalization vs Min-Max scaling — কখন কোনটি?

Mini Project

"EDA Dashboard" — user CSV upload করে, tool automatically descriptive statistics, correlation heatmap, distribution histogram, এবং outlier detection দেখায়।

Summary · সারসংক্ষেপ

Mean/Median/Mode — central tendency-র তিন মুখ।
Sample variance-এ n−1 = unbiased estimator।
Covariance = direction, Correlation = direction + strength।
Feature standardization ML pipeline-এর first step।

✨ পরবর্তী পদক্ষেপ

Chapter 25-এ Hypothesis Testing — "এই difference কি real, না random noise?"

পূর্ববর্তী · CH 23

Probability Distributions

পরবর্তী · CH 25

Hypothesis Testing