অধ্যায় 25 — Hypothesis Testing

📖 একটি ছোট গল্প

নতুন model পুরোনোটির চেয়ে ০.৫% বেশি accuracy দিচ্ছে। এটা কি genuine improvement, নাকি শুধু random fluctuation? Hypothesis testing এই প্রশ্নের statistical উত্তর দেয় — যা A/B testing থেকে paper benchmark সব কিছুর foundation।

Hypothesis Testing-এর Setup

Null hypothesis (H₀) — default belief (no effect, no difference)।
Alternative hypothesis (H₁) — যা আমরা প্রমাণ করতে চাই।
Test statistic — data থেকে calculate করা একটি সংখ্যা (z, t, χ²)।
p-value — H₀ সত্য হলে এত extreme test statistic দেখার probability।
Significance level (α) — সাধারণত 0.05; p < α হলে H₀ reject।

Type I & Type II Errors

Type I error (α): H₀ সত্য, কিন্তু আমরা reject করলাম (false positive)।
Type II error (β): H₀ false, কিন্তু reject করতে পারিনি (false negative)।
Power = 1 − β = সত্য effect detect করার probability।

💡 ইনসাইট

Sample size বাড়ালে power বাড়ে। Small dataset = high β = false negative-এর risk।

Common Tests

z-test (known σ, large n)

z = (x̄ − μ₀) / (σ / √n)

t-test (unknown σ, small n)

t = (x̄ − μ₀) / (s / √n)

One-sample t-test: এক sample-এর mean একটি reference-এর সমান কিনা।
Two-sample t-test: দুটি sample-এর mean একই কিনা।
Paired t-test: same subject-এ before/after।

Chi-square (χ²) test

Categorical data-র independence/goodness-of-fit।

ANOVA

৩+ group-এর mean সব একই কিনা।

Confidence Interval

x̄ ± z_α/2 · (σ / √n) (95% CI = x̄ ± 1.96·SE)

"95% confidence" = একই procedure বহুবার repeat করলে ৯৫% interval-এ true parameter থাকবে। CI না-overlap করলে difference significant।

Python Implementation

pythonPython · NumPy

import numpy as np
from scipy import stats

# A/B test: model A accuracy vs model B accuracy on different test runs
np.random.seed(0)
model_a = np.random.normal(0.85, 0.02, 30)   # 30 runs
model_b = np.random.normal(0.87, 0.02, 30)

# Two-sample t-test
t_stat, p_val = stats.ttest_ind(model_a, model_b)
print(f"t-statistic = {t_stat:.4f}")
print(f"p-value     = {p_val:.4f}")
print(f"Significant at α=0.05? {p_val < 0.05}")

# Confidence interval for mean accuracy of A
mean_a = model_a.mean()
se_a = model_a.std(ddof=1) / np.sqrt(len(model_a))
ci_low, ci_high = stats.t.interval(0.95, df=len(model_a)-1, loc=mean_a, scale=se_a)
print(f"\n95% CI for model A: [{ci_low:.4f}, {ci_high:.4f}]")

# Chi-square: is a die fair?
observed = np.array([16, 18, 16, 14, 12, 24])   # 100 rolls
expected = np.array([100/6] * 6)
chi2, p = stats.chisquare(observed, expected)
print(f"\nDie test: chi2={chi2:.4f}, p={p:.4f}")

AI/ML সংযোগ

A/B testing: নতুন model deploy করার আগে statistical significance check।
Benchmark comparison: GPT-4 vs Llama — paired t-test or McNemar test।
Feature importance: permutation test-এ feature shuffle করে significance test।
Multiple comparison correction: Bonferroni — অনেক test করলে α adjust করতে হয়।

Common Mistakes

p-hacking: significant না পেলে test বদলানো, sample বাড়ানো।
p-value-কে effect size ভাবা — p ছোট হলেও effect tiny হতে পারে।
Multiple test-এর α correction না করা।
"Failed to reject H₀" = "H₀ সত্য" — ভুল; absence of evidence ≠ evidence of absence।
One-tailed vs two-tailed test ভুল choose করা।

Practice Tasks

একটি coin ১০০ বার toss-এ ৬০ head এসেছে। Coin fair কিনা test করুন।
দুটি model-এর accuracy (training run-এর data) দিয়ে paired t-test করুন।
Bonferroni correction দিয়ে ১০টি simultaneous test-এর α calculate করুন।

Assignment

একটি real A/B test simulate করুন: control group conversion rate 5%, treatment 5.5%। প্রতি group-এ ১০০০, ১০,০০০, ১,০০,০০০ sample size-এ z-test চালান। দেখান কীভাবে sample size বাড়ালে power বাড়ে এবং p-value কমে।

Interview Questions

p-value-এর সঠিক interpretation কী?
Type I vs Type II error — কখন কোনটি বেশি costly?
Statistical significance vs practical significance — পার্থক্য?
আপনি একটি ML model deploy করার আগে কীভাবে A/B test design করবেন?

Mini Project

"A/B Test Calculator" — user দুটি group-এর data দেয়, tool automatic appropriate test choose করে (t-test, z-test, chi-square), p-value, CI, এবং effect size report করে।

Summary · সারসংক্ষেপ

Hypothesis testing = data থেকে decision নেওয়ার statistical framework।
p-value < α হলে H₀ reject — কিন্তু effect size আলাদা matter।
Type I (false positive) vs Type II (false negative) tradeoff।
A/B testing = AI deployment-এর gold standard।

✨ পরবর্তী পদক্ষেপ

Chapter 26-এ Maximum Likelihood Estimation — সমস্ত ML model training-এর core principle।

পূর্ববর্তী · CH 24

Statistics Fundamentals

পরবর্তী · CH 26

Maximum Likelihood Estimation