অধ্যায় 32 — Regularization Mathematics

📖 একটি ছোট গল্প

একজন ছাত্র পরীক্ষার জন্য পুরো past paper মুখস্থ করে fellow — exam-এ অজানা প্রশ্ন এলেই ব্যর্থ। ML model-ও তাই — training data মুখস্থ করলে test-এ ফেল করে। Regularization = "মুখস্থ কমাও, generalize কর" — overfitting-এর গাণিতিক চিকিৎসা।

Overfitting & Capacity

Model capacity বেশি হলে training noise ও memorize করতে পারে → low training loss, high test loss। Bias-Variance tradeoff:

E[(y − ŷ)²] = Bias² + Variance + Irreducible noise

Regularization = variance কমিয়ে generalization বাড়ানো (bias কিছুটা বাড়ার বিনিময়ে)।

L2 (Ridge / Weight Decay)

L_reg = L_data + λ · ||w||₂²

Gradient: ∇L_reg = ∇L + 2λw। Weight কে ০-র দিকে pull করে — large weight discourage। Gaussian prior on w-এর MAP = L2।

💡 ইনসাইট

AdamW-এ weight decay loss-এ যোগ না করে directly update-এ apply হয় (w ← w − η·g − η·λw) — Adam-এর adaptive scaling avoid করে।

L1 (Lasso)

L_reg = L_data + λ · ||w||₁

Origin-এ non-differentiable; subgradient ব্যবহার। L1 weight-কে exactly 0 করে দিতে পারে → sparsity → feature selection। Laplace prior-এর MAP।

Elastic Net

L_reg = L_data + λ₁·||w||₁ + λ₂·||w||₂²

L1 + L2 — correlated feature group-এ ভালো।

Dropout

Training-এ প্রতিটি neuron probability p দিয়ে drop। Inference-এ সবাই active, কিন্তু output (1−p)দিয়ে scale (বা training-এ inverted dropout)।

h̃ = (mask ⊙ h) / (1 − p), mask_i ~ Bernoulli(1−p)

Effect: implicit ensemble — exponential সংখ্যক sub-network train হয়।

Batch / Layer Normalization

x̂ = (x − μ_batch) / √(σ²_batch + ε), y = γ·x̂ + β

Internal covariate shift কমায়, training speed বাড়ায়।
Side-effect: regularization (batch noise inject করে)।
LayerNorm: batch-independent — Transformer-এর জন্য essential।
RMSNorm: simpler, LLaMA-এ ব্যবহৃত।

Other Techniques

Early stopping: val loss বাড়লে থামা — implicit regularizer।
Data augmentation: image flip/crop, text mixup — training distribution বাড়ায়।
Label smoothing: one-hot-এর বদলে (1−ε) + ε/K — overconfidence কমায়।
Stochastic depth: ResNet block randomly skip।
Mixup / CutMix: input + label linearly blend।

Python Implementation

pythonPython · NumPy

import numpy as np

# Compare L1 vs L2 effect on linear regression
np.random.seed(0)
N, D = 100, 20
X = np.random.randn(N, D)
true_w = np.zeros(D); true_w[:5] = [1, -2, 3, -1, 0.5]   # only 5 relevant
y = X @ true_w + 0.5*np.random.randn(N)

def fit_regularized(X, y, reg='l2', lam=0.1, lr=0.05, steps=2000):
    w = np.zeros(X.shape[1])
    for _ in range(steps):
        g = 2*X.T @ (X @ w - y) / len(y)
        if reg == 'l2':
            g += 2*lam*w
        elif reg == 'l1':
            g += lam*np.sign(w)
        w -= lr*g
    return w

w_none = fit_regularized(X, y, reg='none', lam=0)
w_l2 = fit_regularized(X, y, reg='l2', lam=0.1)
w_l1 = fit_regularized(X, y, reg='l1', lam=0.1)

print("True:", np.round(true_w, 2))
print("None:", np.round(w_none, 2))
print("L2  :", np.round(w_l2, 2))
print("L1  :", np.round(w_l1, 2))
print(f"\nL1 non-zero count: {np.sum(np.abs(w_l1) > 0.05)}  (sparsity!)")

# Dropout
def dropout(x, p=0.5, train=True):
    if not train: return x
    mask = (np.random.rand(*x.shape) > p).astype(x.dtype)
    return x * mask / (1 - p)

h = np.ones((4, 8))
print(f"\nDropout mean (should ≈ 1): {dropout(h, 0.5).mean():.4f}")

AI/ML সংযোগ

Linear/Logistic: L1/L2 = standard।
CNN: BatchNorm + weight decay + augmentation।
Transformer: LayerNorm + dropout + AdamW weight decay + label smoothing।
LLM: weight decay + dropout often কম, scale-ই বড় regularizer।
GAN: spectral normalization, gradient penalty।

Common Mistakes

BatchNorm-এর সাথে high dropout — interaction conflict।
Validation set ছাড়া λ tune করা।
Test-time dropout on রেখে inconsistent prediction।
Adam + L2 (in loss) → ভুল weight decay behavior; AdamW ব্যবহার।
Augmentation এত aggressive যে label বদলে যায়।

Practice Tasks

একই MLP-তে λ = 0, 0.001, 0.01, 0.1, 1.0 — train/val accuracy plot।
L1 দিয়ে feature selection — কতগুলো weight ০ হলো count করুন।
Dropout p = 0.1, 0.3, 0.5, 0.7 — ছোট dataset-এ effect দেখুন।

Assignment

MNIST-এর 1000-sample subset-এ একটি 3-layer MLP train করুন। ৫টি regularizer (none, L2, L1, Dropout, BatchNorm) আলাদা আলাদা apply করে train/val curve compare করুন। সবচেয়ে ভালো generalization-এর জন্য কোনটি? Mixup ও যোগ করে দেখুন।

Interview Questions

L1 vs L2 — কখন কোনটি ব্যবহার করবেন?
Dropout কেন implicit ensemble?
BatchNorm-কে regularizer বলা যায় কেন?
AdamW Adam-এর সাথে L2-এর তুলনায় কী আলাদা?
Label smoothing কীভাবে কাজ করে?

Mini Project

"Overfitting Lab" — একটি tiny dataset (50 sample) দেওয়া একটি over-parameterized MLP। User regularizer on/off, λ, dropout-rate slider দিয়ে control করে; train vs val loss real-time animate হয় — overfitting, underfitting, sweet spot visible।

Summary · সারসংক্ষেপ

Regularization = variance কমিয়ে generalization বাড়ানো।
L2 = small weights, L1 = sparse weights।
Dropout = exponential implicit ensemble।
BatchNorm/LayerNorm = training-ও speed up করে, regularize-ও।
Modern recipe = AdamW + LayerNorm + Dropout + Augmentation।

✨ পরবর্তী পদক্ষেপ

Phase 4 complete! Chapter 33-এ Phase 5 (Deep Learning Mathematics) শুরু — Neural Network-এর গণিত।

পূর্ববর্তী · CH 31

Learning Rate Scheduling

পরবর্তী · CH 33

Neural Network Mathematics