অধ্যায় 18 — Backpropagation Mathematics

📖 একটি ছোট গল্প

একটি ১০০-tala building-এর ছাদে দাঁড়িয়ে আছেন। আপনাকে নিচের প্রতিটি তলার brick placement-এর দোষ বের করতে হবে — কোন brick-এর জন্য ছাদটা slightly off-balance।

সরাসরি উপায়: প্রতিটি brick একটু সরিয়ে দেখতে হবে ছাদ কতটা বদলায় — অসম্ভব slow।

চালাক উপায়: ছাদ থেকে শুরু করে একটু একটু করে নিচের দিকে message pass করুন — প্রতিটি floor-এ কতটুকু দোষ উপরের floor-এর দোষ থেকে আসছে। এটিই Backpropagation — chain rule-এর applied form।

Forward Pass + Backward Pass

Neural network computation = দুটি phase:

Forward pass: input → layer by layer → prediction ŷ → loss L
Backward pass: loss → layer by layer backward → gradient of each parameter

Forward-এর সময় intermediate values cache করতে হয় (activation, pre-activation) — backward-এ gradient compute করতে দরকার।

💡 ইনসাইট

Forward pass-এর computational cost ≈ backward pass-এর cost (roughly 2× forward) — কারণ backward-এ প্রতিটি operation-এর gradient compute করতে হয়।

একটি Simple Example — Step-by-Step

2-layer network: input x, hidden h, output ŷ

z₁ = W₁x + b₁, h = σ(z₁)

z₂ = W₂h + b₂, ŷ = σ(z₂)

L = ½(ŷ − y)² (MSE loss)

Backward (chain rule step-by-step)

1. Output layer error:

δ₂ = ∂L/∂z₂ = (ŷ − y) ⊙ σ'(z₂)

2. Output layer gradients:

∂L/∂W₂ = δ₂ · h^T

∂L/∂b₂ = δ₂

3. Hidden layer error (backpropagated):

δ₁ = (W₂^T · δ₂) ⊙ σ'(z₁)

4. Hidden layer gradients:

∂L/∂W₁ = δ₁ · x^T

∂L/∂b₁ = δ₁

✨ টিপ

প্যাটার্ন দেখুন: δₗ = (Wₗ₊₁^T · δₗ₊₁) ⊙ σ'(zₗ) — error backward propagate করে, weight transpose দিয়ে, activation derivative দিয়ে scale করে।

Computation Graph = Automatic Differentiation

PyTorch/TensorFlow-এর autograd = computation graph build + reverse-mode autodiff:

Forward pass-এ each operation graph-এ একটি node হয়ে যায় (add, mul, matmul, activation)।
Each node-এর .grad_fn = how to compute output's gradient w.r.t. inputs।
loss.backward() = reverse traversal, applying chain rule at each node।

Reverse-mode autodiff = efficient for scalar → vector gradients (our case: loss is scalar, params are many)।

Cost = O(# operations) — independent of parameter count!

💡 ইনসাইট

Forward-mode autodiff (Jacobian-vector product) = efficient for vector → vector (e.g. sensitivity analysis)। Reverse-mode = deep learning-এর default because loss is scalar।

Vanishing & Exploding Gradients

Deep network-এ δₗ = W^T · δₗ₊₁ ⊙ σ' — প্রতিটি layer-এ গুণ হয়:

δ₁ = (W₁^T · diag(σ'(z₁)) · W₂^T · diag(σ'(z₂)) · … · δ_L)

Vanishing: |W| < 1 and |σ'| < 1 → product → 0। Sigmoid-এ σ' ≤ 0.25 — ১০ layer-েই (0.25)¹⁰ ≈ 10⁻⁶!
Exploding: |W| > 1 → product → ∞। RNN-এ common — gradient clipping দরকার।

সমাধান:

ReLU: σ' = 1 (x > 0) — no vanishing for positive activations।
Xavier/Glorot init: W ~ N(0, 1/nᵢₙ) — weight scale control।
He init: W ~ N(0, 2/nᵢₙ) — ReLU-এর জন্য optimized।
ResNet skip: identity shortcut — gradient directly flow করে, product bypass।
LayerNorm/BatchNorm: per-layer activation scale normalize — gradient magnitude control।
Gradient clipping: ‖∇‖ > c হলে normalize — explosion prevent।

Python Implementation

pythonPython · NumPy

import numpy as np

# Manual backpropagation for a 2-layer MLP
# Architecture: 2 -> 3 -> 1
np.random.seed(42)

W1 = np.random.randn(3, 2) * 0.5
b1 = np.zeros((3, 1))
W2 = np.random.randn(1, 3) * 0.5
b2 = np.zeros((1, 1))

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_prime(z):
    s = sigmoid(z)
    return s * (1 - s)

# Forward pass
x = np.array([[1.0], [2.0]])        # input (2,1)
y = np.array([[0.5]])               # target

z1 = W1 @ x + b1                    # (3,1)
h = sigmoid(z1)                     # (3,1)
z2 = W2 @ h + b2                    # (1,1)
y_hat = sigmoid(z2)                 # (1,1)
loss = 0.5 * (y_hat - y)**2

print(f"y_hat = {y_hat[0,0]:.4f}, loss = {loss[0,0]:.4f}")

# Backward pass
dL_dyhat = (y_hat - y)              # (1,1)
dyhat_dz2 = sigmoid_prime(z2)       # (1,1)
delta2 = dL_dyhat * dyhat_dz2       # (1,1) = ∂L/∂z2

dL_dW2 = delta2 @ h.T               # (1,3)
dL_db2 = delta2                     # (1,1)

dL_dh = W2.T @ delta2               # (3,1)
dh_dz1 = sigmoid_prime(z1)          # (3,1)
delta1 = dL_dh * dh_dz1             # (3,1) = ∂L/∂z1

dL_dW1 = delta1 @ x.T               # (3,2)
dL_db1 = delta1                     # (3,1)

print("\nGradients from manual backprop:")
print(f"dL/dW2 sum = {dL_dW2.sum():.4f}")
print(f"dL/dW1 sum = {dL_dW1.sum():.4f}")

# Update weights
lr = 0.1
W2 -= lr * dL_dW2
b2 -= lr * dL_db2
W1 -= lr * dL_dW1
b1 -= lr * dL_db1

# ===== PyTorch autograd (the real way) =====
import torch

W1t = torch.tensor(W1, dtype=torch.float32, requires_grad=True)
b1t = torch.tensor(b1, dtype=torch.float32, requires_grad=True)
W2t = torch.tensor(W2, dtype=torch.float32, requires_grad=True)
b2t = torch.tensor(b2, dtype=torch.float32, requires_grad=True)

xt = torch.tensor(x, dtype=torch.float32)
yt = torch.tensor(y, dtype=torch.float32)

z1t = W1t @ xt + b1t
ht = torch.sigmoid(z1t)
z2t = W2t @ ht + b2t
y_hatt = torch.sigmoid(z2t)
loss_t = 0.5 * (y_hatt - yt)**2

loss_t.backward()
print(f"\nPyTorch dL/dW2 = {W2t.grad.sum().item():.4f}")
print(f"PyTorch dL/dW1 = {W1t.grad.sum().item():.4f}")

# Gradient checking: manual vs torch should match closely!
print(f"\nMatch W2: {np.isclose(dL_dW2.sum(), W2t.grad.sum().item())}")
print(f"Match W1: {np.isclose(dL_dW1.sum(), W1t.grad.sum().item())}")

AI/ML সংযোগ

All modern frameworks (PyTorch, TensorFlow, JAX) backpropagation automatic করে — কিন্তু ভেতরে এই math-ই চলে।
Custom layer বানালে forward() ও backward() manually লিখতে হয় — autograd use না করে।
Gradient checkpointing: intermediate activation recompute করে memory save — TPU/GPU memory limit-এর জন্য।
Mixed precision: fp16 backward pass — numeric stability concern, loss scaling দরকার।
Activation function choice: ReLU, GELU, Swish — সব backprop-এর জন্য smooth, non-vanishing gradient-এর জন্য design করা।
Attention mechanism: Transformer-এ backprop through attention scores — softmax(QK^T/√d)-এর gradient flow = key to learning।

Common Mistakes

Backward pass-এ chain rule-এর একটি link miss করা → wrong gradient → wrong update।
Sigmoid/Tanh deep net-এ ব্যবহার করে vanishing gradient-এ অবাক হওয়া — expected behavior।
Gradient check না করা custom layer-এ — bug ধরা কঠিন, early-এ debug করুন।
.backward() দুবার call করলে gradient accumulate হয় — zero_grad() ভুলে গেলে surprise।
In-place operation (যেমন x += 1) autograd graph break করে — avoid in training loop।

Practice Tasks

উপরের 2-layer MLP-তে input x = [−1, 1], target y = 0.8 দিয়ে forward+backward করুন।
Hidden layer 3 → 5 করুন। weight init random রেখে backward pass manually লিখুন।
Gradient check করুন: numerical derivative (central difference) দিয়ে verify করুন আপনার manual backprop।
ReLU দিয়ে same network-এ backward pass — z < 0 হলে gradient কী হয়?

Assignment

একটি 3-layer MLP (input: 4, hidden1: 5, hidden2: 3, output: 2) এর complete backpropagation from scratch implement করুন — শুধু NumPy ব্যবহার করে, PyTorch নয়। Architecture:z₁=W₁x+b₁, h₁=ReLU(z₁), z₂=W₂h₁+b₂, h₂=tanh(z₂), z₃=W₃h₂+b₃, ŷ=softmax(z₃)। Cross-entropy loss-এর gradient derive করে implement করুন। তারপর PyTorch autograd দিয়ে verify করুন প্রতিটি weight gradient match করে কিনা।

Interview Questions

Backpropagation কেন efficient? Reverse-mode autodiff-এর complexity কত?
Vanishing gradient-এর কারণ — sigmoid vs ReLU-এর তুলনা।
Exploding gradient — RNN-এ কেন বেশি? সমাধান কী?
Gradient checkpointing কীভাবে কাজ করে — memory vs compute tradeoff?
Custom PyTorch layer-এ backward() method-এ কী return করতে হয়?

Mini Project

"Backprop Visualizer" — একটি interactive tool বানান যেখানে user একটি network topology দেয় (layer sizes) এবং একটি input-target pair দেয়। Tool-টি forward pass-এর প্রতিটি intermediate value এবং backward pass-এর প্রতিটি gradient value table আকারে দেখায়। সাথে একটি "gradient flow" visualization দেখান — প্রতিটি layer-এ gradient magnitude কত (color intensity দিয়ে) — vanishing/exploding instantly visible।

Summary · সারসংক্ষেপ

Backpropagation = chain rule applied layer-by-layer backward through the network।
Forward: cache intermediate values; Backward: compute gradients using cached values।
Error δ propagates backward: δₗ = (Wₗ₊₁^Tδₗ₊₁) ⊙ σ'(zₗ)।
Vanishing/exploding = deep product of weights and derivatives — solved by ReLU, good init, skip connections, normalization।
PyTorch autograd = reverse-mode automatic differentiation — efficient, correct, but understanding the math is essential for debugging custom architectures।

✨ পরবর্তী পদক্ষেপ

Phase 2 (Calculus) complete! Chapter 19-এ Probability & Statistics phase শুরু — AI-এর অনিশ্চয়তা নিয়ে কথা।