📖 একটি ছোট গল্প
একটি paper-এ দেখলাম \mathcal{L}_{\text{adv}} = \mathbb{E}_{x \sim p_{data}} [\log D(x)] + \mathbb{E}_{z \sim p_z} [\log(1 - D(G(z)))] — লেখক বলছেন "it follows from Jensen's inequality"। কিন্তু কীভাবে? Paper-এর equation বোঝা = notation শেখা + common pattern চিনতে পারা + proof gap নিজে পূরণ করা।
Common Notation Guide
AI paper-এর "alphabet soup" — একই letter বিভিন্ন meaning:
- L — loss function (or Lagrangian, or number of layers)।
- \mathcal{L} — calligraphic L = usually loss objective (e.g., ELBO, adversarial loss)।
- \ell — lowercase l = per-sample loss (average করে L পাওয়া যায়)।
- θ (theta) — model parameters (weights)।
- φ (phi) — variational parameters, encoder parameters, or hyperparameters।
- ψ (psi) — critic/discriminator parameters (GAN, RL)।
- ℙ / 𝔼 — probability measure / expectation (over distribution)।
- \sim — "distributed as" (e.g., z \sim \mathcal{N}(0, I))।
- \propto — "proportional to" (normalization constant skip করা)।
- \mathcal{O}(n) — big-O complexity, \Theta(n) — tight bound।
Common Equation Patterns
Pattern 1: Expectation of Loss
\mathbb{E}_{x \sim p} [f(x)] = \int f(x) p(x) dx \approx \frac{1}{N} \sum_{i=1}^N f(x_i)
Monte Carlo estimate — empirical risk minimization-এর ভিত্তি।
Pattern 2: KL Divergence
KL(q \| p) = \mathbb{E}_q [\log q(x) - \log p(x)] = \mathbb{E}_q [\log \frac{q(x)}{p(x)}]
ELBO, VAE, variational inference — সবখানে দেখা যায়।
Pattern 3: Chain Rule / Total Derivative
\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial h} \cdot \frac{\partial h}{\partial a} \cdot \frac{\partial a}{\partial W_1}
Backpropagation = repeated application of chain rule।
Pattern 4: Matrix Inner Product / Trace
\langle A, B \rangle = \text{tr}(A^T B) = \sum_{i,j} A_{ij} B_{ij}
Frobenius inner product, often written compactly as A:B।
Filling in Proof Gaps
Paper-এর proof অনেক সময় "exercise for the reader" — কীভাবে handle করবেন:
- Identify the claim — theorem বলছে কী? converse true কিনা?
- Work backwards — desired result থেকে শুরু, কী assumption চাই?
- Test with numbers — concrete values দিয়ে verify (e.g., d=2, n=3)।
- Find the missing lemma — often standard result (e.g., "by Cauchy-Schwarz")।
- Check appendix — full proof অনেক সময় appendix-এ, main text-এ sketch থাকে।
💡 ইনসাইট
"Trivially" বা "obviously" লেখা থাকলে — author-এর কাছে obvious হতে পারে, আপনার জন্য নয়। নিজে derive না করে skip করবেন না।
Core Derivation Skills
- Gradient check — finite difference দিয়ে analytic gradient verify: \frac{\partial f}{\partial x_i} \approx \frac{f(x + \epsilon e_i) - f(x - \epsilon e_i)}{2\epsilon}।
- Taylor expansion — approximation near a point: f(x + h) \approx f(x) + f'(x)h + \frac{1}{2}f''(x)h^2।
- Log trick — product → sum: \log \prod_i p_i = \sum_i \log p_i।
- Exponential trick — sum → max: \log \sum_i e^{x_i} \approx \max_i x_i (approximate)।
- Completing the square — Gaussian integral বা quadratic form simplify করতে।
Reproducing Paper Results
Math থেকে code — key checkpoints:
- Hyperparameters table দিন — learning rate, batch size, weight decay, schedule।
- Initialization check করুন — Xavier, Kaiming, or custom?
- Loss curve shape দেখুন — expected behavior (monotonically decreasing? spikes?)।
- Gradient norm track করুন — exploding gradients? (norm clipping threshold?)।
- Ablation re-run করুন — paper-এর table exactly match হচ্ছে কিনা?
Python: Symbolic Gradient Check
pythonPython · NumPy
import numpy as np
def numerical_gradient(f, x, eps=1e-5):
"""Finite-difference gradient check."""
grad = np.zeros_like(x)
it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index
old = x[idx]
x[idx] = old + eps
f_plus = f(x)
x[idx] = old - eps
f_minus = f(x)
grad[idx] = (f_plus - f_minus) / (2 * eps)
x[idx] = old
it.iternext()
return grad
# Example: check gradient of f(x) = x^T A x
A = np.array([[2.0, 1.0], [1.0, 3.0]])
x = np.array([1.0, 2.0])
def f(x): return x.T @ A @ x
# Analytic gradient: (A + A^T) x = 2Ax (since A symmetric)
analytic = (A + A.T) @ x
numeric = numerical_gradient(f, x)
print("Analytic:", analytic)
print("Numeric: ", numeric)
print("Diff: ", np.abs(analytic - numeric).max())Practice Tasks
- Attention mechanism-এর equation (CH 39) নিজে কাগজে derive করুন — matrix shape track সহ।
- ELBO (CH 50) re-derive করুন — KL(q||p(z|x)) থেকে শুরু করে।
- যেকোনো paper-এর "proof sketch" বের করে পূর্ণ proof কাগজে লিখুন।
- Attention Is All You Need-এর attention complexity O(n²d) prove করুন।
Interview Questions
- Paper-এর equation বুঝতে পারছেন না — কী approach নেবেন?
- Gradient check কেন important? কখন fail হতে পারে?
- Paper claim "universal approximation" — কীভাবে verify করবেন intuition দিয়ে?
- Proof-এ "WLOG" দেখলে কী বোঝা উচিত?
Summary · সারসংক্ষেপ
- Notation guide = AI paper পড়ার প্রথম defense — L, ℒ, θ, φ, ∼, ∝ জানতে হবে।
- Common patterns: expectation, KL, chain rule, trace — দেখলেই চিনতে পারা উচিত।
- Proof gap = backwards reasoning + concrete numbers + appendix search + standard lemmas।
- Gradient check = math confidence-এর safety net, always verify with finite differences।
- Reproduction = hyperparameters × initialization × loss shape × gradient norm × exact ablation।