অধ্যায় 49 — Advanced Optimization

📖 একটি ছোট গল্প

Saddle point-এ stuck হয়ে AdamW ১০০০ epoch train করলেও loss কমছে না। Newton's method দিয়ে Hessian invert করলে negative eigenvalue — update উল্টো দিকে চলে গেল। Advanced optimization শুধু faster নয়, smarter — curvature, constraints, second-order information সব use করে।

Second-Order Methods

Newton's method — Hessian H = ∇²L use করে:

\theta_{t+1} = \theta_t - H^{-1} \nabla L

Advantage: quadratic convergence (super fast near optimum)।

Problem: O(d³) to invert Hessian, O(d²) memory — deep learning-এ impossible।

Quasi-Newton: L-BFGS

Hessian invert না করে approximate করে — previous gradients থেকে:

s_t = \theta_{t+1} - \theta_t, \quad y_t = \nabla L_{t+1} - \nabla L_t

L-BFGS (Limited-memory BFGS) শুধু last m pairs রাখে — O(md) memory।

💡 ইনসাইট

L-BFGS small models (vision before deep learning), full-batch convex problems-এ excellent। Large deep nets-এ সতর্ক — noisy gradients-এ performance degrade হয়।

Natural Gradient

Standard gradient Euclidean geometry-এ optimizes। কিন্তু parameter space-এর intrinsic geometry = Fisher information metric:

\theta_{t+1} = \theta_t - \alpha F^{-1} \nabla L

F = \mathbb{E}[\nabla \log p(x|\theta) \nabla \log p(x|\theta)^T] = Fisher Information Matrix।

Natural gradient = gradient যা parameter space-এর true geometry মানে।

Approximation: K-FAC (Kronecker-Factored Approximate Curvature) —每层 আলাদা Kronecker product দিয়ে Fisher approximate করে, practical for deep nets।

Hessian-Free Optimization

Hessian store না করে Hv (Hessian-vector product) compute করে:

Hv = \nabla_\theta ((\nabla L)^T v) \quad \text{(Pearlmutter trick)}

Conjugate Gradient দিয়ে H⁻¹∇L solve — O(d) memory, O(κ) iterations।

Used in: early deep learning (Hinton et al.), some RL training pipelines।

Constrained Optimization

Constraints সহ optimize — Lagrangian:

\mathcal{L}(\theta, \lambda) = f(\theta) + \lambda^T g(\theta)

Example: neural architecture search-এ resource constraint, fair ML-এ demographic parity constraint।

Lagrange multiplier — equality constraints-এর জন্য।
KKT conditions — inequality constraints (e.g., g(θ) ≤ 0) এর জন্য।
Primal-Dual methods — alternating update θ ও λ।

Modern Advanced Optimizers

Shampoo — matrix preconditioning, each parameter matrix-এর জন্য separate preconditioner।
SOAP — Shampoo + Adam hybrid, state-of-the-art for some LLM training।
Muon — orthogonal gradient updates, recent optimizer with strong theoretical guarantees।
Schedule-Free — learning rate schedule-ই লাগে না, momentum-এর মাধ্যমে automatic adaptation।

Python: Hessian-Vector Product

pythonPython · NumPy

import torch

def hvp(loss, params, v):
    """Hessian-vector product using autograd twice."""
    grad = torch.autograd.grad(loss, params, create_graph=True)
    grad_flat = torch.cat([g.flatten() for g in grad])
    
    # Second derivative in direction v
    grad_v = torch.sum(grad_flat * v)
    hvp_result = torch.autograd.grad(grad_v, params, retain_graph=True)
    return torch.cat([g.flatten() for g in hvp_result])

# Example usage
x = torch.randn(10, requires_grad=True)
loss = torch.sum(x ** 3)          # L = sum(x^3)
v = torch.randn(10)
hv = hvp(loss, [x], v)
print("Hessian-vector product:", hv)

Practice Tasks

f(x) = x⁴ হলে Newton's method-এর update formula লিখুন।
Fisher Information Matrix-এর rank কত? কেন?
Hessian-free method-এ কেন H⁻¹ store করি না?
Constrained problem min x² s.t. x ≥ 1 — KKT conditions লিখুন।

Interview Questions

Newton vs Quasi-Newton — trade-offs কী?
Natural gradient কেন first-order method-এর চেয়ে better convergence দেয়?
Large model-এ second-order method কেন ব্যবহার করা হয় না?
KKT conditions কোনো constrained ML problem-এ কীভাবে apply করবেন?

Summary · সারসংক্ষেপ

Second-order methods (Newton) fast কিন্তু O(d³) — deep learning-এ impractical।
Quasi-Newton (L-BFGS) = Hessian approximate, memory efficient for moderate d।
Natural gradient = Fisher metric-এ optimize, K-FAC দিয়ে practical approximation।
Hessian-free = Hv product + CG, O(d) memory।
Constrained optimization = Lagrange multipliers, KKT — fairness, NAS, robustness-এ।

পূর্ববর্তী · CH 48

Matrix Calculus

পরবর্তী · CH 50

Variational Inference