Saddle point-এ stuck হয়ে AdamW ১০০০ epoch train করলেও loss কমছে না। Newton's method দিয়ে Hessian invert করলে negative eigenvalue — update উল্টো দিকে চলে গেল। Advanced optimization শুধু faster নয়, smarter — curvature, constraints, second-order information সব use করে।
Second-Order Methods
Newton's method — Hessian H = ∇²L use করে:
Advantage: quadratic convergence (super fast near optimum)।
Problem: O(d³) to invert Hessian, O(d²) memory — deep learning-এ impossible।
Quasi-Newton: L-BFGS
Hessian invert না করে approximate করে — previous gradients থেকে:
L-BFGS (Limited-memory BFGS) শুধু last m pairs রাখে — O(md) memory।
Natural Gradient
Standard gradient Euclidean geometry-এ optimizes। কিন্তু parameter space-এর intrinsic geometry = Fisher information metric:
F = \mathbb{E}[\nabla \log p(x|\theta) \nabla \log p(x|\theta)^T] = Fisher Information Matrix।
Natural gradient = gradient যা parameter space-এর true geometry মানে।
Approximation: K-FAC (Kronecker-Factored Approximate Curvature) —每层 আলাদা Kronecker product দিয়ে Fisher approximate করে, practical for deep nets।
Hessian-Free Optimization
Hessian store না করে Hv (Hessian-vector product) compute করে:
Conjugate Gradient দিয়ে H⁻¹∇L solve — O(d) memory, O(κ) iterations।
Used in: early deep learning (Hinton et al.), some RL training pipelines।
Constrained Optimization
Constraints সহ optimize — Lagrangian:
Example: neural architecture search-এ resource constraint, fair ML-এ demographic parity constraint।
- Lagrange multiplier — equality constraints-এর জন্য।
- KKT conditions — inequality constraints (e.g., g(θ) ≤ 0) এর জন্য।
- Primal-Dual methods — alternating update θ ও λ।
Modern Advanced Optimizers
- Shampoo — matrix preconditioning, each parameter matrix-এর জন্য separate preconditioner।
- SOAP — Shampoo + Adam hybrid, state-of-the-art for some LLM training।
- Muon — orthogonal gradient updates, recent optimizer with strong theoretical guarantees।
- Schedule-Free — learning rate schedule-ই লাগে না, momentum-এর মাধ্যমে automatic adaptation।
Python: Hessian-Vector Product
import torch
def hvp(loss, params, v):
"""Hessian-vector product using autograd twice."""
grad = torch.autograd.grad(loss, params, create_graph=True)
grad_flat = torch.cat([g.flatten() for g in grad])
# Second derivative in direction v
grad_v = torch.sum(grad_flat * v)
hvp_result = torch.autograd.grad(grad_v, params, retain_graph=True)
return torch.cat([g.flatten() for g in hvp_result])
# Example usage
x = torch.randn(10, requires_grad=True)
loss = torch.sum(x ** 3) # L = sum(x^3)
v = torch.randn(10)
hv = hvp(loss, [x], v)
print("Hessian-vector product:", hv)Practice Tasks
- f(x) = x⁴ হলে Newton's method-এর update formula লিখুন।
- Fisher Information Matrix-এর rank কত? কেন?
- Hessian-free method-এ কেন H⁻¹ store করি না?
- Constrained problem min x² s.t. x ≥ 1 — KKT conditions লিখুন।
Interview Questions
- Newton vs Quasi-Newton — trade-offs কী?
- Natural gradient কেন first-order method-এর চেয়ে better convergence দেয়?
- Large model-এ second-order method কেন ব্যবহার করা হয় না?
- KKT conditions কোনো constrained ML problem-এ কীভাবে apply করবেন?
Summary · সারসংক্ষেপ
- Second-order methods (Newton) fast কিন্তু O(d³) — deep learning-এ impractical।
- Quasi-Newton (L-BFGS) = Hessian approximate, memory efficient for moderate d।
- Natural gradient = Fisher metric-এ optimize, K-FAC দিয়ে practical approximation।
- Hessian-free = Hv product + CG, O(d) memory।
- Constrained optimization = Lagrange multipliers, KKT — fairness, NAS, robustness-এ।