অধ্যায় 48 — Matrix Calculus

📖 একটি ছোট গল্প

Backpropagation-এ ∂L/∂W বের করতে হয়, কিন্তু W একটি matrix। Matrix derivative কী shape নেবে? ∂vec(W) নাকি ∂W? Matrix calculus এই confusion দূর করে — AI paper-এর equation বোঝার চাবিকাঠি।

Scalar by Scalar

সবচেয়ে simple:

\frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}

Chain rule: \frac{\partial f}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x}

Vector by Scalar

f(x) যেখানে x ∈ ℝ আর f ∈ ℝⁿ:

\frac{\partial f}{\partial x} = \left[ \frac{\partial f_1}{\partial x}, \frac{\partial f_2}{\partial x}, \dots, \frac{\partial f_n}{\partial x} \right]^T

Result = column vector (Jacobian-এর একটি column)।

Scalar by Vector (Gradient)

Loss L(θ) যেখানে θ ∈ ℝᵈ:

\nabla_\theta L = \left[ \frac{\partial L}{\partial \theta_1}, \frac{\partial L}{\partial \theta_2}, \dots, \frac{\partial L}{\partial \theta_d} \right]^T

Result = column vector (gradient vector)। যে direction-এ loss বাড়ে।

Vector by Vector (Jacobian)

f: ℝⁿ → ℝᵐ:

J = \frac{\partial f}{\partial x} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix} \in \mathbb{R}^{m \times n}

Backprop-এ layer-এর Jacobian দিয়ে gradient chain করে — \delta^{(\ell)} = J^T \delta^{(\ell+1)}।

Matrix Derivatives

সবচেয়ে Common Case: ∂L/∂W

z = Wx + b, L = L(z):

\frac{\partial L}{\partial W} = \frac{\partial L}{\partial z} \cdot x^T = \delta \cdot x^T

Shape check: δ ∈ ℝ^(m×1), xᵀ ∈ ℝ^(1×d) → outer product gives m×d = W shape।

\frac{\partial L}{\partial x} = W^T \frac{\partial L}{\partial z} = W^T \delta

Shape check: Wᵀ ∈ ℝ^(d×m), δ ∈ ℝᵐ → d-dimensional vector (matches x)।

Chain Rule for Matrices

Matrix version chain rule:

\frac{\partial L}{\partial X} = \left(\frac{\partial L}{\partial Y}\right)^T \frac{\partial Y}{\partial X}

Key insight: Trace trick ব্যবহার করলে সবসময় scalar output পাওয়া যায় — shape ambiguity দূর হয়।

dL = \text{tr}\left(\left(\frac{\partial L}{\partial Y}\right)^T dY\right)

Python: Manual Backprop with Matrix Calculus

pythonPython · NumPy

import numpy as np

# Forward: z = W @ x + b, y = relu(z), L = ||y - t||^2
W = np.array([[1.0, 2.0], [3.0, 4.0]])  # 2x2
x = np.array([[1.0], [2.0]])            # 2x1
b = np.array([[0.5], [0.5]])            # 2x1
t = np.array([[3.0], [7.0]])            # target

# Forward
z = W @ x + b
y = np.maximum(0, z)                    # ReLU
L = np.sum((y - t)**2)

# Backward using matrix calculus rules
dL_dy = 2 * (y - t)                     # scalar by vector
dy_dz = np.diag((z > 0).flatten())      # ReLU Jacobian
dL_dz = dy_dz @ dL_dy                   # chain rule
dL_dW = dL_dz @ x.T                     # outer product
dL_db = dL_dz                           # same shape as b
dL_dx = W.T @ dL_dz                     # chain rule for input

print("dL/dW shape:", dL_dW.shape)       # (2, 2) = W shape
print("dL/db shape:", dL_db.shape)       # (2, 1) = b shape
print("dL/dx shape:", dL_dx.shape)       # (2, 1) = x shape

Common Identities

\partial(Ax)/\partial x = A (linear)।
\partial(x^T A x)/\partial x = (A + A^T)x (quadratic)।
\partial \text{tr}(AB)/\partial A = B^T (trace)।
\partial \det(A)/\partial A = \det(A) \cdot (A^{-1})^T।
\partial \log \det(A)/\partial A = A^{-T} (covariance matrix optimization-এ)।

Practice Tasks

f(x) = xᵀAx + bᵀx + c এর gradient বের করুন।
L = ||Wx − t||² হলে ∂L/∂W দেখান।
Sigmoid layer σ(Wx)-এর Jacobian shape কত?
Batch matrix multiplication-এ ∂L/∂W কী shape?

Interview Questions

Backprop-এ ∂L/∂W কেন outer product হয়?
Trace trick কী এবং কেন useful?
Hessian matrix-এর shape কত? Eigenvalues কী বলে?
Vector-Jacobian product (VJP) কী এবং কেন automatic differentiation-এ efficient?

Summary · সারসংক্ষেপ

Matrix calculus = scalar/vector/matrix সব shape-এর derivative নিয়ম।
Gradient ∇_θ L = column vector, Jacobian = matrix of partials।
∂L/∂W = δxᵀ (outer product) — backprop-এর ভিত্তি।
Trace trick shape ambiguity দূর করে — research paper পড়ার সময় কাজে লাগে।
Common identities (linear, quadratic, trace, determinant) মুখস্থ না — derivation pattern শিখুন।

পূর্ববর্তী · CH 47

Numerical Stability

পরবর্তী · CH 49

Advanced Optimization