অধ্যায় 37 — Attention Mathematics

📖 একটি ছোট গল্প

১৯৯৮ — Yann LeCun হাতে লেখা সংখ্যা চিনতে এমন একটি network বানালেন যা পুরো ছবিতে একই filter slide করায়, ঠিক যেমন আপনার চোখ একটি ছবির বিভিন্ন অংশে একই pattern খোঁজে। সেটাই Convolutional Neural Network — computer vision-এর ভিত্তি।

Convolution Math

Discrete 2D convolution (technically cross-correlation in DL):

(I * K)[i, j] = Σₘ Σₙ I[i+m, j+n] · K[m, n]

Output size:

O = ⌊(W − F + 2P) / S⌋ + 1

W input size, F filter, P padding, S stride

কেন CNN > MLP (images-এ)

Parameter sharing — একই filter পুরো image-এ → কম parameter।
Translation invariance — pattern যেখানেই থাকুক, detect হয়।
Local connectivity — কাছের pixel-ই বেশি প্রাসঙ্গিক।

Pooling

Max/Average pooling spatial dimension কমায়, translation robustness বাড়ায়, computation কমায়। আধুনিক architecture (ResNet-50+) প্রায়ই strided conv দিয়ে pooling replace করে।

Receptive Field

গভীর layer-এর প্রতিটি neuron input-এর কত বড় অংশ "দেখে" — receptive field। Deeper = global pattern, shallow = edge/texture। Stacked 3×3 conv দিয়ে বড় receptive field পাওয়া যায় কম parameter-এ (VGG insight)।

Python Implementation

pythonPython · NumPy

import numpy as np
from scipy.signal import correlate2d

img = np.array([
    [0,0,0,0,0],
    [0,1,1,1,0],
    [0,1,1,1,0],
    [0,1,1,1,0],
    [0,0,0,0,0],
], dtype=float)

edge = np.array([[-1,-1,-1],
                 [-1, 8,-1],
                 [-1,-1,-1]], dtype=float)

out = correlate2d(img, edge, mode='valid')
print("Edge response:\n", out)

# PyTorch version
import torch, torch.nn as nn
x = torch.randn(1, 3, 32, 32)             # NCHW
conv = nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1)
print("output:", conv(x).shape)           # → [1, 16, 32, 32]
print("params:", sum(p.numel() for p in conv.parameters()))

Famous Architectures

LeNet-5 (1998) — first practical CNN
AlexNet (2012) — ReLU + GPU, ImageNet বিপ্লব
VGG (2014) — stacked 3×3
ResNet (2015) — skip connection: y = F(x) + x
EfficientNet, ConvNeXt — modern scaling

Summary · সারসংক্ষেপ

CNN = local + shared filters → vision-এর জন্য আদর্শ inductive bias।
Output formula আর receptive field মুখস্থ রাখুন।
ResNet-এর skip connection ছাড়া 100+ layer train করা যায় না।

পূর্ববর্তী · CH 36

RNN / LSTM Mathematics

পরবর্তী · CH 38

Transformer Mathematics