অধ্যায় 31 — Learning Rate Scheduling

📖 একটি ছোট গল্প

একজন athlete training-এ প্রথমে warmup করেন, তারপর full intensity, শেষে cooldown। Neural net-ও তাই — learning rate-কে time-এর সাথে carefully change করলে অনেক ভালো converge করে। Constant LR = constant intensity = suboptimal।

কেন Scheduling দরকার

শুরুতে large LR = দ্রুত region-of-interest-এ যাওয়া।
শেষে small LR = sharp minimum-এ fine-tune।
Warmup = প্রথম step-এ unstable gradient avoid।
Restart = local min থেকে escape।

Common Schedules

Step decay

η_t = η₀ · γ^⌊t/k⌋ (প্রতি k epoch-এ γ গুণ)

Exponential decay

η_t = η₀ · e^−λt

Cosine annealing

η_t = η_min + ½(η_max − η_min)(1 + cos(πt/T))

Smooth decrease, Transformer/CV-তে standard।

Linear warmup + decay

প্রথম w step-এ 0 → η_max, তারপর decay। BERT, GPT-এ ব্যবহৃত।

Cyclic LR (CLR) / One-cycle

LR triangular oscillate করে — fast.ai-এর popular method।

Cosine with warm restarts (SGDR)

প্রতি cycle শেষে LR reset — multiple "exploration" phase।

Warmup — কেন গুরুত্বপূর্ণ

Adam-এ early step-এ v small → effective LR বিশাল → instability। Warmup ছাড়া Transformer train করা প্রায়ই diverge করে।

💡 ইনসাইট

Rule of thumb: large batch / large model = বেশি warmup দরকার। GPT-3 = প্রথম 375M token warmup।

Python Implementation

pythonPython · NumPy

import numpy as np
import math

def cosine_lr(t, T, lr_max=1e-3, lr_min=1e-5):
    return lr_min + 0.5*(lr_max - lr_min)*(1 + math.cos(math.pi*t/T))

def warmup_cosine(t, warmup, T, lr_max=1e-3):
    if t < warmup:
        return lr_max * t / warmup
    return cosine_lr(t - warmup, T - warmup, lr_max)

for t in [0, 100, 500, 1000, 5000, 10000]:
    print(f"step {t:5d}  lr = {warmup_cosine(t, warmup=1000, T=10000):.6f}")

# Step decay
def step_decay(epoch, lr0=0.1, gamma=0.1, k=30):
    return lr0 * gamma**(epoch // k)

print()
for e in [0, 10, 30, 60, 90]:
    print(f"epoch {e:3d}  lr = {step_decay(e):.6f}")

# PyTorch scheduler
import torch
from torch.optim.lr_scheduler import CosineAnnealingLR, OneCycleLR

model_params = [torch.zeros(1, requires_grad=True)]
opt = torch.optim.Adam(model_params, lr=1e-3)
sched = CosineAnnealingLR(opt, T_max=100, eta_min=1e-5)
for _ in range(5):
    opt.step(); sched.step()
print(f"\nAfter 5 steps: lr = {sched.get_last_lr()[0]:.6f}")

LR Finder

Leslie Smith-এর technique: LR exponentially বাড়িয়ে কয়েক batch train করুন, loss vs LR plot করুন। যেখানে loss সবচেয়ে দ্রুত কমে — সেটাই sweet spot। Production training-এর আগে runtime কম, ROI বিশাল।

AI/ML সংযোগ

Transformer pretraining: linear warmup + cosine/inverse-sqrt decay।
ResNet: step decay (e.g. 30, 60, 90 epoch-এ ÷10)।
Fine-tuning: very low constant LR (5e-5) বা cosine।
RL: often constant + entropy schedule।

Common Mistakes

Warmup ছাড়া Adam দিয়ে Transformer train করা।
Step decay-এর "k" tune না করা — too early/late drop।
Scheduler-কে optimizer-এর আগে step করা।
Fine-tuning-এ pretraining-এর high LR ব্যবহার করে catastrophic forgetting।

Practice Tasks

CIFAR-10-এ constant vs cosine vs step decay compare করুন।
LR finder চালিয়ে optimal LR বের করুন।
Warmup steps = 0, 100, 1000 — final accuracy কেমন?

Assignment

৫টি schedule (constant, step, exponential, cosine, warmup+cosine) NumPy-তে implement করুন। T=10000 step-এর জন্য সবগুলোর LR curve plot করুন। তারপর একটি ছোট MLP-তে actual training compare করুন এবং learning curve analyze করুন।

Interview Questions

Transformer-এ warmup কেন critical?
Cosine annealing কেন step decay-এর চেয়ে ভালো কাজ করে অনেক ক্ষেত্রে?
LR finder কী এবং কীভাবে কাজ করে?
SGDR-এর "restart" intuitively কেন কাজ করে?

Mini Project

"LR Schedule Designer" — user start LR, warmup, total steps, schedule type input দেয়; tool LR curve plot করে এবং recommended hyperparameter বের করে।

Summary · সারসংক্ষেপ

Constant LR rarely optimal — schedule = free performance gain।
Warmup → peak → decay = modern training recipe।
Cosine annealing = NLP/CV-র default।
LR finder = production training-এর আগে cheap discovery।

✨ পরবর্তী পদক্ষেপ

Chapter 32-এ Regularization Mathematics — overfitting prevent করার গণিত।

পূর্ববর্তী · CH 30

Adam, RMSProp, SGD

পরবর্তী · CH 32

Regularization Mathematics