↩Logit soft-capping
Logit soft-capping
Logit softcapping (introduced in Gemma 2) is a technique to cap the value between $-soft\_cap$ and $+soft\_cap$:
$$ logits ← soft\_cap \cdot tanh(\frac{logits}{soft\_cap}) $$
From: Methods of Improving LLM Training Stability