Logit soft-capping

Logit

From google:

the vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.

Logit soft-capping

Logit softcapping (introduced in Gemma 2) is a technique to cap the value between $-soft\_cap$ and $+soft\_cap$:

$$ logits ← soft\_cap \cdot tanh(\frac{logits}{soft\_cap}) $$

30

From: Methods of Improving LLM Training Stability