Activation Functions¶
Activation functions are the non-linear transformations applied element-wise (or gate-wise) inside neural networks. Without them, any composition of linear layers collapses to a single linear map, and the network can represent only affine functions. This page provides a rigorous treatment of the activations used in modern transformer models, together with their ZigLlama implementations.
1. Why Non-Linearity?¶
1.1 The Linear Collapse Problem¶
Linear Collapse
Let \( f_i(x) = W_i x + b_i \) for \( i = 1, \dots, L \). The composition \( f_L \circ \cdots \circ f_1 \) is itself an affine map:
where \( W' = W_L W_{L-1} \cdots W_1 \) and \( b' \) is a corresponding accumulated bias.
Proof sketch. Induction on \( L \). Base case (\(L=1\)) is trivial. If \( g = f_{L-1} \circ \cdots \circ f_1 = W''x + b'' \), then \( f_L(g(x)) = W_L(W''x + b'') + b_L = (W_L W'')x + (W_L b'' + b_L) \), which is affine. \(\square\)
This means a 100-layer "deep" network of purely linear transformations has exactly the same representational capacity as a single matrix multiply. Stacking layers buys us nothing without non-linearity.
1.2 Universal Approximation¶
Universal Approximation Theorem (Cybenko 1989, Hornik 1991)
A feed-forward network with a single hidden layer and a non-polynomial activation function can approximate any continuous function on a compact subset of \(\mathbb{R}^n\) to arbitrary precision, given sufficiently many hidden units.1
The theorem guarantees existence of a good approximation but says nothing about how efficiently the network can be trained. In practice, deeper networks with well-chosen activations learn more efficiently than wide shallow ones.
2. Classical Activations¶
2.1 Rectified Linear Unit (ReLU)¶
ReLU
Derivative:
The derivative is the Heaviside step function (undefined at exactly zero, where a sub-gradient of 0 is conventionally used).
Properties:
| Property | Value |
|---|---|
| Range | \([0, +\infty)\) |
| Smoothness | Piecewise linear, not differentiable at 0 |
| Gradient | Constant 1 for \(x>0\); exactly 0 for \(x<0\) |
| Computational cost | One comparison |
Dying ReLU Problem
If a neuron's pre-activation is always negative (e.g., due to a large negative bias learned during training), its gradient is permanently zero and the neuron can never recover. In wide networks, a significant fraction of neurons can "die" this way, wasting capacity. Leaky ReLU and modern smooth activations were designed partly to address this.
2.2 Sigmoid¶
Sigmoid (Logistic)
Derivative: \(\sigma'(x) = \sigma(x)(1 - \sigma(x))\).
| Property | Value |
|---|---|
| Range | \((0, 1)\) |
| Smoothness | Infinitely differentiable |
| Gradient at saturation | Approaches 0 for \(\lvert x \rvert \gg 0\) |
Vanishing Gradient
For inputs with large magnitude, the sigmoid saturates and its derivative is near zero. In deep networks, this causes gradients to shrink exponentially through layers -- the vanishing gradient problem -- making early layers nearly impossible to train.
2.3 Hyperbolic Tangent¶
Tanh
Equivalently, \(\tanh(x) = 2\sigma(2x) - 1\).
| Property | Value |
|---|---|
| Range | \((-1, 1)\) |
| Smoothness | Infinitely differentiable |
| Zero-centered | Yes (unlike sigmoid) |
| Gradient at saturation | Approaches 0 |
Tanh was the default activation in early neural networks. Its zero-centered output is advantageous for gradient dynamics, but it still suffers from vanishing gradients in the saturated regime.
3. Modern Activations¶
3.1 GELU -- Gaussian Error Linear Unit¶
GELU (Hendrycks & Gimpel 2016)
where \(\Phi(x) = \frac{1}{2}\bigl[1 + \operatorname{erf}(x/\sqrt{2})\bigr]\) is the CDF of the standard normal distribution.2
Intuition. GELU weights the input \(x\) by the probability that a standard Gaussian random variable is less than \(x\). Large positive inputs pass through nearly unchanged (\(\Phi(x) \approx 1\)); large negative inputs are suppressed (\(\Phi(x) \approx 0\)); and intermediate values are smoothly interpolated.
Fast approximation (tanh-based):
This approximation avoids computing the error function and is used in most production implementations, including ZigLlama.
| Property | Value |
|---|---|
| Range | \(\approx (-0.17, +\infty)\) |
| Smoothness | Infinitely differentiable |
| Dead neurons | None (gradient never exactly zero) |
| Used in | BERT, GPT-2, GPT-3, T5 |
3.2 SiLU / Swish¶
SiLU (Ramachandran et al. 2017)
Also known as Swish with a fixed \(\beta = 1\).3
SiLU is self-gated: the input acts as both the signal and the gate. It shares many properties with GELU -- smooth, non-monotonic near zero, and asymptotically linear for large positive inputs -- but uses the simpler sigmoid rather than the Gaussian CDF.
| Property | Value |
|---|---|
| Range | \(\approx (-0.28, +\infty)\) |
| Smoothness | Infinitely differentiable |
| Dead neurons | None |
| Used in | LLaMA, PaLM, Switch Transformer |
3.3 SwiGLU -- SiLU-Gated Linear Unit¶
SwiGLU (Shazeer 2020)
where \(\odot\) is element-wise multiplication and \(W_1, W_2 \in \mathbb{R}^{d \times d_{ff}}\).4
SwiGLU is a gated linear unit variant that uses SiLU as its gating activation. The key idea is to split the computation into two parallel streams -- one processed through SiLU (the gate), the other left linear (the content) -- and then combine them via element-wise multiplication.
This architecture requires three weight matrices per feed-forward layer (gate, up, down) rather than two, but empirically yields better model quality at the same compute budget.
| Property | Value |
|---|---|
| Parameters | \(3 \cdot d_{\text{model}} \cdot d_{ff}\) |
| Gating | Learned, via SiLU |
| Used in | LLaMA, LLaMA 2, PaLM |
3.4 GeGLU -- GELU-Gated Linear Unit¶
GeGLU (Shazeer 2020)
GeGLU replaces the SiLU gate in SwiGLU with GELU. The two variants perform comparably in practice; the choice often follows from consistency with the rest of the architecture (e.g., if the model already uses GELU elsewhere).
4. Comparison Table¶
| Activation | Range | Smooth | Dead Neurons | FLOPs/elem | Notable Models |
|---|---|---|---|---|---|
| ReLU | \([0, \infty)\) | No | Yes | 1 | Early CNNs, some MLPs |
| Sigmoid | \((0, 1)\) | Yes | No (but vanishing grad) | ~10 | Gates in LSTMs, GLU |
| Tanh | \((-1, 1)\) | Yes | No (but vanishing grad) | ~10 | LSTMs, early RNNs |
| GELU | \(\approx(-0.17, \infty)\) | Yes | No | ~15 | BERT, GPT-⅔, T5 |
| SiLU | \(\approx(-0.28, \infty)\) | Yes | No | ~10 | LLaMA, PaLM |
| SwiGLU | \(\mathbb{R}\) | Yes | No | ~20 (+ extra matmul) | LLaMA, LLaMA 2 |
| GeGLU | \(\mathbb{R}\) | Yes | No | ~25 (+ extra matmul) | Some T5 variants |
Choosing an Activation
For modern decoder-only LLMs, SwiGLU is the dominant choice (LLaMA, Mistral, PaLM). For encoder models and older architectures, GELU remains standard (BERT, GPT-2). ReLU is rarely used in transformers today but remains pedagogically important.
5. Activation Function Landscape¶
graph LR
subgraph "Classical"
ReLU
Sigmoid
Tanh
end
subgraph "Modern Smooth"
GELU
SiLU["SiLU / Swish"]
end
subgraph "Gated Variants"
GLU["GLU (sigmoid gate)"]
SwiGLU["SwiGLU (SiLU gate)"]
GeGLU["GeGLU (GELU gate)"]
end
ReLU --> GELU
Sigmoid --> SiLU
Sigmoid --> GLU
SiLU --> SwiGLU
GELU --> GeGLU
GLU --> SwiGLU
GLU --> GeGLU 6. Implementation in ZigLlama¶
6.1 Type Enum and Dispatcher¶
ZigLlama defines all supported activations in a single enum and provides a compile-time dispatcher:
/// Activation function types supported in transformers
pub const ActivationType = enum {
ReLU,
GELU,
SiLU, // Also known as Swish
GLU, // Gated Linear Unit
GeGLU, // GELU-based Gated Linear Unit
SwiGLU, // SiLU-based Gated Linear Unit
Tanh,
Sigmoid,
};
/// Generic activation function dispatcher
pub fn applyActivation(
comptime T: type,
activation_type: ActivationType,
input: Tensor(T),
allocator: Allocator,
) TensorError!Tensor(T) {
return switch (activation_type) {
.ReLU => relu(T, input),
.GELU => gelu(T, input, allocator),
.SiLU => silu(T, input, allocator),
.GLU => glu(T, input, allocator),
.GeGLU => geglu(T, input, allocator),
.SwiGLU => swiglu(T, input, allocator),
.Tanh => tanh_activation(T, input, allocator),
.Sigmoid => sigmoid(T, input, allocator),
};
}
6.2 Scalar Activation Functions¶
Each scalar activation operates element-wise over a Tensor(T).
/// ReLU: max(0, x)
pub fn relu(comptime T: type, input: Tensor(T)) TensorError!Tensor(T) {
var result = try Tensor(T).init(input.allocator, input.shape);
for (0..input.size) |i| {
result.data[i] = @max(0.0, input.data[i]);
}
return result;
}
/// GELU: tanh-based approximation
pub fn gelu(comptime T: type, input: Tensor(T), allocator: Allocator) TensorError!Tensor(T) {
var result = try Tensor(T).init(allocator, input.shape);
for (0..input.size) |i| {
const x = input.data[i];
const inner = 0.7978845608 * (x + 0.044715 * x * x * x);
result.data[i] = 0.5 * x * (1.0 + std.math.tanh(inner));
}
return result;
}
/// SiLU / Swish: x * sigmoid(x)
pub fn silu(comptime T: type, input: Tensor(T), allocator: Allocator) TensorError!Tensor(T) {
var result = try Tensor(T).init(allocator, input.shape);
for (0..input.size) |i| {
const x = input.data[i];
result.data[i] = x * (1.0 / (1.0 + @exp(-x)));
}
return result;
}
6.3 Gated Activations¶
Gated variants split the input tensor along the last dimension. The first half is the content stream; the second half passes through the gating activation.
/// SwiGLU: content * SiLU(gate)
pub fn swiglu(comptime T: type, input: Tensor(T), allocator: Allocator) TensorError!Tensor(T) {
const last_dim = input.shape[input.shape.len - 1];
const half_size = input.size / 2;
// ... shape validation omitted for brevity ...
var result = try Tensor(T).init(allocator, output_shape);
for (0..half_size) |i| {
const a = input.data[i]; // content
const b = input.data[i + half_size]; // gate
const sigmoid_b = 1.0 / (1.0 + @exp(-b));
result.data[i] = a * (b * sigmoid_b); // a * SiLU(b)
}
return result;
}
Source File
Full implementation: src/neural_primitives/activations.zig (approximately 380 lines including tests).
7. Numerical Considerations¶
7.1 GELU Approximation Accuracy¶
The tanh-based GELU approximation has a maximum absolute error of approximately \(4 \times 10^{-4}\) relative to the exact \(x\Phi(x)\) form. For inference in f32, this is well within acceptable precision.
7.2 Sigmoid Overflow¶
For very large negative inputs, \(e^{-x}\) overflows f32. ZigLlama relies on IEEE 754 semantics: \(\exp(-x) = +\infty\) yields \(1/(1+\infty) = 0\), which is the correct saturated value. No explicit clamping is required.
7.3 SiLU Near Zero¶
\(\operatorname{SiLU}(0) = 0\) exactly, and the derivative at zero is 0.5. Unlike ReLU, the gradient is non-zero at the origin, so neurons do not die.
8. Exercises¶
- Prove that \(\operatorname{SiLU}'(x) = \sigma(x)(1 + x(1 - \sigma(x)))\).
- Implement a "leaky" GELU variant where negative outputs are scaled by a small constant \(\alpha\) instead of suppressed.
- Benchmark the per-element cost of GELU vs. SiLU on a 4096-element tensor using
std.time.Timer. Which is faster and why? - Explain why SwiGLU requires three weight matrices while standard FFN requires only two, and show that the parameter counts match when \(d_{ff}^{\text{SwiGLU}} = \frac{2}{3} d_{ff}^{\text{standard}}\).
References¶
-
Cybenko, G. "Approximation by Superpositions of a Sigmoidal Function." Mathematics of Control, Signals, and Systems, 2(4):303--314, 1989. ↩
-
Hendrycks, D. & Gimpel, K. "Gaussian Error Linear Units (GELUs)." arXiv:1606.08415, 2016. ↩
-
Ramachandran, P., Zoph, B. & Le, Q. V. "Searching for Activation Functions." arXiv:1710.05941, 2017. ↩
-
Shazeer, N. "GLU Variants Improve Transformer." arXiv:2002.05202, 2020. ↩
-
Nair, V. & Hinton, G. E. "Rectified Linear Units Improve Restricted Boltzmann Machines." ICML, 2010. ↩