Activation Functions¶

Activation functions are the non-linear transformations applied element-wise (or gate-wise) inside neural networks. Without them, any composition of linear layers collapses to a single linear map, and the network can represent only affine functions. This page provides a rigorous treatment of the activations used in modern transformer models, together with their ZigLlama implementations.

1. Why Non-Linearity?¶

1.1 The Linear Collapse Problem¶

Linear Collapse

Let \( f_i(x) = W_i x + b_i \) for \( i = 1, \dots, L \). The composition \( f_L \circ \cdots \circ f_1 \) is itself an affine map:

\[ f_L(\cdots f_1(x)) = W' x + b' \]

where \( W' = W_L W_{L-1} \cdots W_1 \) and \( b' \) is a corresponding accumulated bias.

Proof sketch. Induction on \( L \). Base case (\(L=1\)) is trivial. If \( g = f_{L-1} \circ \cdots \circ f_1 = W''x + b'' \), then \( f_L(g(x)) = W_L(W''x + b'') + b_L = (W_L W'')x + (W_L b'' + b_L) \), which is affine. \(\square\)

This means a 100-layer "deep" network of purely linear transformations has exactly the same representational capacity as a single matrix multiply. Stacking layers buys us nothing without non-linearity.

1.2 Universal Approximation¶

Universal Approximation Theorem (Cybenko 1989, Hornik 1991)

A feed-forward network with a single hidden layer and a non-polynomial activation function can approximate any continuous function on a compact subset of \(\mathbb{R}^n\) to arbitrary precision, given sufficiently many hidden units.¹

The theorem guarantees existence of a good approximation but says nothing about how efficiently the network can be trained. In practice, deeper networks with well-chosen activations learn more efficiently than wide shallow ones.

2. Classical Activations¶

2.1 Rectified Linear Unit (ReLU)¶

ReLU

\[ \operatorname{ReLU}(x) = \max(0,\, x) \]

Derivative:

\[ \operatorname{ReLU}'(x) = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases} \]

The derivative is the Heaviside step function (undefined at exactly zero, where a sub-gradient of 0 is conventionally used).

Properties:

Property	Value
Range	\([0, +\infty)\)
Smoothness	Piecewise linear, not differentiable at 0
Gradient	Constant 1 for \(x>0\); exactly 0 for \(x<0\)
Computational cost	One comparison

Dying ReLU Problem

If a neuron's pre-activation is always negative (e.g., due to a large negative bias learned during training), its gradient is permanently zero and the neuron can never recover. In wide networks, a significant fraction of neurons can "die" this way, wasting capacity. Leaky ReLU and modern smooth activations were designed partly to address this.

2.2 Sigmoid¶

Sigmoid (Logistic)

\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

Derivative: \(\sigma'(x) = \sigma(x)(1 - \sigma(x))\).

Property	Value
Range	\((0, 1)\)
Smoothness	Infinitely differentiable
Gradient at saturation	Approaches 0 for \(\lvert x \rvert \gg 0\)

Vanishing Gradient

For inputs with large magnitude, the sigmoid saturates and its derivative is near zero. In deep networks, this causes gradients to shrink exponentially through layers -- the vanishing gradient problem -- making early layers nearly impossible to train.

2.3 Hyperbolic Tangent¶

Tanh

\[ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]

Equivalently, \(\tanh(x) = 2\sigma(2x) - 1\).

Property	Value
Range	\((-1, 1)\)
Smoothness	Infinitely differentiable
Zero-centered	Yes (unlike sigmoid)
Gradient at saturation	Approaches 0

Tanh was the default activation in early neural networks. Its zero-centered output is advantageous for gradient dynamics, but it still suffers from vanishing gradients in the saturated regime.

3. Modern Activations¶

3.1 GELU -- Gaussian Error Linear Unit¶

GELU (Hendrycks & Gimpel 2016)

\[ \operatorname{GELU}(x) = x \cdot \Phi(x) \]

where \(\Phi(x) = \frac{1}{2}\bigl[1 + \operatorname{erf}(x/\sqrt{2})\bigr]\) is the CDF of the standard normal distribution.²

Intuition. GELU weights the input \(x\) by the probability that a standard Gaussian random variable is less than \(x\). Large positive inputs pass through nearly unchanged (\(\Phi(x) \approx 1\)); large negative inputs are suppressed (\(\Phi(x) \approx 0\)); and intermediate values are smoothly interpolated.

Fast approximation (tanh-based):

\[ \operatorname{GELU}(x) \approx 0.5\, x \left(1 + \tanh\!\left[\sqrt{\frac{2}{\pi}}\left(x + 0.044715\, x^3\right)\right]\right) \]

This approximation avoids computing the error function and is used in most production implementations, including ZigLlama.

Property	Value
Range	\(\approx (-0.17, +\infty)\)
Smoothness	Infinitely differentiable
Dead neurons	None (gradient never exactly zero)
Used in	BERT, GPT-2, GPT-3, T5

3.2 SiLU / Swish¶

SiLU (Ramachandran et al. 2017)

\[ \operatorname{SiLU}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}} \]

Also known as Swish with a fixed \(\beta = 1\).³

SiLU is self-gated: the input acts as both the signal and the gate. It shares many properties with GELU -- smooth, non-monotonic near zero, and asymptotically linear for large positive inputs -- but uses the simpler sigmoid rather than the Gaussian CDF.

Property	Value
Range	\(\approx (-0.28, +\infty)\)
Smoothness	Infinitely differentiable
Dead neurons	None
Used in	LLaMA, PaLM, Switch Transformer

3.3 SwiGLU -- SiLU-Gated Linear Unit¶

SwiGLU (Shazeer 2020)

\[ \operatorname{SwiGLU}(x,\, W_1,\, W_2) = \bigl(\operatorname{SiLU}(x W_1)\bigr) \odot (x W_2) \]

where \(\odot\) is element-wise multiplication and \(W_1, W_2 \in \mathbb{R}^{d \times d_{ff}}\).⁴

SwiGLU is a gated linear unit variant that uses SiLU as its gating activation. The key idea is to split the computation into two parallel streams -- one processed through SiLU (the gate), the other left linear (the content) -- and then combine them via element-wise multiplication.

This architecture requires three weight matrices per feed-forward layer (gate, up, down) rather than two, but empirically yields better model quality at the same compute budget.

Property	Value
Parameters	\(3 \cdot d_{\text{model}} \cdot d_{ff}\)
Gating	Learned, via SiLU
Used in	LLaMA, LLaMA 2, PaLM

3.4 GeGLU -- GELU-Gated Linear Unit¶

GeGLU (Shazeer 2020)

\[ \operatorname{GeGLU}(x,\, W_1,\, W_2) = \bigl(\operatorname{GELU}(x W_1)\bigr) \odot (x W_2) \]

GeGLU replaces the SiLU gate in SwiGLU with GELU. The two variants perform comparably in practice; the choice often follows from consistency with the rest of the architecture (e.g., if the model already uses GELU elsewhere).

4. Comparison Table¶

Activation	Range	Smooth	Dead Neurons	FLOPs/elem	Notable Models
ReLU	\([0, \infty)\)	No	Yes	1	Early CNNs, some MLPs
Sigmoid	\((0, 1)\)	Yes	No (but vanishing grad)	~10	Gates in LSTMs, GLU
Tanh	\((-1, 1)\)	Yes	No (but vanishing grad)	~10	LSTMs, early RNNs
GELU	\(\approx(-0.17, \infty)\)	Yes	No	~15	BERT, GPT-⅔, T5
SiLU	\(\approx(-0.28, \infty)\)	Yes	No	~10	LLaMA, PaLM
SwiGLU	\(\mathbb{R}\)	Yes	No	~20 (+ extra matmul)	LLaMA, LLaMA 2
GeGLU	\(\mathbb{R}\)	Yes	No	~25 (+ extra matmul)	Some T5 variants

Choosing an Activation

For modern decoder-only LLMs, SwiGLU is the dominant choice (LLaMA, Mistral, PaLM). For encoder models and older architectures, GELU remains standard (BERT, GPT-2). ReLU is rarely used in transformers today but remains pedagogically important.

5. Activation Function Landscape¶

graph LR
    subgraph "Classical"
        ReLU
        Sigmoid
        Tanh
    end

    subgraph "Modern Smooth"
        GELU
        SiLU["SiLU / Swish"]
    end

    subgraph "Gated Variants"
        GLU["GLU (sigmoid gate)"]
        SwiGLU["SwiGLU (SiLU gate)"]
        GeGLU["GeGLU (GELU gate)"]
    end

    ReLU --> GELU
    Sigmoid --> SiLU
    Sigmoid --> GLU
    SiLU --> SwiGLU
    GELU --> GeGLU
    GLU --> SwiGLU
    GLU --> GeGLU

6. Implementation in ZigLlama¶

6.1 Type Enum and Dispatcher¶

ZigLlama defines all supported activations in a single enum and provides a compile-time dispatcher:

/// Activation function types supported in transformers
pub const ActivationType = enum {
    ReLU,
    GELU,
    SiLU,    // Also known as Swish
    GLU,     // Gated Linear Unit
    GeGLU,   // GELU-based Gated Linear Unit
    SwiGLU,  // SiLU-based Gated Linear Unit
    Tanh,
    Sigmoid,
};

/// Generic activation function dispatcher
pub fn applyActivation(
    comptime T: type,
    activation_type: ActivationType,
    input: Tensor(T),
    allocator: Allocator,
) TensorError!Tensor(T) {
    return switch (activation_type) {
        .ReLU   => relu(T, input),
        .GELU   => gelu(T, input, allocator),
        .SiLU   => silu(T, input, allocator),
        .GLU    => glu(T, input, allocator),
        .GeGLU  => geglu(T, input, allocator),
        .SwiGLU => swiglu(T, input, allocator),
        .Tanh   => tanh_activation(T, input, allocator),
        .Sigmoid => sigmoid(T, input, allocator),
    };
}

6.2 Scalar Activation Functions¶

Each scalar activation operates element-wise over a Tensor(T).

/// ReLU: max(0, x)
pub fn relu(comptime T: type, input: Tensor(T)) TensorError!Tensor(T) {
    var result = try Tensor(T).init(input.allocator, input.shape);
    for (0..input.size) |i| {
        result.data[i] = @max(0.0, input.data[i]);
    }
    return result;
}

/// GELU: tanh-based approximation
pub fn gelu(comptime T: type, input: Tensor(T), allocator: Allocator) TensorError!Tensor(T) {
    var result = try Tensor(T).init(allocator, input.shape);
    for (0..input.size) |i| {
        const x = input.data[i];
        const inner = 0.7978845608 * (x + 0.044715 * x * x * x);
        result.data[i] = 0.5 * x * (1.0 + std.math.tanh(inner));
    }
    return result;
}

/// SiLU / Swish: x * sigmoid(x)
pub fn silu(comptime T: type, input: Tensor(T), allocator: Allocator) TensorError!Tensor(T) {
    var result = try Tensor(T).init(allocator, input.shape);
    for (0..input.size) |i| {
        const x = input.data[i];
        result.data[i] = x * (1.0 / (1.0 + @exp(-x)));
    }
    return result;
}

6.3 Gated Activations¶

Gated variants split the input tensor along the last dimension. The first half is the content stream; the second half passes through the gating activation.

/// SwiGLU: content * SiLU(gate)
pub fn swiglu(comptime T: type, input: Tensor(T), allocator: Allocator) TensorError!Tensor(T) {
    const last_dim = input.shape[input.shape.len - 1];
    const half_size = input.size / 2;
    // ... shape validation omitted for brevity ...
    var result = try Tensor(T).init(allocator, output_shape);
    for (0..half_size) |i| {
        const a = input.data[i];                          // content
        const b = input.data[i + half_size];              // gate
        const sigmoid_b = 1.0 / (1.0 + @exp(-b));
        result.data[i] = a * (b * sigmoid_b);             // a * SiLU(b)
    }
    return result;
}

Source File

Full implementation: src/neural_primitives/activations.zig (approximately 380 lines including tests).

7. Numerical Considerations¶

7.1 GELU Approximation Accuracy¶

The tanh-based GELU approximation has a maximum absolute error of approximately \(4 \times 10^{-4}\) relative to the exact \(x\Phi(x)\) form. For inference in f32, this is well within acceptable precision.

7.2 Sigmoid Overflow¶

For very large negative inputs, \(e^{-x}\) overflows f32. ZigLlama relies on IEEE 754 semantics: \(\exp(-x) = +\infty\) yields \(1/(1+\infty) = 0\), which is the correct saturated value. No explicit clamping is required.

7.3 SiLU Near Zero¶

\(\operatorname{SiLU}(0) = 0\) exactly, and the derivative at zero is 0.5. Unlike ReLU, the gradient is non-zero at the origin, so neurons do not die.

8. Exercises¶

Prove that \(\operatorname{SiLU}'(x) = \sigma(x)(1 + x(1 - \sigma(x)))\).
Implement a "leaky" GELU variant where negative outputs are scaled by a small constant \(\alpha\) instead of suppressed.
Benchmark the per-element cost of GELU vs. SiLU on a 4096-element tensor using std.time.Timer. Which is faster and why?
Explain why SwiGLU requires three weight matrices while standard FFN requires only two, and show that the parameter counts match when \(d_{ff}^{\text{SwiGLU}} = \frac{2}{3} d_{ff}^{\text{standard}}\).

References¶

Cybenko, G. "Approximation by Superpositions of a Sigmoidal Function." Mathematics of Control, Signals, and Systems, 2(4):303--314, 1989. ↩
Hendrycks, D. & Gimpel, K. "Gaussian Error Linear Units (GELUs)." arXiv:1606.08415, 2016. ↩
Ramachandran, P., Zoph, B. & Le, Q. V. "Searching for Activation Functions." arXiv:1710.05941, 2017. ↩
Shazeer, N. "GLU Variants Improve Transformer." arXiv:2002.05202, 2020. ↩
Nair, V. & Hinton, G. E. "Rectified Linear Units Improve Restricted Boltzmann Machines." ICML, 2010. ↩