Academic Papers¶
This page collects the primary research papers behind the algorithms, model architectures, and inference techniques implemented in ZigLlama. Each entry provides a full citation, an arXiv link where available, and a one-sentence annotation explaining the paper's relevance to the project.
Papers are grouped by topic rather than chronology so that readers studying a particular subsystem -- positional encodings, quantisation, sampling -- can find the relevant literature in one place.
1. Foundational Transformers¶
These papers introduced the core architecture that every model in ZigLlama builds upon.
| # | Citation | Link | Relevance to ZigLlama |
|---|---|---|---|
| 1 | Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30 (NeurIPS). | arXiv:1706.03762 | Defines the original transformer architecture -- scaled dot-product attention, multi-head attention, and the encoder-decoder structure -- which forms the blueprint for every Layer 4 module in ZigLlama. |
| 2 | Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI Technical Report (GPT-2). | OpenAI blog | Demonstrates that a decoder-only transformer trained on large-scale web text can perform diverse NLP tasks without task-specific fine-tuning; ZigLlama's GPT-2 architecture support reproduces this model family. |
| 3 | Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of NAACL-HLT. | arXiv:1810.04805 | Introduces bidirectional pre-training with masked language modelling; ZigLlama implements BERT's encoder architecture and segment embeddings as one of its 18 supported model families. |
2. LLaMA Family¶
The model family from which ZigLlama takes its name and its primary reference architecture.
| # | Citation | Link | Relevance to ZigLlama |
|---|---|---|---|
| 4 | Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Roziere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv preprint. | arXiv:2302.13971 | The primary reference architecture for ZigLlama -- RMSNorm, SwiGLU, RoPE, and pre-norm transformer blocks are implemented exactly as described in this paper. |
| 5 | Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv preprint. | arXiv:2307.09288 | Extends the LLaMA architecture with grouped-query attention and RLHF-tuned chat variants; ZigLlama supports the GQA mechanism and Llama 2 model weights. |
3. Positional Encodings¶
Mechanisms for injecting sequence-position information into transformer representations.
| # | Citation | Link | Relevance to ZigLlama |
|---|---|---|---|
| 6 | Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv preprint. | arXiv:2104.09864 | Introduces RoPE, the rotary position embedding used in LLaMA and most ZigLlama-supported architectures; the rope.zig module implements the rotation matrices and frequency schedules described here. |
| 7 | Press, O., Smith, N. A., & Lewis, M. (2021). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization." ICLR 2022. | arXiv:2108.12409 | Proposes ALiBi, a position encoding that adds a linear bias to attention scores instead of modifying embeddings; ZigLlama implements ALiBi for architectures (e.g., BLOOM) that use it. |
| 8 | Peng, B., Quesnelle, J., Fan, H., & Shivam, E. (2023). "YaRN: Efficient Context Window Extension of Large Language Models." arXiv preprint. | arXiv:2309.00071 | Describes a method for extending RoPE-based context windows beyond training length via NTK-aware interpolation; relevant to ZigLlama's context-extension support in the RoPE module. |
4. Activation Functions¶
Non-linear functions applied within feed-forward sub-layers.
| # | Citation | Link | Relevance to ZigLlama |
|---|---|---|---|
| 9 | Hendrycks, D. & Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." arXiv preprint. | arXiv:1606.08415 | Defines GELU, used as the activation function in BERT, GPT-2, and several other ZigLlama-supported models; implemented in activation_functions.zig. |
| 10 | Ramachandran, P., Zoph, B., & Le, Q. V. (2017). "Searching for Activation Functions." arXiv preprint. | arXiv:1710.05941 | Discovers the Swish activation \( x \cdot \sigma(x) \) through automated search; Swish is the basis for SiLU, which ZigLlama implements as the gating function in SwiGLU. |
| 11 | Shazeer, N. (2020). "GLU Variants Improve Transformer." arXiv preprint. | arXiv:2002.05202 | Shows that gated linear units (especially SwiGLU) outperform standard FFN layers in transformers; ZigLlama's feed-forward module implements SwiGLU as the default for LLaMA-family models. |
5. Normalization¶
Layer-wise normalization techniques that stabilise training and inference.
| # | Citation | Link | Relevance to ZigLlama |
|---|---|---|---|
| 12 | Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). "Layer Normalization." arXiv preprint. | arXiv:1607.06450 | Introduces LayerNorm, the normalization method used in the original transformer, GPT-2, and BERT; ZigLlama implements it in normalization.zig. |
| 13 | Zhang, B. & Sennrich, R. (2019). "Root Mean Square Layer Normalization." Advances in Neural Information Processing Systems 32. | arXiv:1910.07467 | Proposes RMSNorm, which drops the mean-centering step of LayerNorm for faster computation; RMSNorm is the normalization used in LLaMA and the majority of ZigLlama's model architectures. |
6. Efficient Attention¶
Variants that reduce the computational or memory cost of attention.
| # | Citation | Link | Relevance to ZigLlama |
|---|---|---|---|
| 14 | Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., & Sanghai, S. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." arXiv preprint. | arXiv:2305.13245 | Introduces grouped-query attention, which reduces KV cache size by sharing key-value heads across query groups; ZigLlama implements GQA in multi_head_attention.zig for Llama 2 and Mistral. |
| 15 | Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv preprint. | arXiv:1911.02150 | Proposes multi-query attention (MQA), the limiting case of GQA with a single KV head; ZigLlama supports MQA as a configuration option for models such as Falcon. |
| 16 | Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." Advances in Neural Information Processing Systems 35. | arXiv:2205.14135 | Demonstrates that tiling attention computation to fit in SRAM yields significant wall-clock speedups; ZigLlama's attention implementation uses cache-aware blocking strategies inspired by this work. |
7. Quantization¶
Techniques for reducing model weight precision while preserving accuracy.
| # | Citation | Link | Relevance to ZigLlama |
|---|---|---|---|
| 17 | Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." arXiv preprint. | arXiv:2210.17323 | Introduces GPTQ, a one-shot weight quantisation method based on approximate second-order information; provides theoretical context for the quantisation formats (Q4_0, Q4_1) that ZigLlama implements in Layer 2. |
| 18 | Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2023). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint. | arXiv:2306.00978 | Proposes keeping salient weight channels at higher precision based on activation magnitudes; informs ZigLlama's importance-based quantisation strategies in the IQ format family. |
| 19 | Chee, J., Cai, Y., Kuleshov, V., & De Sa, C. (2023). "QuIP: 2-Bit Quantization of Large Language Models With Guarantees." arXiv preprint. | arXiv:2307.13304 | Achieves 2-bit quantisation with theoretical error guarantees using incoherence processing; provides foundational theory for ZigLlama's ultra-low-bit IQ1_S and IQ2 quantisation formats. |
8. Sampling and Decoding¶
Strategies for converting logit distributions into text tokens.
| # | Citation | Link | Relevance to ZigLlama |
|---|---|---|---|
| 20 | Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." ICLR 2020. | arXiv:1904.09751 | Introduces nucleus (top-p) sampling, showing that truncating the probability distribution to a cumulative threshold produces more coherent text than pure top-k; ZigLlama implements nucleus sampling in sampling.zig. |
| 21 | Basu, S., Banerjee, S., Ganguly, N., & Naskar, A. (2021). "Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity." ICLR 2021. | arXiv:2007.14966 | Proposes Mirostat, an adaptive sampling algorithm that targets a desired perplexity level; ZigLlama implements both Mirostat v1 and v2 in the sampling module. |
| 22 | Meister, C., Pimentel, T., Wiher, G., & Cotterell, R. (2023). "Typical Decoding for Natural Language Generation." ICLR 2023. | arXiv:2202.00666 | Proposes locally typical sampling, which selects tokens whose information content is close to the expected information; ZigLlama implements typical sampling as one of its eight decoding strategies. |
9. State-Space Models¶
Alternatives to the attention mechanism based on structured state-space representations.
| # | Citation | Link | Relevance to ZigLlama |
|---|---|---|---|
| 23 | Gu, A. & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv preprint. | arXiv:2312.00752 | Introduces the Mamba architecture with selective scan, achieving linear-time sequence processing; ZigLlama implements the Mamba model as one of its 18 supported architectures, including the selective-scan mechanism. |
10. Mixture of Experts¶
Sparse architectures that activate a subset of model parameters per token.
| # | Citation | Link | Relevance to ZigLlama |
|---|---|---|---|
| 24 | Fedus, W., Zoph, B., & Shazeer, N. (2021). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." arXiv preprint. | arXiv:2101.03961 | Proposes the Switch Transformer, a simplified MoE design with a single-expert routing strategy; ZigLlama's Mixture-of-Experts module implements top-k expert routing as described in this lineage. |
11. Multi-Modal Models¶
Architectures that process both visual and textual inputs.
| # | Citation | Link | Relevance to ZigLlama |
|---|---|---|---|
| 25 | Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). "Learning Transferable Visual Models From Natural Language Supervision." ICML 2021. | arXiv:2103.00020 | Introduces CLIP, which aligns image and text representations through contrastive learning; provides the vision-encoder foundation used in ZigLlama's multi-modal architecture support. |
| 26 | Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). "Visual Instruction Tuning." NeurIPS 2023. | arXiv:2304.08485 | Proposes LLaVA, a vision-language model that connects a CLIP vision encoder to a LLaMA language model via a projection layer; ZigLlama's multi-modal module implements this vision-language architecture. |
12. Model-Specific Papers¶
Publications for specific model architectures supported by ZigLlama beyond the LLaMA family.
| # | Citation | Link | Relevance to ZigLlama |
|---|---|---|---|
| 27 | Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., & El Sayed, W. (2023). "Mistral 7B." arXiv preprint. | arXiv:2310.06825 | Introduces Mistral 7B with sliding window attention and grouped-query attention; ZigLlama implements the Mistral architecture including its windowed attention variant. |
| 28 | Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., & Launay, J. (2023). "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only." NeurIPS 2023 Datasets and Benchmarks Track. | arXiv:2306.01116 | Describes the Falcon model family and its training data; ZigLlama implements the Falcon architecture with multi-query attention as one of its supported model families. |
| 29 | Gunasekar, S., Zhang, Y., Anber, J., Hejazinia, R., Lauter, K., Galashov, A., Langford, J., Luber, N., Goodson, B., Holtermann, H., et al. (2023). "Textbooks Are All You Need." arXiv preprint. | arXiv:2306.11644 | Introduces the Phi model family trained on high-quality "textbook" data, achieving strong performance at small scale; ZigLlama implements the Phi architecture with its partial RoPE and dense attention configuration. |
| 30 | Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. (2023). "StarCoder: May the Source Be with You!" arXiv preprint. | arXiv:2305.06161 | Describes StarCoder, a code-generation model trained on permissively licensed source code; ZigLlama supports the StarCoder architecture with its multi-query attention and fill-in-the-middle capabilities. |
| 31 | Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilic, S., Hesslow, D., Castagne, R., Luccioni, A. S., Yvon, F., Galle, M., et al. (2022). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model." arXiv preprint. | arXiv:2211.05100 | Introduces BLOOM, a multilingual model using ALiBi positional encodings and LayerNorm; ZigLlama implements the BLOOM architecture with ALiBi as one of its supported model families. |
| 32 | Gemma Team, Google DeepMind. (2024). "Gemma: Open Models Based on Gemini Research and Technology." arXiv preprint. | arXiv:2403.08295 | Describes the Gemma family of lightweight open models derived from Gemini research; ZigLlama implements the Gemma architecture with its GeGLU activation and RoPE configuration. |
Citation Index¶
The following table provides a quick-reference sorted by first-author surname for locating a specific paper above.
| First Author | Year | Short Title | Section |
|---|---|---|---|
| Ainslie | 2023 | GQA | 6. Efficient Attention |
| Ba | 2016 | Layer Normalization | 5. Normalization |
| Basu | 2021 | Mirostat | 8. Sampling and Decoding |
| Chee | 2023 | QuIP | 7. Quantization |
| Dao | 2022 | FlashAttention | 6. Efficient Attention |
| Devlin | 2019 | BERT | 1. Foundational Transformers |
| Fedus | 2021 | Switch Transformers | 10. Mixture of Experts |
| Frantar | 2022 | GPTQ | 7. Quantization |
| Gemma Team | 2024 | Gemma | 12. Model-Specific Papers |
| Gu | 2023 | Mamba | 9. State-Space Models |
| Gunasekar | 2023 | Phi | 12. Model-Specific Papers |
| Hendrycks | 2016 | GELU | 4. Activation Functions |
| Holtzman | 2020 | Nucleus Sampling | 8. Sampling and Decoding |
| Jiang | 2023 | Mistral 7B | 12. Model-Specific Papers |
| Li | 2023 | StarCoder | 12. Model-Specific Papers |
| Lin | 2023 | AWQ | 7. Quantization |
| Liu | 2023 | LLaVA | 11. Multi-Modal Models |
| Meister | 2023 | Typical Decoding | 8. Sampling and Decoding |
| Penedo | 2023 | Falcon | 12. Model-Specific Papers |
| Peng | 2023 | YaRN | 3. Positional Encodings |
| Press | 2021 | ALiBi | 3. Positional Encodings |
| Radford (2019) | 2019 | GPT-2 | 1. Foundational Transformers |
| Radford (2021) | 2021 | CLIP | 11. Multi-Modal Models |
| Ramachandran | 2017 | Swish | 4. Activation Functions |
| Scao | 2022 | BLOOM | 12. Model-Specific Papers |
| Shazeer (2019) | 2019 | MQA | 6. Efficient Attention |
| Shazeer (2020) | 2020 | GLU Variants | 4. Activation Functions |
| Su | 2021 | RoPE / RoFormer | 3. Positional Encodings |
| Touvron (Feb 2023) | 2023 | LLaMA | 2. LLaMA Family |
| Touvron (Jul 2023) | 2023 | Llama 2 | 2. LLaMA Family |
| Vaswani | 2017 | Attention Is All You Need | 1. Foundational Transformers |
| Zhang | 2019 | RMSNorm | 5. Normalization |