Academic Papers¶

This page collects the primary research papers behind the algorithms, model architectures, and inference techniques implemented in ZigLlama. Each entry provides a full citation, an arXiv link where available, and a one-sentence annotation explaining the paper's relevance to the project.

Papers are grouped by topic rather than chronology so that readers studying a particular subsystem -- positional encodings, quantisation, sampling -- can find the relevant literature in one place.

1. Foundational Transformers¶

These papers introduced the core architecture that every model in ZigLlama builds upon.

#	Citation	Link	Relevance to ZigLlama
1	Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems 30 (NeurIPS).	arXiv:1706.03762	Defines the original transformer architecture -- scaled dot-product attention, multi-head attention, and the encoder-decoder structure -- which forms the blueprint for every Layer 4 module in ZigLlama.
2	Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." OpenAI Technical Report (GPT-2).	OpenAI blog	Demonstrates that a decoder-only transformer trained on large-scale web text can perform diverse NLP tasks without task-specific fine-tuning; ZigLlama's GPT-2 architecture support reproduces this model family.
3	Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Proceedings of NAACL-HLT.	arXiv:1810.04805	Introduces bidirectional pre-training with masked language modelling; ZigLlama implements BERT's encoder architecture and segment embeddings as one of its 18 supported model families.

2. LLaMA Family¶

The model family from which ZigLlama takes its name and its primary reference architecture.

#	Citation	Link	Relevance to ZigLlama
4	Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Roziere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). "LLaMA: Open and Efficient Foundation Language Models." arXiv preprint.	arXiv:2302.13971	The primary reference architecture for ZigLlama -- RMSNorm, SwiGLU, RoPE, and pre-norm transformer blocks are implemented exactly as described in this paper.
5	Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv preprint.	arXiv:2307.09288	Extends the LLaMA architecture with grouped-query attention and RLHF-tuned chat variants; ZigLlama supports the GQA mechanism and Llama 2 model weights.

3. Positional Encodings¶

Mechanisms for injecting sequence-position information into transformer representations.

#	Citation	Link	Relevance to ZigLlama
6	Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv preprint.	arXiv:2104.09864	Introduces RoPE, the rotary position embedding used in LLaMA and most ZigLlama-supported architectures; the `rope.zig` module implements the rotation matrices and frequency schedules described here.
7	Press, O., Smith, N. A., & Lewis, M. (2021). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Generalization." ICLR 2022.	arXiv:2108.12409	Proposes ALiBi, a position encoding that adds a linear bias to attention scores instead of modifying embeddings; ZigLlama implements ALiBi for architectures (e.g., BLOOM) that use it.
8	Peng, B., Quesnelle, J., Fan, H., & Shivam, E. (2023). "YaRN: Efficient Context Window Extension of Large Language Models." arXiv preprint.	arXiv:2309.00071	Describes a method for extending RoPE-based context windows beyond training length via NTK-aware interpolation; relevant to ZigLlama's context-extension support in the RoPE module.

4. Activation Functions¶

Non-linear functions applied within feed-forward sub-layers.

#	Citation	Link	Relevance to ZigLlama
9	Hendrycks, D. & Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." arXiv preprint.	arXiv:1606.08415	Defines GELU, used as the activation function in BERT, GPT-2, and several other ZigLlama-supported models; implemented in `activation_functions.zig`.
10	Ramachandran, P., Zoph, B., & Le, Q. V. (2017). "Searching for Activation Functions." arXiv preprint.	arXiv:1710.05941	Discovers the Swish activation \( x \cdot \sigma(x) \) through automated search; Swish is the basis for SiLU, which ZigLlama implements as the gating function in SwiGLU.
11	Shazeer, N. (2020). "GLU Variants Improve Transformer." arXiv preprint.	arXiv:2002.05202	Shows that gated linear units (especially SwiGLU) outperform standard FFN layers in transformers; ZigLlama's feed-forward module implements SwiGLU as the default for LLaMA-family models.

5. Normalization¶

Layer-wise normalization techniques that stabilise training and inference.

#	Citation	Link	Relevance to ZigLlama
12	Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). "Layer Normalization." arXiv preprint.	arXiv:1607.06450	Introduces LayerNorm, the normalization method used in the original transformer, GPT-2, and BERT; ZigLlama implements it in `normalization.zig`.
13	Zhang, B. & Sennrich, R. (2019). "Root Mean Square Layer Normalization." Advances in Neural Information Processing Systems 32.	arXiv:1910.07467	Proposes RMSNorm, which drops the mean-centering step of LayerNorm for faster computation; RMSNorm is the normalization used in LLaMA and the majority of ZigLlama's model architectures.

6. Efficient Attention¶

Variants that reduce the computational or memory cost of attention.

#	Citation	Link	Relevance to ZigLlama
14	Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebron, F., & Sanghai, S. (2023). "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." arXiv preprint.	arXiv:2305.13245	Introduces grouped-query attention, which reduces KV cache size by sharing key-value heads across query groups; ZigLlama implements GQA in `multi_head_attention.zig` for Llama 2 and Mistral.
15	Shazeer, N. (2019). "Fast Transformer Decoding: One Write-Head is All You Need." arXiv preprint.	arXiv:1911.02150	Proposes multi-query attention (MQA), the limiting case of GQA with a single KV head; ZigLlama supports MQA as a configuration option for models such as Falcon.
16	Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." Advances in Neural Information Processing Systems 35.	arXiv:2205.14135	Demonstrates that tiling attention computation to fit in SRAM yields significant wall-clock speedups; ZigLlama's attention implementation uses cache-aware blocking strategies inspired by this work.

7. Quantization¶

Techniques for reducing model weight precision while preserving accuracy.

#	Citation	Link	Relevance to ZigLlama
17	Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." arXiv preprint.	arXiv:2210.17323	Introduces GPTQ, a one-shot weight quantisation method based on approximate second-order information; provides theoretical context for the quantisation formats (Q4_0, Q4_1) that ZigLlama implements in Layer 2.
18	Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., & Han, S. (2023). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv preprint.	arXiv:2306.00978	Proposes keeping salient weight channels at higher precision based on activation magnitudes; informs ZigLlama's importance-based quantisation strategies in the IQ format family.
19	Chee, J., Cai, Y., Kuleshov, V., & De Sa, C. (2023). "QuIP: 2-Bit Quantization of Large Language Models With Guarantees." arXiv preprint.	arXiv:2307.13304	Achieves 2-bit quantisation with theoretical error guarantees using incoherence processing; provides foundational theory for ZigLlama's ultra-low-bit IQ1_S and IQ2 quantisation formats.

8. Sampling and Decoding¶

Strategies for converting logit distributions into text tokens.

#	Citation	Link	Relevance to ZigLlama
20	Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). "The Curious Case of Neural Text Degeneration." ICLR 2020.	arXiv:1904.09751	Introduces nucleus (top-p) sampling, showing that truncating the probability distribution to a cumulative threshold produces more coherent text than pure top-k; ZigLlama implements nucleus sampling in `sampling.zig`.
21	Basu, S., Banerjee, S., Ganguly, N., & Naskar, A. (2021). "Mirostat: A Neural Text Decoding Algorithm that Directly Controls Perplexity." ICLR 2021.	arXiv:2007.14966	Proposes Mirostat, an adaptive sampling algorithm that targets a desired perplexity level; ZigLlama implements both Mirostat v1 and v2 in the sampling module.
22	Meister, C., Pimentel, T., Wiher, G., & Cotterell, R. (2023). "Typical Decoding for Natural Language Generation." ICLR 2023.	arXiv:2202.00666	Proposes locally typical sampling, which selects tokens whose information content is close to the expected information; ZigLlama implements typical sampling as one of its eight decoding strategies.

9. State-Space Models¶

Alternatives to the attention mechanism based on structured state-space representations.

#	Citation	Link	Relevance to ZigLlama
23	Gu, A. & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv preprint.	arXiv:2312.00752	Introduces the Mamba architecture with selective scan, achieving linear-time sequence processing; ZigLlama implements the Mamba model as one of its 18 supported architectures, including the selective-scan mechanism.

10. Mixture of Experts¶

Sparse architectures that activate a subset of model parameters per token.

#	Citation	Link	Relevance to ZigLlama
24	Fedus, W., Zoph, B., & Shazeer, N. (2021). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." arXiv preprint.	arXiv:2101.03961	Proposes the Switch Transformer, a simplified MoE design with a single-expert routing strategy; ZigLlama's Mixture-of-Experts module implements top-k expert routing as described in this lineage.

Architectures that process both visual and textual inputs.

#	Citation	Link	Relevance to ZigLlama
25	Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). "Learning Transferable Visual Models From Natural Language Supervision." ICML 2021.	arXiv:2103.00020	Introduces CLIP, which aligns image and text representations through contrastive learning; provides the vision-encoder foundation used in ZigLlama's multi-modal architecture support.
26	Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). "Visual Instruction Tuning." NeurIPS 2023.	arXiv:2304.08485	Proposes LLaVA, a vision-language model that connects a CLIP vision encoder to a LLaMA language model via a projection layer; ZigLlama's multi-modal module implements this vision-language architecture.

12. Model-Specific Papers¶

Publications for specific model architectures supported by ZigLlama beyond the LLaMA family.

#	Citation	Link	Relevance to ZigLlama
27	Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Le Scao, T., Lavril, T., Wang, T., Lacroix, T., & El Sayed, W. (2023). "Mistral 7B." arXiv preprint.	arXiv:2310.06825	Introduces Mistral 7B with sliding window attention and grouped-query attention; ZigLlama implements the Mistral architecture including its windowed attention variant.
28	Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., & Launay, J. (2023). "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only." NeurIPS 2023 Datasets and Benchmarks Track.	arXiv:2306.01116	Describes the Falcon model family and its training data; ZigLlama implements the Falcon architecture with multi-query attention as one of its supported model families.
29	Gunasekar, S., Zhang, Y., Anber, J., Hejazinia, R., Lauter, K., Galashov, A., Langford, J., Luber, N., Goodson, B., Holtermann, H., et al. (2023). "Textbooks Are All You Need." arXiv preprint.	arXiv:2306.11644	Introduces the Phi model family trained on high-quality "textbook" data, achieving strong performance at small scale; ZigLlama implements the Phi architecture with its partial RoPE and dense attention configuration.
30	Li, R., Allal, L. B., Zi, Y., Muennighoff, N., Kocetkov, D., Mou, C., Marone, M., Akiki, C., Li, J., Chim, J., et al. (2023). "StarCoder: May the Source Be with You!" arXiv preprint.	arXiv:2305.06161	Describes StarCoder, a code-generation model trained on permissively licensed source code; ZigLlama supports the StarCoder architecture with its multi-query attention and fill-in-the-middle capabilities.
31	Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilic, S., Hesslow, D., Castagne, R., Luccioni, A. S., Yvon, F., Galle, M., et al. (2022). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model." arXiv preprint.	arXiv:2211.05100	Introduces BLOOM, a multilingual model using ALiBi positional encodings and LayerNorm; ZigLlama implements the BLOOM architecture with ALiBi as one of its supported model families.
32	Gemma Team, Google DeepMind. (2024). "Gemma: Open Models Based on Gemini Research and Technology." arXiv preprint.	arXiv:2403.08295	Describes the Gemma family of lightweight open models derived from Gemini research; ZigLlama implements the Gemma architecture with its GeGLU activation and RoPE configuration.

Citation Index¶

The following table provides a quick-reference sorted by first-author surname for locating a specific paper above.

First Author	Year	Short Title	Section
Ainslie	2023	GQA	6. Efficient Attention
Ba	2016	Layer Normalization	5. Normalization
Basu	2021	Mirostat	8. Sampling and Decoding
Chee	2023	QuIP	7. Quantization
Dao	2022	FlashAttention	6. Efficient Attention
Devlin	2019	BERT	1. Foundational Transformers
Fedus	2021	Switch Transformers	10. Mixture of Experts
Frantar	2022	GPTQ	7. Quantization
Gemma Team	2024	Gemma	12. Model-Specific Papers
Gu	2023	Mamba	9. State-Space Models
Gunasekar	2023	Phi	12. Model-Specific Papers
Hendrycks	2016	GELU	4. Activation Functions
Holtzman	2020	Nucleus Sampling	8. Sampling and Decoding
Jiang	2023	Mistral 7B	12. Model-Specific Papers
Li	2023	StarCoder	12. Model-Specific Papers
Lin	2023	AWQ	7. Quantization
Liu	2023	LLaVA	11. Multi-Modal Models
Meister	2023	Typical Decoding	8. Sampling and Decoding
Penedo	2023	Falcon	12. Model-Specific Papers
Peng	2023	YaRN	3. Positional Encodings
Press	2021	ALiBi	3. Positional Encodings
Radford (2019)	2019	GPT-2	1. Foundational Transformers
Radford (2021)	2021	CLIP	11. Multi-Modal Models
Ramachandran	2017	Swish	4. Activation Functions
Scao	2022	BLOOM	12. Model-Specific Papers
Shazeer (2019)	2019	MQA	6. Efficient Attention
Shazeer (2020)	2020	GLU Variants	4. Activation Functions
Su	2021	RoPE / RoFormer	3. Positional Encodings
Touvron (Feb 2023)	2023	LLaMA	2. LLaMA Family
Touvron (Jul 2023)	2023	Llama 2	2. LLaMA Family
Vaswani	2017	Attention Is All You Need	1. Foundational Transformers
Zhang	2019	RMSNorm	5. Normalization