Modelfile Format¶

Mullama supports creating custom model configurations using two file formats:

Modelfile -- Compatible with Ollama's Modelfile format
Mullamafile -- Extended format with Mullama-specific directives

Both formats use a Dockerfile-like syntax with directives that configure model behavior, parameters, and metadata.

Quick Example¶

FROM llama3.2:1b

PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER top_p 0.9

SYSTEM """
You are a helpful coding assistant specialized in Rust programming.
Always provide code examples and explain your reasoning.
"""

Create and use this model:

# Save the above as ./Modelfile, then:
mullama create rust-assistant -f ./Modelfile

# Use it
mullama run rust-assistant "How do I read a file in Rust?"
mullama serve --model rust-assistant

Directives Reference¶

FROM (Required)¶

Specifies the base model. This can be an alias, a local path, or a HuggingFace spec.

# Using a model alias
FROM llama3.2:1b

# Using a local GGUF file
FROM ./models/my-model.gguf

# Using an absolute path
FROM /opt/models/custom-model.gguf

# Using a HuggingFace spec
FROM hf:Qwen/Qwen2.5-7B-Instruct-GGUF

# Pin to specific commit for reproducibility
FROM hf:Qwen/Qwen2.5-7B-Instruct-GGUF@a1b2c3d

# With specific quantization at a commit
FROM hf:meta-llama/Llama-3.2-1B-Instruct-GGUF:Q4_K_M@abc123def

Base Model Resolution

When using an alias, Mullama resolves it through the model registry. When using hf: prefix, it downloads directly from HuggingFace. Local paths must point to valid GGUF files.

PARAMETER¶

Set model parameters as key-value pairs. Multiple PARAMETER directives can be specified.

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 8192
PARAMETER num_predict 512
PARAMETER repeat_penalty 1.1
PARAMETER seed 42

Supported Parameters:

Parameter	Type	Range	Default	Description
`temperature`	float	0.0-2.0	0.7	Sampling temperature. Lower = more deterministic
`top_p`	float	0.0-1.0	0.9	Nucleus sampling threshold
`top_k`	int	1+	40	Top-k sampling candidates
`num_ctx`	int	128+	4096	Context window size in tokens
`num_predict`	int	-1+	512	Maximum tokens to generate (-1 = unlimited)
`repeat_penalty`	float	0.0+	1.1	Repetition penalty factor
`presence_penalty`	float	-2.0-2.0	0.0	Presence penalty
`frequency_penalty`	float	-2.0-2.0	0.0	Frequency penalty
`seed`	int	--	random	Random seed for reproducibility
`stop`	string	--	--	Stop sequence (can specify multiple)
`num_batch`	int	1+	512	Batch size for prompt processing
`num_thread`	int	1+	auto	Number of CPU threads

SYSTEM¶

Define the system prompt. Use triple quotes for multi-line content.

SYSTEM """
You are a helpful assistant. You provide clear, concise answers
and always ask for clarification when a question is ambiguous.
"""

Or single-line:

SYSTEM "You are a helpful coding assistant."

System Prompt Best Practices

Keep system prompts focused and specific
Define the role, tone, and any constraints
Include examples of desired output format if relevant
System prompts consume context tokens, so balance detail with token budget

TEMPLATE¶

Define the chat template format used for prompt construction. Template variables are wrapped in {{ }}.

TEMPLATE """
{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>
"""

Template Variables:

Variable	Description
`{{ .System }}`	The system prompt
`{{ .Prompt }}`	The user's message
`{{ .Response }}`	The assistant's response (for multi-turn)
`{{ if .System }}...{{ end }}`	Conditional rendering

Template Auto-Detection

Most models include a chat template in their GGUF metadata. You only need to specify TEMPLATE when overriding the built-in template or using a model without one.

MESSAGE¶

Pre-define conversation messages for few-shot examples.

MESSAGE user "What is 2+2?"
MESSAGE assistant "2+2 equals 4."
MESSAGE user "And 3+3?"
MESSAGE assistant "3+3 equals 6."

Few-shot messages are prepended to every conversation, giving the model examples of desired behavior.

Multiple Stop Sequences¶

Models can have multiple stop sequences defined:

PARAMETER stop "<|im_start|>"
PARAMETER stop "<|im_end|>"
PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|end|>"

Mullamafile Extensions¶

The Mullamafile format supports all standard Modelfile directives plus Mullama-specific extensions for GPU configuration, adapters, vision, and more.

GPU_LAYERS¶

Number of layers to offload to GPU for faster inference.

GPU_LAYERS 35

Set to 99 to offload all layers (will cap at the model's actual layer count):

GPU_LAYERS 99

Finding the Right GPU Layers

Start with a high number and reduce if you get out-of-memory errors. A 7B model typically has 32-35 layers. Use mullama ps to see current GPU layer allocation.

FLASH_ATTENTION¶

Enable flash attention for faster inference and lower memory usage.

FLASH_ATTENTION true

Flash attention is particularly beneficial for longer context lengths and requires GPU offloading.

ADAPTER¶

Path to a LoRA adapter file for fine-tuned behavior without modifying the base model.

ADAPTER ./my-lora-adapter.safetensors

Multiple adapters can be stacked:

ADAPTER ./style-adapter.safetensors
ADAPTER ./domain-adapter.safetensors

VISION_PROJECTOR¶

Path to the multimodal projector for vision models. Required for models that accept image input.

VISION_PROJECTOR ./mmproj-model-f16.gguf

DIGEST¶

SHA256 hash for content-addressed verification of the base model file.

DIGEST sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

When loading, Mullama verifies the model file matches this hash and raises an error on mismatch. This ensures reproducible deployments.

THINKING¶

Configure thinking/reasoning token detection and separation. Used for models like DeepSeek-R1 that output chain-of-thought reasoning.

THINKING start "<think>"
THINKING end "</think>"
THINKING enabled true

When enabled, streaming responses include separate thinking content in the delta, allowing UIs to display reasoning in a collapsible block.

TOOLFORMAT¶

Configure tool/function calling format for models that support structured tool use.

TOOLFORMAT style qwen
TOOLFORMAT call_start "<tool_call>"
TOOLFORMAT call_end "</tool_call>"
TOOLFORMAT result_start "<tool_response>"
TOOLFORMAT result_end "</tool_response>"

Supported Styles:

Style	Models	Description
`qwen`	Qwen 2.5 series	Qwen tool calling format
`llama`	Llama 3.1+	Llama function calling format
`mistral`	Mistral models	Mistral tool use format
`generic`	Any model	Generic XML-based format

CAPABILITY¶

Declare model capabilities. These flags are exposed via the /v1/models endpoint for client introspection.

CAPABILITY json true
CAPABILITY tools true
CAPABILITY thinking true
CAPABILITY vision false
CAPABILITY code true

LICENSE¶

License metadata for the model configuration.

LICENSE MIT

Or multi-line:

LICENSE """
Apache License 2.0
Copyright (c) 2025 Your Organization
"""

AUTHOR¶

Author metadata.

AUTHOR "Your Name <your@email.com>"

Complete Directive Reference¶

Directive	Format	Modelfile	Mullamafile	Description
`FROM`	`FROM <model>`	Yes	Yes	Base model (required)
`PARAMETER`	`PARAMETER <key> <value>`	Yes	Yes	Model parameter
`SYSTEM`	`SYSTEM """..."""`	Yes	Yes	System prompt
`TEMPLATE`	`TEMPLATE """..."""`	Yes	Yes	Chat template
`MESSAGE`	`MESSAGE <role> "<text>"`	Yes	Yes	Pre-defined message
`GPU_LAYERS`	`GPU_LAYERS <N>`	--	Yes	GPU layer offloading
`FLASH_ATTENTION`	`FLASH_ATTENTION <bool>`	--	Yes	Flash attention
`ADAPTER`	`ADAPTER <path>`	--	Yes	LoRA adapter path
`VISION_PROJECTOR`	`VISION_PROJECTOR <path>`	--	Yes	Vision projector
`DIGEST`	`DIGEST sha256:<hash>`	--	Yes	SHA256 verification
`THINKING`	`THINKING <key> <value>`	--	Yes	Thinking tokens
`TOOLFORMAT`	`TOOLFORMAT <key> <value>`	--	Yes	Tool calling format
`CAPABILITY`	`CAPABILITY <name> <bool>`	--	Yes	Capability flags
`LICENSE`	`LICENSE <text>`	--	Yes	License metadata
`AUTHOR`	`AUTHOR "<name>"`	--	Yes	Author metadata

Example Configurations¶

General Assistant¶

FROM llama3.2:3b

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
PARAMETER num_predict 1024

SYSTEM """
You are a helpful, harmless, and honest assistant. You provide
thoughtful and accurate responses to user questions. When uncertain,
you say so rather than guessing.
"""

Code Assistant¶

FROM qwen2.5-coder:7b

PARAMETER temperature 0.3
PARAMETER top_p 0.95
PARAMETER num_ctx 8192
PARAMETER num_predict 2048
PARAMETER stop "<|endoftext|>"

SYSTEM """
You are an expert programming assistant. You write clean, efficient,
and well-documented code. When providing solutions:
1. Explain your approach briefly
2. Write the complete code
3. Add inline comments for complex logic
4. Suggest potential improvements or edge cases
"""

GPU_LAYERS 35
FLASH_ATTENTION true
CAPABILITY tools true
CAPABILITY code true

Reasoning Model¶

FROM deepseek-r1:7b

PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER num_ctx 8192
PARAMETER num_predict 4096

THINKING start "<think>"
THINKING end "</think>"
THINKING enabled true

CAPABILITY thinking true
CAPABILITY json true

SYSTEM """
You are a reasoning assistant. Think through problems step by step
before providing your final answer. Show your work clearly.
"""

GPU_LAYERS 35

Vision Model¶

FROM llava:7b

PARAMETER temperature 0.7
PARAMETER num_ctx 4096
PARAMETER num_predict 1024

VISION_PROJECTOR ./mmproj-model-f16.gguf

CAPABILITY vision true

SYSTEM """
You are a vision-language assistant. Describe images accurately
and answer questions about visual content with detail and precision.
"""

GPU_LAYERS 35

Tool-Calling Model¶

FROM qwen2.5:7b

PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER stop "<|im_end|>"

TOOLFORMAT style qwen
TOOLFORMAT call_start "<tool_call>"
TOOLFORMAT call_end "</tool_call>"
TOOLFORMAT result_start "<tool_response>"
TOOLFORMAT result_end "</tool_response>"

CAPABILITY tools true
CAPABILITY json true

SYSTEM """
You are a helpful assistant with access to tools. When you need
information that a tool can provide, use the appropriate tool.
Format your tool calls as structured JSON.
"""

GPU_LAYERS 35

Creative Writer¶

FROM llama3.1:8b

PARAMETER temperature 1.2
PARAMETER top_p 0.95
PARAMETER top_k 100
PARAMETER repeat_penalty 1.2
PARAMETER num_ctx 8192
PARAMETER num_predict 4096

SYSTEM """
You are a creative writing assistant. You write vivid, engaging prose
with rich descriptions and compelling narratives. You adapt your style
to match the genre and tone requested by the user.
"""

Minimal JSON API¶

FROM qwen2.5:7b

PARAMETER temperature 0.1
PARAMETER num_ctx 4096
PARAMETER num_predict 2048
PARAMETER stop "```"

SYSTEM """
You are a JSON API. You respond only with valid JSON. No explanations,
no markdown, no code fences. Just pure JSON output matching the
requested schema.
"""

CAPABILITY json true

Revision Pinning¶

Pin models to specific HuggingFace commits for reproducible deployments:

# Pin to a specific commit hash
FROM hf:Qwen/Qwen2.5-7B-Instruct-GGUF@a1b2c3d

# Pin with specific quantization
FROM hf:meta-llama/Llama-3.2-1B-Instruct-GGUF:Q4_K_M@abc123def

This ensures that the exact same model weights are used regardless of upstream updates. Essential for production deployments.

Content-Addressed Verification¶

Verify model integrity using SHA256 digests:

FROM hf:Qwen/Qwen2.5-7B-Instruct-GGUF
DIGEST sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

Verification happens automatically when loading. If the digest does not match the file on disk, an error is raised:

Error: Model digest mismatch
  Expected: sha256:e3b0c44298fc1c149...
  Got:      sha256:7f83b1657ff1fc53b...
  File:     ~/.cache/mullama/models/Qwen/...

CLI Commands¶

# Create model from Modelfile in current directory
mullama create my-assistant

# Create from specific file
mullama create my-assistant -f ./Modelfile

# Create from Mullamafile
mullama create my-assistant -f ./Mullamafile

# Show model details
mullama show my-assistant

# Show the Modelfile content
mullama show my-assistant --modelfile

# Show parameters only
mullama show my-assistant --parameters

# Copy/rename a model
mullama cp my-assistant my-assistant-v2

# Remove a model
mullama rm my-assistant

File Discovery¶

When no -f flag is provided to mullama create, the tool looks for files in this order:

./Mullamafile
./Modelfile

If neither exists, an error is returned.

Ollama Compatibility¶

Mullama's Modelfile format is fully compatible with Ollama's Modelfile specification. You can use existing Ollama Modelfiles without modification:

# Use an Ollama Modelfile directly
mullama create my-model -f ./Modelfile

Key compatibility notes:

All standard Ollama directives (FROM, PARAMETER, SYSTEM, TEMPLATE, MESSAGE) work identically
The FROM directive accepts both Ollama-style model references and Mullama aliases
Stop sequences use the same syntax: PARAMETER stop "<token>"
Template variables use the same Go-template syntax

Mullama extensions (GPU_LAYERS, FLASH_ATTENTION, ADAPTER, VISION_PROJECTOR, THINKING, TOOLFORMAT, CAPABILITY, DIGEST, AUTHOR) are ignored by Ollama if the file is used there.

Comments¶

Lines starting with # are treated as comments:

# This is my custom model configuration
# Optimized for code generation tasks

FROM qwen2.5-coder:7b

# Use lower temperature for more deterministic code output
PARAMETER temperature 0.3

# Extended context for large code files
PARAMETER num_ctx 16384

SYSTEM """
You are a code assistant.
"""

# GPU acceleration for faster inference
GPU_LAYERS 35