Skip to content

Migration from Ollama

This guide provides a complete walkthrough for migrating from Ollama to Mullama. The transition is designed to be straightforward -- Mullama intentionally provides CLI and API compatibility with Ollama while extending functionality significantly.

Key Migration Facts

  • Both use the GGUF model format -- your models work without conversion
  • CLI commands use the same syntax and semantics
  • Modelfile format is fully backward-compatible
  • Migration can be done incrementally -- run both side by side

CLI Command Mapping

Mullama provides equivalent commands for all Ollama operations, with the same syntax and behavior:

Core Commands

Ollama Command Mullama Command Behavior
ollama run <model> "prompt" mullama run <model> "prompt" Identical syntax, auto-spawns daemon
ollama serve mullama serve Extended options (port, model, GPU layers)
ollama pull <model> mullama pull <model> Same alias resolution, downloads from registry
ollama list mullama list Same output format
ollama show <model> mullama show <model> Additional flags available
ollama show --modelfile mullama show --modelfile Compatible format output
ollama create -f Modelfile mullama create -f Modelfile Extended Modelfile support
ollama rm <model> mullama rm <model> Identical syntax
ollama cp <src> <dst> mullama cp <src> <dst> Identical syntax
ollama ps mullama ps Shows running models

Additional Mullama Commands

Commands not available in Ollama:

# Daemon lifecycle management
mullama daemon start              # Start daemon in background
mullama daemon stop               # Stop running daemon
mullama daemon restart            # Restart daemon
mullama daemon status             # Show daemon health, loaded models, memory
mullama daemon logs               # View daemon logs
mullama daemon logs -f            # Follow logs in real-time

# Interactive TUI chat
mullama chat                      # Full TUI chat interface
mullama chat --model llama3.2:1b  # Specify model for TUI

# Extended serve options
mullama serve --model llama3.2:1b --http-port 9090 --gpu-layers 35
mullama serve --flash-attention auto --context-size 8192

Command Examples Side by Side

# Pull a model
ollama pull llama3.2:1b          # Ollama
mullama pull llama3.2:1b         # Mullama (identical)

# List models
ollama list                      # Ollama
mullama list                     # Mullama (identical)

# Show model info
ollama show llama3.2:1b          # Ollama
mullama show llama3.2:1b         # Mullama (identical)

# Remove a model
ollama rm llama3.2:1b            # Ollama
mullama rm llama3.2:1b           # Mullama (identical)

# Copy/alias a model
ollama cp llama3.2:1b my-model   # Ollama
mullama cp llama3.2:1b my-model  # Mullama (identical)
# One-shot generation
ollama run llama3.2:1b "Hello!"          # Ollama
mullama run llama3.2:1b "Hello!"         # Mullama (identical)

# Start server
ollama serve                              # Ollama
mullama serve --model llama3.2:1b         # Mullama (model specified)

# Create from Modelfile
ollama create my-model -f Modelfile       # Ollama
mullama create my-model -f Modelfile      # Mullama (identical)
# Show running models
ollama ps                         # Ollama
mullama ps                        # Mullama (identical)

# Background daemon (Mullama-only)
mullama daemon start
mullama daemon status
mullama daemon stop

API Endpoint Mapping

Ollama API to Mullama API

Mullama provides both Ollama-compatible and OpenAI-compatible endpoints:

Ollama Endpoint Mullama Equivalent Format Notes
POST /api/generate POST /v1/completions OpenAI Text completions
POST /api/chat POST /v1/chat/completions OpenAI Chat completions
POST /api/embeddings POST /v1/embeddings OpenAI Embedding generation
GET /api/tags GET /v1/models OpenAI List models
POST /api/pull POST /api/models/pull Mullama Download model
POST /api/show GET /api/models/<name> Mullama Model details
DELETE /api/delete DELETE /api/models/<name> Mullama Remove model
-- POST /v1/messages Anthropic Anthropic-compatible
-- GET /metrics Prometheus Monitoring metrics
-- WS /ws/chat WebSocket Bidirectional streaming
-- GET /api/system/status Mullama System health

Mullama Management API

Additional management endpoints:

Endpoint Method Description
/api/models GET List all available models
/api/models/pull POST Download a model by alias
/api/models/<name> GET Get model details and metadata
/api/models/<name> DELETE Remove a model from cache
/api/system/status GET System status, memory, GPU info
/metrics GET Prometheus-format metrics

Request Format Comparison

# Ollama format
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello!"}
  ]
}'

# Mullama format (OpenAI-compatible)
curl http://localhost:8080/v1/chat/completions -d '{
  "model": "llama3.2:1b",
  "messages": [
    {"role": "system", "content": "You are helpful."},
    {"role": "user", "content": "Hello!"}
  ],
  "max_tokens": 256,
  "temperature": 0.7
}'
# Ollama format
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Once upon a time",
  "stream": false
}'

# Mullama format (OpenAI-compatible)
curl http://localhost:8080/v1/completions -d '{
  "model": "llama3.2:1b",
  "prompt": "Once upon a time",
  "max_tokens": 256,
  "stream": false
}'
# Ollama format
curl http://localhost:11434/api/embeddings -d '{
  "model": "nomic-embed-text",
  "prompt": "Hello world"
}'

# Mullama format (OpenAI-compatible)
curl http://localhost:8080/v1/embeddings -d '{
  "model": "nomic-embed-text",
  "input": "Hello world"
}'
# Ollama format (newline-delimited JSON)
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [{"role": "user", "content": "Hello"}],
  "stream": true
}'
# Response: {"message":{"content":"Hi"},"done":false}\n

# Mullama format (Server-Sent Events)
curl http://localhost:8080/v1/chat/completions -d '{
  "model": "llama3.2:1b",
  "messages": [{"role": "user", "content": "Hello"}],
  "stream": true
}'
# Response: data: {"choices":[{"delta":{"content":"Hi"}}]}\n\n
# Mullama Anthropic-compatible endpoint
curl http://localhost:8080/v1/messages -d '{
  "model": "llama3.2:1b",
  "max_tokens": 1024,
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'
# Response follows Anthropic message format

Modelfile Compatibility

Full Backward Compatibility

Mullama reads Ollama Modelfile format without any modification. Your existing Modelfiles work as-is. Mullama also supports an extended format with additional directives.

Standard Modelfile (Works in Both)

FROM llama3.2:1b

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER num_ctx 8192
PARAMETER repeat_penalty 1.1

SYSTEM """
You are a helpful coding assistant. You provide clear,
concise explanations and working code examples.
"""

TEMPLATE """{{ .System }}

{{ .Prompt }}"""

LICENSE """MIT License..."""

Mullama Extended Modelfile (Mullamafile)

Mullama extends the format with additional directives that are ignored by Ollama:

FROM llama3.2:1b

# === Standard Directives (Ollama-compatible) ===
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 8192
SYSTEM """You are a helpful assistant."""
TEMPLATE """{{ .System }}\n{{ .Prompt }}"""

# === Mullama Extensions ===

# GPU configuration
GPU_LAYERS 35
FLASH_ATTENTION auto

# Model adaptation
ADAPTER ./my-lora-adapter.gguf

# Revision pinning for reproducibility
# FROM hf:meta-llama/Llama-3.2-1B-Instruct-GGUF@abc123def456

# Content verification
DIGEST sha256:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

# Thinking token configuration (for reasoning models)
THINKING start "<think>"
THINKING end "</think>"
THINKING enabled true

# Tool calling format
TOOLFORMAT style qwen
TOOLFORMAT call_start "<tool_call>"
TOOLFORMAT call_end "</tool_call>"
TOOLFORMAT result_start "<tool_response>"
TOOLFORMAT result_end "</tool_response>"

# Capability declarations
CAPABILITY json true
CAPABILITY tools true
CAPABILITY thinking true
CAPABILITY vision false

# Vision projector (multimodal models)
VISION_PROJECTOR ./mmproj-model.gguf

# Metadata
AUTHOR "Your Name"

Extended Directives Reference

Directive Purpose Ollama Equivalent
GPU_LAYERS <n> Number of layers to offload to GPU Not available
FLASH_ATTENTION <auto/true/false> Flash attention mode Not available
ADAPTER <path> LoRA adapter file path ADAPTER (limited)
VISION_PROJECTOR <path> Multimodal vision projector path Not available
DIGEST <sha256:hash> Content integrity verification Not available
THINKING start/end/enabled Reasoning token boundaries Not available
TOOLFORMAT style/call_start/... Tool calling format specification Not available
CAPABILITY <name> <bool> Model capability flags Not available
AUTHOR <name> Author metadata Not available

Model File Reuse

Same Format, Same Files

Both tools use GGUF (GPT-Generated Unified Format) model files. Any GGUF file works with both:

# Use an Ollama-downloaded model with Mullama
# Find where Ollama stores models:
ls ~/.ollama/models/blobs/

# Point Mullama at the file:
mullama run --model-path ~/.ollama/models/blobs/sha256-<hash>

# Or set cache directory:
export MULLAMA_CACHE_DIR=~/.ollama/models
mullama run llama3.2:1b "Hello"

Model Storage Locations

Tool Platform Default Path
Ollama Linux ~/.ollama/models/
Ollama macOS ~/.ollama/models/
Ollama Windows %USERPROFILE%\.ollama\models\
Mullama Linux ~/.cache/mullama/models/
Mullama macOS ~/Library/Caches/mullama/models/
Mullama Windows %LOCALAPPDATA%\mullama\models\

Same Download Sources

Both support downloading quantized GGUF models. Mullama downloads from HuggingFace:

# Ollama pulls from ollama.com registry
ollama pull llama3.2:1b

# Mullama pulls from HuggingFace (same model, same quantization)
mullama pull llama3.2:1b
# Resolves to: hf:meta-llama/Llama-3.2-1B-Instruct-GGUF

Code Migration Examples

Python: REST Client to Native Binding

# Using ollama-python (REST wrapper)
import ollama

response = ollama.chat(model='llama3.2', messages=[
    {'role': 'user', 'content': 'What is Python?'}
])
print(response['message']['content'])

# Streaming
stream = ollama.chat(model='llama3.2', messages=[
    {'role': 'user', 'content': 'Explain recursion'}
], stream=True)
for chunk in stream:
    print(chunk['message']['content'], end='')
# Using mullama (native PyO3 binding - no HTTP)
import mullama

model = mullama.Model("llama3.2-1b.gguf")
context = model.create_context(n_ctx=4096)

result = context.generate("What is Python?", max_tokens=256)
print(result)

# Streaming
for token in context.generate_stream("Explain recursion", max_tokens=256):
    print(token, end='', flush=True)
# Using standard OpenAI SDK pointed at Mullama daemon
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="unused"  # No auth required for local
)

response = client.chat.completions.create(
    model="llama3.2:1b",
    messages=[{"role": "user", "content": "What is Python?"}],
    max_tokens=256
)
print(response.choices[0].message.content)

# Streaming
stream = client.chat.completions.create(
    model="llama3.2:1b",
    messages=[{"role": "user", "content": "Explain recursion"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end='')

Node.js: REST Client to Native Binding

// Using ollama-js (REST wrapper)
import { Ollama } from 'ollama';

const ollama = new Ollama();
const response = await ollama.chat({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Hello!' }]
});
console.log(response.message.content);

// Streaming
const stream = await ollama.chat({
  model: 'llama3.2',
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true
});
for await (const chunk of stream) {
  process.stdout.write(chunk.message.content);
}
// Using mullama (native NAPI-RS binding - no HTTP)
const { Model } = require('mullama');

const model = new Model('llama3.2-1b.gguf');
const context = model.createContext({ nCtx: 4096 });

const result = await context.generate('Hello!', { maxTokens: 256 });
console.log(result);

// Streaming
const stream = context.generateStream('Tell me a story', { maxTokens: 256 });
stream.on('token', (token) => process.stdout.write(token));
await stream.done();
// Using standard OpenAI SDK pointed at Mullama daemon
import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8080/v1',
  apiKey: 'unused'
});

const response = await client.chat.completions.create({
  model: 'llama3.2:1b',
  messages: [{ role: 'user', content: 'Hello!' }],
  max_tokens: 256
});
console.log(response.choices[0].message.content);

// Streaming
const stream = await client.chat.completions.create({
  model: 'llama3.2:1b',
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true
});
for await (const chunk of stream) {
  process.stdout.write(chunk.choices[0]?.delta?.content || '');
}

Go: REST Client to Native Binding

// Using ollama-go (REST wrapper)
package main

import (
    "context"
    "fmt"
    "github.com/ollama/ollama/api"
)

func main() {
    client, _ := api.ClientFromEnvironment()
    req := &api.ChatRequest{
        Model:    "llama3.2",
        Messages: []api.Message{{Role: "user", Content: "Hello!"}},
    }
    resp, _ := client.Chat(context.Background(), req)
    fmt.Println(resp.Message.Content)
}
// Using mullama-go (native cgo binding - no HTTP)
package main

import (
    "fmt"
    "github.com/cognisoc/mullama"
)

func main() {
    model, _ := mullama.LoadModel("llama3.2-1b.gguf")
    defer model.Close()

    ctx, _ := model.CreateContext(mullama.ContextParams{NCtx: 4096})
    defer ctx.Close()

    result, _ := ctx.Generate("Hello!", mullama.GenerateParams{MaxTokens: 256})
    fmt.Println(result)
}

Environment Variable Differences

Purpose Ollama Mullama Notes
Server address OLLAMA_HOST --http-addr / --http-port CLI flags preferred
Model storage OLLAMA_MODELS MULLAMA_CACHE_DIR Cache directory
GPU layers OLLAMA_NUM_GPU --gpu-layers / GPU_LAYERS CLI flag or Modelfile
HuggingFace auth -- HF_TOKEN For gated model downloads
Binary path -- MULLAMA_BIN For auto-spawn feature
Context size OLLAMA_NUM_CTX --context-size / num_ctx CLI flag or param
Debug logging OLLAMA_DEBUG RUST_LOG=debug Standard Rust logging

Default Port Difference

Setting Ollama Mullama Notes
HTTP port 11434 8080 No conflict when running both
Bind address 127.0.0.1 0.0.0.0 Mullama accessible from network by default

Network Binding

Mullama binds to 0.0.0.0 by default, making it accessible from the network. For local-only access, use --http-addr 127.0.0.1 or configure in your daemon config file.

What You Gain by Migrating

After migrating from Ollama to Mullama, you gain access to:

Native Library Integration

  • Zero-overhead inference from Rust, Node.js, Python, Go, PHP, and C/C++
  • No daemon required for library usage -- embed inference directly in your application
  • Single-binary deployment with inference capabilities baked in

Advanced API Features

  • Anthropic-compatible API (/v1/messages) for Claude SDK compatibility
  • WebSocket streaming for bidirectional real-time communication
  • Prometheus metrics for production monitoring at /metrics
  • Embedded Web UI for model management and chat

Extended Inference Capabilities

  • ColBERT / late interaction with MaxSim scoring for advanced retrieval
  • Streaming audio with voice activity detection and noise reduction
  • Parallel batch processing with Rayon work-stealing scheduler
  • Composable sampler chains for advanced generation control
  • Multiple simultaneous LoRA adapters with dynamic weight adjustment

Operational Improvements

  • Auto-spawn daemon -- starts automatically on first command
  • TUI chat interface -- rich terminal UI for interactive sessions
  • Extended Modelfile -- thinking tokens, tool format, capabilities, digest verification
  • Model aliases -- 40+ pre-configured aliases for popular models

Step-by-Step Migration

Step 1: Install Mullama

# Clone and build
git clone https://github.com/cognisoc/mullama.git
cd mullama
git submodule update --init --recursive

# Install system dependencies (Linux)
sudo apt install -y libasound2-dev libpulse-dev libflac-dev \
    libvorbis-dev libopus-dev libpng-dev libjpeg-dev \
    libtiff-dev libwebp-dev

# Build with full daemon support
cargo build --release --features daemon

# Optionally build with Web UI
cd ui && npm install && npm run build && cd ..
cargo build --release --features daemon,embedded-ui

# Add to PATH
export PATH="$PWD/target/release:$PATH"

Step 2: Verify Installation

mullama --version
mullama daemon status

Step 3: Pull Your Models

# Same model aliases work
mullama pull llama3.2:1b
mullama pull qwen2.5:7b-instruct
mullama list

Step 4: Test Basic Operations

# One-shot generation
mullama run llama3.2:1b "What is the capital of France?"

# Interactive chat
mullama chat --model llama3.2:1b

Step 5: Migrate Modelfiles

# Your existing Modelfiles work without changes
mullama create my-model -f ./Modelfile
mullama show my-model --modelfile

Step 6: Update Application Code

Choose your integration approach:

  • Native binding: Rewrite to use mullama package (maximum performance)
  • OpenAI SDK: Point existing OpenAI client at http://localhost:8080/v1 (minimal changes)
  • Anthropic SDK: Point existing Anthropic client at http://localhost:8080 (minimal changes)

Step 7: Explore Advanced Features

# Web UI
open http://localhost:8080/ui/

# Prometheus metrics
curl http://localhost:8080/metrics

# Anthropic-compatible API
curl http://localhost:8080/v1/messages -H "Content-Type: application/json" -d '{
  "model": "llama3.2:1b",
  "max_tokens": 1024,
  "messages": [{"role": "user", "content": "Hello from Anthropic SDK!"}]
}'

Frequently Asked Questions

Can I run Ollama and Mullama side by side?

Yes. They use different default ports (Ollama: 11434, Mullama: 8080) and different storage directories. Both can run simultaneously without conflicts.

Do I need to re-download my models?

Not necessarily. If you have GGUF files from Ollama, you can point Mullama directly at them using --model-path or by setting MULLAMA_CACHE_DIR. Alternatively, mullama pull downloads models from HuggingFace into its own cache.

Is the Modelfile format exactly the same?

Mullama supports all standard Ollama Modelfile directives. It also adds extensions (GPU_LAYERS, FLASH_ATTENTION, THINKING, TOOLFORMAT, CAPABILITY, etc.) that are Mullama-specific. An Ollama Modelfile works in Mullama without changes.

Can I use OpenAI client libraries with Mullama?

Yes. Mullama's daemon provides a fully OpenAI-compatible API. Any OpenAI SDK (Python, Node.js, Go, etc.) works by setting base_url to http://localhost:8080/v1 and api_key to any string.

Can I use Anthropic client libraries with Mullama?

Yes. Mullama provides an Anthropic-compatible endpoint at /v1/messages. Configure the Anthropic SDK's base URL to point to your Mullama instance.

What about model aliases like 'llama3.2:1b'?

Mullama ships with 40+ pre-configured model aliases that resolve to HuggingFace repositories. Custom aliases can be defined in configs/models.toml.

Is there a performance difference in daemon mode?

When both are running as daemons serving HTTP APIs, performance is comparable -- both use llama.cpp for inference. Mullama's advantage emerges when using native bindings (eliminating HTTP overhead entirely) or when leveraging features like parallel batch processing.

Can I use the same GPU settings?

Yes. Mullama supports CUDA, Metal, and ROCm, configured via --gpu-layers CLI flag or GPU_LAYERS in Modelfile, equivalent to Ollama's OLLAMA_NUM_GPU.

Does Mullama support chat templates?

Yes. Modelfile TEMPLATE directives work identically. Mullama also auto-detects chat templates from GGUF metadata.

What if a feature I need is missing?

Check the Feature Matrix for current status. File an issue on GitHub for feature requests -- community feedback directly influences the roadmap.