CLI Reference¶

The mullama binary serves as both the daemon server and the client CLI. All commands follow the pattern:

mullama <command> [options] [arguments]

Commands Overview¶

By Category¶

RunningServingModel ManagementDaemon LifecycleMonitoringUtility

Command	Description
`run`	One-shot text generation (auto-spawns daemon)
`chat`	Interactive TUI chat client

Command	Description
`serve`	Start the daemon server (foreground)

Command	Description
`pull`	Download a model from HuggingFace
`list`	List all local models
`show`	Show model details
`create`	Create a custom model from a Modelfile
`cp`	Copy/rename a model
`rm`	Remove a model from disk

Command	Description
`daemon start`	Start daemon as background process
`daemon stop`	Stop the running daemon
`daemon status`	Show daemon status
`daemon restart`	Restart the daemon
`daemon logs`	View daemon log output

Command	Description
`ps`	Show running (loaded) models
`status`	Show daemon status
`cache`	Manage the model cache
`ping`	Ping the daemon

Command	Description
`load`	Load a model into the daemon
`unload`	Unload a model from memory
`search`	Search HuggingFace for models
`info`	Show HuggingFace repository details
`tokenize`	Tokenize text using a model
`embed`	Generate embeddings for text
`stop`	Shutdown the daemon

Server Commands¶

`mullama serve`¶

Start the daemon server in the foreground. The server provides both IPC and HTTP interfaces.

mullama serve [OPTIONS]

Options:

Flag	Default	Description
`-m, --model <SPEC>`	--	Model(s) to load on startup (repeatable)
`--mmproj <PATH>`	--	Multimodal projector path (for vision models)
`-s, --socket <ADDR>`	`ipc:///tmp/mullama.sock`	IPC socket address
`-p, --http-port <PORT>`	`8080`	HTTP port (0 to disable HTTP)
`--http-addr <ADDR>`	`0.0.0.0`	HTTP bind address
`-g, --gpu-layers <N>`	`0`	Default GPU layers to offload
`-c, --context-size <N>`	`4096`	Default context size (tokens)
`-t, --threads <N>`	`num_cpus / 2`	CPU threads per model
`-v, --verbose`	--	Verbose output

Aliases: start

Examples:

# Start with a single model
mullama serve --model llama3.2:1b

# Start with GPU acceleration and custom port
mullama serve --model llama3.2:1b -g 35 -p 9090

# Start with multiple models
mullama serve --model llama3.2:1b --model qwen2.5:7b --model deepseek-r1:7b

# Start a vision model with projector
mullama serve --model llava:7b --mmproj ./mmproj-model-f16.gguf

# Start with a local GGUF file (alias:path format)
mullama serve --model my-model:./path/to/model.gguf

# Start with HuggingFace model
mullama serve --model hf:bartowski/Llama-3.2-1B-Instruct-GGUF

# Localhost-only binding (no external access)
mullama serve --model llama3.2:1b --http-addr 127.0.0.1

# IPC-only mode (no HTTP server)
mullama serve --model llama3.2:1b --http-port 0

# Large context with many threads
mullama serve --model qwen2.5:7b --context-size 32768 --threads 16

Foreground vs Background

mullama serve runs in the foreground and logs to stderr. Use mullama daemon start for background operation with log file output.

Generation Commands¶

`mullama run`¶

One-shot text generation. Auto-spawns the daemon if it is not running.

mullama run [MODEL] <PROMPT> [OPTIONS]

Arguments:

Argument	Required	Description
`MODEL`	No	Model alias or path (uses daemon default if omitted)
`PROMPT`	Yes	The prompt text to send

Options:

Flag	Default	Description
`-m, --model <MODEL>`	daemon default	Model to use (alternative to positional)
`-n, --max-tokens <N>`	`256`	Maximum tokens to generate
`-t, --temperature <F>`	`0.7`	Sampling temperature (0.0-2.0)
`--top-p <F>`	--	Nucleus sampling threshold (0.0-1.0)
`--top-k <N>`	--	Top-k sampling candidates
`--repeat-penalty <F>`	--	Repetition penalty factor
`-s, --socket <ADDR>`	`ipc:///tmp/mullama.sock`	IPC socket address
`-i, --image <PATH>`	--	Image file for vision models
`--http-port <PORT>`	`8080`	HTTP port for vision requests
`--system <PROMPT>`	--	System prompt
`--stats`	--	Show generation statistics
`--no-stream`	--	Disable streaming (wait for full response)
`--json`	--	Output response as JSON

Examples:

# Basic generation with model specified
mullama run llama3.2:1b "What is the capital of France?"

# Using default model
mullama run "Explain photosynthesis"

# With specific model and parameters
mullama run --model qwen2.5:7b -n 512 -t 0.9 "Write a poem about Rust"

# With system prompt
mullama run llama3.2:1b --system "You are a pirate" "Tell me about the ocean"

# Vision model with image
mullama run --model llava:7b --image photo.jpg "Describe this image in detail"

# Low temperature for deterministic output
mullama run --model deepseek-r1:7b -t 0.1 "What is 15 * 23?"

# Show performance stats after generation
mullama run --stats llama3.2:1b "Hello, world!"

# JSON output for scripting
mullama run --json llama3.2:1b "List 3 colors" | jq .content

Auto-Spawn

If the daemon is not running, mullama run will automatically start it in the background before sending the request.

`mullama chat`¶

Launch the interactive TUI chat client. See the TUI Chat page for full details.

mullama chat [OPTIONS]

Options:

Flag	Default	Description
`-m, --model <MODEL>`	daemon default	Initial model to use
`-s, --socket <ADDR>`	`ipc:///tmp/mullama.sock`	IPC socket address
`-t, --timeout <SECS>`	`10`	Connection timeout in seconds
`--system <PROMPT>`	--	Initial system prompt

Aliases: tui

Examples:

# Launch with default model
mullama chat

# Launch with specific model
mullama chat --model deepseek-r1:7b

# With custom system prompt
mullama chat --model qwen2.5:7b --system "You are a helpful coding assistant"

# Connect to non-default daemon
mullama chat --socket ipc:///var/run/mullama.sock

Model Management Commands¶

`mullama pull`¶

Download a model from HuggingFace.

mullama pull <SPEC> [OPTIONS]

Arguments:

Argument	Description
`SPEC`	Model alias, HuggingFace spec (`hf:org/repo`), or `hf:org/repo:file.gguf`

Options:

Flag	Default	Description
`-q, --quiet`	--	Suppress progress bar
`--force`	--	Re-download even if cached

Aliases: download

Examples:

# Download using alias
mullama pull llama3.2:1b
mullama pull qwen2.5:7b
mullama pull deepseek-r1:7b

# Download from HuggingFace (auto-detect best GGUF)
mullama pull hf:bartowski/Llama-3.2-1B-Instruct-GGUF

# Download specific file from HuggingFace
mullama pull hf:TheBloke/Llama-2-7B-GGUF:llama-2-7b.Q4_K_M.gguf

# Download specific quantization
mullama pull hf:bartowski/Llama-3.2-3B-Instruct-GGUF:Llama-3.2-3B-Instruct-Q5_K_M.gguf

# Quiet mode (for scripts)
mullama pull -q llama3.2:1b

# Force re-download
mullama pull --force llama3.2:1b

`mullama list`¶

List all local models (cached HuggingFace downloads and custom models).

mullama list [OPTIONS]

Options:

Flag	Default	Description
`-v, --verbose`	--	Show detailed information (path, repository)
`--json`	--	Output as JSON

Aliases: ls

Example output:

NAME                                      SIZE       QUANTIZATION   MODIFIED
Llama-3.2-1B-Instruct-GGUF:Q4_K_M        0.8 GB    Q4_K_M         2 days ago
qwen2.5-7b-instruct-q4_k_m               4.7 GB    Q4_K_M         5 hours ago
deepseek-r1-7b                            4.9 GB    Q4_K_M         1 hour ago
my-assistant                              0.8 GB    Q4_K_M         30 minutes ago

4 model(s), 11.2 GB total

Verbose output:

NAME                                      SIZE       PATH
Llama-3.2-1B-Instruct-GGUF:Q4_K_M        0.8 GB    ~/.cache/mullama/models/bartowski/...
  Repo: bartowski/Llama-3.2-1B-Instruct-GGUF
  Quantization: Q4_K_M
  Modified: 2025-01-20 14:30:00

qwen2.5-7b-instruct-q4_k_m               4.7 GB    ~/.cache/mullama/models/Qwen/...
  Repo: Qwen/Qwen2.5-7B-Instruct-GGUF
  Quantization: Q4_K_M
  Modified: 2025-01-23 09:15:00

`mullama ps`¶

Show models currently loaded in the daemon.

mullama ps [OPTIONS]

Options:

Flag	Default	Description
`-s, --socket <ADDR>`	`ipc:///tmp/mullama.sock`	IPC socket
`--json`	--	Output as JSON

Example output:

NAME                 SIZE       GPU          CONTEXT    ACTIVE
*llama3.2:1b         1236M      35 layers    4096       -
 qwen2.5:7b          7615M      CPU          4096       1 req
 deepseek-r1:7b      4900M      20 layers    8192       -

* = default model
3 model(s) loaded, 13.7 GB total

`mullama show`¶

Show detailed information about a model.

mullama show <NAME> [OPTIONS]

Options:

Flag	Default	Description
`--modelfile`	--	Show the Modelfile/Mullamafile content
`--parameters`	--	Show model parameters only
`--license`	--	Show license information
`--json`	--	Output as JSON

Examples:

# Show model details
mullama show llama3.2:1b

# Show the Modelfile
mullama show my-assistant --modelfile

# Show parameters
mullama show my-assistant --parameters

# JSON output
mullama show llama3.2:1b --json

Example output:

Model: llama3.2:1b
  Family:       llama
  Parameters:   1.24B
  Quantization: Q4_K_M
  Size:         0.8 GB
  Context:      4096
  Repository:   bartowski/Llama-3.2-1B-Instruct-GGUF
  Path:         ~/.cache/mullama/models/bartowski/Llama-3.2-1B-Instruct-GGUF/...
  Modified:     2025-01-20 14:30:00

`mullama create`¶

Create a custom model configuration from a Modelfile or Mullamafile.

mullama create <NAME> [OPTIONS]

Options:

Flag	Default	Description
`-f, --file <PATH>`	`./Mullamafile` or `./Modelfile`	Path to Modelfile
`--download`	`true`	Download base model if not cached
`-q, --quiet`	--	Suppress progress output

Examples:

# Create from Modelfile in current directory
mullama create my-assistant

# Create from specific file
mullama create my-coder -f ./coding-assistant.modelfile

# Create without auto-downloading base model
mullama create my-model -f Modelfile --download=false

`mullama cp`¶

Copy or rename a custom model.

mullama cp <SOURCE> <DESTINATION>

Aliases: copy

Examples:

# Copy a model
mullama cp my-assistant my-assistant-v2

# Rename by copying and removing
mullama cp old-name new-name && mullama rm old-name

`mullama rm`¶

Remove a model from disk.

mullama rm <NAME> [OPTIONS]

Options:

Flag	Default	Description
`-f, --force`	--	Skip confirmation prompt

Aliases: delete, remove

Examples:

# Remove with confirmation
mullama rm my-old-model

# Force remove (no confirmation)
mullama rm -f my-old-model

`mullama load`¶

Load a model into the running daemon.

mullama load <SPEC> [OPTIONS]

Options:

Flag	Default	Description
`-g, --gpu-layers <N>`	`0`	GPU layers to offload
`-c, --context-size <N>`	`4096`	Context size
`--mmproj <PATH>`	--	Vision projector path
`-s, --socket <ADDR>`	`ipc:///tmp/mullama.sock`	IPC socket
`--default`	--	Set as default model

Examples:

# Load by alias:path format
mullama load llama:./models/llama.gguf

# Load with GPU acceleration
mullama load llama:./models/llama.gguf -g 35

# Load with custom context
mullama load qwen:./models/qwen.gguf -c 8192

# Load and set as default
mullama load llama3.2:1b --default

# Load a vision model
mullama load llava:./llava.gguf --mmproj ./mmproj.gguf -g 35

`mullama unload`¶

Unload a model from the daemon, freeing memory.

mullama unload <ALIAS> [OPTIONS]

Options:

Flag	Default	Description
`-s, --socket <ADDR>`	`ipc:///tmp/mullama.sock`	IPC socket

Examples:

# Unload a model
mullama unload qwen2.5:7b

# Unload from custom socket
mullama unload llama3.2:1b -s ipc:///var/run/mullama.sock

Discovery Commands¶

`mullama search`¶

Search HuggingFace for GGUF models.

mullama search <QUERY> [OPTIONS]

Options:

Flag	Default	Description
`-n, --limit <N>`	`10`	Maximum results to show
`--all`	--	Show all models (not just GGUF)
`-f, --files`	--	Show available GGUF files per repository

Aliases: find

Examples:

# Search for Llama models
mullama search "llama 7b"

# Show available quantizations for matches
mullama search "mistral instruct" --files

# Include non-GGUF results
mullama search "phi" --all

# Limit results
mullama search "deepseek" -n 5

Example output:

REPOSITORY                                    DOWNLOADS    UPDATED
bartowski/Llama-3.2-1B-Instruct-GGUF          125,432      3 days ago
bartowski/Llama-3.2-3B-Instruct-GGUF           89,201      3 days ago
bartowski/Meta-Llama-3.1-8B-Instruct-GGUF     312,445      2 weeks ago

3 result(s)

`mullama info`¶

Show details about a HuggingFace repository.

mullama info <REPO>

Examples:

mullama info bartowski/Llama-3.2-1B-Instruct-GGUF
mullama info TheBloke/Mistral-7B-Instruct-v0.2-GGUF

Example output:

Repository: bartowski/Llama-3.2-1B-Instruct-GGUF
Description: GGUF quantizations of Meta Llama 3.2 1B Instruct
Downloads: 125,432
Last Updated: 2025-01-20

Available Files:
  Llama-3.2-1B-Instruct-Q2_K.gguf         (0.5 GB)
  Llama-3.2-1B-Instruct-Q3_K_M.gguf       (0.6 GB)
  Llama-3.2-1B-Instruct-Q4_K_M.gguf       (0.8 GB)  [recommended]
  Llama-3.2-1B-Instruct-Q5_K_M.gguf       (0.9 GB)
  Llama-3.2-1B-Instruct-Q6_K.gguf         (1.0 GB)
  Llama-3.2-1B-Instruct-Q8_0.gguf         (1.3 GB)
  Llama-3.2-1B-Instruct-f16.gguf          (2.5 GB)

Daemon Management¶

`mullama daemon start`¶

Start the daemon as a background process.

mullama daemon start [OPTIONS]

Options:

Flag	Default	Description
`-m, --model <SPEC>`	--	Model(s) to load (repeatable)
`-p, --http-port <PORT>`	`8080`	HTTP port
`--http-addr <ADDR>`	`0.0.0.0`	HTTP bind address
`-g, --gpu-layers <N>`	`0`	Default GPU layers
`-c, --context-size <N>`	`4096`	Default context size
`-s, --socket <ADDR>`	`ipc:///tmp/mullama.sock`	IPC socket

Examples:

# Start with defaults
mullama daemon start

# Start with a model pre-loaded
mullama daemon start --model llama3.2:1b

# Start with GPU and custom port
mullama daemon start --model llama3.2:1b -g 35 -p 9090

# Start with multiple models
mullama daemon start --model llama3.2:1b --model qwen2.5:7b

`mullama daemon stop`¶

Stop the running daemon gracefully.

mullama daemon stop [OPTIONS]

Options:

Flag	Default	Description
`-s, --socket <ADDR>`	`ipc:///tmp/mullama.sock`	IPC socket
`-f, --force`	--	Force stop (SIGKILL instead of graceful shutdown)
`--timeout <SECS>`	`5`	Graceful shutdown timeout

Examples:

# Graceful stop
mullama daemon stop

# Force stop
mullama daemon stop --force

# Stop non-default daemon
mullama daemon stop -s ipc:///var/run/mullama.sock

`mullama daemon restart`¶

Restart the daemon (stop then start with the same configuration).

mullama daemon restart [OPTIONS]

Options: Same as daemon start.

`mullama daemon status`¶

Show daemon status information.

mullama daemon status [OPTIONS]

Options:

Flag	Default	Description
`-s, --socket <ADDR>`	`ipc:///tmp/mullama.sock`	IPC socket
`--json`	--	Output as JSON

Example output:

Mullama Daemon Status
=====================
Running:     Yes
Version:     0.3.0
Uptime:      1h 23m 45s
Models:      2 loaded
Socket:      ipc:///tmp/mullama.sock
HTTP:        http://0.0.0.0:8080
Logs:        /tmp/mullamad.log
GPU:         CUDA available (35 layers offloaded)
Memory:      4.2 GB / 16.0 GB (26%)
Requests:    1,523 total
Tokens:      456,789 generated

`mullama daemon logs`¶

View daemon log output.

mullama daemon logs [OPTIONS]

Options:

Flag	Default	Description
`-n, --lines <N>`	`50`	Number of lines to show
`-f, --follow`	--	Follow log output (like `tail -f`)

Examples:

# Show last 50 lines
mullama daemon logs

# Show last 200 lines
mullama daemon logs -n 200

# Follow log output in real-time
mullama daemon logs -f

Monitoring Commands¶

`mullama status`¶

Show daemon status (shorthand for daemon status).

mullama status [OPTIONS]

Options:

Flag	Default	Description
`-s, --socket <ADDR>`	`ipc:///tmp/mullama.sock`	IPC socket
`--json`	--	Output as JSON

`mullama ping`¶

Ping the daemon to check connectivity.

mullama ping [OPTIONS]

Options:

Flag	Default	Description
`-s, --socket <ADDR>`	`ipc:///tmp/mullama.sock`	IPC socket

Example output:

Daemon is running (response time: 0.3ms)

`mullama cache`¶

Manage the local model cache.

mullama cache <ACTION> [OPTIONS]

Actions:

Action	Description
`list [--verbose]`	List cached models with sizes
`show`	Show cache details (alias for `list --verbose`)
`path`	Show cache directory path
`size`	Show total cache size
`remove <REPO_ID> [--filename FILE]`	Remove specific cached model(s)
`clear [--force]`	Clear entire cache

Examples:

# List cached models
mullama cache list

# Show detailed cache info
mullama cache show

# Show cache directory
mullama cache path

# Show total size
mullama cache size

# Remove specific model
mullama cache remove bartowski/Llama-3.2-1B-Instruct-GGUF

# Remove specific file
mullama cache remove bartowski/Llama-3.2-1B-Instruct-GGUF --filename Q4_K_M.gguf

# Clear all (with confirmation)
mullama cache clear

# Clear all (no confirmation)
mullama cache clear --force

Utility Commands¶

`mullama tokenize`¶

Tokenize text using a loaded model's tokenizer.

mullama tokenize <TEXT> [OPTIONS]

Options:

Flag	Default	Description
`-m, --model <MODEL>`	daemon default	Model to use
`-s, --socket <ADDR>`	`ipc:///tmp/mullama.sock`	IPC socket
`--json`	--	Output as JSON

Example:

$ mullama tokenize "Hello, world!"
Tokens: [15496, 11, 1917, 0]
Count: 4

`mullama embed`¶

Generate embeddings for one or more texts.

mullama embed <TEXT>... [OPTIONS]

Options:

Flag	Default	Description
`-m, --model <MODEL>`	daemon default	Model to use
`-s, --socket <ADDR>`	`ipc:///tmp/mullama.sock`	IPC socket
`--json`	--	Output as JSON
`--dimensions`	--	Show only the embedding dimensions

Examples:

# Generate embeddings
mullama embed "Hello, world!"

# Multiple texts
mullama embed "First text" "Second text"

# JSON output
mullama embed --json "Hello" | jq .embedding[:5]

# Show dimensions only
mullama embed --dimensions "Hello"

`mullama stop`¶

Shutdown the daemon (alias for daemon stop).

mullama stop [OPTIONS]

Environment Variables¶

Variable	Description	Default
`MULLAMA_HOST`	Default HTTP bind address	`0.0.0.0`
`MULLAMA_PORT`	Default HTTP port	`8080`
`MULLAMA_MODELS_DIR`	Custom model storage directory	`~/.mullama/models`
`MULLAMA_CACHE_DIR`	Override model cache directory	Platform-specific
`MULLAMA_GPU_LAYERS`	Default GPU layers	`0`
`MULLAMA_CONTEXT_SIZE`	Default context size	`4096`
`MULLAMA_SOCKET`	Default IPC socket path	`ipc:///tmp/mullama.sock`
`MULLAMA_BIN`	Path to mullama binary (for auto-spawn)	Auto-detected
`MULLAMA_CONFIG`	Path to configuration file	`~/.mullama/config.yaml`
`MULLAMA_LOG_LEVEL`	Logging level (trace, debug, info, warn, error)	`info`
`HF_TOKEN`	HuggingFace API token for gated models	--

Examples:

# Set persistent environment
export MULLAMA_PORT=9090
export MULLAMA_GPU_LAYERS=35
export MULLAMA_MODELS_DIR=/mnt/models
export HF_TOKEN="hf_your_token_here"

# Then use normally
mullama serve --model llama3.2:1b
# Equivalent to: mullama serve --model llama3.2:1b --http-port 9090 --gpu-layers 35

Precedence

CLI flags take precedence over environment variables, which take precedence over config file values.

Exit Codes¶

Code	Meaning
`0`	Success
`1`	General error
`2`	Connection failed (daemon not reachable)
`3`	Model not found
`4`	Invalid arguments
`5`	Download failed
`6`	Timeout
`7`	Permission denied

Global Behavior¶

Auto-spawn: Commands that require a daemon connection (run, chat, ps, load, unload, tokenize, embed) will automatically start the daemon if it is not running.
JSON output: Most commands support --json for machine-readable output suitable for scripting.
IPC socket: All client commands accept -s, --socket to connect to a non-default daemon instance.
Quiet mode: Commands with progress output support -q, --quiet for script-friendly operation.
Tab completion: Install shell completions with mullama completions bash|zsh|fish.

Shell Completions¶

Generate shell completion scripts:

# Bash
mullama completions bash > /etc/bash_completion.d/mullama

# Zsh
mullama completions zsh > ~/.zfunc/_mullama

# Fish
mullama completions fish > ~/.config/fish/completions/mullama.fish

Usage Patterns¶

Scripting¶

#!/bin/bash
# Generate summaries for multiple files
for file in docs/*.md; do
    content=$(cat "$file")
    summary=$(mullama run --json --no-stream llama3.2:1b "Summarize: $content" | jq -r .content)
    echo "$file: $summary"
done

Pipeline Integration¶

# Pipe input to mullama
echo "Translate to French: Hello, how are you?" | mullama run llama3.2:1b -

# Use with jq for structured output
mullama run --json qwen2.5:7b "List 5 programming languages as JSON array" | jq .content

Multi-Instance¶

# Run two daemon instances on different ports
mullama daemon start --http-port 8080 --socket ipc:///tmp/mullama-1.sock --model llama3.2:1b
mullama daemon start --http-port 8081 --socket ipc:///tmp/mullama-2.sock --model qwen2.5:7b

# Send requests to specific instances
mullama run -s ipc:///tmp/mullama-1.sock "Hello from instance 1"
mullama run -s ipc:///tmp/mullama-2.sock "Hello from instance 2"

CLI Reference¶

Commands Overview¶

By Category¶

Server Commands¶

mullama serve¶

Generation Commands¶

mullama run¶

mullama chat¶

Model Management Commands¶

mullama pull¶

mullama list¶

mullama ps¶

mullama show¶

mullama create¶

mullama cp¶

mullama rm¶

mullama load¶

mullama unload¶

Discovery Commands¶

mullama search¶

mullama info¶

Daemon Management¶

mullama daemon start¶

mullama daemon stop¶

mullama daemon restart¶

mullama daemon status¶

mullama daemon logs¶

Monitoring Commands¶

mullama status¶

mullama ping¶

mullama cache¶

Utility Commands¶

mullama tokenize¶

mullama embed¶

mullama stop¶

Environment Variables¶

Exit Codes¶

Global Behavior¶

Shell Completions¶

Usage Patterns¶

Scripting¶

Pipeline Integration¶

Multi-Instance¶

`mullama serve`¶

`mullama run`¶

`mullama chat`¶

`mullama pull`¶

`mullama list`¶

`mullama ps`¶

`mullama show`¶

`mullama create`¶

`mullama cp`¶

`mullama rm`¶

`mullama load`¶

`mullama unload`¶

`mullama search`¶

`mullama info`¶

`mullama daemon start`¶

`mullama daemon stop`¶

`mullama daemon restart`¶

`mullama daemon status`¶

`mullama daemon logs`¶

`mullama status`¶

`mullama ping`¶

`mullama cache`¶

`mullama tokenize`¶

`mullama embed`¶

`mullama stop`¶