OpenAI API Compatibility¶

The Mullama daemon provides a drop-in replacement for the OpenAI API. Applications built for OpenAI can connect to Mullama by simply changing the base URL, enabling local LLM inference with no code changes.

Overview¶

Base URL: http://localhost:8080/v1

Endpoint	Method	Description
`/v1/chat/completions`	POST	Chat completion with streaming
`/v1/completions`	POST	Text completion
`/v1/embeddings`	POST	Text embeddings
`/v1/models`	GET	List available models

Using with OpenAI SDKs¶

PythonNode.jscurl

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="unused"  # Required by SDK but not validated
)

# Chat completion
response = client.chat.completions.create(
    model="llama3.2:1b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"}
    ],
    temperature=0.7,
    max_tokens=256
)
print(response.choices[0].message.content)

import OpenAI from 'openai';

const client = new OpenAI({
  baseURL: 'http://localhost:8080/v1',
  apiKey: 'unused'
});

const response = await client.chat.completions.create({
  model: 'llama3.2:1b',
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'What is the capital of France?' }
  ],
  temperature: 0.7,
  max_tokens: 256
});
console.log(response.choices[0].message.content);

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:1b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Chat Completions¶

POST /v1/chat/completions¶

Generate a chat completion from a conversation history.

Request Body:

Field	Type	Required	Default	Description
`model`	string	No	daemon default	Model identifier
`messages`	array	Yes	--	Conversation messages
`temperature`	float	No	0.7	Sampling temperature (0.0-2.0)
`top_p`	float	No	0.9	Nucleus sampling threshold
`max_tokens`	integer	No	512	Maximum tokens to generate
`stream`	boolean	No	false	Enable SSE streaming
`stop`	string/array	No	--	Stop sequence(s)
`n`	integer	No	1	Number of completions
`frequency_penalty`	float	No	0.0	Frequency penalty (-2.0 to 2.0)
`presence_penalty`	float	No	0.0	Presence penalty (-2.0 to 2.0)
`seed`	integer	No	--	Random seed for reproducibility
`user`	string	No	--	End-user identifier (logged, not used)

Message Format¶

{
  "role": "system" | "user" | "assistant",
  "content": "Message text"
}

Non-Streaming Example¶

Request:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:1b",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Explain quantum computing in one sentence."}
    ],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1706000000,
  "model": "llama3.2:1b",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum computing uses quantum mechanical phenomena like superposition and entanglement to perform computations that would be impractical for classical computers."
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 28,
    "total_tokens": 53
  }
}

Streaming Example¶

Request:

curl -N http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:1b",
    "messages": [{"role": "user", "content": "Tell me a story"}],
    "stream": true
  }'

Response (SSE):

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706000000,"model":"llama3.2:1b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706000000,"model":"llama3.2:1b","choices":[{"index":0,"delta":{"content":"Once"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706000000,"model":"llama3.2:1b","choices":[{"index":0,"delta":{"content":" upon"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706000000,"model":"llama3.2:1b","choices":[{"index":0,"delta":{"content":" a"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706000000,"model":"llama3.2:1b","choices":[{"index":0,"delta":{"content":" time"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1706000000,"model":"llama3.2:1b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Streaming with SDKs¶

PythonNode.js

stream = client.chat.completions.create(
    model="llama3.2:1b",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

const stream = await client.chat.completions.create({
  model: 'llama3.2:1b',
  messages: [{ role: 'user', content: 'Tell me a story' }],
  stream: true
});
for await (const chunk of stream) {
  const content = chunk.choices[0]?.delta?.content || '';
  process.stdout.write(content);
}

Multi-Turn Conversation¶

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:1b",
    "messages": [
      {"role": "system", "content": "You are a math tutor."},
      {"role": "user", "content": "What is 15 * 23?"},
      {"role": "assistant", "content": "15 * 23 = 345"},
      {"role": "user", "content": "Now divide that by 5"}
    ]
  }'

Text Completions¶

POST /v1/completions¶

Generate a text completion from a prompt (legacy completions API).

Request Body:

Field	Type	Required	Default	Description
`model`	string	No	daemon default	Model identifier
`prompt`	string	Yes	--	Input text
`max_tokens`	integer	No	512	Maximum tokens
`temperature`	float	No	0.7	Sampling temperature
`top_p`	float	No	0.9	Nucleus sampling
`stream`	boolean	No	false	Enable SSE streaming
`stop`	string/array	No	--	Stop sequence(s)
`echo`	boolean	No	false	Include prompt in response
`suffix`	string	No	--	Text after completion (fill-in-middle)

Example:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:1b",
    "prompt": "The quick brown fox",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Response:

{
  "id": "cmpl-abc123",
  "object": "text_completion",
  "created": 1706000000,
  "model": "llama3.2:1b",
  "choices": [
    {
      "text": " jumped over the lazy dog. This classic pangram contains every letter of the English alphabet.",
      "index": 0,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 18,
    "total_tokens": 23
  }
}

Embeddings¶

POST /v1/embeddings¶

Generate vector embeddings for input text.

Request Body:

Field	Type	Required	Description
`model`	string	No	Model to use (default: daemon default)
`input`	string/array	Yes	Text(s) to embed
`encoding_format`	string	No	`float` or `base64` (default: float)

Example (single text):

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed",
    "input": "Hello, world!"
  }'

Example (batch):

curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed",
    "input": [
      "The cat sat on the mat",
      "A dog played in the park",
      "Machine learning is fascinating"
    ]
  }'

Response:

{
  "object": "list",
  "data": [
    {
      "object": "embedding",
      "embedding": [0.0023, -0.0091, 0.0152, 0.0284, -0.0037],
      "index": 0
    }
  ],
  "model": "nomic-embed",
  "usage": {
    "prompt_tokens": 4,
    "total_tokens": 4
  }
}

Embedding SDK Usage¶

PythonNode.js

response = client.embeddings.create(
    model="nomic-embed",
    input=["Hello, world!", "How are you?"]
)
for item in response.data:
    print(f"Index {item.index}: {len(item.embedding)} dimensions")

const response = await client.embeddings.create({
  model: 'nomic-embed',
  input: ['Hello, world!', 'How are you?']
});
response.data.forEach(item => {
  console.log(`Index ${item.index}: ${item.embedding.length} dimensions`);
});

Models¶

GET /v1/models¶

List all models available for inference.

Example:

curl http://localhost:8080/v1/models

Response:

{
  "object": "list",
  "data": [
    {
      "id": "llama3.2:1b",
      "object": "model",
      "created": 1706000000,
      "owned_by": "local"
    },
    {
      "id": "qwen2.5:7b",
      "object": "model",
      "created": 1706000000,
      "owned_by": "local"
    }
  ]
}

Streaming Format¶

When stream: true is set, the response uses Server-Sent Events (SSE):

Each event begins with data: followed by a JSON object
Events are separated by double newlines (\n\n)
The stream terminates with data: [DONE]
Content-Type: text/event-stream

Parsing example (Python):

import requests
import json

response = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={
        "model": "llama3.2:1b",
        "messages": [{"role": "user", "content": "Hello!"}],
        "stream": True
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        line = line.decode("utf-8")
        if line.startswith("data: "):
            data = line[6:]
            if data == "[DONE]":
                break
            chunk = json.loads(data)
            content = chunk["choices"][0]["delta"].get("content", "")
            print(content, end="")

Supported Parameters¶

Full Parameter Support¶

Parameter	Supported	Notes
`model`	Yes	Uses local model aliases
`messages`	Yes	system, user, assistant roles
`temperature`	Yes	0.0 - 2.0
`top_p`	Yes	0.0 - 1.0
`max_tokens`	Yes	Model-dependent maximum
`stream`	Yes	SSE format
`stop`	Yes	String or array
`n`	Yes	Number of completions
`frequency_penalty`	Yes	-2.0 to 2.0
`presence_penalty`	Yes	-2.0 to 2.0
`seed`	Yes	For reproducibility
`user`	Accepted	Logged but not used for billing

Unsupported Parameters¶

Parameter	Status	Notes
`tools` / `functions`	Partial	Depends on model and Modelfile config
`tool_choice`	Partial	Basic support
`response_format`	Partial	`json_object` mode via system prompt
`logprobs`	No	Not implemented
`top_logprobs`	No	Not implemented
`logit_bias`	No	Not implemented

Function/Tool Calling¶

Tool calling is supported for models configured with TOOLFORMAT in their Modelfile.

Request:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5:7b",
    "messages": [
      {"role": "user", "content": "What is the weather in Paris?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get current weather for a city",
          "parameters": {
            "type": "object",
            "properties": {
              "city": {"type": "string", "description": "City name"}
            },
            "required": ["city"]
          }
        }
      }
    ]
  }'

Response (when model decides to call a tool):

{
  "id": "chatcmpl-abc123",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": null,
        "tool_calls": [
          {
            "id": "call_abc123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"city\": \"Paris\"}"
            }
          }
        ]
      },
      "finish_reason": "tool_calls"
    }
  ]
}

Tool Calling Requirements

Tool calling requires a model with tool calling capabilities configured in its Modelfile. See Modelfile Format for configuration details.

Vision (Multimodal)¶

For models with vision capabilities, images can be included in messages:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llava:7b",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text", "text": "What is in this image?"},
          {
            "type": "image_url",
            "image_url": {
              "url": "data:image/jpeg;base64,/9j/4AAQ..."
            }
          }
        ]
      }
    ],
    "max_tokens": 256
  }'

Image Format

Images must be base64-encoded and included as data URLs. The supported formats are JPEG, PNG, and WebP.

Differences from OpenAI API¶

Feature	OpenAI	Mullama
Authentication	API key required	Optional (configurable)
Model names	`gpt-4`, `gpt-3.5-turbo`	Local aliases (`llama3.2:1b`, etc.)
Rate limiting	Per-key quotas	Optional per-IP limiting
Billing	Pay per token	Free (local inference)
logprobs	Supported	Not implemented
logit_bias	Supported	Not implemented
response_format	Full JSON mode	Partial (via system prompt)
Tool calling	Full support	Model-dependent
Vision	GPT-4V	Requires vision model (LLaVA, etc.)
Embeddings	Multiple models	Single loaded model
Moderation	Available	Not implemented
Fine-tuning	API-based	Via Modelfile/LoRA
Batch API	Available	Not implemented

Error Responses¶

Errors follow the OpenAI error format:

{
  "error": {
    "message": "Model 'unknown-model' not found",
    "type": "invalid_request_error",
    "code": "model_not_found"
  }
}

HTTP Code	Error Type	Common Causes
400	`invalid_request_error`	Malformed request, missing fields
401	`authentication_error`	Invalid API key (when auth enabled)
404	`not_found_error`	Model not loaded
429	`rate_limit_error`	Rate limit exceeded
500	`internal_error`	Model inference failure
503	`overloaded_error`	Server overloaded