HTTP Server Design¶
Overview¶
The unikernel includes a minimal HTTP server for serving LLM inference requests. Since we're in a unikernel environment, standard networking libraries are not available -- the server is built on top of a custom network stack with PCI enumeration and an Intel e1000 NIC driver.
Implementation Approach¶
Network Stack¶
- Ethernet Frame Handling -- Raw packet processing via e1000 driver
- IP Protocol -- Basic IPv4 implementation
- TCP Protocol -- Connection-oriented transport
- HTTP/1.1 Subset -- Only the methods needed for our API
API Endpoints¶
| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check |
/models/load |
POST | Load a model |
/models/unload |
POST | Unload current model |
/completion |
POST | Generate text completion |
/embedding |
POST | Generate embeddings |
Key Components¶
- Network Interface (
network.c) -- PCI device discovery, e1000 driver, packet I/O - HTTP Parser (
http.c) -- Parse incoming HTTP/1.1 requests - Router (
api.c) -- Route requests to appropriate handlers - JSON Handler (
json.c) -- Parse and generate JSON request/response bodies - Connection Manager -- Handle connections via the network loop
Simplifications for Unikernel¶
- Single Threaded -- No complex threading; one request at a time
- Limited Connections -- Sequential request processing
- No File System -- Serve responses directly from memory
- Static Configuration -- All configuration baked in at build time
Implementation Order¶
- Basic network interface -- get packets in and out
- HTTP request parser -- parse incoming requests
- Simple response generator -- generate HTTP responses
- API endpoint handlers -- implement each endpoint
- Integration with llama.cpp -- connect to LLM engine