Local AI Server API

This document describes the HTTP API exposed by the app server (WebMServer).

Base URL

Default port: 11434
Base URL format: http://<iphone-ip-address>:<port>
The server binds to 0.0.0.0 (reachable on local network).
Bonjour name is exposed in the UI and can also be used on compatible networks.

Authentication

No authentication is enforced by the server.
If a client requires an API key (for OpenAI-compatible flows), any non-empty value can be used.

Content Type

Send Content-Type: application/json for most POST requests.
POST /v1/audio/transcriptions requires multipart/form-data.
POST /v1/audio/transcriptions has a route body limit of 30MB, but validates uploaded audio files up to 25MB.

Endpoint Summary

Method	Path	Purpose
`HEAD`	`/`	Lightweight health check
`GET`	`/status`	Server status
`GET`	`/api/services/screen/brightness`	Read device brightness
`POST`	`/api/services/screen/brightness`	Set device brightness
`GET`	`/v1/models`	List installed models plus Apple Foundation Model when available (OpenAI-like)
`POST`	`/v1/chat/completions`	Chat completion (OpenAI-compatible)
`POST`	`/v1/chat`	Chat completion alias (legacy compatibility)
`POST`	`/v1/completions`	Text completion (OpenAI-like)
`POST`	`/v1/audio/transcriptions`	Speech-to-text transcription (OpenAI-compatible)
`POST`	`/v1/audio/speech`	Text-to-speech synthesis (OpenAI-compatible)
`GET`	`/api/tags`	List installed models plus Apple Foundation Model when available (Ollama-like)
`POST`	`/api/chat`	Chat completion (Ollama-like)
`POST`	`/api/generate`	Prompt completion (Ollama-like)
`POST`	`/api/show`	Show model details (Ollama-like)

Health and Status

`HEAD /`

Returns HTTP 200 OK with no response body.

`GET /status`

Returns:

{
  "status": "Running",
  "message": "Server is currently running."
}

status values are Stopped, Running, or Failed.

Device Control

`GET /api/services/screen/brightness`

Returns current brightness:

{
  "brightness": 0.62
}

`POST /api/services/screen/brightness`

Request body:

{
  "brightness": 0.8
}

Behavior:

brightness must be a finite number.
Value is clamped to [0.0, 1.0].
Non-finite values return 400 Bad Request.

Response:

{
  "requestedBrightness": 0.8,
  "appliedBrightness": 0.8
}

OpenAI-like API (`/v1`)

Model Naming

For /v1 endpoints, send model names without the mlx-community/ prefix (for example Llama-3.2-1B-Instruct-4bit).

`GET /v1/models`

Returns installed models discovered by the app. When Apple Foundation Models are available on the device, apple-foundation-model is included. Response keys are snake_case.

Example:

{
  "object": "list",
  "data": [
    {
      "id": "llama-3.2-1b-instruct-4bit",
      "object": "model",
      "created": 0,
      "owned_by": "Unknown",
      "capabilities": {
        "completion_chat": true,
        "completion_fim": false,
        "function_calling": false,
        "vision": false,
        "fine_tuning": false
      },
      "name": "llama-3.2-1b-instruct-4bit",
      "description": "llama-3.2-1b-instruct-4bit",
      "aliases": [],
      "deprecation": null,
      "type": "base"
    }
  ]
}

`POST /v1/chat/completions`

Request body:

{
  "model": "Llama-3.2-1B-Instruct-4bit",
  "messages": [
    { "role": "user", "content": "Hello" }
  ]
}

Response shape:

{
  "id": "uuid",
  "object": "chat_completion",
  "created": 1739577600,
  "model": "Llama-3.2-1B-Instruct-4bit",
  "choices": [
    { "message": { "role": "assistant", "content": "..." }, "index": 0, "finish_reason": "stop" }
  ],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 24,
    "total_tokens": 36
  }
}

Notes:

/v1/chat is also available as a legacy alias and returns the same response shape.
temperature is accepted and applied.
top_p, max_tokens, n, and stream are accepted for compatibility but currently ignored.

`POST /v1/completions`

Request body:

{
  "model": "Llama-3.2-1B-Instruct-4bit",
  "prompt": "Explain transformers simply.",
  "max_tokens": 128,
  "temperature": 0.7,
  "top_p": 1.0,
  "n": 1
}

Response:

{
  "id": "uuid",
  "object": "text_completion",
  "created": 1739577600,
  "model": "Llama-3.2-1B-Instruct-4bit",
  "choices": [
    {
      "text": "...",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 30,
    "completion_tokens": 20,
    "total_tokens": 50
  }
}

Notes:

n defaults to 1 when omitted.
temperature is used.
max_tokens and top_p are accepted for compatibility but currently not applied.
n > 1 returns repeated choices containing the same generated text (not independent samples).
completion_tokens is an estimate.

`POST /v1/audio/transcriptions`

OpenAI-compatible speech-to-text endpoint.

Request (multipart/form-data) fields:

file (required): audio file.
model (required): model identifier (for compatibility).
language (optional): locale/BCP-47 language hint (for example en-US).
prompt (optional): context hint.
temperature (optional): must be in [0, 1].
response_format (optional): json (default), text, verbose_json, srt, or vtt.
timestamp_granularities or timestamp_granularities[] (optional, only with verbose_json): segment, word.

Supported upload extensions: flac, m4a, mp3, mp4, mpeg, mpga, ogg, wav, webm. Maximum upload size: 25MB.

Notes:

model is required for OpenAI compatibility, but transcription engine selection is currently independent of this value.
When response_format=verbose_json, omitting both timestamp_granularities and timestamp_granularities[] defaults to segment.
Multipart clients can send timestamp granularity as either repeated fields or a single scalar field.

json response example:

{
  "text": "Hello from local speech recognition."
}

verbose_json response example:

{
  "task": "transcribe",
  "language": "en-US",
  "duration": 1.42,
  "text": "Hello from local speech recognition.",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0,
      "end": 1.42,
      "text": "Hello from local speech recognition.",
      "tokens": [],
      "temperature": 0,
      "avg_logprob": 0,
      "compression_ratio": 0,
      "no_speech_prob": 0
    }
  ],
  "words": [
    { "word": "Hello", "start": 0, "end": 0.2 }
  ]
}

text, srt, and vtt return plain text bodies.

`POST /v1/audio/speech`

OpenAI-compatible text-to-speech endpoint.

Request body:

{
  "model": "tts-1",
  "input": "Hello from local text to speech.",
  "voice": "alloy",
  "response_format": "mp3",
  "speed": 1.0
}

Behavior:

model is required for OpenAI compatibility, but speech engine selection is currently independent of this value.
Supported voice: alloy, echo, fable, onyx, nova, shimmer.
Supported response_format: mp3 (default), wav.
speed must be between 0.25 and 4.
input max length is 4096 characters.

Response:

Binary audio (audio/mpeg for mp3, audio/wav for wav).

Ollama-like API (`/api`)

Model Naming

/api/chat: model is optional.
If omitted, server tries, in order: current selected model in app settings, first installed model, then default model.
/api/generate: model is required.
/api/show: model is required.
Names may be provided with or without mlx-community/ prefix.

`GET /api/tags`

Response:

Returns installed models discovered by the app. When Apple Foundation Models are available on the device, apple-foundation-model is included.

{
  "models": [
    {
      "name": "llama-3.2-1b-instruct-4bit",
      "modified_at": "2026-02-15T12:00:00Z",
      "size": 0,
      "digest": "uuid",
      "details": {
        "format": "gguf",
        "family": "llama",
        "families": null,
        "parameter_size": "13B",
        "quantization_level": "Q4_0"
      }
    }
  ]
}

`POST /api/chat`

Request body:

{
  "model": "Llama-3.2-1B-Instruct-4bit",
  "messages": [
    { "role": "user", "content": "Hello" }
  ],
  "stream": false,
  "options": {
    "topK": 40,
    "topP": 0.9,
    "temperature": 0.7
  }
}

Response:

{
  "model": "Llama-3.2-1B-Instruct-4bit",
  "created_at": "2026-02-15T12:00:00Z",
  "message": { "role": "assistant", "content": "..." },
  "done": true
}

Notes:

stream is accepted for compatibility but currently ignored; response is non-streaming.
In options, only temperature is currently used; topK and topP are currently ignored.
For options, use camelCase keys (topK, topP, temperature).

`POST /api/generate`

Request body:

{
  "model": "Llama-3.2-1B-Instruct-4bit",
  "prompt": "Write a haiku about Swift.",
  "images": []
}

images can include URL strings, file paths, data: URLs, or base64 strings.

Response:

{
  "response": "...",
  "model": "Llama-3.2-1B-Instruct-4bit",
  "created_at": "2026-02-15T12:00:00Z"
}

`POST /api/show`

Request body:

{
  "model": "Llama-3.2-1B-Instruct-4bit",
  "verbose": false
}

Response (verbose: false):

{
  "license": "Unknown",
  "parameters": "temperature 0.7\nquantization Q4_0",
  "details": {
    "format": "safetensors",
    "family": "llama",
    "families": ["llama"],
    "parameter_size": "1B",
    "quantization_level": "Q4_0"
  },
  "capabilities": ["completion"],
  "modified_at": "2026-02-15T12:00:00Z"
}

Response (verbose: true) also includes modelfile, template, model_info, and messages.

If model is not installed, returns 404:

{
  "error": "model 'your-model-name' not found"
}

Message Format Details

messages[].content accepts:

a plain string, or
an object/array with mixed text and image parts (e.g. {"type":"text","text":"..."} and {"type":"image_url","image_url":"..."}).

Optional images can also be sent through messages[].images.

Supported roles include user, assistant, system, and tool.

cURL Examples

OpenAI-like chat:

curl http://<iphone-ip-address>:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer local-ai" \
  -d '{
    "model": "Llama-3.2-1B-Instruct-4bit",
    "messages": [{"role":"user","content":"Hello"}]
  }'

Ollama-like chat:

curl http://<iphone-ip-address>:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Llama-3.2-1B-Instruct-4bit",
    "messages": [{"role":"user","content":"Hello"}],
    "stream": false
  }'

API Documentation

Local AI Server API

Base URL

Authentication

Content Type

Endpoint Summary

Health and Status

HEAD /

GET /status

Device Control

GET /api/services/screen/brightness

POST /api/services/screen/brightness

OpenAI-like API (/v1)

Model Naming

GET /v1/models

POST /v1/chat/completions

POST /v1/completions

POST /v1/audio/transcriptions

POST /v1/audio/speech

Ollama-like API (/api)

Model Naming

GET /api/tags

POST /api/chat

POST /api/generate

POST /api/show

Message Format Details

cURL Examples

`HEAD /`

`GET /status`

`GET /api/services/screen/brightness`

`POST /api/services/screen/brightness`

OpenAI-like API (`/v1`)

`GET /v1/models`

`POST /v1/chat/completions`

`POST /v1/completions`

`POST /v1/audio/transcriptions`

`POST /v1/audio/speech`

Ollama-like API (`/api`)

`GET /api/tags`

`POST /api/chat`

`POST /api/generate`

`POST /api/show`