Local AI Server API
This document describes the HTTP API exposed by the app server (WebMServer).
Base URL
- Default port:
11434 - Base URL format:
http://<iphone-ip-address>:<port> - The server binds to
0.0.0.0(reachable on local network). - Bonjour name is exposed in the UI and can also be used on compatible networks.
Authentication
- No authentication is enforced by the server.
- If a client requires an API key (for OpenAI-compatible flows), any non-empty value can be used.
Content Type
- Send
Content-Type: application/jsonfor mostPOSTrequests. POST /v1/audio/transcriptionsrequiresmultipart/form-data.POST /v1/audio/transcriptionshas a route body limit of30MB, but validates uploaded audio files up to25MB.
Endpoint Summary
| Method | Path | Purpose |
|---|---|---|
HEAD |
/ |
Lightweight health check |
GET |
/status |
Server status |
GET |
/api/services/screen/brightness |
Read device brightness |
POST |
/api/services/screen/brightness |
Set device brightness |
GET |
/v1/models |
List installed models plus Apple Foundation Model when available (OpenAI-like) |
POST |
/v1/chat/completions |
Chat completion (OpenAI-compatible) |
POST |
/v1/chat |
Chat completion alias (legacy compatibility) |
POST |
/v1/completions |
Text completion (OpenAI-like) |
POST |
/v1/audio/transcriptions |
Speech-to-text transcription (OpenAI-compatible) |
POST |
/v1/audio/speech |
Text-to-speech synthesis (OpenAI-compatible) |
GET |
/api/tags |
List installed models plus Apple Foundation Model when available (Ollama-like) |
POST |
/api/chat |
Chat completion (Ollama-like) |
POST |
/api/generate |
Prompt completion (Ollama-like) |
POST |
/api/show |
Show model details (Ollama-like) |
Health and Status
HEAD /
- Returns HTTP
200 OKwith no response body.
GET /status
Returns:
{
"status": "Running",
"message": "Server is currently running."
}
status values are Stopped, Running, or Failed.
Device Control
GET /api/services/screen/brightness
Returns current brightness:
{
"brightness": 0.62
}
POST /api/services/screen/brightness
Request body:
{
"brightness": 0.8
}
Behavior:
brightnessmust be a finite number.- Value is clamped to
[0.0, 1.0]. - Non-finite values return
400 Bad Request.
Response:
{
"requestedBrightness": 0.8,
"appliedBrightness": 0.8
}
OpenAI-like API (/v1)
Model Naming
For /v1 endpoints, send model names without the mlx-community/ prefix (for example Llama-3.2-1B-Instruct-4bit).
GET /v1/models
Returns installed models discovered by the app. When Apple Foundation Models are available on the device, apple-foundation-model is included. Response keys are snake_case.
Example:
{
"object": "list",
"data": [
{
"id": "llama-3.2-1b-instruct-4bit",
"object": "model",
"created": 0,
"owned_by": "Unknown",
"capabilities": {
"completion_chat": true,
"completion_fim": false,
"function_calling": false,
"vision": false,
"fine_tuning": false
},
"name": "llama-3.2-1b-instruct-4bit",
"description": "llama-3.2-1b-instruct-4bit",
"aliases": [],
"deprecation": null,
"type": "base"
}
]
}
POST /v1/chat/completions
Request body:
{
"model": "Llama-3.2-1B-Instruct-4bit",
"messages": [
{ "role": "user", "content": "Hello" }
]
}
Response shape:
{
"id": "uuid",
"object": "chat_completion",
"created": 1739577600,
"model": "Llama-3.2-1B-Instruct-4bit",
"choices": [
{ "message": { "role": "assistant", "content": "..." }, "index": 0, "finish_reason": "stop" }
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 24,
"total_tokens": 36
}
}
Notes:
/v1/chatis also available as a legacy alias and returns the same response shape.temperatureis accepted and applied.top_p,max_tokens,n, andstreamare accepted for compatibility but currently ignored.
POST /v1/completions
Request body:
{
"model": "Llama-3.2-1B-Instruct-4bit",
"prompt": "Explain transformers simply.",
"max_tokens": 128,
"temperature": 0.7,
"top_p": 1.0,
"n": 1
}
Response:
{
"id": "uuid",
"object": "text_completion",
"created": 1739577600,
"model": "Llama-3.2-1B-Instruct-4bit",
"choices": [
{
"text": "...",
"index": 0,
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 30,
"completion_tokens": 20,
"total_tokens": 50
}
}
Notes:
ndefaults to1when omitted.temperatureis used.max_tokensandtop_pare accepted for compatibility but currently not applied.n > 1returns repeated choices containing the same generated text (not independent samples).completion_tokensis an estimate.
POST /v1/audio/transcriptions
OpenAI-compatible speech-to-text endpoint.
Request (multipart/form-data) fields:
file(required): audio file.model(required): model identifier (for compatibility).language(optional): locale/BCP-47 language hint (for exampleen-US).prompt(optional): context hint.temperature(optional): must be in[0, 1].response_format(optional):json(default),text,verbose_json,srt, orvtt.timestamp_granularitiesortimestamp_granularities[](optional, only withverbose_json):segment,word.
Supported upload extensions: flac, m4a, mp3, mp4, mpeg, mpga, ogg, wav, webm.
Maximum upload size: 25MB.
Notes:
modelis required for OpenAI compatibility, but transcription engine selection is currently independent of this value.- When
response_format=verbose_json, omitting bothtimestamp_granularitiesandtimestamp_granularities[]defaults tosegment. - Multipart clients can send timestamp granularity as either repeated fields or a single scalar field.
json response example:
{
"text": "Hello from local speech recognition."
}
verbose_json response example:
{
"task": "transcribe",
"language": "en-US",
"duration": 1.42,
"text": "Hello from local speech recognition.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0,
"end": 1.42,
"text": "Hello from local speech recognition.",
"tokens": [],
"temperature": 0,
"avg_logprob": 0,
"compression_ratio": 0,
"no_speech_prob": 0
}
],
"words": [
{ "word": "Hello", "start": 0, "end": 0.2 }
]
}
text, srt, and vtt return plain text bodies.
POST /v1/audio/speech
OpenAI-compatible text-to-speech endpoint.
Request body:
{
"model": "tts-1",
"input": "Hello from local text to speech.",
"voice": "alloy",
"response_format": "mp3",
"speed": 1.0
}
Behavior:
modelis required for OpenAI compatibility, but speech engine selection is currently independent of this value.- Supported
voice:alloy,echo,fable,onyx,nova,shimmer. - Supported
response_format:mp3(default),wav. speedmust be between0.25and4.inputmax length is4096characters.
Response:
- Binary audio (
audio/mpegformp3,audio/wavforwav).
Ollama-like API (/api)
Model Naming
/api/chat:modelis optional.- If omitted, server tries, in order: current selected model in app settings, first installed model, then default model.
/api/generate:modelis required./api/show:modelis required.- Names may be provided with or without
mlx-community/prefix.
GET /api/tags
Response:
Returns installed models discovered by the app. When Apple Foundation Models are available on the device, apple-foundation-model is included.
{
"models": [
{
"name": "llama-3.2-1b-instruct-4bit",
"modified_at": "2026-02-15T12:00:00Z",
"size": 0,
"digest": "uuid",
"details": {
"format": "gguf",
"family": "llama",
"families": null,
"parameter_size": "13B",
"quantization_level": "Q4_0"
}
}
]
}
POST /api/chat
Request body:
{
"model": "Llama-3.2-1B-Instruct-4bit",
"messages": [
{ "role": "user", "content": "Hello" }
],
"stream": false,
"options": {
"topK": 40,
"topP": 0.9,
"temperature": 0.7
}
}
Response:
{
"model": "Llama-3.2-1B-Instruct-4bit",
"created_at": "2026-02-15T12:00:00Z",
"message": { "role": "assistant", "content": "..." },
"done": true
}
Notes:
streamis accepted for compatibility but currently ignored; response is non-streaming.- In
options, onlytemperatureis currently used;topKandtopPare currently ignored. - For
options, use camelCase keys (topK,topP,temperature).
POST /api/generate
Request body:
{
"model": "Llama-3.2-1B-Instruct-4bit",
"prompt": "Write a haiku about Swift.",
"images": []
}
images can include URL strings, file paths, data: URLs, or base64 strings.
Response:
{
"response": "...",
"model": "Llama-3.2-1B-Instruct-4bit",
"created_at": "2026-02-15T12:00:00Z"
}
POST /api/show
Request body:
{
"model": "Llama-3.2-1B-Instruct-4bit",
"verbose": false
}
Response (verbose: false):
{
"license": "Unknown",
"parameters": "temperature 0.7\nquantization Q4_0",
"details": {
"format": "safetensors",
"family": "llama",
"families": ["llama"],
"parameter_size": "1B",
"quantization_level": "Q4_0"
},
"capabilities": ["completion"],
"modified_at": "2026-02-15T12:00:00Z"
}
Response (verbose: true) also includes modelfile, template, model_info, and messages.
If model is not installed, returns 404:
{
"error": "model 'your-model-name' not found"
}
Message Format Details
messages[].content accepts:
- a plain string, or
- an object/array with mixed text and image parts (e.g.
{"type":"text","text":"..."}and{"type":"image_url","image_url":"..."}).
Optional images can also be sent through messages[].images.
Supported roles include user, assistant, system, and tool.
cURL Examples
OpenAI-like chat:
curl http://<iphone-ip-address>:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer local-ai" \
-d '{
"model": "Llama-3.2-1B-Instruct-4bit",
"messages": [{"role":"user","content":"Hello"}]
}'
Ollama-like chat:
curl http://<iphone-ip-address>:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "Llama-3.2-1B-Instruct-4bit",
"messages": [{"role":"user","content":"Hello"}],
"stream": false
}'