HTTP API
Overview
ai.local exposes two HTTP API surfaces:
- OpenAI-compatible endpoints under
/v1/* - Ollama-compatible endpoints under
/api/*
Examples in this document use http://127.0.0.1:8080. Replace the host and port with the address used by your ai.local server.
Conventions
- Request bodies are JSON unless noted otherwise.
- Successful JSON endpoints return
Content-Type: application/json. POST /v1/audio/transcriptionsrequiresmultipart/form-data.POST /v1/audio/speechreturns binary audio, not JSON.- For OpenAI routes, use model IDs returned by
GET /v1/models, for examplellama-3.2-1b-instruct-4bit. - OpenAI text-generation routes support local installed models.
apple-foundation-modelis handled separately when Apple Foundation Models are available on the device. - Ollama routes accept model names with or without the
mlx-community/prefix. - Usage counts are approximate character-based estimates, not tokenizer-exact values.
- The server reserves
/,/status,/v1/*, and/api/*for built-in routes.
Health and device control
HEAD /
Health probe. Returns 200 OK with no body.
GET /status
Returns basic server status.
{
"status": "Running",
"message": "Server is currently running."
}
POST /api/services/screen/brightness
Sets device brightness. brightness is clamped to 0...1.
{
"brightness": 0.5
}
Response:
{
"requestedBrightness": 0.5,
"appliedBrightness": 0.5
}
Notes:
- Returns the requested and applied brightness values when brightness control is available on the current platform.
- Returns
501 Not Implementedwhen brightness control is unavailable.
GET /api/services/screen/brightness
Returns the current screen brightness.
{
"brightness": 0.5
}
Notes:
- Returns the current brightness value when brightness queries are available on the current platform.
- Returns
501 Not Implementedwhen brightness queries are unavailable.
OpenAI-compatible API
GET /v1/models
Lists installed models plus apple-foundation-model when the device reports Foundation Models availability.
{
"object": "list",
"data": [
{
"id": "llama-3.2-1b-instruct-4bit",
"object": "model",
"created": 0,
"owned_by": "local-ai-server"
}
]
}
Notes:
- Returned model IDs are lowercased and stripped of the
mlx-community/prefix. - The response uses the standard OpenAI list envelope and standard model fields only.
owned_byisappleforapple-foundation-modelandlocal-ai-serverfor installed local models.- When
apple-foundation-modelis requested on OpenAI text routes, the server uses Apple Foundation Models instead of MLX. apple-foundation-modelcurrently supports plain-text generation only on/v1/*. Image input, tool calling, and tool-call history are rejected for that model.
POST /v1/completions
Single-prompt text completion.
Request:
{
"model": "llama-3.2-1b-instruct-4bit",
"prompt": "Write one short sentence about local inference.",
"temperature": 0.7,
"n": 1
}
Response shape:
{
"id": "UUID",
"object": "text_completion",
"created": 1741700000,
"model": "llama-3.2-1b-instruct-4bit",
"choices": [
{
"text": "Local inference keeps the model on your device.",
"index": 0,
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 11,
"completion_tokens": 12,
"total_tokens": 23
}
}
Behavior:
- Used:
model,prompt,temperature,n - Accepted but not enforced:
max_tokens,top_p - Supported:
stream=truereturnstext/event-streamwith OpenAI-styletext_completionchunks and a final[DONE] - If generation fails after the SSE stream has started, the server emits
event: errorwith the standard OpenAIerrorenvelope before[DONE] nduplicates the same generated text across choices instead of sampling independent completionsmodel=apple-foundation-modelruns through Apple Foundation Models when they are available on the device
POST /v1/chat/completions
OpenAI Chat Completions endpoint.
Basic request:
{
"model": "llama-3.2-1b-instruct-4bit",
"messages": [
{ "role": "system", "content": "Reply in one sentence." },
{ "role": "user", "content": "What is edge inference?" }
],
"temperature": 0.7
}
Multimodal request shape also accepted:
{
"model": "llama-3.2-11b-vision-instruct-4bit",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "Describe this image." },
{
"type": "image_url",
"image_url": {
"url": "https://example.com/cat.jpg"
}
}
]
}
]
}
Behavior:
- Used:
model,messages,temperature,tools,tool_choice - Accepted but not enforced:
top_p,max_tokens,max_completion_tokens,n - Supported:
stream=truereturnstext/event-streamwithchat.completion.chunkpayloads and a final[DONE] - If generation fails after the SSE stream has started, the server emits
event: errorwith the standard OpenAIerrorenvelope before[DONE] - Rejected: legacy
functions, legacyfunction_call, and non-standard chat request fields on/v1 - Response always contains a single choice
- Only
POST /v1/chat/completionsis registered./v1/chatis not available. - Only standard OpenAI chat content parts are documented on this route. For image input, send an
image_urlobject with aurlstring. - When functions are provided, they must already be registered on the server and also be listed in the request
- Only tools of type
functionare supported - Function calls are executed locally on the server. Plain-text replies still stream incrementally when tools are present, but tool phases themselves may be buffered before the final assistant text is emitted.
model=apple-foundation-modelonly accepts plain-textsystem,user, andassistantmessages. Image input, tool calls, tool-call history, and conversations whose final non-system message is not auserturn are rejected.
Function calling request:
{
"model": "llama-3.2-1b-instruct-4bit",
"messages": [
{ "role": "user", "content": "Add 2 and 3." }
],
"tools": [
{
"type": "function",
"function": {
"name": "add_numbers",
"description": "Add two integers.",
"parameters": {
"type": "object",
"properties": {
"a": { "type": "integer" },
"b": { "type": "integer" }
},
"required": ["a", "b"]
}
}
}
],
"tool_choice": "required"
}
POST /v1/responses
OpenAI Responses-style wrapper over the same generation engine.
Simple request:
{
"model": "llama-3.2-1b-instruct-4bit",
"input": "Summarize on-device AI in one sentence.",
"instructions": "Be concise.",
"temperature": 0.7
}
Structured input with image:
{
"model": "llama-3.2-11b-vision-instruct-4bit",
"input": [
{
"type": "message",
"role": "user",
"content": [
{ "type": "input_text", "text": "Describe this image." },
{ "type": "input_image", "image_url": "https://example.com/cat.jpg" }
]
}
]
}
Accepted input shapes:
- plain string
- array of OpenAI Responses input items
Behavior:
- Used for generation:
model,input,instructions,temperature,tools,tool_choice,parallel_tool_calls - Echoed in the response metadata:
top_p,max_output_tokens,previous_response_id,store,metadata,truncation - Supported:
stream=truereturnstext/event-streamwith OpenAI Responses events such asresponse.created,response.in_progress,response.output_item.added,response.content_part.added,response.output_text.delta, andresponse.completed - Rejected: single-object
input, and legacy non-OpenAI aliases such astool_call_id - Supported input-item types:
message,input_text,input_image,function_call,function_call_output - Not implemented:
input_file,file_id-based file resolution, reasoning summaries - When functions execute,
outputincludesfunction_callandfunction_call_outputitems before the final assistant message model=apple-foundation-modelonly accepts plain-text requests on this route.input_image,function_call, andfunction_call_outputitems are rejected for that model because image input and tool execution are not implemented on the Apple Foundation backend.
Function calling request:
{
"model": "llama-3.2-1b-instruct-4bit",
"input": "Add 2 and 3.",
"tools": [
{
"type": "function",
"name": "add_numbers",
"description": "Add two integers.",
"parameters": {
"type": "object",
"properties": {
"a": { "type": "integer" },
"b": { "type": "integer" }
},
"required": ["a", "b"]
},
"strict": true
}
],
"tool_choice": {
"type": "function",
"name": "add_numbers"
}
}
Representative response shape:
{
"id": "resp_123",
"object": "response",
"created_at": 1741700000,
"completed_at": 1741700000,
"background": false,
"status": "completed",
"error": null,
"incomplete_details": null,
"instructions": null,
"max_output_tokens": null,
"model": "llama-3.2-1b-instruct-4bit",
"output": [
{
"id": "fc_123",
"type": "function_call",
"status": "completed",
"call_id": "call_123",
"name": "add_numbers",
"arguments": "{\"a\":2,\"b\":3}"
},
{
"id": "fco_123",
"type": "function_call_output",
"status": "completed",
"call_id": "call_123",
"output": "{\"sum\":5}"
},
{
"id": "msg_123",
"type": "message",
"status": "completed",
"role": "assistant",
"content": [
{
"type": "output_text",
"text": "The sum is 5.",
"annotations": [],
"logprobs": []
}
}
}
],
"parallel_tool_calls": true,
"previous_response_id": null,
"reasoning": {
"effort": null,
"summary": null
},
"store": true,
"temperature": 1,
"text": {
"format": {
"type": "text"
}
},
"tool_choice": {
"type": "function",
"name": "add_numbers"
},
"tools": [
{
"type": "function",
"name": "add_numbers",
"description": "Add two integers.",
"parameters": {
"type": "object"
},
"strict": true
}
],
"top_p": 1,
"truncation": "disabled",
"usage": {
"input_tokens": 10,
"input_tokens_details": {
"cached_tokens": 0
},
"output_tokens": 6,
"output_tokens_details": {
"reasoning_tokens": 0
},
"total_tokens": 16
},
"user": null,
"metadata": {}
}
Notes:
- Functions must already be registered on the server and be included in the request payload.
- Only tools of type
functionare supported. /v1/responsesuses the OpenAI Responses tool schema with top-levelname,description, andparameters. The chat-style nestedfunctionwrapper is rejected on this route.- Unknown function names return an OpenAI-style validation error.
- Parameter schemas must be JSON objects. Full JSON Schema enforcement is not implemented.
POST /v1/audio/transcriptions
Multipart transcription endpoint.
Example:
curl -X POST http://127.0.0.1:8080/v1/audio/transcriptions \
-F "file=@sample.wav" \
-F "model=apple" \
-F "response_format=verbose_json" \
-F "timestamp_granularities[]=segment" \
-F "timestamp_granularities[]=word"
Validation rules:
- required:
file,model - built-in transcription engine label:
apple - current server behavior:
modelmust be non-empty, but it is not otherwise validated or used to select a transcription backend - max file size: 25 MB
- allowed extensions:
flac,m4a,mp3,mp4,mpeg,mpga,ogg,wav,webm temperaturemust be between0and1timestamp_granularitiesonly works withresponse_format=verbose_json
Supported response_format values:
jsontextsrtverbose_jsonvtt
Response types:
jsonandverbose_json: JSON bodytext: plain textsrt: SubRip textvtt: WebVTT text
POST /v1/audio/speech
Synthesizes speech and returns audio data.
Request:
{
"model": "apple",
"input": "Hello from ai.local.",
"voice": "alloy",
"response_format": "mp3",
"speed": 1.0
}
Example:
curl -X POST http://127.0.0.1:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model":"apple","input":"Hello from ai.local.","voice":"alloy","response_format":"mp3"}' \
--output speech.mp3
Validation rules:
- required:
model,input,voice - built-in speech engine label:
apple - current server behavior:
modelmust be non-empty, but it is not otherwise validated or used to select a speech backend - max input length: 4096 characters
- supported voices:
alloy,echo,fable,onyx,nova,shimmer - supported
response_format:mp3,wav speedmust be between0.25and4.0
Ollama-compatible API
POST /api/chat
Chat completion in Ollama-style request format.
Request:
{
"model": "llama-3.2-1b-instruct-4bit",
"messages": [
{ "role": "user", "content": "Say hello in one sentence." }
],
"options": {
"temperature": 0.7
}
}
Behavior:
- If
modelis omitted, the server selects a default installed model - Used:
model,messages,options.temperature - Accepted but ignored:
stream,format,keep_alive,think,tools,options.top_k,options.top_p - Response uses the standard final Ollama chat object keys:
model,created_at,message,done,done_reason,total_duration,load_duration,prompt_eval_count,prompt_eval_duration,eval_count, andeval_duration prompt_eval_countandeval_countare approximate character-based token estimates- Timing fields are synthesized as
0because the local evaluator does not expose native Ollama nanosecond counters
POST /api/generate
Single-prompt generation endpoint.
Request:
{
"model": "llama-3.2-1b-instruct-4bit",
"prompt": "Explain local-first AI in one sentence."
}
Optional images may be provided as an array of URL strings, local file paths, data: URLs, or base64 strings.
Behavior:
- Used:
model,prompt,images - Accepted but ignored:
stream,format,system,raw,think,keep_alive,options,logprobs,top_logprobs - Response uses the standard final Ollama generate object keys:
model,created_at,response,done,done_reason,total_duration,load_duration,prompt_eval_count,prompt_eval_duration,eval_count, andeval_duration thinkingis not emitted because this server does not expose a separate reasoning channel on/api/generateprompt_eval_countandeval_countare approximate character-based token estimates- Timing fields are synthesized as
0because the local evaluator does not expose native Ollama nanosecond counters
POST /api/show
Returns inferred model metadata for an installed model.
Request:
{
"model": "llama-3.2-1b-instruct-4bit",
"verbose": true
}
Notes:
- Returns
404with{ "error": "model '<name>' not found" }if the model is unavailable - Response keys follow the Ollama snake_case shape, including
modified_at,details.parent_model, andmodel_info license,family,parameter_size,quantization_level,template, andmodel_infoare inferred from cached files and model names- Verbose output is synthesized for compatibility; it is not backed by a native Ollama runtime or
Modelfile
GET /api/tags
Lists available models in an Ollama-like response format.
{
"models": [
{
"name": "llama-3.2-1b-instruct-4bit",
"model": "llama-3.2-1b-instruct-4bit",
"modified_at": "2026-03-11T14:00:00Z",
"size": 1234567890,
"digest": "c180fa9df3a6b73d6f4bf2af9eaef7ca51e9167e8d1812c1fc8de12d5d00d992",
"details": {
"format": "gguf",
"family": "llama",
"families": null,
"parameter_size": "13B",
"quantization_level": "Q4_0"
}
}
]
}
The details payload is compatibility metadata assembled by the server, not a direct export from Ollama.
size is the cached model-directory size in bytes when the directory can be resolved; otherwise it falls back to 0.
digest is a stable SHA-256 hash of the normalized model name because the local cache does not expose native Ollama digests.
Errors
OpenAI-style error envelope
/v1/completions, /v1/chat/completions, /v1/responses, and /v1/audio/* convert validation and runtime failures into:
{
"error": {
"message": "Unsupported voice 'robot'. Supported voices: alloy, echo, fable, onyx, nova, shimmer.",
"type": "invalid_request_error",
"param": "voice",
"code": "invalid_request_error"
}
}
For stream=true on /v1/completions and /v1/chat/completions, preflight validation failures still return the same JSON error envelope as a normal 4xx or 5xx response. If a failure happens after the SSE stream has started, the server emits event: error with that same error object and then terminates the stream with [DONE].
Current compatibility gaps
top_p,max_tokens, andmax_completion_tokensmay be accepted on text routes but are not enforced by the local generation engine.- Completion and chat usage counts are approximate.
- Function calling is prompt-driven over the local model. It supports request-listed registered functions, but it does not implement token-level streaming tool calls or full JSON Schema validation.
apple-foundation-modelon/v1/*currently supports plain-text generation only; image input, tool calling, and tool-call history are rejected on that backend.- Legacy OpenAI compatibility aliases such as
/v1/chat,functions, andfunction_callare intentionally not supported. - Ollama
/api/chatand/api/generatecurrently return only the final non-streaming JSON object. Standard Ollama usage counters are synthesized for compatibility, with approximate token counts and zero-valued duration fields. - Ollama
/api/showand/api/tagsreturn synthesized metadata derived from the local model cache, not from a native Ollama runtime.