Download

API Documentation

Complete endpoint reference for the local AI server.

HTTP API

Overview

ai.local exposes two HTTP API surfaces:

  • OpenAI-compatible endpoints under /v1/*
  • Ollama-compatible endpoints under /api/*

Examples in this document use http://127.0.0.1:8080. Replace the host and port with the address used by your ai.local server.

Conventions

  • Request bodies are JSON unless noted otherwise.
  • Successful JSON endpoints return Content-Type: application/json.
  • POST /v1/audio/transcriptions requires multipart/form-data.
  • POST /v1/audio/speech returns binary audio, not JSON.
  • For OpenAI routes, use model IDs returned by GET /v1/models, for example llama-3.2-1b-instruct-4bit.
  • OpenAI text-generation routes support local installed models. apple-foundation-model is handled separately when Apple Foundation Models are available on the device.
  • Ollama routes accept model names with or without the mlx-community/ prefix.
  • Usage counts are approximate character-based estimates, not tokenizer-exact values.
  • The server reserves /, /status, /v1/*, and /api/* for built-in routes.

Health and device control

HEAD /

Health probe. Returns 200 OK with no body.

GET /status

Returns basic server status.

{
  "status": "Running",
  "message": "Server is currently running."
}

POST /api/services/screen/brightness

Sets device brightness. brightness is clamped to 0...1.

{
  "brightness": 0.5
}

Response:

{
  "requestedBrightness": 0.5,
  "appliedBrightness": 0.5
}

Notes:

  • Returns the requested and applied brightness values when brightness control is available on the current platform.
  • Returns 501 Not Implemented when brightness control is unavailable.

GET /api/services/screen/brightness

Returns the current screen brightness.

{
  "brightness": 0.5
}

Notes:

  • Returns the current brightness value when brightness queries are available on the current platform.
  • Returns 501 Not Implemented when brightness queries are unavailable.

OpenAI-compatible API

GET /v1/models

Lists installed models plus apple-foundation-model when the device reports Foundation Models availability.

{
  "object": "list",
  "data": [
    {
      "id": "llama-3.2-1b-instruct-4bit",
      "object": "model",
      "created": 0,
      "owned_by": "local-ai-server"
    }
  ]
}

Notes:

  • Returned model IDs are lowercased and stripped of the mlx-community/ prefix.
  • The response uses the standard OpenAI list envelope and standard model fields only.
  • owned_by is apple for apple-foundation-model and local-ai-server for installed local models.
  • When apple-foundation-model is requested on OpenAI text routes, the server uses Apple Foundation Models instead of MLX.
  • apple-foundation-model currently supports plain-text generation only on /v1/*. Image input, tool calling, and tool-call history are rejected for that model.

POST /v1/completions

Single-prompt text completion.

Request:

{
  "model": "llama-3.2-1b-instruct-4bit",
  "prompt": "Write one short sentence about local inference.",
  "temperature": 0.7,
  "n": 1
}

Response shape:

{
  "id": "UUID",
  "object": "text_completion",
  "created": 1741700000,
  "model": "llama-3.2-1b-instruct-4bit",
  "choices": [
    {
      "text": "Local inference keeps the model on your device.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 11,
    "completion_tokens": 12,
    "total_tokens": 23
  }
}

Behavior:

  • Used: model, prompt, temperature, n
  • Accepted but not enforced: max_tokens, top_p
  • Supported: stream=true returns text/event-stream with OpenAI-style text_completion chunks and a final [DONE]
  • If generation fails after the SSE stream has started, the server emits event: error with the standard OpenAI error envelope before [DONE]
  • n duplicates the same generated text across choices instead of sampling independent completions
  • model=apple-foundation-model runs through Apple Foundation Models when they are available on the device

POST /v1/chat/completions

OpenAI Chat Completions endpoint.

Basic request:

{
  "model": "llama-3.2-1b-instruct-4bit",
  "messages": [
    { "role": "system", "content": "Reply in one sentence." },
    { "role": "user", "content": "What is edge inference?" }
  ],
  "temperature": 0.7
}

Multimodal request shape also accepted:

{
  "model": "llama-3.2-11b-vision-instruct-4bit",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Describe this image." },
        {
          "type": "image_url",
          "image_url": {
            "url": "https://example.com/cat.jpg"
          }
        }
      ]
    }
  ]
}

Behavior:

  • Used: model, messages, temperature, tools, tool_choice
  • Accepted but not enforced: top_p, max_tokens, max_completion_tokens, n
  • Supported: stream=true returns text/event-stream with chat.completion.chunk payloads and a final [DONE]
  • If generation fails after the SSE stream has started, the server emits event: error with the standard OpenAI error envelope before [DONE]
  • Rejected: legacy functions, legacy function_call, and non-standard chat request fields on /v1
  • Response always contains a single choice
  • Only POST /v1/chat/completions is registered. /v1/chat is not available.
  • Only standard OpenAI chat content parts are documented on this route. For image input, send an image_url object with a url string.
  • When functions are provided, they must already be registered on the server and also be listed in the request
  • Only tools of type function are supported
  • Function calls are executed locally on the server. Plain-text replies still stream incrementally when tools are present, but tool phases themselves may be buffered before the final assistant text is emitted.
  • model=apple-foundation-model only accepts plain-text system, user, and assistant messages. Image input, tool calls, tool-call history, and conversations whose final non-system message is not a user turn are rejected.

Function calling request:

{
  "model": "llama-3.2-1b-instruct-4bit",
  "messages": [
    { "role": "user", "content": "Add 2 and 3." }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "add_numbers",
        "description": "Add two integers.",
        "parameters": {
          "type": "object",
          "properties": {
            "a": { "type": "integer" },
            "b": { "type": "integer" }
          },
          "required": ["a", "b"]
        }
      }
    }
  ],
  "tool_choice": "required"
}

POST /v1/responses

OpenAI Responses-style wrapper over the same generation engine.

Simple request:

{
  "model": "llama-3.2-1b-instruct-4bit",
  "input": "Summarize on-device AI in one sentence.",
  "instructions": "Be concise.",
  "temperature": 0.7
}

Structured input with image:

{
  "model": "llama-3.2-11b-vision-instruct-4bit",
  "input": [
    {
      "type": "message",
      "role": "user",
      "content": [
        { "type": "input_text", "text": "Describe this image." },
        { "type": "input_image", "image_url": "https://example.com/cat.jpg" }
      ]
    }
  ]
}

Accepted input shapes:

  • plain string
  • array of OpenAI Responses input items

Behavior:

  • Used for generation: model, input, instructions, temperature, tools, tool_choice, parallel_tool_calls
  • Echoed in the response metadata: top_p, max_output_tokens, previous_response_id, store, metadata, truncation
  • Supported: stream=true returns text/event-stream with OpenAI Responses events such as response.created, response.in_progress, response.output_item.added, response.content_part.added, response.output_text.delta, and response.completed
  • Rejected: single-object input, and legacy non-OpenAI aliases such as tool_call_id
  • Supported input-item types: message, input_text, input_image, function_call, function_call_output
  • Not implemented: input_file, file_id-based file resolution, reasoning summaries
  • When functions execute, output includes function_call and function_call_output items before the final assistant message
  • model=apple-foundation-model only accepts plain-text requests on this route. input_image, function_call, and function_call_output items are rejected for that model because image input and tool execution are not implemented on the Apple Foundation backend.

Function calling request:

{
  "model": "llama-3.2-1b-instruct-4bit",
  "input": "Add 2 and 3.",
  "tools": [
    {
      "type": "function",
      "name": "add_numbers",
      "description": "Add two integers.",
      "parameters": {
        "type": "object",
        "properties": {
          "a": { "type": "integer" },
          "b": { "type": "integer" }
        },
        "required": ["a", "b"]
      },
      "strict": true
    }
  ],
  "tool_choice": {
    "type": "function",
    "name": "add_numbers"
  }
}

Representative response shape:

{
  "id": "resp_123",
  "object": "response",
  "created_at": 1741700000,
  "completed_at": 1741700000,
  "background": false,
  "status": "completed",
  "error": null,
  "incomplete_details": null,
  "instructions": null,
  "max_output_tokens": null,
  "model": "llama-3.2-1b-instruct-4bit",
  "output": [
    {
      "id": "fc_123",
      "type": "function_call",
      "status": "completed",
      "call_id": "call_123",
      "name": "add_numbers",
      "arguments": "{\"a\":2,\"b\":3}"
    },
    {
      "id": "fco_123",
      "type": "function_call_output",
      "status": "completed",
      "call_id": "call_123",
      "output": "{\"sum\":5}"
    },
    {
      "id": "msg_123",
      "type": "message",
      "status": "completed",
      "role": "assistant",
      "content": [
        {
          "type": "output_text",
          "text": "The sum is 5.",
          "annotations": [],
          "logprobs": []
        }
      }
    }
  ],
  "parallel_tool_calls": true,
  "previous_response_id": null,
  "reasoning": {
    "effort": null,
    "summary": null
  },
  "store": true,
  "temperature": 1,
  "text": {
    "format": {
      "type": "text"
    }
  },
  "tool_choice": {
    "type": "function",
    "name": "add_numbers"
  },
  "tools": [
    {
      "type": "function",
      "name": "add_numbers",
      "description": "Add two integers.",
      "parameters": {
        "type": "object"
      },
      "strict": true
    }
  ],
  "top_p": 1,
  "truncation": "disabled",
  "usage": {
    "input_tokens": 10,
    "input_tokens_details": {
      "cached_tokens": 0
    },
    "output_tokens": 6,
    "output_tokens_details": {
      "reasoning_tokens": 0
    },
    "total_tokens": 16
  },
  "user": null,
  "metadata": {}
}

Notes:

  • Functions must already be registered on the server and be included in the request payload.
  • Only tools of type function are supported.
  • /v1/responses uses the OpenAI Responses tool schema with top-level name, description, and parameters. The chat-style nested function wrapper is rejected on this route.
  • Unknown function names return an OpenAI-style validation error.
  • Parameter schemas must be JSON objects. Full JSON Schema enforcement is not implemented.

POST /v1/audio/transcriptions

Multipart transcription endpoint.

Example:

curl -X POST http://127.0.0.1:8080/v1/audio/transcriptions \
  -F "file=@sample.wav" \
  -F "model=apple" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=segment" \
  -F "timestamp_granularities[]=word"

Validation rules:

  • required: file, model
  • built-in transcription engine label: apple
  • current server behavior: model must be non-empty, but it is not otherwise validated or used to select a transcription backend
  • max file size: 25 MB
  • allowed extensions: flac, m4a, mp3, mp4, mpeg, mpga, ogg, wav, webm
  • temperature must be between 0 and 1
  • timestamp_granularities only works with response_format=verbose_json

Supported response_format values:

  • json
  • text
  • srt
  • verbose_json
  • vtt

Response types:

  • json and verbose_json: JSON body
  • text: plain text
  • srt: SubRip text
  • vtt: WebVTT text

POST /v1/audio/speech

Synthesizes speech and returns audio data.

Request:

{
  "model": "apple",
  "input": "Hello from ai.local.",
  "voice": "alloy",
  "response_format": "mp3",
  "speed": 1.0
}

Example:

curl -X POST http://127.0.0.1:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model":"apple","input":"Hello from ai.local.","voice":"alloy","response_format":"mp3"}' \
  --output speech.mp3

Validation rules:

  • required: model, input, voice
  • built-in speech engine label: apple
  • current server behavior: model must be non-empty, but it is not otherwise validated or used to select a speech backend
  • max input length: 4096 characters
  • supported voices: alloy, echo, fable, onyx, nova, shimmer
  • supported response_format: mp3, wav
  • speed must be between 0.25 and 4.0

Ollama-compatible API

POST /api/chat

Chat completion in Ollama-style request format.

Request:

{
  "model": "llama-3.2-1b-instruct-4bit",
  "messages": [
    { "role": "user", "content": "Say hello in one sentence." }
  ],
  "options": {
    "temperature": 0.7
  }
}

Behavior:

  • If model is omitted, the server selects a default installed model
  • Used: model, messages, options.temperature
  • Accepted but ignored: stream, format, keep_alive, think, tools, options.top_k, options.top_p
  • Response uses the standard final Ollama chat object keys: model, created_at, message, done, done_reason, total_duration, load_duration, prompt_eval_count, prompt_eval_duration, eval_count, and eval_duration
  • prompt_eval_count and eval_count are approximate character-based token estimates
  • Timing fields are synthesized as 0 because the local evaluator does not expose native Ollama nanosecond counters

POST /api/generate

Single-prompt generation endpoint.

Request:

{
  "model": "llama-3.2-1b-instruct-4bit",
  "prompt": "Explain local-first AI in one sentence."
}

Optional images may be provided as an array of URL strings, local file paths, data: URLs, or base64 strings.

Behavior:

  • Used: model, prompt, images
  • Accepted but ignored: stream, format, system, raw, think, keep_alive, options, logprobs, top_logprobs
  • Response uses the standard final Ollama generate object keys: model, created_at, response, done, done_reason, total_duration, load_duration, prompt_eval_count, prompt_eval_duration, eval_count, and eval_duration
  • thinking is not emitted because this server does not expose a separate reasoning channel on /api/generate
  • prompt_eval_count and eval_count are approximate character-based token estimates
  • Timing fields are synthesized as 0 because the local evaluator does not expose native Ollama nanosecond counters

POST /api/show

Returns inferred model metadata for an installed model.

Request:

{
  "model": "llama-3.2-1b-instruct-4bit",
  "verbose": true
}

Notes:

  • Returns 404 with { "error": "model '<name>' not found" } if the model is unavailable
  • Response keys follow the Ollama snake_case shape, including modified_at, details.parent_model, and model_info
  • license, family, parameter_size, quantization_level, template, and model_info are inferred from cached files and model names
  • Verbose output is synthesized for compatibility; it is not backed by a native Ollama runtime or Modelfile

GET /api/tags

Lists available models in an Ollama-like response format.

{
  "models": [
    {
      "name": "llama-3.2-1b-instruct-4bit",
      "model": "llama-3.2-1b-instruct-4bit",
      "modified_at": "2026-03-11T14:00:00Z",
      "size": 1234567890,
      "digest": "c180fa9df3a6b73d6f4bf2af9eaef7ca51e9167e8d1812c1fc8de12d5d00d992",
      "details": {
        "format": "gguf",
        "family": "llama",
        "families": null,
        "parameter_size": "13B",
        "quantization_level": "Q4_0"
      }
    }
  ]
}

The details payload is compatibility metadata assembled by the server, not a direct export from Ollama. size is the cached model-directory size in bytes when the directory can be resolved; otherwise it falls back to 0. digest is a stable SHA-256 hash of the normalized model name because the local cache does not expose native Ollama digests.

Errors

OpenAI-style error envelope

/v1/completions, /v1/chat/completions, /v1/responses, and /v1/audio/* convert validation and runtime failures into:

{
  "error": {
    "message": "Unsupported voice 'robot'. Supported voices: alloy, echo, fable, onyx, nova, shimmer.",
    "type": "invalid_request_error",
    "param": "voice",
    "code": "invalid_request_error"
  }
}

For stream=true on /v1/completions and /v1/chat/completions, preflight validation failures still return the same JSON error envelope as a normal 4xx or 5xx response. If a failure happens after the SSE stream has started, the server emits event: error with that same error object and then terminates the stream with [DONE].

Current compatibility gaps

  • top_p, max_tokens, and max_completion_tokens may be accepted on text routes but are not enforced by the local generation engine.
  • Completion and chat usage counts are approximate.
  • Function calling is prompt-driven over the local model. It supports request-listed registered functions, but it does not implement token-level streaming tool calls or full JSON Schema validation.
  • apple-foundation-model on /v1/* currently supports plain-text generation only; image input, tool calling, and tool-call history are rejected on that backend.
  • Legacy OpenAI compatibility aliases such as /v1/chat, functions, and function_call are intentionally not supported.
  • Ollama /api/chat and /api/generate currently return only the final non-streaming JSON object. Standard Ollama usage counters are synthesized for compatibility, with approximate token counts and zero-valued duration fields.
  • Ollama /api/show and /api/tags return synthesized metadata derived from the local model cache, not from a native Ollama runtime.