Foundry inference API reference

Applies to: Foundry Local on Azure Local

This article is the platform reference for Foundry inference APIs in Foundry Local on Azure Local. It covers API surfaces by service, control plane API contracts for models and deployments, and common API patterns such as pagination and error responses.

For data-plane endpoint payloads and request examples, see Inference API endpoints and payload reference for Foundry Local on Azure Local.

For authentication architecture and authorization flow details, see Authentication and authorization in Foundry Local enabled by Azure Arc.

Important

Foundry Local is available in preview. Preview releases provide early access to features that are in active deployment.
Features, approaches, and processes can change or have limited capabilities before general availability (GA).

Platform overview

The Foundry Inference platform gives you a Kubernetes-native system for deploying and managing AI model inference workloads across multiple API surfaces. Each service has a specific role in the inference lifecycle.

All APIs use REST/HTTP. The platform doesn't include any gRPC (remote procedure call) endpoints. All services enforce authentication via Azure role-based access control (Azure RBAC) or API keys.

Service	Framework	Port	Purpose
inference_api	Python / FastAPI	8080	Control plane — create, read, update, and delete (CRUD) operations for Models, Deployments, API keys
predictive-server	Python / FastAPI	8000	Open Neural Network Exchange (ONNX) inference for predictive (non-generative) models
Chat Server	C# / ASP.NET Core	5000	OpenAI-compatible chat completions and audio transcription

Base URLs

Control Plane:     http://<host>:8080/api/v1
Predictive Server: http://<host>:8000
Chat Server:       http://<host>:5000

Control plane API

The control plane API runs on port 8080 by using FastAPI and provides management operations for Kubernetes custom resources. It serves as the primary interface for creating and managing model deployments. An auto-generated OpenAPI specification is available at /openapi.json with an interactive Swagger UI at /docs.

All non-health endpoints require Azure RBAC authentication. GET and HEAD requests require the deployments/read action, POST, PUT, and PATCH requests require deployments/write, and DELETE requests require deployments/delete.

Health endpoints

Use these endpoints to check service liveness and readiness.

Method	Path	Description
GET	/healthz	Liveness probe - always returns 200 if the process is alive
GET	/readyz	Readiness probe - verifies Kubernetes (K8s) API connectivity (503 if disconnected)

Response: GET /healthz

{ "status": "healthy" }

Response: GET /readyz

// Success (200):
{ "status": "ready", "kubernetes": "connected" }

// Failure (503):
{ "status": "not ready", "kubernetes": "disconnected", "error": "<reason>" }

Models (unified catalog and bring-your-own (BYO))

The Models API provides a unified view of all available models from multiple sources: the Foundry Local Open Neural Network Exchange (ONNX) catalog, the Microsoft Foundry vLLM catalog, and user-registered (BYO/custom) models. The old separate /catalog endpoints are now part of this unified API.

Method	Path	Description
GET	/api/v1/models	List all models (unified catalog + custom)
GET	/api/v1/models/foundry-local/{name}	Get a Foundry Local catalog model by alias or ID
GET	/api/v1/models/foundry/{name}	Get a Microsoft Foundry catalog model by alias
GET	/api/v1/models/custom/{name}	Get a BYO custom model by Kubernetes (K8s) resource name
POST	/api/v1/models	Register a new custom (BYO) model
PUT	/api/v1/models/custom/{name}	Update a custom model (full replace)
DELETE	/api/v1/models/custom/{name}	Delete a custom model
POST	/api/v1/models/sync	Trigger a catalog sync

BYO model operations (POST, PUT, DELETE) are scoped to the foundry-local-operator namespace. No namespace path parameter is required.

List models — query parameters

Use these query parameters to filter, sort, and page model list results.

Parameter	Type	Req.	Description
name	string	No	Partial, case-insensitive match on model ID, alias, or displayName
compute	enum	No	Filter by compute type: cpu, gpu, npu
task	string	No	Exact, case-insensitive match (e.g., chat-completion)
publisher	string	No	Partial, case-insensitive match on publisher name
source	string	No	Filter by source: foundry-local, foundry, custom
limit	integer	No	Max results per page (1–100, server-clamped)
offset	integer	No	Number of items to skip for pagination (≥ 0)

List models — response fields

This response includes pagination metadata and the list of returned models.

Field	Type	Description
models	CatalogModelSummary[]	Paginated list of model summaries
total	integer	Total count after filtering (before pagination)
count	integer	Number of models returned in this response
hasMore	boolean	Whether more results exist beyond this page
limit	integer or null	The limit parameter used
offset	integer or null	The offset parameter used
unfilteredTotal	integer	Total models before any filtering applied
version	string or null	Catalog version / timestamp
lastSync	string or null	Last catalog sync timestamp (ISO 8601)

Model summary fields

Each model in the list includes the following summary fields.

Field	Type	Description
alias	string or null	Short model alias
publisher	string or null	Publisher / author
description	string or null	Model description
license	string or null	License identifier
task	string or null	Task type (e.g., chat-completion)
source	string or null	Source: foundry-local, foundry, huggingface, or custom
framework	string or null	Model framework (e.g., ONNX, Custom/PyTorch)
modelVersion	string or null	Model version string
supportedCompute	enum[] or null	List of CPU, GPU, NPU

Create BYO model — request body

You can create only custom (BYO) models through the API. The source.type value must be "custom". The catalog sync process manages catalog models.

POST /api/v1/models
Content-Type: application/json

{
  "name": "my-custom-model",
  "displayName": "My Custom Model",
  "description": "A custom ONNX model for image classification",
  "source": {
    "type": "custom",
    "custom": {
      "registry": "myacr.azurecr.io",
      "repository": "models/my-model",
      "tag": "v1.0",
      "credentials": {
        "secretRef": {
          "name": "my-registry-secret",
          "usernameKey": "username",
          "passwordKey": "password"
        }
      }
    }
  },
  "capabilities": {
    "task": "chat-completion",
    "contextLength": 4096,
    "streaming": true
  }
}

The registry field is validated for server-side request forgery (SSRF) protection. The validation rejects private, internal, and bare IP addresses with a 400 error.

Trigger catalog sync

Use this endpoint to start a manual catalog synchronization cycle.

POST /api/v1/models/sync

// Response (200):
{
  "status": "triggered",
  "message": "Catalog sync requested",
  "syncedAt": "2024-01-15T10:30:00Z"
}

Deployments

The Deployments API manages ModelDeployment custom resource definitions (CRDs), which represent running inference workloads. Each deployment creates a Kubernetes Deployment, Service, and optionally an Ingress. The API injects an nginx transport layer security (TLS) sidecar for secure communication, and it enforces authentication at the application layer.

Method	Path	Description
GET	/api/v1/deployments	List all deployments across all namespaces
GET	/api/v1/namespaces/{ns}/deployments	List deployments in a specific namespace
GET	/api/v1/namespaces/{ns}/deployments/{name}	Get a specific deployment with full spec and status
POST	/api/v1/namespaces/{ns}/deployments	Create a new deployment
PUT	/api/v1/namespaces/{ns}/deployments/{name}	Full-replace update of a deployment spec
PATCH	/api/v1/namespaces/{ns}/deployments/{name}	Partial update (replicas, env, resources, endpoint)
DELETE	/api/v1/namespaces/{ns}/deployments/{name}	Delete a deployment and its child K8s resources

Create deployment — request body

Use the following fields to define a new deployment request.

Field	Type	Req.	Description
name	string	Yes	Unique name (1–63 chars, DNS label format)
spec.model	ModelRef	Yes	Model reference (one of: ref, catalog, or custom)
spec.compute	enum	Yes	Compute type: "cpu" or "gpu"
spec.workloadType	enum	No	Workload type: "generative" (default) or "predictive"
spec.replicas	integer	No	Pod replica count, 1–100 (default: 1)
spec.port	integer	No	Container port, 1024–65535 (default: 5000)
spec.displayName	string	No	Human-readable name (max 256 chars)
spec.env	EnvVar[]	No	Extra environment variables [{name, value}]
spec.resources	object	No	CPU, memory, and GPU requests and limits
spec.nodeSelector	object	No	K8s node selector key-value pairs
spec.tolerations	Toleration[]	No	Pod scheduling tolerations
spec.endpoint	EndpointConfig	No	Ingress configuration (host, path, TLS)
spec.authentication	AuthConfig	No	API key authentication configuration

Model reference types

The spec.model field accepts exactly one of the following reference types:

// Reference an existing Model CRD in the same namespace
{ "ref": "my-model-name" }

// Inline catalog model reference
{ "catalog": { "name": "phi-4-mini", "version": "latest" } }

// Inline custom (BYO) model reference
{ "custom": {
    "registry": "myacr.azurecr.io",
    "repository": "models/my-model",
    "tag": "v1.0",
    "credentials": { "secretRef": { "name": "secret-name" } }
  }
}

Resource requirements

Use this structure to set CPU, memory, and GPU requests and limits.

"resources": {
  "requests": { "cpu": "100m", "memory": "256Mi" },
  "limits": { "cpu": "1000m", "memory": "1Gi", "gpu": 1 }
}

Note

When compute is "gpu" and skipGpuResource is false, resources.limits.gpu is required (1–8).

Create deployment — example request

This example shows a complete deployment request payload.

POST /api/v1/namespaces/default/deployments
Content-Type: application/json

{
  "name": "phi4-mini-deploy",
  "spec": {
    "model": { "catalog": { "name": "phi-4-mini", "version": "latest" } },
    "compute": "cpu",
    "workloadType": "generative",
    "replicas": 2,
    "resources": {
      "requests": { "cpu": "2000m", "memory": "4Gi" },
      "limits": { "cpu": "4000m", "memory": "8Gi" }
    },
    "authentication": { "enabled": true }
  }
}

Deployment status fields

These fields describe deployment state, readiness, and resolved endpoints.

Field	Type	Description
state	enum or null	Pending, Creating, Running, Updating, Error, Terminating
message	string or null	Human-readable status message
readyReplicas	integer or null	Number of pods in ready state
deploymentReady	boolean or null	Whether all requested replicas are ready
serviceReady	boolean or null	Whether the K8s Service is created
internalEndpoint	string or null	Internal cluster URL for the deployment
externalEndpoint	string or null	External URL (when Ingress is configured)
resolvedModel	object or null	Resolved model info: {name, variant, image}
authentication	object or null	Auth status: {keysSecretName, key rotation timestamps}
conditions	Condition[]	K8s-style conditions array with type/status/reason/message

Partial update (PATCH)

The PATCH endpoint accepts a subset of fields for quick updates without replacing the entire spec. Only replicas, env, resources, and endpoint are patchable. Authentication and model aren't patchable.

PATCH /api/v1/namespaces/default/deployments/phi4-mini-deploy
Content-Type: application/json

{
  "replicas": 3,
  "resources": {
    "limits": { "cpu": "8000m", "memory": "16Gi" }
  }
}

API keys

Each deployment with authentication enabled has a primary and secondary API key, stored as a Kubernetes Secret. The system auto-generates keys when the deployment becomes Ready.

Method	Path	Description
GET	/api/v1/namespaces/{ns}/deployments/{name}/keys	Get primary and secondary API keys
POST	.../{name}/keys/{key_type}/rotate	Rotate a key (key_type: primary or secondary)

Get keys — response

This response returns the active primary and secondary API keys for a deployment.

{
  "deploymentName": "phi4-mini-deploy",
  "namespace": "default",
  "primaryKey": {
    "value": "fndry-pk-a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "createdAt": "2024-01-15T10:00:00Z"
  },
  "secondaryKey": {
    "value": "fndry-sk-f1e2d3c4-b5a6-0987-dcba-1234567890ef",
    "createdAt": "2024-01-15T10:00:00Z"
  }
}

Key format

Generated API keys follow these formats.

Primary keys:   fndry-pk-{uuid4} (generated by operator on initial deployment)
Secondary keys: fndry-sk-{uuid4} (generated by operator on initial deployment)

Keys rotated through the API rotate endpoint use secrets.token_hex(32), which produces a 64-character hex string without the fndry-pk- or fndry-sk- prefix.

Rotate key — example

This example rotates one key and returns the new key value and timestamp.

POST /api/v1/namespaces/default/deployments/phi4-mini-deploy/keys/primary/rotate

// Response:
{
  "deploymentName": "phi4-mini-deploy",
  "namespace": "default",
  "keyType": "primary",
  "key": {
    "value": "a7f3e2b1c9d8......",
    "createdAt": "2024-01-20T14:30:00Z"
  }
}

The deployment must have authentication enabled. If you request keys for a deployment with authentication disabled, the API returns 400.

InferenceServices (legacy)

InferenceServices is the older CRD design. The recommended approach is to use Models + ModelDeployments. Both approaches remain active in the codebase.

Method	Path	Description
GET	/api/v1/inferenceservices	List all InferenceServices (all namespaces)
GET	/api/v1/namespaces/{ns}/inferenceservices	List InferenceServices in a namespace
GET	/api/v1/namespaces/{ns}/inferenceservices/{name}	Get a specific InferenceService
POST	/api/v1/namespaces/{ns}/inferenceservices	Create an InferenceService
PUT	/api/v1/namespaces/{ns}/inferenceservices/{name}	Full-replace update
PATCH	/api/v1/namespaces/{ns}/inferenceservices/{name}	Partial update
DELETE	/api/v1/namespaces/{ns}/inferenceservices/{name}	Delete

Key differences from deployments

This table shows how InferenceServices fields map to ModelDeployment fields.

Field	InferenceService	ModelDeployment
Workload type field	inferenceType	spec.workloadType
Compute field	hardware	spec.compute
Model source field	modelSource.foundry / modelSource.byo	spec.model.ref / catalog / custom
Ingress field	ingress	spec.endpoint

Data-plane API surfaces

This section lists the data-plane endpoints by service surface. For request and response schema details, payload examples, and client samples, see Inference API endpoints and payload reference for Foundry Local on Azure Local.

Predictive server (port 8000)

These endpoints support predictive inference workloads and model status checks.

Method	Path	Description
GET	/health	Liveness probe
GET	/ready	Readiness probe
GET	/v1/model	Model metadata endpoint
POST	/v1/predict	Predictive inference endpoint

Chat server (port 5000)

These endpoints support chat completions, transcription, and model listing.

Method	Path	Description
POST	/v1/chat/completions	OpenAI-compatible generative inference
POST	/v1/audio/transcriptions	OpenAI-compatible transcription
GET	/v1/models	OpenAI-compatible model listing

Authentication and authorization summary

The application layer enforces authentication, and the nginx sidecar provides TLS termination. Data-plane requests support API key and Microsoft Entra ID JSON Web Token (JWT) credential modes based on deployment configuration.

For detailed architecture, validation flow, and authorization behavior, see Authentication and authorization in Foundry Local enabled by Azure Arc.

Common patterns

Use these patterns for consistent pagination, error handling, and API discovery.

Pagination

These pagination patterns apply to list endpoints across the API surface.

Cursor pagination (deployments, InferenceServices)

These endpoints use Kubernetes-native cursor pagination. Pass the continueToken from the response as the continue query parameter in the next request.

GET /api/v1/deployments?limit=10
// Response includes: "continueToken": "eyJjb250aW51ZS..."

GET /api/v1/deployments?limit=10&continue=eyJjb250aW51ZS...
// Next page; continueToken: null when no more pages

Offset pagination (models)

The unified models list uses offset-based pagination with limit (1–100) and offset parameters.

GET /api/v1/models?limit=20&offset=0
// Response: { "total": 45, "count": 20, "hasMore": true, ... }

GET /api/v1/models?limit=20&offset=20
// Response: { "total": 45, "count": 20, "hasMore": true, ... }

GET /api/v1/models?limit=20&offset=40
// Response: { "total": 45, "count": 5, "hasMore": false, ... }

Error responses

The following sections describe standard error payloads by API surface.

Control plane API format

Control plane errors return a structured envelope with field-level validation details.

{
  "error": "ValidationError",
  "message": "Request validation failed",
  "details": {
    "errors": [
      { "field": "spec.compute", "message": "value is not a valid enumeration member" }
    ]
  }
}

Error types

Use these error types and status codes to diagnose failed control plane requests.

Error Type	HTTP	Description
NotFound	404	Requested K8s resource doesn't exist
Conflict	409	Resource with the same name already exists
ValidationError	400	Request body validation failed (details.errors has field-level messages)
AuthenticationDisabled	400	API keys requested for a deployment with auth disabled
InternalError	500	Unexpected server error or K8s API failure

Chat server error format (OpenAI-compatible)

Chat server errors follow the OpenAI-compatible error shape.

{
  "error": {
    "message": "Request failed.",
    "type": "server_error",
    "code": "internal_error"
  }
}

Predictive server error format

Predictive server errors return either a standard detail message or a queue-capacity payload.

// Standard errors:
{ "detail": "Model not loaded yet" }

// Queue full (includes Retry-After header):
{ "error": "Queue full: ...", "queue_depth": 100, "retry_after": 5 }

OpenAPI and Swagger

Use these endpoints to inspect API schemas and test endpoints interactively.

Service	Swagger UI	OpenAPI JSON
Control Plane API	http://<host>:8080/docs	http://<host>:8080/openapi.json
Predictive Server	http://<host>:8000/docs	http://<host>:8000/openapi.json
Chat Server	Not available	Not available

Feedback

Was this page helpful?

Last updated on 2026-05-04