Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article is the platform reference for Foundry inference APIs in Foundry Local on Azure Local. It covers API surfaces by service, control plane API contracts for models and deployments, and common API patterns such as pagination and error responses.
For data-plane endpoint payloads and request examples, see Inference API endpoints and payload reference for Foundry Local on Azure Local.
For authentication architecture and authorization flow details, see Authentication and authorization in Foundry Local enabled by Azure Arc.
Important
- Foundry Local is available in preview. Preview releases provide early access to features that are in active deployment.
- Features, approaches, and processes can change or have limited capabilities before general availability (GA).
Platform overview
The Foundry Inference platform gives you a Kubernetes-native system for deploying and managing AI model inference workloads across multiple API surfaces. Each service has a specific role in the inference lifecycle.
All APIs use REST/HTTP. The platform doesn't include any gRPC (remote procedure call) endpoints. All services enforce authentication via Azure role-based access control (Azure RBAC) or API keys.
| Service | Framework | Port | Purpose |
|---|---|---|---|
| inference_api | Python / FastAPI | 8080 | Control plane — create, read, update, and delete (CRUD) operations for Models, Deployments, API keys |
| predictive-server | Python / FastAPI | 8000 | Open Neural Network Exchange (ONNX) inference for predictive (non-generative) models |
| Chat Server | C# / ASP.NET Core | 5000 | OpenAI-compatible chat completions and audio transcription |
Base URLs
Control Plane: http://<host>:8080/api/v1
Predictive Server: http://<host>:8000
Chat Server: http://<host>:5000
Control plane API
The control plane API runs on port 8080 by using FastAPI and provides management operations for Kubernetes custom resources. It serves as the primary interface for creating and managing model deployments. An auto-generated OpenAPI specification is available at /openapi.json with an interactive Swagger UI at /docs.
All non-health endpoints require Azure RBAC authentication. GET and HEAD requests require the deployments/read action, POST, PUT, and PATCH requests require deployments/write, and DELETE requests require deployments/delete.
Health endpoints
Use these endpoints to check service liveness and readiness.
| Method | Path | Description |
|---|---|---|
| GET | /healthz | Liveness probe - always returns 200 if the process is alive |
| GET | /readyz | Readiness probe - verifies Kubernetes (K8s) API connectivity (503 if disconnected) |
Response: GET /healthz
{ "status": "healthy" }
Response: GET /readyz
// Success (200):
{ "status": "ready", "kubernetes": "connected" }
// Failure (503):
{ "status": "not ready", "kubernetes": "disconnected", "error": "<reason>" }
Models (unified catalog and bring-your-own (BYO))
The Models API provides a unified view of all available models from multiple sources: the Foundry Local Open Neural Network Exchange (ONNX) catalog, the Microsoft Foundry vLLM catalog, and user-registered (BYO/custom) models. The old separate /catalog endpoints are now part of this unified API.
| Method | Path | Description |
|---|---|---|
| GET | /api/v1/models | List all models (unified catalog + custom) |
| GET | /api/v1/models/foundry-local/{name} | Get a Foundry Local catalog model by alias or ID |
| GET | /api/v1/models/foundry/{name} | Get a Microsoft Foundry catalog model by alias |
| GET | /api/v1/models/custom/{name} | Get a BYO custom model by Kubernetes (K8s) resource name |
| POST | /api/v1/models | Register a new custom (BYO) model |
| PUT | /api/v1/models/custom/{name} | Update a custom model (full replace) |
| DELETE | /api/v1/models/custom/{name} | Delete a custom model |
| POST | /api/v1/models/sync | Trigger a catalog sync |
BYO model operations (POST, PUT, DELETE) are scoped to the foundry-local-operator namespace. No namespace path parameter is required.
List models — query parameters
Use these query parameters to filter, sort, and page model list results.
| Parameter | Type | Req. | Description |
|---|---|---|---|
| name | string | No | Partial, case-insensitive match on model ID, alias, or displayName |
| compute | enum | No | Filter by compute type: cpu, gpu, npu |
| task | string | No | Exact, case-insensitive match (e.g., chat-completion) |
| publisher | string | No | Partial, case-insensitive match on publisher name |
| source | string | No | Filter by source: foundry-local, foundry, custom |
| limit | integer | No | Max results per page (1–100, server-clamped) |
| offset | integer | No | Number of items to skip for pagination (≥ 0) |
List models — response fields
This response includes pagination metadata and the list of returned models.
| Field | Type | Description |
|---|---|---|
| models | CatalogModelSummary[] | Paginated list of model summaries |
| total | integer | Total count after filtering (before pagination) |
| count | integer | Number of models returned in this response |
| hasMore | boolean | Whether more results exist beyond this page |
| limit | integer or null | The limit parameter used |
| offset | integer or null | The offset parameter used |
| unfilteredTotal | integer | Total models before any filtering applied |
| version | string or null | Catalog version / timestamp |
| lastSync | string or null | Last catalog sync timestamp (ISO 8601) |
Model summary fields
Each model in the list includes the following summary fields.
| Field | Type | Description |
|---|---|---|
| alias | string or null | Short model alias |
| publisher | string or null | Publisher / author |
| description | string or null | Model description |
| license | string or null | License identifier |
| task | string or null | Task type (e.g., chat-completion) |
| source | string or null | Source: foundry-local, foundry, huggingface, or custom |
| framework | string or null | Model framework (e.g., ONNX, Custom/PyTorch) |
| modelVersion | string or null | Model version string |
| supportedCompute | enum[] or null | List of CPU, GPU, NPU |
Create BYO model — request body
You can create only custom (BYO) models through the API. The source.type value must be "custom". The catalog sync process manages catalog models.
POST /api/v1/models
Content-Type: application/json
{
"name": "my-custom-model",
"displayName": "My Custom Model",
"description": "A custom ONNX model for image classification",
"source": {
"type": "custom",
"custom": {
"registry": "myacr.azurecr.io",
"repository": "models/my-model",
"tag": "v1.0",
"credentials": {
"secretRef": {
"name": "my-registry-secret",
"usernameKey": "username",
"passwordKey": "password"
}
}
}
},
"capabilities": {
"task": "chat-completion",
"contextLength": 4096,
"streaming": true
}
}
The registry field is validated for server-side request forgery (SSRF) protection. The validation rejects private, internal, and bare IP addresses with a 400 error.
Trigger catalog sync
Use this endpoint to start a manual catalog synchronization cycle.
POST /api/v1/models/sync
// Response (200):
{
"status": "triggered",
"message": "Catalog sync requested",
"syncedAt": "2024-01-15T10:30:00Z"
}
Deployments
The Deployments API manages ModelDeployment custom resource definitions (CRDs), which represent running inference workloads. Each deployment creates a Kubernetes Deployment, Service, and optionally an Ingress. The API injects an nginx transport layer security (TLS) sidecar for secure communication, and it enforces authentication at the application layer.
| Method | Path | Description |
|---|---|---|
| GET | /api/v1/deployments | List all deployments across all namespaces |
| GET | /api/v1/namespaces/{ns}/deployments | List deployments in a specific namespace |
| GET | /api/v1/namespaces/{ns}/deployments/{name} | Get a specific deployment with full spec and status |
| POST | /api/v1/namespaces/{ns}/deployments | Create a new deployment |
| PUT | /api/v1/namespaces/{ns}/deployments/{name} | Full-replace update of a deployment spec |
| PATCH | /api/v1/namespaces/{ns}/deployments/{name} | Partial update (replicas, env, resources, endpoint) |
| DELETE | /api/v1/namespaces/{ns}/deployments/{name} | Delete a deployment and its child K8s resources |
Create deployment — request body
Use the following fields to define a new deployment request.
| Field | Type | Req. | Description |
|---|---|---|---|
| name | string | Yes | Unique name (1–63 chars, DNS label format) |
| spec.model | ModelRef | Yes | Model reference (one of: ref, catalog, or custom) |
| spec.compute | enum | Yes | Compute type: "cpu" or "gpu" |
| spec.workloadType | enum | No | Workload type: "generative" (default) or "predictive" |
| spec.replicas | integer | No | Pod replica count, 1–100 (default: 1) |
| spec.port | integer | No | Container port, 1024–65535 (default: 5000) |
| spec.displayName | string | No | Human-readable name (max 256 chars) |
| spec.env | EnvVar[] | No | Extra environment variables [{name, value}] |
| spec.resources | object | No | CPU, memory, and GPU requests and limits |
| spec.nodeSelector | object | No | K8s node selector key-value pairs |
| spec.tolerations | Toleration[] | No | Pod scheduling tolerations |
| spec.endpoint | EndpointConfig | No | Ingress configuration (host, path, TLS) |
| spec.authentication | AuthConfig | No | API key authentication configuration |
Model reference types
The spec.model field accepts exactly one of the following reference types:
// Reference an existing Model CRD in the same namespace
{ "ref": "my-model-name" }
// Inline catalog model reference
{ "catalog": { "name": "phi-4-mini", "version": "latest" } }
// Inline custom (BYO) model reference
{ "custom": {
"registry": "myacr.azurecr.io",
"repository": "models/my-model",
"tag": "v1.0",
"credentials": { "secretRef": { "name": "secret-name" } }
}
}
Resource requirements
Use this structure to set CPU, memory, and GPU requests and limits.
"resources": {
"requests": { "cpu": "100m", "memory": "256Mi" },
"limits": { "cpu": "1000m", "memory": "1Gi", "gpu": 1 }
}
Note
When compute is "gpu" and skipGpuResource is false, resources.limits.gpu is required (1–8).
Create deployment — example request
This example shows a complete deployment request payload.
POST /api/v1/namespaces/default/deployments
Content-Type: application/json
{
"name": "phi4-mini-deploy",
"spec": {
"model": { "catalog": { "name": "phi-4-mini", "version": "latest" } },
"compute": "cpu",
"workloadType": "generative",
"replicas": 2,
"resources": {
"requests": { "cpu": "2000m", "memory": "4Gi" },
"limits": { "cpu": "4000m", "memory": "8Gi" }
},
"authentication": { "enabled": true }
}
}
Deployment status fields
These fields describe deployment state, readiness, and resolved endpoints.
| Field | Type | Description |
|---|---|---|
| state | enum or null | Pending, Creating, Running, Updating, Error, Terminating |
| message | string or null | Human-readable status message |
| readyReplicas | integer or null | Number of pods in ready state |
| deploymentReady | boolean or null | Whether all requested replicas are ready |
| serviceReady | boolean or null | Whether the K8s Service is created |
| internalEndpoint | string or null | Internal cluster URL for the deployment |
| externalEndpoint | string or null | External URL (when Ingress is configured) |
| resolvedModel | object or null | Resolved model info: {name, variant, image} |
| authentication | object or null | Auth status: {keysSecretName, key rotation timestamps} |
| conditions | Condition[] | K8s-style conditions array with type/status/reason/message |
Partial update (PATCH)
The PATCH endpoint accepts a subset of fields for quick updates without replacing the entire spec. Only replicas, env, resources, and endpoint are patchable. Authentication and model aren't patchable.
PATCH /api/v1/namespaces/default/deployments/phi4-mini-deploy
Content-Type: application/json
{
"replicas": 3,
"resources": {
"limits": { "cpu": "8000m", "memory": "16Gi" }
}
}
API keys
Each deployment with authentication enabled has a primary and secondary API key, stored as a Kubernetes Secret. The system auto-generates keys when the deployment becomes Ready.
| Method | Path | Description |
|---|---|---|
| GET | /api/v1/namespaces/{ns}/deployments/{name}/keys | Get primary and secondary API keys |
| POST | .../{name}/keys/{key_type}/rotate | Rotate a key (key_type: primary or secondary) |
Get keys — response
This response returns the active primary and secondary API keys for a deployment.
{
"deploymentName": "phi4-mini-deploy",
"namespace": "default",
"primaryKey": {
"value": "fndry-pk-a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"createdAt": "2024-01-15T10:00:00Z"
},
"secondaryKey": {
"value": "fndry-sk-f1e2d3c4-b5a6-0987-dcba-1234567890ef",
"createdAt": "2024-01-15T10:00:00Z"
}
}
Key format
Generated API keys follow these formats.
Primary keys: fndry-pk-{uuid4} (generated by operator on initial deployment)
Secondary keys: fndry-sk-{uuid4} (generated by operator on initial deployment)
Keys rotated through the API rotate endpoint use secrets.token_hex(32), which produces a 64-character hex string without the fndry-pk- or fndry-sk- prefix.
Rotate key — example
This example rotates one key and returns the new key value and timestamp.
POST /api/v1/namespaces/default/deployments/phi4-mini-deploy/keys/primary/rotate
// Response:
{
"deploymentName": "phi4-mini-deploy",
"namespace": "default",
"keyType": "primary",
"key": {
"value": "a7f3e2b1c9d8......",
"createdAt": "2024-01-20T14:30:00Z"
}
}
The deployment must have authentication enabled. If you request keys for a deployment with authentication disabled, the API returns 400.
InferenceServices (legacy)
InferenceServices is the older CRD design. The recommended approach is to use Models + ModelDeployments. Both approaches remain active in the codebase.
| Method | Path | Description |
|---|---|---|
| GET | /api/v1/inferenceservices | List all InferenceServices (all namespaces) |
| GET | /api/v1/namespaces/{ns}/inferenceservices | List InferenceServices in a namespace |
| GET | /api/v1/namespaces/{ns}/inferenceservices/{name} | Get a specific InferenceService |
| POST | /api/v1/namespaces/{ns}/inferenceservices | Create an InferenceService |
| PUT | /api/v1/namespaces/{ns}/inferenceservices/{name} | Full-replace update |
| PATCH | /api/v1/namespaces/{ns}/inferenceservices/{name} | Partial update |
| DELETE | /api/v1/namespaces/{ns}/inferenceservices/{name} | Delete |
Key differences from deployments
This table shows how InferenceServices fields map to ModelDeployment fields.
| Field | InferenceService | ModelDeployment |
|---|---|---|
| Workload type field | inferenceType | spec.workloadType |
| Compute field | hardware | spec.compute |
| Model source field | modelSource.foundry / modelSource.byo | spec.model.ref / catalog / custom |
| Ingress field | ingress | spec.endpoint |
Data-plane API surfaces
This section lists the data-plane endpoints by service surface. For request and response schema details, payload examples, and client samples, see Inference API endpoints and payload reference for Foundry Local on Azure Local.
Predictive server (port 8000)
These endpoints support predictive inference workloads and model status checks.
| Method | Path | Description |
|---|---|---|
| GET | /health | Liveness probe |
| GET | /ready | Readiness probe |
| GET | /v1/model | Model metadata endpoint |
| POST | /v1/predict | Predictive inference endpoint |
Chat server (port 5000)
These endpoints support chat completions, transcription, and model listing.
| Method | Path | Description |
|---|---|---|
| POST | /v1/chat/completions | OpenAI-compatible generative inference |
| POST | /v1/audio/transcriptions | OpenAI-compatible transcription |
| GET | /v1/models | OpenAI-compatible model listing |
Authentication and authorization summary
The application layer enforces authentication, and the nginx sidecar provides TLS termination. Data-plane requests support API key and Microsoft Entra ID JSON Web Token (JWT) credential modes based on deployment configuration.
For detailed architecture, validation flow, and authorization behavior, see Authentication and authorization in Foundry Local enabled by Azure Arc.
Common patterns
Use these patterns for consistent pagination, error handling, and API discovery.
Pagination
These pagination patterns apply to list endpoints across the API surface.
Cursor pagination (deployments, InferenceServices)
These endpoints use Kubernetes-native cursor pagination. Pass the continueToken from the response as the continue query parameter in the next request.
GET /api/v1/deployments?limit=10
// Response includes: "continueToken": "eyJjb250aW51ZS..."
GET /api/v1/deployments?limit=10&continue=eyJjb250aW51ZS...
// Next page; continueToken: null when no more pages
Offset pagination (models)
The unified models list uses offset-based pagination with limit (1–100) and offset parameters.
GET /api/v1/models?limit=20&offset=0
// Response: { "total": 45, "count": 20, "hasMore": true, ... }
GET /api/v1/models?limit=20&offset=20
// Response: { "total": 45, "count": 20, "hasMore": true, ... }
GET /api/v1/models?limit=20&offset=40
// Response: { "total": 45, "count": 5, "hasMore": false, ... }
Error responses
The following sections describe standard error payloads by API surface.
Control plane API format
Control plane errors return a structured envelope with field-level validation details.
{
"error": "ValidationError",
"message": "Request validation failed",
"details": {
"errors": [
{ "field": "spec.compute", "message": "value is not a valid enumeration member" }
]
}
}
Error types
Use these error types and status codes to diagnose failed control plane requests.
| Error Type | HTTP | Description |
|---|---|---|
| NotFound | 404 | Requested K8s resource doesn't exist |
| Conflict | 409 | Resource with the same name already exists |
| ValidationError | 400 | Request body validation failed (details.errors has field-level messages) |
| AuthenticationDisabled | 400 | API keys requested for a deployment with auth disabled |
| InternalError | 500 | Unexpected server error or K8s API failure |
Chat server error format (OpenAI-compatible)
Chat server errors follow the OpenAI-compatible error shape.
{
"error": {
"message": "Request failed.",
"type": "server_error",
"code": "internal_error"
}
}
Predictive server error format
Predictive server errors return either a standard detail message or a queue-capacity payload.
// Standard errors:
{ "detail": "Model not loaded yet" }
// Queue full (includes Retry-After header):
{ "error": "Queue full: ...", "queue_depth": 100, "retry_after": 5 }
OpenAPI and Swagger
Use these endpoints to inspect API schemas and test endpoints interactively.
| Service | Swagger UI | OpenAPI JSON |
|---|---|---|
| Control Plane API | http://<host>:8080/docs | http://<host>:8080/openapi.json |
| Predictive Server | http://<host>:8000/docs | http://<host>:8000/openapi.json |
| Chat Server | Not available | Not available |