Managed Inference

CosmicAC Managed Inference lets you run open-source language models without deploying or managing serving infrastructure. You create an API key, configure the CLI, and send requests.

What `Managed` Means

Running a model for inference involves more than the model itself. You need infrastructure to handle requests, authenticate callers, balance traffic, and scale as demand grows.

CosmicAC Managed Inference provides all of this as a platform service. vLLM serving the models, the servers, and the service discovery mechanism that connects requests to available workers all run independently of your code. You interact only through the CLI.

The Proxy Layer

Every inference request goes through the inference proxy before reaching the model. The proxy handles authentication, service discovery, and load balancing.

Authentication. The proxy verifies your API key before forwarding the request. It rejects requests without a valid key before they reach the model.

Service discovery. When a new inference worker starts up, it registers itself to the distributed hash table (DHT). The proxy queries the DHT to discover which workers are reachable and routes requests accordingly.

Load balancing. When multiple inference workers are available, the proxy distributes requests across them.

Getting Started: Managed Inference Job — Create an API key and prepare for your first inference request.
How to Connect to a Managed Inference API — Send your first request using the CLI.
Managed Inference I/O Specs — Full request schema, response formats, and streaming options.
API Key Management — Manage keys across projects.

Managed Inference

What `Managed` Means

The Proxy Layer

The Inference Worker

The API

Streaming

Authentication

What's Next

On this page

Managed Inference

What Managed Means

The Proxy Layer

The Inference Worker

The API

Streaming

Authentication

What's Next

On this page

What `Managed` Means