CosmicAC Logo

Managed Inference

CosmicAC Managed Inference lets you run open-source language models without deploying or managing serving infrastructure. You create an API key, configure the CLI, and send requests.


What Managed Means

Running a model for inference involves more than the model itself. You need infrastructure to handle requests, authenticate callers, balance traffic, and scale as demand grows.

CosmicAC Managed Inference provides all of this as a platform service. vLLM serving the models, the servers, and the service discovery mechanism that connects requests to available workers all run independently of your code. You interact only through the CLI.


The Proxy Layer

Every inference request goes through the inference proxy before reaching the model. The proxy handles authentication, service discovery, and load balancing.

Authentication. The proxy verifies your API key before forwarding the request. It rejects requests without a valid key before they reach the model.

Service discovery. When a new inference worker starts up, it registers itself to the distributed hash table (DHT). The proxy queries the DHT to discover which workers are reachable and routes requests accordingly.

Load balancing. When multiple inference workers are available, the proxy distributes requests across them.


The Inference Worker

Each inference worker runs vLLM inside a KubeVirt virtual machine, the same VM-based compute environment used for GPU containers. CosmicAC deploys and manages the vLLM process, so you do not configure it directly.

When you create a Managed Inference job, CosmicAC provisions the VM and registers the worker to the DHT. The proxy then routes requests to it.


The API

The Managed Inference API implements the /v1/chat/completions endpoint and is OpenAI-compatible. You interact with it through the CosmicAC CLI using the inference chat command.

Streaming

Managed Inference supports streaming responses using Server-Sent Events (SSE). Passing --stream in your request delivers token chunks as they are generated, rather than waiting for the complete response.

Authentication

You authenticate requests using your API key. The CLI resolves your key in this order — the --api-key flag, the COSMICAC_API_KEY environment variable, or the key stored from inference init.


What's Next

On this page