Now in general availability

AI inference.
Without the
infrastructure tax.

Pick a model. Send a request. Get a result. No GPUs to manage, no clusters to babysit, no idle servers to pay for.

Python TypeScript cURL

1# Two lines to your first inference
2from compute import Client
3
4client = Client(api_key="sk-...")
5
6response = client.infer(
7    model="llama-3.1-70b",
8    messages=[{"role": "user", "content": "Hello"}]
9)
Response · 180ms

            {

              "id": "inf_01HXQ7...",

              "model": "llama-3.1-70b",

              "content": "Hello! How can I help you today?",

              "tokens_used": 42,

              "cost_usd": 0.000021

            }
          

<200ms

Median cold start

50+

Curated catalog models

When you're not using it

99.9%

Uptime SLA

How it works

From idea to inference
in under an hour

Choose a model

Pick from the public catalog or bring your own fine-tuned weights. Every model includes pricing and latency specs upfront.

Receive your endpoint

You get a URL and an API key. That's your entire deployment. No YAML, no Kubernetes, no cluster provisioning.

Send requests

JSON in, JSON out. The same calling pattern works across every model type — swap a vision model for an LLM without rewriting client code.

Scale automatically

Traffic doubles? Workers spin up. Traffic drops to zero overnight? Workers disappear. You see it only in latency — never in your bill.

Your app

POST /v1/infer

→

Compute router

              Route + queue

↓

GPU worker

              Model loaded ✓

→

Response

180ms · 42 tokens

Auto-scale 0 → 1000 workers

Scale-to-zero No idle cost

What makes it different

Built for the
production reality

Not a proof-of-concept. The things teams quietly need once they're past the prototype stage are already in.

True elasticity

Workers come up in seconds, scale to zero overnight. A model handling 1,000 RPM at peak and nothing at 3am reflects that pattern exactly in its bill.

Model breadth

LLMs, image generation, speech, embeddings, classifiers, vision — one account, one dashboard, one SDK for your entire inference stack.

Honest pricing

Metered in tokens, seconds, or requests — whichever matches the model. No per-hour charges hiding underneath. No surprise minimums.

Private model hosting

Your fine-tuned weights stay on isolated infrastructure. No commingling with other tenants. Weights are never used for anything beyond serving your requests.

Traffic splitting

Multiple model versions served side by side. Split traffic between them, roll out replacements without dropping requests in flight.

Operational visibility

Per-endpoint rate limits, request-level logs, latency and error metrics. The unglamorous features that decide whether an inference layer survives its first incident.

Model catalog

Every inference workload,
one platform

Curated models ready the moment your account is active. License terms, latency, and pricing on a single page before you write a line of code.

LLM

Llama 3.1 70B

Instruction-tuned, 128k context. Strong reasoning and code generation.

$0.50 / 1M tokens

LLM

Mixtral 8×22B

Sparse mixture-of-experts. Fast, efficient at 64k context.

$0.65 / 1M tokens

Image

FLUX.1 Pro

State-of-the-art text-to-image. 12B params, photorealistic output.

$0.055 / image

Speech

Whisper Large v3

OpenAI's best transcription model. 99 languages, word timestamps.

$0.006 / minute

Embeddings

BGE-M3

Multi-lingual embeddings. Excellent for retrieval and semantic search.

$0.02 / 1M tokens

Vision

LLaVA 1.6 34B

Visual question answering and document understanding at scale.

$1.20 / 1M tokens

TTS

Kokoro 82M

Fast, natural text-to-speech. Low latency, streaming support.

$0.015 / 1K chars

Pricing

Pay for what you use.
Nothing else.

No committed minimums. No free-tier traps. What you see in the dashboard is what shows up on the invoice.

On-demand

Usage-based

Perfect for products with bursty or unpredictable traffic. Scale to zero means you pay nothing between requests.

Tokens / seconds / requests billed
Scale-to-zero, no idle cost
Full catalog access
Request logs + metrics

Start building

The infrastructure layer
you'll never think about.

Start shipping AI features today. Most teams have a working integration before their first cup of coffee finishes cooling.

Read the docs

AI inference. Without theinfrastructure tax.

From idea to inferencein under an hour

Built for theproduction reality

Every inference workload,one platform

Pay for what you use.Nothing else.

Get Started