Now in general availability

AI inference.
Without the
infrastructure tax.

Pick a model. Send a request. Get a result. No GPUs to manage, no clusters to babysit, no idle servers to pay for.

Python TypeScript cURL
1# Two lines to your first inference
2from compute import Client
3
4client = Client(api_key="sk-...")
5
6response = client.infer(
7    model="llama-3.1-70b",
8    messages=[{"role": "user", "content": "Hello"}]
9)
Response · 180ms
{
  "id": "inf_01HXQ7...",
  "model": "llama-3.1-70b",
  "content": "Hello! How can I help you today?",
  "tokens_used": 42,
  "cost_usd": 0.000021
}
<200ms
Median cold start
50+
Curated catalog models
$0
When you're not using it
99.9%
Uptime SLA

From idea to inference
in under an hour

01
Choose a model
Pick from the public catalog or bring your own fine-tuned weights. Every model includes pricing and latency specs upfront.
02
Receive your endpoint
You get a URL and an API key. That's your entire deployment. No YAML, no Kubernetes, no cluster provisioning.
03
Send requests
JSON in, JSON out. The same calling pattern works across every model type — swap a vision model for an LLM without rewriting client code.
04
Scale automatically
Traffic doubles? Workers spin up. Traffic drops to zero overnight? Workers disappear. You see it only in latency — never in your bill.
Your app
POST /v1/infer
Compute router
Route + queue
GPU worker
Model loaded ✓
Response
180ms · 42 tokens
Auto-scale 0 → 1000 workers
Scale-to-zero No idle cost

Built for the
production reality

Not a proof-of-concept. The things teams quietly need once they're past the prototype stage are already in.

True elasticity
Workers come up in seconds, scale to zero overnight. A model handling 1,000 RPM at peak and nothing at 3am reflects that pattern exactly in its bill.
Model breadth
LLMs, image generation, speech, embeddings, classifiers, vision — one account, one dashboard, one SDK for your entire inference stack.
Honest pricing
Metered in tokens, seconds, or requests — whichever matches the model. No per-hour charges hiding underneath. No surprise minimums.
Private model hosting
Your fine-tuned weights stay on isolated infrastructure. No commingling with other tenants. Weights are never used for anything beyond serving your requests.
Traffic splitting
Multiple model versions served side by side. Split traffic between them, roll out replacements without dropping requests in flight.
Operational visibility
Per-endpoint rate limits, request-level logs, latency and error metrics. The unglamorous features that decide whether an inference layer survives its first incident.

Every inference workload,
one platform

Curated models ready the moment your account is active. License terms, latency, and pricing on a single page before you write a line of code.

LLM
Llama 3.1 70B
Instruction-tuned, 128k context. Strong reasoning and code generation.
$0.50 / 1M tokens
LLM
Mixtral 8×22B
Sparse mixture-of-experts. Fast, efficient at 64k context.
$0.65 / 1M tokens
Image
FLUX.1 Pro
State-of-the-art text-to-image. 12B params, photorealistic output.
$0.055 / image
Speech
Whisper Large v3
OpenAI's best transcription model. 99 languages, word timestamps.
$0.006 / minute
Embeddings
BGE-M3
Multi-lingual embeddings. Excellent for retrieval and semantic search.
$0.02 / 1M tokens
Vision
LLaVA 1.6 34B
Visual question answering and document understanding at scale.
$1.20 / 1M tokens
TTS
Kokoro 82M
Fast, natural text-to-speech. Low latency, streaming support.
$0.015 / 1K chars

Pay for what you use.
Nothing else.

No committed minimums. No free-tier traps. What you see in the dashboard is what shows up on the invoice.

On-demand
Usage-based
Perfect for products with bursty or unpredictable traffic. Scale to zero means you pay nothing between requests.
  • Tokens / seconds / requests billed
  • Scale-to-zero, no idle cost
  • Full catalog access
  • Request logs + metrics
Start building
Enterprise
Custom
Dedicated infrastructure, compliance, SLA guarantees, and volume pricing for teams with specific regulatory or scale requirements.
  • Dedicated compute isolation
  • SOC 2 Type II + HIPAA
  • 99.99% SLA with credits
  • Volume discounts
Talk to sales

The infrastructure layer
you'll never think about.

Start shipping AI features today. Most teams have a working integration before their first cup of coffee finishes cooling.