Var is the experimentation platform that tells you whether your prompts, models, and agents actually work better — with statistical honesty, not wishful thinking.
Most teams ship AI changes the same way they choose lunch: gut feeling. Here is why that is dangerous.
The same prompt produces different responses every time. Small differences between variants disappear into natural variance, making traditional A/B testing unreliable.
A variant might be better on helpfulness but worse on cost, latency, or safety. Single-metric testing fails to capture these critical trade-offs.
There is no equivalent of a click for AI quality. Existing tools cannot measure whether a response was actually good without thoughtful, scaled evaluation.
Var was built from the ground up for stochastic outputs, multi-objective evaluation, and the statistical honesty AI workloads demand. No retrofitting — just engineering that matches the problem.
From prompt variations to full agent architectures — Var handles experimentation at every layer of the AI stack.
Test prompts, models, agent configurations, or any combination. The same experiment surface handles everything, so learnings transfer across your stack.
Var publishes confidence intervals, refuses to declare winners on weak evidence, and distinguishes practical from statistical significance.
Run new variants against historical traffic before touching real users. Filter bad candidates fast, cheap, and safely.
Rubric-based LLM evaluation, pairwise comparison, embedding similarity, and human review — layer methods for confidence.
Track cost, latency, quality across dimensions, and safety in the same experiment. See the full picture, not just one number.
Every variant has stable IDs, version history, and audit trails. Promote winners in one click, roll back just as fast.
Running an experiment in Var follows a short, predictable arc.
Declare variants, traffic allocation, and the metrics that decide winners — in code or through the dashboard.
Wrap relevant calls with the Var SDK. Deterministic routing ensures users see consistent variants throughout sessions.
Let traffic flow. Cost and latency captured automatically; quality via configurable judges and optional human review.
Promote the winner to production in one operation. If something regresses, roll back just as fast.
Built specifically for AI, not retrofitted from marketing tools.
Most tools optimize for the appearance of clear answers. Var tells you when evidence is weak, when trade-offs are real, and when the long tail matters more than the average.
Filter out bad variants against historical traffic before they ever touch a real user. Live experiments are reserved for candidates that survive replay.
The same experiment surface handles prompts, models, and agent architectures. Learning transfers across every part of your AI stack.
Every variant has stable IDs, version history, audit trails, and a record of every experiment it appeared in. No more ad-hoc prompt files.
From model selection to ongoing optimization, Var covers the full AI experimentation lifecycle.
Replace vendor benchmarks with measurements against your own traffic. Know which model actually works best for your specific use case.
Stop the lottery of intuition. Compare prompt versions with statistical rigor and know when a change actually helps.
Start at 1% traffic, expand only as the data confirms the upgrade is real. Catch regressions before customers do.
Confirm that a smaller, cheaper model can handle the workload before committing. Know exactly what you are trading off.
Clean HTTP API with SDKs in Python and TypeScript. Designed to wrap existing model calls without restructuring your application. Typically just a few lines of code to get started.
View DocumentationFor environments with strict data-handling needs, Var offers a self-hosted edition with the same API surface as the hosted service. Your data stays where you need it.
Var caught a regression in our customer support agent that would have gone unnoticed for weeks. The tail behavior analysis alone has saved us multiple incidents.
We used to ship prompt changes based on vibes. Now we have actual evidence. The replay testing feature lets us filter bad ideas before they touch production.
The multi-objective analysis changed how we think about model selection. Seeing cost, latency, and quality trade-offs in one place made the decision obvious.
| Capability | Var | Traditional Tools |
|---|---|---|
| AI-native statistical methodology | ✓ | ✗ |
| Multi-objective analysis | ✓ | ✗ |
| Replay testing | ✓ | ✗ |
| AI quality judges | ✓ | ✗ |
| Tail behavior analysis | ✓ | ✗ |
| Prompt version control | ✓ | ✗ |
| Confidence intervals | ✓ | Limited |
The difference between an AI product that gets worse over time and one that gets steadily better comes down to one thing: measurement.