SLM is a curated catalog and inference API for small language models that actually perform in production. Stop overpaying for frontier models on tasks that don't need them. Start shipping faster, cheaper, and smarter.
The industry developed a reflex: reach for the largest available model regardless of the task. A ticket classifier gets the same model as a legal brief writer. The cost compounds quietly into a board-meeting problem.
A 3-second response vs. 300ms is the difference between a feature people love and one they tolerate.
Ten million tickets per month at frontier prices is not a business — it's a liability.
Choosing, hardening, and operating small models at scale is work most teams can't afford to own.
Cost per 1M tokens — Classification Task
For tasks like classification, extraction, routing, and structured output — a well-chosen 3B model matches or beats a frontier model at 1–2% of the cost and a fraction of the latency.
Everything you need to choose, evaluate, and run small language models in production — without building and owning the operational complexity yourself.
50+ hand-selected, production-hardened models. Not every model in existence — only the ones that clear a strict bar: permissive license, predictable behavior, active maintenance, and demonstrated real-world performance.
One endpoint shape, one response format, across every model. Swap from prototype to production in a single field change. If you already use the OpenAI SDK, a base URL swap is all it takes to get started.
Chat templates, tokenizer quirks, stop tokens, long-context edge cases — SLM's engineering team finds and fixes these at the serving layer before any model enters the catalog. It works the way the model card says it does, every time.
Academic benchmarks favor large models. SLM publishes task-specific metrics: extraction accuracy, classification F1, instruction-following on short prompts, latency under load, and cost per thousand requests.
Upload your dataset, run candidates head-to-head, and get a comparison report on accuracy, latency, and cost before you commit. Evaluate rigorously rather than impressionistically.
When off-the-shelf models leave performance on the table, fine-tune any base model in the catalog. Training, evaluation, deployment, and monitoring in one platform — served through the same API at the same latency.
The features that decide whether a service can sit inside a serious application — not the glamorous ones, the important ones.
Full visibility per API key. Per-model usage breakdowns, request-level logs with optional retention, and latency and error metrics surfaced in the dashboard.
Streaming is supported for every model, at the context lengths the underlying model actually handles — not theoretical maximums that fall apart in practice.
On-demand traffic scales automatically. For latency-sensitive endpoints, reserve dedicated capacity for any catalog model, metered separately from burst traffic.
Tag workloads by team, product, or environment. Failed requests are categorized with actionable error types — not thrown together as generic 500s.
JSON-structured responses for any model that supports it. Reliable field extraction and classification with schema validation baked into the serving layer.
Requests are batched automatically to keep latency low and throughput high — letting you serve high-volume pipelines without managing batch logic yourself.
The catalog is organized by what you're trying to accomplish, not by model family. Classify support tickets? Find the best classifier — whether it happens to be a Phi variant or a Qwen fine-tune — with latency, cost, and accuracy data surfaced first. Side-by-side comparison views let you run the same prompt against several candidates and inspect outputs before writing a line of integration code.
A path most developers already recognize — without the operational overhead of running models yourself.
Sign up, create your first project, and generate an API key from the dashboard in under a minute.
Browse by task type. Read the model card — benchmarks, latency profile, cost per token, license, and recommended use cases are all there before you write code.
Point your existing OpenAI-compatible client at the SLM base URL. If you're starting fresh, the dashboard has a live playground. Your first working request takes minutes, not days.
Feed the same prompt to multiple catalog candidates side-by-side. Inspect outputs together before writing integration code.
Provide a small set of real inputs and expected outputs. The evaluation runner benchmarks each candidate on your actual task.
Accuracy, latency, and cost per thousand requests — all in one report. Ship the model that actually works best for your workload, not just the one with the best headline number.
The prototype code is the production code. No migration, no rewrite — the platform absorbs traffic growth automatically.
For user-facing latency-sensitive endpoints, reserve dedicated throughput. Leave batch workloads on on-demand to optimize cost.
Request logs, per-key rate limits, cost attribution tags, and categorized error types — the visibility to run a model inside a serious application.
Teams that switch to the right small model first tend to be the ones that scale most comfortably. The math compounds fast.
Classification, extraction, routing, and summarization at 1–2% of frontier model pricing. At 10M requests per month, that's the difference between viable and impossible.
Sub-100ms median response on 3B models. User-facing applications that need speed get it — without sacrificing accuracy on tasks small models handle well.
No servers to provision. No tokenizer edge cases to debug. No long-context behavior to harden. SLM owns all of that so your team focuses on the application.
If your language model task is repeatable, well-defined, or running at volume, SLM is almost certainly the right infrastructure layer for it.
Classify, route, and triage support tickets at scale. Match incoming requests to categories and queues without manual review.
Extract structured fields from invoices, contracts, and forms. Turn unstructured documents into clean, database-ready records.
Moderate user-generated content at platform scale. Fast, cheap classification that handles millions of items per day without budget blowouts.
Use SLM as the workhorse for subtasks that a frontier model orchestrates from above. 10x cheaper sub-agent calls, same quality where it counts.
Rewrite search queries for better retrieval. Rerank candidate results. Small models handle both tasks with sub-100ms latency that search UX demands.
Code completion, command translation, small refactoring tasks. Code-specialized small models deliver fast, accurate completions at IDE response speeds.
SLM is a drop-in replacement for OpenAI-compatible clients. For the rest of the ecosystem, native integrations are already there.
pip install slm
npm install @slm/client
Native integration
Native integration
Context package ready
Context package ready
Evals, experiments, CI
OpenAI-compatible
The operational standards your security and procurement teams expect, built into the platform from the start.
Requests are not stored or used for model training. Opt-in log retention with configurable retention windows.
TLS 1.3 for all API traffic. Customer data encrypted at rest with AES-256. Keys managed per-tenant.
Granular permissions per API key and team member. Audit logs for all administrative actions and credential changes.
Inclusion criteria for every catalog model are public: license, maintenance status, production-grade quality bar. The curation can be questioned and improved.
Annual third-party audit. Controls report available under NDA.
DPA available. EU-region hosting option. Data residency controls.
Information security management system certified.
BAA available for qualifying healthcare workloads.
We were spending $40k/month running support ticket classification through GPT-4. Switching to SLM's Phi-3.5 mini got us to $800/month with better F1 scores on our actual data. It took one afternoon.
The evaluation runner is what sold us. We had three candidate models and a labeled dataset. SLM ran them all and gave us the report in 20 minutes. We knew exactly which model to ship before we touched our codebase.
Our agent pipeline was blowing through token budgets on sub-tasks that didn't warrant frontier reasoning. SLM handles all the workhorse calls now. The main orchestrator is still Claude — the economics are completely different.
The most expensive mistake in LLM deployment is reflexively reaching for the largest model on every task. Start with SLM and let the economics work for you from week one.