Now in Production — 50+ Models Available

The Right Small Model.
Production-Ready.
One API.

SLM is a curated catalog and inference API for small language models that actually perform in production. Stop overpaying for frontier models on tasks that don't need them. Start shipping faster, cheaper, and smarter.

See How It Works
# Drop-in OpenAI-compatible API
from openai import OpenAI

client = OpenAI(
  base_url="https://api.slm.dev/v1",
  api_key="slm_your_key_here"
)

response = client.chat.completions.create(
  model="phi-3.5-mini",
  messages=[{
    "role": "user",
    "content": prompt
  }]
)

# 98% cheaper. 10x faster. Same result.
OpenAI Compatible Streaming JSON Mode Fine-Tuning Auto-Scale
0
Curated Production Models
0
API Uptime SLA
0
Median Latency (3B models)
0
Average Cost Reduction
LLAMA MISTRAL PHI GEMMA QWEN DEEPSEEK

Frontier Models Are Overkill for Most Tasks

The industry developed a reflex: reach for the largest available model regardless of the task. A ticket classifier gets the same model as a legal brief writer. The cost compounds quietly into a board-meeting problem.

Latency kills UX

A 3-second response vs. 300ms is the difference between a feature people love and one they tolerate.

💸
Cost scales with every request

Ten million tickets per month at frontier prices is not a business — it's a liability.

🔧
Self-hosting is a full-time job

Choosing, hardening, and operating small models at scale is work most teams can't afford to own.

Cost per 1M tokens — Classification Task

GPT-4o (frontier)$15.00
Claude 3.5 Sonnet$9.00
SLM — Llama 3.1 8B$0.18
SLM — Phi-3.5 Mini$0.10

For tasks like classification, extraction, routing, and structured output — a well-chosen 3B model matches or beats a frontier model at 1–2% of the cost and a fraction of the latency.

Small Models. Serious Infrastructure.

Everything you need to choose, evaluate, and run small language models in production — without building and owning the operational complexity yourself.

📦

Curated Catalog

50+ hand-selected, production-hardened models. Not every model in existence — only the ones that clear a strict bar: permissive license, predictable behavior, active maintenance, and demonstrated real-world performance.

🔌

OpenAI-Compatible API

One endpoint shape, one response format, across every model. Swap from prototype to production in a single field change. If you already use the OpenAI SDK, a base URL swap is all it takes to get started.

⚙️

Production Hardening

Chat templates, tokenizer quirks, stop tokens, long-context edge cases — SLM's engineering team finds and fixes these at the serving layer before any model enters the catalog. It works the way the model card says it does, every time.

📊

Honest Benchmarks

Academic benchmarks favor large models. SLM publishes task-specific metrics: extraction accuracy, classification F1, instruction-following on short prompts, latency under load, and cost per thousand requests.

🎯

Evaluation Runner

Upload your dataset, run candidates head-to-head, and get a comparison report on accuracy, latency, and cost before you commit. Evaluate rigorously rather than impressionistically.

🚀

Fine-Tuning Service

When off-the-shelf models leave performance on the table, fine-tune any base model in the catalog. Training, evaluation, deployment, and monitoring in one platform — served through the same API at the same latency.

Built for Production From Day One

The features that decide whether a service can sit inside a serious application — not the glamorous ones, the important ones.

🔑

Per-Key Rate Limits & Usage

Full visibility per API key. Per-model usage breakdowns, request-level logs with optional retention, and latency and error metrics surfaced in the dashboard.

🌊

Streaming Across the Full Catalog

Streaming is supported for every model, at the context lengths the underlying model actually handles — not theoretical maximums that fall apart in practice.

📈

Autoscaling & Reserved Capacity

On-demand traffic scales automatically. For latency-sensitive endpoints, reserve dedicated capacity for any catalog model, metered separately from burst traffic.

🏷️

Cost Attribution & Tagging

Tag workloads by team, product, or environment. Failed requests are categorized with actionable error types — not thrown together as generic 500s.

📋

Structured Output Mode

JSON-structured responses for any model that supports it. Reliable field extraction and classification with schema validation baked into the serving layer.

🔄

Intelligent Request Batching

Requests are batched automatically to keep latency low and throughput high — letting you serve high-volume pipelines without managing batch logic yourself.

🗂️

Task-Oriented Catalog Navigation

The catalog is organized by what you're trying to accomplish, not by model family. Classify support tickets? Find the best classifier — whether it happens to be a Phi variant or a Qwen fine-tune — with latency, cost, and accuracy data surfaced first. Side-by-side comparison views let you run the same prompt against several candidates and inspect outputs before writing a line of integration code.

From First Request to Production

A path most developers already recognize — without the operational overhead of running models yourself.

1

Create an account & generate an API key

Sign up, create your first project, and generate an API key from the dashboard in under a minute.

2

Choose a model from the catalog

Browse by task type. Read the model card — benchmarks, latency profile, cost per token, license, and recommended use cases are all there before you write code.

3

Send your first request

Point your existing OpenAI-compatible client at the SLM base URL. If you're starting fresh, the dashboard has a live playground. Your first working request takes minutes, not days.

# Option 1: Use existing OpenAI client
import openai

openai.base_url = "https://api.slm.dev/v1"
openai.api_key = "slm_xxxxxxxxxxxx"

# Option 2: Direct HTTP
curl https://api.slm.dev/v1/chat/completions \
  -H "Authorization: Bearer $SLM_KEY" \
  -d '{
    "model": "phi-3.5-mini",
    "messages": [{"role":"user","content":"..."}]
  }'
1

Run comparison views

Feed the same prompt to multiple catalog candidates side-by-side. Inspect outputs together before writing integration code.

2

Upload your evaluation dataset

Provide a small set of real inputs and expected outputs. The evaluation runner benchmarks each candidate on your actual task.

3

Get a comparison report

Accuracy, latency, and cost per thousand requests — all in one report. Ship the model that actually works best for your workload, not just the one with the best headline number.

# Evaluation runner — CLI
slm eval run \
  --dataset ./my_tickets.jsonl \
  --models phi-3.5-mini,llama-3.1-8b,qwen2.5-3b \
  --metric f1,latency,cost

# Returns comparison report:
# model f1 p50ms $/1M tok
# phi-3.5-mini 0.94 87ms $0.10
# llama-3.1-8b 0.91 142ms $0.18
# qwen2.5-3b 0.93 101ms $0.12
1

Same API key, same endpoint

The prototype code is the production code. No migration, no rewrite — the platform absorbs traffic growth automatically.

2

Pin reserved capacity where it matters

For user-facing latency-sensitive endpoints, reserve dedicated throughput. Leave batch workloads on on-demand to optimize cost.

3

Monitor usage and errors in detail

Request logs, per-key rate limits, cost attribution tags, and categorized error types — the visibility to run a model inside a serious application.

# Reserve capacity for consistent latency
slm capacity reserve \
  --model phi-3.5-mini \
  --throughput 500 \
  --label prod-ticket-classifier

# Tag requests for cost attribution
client.chat.completions.create(
  model="phi-3.5-mini",
  messages=[...],
  extra_headers={
    "X-SLM-Tag": "team:support,env:prod"
  }
)

The Economics Are Undeniable

Teams that switch to the right small model first tend to be the ones that scale most comfortably. The math compounds fast.

100x

Lower Cost

Classification, extraction, routing, and summarization at 1–2% of frontier model pricing. At 10M requests per month, that's the difference between viable and impossible.

10x

Faster Latency

Sub-100ms median response on 3B models. User-facing applications that need speed get it — without sacrificing accuracy on tasks small models handle well.

Zero

Ops Overhead

No servers to provision. No tokenizer edge cases to debug. No long-context behavior to harden. SLM owns all of that so your team focuses on the application.

Built for Real Workloads

If your language model task is repeatable, well-defined, or running at volume, SLM is almost certainly the right infrastructure layer for it.

🎫

Customer Support Automation

Classify, route, and triage support tickets at scale. Match incoming requests to categories and queues without manual review.

Classification Routing
📄

Document Processing Pipelines

Extract structured fields from invoices, contracts, and forms. Turn unstructured documents into clean, database-ready records.

Extraction JSON Mode
🛡️

Content Moderation

Moderate user-generated content at platform scale. Fast, cheap classification that handles millions of items per day without budget blowouts.

Moderation High Volume
🤖

Agent Sub-Tasks

Use SLM as the workhorse for subtasks that a frontier model orchestrates from above. 10x cheaper sub-agent calls, same quality where it counts.

LLM Agents Orchestration
🔍

Search & Retrieval

Rewrite search queries for better retrieval. Rerank candidate results. Small models handle both tasks with sub-100ms latency that search UX demands.

Query Rewriting Reranking
💻

Developer Tooling

Code completion, command translation, small refactoring tasks. Code-specialized small models deliver fast, accurate completions at IDE response speeds.

Code Completion

Works with Your Existing Stack

SLM is a drop-in replacement for OpenAI-compatible clients. For the rest of the ecosystem, native integrations are already there.

Python SDK

pip install slm

JavaScript SDK

npm install @slm/client

LangChain

Native integration

LlamaIndex

Native integration

Claude Code

Context package ready

Cursor

Context package ready

CLI Tool

Evals, experiments, CI

REST API

OpenAI-compatible

Enterprise-Grade Trust

The operational standards your security and procurement teams expect, built into the platform from the start.

Zero Data Retention by Default

Requests are not stored or used for model training. Opt-in log retention with configurable retention windows.

Encryption In Transit & At Rest

TLS 1.3 for all API traffic. Customer data encrypted at rest with AES-256. Keys managed per-tenant.

Role-Based Access Controls

Granular permissions per API key and team member. Audit logs for all administrative actions and credential changes.

Transparent Curation Criteria

Inclusion criteria for every catalog model are public: license, maintenance status, production-grade quality bar. The curation can be questioned and improved.

🔒

SOC 2 Type II

Annual third-party audit. Controls report available under NDA.

🇪🇺

GDPR Ready

DPA available. EU-region hosting option. Data residency controls.

🛡️

ISO 27001

Information security management system certified.

🏥

HIPAA Eligible

BAA available for qualifying healthcare workloads.

From Teams Who Made the Switch

"

We were spending $40k/month running support ticket classification through GPT-4. Switching to SLM's Phi-3.5 mini got us to $800/month with better F1 scores on our actual data. It took one afternoon.

SR
Sarah R.
Head of ML, B2B SaaS — 800 employees
"

The evaluation runner is what sold us. We had three candidate models and a labeled dataset. SLM ran them all and gave us the report in 20 minutes. We knew exactly which model to ship before we touched our codebase.

DK
Daniel K.
Senior Engineer, Document Processing Startup
"

Our agent pipeline was blowing through token budgets on sub-tasks that didn't warrant frontier reasoning. SLM handles all the workhorse calls now. The main orchestrator is still Claude — the economics are completely different.

AM
Anika M.
AI Product Lead, Automation Platform

Common Questions

SLM uses the same endpoint shape, same parameters, and same response format as the OpenAI API. Existing clients in Python, JavaScript, Go, or any language with an OpenAI-compatible library typically work with just a base URL change and an API key swap. For most teams, trying SLM against an existing application is a five-minute experiment rather than a multi-week migration.
Inclusion criteria are explicit and public: production-grade quality, permissive license, active maintenance, predictable behavior under load, and demonstrated performance on real production task types. Models that don't clear this bar are not in the catalog. The criteria can be questioned — and improved — over time. The catalog currently covers roughly 50 models across the leading open-weight families and specialist categories.
Open-weight model releases are technically functional but rarely production-ready. Chat templates have inconsistencies. Tokenizers handle whitespace in surprising ways. Stop tokens are wrong or missing. Long-context behavior degrades at lengths shorter than advertised. SLM's engineering team identifies these issues before a model enters the catalog and fixes them at the serving layer — so the model behaves the way the model card claims it does, every time, under load.
SLM is a strong fit for repeatable, well-defined, or high-volume workloads: classification, routing, extraction, structured output, summarization, content moderation, code completion, and query parsing. It is intentionally not a fit for every problem — tasks that genuinely require frontier-level reasoning, long multi-document analysis, or creative writing at the edge of the field belong on a different kind of model. SLM will tell you so rather than letting you discover it in production.
When off-the-shelf catalog models leave performance on the table for your specific data, SLM offers fine-tuning for any base model in the catalog. The full lifecycle — training, evaluation, deployment, and monitoring — happens within one platform. Fine-tuned models are served through the same API as their base counterparts, with the same latency, the same pricing model, and the same operational visibility. Fine-tuning is a feature, not a separate project.
SLM uses a pay-per-token model. Each catalog entry shows transparent per-million input and output token pricing before you write any code. On-demand and reserved capacity are metered separately. Reserved capacity provides consistent low-latency guarantees for user-facing endpoints; on-demand handles burst traffic and batch workloads. There are no seat fees, no setup costs, and no minimum commitments to get started.

The Right Model for Every Task.
Production-Ready Today.

The most expensive mistake in LLM deployment is reflexively reaching for the largest model on every task. Start with SLM and let the economics work for you from week one.

View the Docs