Technical Guide

Everything engineers need to know about Seracade: setup, audit mechanics, proxy architecture, and security model.

1. Overview

Seracade is an automated LLM cost audit and optimization proxy. It intercepts your existing LLM API calls via a single environment variable change, logs a statistically significant sample of requests, replays them against alternative models, and delivers a report showing exactly where you can reduce spend without degrading output quality. After the audit, an optional paid proxy continuously routes each call to the most cost-effective model that meets your quality threshold.

Architecture

Your App

Existing code, unchanged

API request (unchanged)

Response (unchanged)

Seracade Proxy

Cloudflare Worker, edge

Logs call, forwards unmodified

Forwarded request

Provider response

LLM Provider

OpenAI, Anthropic, Google

Async (does not block requests)

Audit Engine

Log each request/response pair to Cloudflare KV
At call threshold, send logs to webhook server
Replay each call against alternative models
Score output quality with 95% confidence intervals
Generate HTML savings report
Email report to customer

2. Setup

Prerequisites

Node.js 20+ (for local tooling and the setup script)
A .env file in your project root with your LLM API keys
At least one active provider: OpenAI, Anthropic, or Google

Option A: Browser setup

Walk through the guided setup at /onboarding. It detects your providers, generates the correct env vars, and triggers your first audit automatically.

Option B: Terminal (one command)

curl -sL seracade.com | bash

Option C: Manual setup

Add the proxy base URL for each provider you use. Your API keys stay in your local .env and are never sent to Seracade.

# OpenAI OPENAI_BASE_URL=https://seracade.com/v1 # Anthropic ANTHROPIC_BASE_URL=https://seracade.com # Google GOOGLE_API_BASE_URL=https://seracade.com

What the setup script does

Detects your project root and locates the .env file
Backs up .env to .env.seracade-backup-YYYYMMDD-HHMMSS
Scans for existing OPENAI_API_KEY, ANTHROPIC_API_KEY, and GOOGLE_API_KEY
Adds the corresponding *_BASE_URL env vars pointing to the Seracade proxy
Sends a registration webhook with a SHA-256 hash of your API key (not the key itself) so the proxy can associate your traffic
Runs a single test call to confirm the proxy is reachable

3. How the Audit Works

Call logging

The Seracade proxy intercepts each LLM request, logs the request/response metadata to Cloudflare KV, and passes the original call through to the provider unchanged. Your application receives the same response it would without the proxy. Latency overhead is sub-millisecond (edge routing only, no payload inspection in the hot path).

50-call trigger

Once 50 calls have been logged for your account, the first audit kicks off automatically. No manual trigger needed. Subsequent audits trigger at 100, 200, 350, and 500 calls, each refining the analysis. Most teams hit the first threshold within a few hours.

Replay process

The audit engine replays your logged calls against a dynamic pool of up to 10 alternative models, ranked by price-performance. The pool refreshes every 6 hours from OpenRouter's full catalog of 348+ models. As of April 2026, the pool typically includes models from OpenAI, Anthropic, Google, xAI, DeepSeek, and others. You can view the current pool at /frontier.

Each call is replayed with identical inputs: same system prompt, same user message, same parameters. Replays route through OpenRouter using Seracade's API key, so there is no cost to you during the audit.

Quality scoring

Each alternative model's output is scored against your original response on three dimensions:

Semantic similarity: does the meaning match?
Format preservation: does the structure (JSON, markdown, lists) match?
Task completion: does the output accomplish what the prompt asked for?

Results are reported with 95% confidence intervals so you see statistical reliability, not just averages.

Report generation and delivery

After replay and scoring complete, the audit engine generates a per-endpoint report showing which calls can safely use an alternative model and which ones genuinely need the model you are paying for. The report is emailed to the address associated with your account and is also available in the dashboard.

4. How the Paid Proxy Works

Model routing

Based on the audit results, the proxy builds a per-endpoint routing table. Each endpoint (identified by system prompt hash + model + path) is mapped to the cheapest model that met your quality threshold. You can pin any endpoint to a specific model to override the automatic routing.

Response caching

The proxy uses hash-based deduplication for repeat prompts. If the same input (system prompt + user message + parameters) is seen within the cache TTL, the cached response is returned instantly. Cache hit rates of 10 to 25% are typical for production workloads with templated prompts.

Provider fallback

If the target model returns a rate limit (429) or server error (5xx), the proxy automatically retries with the next cheapest qualifying model from the routing table. Failover is transparent to your application. No code changes, no retry logic on your side.

Cost attribution

Tag calls with a x-seracade-feature header to get per-feature cost breakdowns in the dashboard. See exactly what each part of your product costs in LLM spend.

# Example: tag a call with a feature name curl https://seracade.com/v1/chat/completions \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "x-seracade-feature: summarization" \ -d '{ "model": "gpt-4o", "messages": [...] }'

Budget caps and spend alerts

Set a monthly budget cap per feature or account-wide. When spend reaches 80% of the cap, you receive an email alert. At 100%, the proxy can either hard-stop (reject calls) or soft-stop (continue but alert on every call). Configure this in the dashboard.

5. Security

API keys never leave your machine

Your LLM provider API keys remain in your local .env file. The SDK reads them at call time and sends them directly to the proxy over HTTPS. The proxy forwards them to the provider in the same request. Keys are never logged, stored, or persisted by Seracade.

What we see vs. what we don't

We See	We Don't See
Request/response metadata (model, token count, latency)	Your raw API keys (only a SHA-256 hash for identification)
Prompt/completion content during replay (in-memory only)	Any data after the audit completes (discarded from memory)
Aggregated cost and quality metrics	Your source code, infrastructure, or internal systems

Request/response handling

During the audit replay phase, request/response pairs are held in memory for scoring. After the quality scores are computed, the raw content is discarded. No prompts or completions are written to disk or persistent storage in plaintext.

Customer identification

Seracade identifies your account using a SHA-256 hash of your API key. The hash is computed locally by the setup script and sent during registration. The actual key is never transmitted to Seracade infrastructure.

Transport security

HTTPS is enforced on all proxy endpoints. The proxy runs on Cloudflare Workers, which terminates TLS at the edge. Connections to upstream LLM providers also use HTTPS exclusively.

6. Supported Providers

Provider	Env Var	Proxy URL
OpenAI	`OPENAI_BASE_URL`	`https://seracade.com/v1`
Anthropic	`ANTHROPIC_BASE_URL`	`https://seracade.com`
Google	`GOOGLE_API_BASE_URL`	`https://seracade.com`

How it works with your SDK All three major SDKs (openai, @anthropic-ai/sdk, @google/generative-ai) respect the base URL environment variable. Setting the env var is all that's needed. No code changes, no wrapper functions, no monkey-patching.

7. Undo / Uninstall

Removing Seracade takes one command. The setup script creates a timestamped backup of your .env before making any changes.

Restore from backup

cp .env.seracade-backup-YYYYMMDD-HHMMSS .env

Or manually remove the proxy URLs

Delete these lines from your .env:

OPENAI_BASE_URL=https://seracade.com/v1 ANTHROPIC_BASE_URL=https://seracade.com GOOGLE_API_BASE_URL=https://seracade.com

No residual processes Seracade does not install any background services, daemons, or agents on your machine. The proxy runs entirely on Cloudflare's edge network. Removing the env vars is a complete uninstall.

8. Agent Routing

AI agents make hundreds or thousands of API calls per task across different task types. Seracade provides first-class support for identifying, tracking, and optimizing agent costs.

Identify your agents

Add the X-Seracade-Agent header to any request to tag it with an agent name. Seracade tracks usage, cost, and savings per agent automatically.

# Python (OpenAI SDK)
client = OpenAI(base_url="https://seracade.com/v1")
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "Review this code"}],
    extra_headers={"X-Seracade-Agent": "code-reviewer"}
)

# cURL
curl https://seracade.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "X-Seracade-Agent: code-reviewer" \
  -d '{"model":"gpt-4.1","messages":[{"role":"user","content":"Review this code"}]}'

Per-agent cost dashboard

Query your agent stats at any time:

GET /api/agents/stats?key_hash=YOUR_HASH

Returns each agent's total calls, cost, savings, task type breakdown, models used, average cost per call, and an efficiency score (quality achieved per dollar spent).

Budget controls

Set per-agent monthly budget limits to prevent runaway costs:

POST /api/agents/budget
{
  "key_hash": "YOUR_HASH",
  "agent_name": "code-reviewer",
  "monthly_budget_usd": 50,
  "action_on_exceed": "downgrade"
}

Three actions when a budget is exceeded:

block — returns 429, agent must handle the error
downgrade — automatically routes to the cheapest model in the routing table
alert — sends an email alert, continues routing normally

Agent cost reports

Generate a per-agent cost efficiency report:

GET /api/agents/report?key_hash=YOUR_HASH

Shows each agent's cost efficiency, which models it uses, which task types, and recommended routing changes. Available as JSON (default) or HTML (via Accept header).

Response headers

When X-Seracade-Agent is present, every response includes:

X-Seracade-Agent-Cost — the cost of that specific call (e.g., 0.001200)
X-Seracade-Routed — if the call was routed to a cheaper model, shows original -> routed (task_type)
X-Seracade-Task — the detected task type for this call

Framework integration

Any framework that supports OPENAI_BASE_URL works with Seracade. Set the env var and add the agent header in your framework's config:

# Environment
OPENAI_BASE_URL=https://seracade.com/v1

# CrewAI — set in your agent's llm config
# LangGraph — set in your ChatOpenAI constructor
# AutoGen — set in your OAI config list

Auto mode (no human in the loop)

Auto mode removes every human gate from the optimization flow. Seracade audits your traffic, activates routing when quality thresholds pass, monitors continuously, and rolls back automatically if quality degrades. No report to review, no button to click.

Enable auto mode

curl -X POST https://seracade.com/api/mode \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"mode":"auto"}'

Returns {"ok": true, "mode": "auto", "customer_id": "sha256_hash"}. To revert: send {"mode":"manual"}.

Check current mode

GET /api/mode
Authorization: Bearer YOUR_API_KEY

What auto mode does

Audit triggers at 50 calls, same as manual mode
When the audit generates a routing table and all recommended swaps meet the 95% quality threshold, routing activates on the next request. No email, no confirmation.
Continuous quality monitoring samples routed calls and scores them against the original model. If quality drops below threshold, the affected route is rolled back automatically.
A circuit breaker bypasses routing for 5 minutes if proxy latency exceeds acceptable thresholds, then re-enables.
After 24 hours of traffic, per-agent budgets are auto-inferred at 3x observed daily spend with action set to alert. These are flagged auto_inferred: true in the API response.

Monitor programmatically

GET /api/status/YOUR_CUSTOMER_ID
Authorization: Bearer YOUR_API_KEY

Returns audit state, active routes, quality scores, savings totals, agent breakdowns, and budget status as JSON. The API key must hash to the requested customer ID or the request returns 401.

No emails in auto mode

Auto mode suppresses all email: audit reports, drip sequences, and savings summaries. Monitoring is API-only. If you want email notifications alongside auto mode, set mode to auto and separately configure an alert email via POST /api/agents/budget with "action_on_exceed": "alert".

9. Additional FAQ

What counts as a "call" for the audit threshold?

Any completion request that passes through the proxy. Chat completions, function calls, embeddings. Each request/response pair counts as one call. Most teams hit the threshold within the first day.

How does quality scoring work?

Seracade replays your exact production inputs against each alternative model, then scores the output against your original response using semantic similarity, format preservation, and task completion. Results are reported with 95% confidence intervals so you see statistical reliability, not just averages.

What if my use case is too specialized for alternative models?

That happens. Roughly 20 to 30% of endpoints show no viable alternative. The audit report tells you which endpoints can save money and which ones genuinely need the model you're using. Both findings are valuable.

Does the proxy add latency?

The routing decision adds sub-millisecond overhead. The proxy runs on Cloudflare's edge network. If an alternative model responds faster (Haiku and Flash typically do), you may see lower total latency.

Can I exclude specific endpoints from optimization?

Yes. Seracade Pro lets you pin any endpoint to a specific model. If your summarization pipeline needs Claude Opus 4, lock it. Seracade only optimizes the endpoints you allow.

10. FAQ

Common questions from customers are answered on the landing page. See the full FAQ section for details on pricing, data handling, latency impact, model pinning, and more.

11. Capabilities Reference

Seracade exposes six named capabilities. Every customer-facing artifact (the routing proxy, the audit report, the dashboard, the MCP server) is built from this set. The names are stable; downstream tooling can integrate against them.

Call

Every inference request through Seracade is a Call. Each Call is addressable by ID and carries full metadata: model used, model considered, tokens, latency, cost, classification, Quality Score, route reason, and the customer's active Quality Gate. Every routed /v1/chat/completions response carries:

X-Seracade-Call-Id — the Call ID, used as input to /v1/replay
X-Seracade-Calibration-Version — the calibration version under which the Score was produced
X-Seracade-Score-Disclaimer — "Subject to recalibration. Customer-private. Not for redistribution."
X-Seracade-Routed — original model, routed model, and task type

Step

Agent workflows are sequences of Calls grouped into Steps. A planning Step, a tool-selection Step, a summarization Step, and a code-generation Step in the same agent session are classified and routed independently. Identify Steps via optional headers on /v1/chat/completions:

X-Seracade-Step-Id — opaque identifier for the Step within an agent trajectory
X-Seracade-Parent-Call-Id — ID of the upstream Call that produced this Step's input
X-Seracade-Trace-Id — opaque identifier for the full agent run

Headers are additive. Frameworks that emit step identifiers are picked up automatically; frameworks that do not continue to work at the Call level. Step headers are stripped before the upstream provider sees the request and persisted as opaque tags on the Call record.

# Python (OpenAI SDK)
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "Plan the next action"}],
    extra_headers={
        "X-Seracade-Step-Id": "plan-001",
        "X-Seracade-Trace-Id": "agent-run-2026-04-25-7f3a",
    },
)

Quality Score

Every Call receives a calibrated, version-stamped Quality Score produced from the customer's own traffic. Each Score carries calibration_version and sample_size and a "subject to recalibration" disclaimer. Scores below the per-customer suppression threshold are withheld rather than reported with wide confidence bands. Scores are visible to the customer on their own data only; methodology is not published.

Routing Decision

The routing table is a set of Routing Decisions. Each maps (task_type, Quality Gate threshold) to the candidate model set that clears the threshold, with evidence (sample_size, score distribution, calibration_version, last_rebalance timestamp). Customers can fetch their routing table via the dashboard or the audit report.

Counterfactual Replay

Replay re-runs any historical Call against a specified set of alternate models and returns counterfactual cost and Quality Score per candidate, version-stamped. Use Replay to validate Routing Decisions, build cost-regression tests in CI, or evaluate new models against past traffic before promoting them.

# Replay a Call against two candidate models
curl https://seracade.com/v1/replay \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "call_id": "log:<customer>:<ts>:<uuid>",
    "candidate_models": ["claude-haiku-4-5", "gpt-4.1-mini"]
  }'

Replay output is private to the requesting customer and is not for redistribution. Pricing matches routed Calls (15% of the absolute price difference per Replay, free until $500/month of cumulative routing value).

Quality Gate

Quality Gates are customer-set policies that constrain the routing table. Declare each Gate as a (task_type, minimum_score) rule; Seracade never routes a Call below an active Gate, even if a cheaper model passes the global threshold. Use Gates to hold quality high on legal, healthcare, financial, or other sensitive task types while cost-routing on summarization, extraction, and classification.

# Set a Gate: never substitute below 0.90 on coding tasks
curl https://seracade.com/v1/gates \
  -X POST \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"gates": {"coding": {"minimum_score": 0.90}}}'

# Read active Gates
curl https://seracade.com/v1/gates \
  -H "Authorization: Bearer YOUR_KEY"

# Remove one Gate
curl https://seracade.com/v1/gates \
  -X DELETE \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"task_type": "coding"}'

Valid task types: extraction, classification, function_calling, summarization, rewriting, qa_rag, reasoning, coding, other. The minimum_score is a number between 0 and 1.

Composition A production Call arrives with optional Step headers. Seracade classifies it, looks up the Routing Decision for that task_type at the customer's active Quality Gate, and selects the model. The Call executes; the response carries the Quality Score and the routing evidence. Replay sits orthogonal: at any time the customer can re-run any historical Call against alternate models to validate a Routing Decision, evaluate a candidate, or build a regression test.