Technical Guide

Everything engineers need to know about Seracade: setup, audit mechanics, proxy architecture, and security model.

1. Overview

Seracade is an automated LLM cost audit and optimization proxy. It intercepts your existing LLM API calls via a single environment variable change, logs a statistically significant sample of requests, replays them against alternative models, and delivers a report showing exactly where you can reduce spend without degrading output quality. After the audit, an optional paid proxy continuously routes each call to the most cost-effective model that meets your quality threshold.

Architecture

Your App
Existing code, unchanged
API request (unchanged)
Response (unchanged)
Seracade Proxy
Cloudflare Worker, edge
Logs call, forwards unmodified
Forwarded request
Provider response
LLM Provider
OpenAI, Anthropic, Google
Async (does not block requests)
Audit Engine
  • Log each request/response pair to Cloudflare KV
  • At call threshold, send logs to webhook server
  • Replay each call against alternative models
  • Score output quality with 95% confidence intervals
  • Generate HTML savings report
  • Email report to customer

2. Setup

Prerequisites

Option A: Browser setup

Walk through the guided setup at /onboarding. It detects your providers, generates the correct env vars, and triggers your first audit automatically.

Option B: Terminal (one command)

curl -sL seracade.com | bash

Option C: Manual setup

Add the proxy base URL for each provider you use. Your API keys stay in your local .env and are never sent to Seracade.

# OpenAI OPENAI_BASE_URL=https://seracade.com/v1 # Anthropic ANTHROPIC_BASE_URL=https://seracade.com # Google GOOGLE_API_BASE_URL=https://seracade.com

What the setup script does

  1. Detects your project root and locates the .env file
  2. Backs up .env to .env.seracade-backup-YYYYMMDD-HHMMSS
  3. Scans for existing OPENAI_API_KEY, ANTHROPIC_API_KEY, and GOOGLE_API_KEY
  4. Adds the corresponding *_BASE_URL env vars pointing to the Seracade proxy
  5. Sends a registration webhook with a SHA-256 hash of your API key (not the key itself) so the proxy can associate your traffic
  6. Runs a single test call to confirm the proxy is reachable

3. How the Audit Works

Call logging

The Seracade proxy intercepts each LLM request, logs the request/response metadata to Cloudflare KV, and passes the original call through to the provider unchanged. Your application receives the same response it would without the proxy. Latency overhead is sub-millisecond (edge routing only, no payload inspection in the hot path).

50-call trigger

Once 50 calls have been logged for your account, the first audit kicks off automatically. No manual trigger needed. Subsequent audits trigger at 100, 200, 350, and 500 calls, each refining the analysis. Most teams hit the first threshold within a few hours.

Replay process

The audit engine replays your logged calls against a dynamic pool of up to 10 alternative models, ranked by price-performance. The pool refreshes every 6 hours from OpenRouter's full catalog of 348+ models. As of April 2026, the pool typically includes models from OpenAI, Anthropic, Google, xAI, DeepSeek, and others. You can view the current pool at /frontier.

Each call is replayed with identical inputs: same system prompt, same user message, same parameters. Replays route through OpenRouter using Seracade's API key, so there is no cost to you during the audit.

Quality scoring

Each alternative model's output is scored against your original response on three dimensions:

Results are reported with 95% confidence intervals so you see statistical reliability, not just averages.

Report generation and delivery

After replay and scoring complete, the audit engine generates a per-endpoint report showing which calls can safely use an alternative model and which ones genuinely need the model you are paying for. The report is emailed to the address associated with your account and is also available in the dashboard.

4. How the Paid Proxy Works

Model routing

Based on the audit results, the proxy builds a per-endpoint routing table. Each endpoint (identified by system prompt hash + model + path) is mapped to the cheapest model that met your quality threshold. You can pin any endpoint to a specific model to override the automatic routing.

Response caching

The proxy uses hash-based deduplication for repeat prompts. If the same input (system prompt + user message + parameters) is seen within the cache TTL, the cached response is returned instantly. Cache hit rates of 10 to 25% are typical for production workloads with templated prompts.

Provider fallback

If the target model returns a rate limit (429) or server error (5xx), the proxy automatically retries with the next cheapest qualifying model from the routing table. Failover is transparent to your application. No code changes, no retry logic on your side.

Cost attribution

Tag calls with a x-seracade-feature header to get per-feature cost breakdowns in the dashboard. See exactly what each part of your product costs in LLM spend.

# Example: tag a call with a feature name curl https://seracade.com/v1/chat/completions \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -H "x-seracade-feature: summarization" \ -d '{ "model": "gpt-4o", "messages": [...] }'

Budget caps and spend alerts

Set a monthly budget cap per feature or account-wide. When spend reaches 80% of the cap, you receive an email alert. At 100%, the proxy can either hard-stop (reject calls) or soft-stop (continue but alert on every call). Configure this in the dashboard.

5. Security

API keys never leave your machine

Your LLM provider API keys remain in your local .env file. The SDK reads them at call time and sends them directly to the proxy over HTTPS. The proxy forwards them to the provider in the same request. Keys are never logged, stored, or persisted by Seracade.

What we see vs. what we don't

We See We Don't See
Request/response metadata (model, token count, latency) Your raw API keys (only a SHA-256 hash for identification)
Prompt/completion content during replay (in-memory only) Any data after the audit completes (discarded from memory)
Aggregated cost and quality metrics Your source code, infrastructure, or internal systems

Request/response handling

During the audit replay phase, request/response pairs are held in memory for scoring. After the quality scores are computed, the raw content is discarded. No prompts or completions are written to disk or persistent storage in plaintext.

Customer identification

Seracade identifies your account using a SHA-256 hash of your API key. The hash is computed locally by the setup script and sent during registration. The actual key is never transmitted to Seracade infrastructure.

Transport security

HTTPS is enforced on all proxy endpoints. The proxy runs on Cloudflare Workers, which terminates TLS at the edge. Connections to upstream LLM providers also use HTTPS exclusively.

6. Supported Providers

Provider Env Var Proxy URL
OpenAI OPENAI_BASE_URL https://seracade.com/v1
Anthropic ANTHROPIC_BASE_URL https://seracade.com
Google GOOGLE_API_BASE_URL https://seracade.com
How it works with your SDK All three major SDKs (openai, @anthropic-ai/sdk, @google/generative-ai) respect the base URL environment variable. Setting the env var is all that's needed. No code changes, no wrapper functions, no monkey-patching.

7. Undo / Uninstall

Removing Seracade takes one command. The setup script creates a timestamped backup of your .env before making any changes.

Restore from backup

cp .env.seracade-backup-YYYYMMDD-HHMMSS .env

Or manually remove the proxy URLs

Delete these lines from your .env:

OPENAI_BASE_URL=https://seracade.com/v1 ANTHROPIC_BASE_URL=https://seracade.com GOOGLE_API_BASE_URL=https://seracade.com
No residual processes Seracade does not install any background services, daemons, or agents on your machine. The proxy runs entirely on Cloudflare's edge network. Removing the env vars is a complete uninstall.

8. Agent Routing

AI agents make hundreds or thousands of API calls per task across different task types. Seracade provides first-class support for identifying, tracking, and optimizing agent costs.

Identify your agents

Add the X-Seracade-Agent header to any request to tag it with an agent name. Seracade tracks usage, cost, and savings per agent automatically.

# Python (OpenAI SDK)
client = OpenAI(base_url="https://seracade.com/v1")
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "Review this code"}],
    extra_headers={"X-Seracade-Agent": "code-reviewer"}
)

# cURL
curl https://seracade.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "X-Seracade-Agent: code-reviewer" \
  -d '{"model":"gpt-4.1","messages":[{"role":"user","content":"Review this code"}]}'

Per-agent cost dashboard

Query your agent stats at any time:

GET /api/agents/stats?key_hash=YOUR_HASH

Returns each agent's total calls, cost, savings, task type breakdown, models used, average cost per call, and an efficiency score (quality achieved per dollar spent).

Budget controls

Set per-agent monthly budget limits to prevent runaway costs:

POST /api/agents/budget
{
  "key_hash": "YOUR_HASH",
  "agent_name": "code-reviewer",
  "monthly_budget_usd": 50,
  "action_on_exceed": "downgrade"
}

Three actions when a budget is exceeded:

Agent cost reports

Generate a per-agent cost efficiency report:

GET /api/agents/report?key_hash=YOUR_HASH

Shows each agent's cost efficiency, which models it uses, which task types, and recommended routing changes. Available as JSON (default) or HTML (via Accept header).

Response headers

When X-Seracade-Agent is present, every response includes:

Framework integration

Any framework that supports OPENAI_BASE_URL works with Seracade. Set the env var and add the agent header in your framework's config:

# Environment
OPENAI_BASE_URL=https://seracade.com/v1

# CrewAI — set in your agent's llm config
# LangGraph — set in your ChatOpenAI constructor
# AutoGen — set in your OAI config list

Auto mode (no human in the loop)

Auto mode removes every human gate from the optimization flow. Seracade audits your traffic, activates routing when quality thresholds pass, monitors continuously, and rolls back automatically if quality degrades. No report to review, no button to click.

Enable auto mode

curl -X POST https://seracade.com/api/mode \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{"mode":"auto"}'

Returns {"ok": true, "mode": "auto", "customer_id": "sha256_hash"}. To revert: send {"mode":"manual"}.

Check current mode

GET /api/mode
Authorization: Bearer YOUR_API_KEY

What auto mode does

Monitor programmatically

GET /api/status/YOUR_CUSTOMER_ID
Authorization: Bearer YOUR_API_KEY

Returns audit state, active routes, quality scores, savings totals, agent breakdowns, and budget status as JSON. The API key must hash to the requested customer ID or the request returns 401.

No emails in auto mode

Auto mode suppresses all email: audit reports, drip sequences, and savings summaries. Monitoring is API-only. If you want email notifications alongside auto mode, set mode to auto and separately configure an alert email via POST /api/agents/budget with "action_on_exceed": "alert".

9. Additional FAQ

What counts as a "call" for the audit threshold?

Any completion request that passes through the proxy. Chat completions, function calls, embeddings. Each request/response pair counts as one call. Most teams hit the threshold within the first day.

How does quality scoring work?

Seracade replays your exact production inputs against each alternative model, then scores the output against your original response using semantic similarity, format preservation, and task completion. Results are reported with 95% confidence intervals so you see statistical reliability, not just averages.

What if my use case is too specialized for alternative models?

That happens. Roughly 20 to 30% of endpoints show no viable alternative. The audit report tells you which endpoints can save money and which ones genuinely need the model you're using. Both findings are valuable.

Does the proxy add latency?

The routing decision adds sub-millisecond overhead. The proxy runs on Cloudflare's edge network. If an alternative model responds faster (Haiku and Flash typically do), you may see lower total latency.

Can I exclude specific endpoints from optimization?

Yes. Seracade Pro lets you pin any endpoint to a specific model. If your summarization pipeline needs Claude Opus 4, lock it. Seracade only optimizes the endpoints you allow.

10. FAQ

Common questions from customers are answered on the landing page. See the full FAQ section for details on pricing, data handling, latency impact, model pinning, and more.

11. Capabilities Reference

Seracade exposes six named capabilities. Every customer-facing artifact (the routing proxy, the audit report, the dashboard, the MCP server) is built from this set. The names are stable; downstream tooling can integrate against them.

Call

Every inference request through Seracade is a Call. Each Call is addressable by ID and carries full metadata: model used, model considered, tokens, latency, cost, classification, Quality Score, route reason, and the customer's active Quality Gate. Every routed /v1/chat/completions response carries:

Step

Agent workflows are sequences of Calls grouped into Steps. A planning Step, a tool-selection Step, a summarization Step, and a code-generation Step in the same agent session are classified and routed independently. Identify Steps via optional headers on /v1/chat/completions:

Headers are additive. Frameworks that emit step identifiers are picked up automatically; frameworks that do not continue to work at the Call level. Step headers are stripped before the upstream provider sees the request and persisted as opaque tags on the Call record.

# Python (OpenAI SDK)
response = client.chat.completions.create(
    model="gpt-4.1",
    messages=[{"role": "user", "content": "Plan the next action"}],
    extra_headers={
        "X-Seracade-Step-Id": "plan-001",
        "X-Seracade-Trace-Id": "agent-run-2026-04-25-7f3a",
    },
)

Quality Score

Every Call receives a calibrated, version-stamped Quality Score produced from the customer's own traffic. Each Score carries calibration_version and sample_size and a "subject to recalibration" disclaimer. Scores below the per-customer suppression threshold are withheld rather than reported with wide confidence bands. Scores are visible to the customer on their own data only; methodology is not published.

Routing Decision

The routing table is a set of Routing Decisions. Each maps (task_type, Quality Gate threshold) to the candidate model set that clears the threshold, with evidence (sample_size, score distribution, calibration_version, last_rebalance timestamp). Customers can fetch their routing table via the dashboard or the audit report.

Counterfactual Replay

Replay re-runs any historical Call against a specified set of alternate models and returns counterfactual cost and Quality Score per candidate, version-stamped. Use Replay to validate Routing Decisions, build cost-regression tests in CI, or evaluate new models against past traffic before promoting them.

# Replay a Call against two candidate models
curl https://seracade.com/v1/replay \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "call_id": "log:<customer>:<ts>:<uuid>",
    "candidate_models": ["claude-haiku-4-5", "gpt-4.1-mini"]
  }'

Replay output is private to the requesting customer and is not for redistribution. Pricing matches routed Calls (15% of the absolute price difference per Replay, free until $500/month of cumulative routing value).

Quality Gate

Quality Gates are customer-set policies that constrain the routing table. Declare each Gate as a (task_type, minimum_score) rule; Seracade never routes a Call below an active Gate, even if a cheaper model passes the global threshold. Use Gates to hold quality high on legal, healthcare, financial, or other sensitive task types while cost-routing on summarization, extraction, and classification.

# Set a Gate: never substitute below 0.90 on coding tasks
curl https://seracade.com/v1/gates \
  -X POST \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"gates": {"coding": {"minimum_score": 0.90}}}'

# Read active Gates
curl https://seracade.com/v1/gates \
  -H "Authorization: Bearer YOUR_KEY"

# Remove one Gate
curl https://seracade.com/v1/gates \
  -X DELETE \
  -H "Authorization: Bearer YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{"task_type": "coding"}'

Valid task types: extraction, classification, function_calling, summarization, rewriting, qa_rag, reasoning, coding, other. The minimum_score is a number between 0 and 1.

Composition A production Call arrives with optional Step headers. Seracade classifies it, looks up the Routing Decision for that task_type at the customer's active Quality Gate, and selects the model. The Call executes; the response carries the Quality Score and the routing evidence. Replay sits orthogonal: at any time the customer can re-run any historical Call against alternate models to validate a Routing Decision, evaluate a candidate, or build a regression test.